Comments Page - Building AI Products–Part I: Back-End Architecture

« Back Building AI Products–Part I: Back-End Architecturephilcalcado.comSubmitted by rafaelferreira 6 months ago

xrd 6 months ago
This article is written by an engineer, first and foremost.
Many of the APIs or LLM extensions provided by AI companies are written by ML engineers that do not have Phil's decades of experience in distributed systems, databases and networking. That is evident after reading this article; the first time I've seen a coherent discussion of the tools and tradeoffs when building agentic systems.
I've struggled to actually build something useful with the "agentic" systems and tools out there (and I've tried a lot). Deep down I've felt intimidated by the dozens of new terms the docs use, and after reflection, those tech marketing pieces give the vibe that they are written primarily by AI and told to be colorful and not clear and precise. These solutions from billion dollar valued companies must to present "brand new" ideas to justify their valuations. We should know better: everything builds on the shoulders of decades of research and discovery. If you see something flying high in the clouds (and not standing on the shoulders of giants), it is sure to fall back to earth soon.
A great read. I'm very excited about Outropy.
- pcalcado 6 months ago
  Thanks! Honestly, I feel like my first year was a lot of just translating what those papers were trying to say—especially because often they talk a lot but don't say much. I am lucky that my cofounder has a background in ML/AI and could help me understand, but something else that helped me was to ask Claude/GPT to explain something I don't understand: "using analogies an experience back end developer understands".
- bsenftner 6 months ago
  Yeah, it's really clear it's an engineer, what the software does is never even mentioned, the purposes and tasks that it performs: what are they? And how is hallucination managed? This reads to me like a complexity soup, where they just started without a clear idea of purpose or goal. Perhaps if the article mentioned what the software does, the purpose, it might be more clear. It sounds like a replacement for the entire management layer of a company...
  NeutralCrane 6 months ago
  Agreed. This could be a very intelligent implementation, or it could be an over-engineered mess. It certainly seems like overkill for my experiences with agents, but problem applications can vary wildly. It is impossible to tell how to evaluate these design choices without more concrete details.
  pcalcado 6 months ago
  This is good feedback; thanks both! Initially, this was a single article, and it started with an explanation of the system, but it was getting too long, so I decided to split it into three. In hindsight, I should have started with part II, where I wanted to talk about the features, but I thought that the most underserved part of the AI stack was the back-end architecture, so I tried to address it first.
  pixelsort 6 months ago
  Manager need to replace engineers with AI faster than engineers use AI to replace their managers. In the end, nobody wins except OpenAI.
svilen_dobrev 6 months ago
> durable workflows
This is what long-running-transactions of the past became.. and slowly cover all their ground (initially Cadence by Uber, then Temporal). Zillions of little flows that can go through their FSMs at any speed, (milli)seconds-or-days-or-months-or-whenever.
i wonder though, how much some further developments like Cloudflare's durable objects, or similar recently announced Rivet actions [1] would simplify (or, complicate) matters, esp. in this "agentic" case ?
https://news.ycombinator.com/item?id=42472519
nikolayasdf123 6 months ago
> Agents are not Microservices
> Agents naturally align with OOP principles: they maintain encapsulated state (their memory), expose methods (their tools and decision-making capabilities via inference pipelines), and communicate through message passing
it does sound like a service (memory=db,methods+messages=api). it is just the level of isolation/deployment you need
UPD: also, how come your services share database layer (?), maybe problems in scaling are not due to Agents at all? do you have scaling issues even without agents? would not be surprised! classic rule form Amazon 2002 API mandate by Bezos "no shared db between services. all communication happens over exposed interfaces and over network".
- pcalcado 6 months ago
  By database layer, do you mean the RDS in the diagrams?
  If so, they were logical diagrams; the deployment itself was more complicated to handle the realities of AWS and whatnot.
  Still, having a single beefy RDS instance is a pretty common pattern for apps at this size. I've never experienced RDS postgres as a bottleneck for standard microservices architectures even at the 100-million-MAU scale.
- tossandthrow 6 months ago
  What I read is that the cut for microservices and agents do not align.
  This does not mean that agents can not run in microservices, just that is is not 1:1 between agent and a microservice.
  pcalcado 6 months ago
  Yes, that's what I meant.
  There's a whole can of worms here around the "what is a microservice, anyway?" but I tried to avoid more philosophical questions and used the term as shorthand for "small deployable unit following some version of 12 factor for horizontal scalability." It's not super comprehensive but matches what I've seen in practice over the last decade+
theptip 6 months ago
Great breakdown of the "architectural decision log" for the evolution of this system.
> This model broke down when we added backpressure and resilience patterns to our agents. We faced new challenges: what happens when the third of five LLM calls fails during an agent’s decision process? Should we retry everything? Save partial results and retry just the failed call? When do we give up and error out?”
> We first looked at ETL tools like Apache Airflow. While great for data engineering, Airflow’s focus on stateless, scheduled tasks wasn’t a good fit for our agents’ stateful, event-driven operations.
> I’d heard great things about Temporal from my previous teams at DigitalOcean. It’s built for long-running, stateful workflows, offering the durability and resilience we needed out of the box.
I would also have reached for workflow engines here. But I wonder if Actor frameworks might actually be the sweet spot; something like Erlang's distributed actor model could be a good fit. I'm not familiar with a good distributed Actor framework for Python but there's of course Elixir, Actix, Akka in other stacks.
Coming from the other direction, I'm not surprised that Airflow isn't fit for this purpose, but I wonder if one of the newer generation of ETL engines like Dagster would work? Maybe the workflow here just involves too many pipelines (one per customer per Agent, I suppose), and too many Sensor events (each Slack message would get materialized, not sure if that's excessive). Could be a fairly substantial overhaul to the architecture vs. Temporal, but I'd be interested to know if anyone has experimented with this option for AI workflows.
undefined 6 months ago
[deleted]
karmasimida 6 months ago
I don't see AI system too special in terms of back-end engineering, except maybe for agentic system, things are inherently stateful.
But considering how limited RPM/TPM with regards mainstream LLMs, states saving/loading is hardly the bottleneck I feel.
- pcalcado 6 months ago
  Same here. I've had data-intensive systems and classifiers on critical paths for non-AI apps, and the same tools I used before seem to work fine with GenAI.
  The primary real difference I've found has to do with when agents make decisions; this creates arbitrary call graphs in your distributed architecture and makes it harder to provision things, optimize, and do anomaly detection.
upghost 6 months ago
Color me impressed. These guys get it right because they treat LLMs like what they are -- tools with a specific use, not anthropomorphized pets. (although I did groan a bit at the "AI Chief of Staff" moniker).
It's extremely refreshing to hear an actual engineering conversation around LLMs that doesn't sound like it came out of the pages of an undergraduate alchemy notebook.
AYBABTME 6 months ago
I came to the same conclusion about Temporal for these types of things. Interactive stuff that touches 1 DB? Do it in the API. Needs to coordinate >1 thing? Temporal.
Orchestrating a bunch of LLM calls is a perfect fit for Temporal.
- pcalcado 6 months ago
  If only they embrace Pydantic and the libs Python people actually use >.<
iandanforth 6 months ago
Thanks for the excellent article. It's hard to find these step by step architecture evolution retrospectives. A great reference for other startups going though a similar journey!
jarbus 6 months ago
Great article, really enjoyed how they described what they initially tried, where it struggled, and why their current solution works better.
xwowsersx 6 months ago
Well written. It's a rare pleasure to hear a discussion about LLMs grounded in real engineering, free from the fanciful notions often found in all the other spam out there.
sabbaticaldev 6 months ago
[dead]
undefined 6 months ago
[deleted]
asah 6 months ago
crazy idea: could quantum entangled communication help soften CAP ? (e.g. by allowing limited communication between partitions)