by fazlerocks on 1/10/25, 8:22 PM with 3 comments
Current stack: - Next.js on Vercel - Serverless functions for AI/LLM endpoints - Pinecone for vector storage
Questions for those running AI in production:
1. What's your serverless infrastructure choice? (Vercel/Cloud Run/Lambda)
2. How are you handling state management for long-running agent tasks?
3. What's your approach to cost optimization with LLM API calls?
4. Are you self-hosting any components?
5. How are you handling vector store scaling?
Particularly interested in hearing from teams who've scaled beyond prototype stage. Have you hit any unexpected limitations with serverless for AI workloads?
by lunarcave on 1/10/25, 8:54 PM
1. Probably the best is fly.io IMHO. It has a nice balance between running ephemeral containers that can support long running tasks, and quickly booting up to respond to a tool call. [1]
2. If your task is truly long running, (I'm thinking several minutes), probably wise to put trigger [2] or temporal [3] under it.
3. A mix of prompt caching, context shedding, progressive context enrichment [4].
4. I'm building a platform that can be self-hosted to do a few of the above, so I can't speak to this. But most of my customers do not.
5. To start with, a simple postgres table and pgvector is all you need. But I've recently been delighted with the DX of Upstash vector [5]. They handle the embeddings for you and give you a text-in, text-out experience. If you want more control, and savings on a higher scale, have heard good things about marqo.ai [6].
Happy to talk more about this at length. (E-mail in the profile)
[1] https://fly.io/docs/reference/architecture/
[2] trigger.dev
[3] temporal.io
[4] https://www.inferable.ai/blog/posts/llm-progressive-context-...