The AI application development landscape has stabilised significantly since 2023. Here is a practical guide to making tech stack decisions that will hold up.
The Model Layer
For the LLM itself, the decision comes down to: hosted API vs self-hosted open source. Hosted APIs (Anthropic Claude, OpenAI GPT-4o, Google Gemini) offer: no infrastructure management, high quality, simple API, pay-per-token pricing. The right choice for most applications. Self-hosted open source (Llama 3, Mistral, Qwen) via frameworks like Ollama (local) or vLLM (server): lower per-token cost at high volume, data stays on-premise (important for sensitive data), but requires GPU infrastructure and engineering to maintain. The threshold where self-hosting becomes economically rational: generally above $10,000–30,000/month in API costs. Below that, hosted APIs with their infrastructure and safety guarantees are almost always the better choice. Model selection: for general-purpose tasks, Claude Sonnet 4.6 and GPT-4o are the current benchmarks; for cost-sensitive high-volume inference, smaller models (Haiku, GPT-4o-mini, Gemini Flash) are often sufficient; for coding-specific tasks, coding-optimised models outperform general models.
The Orchestration Layer
LangChain and LlamaIndex remain the dominant frameworks for building AI pipelines. LangChain: best for building complex chains, agents with tool use, and applications requiring many different integrations. Criticised for abstraction complexity — many developers move away from it as they understand the problem better and write more direct code. LlamaIndex: better suited to document-heavy RAG applications, with stronger native support for chunking strategies, vector stores, and retrieval evaluation. For simpler applications: calling the model API directly (Anthropic SDK, OpenAI SDK) with minimal framework is often cleaner and more maintainable than LangChain. The framework adds value when the pipeline is complex; for a single LLM call with a prompt, frameworks add overhead without benefit. Emerging: LangGraph (part of LangChain) for multi-agent workflows with state management; smolagents (from Hugging Face) as a lightweight agent framework.
The Infrastructure Layer
Vector databases for RAG: Pinecone (fully managed, easiest to start), Weaviate (managed or self-hosted, richer query options), Chroma (local, good for development), pgvector (Postgres extension — if you already use Postgres, this is often the simplest production choice). Observability: LangSmith (LangChain’s observability tool), LangFuse (open-source alternative), and Helicone are the main options for tracing LLM calls, evaluating quality, and monitoring costs. Without observability, you are flying blind on quality and cost. Caching: prompt caching (Anthropic and OpenAI both offer prefix caching that reduces cost for repeated long system prompts by 50–80%) is worth implementing early — it can meaningfully reduce API costs at scale.
The Deployment and Evaluation Reality
The mistakes most AI applications make: building without an evaluation framework first (you can’t improve what you don’t measure); not designing for prompt versioning and A/B testing from the start; underestimating latency (LLM inference is slow — 1–5 seconds for a response — which affects UX design significantly); ignoring structured output (use Pydantic models and the model’s JSON output mode to get reliable structured data from LLMs instead of parsing free text). The evaluation-first principle: before building the application UI, build the evaluation harness — a set of test cases with expected outputs that you can run against model versions to catch regressions. Without this, prompt engineering and model updates become regressions you discover in production.



