AI API costs can be unexpectedly high in production if you don’t understand the pricing model. Here is what you need to know about token economics for building cost-effective AI applications.
What Tokens Are and How They’re Counted
AI language models process text as tokens — roughly 4 characters or ¾ of a word in English. 1 million tokens ≈ 750,000 words ≈ 1,500 pages of text. All LLM APIs are priced per million tokens, split between input tokens (the prompt you send) and output tokens (the model’s response). Output tokens typically cost 3–5x more than input tokens. Example pricing in 2025: Claude Sonnet 4.6 is $3/million input tokens and $15/million output tokens; GPT-4o is $2.50/million input and $10/million output. The implication: minimise output length, as it is the most expensive part of any LLM interaction.
The Cost Calculation
A simple cost calculation for a RAG chatbot processing 1,000 queries/day: system prompt: 500 tokens × 1,000 = 500,000 input tokens/day; RAG context (retrieved documents): 2,000 tokens × 1,000 = 2,000,000 input tokens/day; user query: 50 tokens × 1,000 = 50,000 input tokens/day; total input: ~2.55M tokens/day = $7.65/day at Claude Sonnet pricing; output (responses): 300 tokens × 1,000 = 300,000 output tokens/day = $4.50/day; total daily cost: ~$12.15 = ~$365/month. For a 10,000 queries/day production system: ~$3,650/month — significant, but knowable in advance and directly scaling with usage.
Cost Optimisation Strategies
Prompt caching: Anthropic’s prompt caching reduces input token costs by 90% for repeated system prompts and shared context (cache hits cost $0.30/million vs $3/million). If your system prompt is constant, caching pays back immediately. Tiered model selection: use a cheaper, faster model (GPT-4o Mini at $0.15/million input, or Claude Haiku at $0.25/million) for simpler classification and routing tasks, reserving the more expensive model for complex generation. Output length control: explicit constraints in your prompt (“respond in under 100 words”) directly reduce costs. Batch processing: Anthropic and OpenAI offer batch APIs with 50% discount for non-real-time workloads (nightly report generation, bulk data processing).
When Costs Become a Problem
Two common cost surprises: recursive agent loops (an agent that keeps calling itself for complex tasks can run up hundreds of dollars in minutes without a token budget or loop limit), and large document processing (if users can upload documents, a 200-page PDF in the context window is expensive per query). Mitigations: set maximum token budgets in your agent code, implement rate limiting per user, use RAG to retrieve relevant chunks rather than including entire documents in the prompt, and monitor costs per request in your logging. Cost monitoring tools: LangSmith (for LangChain apps), Helicone (provider-agnostic API proxy with cost tracking), and the native cost dashboards from OpenAI and Anthropic.




