LLMs in Production: Costs, Latency, and the Metrics Nobody Talks About
The demo is easy, the invoice is the surprise
Building a demo with GPT-4 that impresses in a board meeting takes an afternoon. Taking that same system to production with 10,000 users and keeping it profitable is a months-long engineering project. The distance between the demo and production is where most LLM projects die, and the reasons are rarely technical. They’re economic and operational.
We’ve been operating LLMs in production for clients in logistics, legal, and financial services for 18 months. This article documents the real metrics, the costs that don’t appear in tutorials, and the patterns that reduce the bill without sacrificing quality.
The real cost of tokens
Model prices are published per million tokens. It seems cheap. Until you calculate how many tokens your application consumes.
Anatomy of a typical call
A customer support chatbot with context:
| Component | Tokens (approximate) |
|---|---|
| System prompt | 500-1,500 |
| RAG context (retrieved documents) | 2,000-8,000 |
| Conversation history (5 turns) | 1,500-3,000 |
| User question | 50-200 |
| Total input | 4,050-12,700 |
| Model response | 200-800 |
| Total output | 200-800 |
With GPT-4o (March 2025): input at $2.50/M tokens, output at $10.00/M tokens. A typical call: ~$0.003-0.012. Seems irrelevant. At 50,000 calls per month: $150-600/month. At 500,000: $1,500-6,000/month. And that’s a simple chatbot.
A document processing system analyzing 30-page contracts can consume 50,000-80,000 tokens per document. At 100 contracts/month with GPT-4o, we’re at $1,500-2,500/month in API calls alone.
The long context window trap
Models with 128K or 200K token context windows invite stuffing all possible context. “Why do RAG if I can put the entire document in the prompt?” Because the cost scales linearly with input. A 100,000-token prompt with GPT-4o costs $0.25 per call. At 1,000 daily calls, that’s $7,500/month.
Additionally, response quality degrades with very long contexts (the “lost in the middle” effect documented by Liu et al., 2023). More context isn’t always better context.
The cost ladder by model
| Model | Input ($/M) | Output ($/M) | Use case |
|---|---|---|---|
| GPT-4o | 2.50 | 10.00 | Complex reasoning |
| Claude 3.5 Sonnet | 3.00 | 15.00 | Analysis and writing |
| GPT-4o mini | 0.15 | 0.60 | Classification, extraction |
| Claude 3.5 Haiku | 0.80 | 4.00 | Fast tasks |
| Llama 3.1 70B (self-hosted) | ~0.50* | ~0.50* | High volume, control |
| Mistral Large | 2.00 | 6.00 | EU alternative |
*Estimated cost including GPU, inference on A100/H100.
The difference between using GPT-4o for everything and using the right model for each task can be 10-20x.
Latency budgets
LLM latency has two components: Time to First Token (TTFT) and tokens per second generation speed. Both matter, but in different ways.
TTFT determines how long the user waits before seeing anything. For interactive interfaces, TTFT > 2 seconds is perceived as slow. Large models (GPT-4o, Claude Opus) have TTFT of 1-3 seconds. Small models (GPT-4o mini, Haiku) are at 200-800 ms.
Generation speed determines how long the complete response takes. GPT-4o generates 50-80 tokens/second. With streaming, the user sees the response forming progressively, which improves the perception of speed even though total latency is the same.
Budgets by use case
| Use case | Max TTFT | Max total latency |
|---|---|---|
| Interactive chatbot | 1.5 s | 5 s (with streaming) |
| Real-time autocomplete | 200 ms | 500 ms |
| Batch processing (documents) | Not critical | < 30 s/document |
| Inline classification | 300 ms | 1 s |
| Email generation | 2 s | 8 s |
If your use case requires sub-second latency, large models are ruled out for the critical path. You need small models, fine-tuning, or pre-computation.
Factors that inflate latency
- Distance to endpoint: OpenAI and Anthropic APIs serve from the US. For European users, add 100-200 ms round trip. Azure OpenAI with West Europe region reduces this to 20-40 ms.
- Long input: more input tokens = more processing time. A 50,000-token prompt takes significantly longer to process than a 5,000-token one, even before generating output.
- Contention: during high demand, APIs may throttle or increase latency. There’s no latency SLA on OpenAI’s standard plans.
Caching: the most underrated optimization
Semantic caching is probably the optimization with the highest ROI in production. The idea: if two questions are sufficiently similar, the cached response is valid for both.
Cache levels
Exact cache: same question, same response. Trivial to implement (hash of the prompt as key, Redis as store). Effective for FAQs and repetitive queries. In a technical support chatbot we operate, exact cache has a 35% hit rate, which reduces costs and latency by the same proportion.
Semantic cache: similar (not identical) questions return the same response. Requires embeddings: generate a vector of the question, search for similar questions in a vector index (Pinecone, Qdrant, pgvector), and if similarity exceeds a threshold (typically 0.95), return the cached response. GPTCache and LangChain have ready-to-use implementations.
Fragment cache: for RAG systems, cache the retrieval results (not the generation). If 10 different questions retrieve the same 3 documents, the vector search executes once and is reused for all 10. Reduces latency and embedding retrieval cost.
Invalidation
The classic caching problem applies multiplied: when do I invalidate a cached response? If underlying data changes (new product, new policy, price correction), cached responses may be incorrect. Recommended strategy: aggressive TTL (time to live) for volatile data (1-4 hours) and conservative for stable data (24-72 hours), combined with explicit invalidation when changes are detected in the data source.
Model routing: the right model for each task
Not all tasks need GPT-4. In fact, most don’t. Model routing is the pattern of sending each request to the most economical model that can resolve it with acceptable quality.
Classification-based router
A lightweight classifier (can be a 3B parameter model or even rules) analyzes the incoming request and decides:
- Simple task (classification, data extraction, formatting) -> GPT-4o mini or Haiku
- Medium task (summarization, translation, Q&A over documents) -> GPT-4o mini or Sonnet
- Complex task (multi-step reasoning, legal analysis, formal writing) -> GPT-4o or Opus
The router cost is negligible compared to the savings. In a system we process, routing reduces the average cost per call by 65% with a measured quality degradation (by human evaluators) of less than 3%.
Progressive fallback
Another strategy: start with the cheap model and scale up if quality isn’t sufficient. The flow:
- Send to the small model
- Evaluate the response (length, confidence, presence of “I don’t know”)
- If evaluation fails, resend to the large model
- Cache that this class of question requires the large model
Over time, the system learns which questions need which model. The average cost converges to the optimum.
The metrics you should be monitoring
Beyond the classic “it works or it doesn’t,” these are the metrics that separate an operational LLM system from one that’s silently falling apart:
Quality
- Faithfulness: is the response consistent with the provided context? Measured automatically with frameworks like RAGAS or DeepEval. Target: > 0.85.
- Answer relevancy: does the response answer the question? Different from faithfulness: a response can be faithful to the documents but not answer the question. Target: > 0.80.
- Hallucination rate: percentage of responses containing information not supported by context. Target: < 5%. For critical data (prices, dates, numbers), target: 0%.
Operational
- Tokens/request: mean and p95. If it grows without product changes, there’s prompt bloat.
- Cost/request: broken down by model. The most direct efficiency metric.
- Latency TTFT and total: by model and task type.
- API error rate: timeout, rate limiting, provider 5xx errors.
- Cache hit rate: the indicator of how much you’re saving. Target: > 25% for applications with repetitive queries.
Business
- Cost per user session: how much it costs to serve a complete conversation. The metric the CFO understands.
- Deflection rate (if support): percentage of queries resolved by the LLM without human escalation. The direct ROI.
- User satisfaction: thumbs up/down on responses. The only metric that connects perceived quality with measured quality.
The costs nobody budgets for
Prompt engineering
The prompt isn’t something you write once and forget. It’s a living component that gets iterated, tested, and versioned. We’ve dedicated over 200 engineer hours in the past year to prompt optimization alone for production systems. That’s a personnel cost that doesn’t appear on the API bill.
Tools that reduce this cost: PromptFoo for systematic prompt testing, LangSmith for tracing and debugging, and a version-controlled prompt repository with git (yes, prompts should be in version control).
Continuous evaluation
Models get updated. GPT-4o from January doesn’t behave exactly the same as GPT-4o from April. Changes are subtle but can break finely tuned prompts. You need an evaluation pipeline that automatically runs your test cases against the current model and alerts if quality drops. Without this, you discover the regression when users complain.
RAG infrastructure
If you use RAG (and you probably should), the cost of the vector database, the indexing pipeline, and document ingestion is significant. Pinecone: from $70/month to start. Qdrant self-hosted: VM cost + operation. pgvector on PostgreSQL: the most economical option if you already have Postgres, but with scale limitations.
The golden rule
Before putting an LLM in production, answer these three questions:
- How much does it cost to serve one user for a month? If you can’t calculate it, you’re not ready for production.
- What happens when the model fails? Timeout, hallucination, incorrect response. Each scenario needs a defined fallback.
- How do I measure quality automatically? If the only way to know if the system is working well is reading responses manually, it won’t scale.
LLMs in production are an engineering problem, not a data science problem. The models are already good enough. The difference is in how you operate them.
Our AI and machine learning team implements LLMs in production with the metrics, caching, and model routing needed to make them economically viable. If you already have a prototype that needs to go to production, we can help with the necessary engineering. For the full process of getting models to production, see our whitepaper on MLOps: from notebook to pipeline. And for an applied use case, our article on generative AI in logistics shows how these patterns work in the logistics sector.
About the author
abemon engineering
Engineering team
Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.
