The Age of AI Agents: State of the Art in 2025
Three numbers that frame 2025
72% of Fortune 500 companies have at least one AI agent project underway. Only 14% have reached production. And of that 14%, half admit their agents operate under constraints so severe they barely qualify as “autonomous.”
Those three numbers, from McKinsey’s December 2024 generative AI report, frame the state of the art better than any introductory paragraph could. The technology is ready. The engineering to put it into production is not.
This whitepaper documents where we actually are. Not where we would like to be, not where keynote speakers say we are, but where the systems that process transactions, serve customers, and make business decisions are operating every day. It includes deployment patterns we have validated, cost structures from real projects, and the metrics that separate a useful agent from an incident generator.
Defining “agent” without ambiguity
The word “agent” has been inflated to the point of meaninglessness. A chatbot with database access is not an agent. A RAG pipeline with a decision step is not an agent either. We need an operational definition.
An AI agent, in the context of this whitepaper, meets four criteria:
- It receives a goal, not step-by-step instructions. You say “process this invoice” or “resolve this incident,” not “extract field X from document Y and write Z to the database.”
- It plans its own sequence of actions. The agent decides which tools to use, in what order, and how to interpret intermediate results.
- It executes actions with real-world effects. It does not just generate text. It calls APIs, modifies records, sends communications.
- It adapts to unexpected results. If a tool fails or returns something unexpected, the agent adjusts its plan.
If your system does not meet all four, you probably have a sophisticated pipeline, not an agent. That is not an insult. Pipelines are predictable, testable, and cheap. Sometimes they are the right answer.
The autonomy spectrum
Not all agents need the same level of autonomy. Most should not have it. We have identified four levels that exist in production today:
Level 1: Supervised assistant. The agent proposes actions and a human approves them. This is the most common model and the safest. We see it in support triage, document classification, and response drafting. Adoption is high because risk is low.
Level 2: Autonomous with guardrails. The agent executes actions within strict boundaries. It can process invoices under EUR 5,000, answer questions about internal policies, or reclassify low-priority tickets. Anything outside those limits escalates. This is where most enterprises want to be in 2025.
Level 3: Autonomous with audit. The agent operates without real-time supervision, but every decision is logged and periodically audited. Suitable for high-volume, low-risk tasks: data enrichment, price monitoring, report generation. Human review happens after, not before.
Level 4: Fully autonomous. The agent operates without human intervention at any point. In real production, this barely exists. The only examples we have observed are high-frequency trading agents and recommendation systems that technically meet our four criteria but operate in domains where the cost of any individual error is low.
The question every company should ask is not “how do I get to Level 4” but “what level do I need for each use case?”
Deployment patterns that work in production
After analyzing over 30 implementations (our own and clients’), four architectural patterns appear consistently in successful deployments.
Pattern 1: Orchestrator-Workers
The most robust pattern. An orchestrator agent receives the task, decomposes it, and delegates to specialized worker agents. Each worker has a limited scope: one reads documents, another queries APIs, another generates structured outputs. The orchestrator maintains state, handles errors, and decides when the task is complete.
Where it works: complex document processing, multi-system workflows, tasks that combine analysis and action.
Where it fails: simple tasks where orchestration introduces unnecessary latency. If your task is “classify this email,” you do not need three agents.
Typical cost: 2-5x more expensive in tokens than a monolithic agent, but significantly more reliable. In an invoice processing implementation, we went from 76% success rate with a single agent to 94% with orchestrator-workers. Cost per invoice rose from EUR 0.03 to EUR 0.08, but the 18-point reliability gain more than justified it.
Pattern 2: Router + Specialists
Similar to the above, but the routing component is not a full agent. It is a lightweight classifier (often a small model or even heuristic rules) that determines the task type and routes to the appropriate specialist. Each specialist is a complete agent, but it only knows how to do one thing.
The advantage is cost. The router consumes minimal tokens, and each specialist is optimized for its domain. We have seen companies that combine an embedding-based router (near-zero cost) with specialists using different models based on complexity. The classification specialist uses Claude Haiku. The legal reasoning specialist uses Claude Opus. The document generation specialist uses GPT-4o.
Where it works: companies with multiple use cases that want a unified interface. Example: a logistics company using a single entry point for customer inquiries, claims, and shipment tracking, with three specialists behind it.
Pattern 3: Agent with Persistent State
Most agents are stateless. They receive a task, execute it, return a result. But there are cases where the agent needs to remember context across executions. An agent managing a client relationship needs to know what was discussed last week. An agent monitoring a process needs to know which alerts it already sent.
The typical implementation uses a vector database or key-value store for agent state, combined with a summarization mechanism to compress historical context. Pinecone, Weaviate, and pgvector are the most common choices.
The risk: accumulated state can degrade agent quality if not actively managed. We had an agent that accumulated so much historical context that the model got confused and mixed information from different clients. The fix was a sliding context window with periodic summarization: every 20 interactions, a second model generates a summary of relevant context and discards the detail.
Pattern 4: Reactive Agent with Event-Driven Architecture
Instead of a user or a cron triggering the agent, the agent activates on system events. A new email arrives, the classification agent fires. An invoice is uploaded to Drive, the processing agent activates. A ticket changes state, the follow-up agent evaluates whether it needs to act.
This pattern fits naturally with microservice architectures that already use message queues. Kafka, RabbitMQ, or even Google Cloud Functions as triggers. The agent becomes just another consumer in the event system.
What we learned: the biggest risk is event storms. An upstream failure generating 500 duplicate events can trigger 500 agent executions, each with its token cost. The solution is mandatory: event deduplication, rate limiting per source, and circuit breakers that halt execution if the event rate exceeds a threshold.
Real cost structures
Let us talk money. The costs of a production agent have four components, and most estimates only consider one.
Component 1: LLM Inference
The most visible and most variable. Prices dropped dramatically in 2024. Claude 3.5 Sonnet costs $3/million input tokens and $15/million output. GPT-4o is at $2.50/$10. Small models (Haiku, GPT-4o-mini) cost 10-20x less.
But cost per token is only half the equation. What matters is cost per completed task, and that depends on how many tokens your agent consumes per execution.
Real data from three of our implementations:
| Use case | Model | Tokens/execution | Cost/execution | Monthly volume | Monthly cost |
|---|---|---|---|---|---|
| Email classification | Haiku | 2,800 | EUR 0.002 | 15,000 | EUR 30 |
| Invoice processing | Sonnet | 12,000 | EUR 0.05 | 3,000 | EUR 150 |
| Contract analysis | Opus | 45,000 | EUR 0.90 | 200 | EUR 180 |
The variance across use cases is 450x. Generalizing AI costs is about as useful as saying “a car costs EUR 30,000.”
Component 2: External tools and APIs
Every API call the agent makes has a cost. Google Maps, company databases, verification services, document parsing. In our invoice processing agent, the OCR cost (Google Document AI, $1.50 per 1,000 pages) exceeds the LLM cost at high volume.
Component 3: Infrastructure
The agent needs to run somewhere. If you use a framework like LangGraph Cloud or CrewAI+, you pay per execution. If you self-host, you pay for compute. On Railway, a service running agents with one worker costs EUR 5-20/month depending on volume. On AWS Lambda, you pay per invocation.
Component 4: Human supervision cost
The most ignored and often the highest. Someone has to review escalated actions, analyze errors, tune prompts, and monitor performance. In a typical Level 2 deployment (autonomous with guardrails), we estimate 2-5 hours per week of a technical profile to keep an agent stable. That is EUR 400-1,000/month in personnel cost.
The real total cost of a production agent for a mid-sized European company is EUR 600-2,500/month, depending on the use case and volume. That includes all four components. If someone tells you they can run a production agent for EUR 50/month, they are not including supervision, infrastructure, or tools.
Reliability metrics: what to measure and what to ignore
There are metrics that matter and metrics that look impressive on a dashboard but tell you nothing useful.
Metrics that matter
Task Completion Rate (TCR). Percentage of tasks the agent completes without human intervention. This is the queen metric. If your TCR is 85%, it means 15 out of every 100 tasks need a human. That directly determines the time savings the agent generates.
Critical error rate. Percentage of tasks where the agent produces an incorrect result that is not automatically detected. An agent that classifies an email as “urgent” when it is not generates noise. One that processes an invoice with the wrong amount creates an accounting problem. The first is annoying. The second is expensive. You need to measure both separately.
Mean execution time. Not just for efficiency, but because it correlates with cost and user experience. An agent that takes 45 seconds to respond to a customer is unacceptable for live support. One that takes 3 minutes to process a complex document might be perfectly fine.
Cost per successfully completed task. Not cost per execution. Cost per task that completes correctly. If your agent costs EUR 0.05 per execution but fails 30% of the time, your real cost per completed task is EUR 0.07 plus the cost of the human who resolves the failures.
Metrics that matter less than you think
Tokens per execution. Useful for internal optimization, but meaningless as a business metric. An agent that uses lots of tokens but correctly completes 98% of tasks is better than one that uses few but fails 25%.
Time to first response. Relevant for chatbots, irrelevant for back-office agents. Do not optimize the wrong metric.
LLM “helpfulness” score. Automated quality evaluations with LLM-as-judge are useful for detecting regressions, but they are not a business metric. An agent can be “very helpful” according to GPT-4 and still get 20% of invoices wrong.
Frameworks and tools: the 2025 map
The tooling ecosystem for building agents has consolidated significantly over the past year. We are past the “new framework every week” phase (well, sort of, but the ones that matter have stabilized).
Orchestration frameworks
LangGraph (LangChain) has become the de facto standard for stateful agents. Its directed graph model fits well with the orchestrator-workers pattern. The cloud version adds persistence and traceability out of the box. We use it in 60% of our projects.
CrewAI has gained traction for multi-role agents. The “team of agents” metaphor is intuitive for non-technical stakeholders, which helps adoption. But its execution model is less flexible than LangGraph for complex workflows.
AutoGen (Microsoft) is fine for research and prototypes, but its multi-agent conversation model does not scale well for production. Too much inter-agent chatter, too many tokens burned on coordination.
Native agent SDKs from Anthropic and OpenAI. Both providers have shipped their own agent SDKs. OpenAI’s Agents SDK and Anthropic’s advanced tool use. The upside: native model integration, guaranteed structured outputs, lower latency. The downside: provider lock-in.
Models
The relevant model landscape for agents as of January 2025:
| Model | Best for | Relative cost | Latency |
|---|---|---|---|
| Claude 3.5 Opus | Complex reasoning, decisions | High | Medium |
| Claude 3.5 Sonnet | Quality/cost balance | Medium | Low |
| Claude 3.5 Haiku | Classification, routing | Low | Very low |
| GPT-4o | Multimodal, generation | Medium-high | Medium |
| GPT-4o-mini | Simple tasks | Low | Low |
| Gemini 1.5 Pro | Long context, documents | Medium | Medium |
| Mixtral/Llama 3 | Self-hosted, privacy | Own infra | Variable |
The clear trend: today’s “small” models outperform last year’s “large” models. Claude Haiku solving tasks that a year ago required GPT-4.
Observability
LangSmith is the natural companion to LangGraph. Full chain tracing, evaluations, and test datasets. It is our primary tool.
Langfuse as an open-source alternative. Less polished, but your data stays in your infrastructure. For companies with privacy requirements, it is the obvious choice.
OpenTelemetry with custom exporters. For teams that already have an observability stack (Datadog, Grafana), extending OTEL for agent traces is the lowest-friction path.
What is coming: trends for H1 2025
Three trends we are tracking that we believe will define the first half of the year.
Computer use and browser agents. Claude and other models can now interact with graphical interfaces. This opens the door to agents that operate legacy applications without APIs. We prototyped an agent that navigates a Spanish tax authority portal to download notifications. It works 80% of the time. That other 20% is why it is not in production.
Agents with long-term memory. Advances in memory architectures (MemGPT, advanced RAG with reranking) are enabling agents that maintain useful context for weeks or months. This is critical for use cases like client management, where the agent needs to remember previous interactions.
Vertical specialization. Generic agents are giving way to agents designed for specific verticals. Agents for accounting that understand local tax codes. Agents for logistics that know Incoterms and customs documentation. Agents for legal that handle procedural deadlines and statutory references. Vertical specialization is where the real value lies, not in the agent that knows a little bit about everything.
Practical recommendations
If you are considering implementing AI agents in your company in 2025, these are our recommendations based on what we have seen work.
Start at Level 1. A supervised assistant that proposes actions. Investment is low, risk is minimal, and it gives you real data on which use cases deliver value. Three months with a supervised assistant teaches you more than six months planning an autonomous agent.
Choose a use case with high volume and low risk. Email classification, ticket triage, data extraction from documents. Do not start with the agent that makes financial decisions. Start with the one that saves 20 minutes a day for a team of five.
Measure before you automate. Before building the agent, measure the manual process. How many tasks per day. How long per task. How many errors. Without that baseline, you cannot demonstrate ROI.
Budget for all four cost components. Not just inference. Tools, infrastructure, and human supervision. If your budget only covers LLM tokens, your project will fail.
Invest in observability from day one. This is not something you add later. If you cannot see what your agent does step by step, you cannot diagnose failures, optimize costs, or prove it works.
We are at the phase where agent technology is no longer the bottleneck. Production engineering, integration with existing processes, and organizational change management are. The companies that understand this and act accordingly will have a real competitive advantage. Those waiting for it to become “easy” will be waiting a long time.
If you want to explore how AI agents can fit into your operations, our AI and Machine Learning team can help you identify the highest-impact use cases. You can also review our article on AI agents in production for a deeper technical analysis of lessons learned, and our article on orchestration and failure patterns for a deep dive into production failure modes.
For companies seeking a full assessment of their technology maturity, we offer strategic consulting that includes an AI implementation roadmap tailored to your sector and size.
About the author
abemon engineering
Engineering team
Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.
