AI-Native Architectures: Designing Systems Where AI Is a First-Class Citizen
The bolt-on AI problem
Most enterprise AI implementations follow the same playbook: take an existing system, attach a language model to its flank, hope something improves. Sometimes it works. Usually, it introduces friction nobody anticipated.
The original system was designed with an implicit assumption: data flows from humans to machines and back, in predictable formats, at human speed. A form creates a database record. An operator reviews a queue. A batch process generates a report. Everything is synchronous, predictable, dimensioned for the throughput of a person with a keyboard.
Drop an AI agent into that flow and every assumption breaks. The agent processes at machine speed, generates data volumes the system never expected, and makes decisions the original flow assumed a human would handle. The result is a Frankenstein: a legacy system with an AI brain that does not fit the existing anatomy.
The alternative is designing from scratch with AI as a core component. That is what we call AI-native architecture.
Three properties that define AI-native
An AI-native architecture is not simply one that “uses AI.” It has three structural properties that set it apart.
Continuous bidirectional data flow. Traditional systems move data in one direction: input, processing, output. AI-native systems move data in loops. A model’s output feeds the next iteration. The model’s decisions generate data that refines future decisions. The system learns from its own operation — not as an abstract concept but as an explicit data flow in the architecture.
Uncertainty as a first-class data type. A traditional system operates with certainties: an order is confirmed or not. An invoice is correct or incorrect. An AI-native system operates with probabilities. A document has an 87% likelihood of being a transport invoice. A client has a 72% chance of needing additional insurance. The architecture must propagate, store, and act on those probabilities. This affects the data model, the business logic, and the user interface.
Human-AI collaboration as the primary interaction pattern. Not “the human decides” or “the AI decides.” A continuous dialogue where the system presents options with confidence, the human corrects or confirms, and that correction feeds the model. This pattern requires interfaces designed for collaboration, not supervision.
The data graph: backbone of an AI-native system
The most critical component of an AI-native architecture is not the model. It is the data graph.
In a conventional system, the database schema reflects business entities: customers, orders, invoices. Relationships are explicit and static. In an AI-native system, you also need to represent inferred relationships, confidence scores, and temporal context.
A concrete example from our logistics platform. A shipment is not just a record with origin, destination, and weight. It is a node connected to: the customer (with inferred preference history), the route (with time predictions based on historical data and current conditions), the carrier (with a reliability score computed by ML), and documentation (with automatically extracted fields and their confidence levels). Each connection carries metadata: when it was created, which model generated it, at what confidence, and whether a human validated it.
We implement this graph as a layer on top of relational storage, not instead of it. Postgres remains the source of truth for core entities. We add a metadata layer that captures inferences, confidences, and AI-generated relationships. In practice, that means additional tables using an EAV-style pattern for inferences, or JSONB columns in Postgres for confidence metadata.
The benefit is twofold. First, every AI decision is traceable: you can reconstruct why the system classified a shipment as priority. Second, human corrections are captured structurally and feed retraining pipelines.
Data flow patterns for AI
Data flow in an AI-native system follows patterns distinct from conventional CRUD. We have identified four that appear repeatedly.
Progressive enrichment. A datum enters the system with minimal information and gets enriched at each step. An email arrives as plain text. The first model extracts entities (name, company, shipment reference). The second classifies intent (inquiry, complaint, quote request). The third links it to an existing case or creates a new one. Each step adds metadata without modifying the original datum. By the end, a three-line email has ten structured fields attached.
This pattern maps to an event pipeline. Each enrichment step is an independent consumer that reads from the previous event and publishes an enriched result. We use Apache Kafka for high-volume pipelines and lighter queues (Redis Streams or SQS) for moderate volumes. The design principle: each step is idempotent and can fail without corrupting the original data.
Confidence-routed decision. The system generates a decision with a confidence level and routes it by threshold. High confidence (>90%): automatic execution. Medium confidence (70-90%): fast review queue where a human confirms or corrects with a single click. Low confidence (<70%): manual processing queue with all contextual information pre-populated.
What matters in the design is that thresholds are not fixed. They are calibrated per decision type and adjust dynamically. If the human correction rate in the medium-confidence queue climbs from 5% to 15%, the system automatically lowers the threshold, routing more decisions to review. This is implemented as a feedback loop with a sliding window over the last 500 decisions.
Explicit feedback. Every human correction is captured as a pair (model_prediction, human_correction) with the full context available to the model. These pairs accumulate in a continuously growing fine-tuning dataset. We do not run continuous fine-tuning in production (the risks outweigh the benefits in most cases), but we use this data to evaluate new models before deploying them and to refine prompts.
Cumulative context. The system maintains per-entity context (customer, case, operation) that enriches with each interaction. When an AI agent handles a query about a shipment, it has access not only to current status but to the full interaction history, prior decisions, past exceptions, and inferred customer preferences. This context is composed dynamically by selecting the most relevant fragments using semantic search over a vector store (we use pgvector inside the same Postgres instance).
The orchestration layer: coordinating humans and AI
In an AI-native architecture, orchestration is not a cron job running tasks. It is a coordination system managing workflows where AI models and humans participate in an interleaved fashion.
We implement orchestration as state machines. Each workflow has defined states, possible transitions, and for each transition: who executes it (model or human), what data it needs, what validation criteria apply, and where it goes on failure.
A sample flow in our shipment management system:
- Request intake (automatic): model extracts data from email or form.
- Classification (automatic if confidence >90%, human otherwise): shipment type, urgency, destination.
- Quoting (automatic): tariff calculation across multiple carriers in parallel.
- Client presentation (automatic): generates and sends formatted quote.
- Confirmation (human): operator verifies critical data before confirming with carrier.
- Tracking (automatic): monitors status and flags anomalies.
- Close-out (automatic with validation): invoicing and archival.
Five of seven steps are automatic. Two require human intervention. But the key is not how many steps are automated — it is that every step has a defined fallback. If the classification model fails, the request routes to the manual classification queue with a 30-minute SLA. If the tariff calculation fails for one carrier, the system excludes it and presents the rest.
The state machine runs on Temporal (we previously used a combination of queues and database states; migrating to Temporal was one of the best architectural decisions we have made). Temporal provides durability, automatic retries, configurable timeouts, and visibility into every in-flight workflow.
Feedback loops: the self-improving system
Feedback loops are what separate a system with AI from an AI-native system. Without feedback loops, you have a static system that uses a model. With them, you have a system that improves continuously.
We maintain three types of feedback loops in production.
Threshold calibration loop. Every automatic decision is retrospectively evaluated. If an automatic classification turns out incorrect (detected when a human corrects it downstream), it is recorded as a false positive. The false positive rate per decision type is calculated in 7-day windows. If it exceeds a threshold, the confidence level required for that decision adjusts automatically. This works in both directions: when accuracy improves, thresholds relax and more decisions become automatic.
Data quality loop. AI-generated data (classifications, extractions, summaries) is periodically sampled and evaluated against ground truth. Five percent of document extractions are manually reviewed each week. Results feed a quality dashboard showing trends by document type, model, and time period. When quality drops below 95%, an investigation triggers. The cause is usually a change in input document format, not model degradation.
User experience loop. Every human-AI interaction in the interface is instrumented. We measure: review time (how long an operator takes to validate an AI suggestion), acceptance rate (what percentage of suggestions are accepted without modification), and correction pattern (which fields operators correct most frequently). This data informs both model improvements and interface improvements. If operators systematically correct the “goods type” field, it is not just a model problem — it might be a problem with how we present the options.
Data model: extending the relational schema
An AI-native data model extends the conventional relational schema with three additional concepts.
Inferences table. Each model-generated inference is stored with: entity_id, inference_type, value, confidence, model_version, timestamp, and status (pending, validated, rejected). This creates a complete history of what the AI has “thought” about each entity.
Corrections table. Each human correction to an inference is stored linked to the original inference: inference_id, corrected_value, user_id, timestamp, reason (optional). This is the continuous improvement dataset.
Context table. A vector store holding embeddings of documents, interactions, and decisions associated with each entity. Queried via semantic search to compose the dynamic context that feeds the model on each interaction.
In Postgres, the practical implementation:
CREATE TABLE ai_inferences (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
entity_type VARCHAR(50) NOT NULL,
entity_id UUID NOT NULL,
inference_type VARCHAR(100) NOT NULL,
value JSONB NOT NULL,
confidence FLOAT NOT NULL,
model_version VARCHAR(50) NOT NULL,
status VARCHAR(20) DEFAULT 'pending',
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE ai_corrections (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
inference_id UUID REFERENCES ai_inferences(id),
corrected_value JSONB NOT NULL,
corrected_by UUID NOT NULL,
reason TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
With pgvector for the context table and an HNSW index for efficient search. The schema is not complex. What is complex is the discipline to use it consistently at every point where the AI generates a decision.
Human-AI collaboration patterns in the UI
The user interface of an AI-native system is fundamentally different from a CRUD system. It is not a form for entering data. It is a collaboration space where the system proposes and the human decides.
We have iterated on three interface patterns.
Intelligent pre-population. The classic form, but with every field pre-filled by AI with visual confidence indicators. A field with >95% confidence appears green and the operator skips it. A field with <70% appears yellow with a tooltip explaining why the AI is unsure. The operator focuses only on the yellow fields. In our measurements, this reduces processing time per shipment from 4 minutes to 45 seconds.
Exception-based review. Instead of reviewing every item, the operator sees only those the AI flags as exceptions: inconsistent data, out-of-range values, or unrecognized entities. A queue of 200 shipments reduces to 15 exceptions. The operator resolves the exceptions and the system auto-confirms the rest. The design key: the operator must be able to access the complete item from the exception view, not just the problematic field.
Contextual dialogue. For complex tasks that a form cannot resolve, the operator interacts with an AI agent in a conversational interface with full operational context. “I need to reclassify this shipment as DDP instead of DAP, recalculate duties, and generate new documentation.” The agent executes the changes, shows a diff of what it will modify, and the operator confirms. This is radically more efficient than navigating five different screens to make the same change manually.
Infrastructure: what you actually need
An AI-native architecture does not require exotic infrastructure. The components are the same as any modern system, with a few additions.
Compute layer. Containers orchestrated with Kubernetes or managed services (Railway, ECS, Cloud Run). AI agents are stateless processes that scale horizontally. The only difference: you need to handle LLM API calls with longer timeouts (30-120 seconds) than typical REST APIs.
Data layer. Postgres as the source of truth, with pgvector for semantic search. Redis for context cache and sessions. A message broker (Kafka for high volume, Redis Streams for moderate volume) for enrichment pipelines.
Orchestration layer. Temporal for durable workflows. Manages retries, timeouts, and coordination between automatic and manual steps.
Observability layer. OpenTelemetry for tracing, Prometheus for metrics, Grafana for dashboards. We add AI-specific metrics: tokens consumed, inference latency, confidence rate, and human correction rate.
Model layer. LLM provider APIs (Anthropic, OpenAI) for language models. Custom models deployed on GPU (if any) for specialized tasks like document field extraction. A router that selects the model based on task and budget.
Total infrastructure cost for a mid-sized AI-native system (10-50 users, 10K-100K daily operations) runs between EUR 2,000 and 5,000 per month, including LLM inference. The most variable component is inference cost, which depends on volume and task complexity. In our experience, inference represents 40% to 60% of total infrastructure cost.
Mistakes we made (and you probably will too)
Treating AI as an isolated microservice. Our first attempt was a standalone “AI service” the rest of the system called. This forced rigid interfaces, prevented context flow, and added unnecessary latency. AI is not a service; it is a capability that permeates the system. Every service that needs intelligence has it integrated, not delegated.
Underestimating feedback data volume. The inferences and corrections tables grow fast. A system processing 5,000 daily operations generates 50,000 inferences per day (roughly 10 per operation). In a year, that is 18 million rows in the inferences table alone. You need a partitioning and archival strategy from day one, not when the table hits 100 million rows.
Ignoring perceived latency. A model that takes 3 seconds to respond is acceptable in an automated pipeline. It is unacceptable in an interactive interface. We had to redesign collaboration interfaces to show results progressively (streaming) rather than waiting for the complete response. The difference in user experience is dramatic: the operator starts reading while the model continues generating.
Not versioning prompts. For months, prompts were embedded in code. A prompt change required a deployment. Now prompts live in a separate repository, versioned, with regression tests, and an evaluation pipeline that compares the new version against the previous one on a reference dataset before promoting to production.
When it is worth it and when it is not
An AI-native architecture makes sense when:
- The domain has high variability in input data (unstructured documents, natural language communications, multi-format data).
- Processes require judgment that currently depends on operator experience.
- Volume justifies automation investment (>1,000 daily operations or >50 hours of manual work per week).
- A natural feedback loop exists: humans already review and correct decisions.
It does not make sense when:
- Processes are deterministic and well-defined (use rules, not AI).
- Volume is low and will not grow.
- No historical data exists to calibrate models.
- The cost of an error is catastrophic with no way to detect it before damage occurs.
The decision is not binary. Many systems benefit from a hybrid approach: the core is deterministic, AI applies at high-variability points. Classify a document: AI. Calculate a tariff from already-classified data: rules. Detect an anomaly in a shipping pattern: AI. Execute the response to the anomaly: human-approved process.
A pragmatic conclusion
Designing an AI-native system is more a shift in mindset than in technology. The tools are the same: Postgres, Kafka, Kubernetes, LLM APIs. What changes is how you compose them.
Data flows in loops, not lines. Decisions carry confidence, not certainty. Humans collaborate with AI instead of using it or supervising it. And every interaction generates data that improves the next one.
If you are starting a new system today and know AI will be central to its operation, build the architecture for it from the start. The cost of retrofitting a legacy system for AI is orders of magnitude higher than designing for AI from day one.
And if you already have a legacy system, do not attempt an AI-native conversion overnight. Identify the point of highest variability, implement the progressive enrichment pattern there, and extend from that point. It is slower but safer. And ultimately, what matters is not the architecture in the diagram. It is the architecture in production. For the foundations of MLOps and production pipelines, or to understand the real metrics of LLMs in production, see our dedicated articles.
About the author
abemon engineering
Engineering team
Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.
