By CitrusBits Engineering Leadership | Published 2026
Decision loop, compressed: [User Query] → [Intent] → [Plan] → [Execute via MCP + Skills] → [Synthesize] → [Respond]
Every step is observable. Every agent decision is logged. Every human override is recorded. This is not just good architecture; in a regulated healthcare environment, it’s the law.
Great architecture deserves clear concepts. Let’s unpack the terms that matter, without the hand-waving.
An AI agent is a system that perceives its environment, reasons about what to do, takes action, and observes the result. Think of it as a microservice with a will of its own, albeit a safe and auditable one. The sophistication lies in the quality of the reasoning and the richness of the action space.
In our clinical documentation system, the Clinical Reasoning Agent perceives a structured transcript and patient context, reasons about what a complete clinical note should contain, takes action by generating that note, and observes the result (physician edits), which can be fed back to improve future outputs.
The key distinction from traditional software: the agent decides what to do within a goal, rather than following a predefined procedure. This is the shift from scripted to reasoning.
Autonomous vs. semi-autonomous agents: Our MVP uses semi-autonomous agents throughout. Every agent output is reviewed by a human before it affects the real world. Fully autonomous agents (systems that take consequential action without human review) are appropriate in low-stakes, high-volume contexts. In clinical documentation, we keep humans firmly in the loop. The AI is a highly capable collaborator, not an autonomous actor.
(Covered in detail in the architecture section above.)
The core insight bears repeating: MCP defines a standard interface for resources (data the agent can read), tools (actions the agent can take), and prompts (reusable interaction patterns). In practical terms, the same Clinical Reasoning Agent that works with one EHR vendor’s data can work with another’s. This is possible not because someone wrote a custom adapter, but because both vendors expose their data through the standard MCP interface.
Healthcare has been plagued for decades by integration complexity. MCP is not a silver bullet, but it fundamentally changes the architecture of that work, from bespoke to standard.
(Covered in depth in the architecture section above.)
The one thing worth reemphasizing here: skills turn specialized knowledge into callable, testable functions. If the ICD-10 coding skill produces errors, you know exactly where to look and how to measure improvement. In a regulated healthcare environment, that auditability is not a nice-to-have. It’s the foundation of trust.
Retrieval-Augmented Generation (RAG) is the technique of giving an AI model access to a relevant knowledge base at inference time, rather than relying solely on what it learned during training.
In our clinical documentation system, this means the Clinical Reasoning Agent doesn’t just rely on what it “knows” about medicine. It retrieves the specific patient’s history, the current clinical guidelines relevant to the diagnosed condition, and the hospital formulary, in real time, and uses that retrieved context to generate grounded, accurate outputs.
A model trained on general medical knowledge will always underperform a model that has access to this specific patient’s actual data. RAG bridges that gap.
The traditional SDLC follows a familiar rhythm:
Plan → Build → Test → Deploy
The AI Development Lifecycle replaces that linear sequence with four continuous loops:
Intent Scoping → Context Engineering → Evaluation → Observation
Intent Scoping replaces “requirements gathering.” Instead of writing specification documents that describe exactly what the system should do, teams define what the agent should be able to achieve: the goals, not the procedures. “Automate SOAP note generation with billing accuracy” is an intent. A 47-page functional specification is not.
Context Engineering is the discipline of designing what information the agent receives and how. This is where most AI projects either win or lose. The model is the engine; the context is the fuel. Poorly designed context (missing patient data, vague instructions, inconsistent formatting) produces unreliable outputs regardless of how powerful the underlying model is.
Evaluation replaces unit testing as the primary quality gate. You don’t just test that the function runs; you test that the agent’s judgment is correct, safe, and consistent across a representative range of real-world scenarios. In clinical contexts, this means testing against messy, ambiguous, multi-diagnosis encounters, not just clean happy paths.
Observation is the ongoing monitoring loop that never closes. AI systems can drift. A model that performs well at launch may degrade subtly as data patterns shift. Observation means logging every agent decision, tracking output quality over time, and catching problems before physicians do.
“You don’t deploy code; you deploy a cognitive process.”
This is the sentence that changes how you think about release management in AI-native systems. A deployment is not a conclusion. It’s the beginning of the observation loop.
AI agents can operate with different types of memory:
Short-term context (in-context memory): Everything in the active session: the current encounter’s transcript, the retrieved patient data, the in-progress note. This is ephemeral; it vanishes when the session ends.
Long-term memory (external storage): Persistent data that can be retrieved across sessions: patient history, past physician preference patterns, system-wide performance logs. This is stored in databases and retrieved via the MCP layer as needed.
Context engineering, that is, designing what goes in and what stays out, is one of the highest-leverage skills in AI-native development. Stuffing too much into short-term context is expensive and can degrade model performance. Not retrieving enough from long-term memory produces thin, context-poor outputs.
Traditional application observability asks: is the server up? Are the APIs responding? Is the error rate within bounds?
AI-native application observability asks all of that, plus:
Building observability into an AI-native system from day one is not optional. It’s the only way to maintain trust in a clinical context, and the only way to systematically improve the system over time.
Let’s talk about what this means for the people writing the code.
Traditional Development | AI-Native Development |
Write exhaustive API specifications | Write intent examples and context |
Handle every edge case in code | Let the agent reason about edge cases |
Hardcode conditional logic | Provide skills; let the agent decide when to invoke them |
Debug with logs and breakpoints | Debug with prompt traces and evaluation scores |
Months to build a clinical rule engine | Days to prototype a working agent |
Deploy and maintain static business logic | Deploy and observe a reasoning system |
The illustrative example: In our ambient documentation system, handling “flag anything unusual in this patient’s morning labs” would require dozens of lines of if-else logic in a traditional system, hardcoding every definition of “unusual” for every lab type, every baseline, every edge case. With AI-native architecture, the agent interprets “unusual” contextually (delta from this patient’s baseline, not just population norms), retrieves the relevant labs, and reasons about what a clinician would want to know. You don’t hardcode unusual. You provide the data, the context, and a well-designed prompt.
That’s not fewer decisions for the developer; it’s different decisions. You trade branching logic for context engineering. That is a win, because context is declarative, testable, and far easier to update than nested conditionals.
“In 2026, the best engineer is the one who provides the best context, not the one who writes the most clever loop.”
Developers don’t disappear in this model. They transform. They move from code writers to system orchestrators, from translating requirements into functions to designing the reasoning environments in which agents operate.
Claude in the developer workflow:
At CitrusBits, we’ve seen Claude integrated into engineering workflows in ways that genuinely accelerate delivery, not by replacing engineering judgment, but by dramatically reducing the time spent on the non-judgmental parts of the job:
Claude Code, specifically, integrates directly into the development environment, making AI assistance a continuous background capability rather than a separate context-switch to a chat interface. Developers stay in flow. Velocity increases. The time from “I need this skill” to “this skill is tested and deployed” compresses meaningfully.
Building AI-native systems requires AI-native teams. This is not a layoff memo masquerading as strategy. It’s a skills evolution, and it’s one that engineering leaders need to actively lead.
Systems thinking at the agent level
Traditional engineers think in functions and APIs. AI-native engineers think in agents, contexts, and feedback loops. The mental model shifts from “what does this function do?” to “what does this agent know, what can it do, and how does it learn?”
Prompt and context engineering
This is a real engineering discipline, not a parlor trick. Designing the context that an agent receives directly determines the quality of the output. Engineers who can design this well are genuinely rare and genuinely valuable. The model is the engine. Context is the oil. A team that masters context design will outperform a team that simply buys the biggest model.
Agent orchestration
When you have multiple agents that need to collaborate (transcription feeds into reasoning, which feeds into routing), someone has to design the choreography. Who goes first? What does failure look like? How does the system recover? This is orchestration design, and it requires deep understanding of both the AI capabilities and the clinical workflow.
Evaluation and measurement
AI systems need to be measured against behavioral benchmarks, not just functional tests. Engineers who can build evaluation pipelines, defining what “good” looks like and measuring against it continuously, are building the immune system of AI-native applications.
Explainability and compliance awareness
“it works” is not enough in healthcare. Engineers need to understand what their systems are doing well enough to explain it to compliance officers, clinical staff, and, in some cases, regulators. This requires both technical depth and communication clarity.
Start with a bounded workflow, not the entire product.
The worst way to introduce AI-native development is to bet the entire roadmap on it at once. The best way is to identify one high-friction workflow (clinical documentation is an excellent candidate) and build a focused MVP with a team that has the psychological safety to learn and iterate. The first five agents will fail. The sixth will ship.
Invest in context, not just models.
There is a persistent temptation among engineering leaders to treat AI adoption as a model selection problem. It isn’t. The model is the engine; the context is the oil. Invest in the data architecture, the MCP integrations, and the context engineering discipline. A well-contextualized smaller model will outperform a poorly contextualized larger one consistently.
Build an evals-first culture.
Before a single AI-native component touches a production patient workflow, it must pass a battery of evaluation scenarios that simulate the messiest, most ambiguous encounters imaginable. CTOs who mandate this aren’t slowing their teams down; they’re building the trust infrastructure that makes scale possible.
Fund observability from day one.
Retrofitting observability into an AI-native system is genuinely painful. Teams that bake monitoring, evaluation pipelines, and human feedback collection into their architecture from the start have dramatically better outcomes.
Kill the requirements document religion.
Replace requirements documents with intent scenarios and context specifications. “The system shall display the patient’s last three lab results” is a requirement. “The agent should understand what a physician means by ‘anything concerning in the recent labs’ and surface the relevant data” is an intent. These are different design challenges, and they need different design tools.
Redefine what “done” means.
An AI-native system is never done in the traditional sense. It has a launch, and then it has a continuous observation loop. Engineering roadmaps need to reflect this, allocating ongoing capacity for evaluation, refinement, and adaptation rather than treating the first deployment as the finish line.
There’s a softer dimension to this transformation that engineering leaders often underestimate: the shift from certainty to probability.
Traditional software engineers are trained to produce deterministic outputs. A function either works correctly or it doesn’t. There is a right answer and a wrong answer.
AI-native systems operate probabilistically. An agent produces an output that is more or less likely to be correct, given its context and reasoning. This requires a different relationship with uncertainty.
“We used to fear ambiguity. Now we design for it.”
Teams that embrace this shift, becoming comfortable with monitoring, measuring, and improving rather than fixing and shipping, are the teams that will build the healthcare software of the next decade. In AI-native teams, progress beats perfection. The system that improves every week will always outperform the system that shipped perfectly and then stood still.
The concepts in this post are grounded in the Anthropic ecosystem: Claude, the Claude API, Claude Code, and the MCP standard that Anthropic has championed. This is our primary toolkit, and it is an excellent one.
It is also worth noting that AI-native development is an architectural philosophy, not a vendor commitment. The principles of agent design, skill modularity, context engineering, and observability apply across ecosystems. Teams building with OpenAI’s APIs, Google’s Gemini, Meta’s Llama models, or open-source frameworks like LangGraph and AutoGen are building AI-native systems too; much of what we’ve described here translates directly.
The right tool is the one that fits your use case, your compliance requirements, and your team’s capabilities. What matters more than the specific model or platform is the disciplined, layered architecture that makes your system trustworthy, adaptable, and improvable over time.
The ecosystem is large. The principles are portable. Build with both in mind.
Let’s return to where we started.
It’s 11:47 PM on a Thursday. Dr. Emily Carter has just finished her twelfth patient.
In this version of the story, she doesn’t open fourteen dropdown menus. The ambient documentation system has already been listening, with her consent and within her institution’s HIPAA-compliant infrastructure, since the moment she walked into the room. By the time she reaches her desk, a draft SOAP note is waiting. The ICD-10 codes are pre-populated. The medication interaction flag (the one that would have taken her three separate database lookups to surface manually) is highlighted at the top of the review panel. She reads it in ninety seconds, makes two small adjustments, and signs.
Her daughter is still awake.
That system does not exist because someone wrote better conditional logic. It exists because an engineering team stopped thinking in features and started thinking in agents. It exists because a CTO decided that “done” means “continuously improving,” not “successfully deployed.” It exists because developers who used to write every rule learned to design the context in which reasoning happens.
That is AI-native development. And it is not optional. It is inevitable.
The question is no longer whether to adopt AI. It is how quickly your team can develop the architectural fluency, the evaluation discipline, and the cultural tolerance for probability that this shift requires.
The developers who adapt will build the systems that change how medicine is practiced. They will be the ones whose code runs in the rooms where lives are saved.
The CTOs who lead this transformation will define the next generation of healthcare technology companies.
The blank file is waiting. This time, don’t just write code.
Design intelligence.
CitrusBits is a healthcare technology and AI development firm building AI-native systems for health systems, medical device companies, and digital health platforms. We partner with engineering leaders who are ready to build differently.
Interested in starting your AI-native journey? citrusbits.com
Tags: AI-native development · healthcare engineering · clinical AI · AI agents · MCP · ADLC · digital health · software architecture · LLM integration · Claude API
1) The Night the Codebase Won
2) The SDLC Had a Good Run. But Its Time Is Up.
3) What Does “AI-Native” Actually Mean?
4) The Application: An Ambient Clinical Documentation Assistant
5) The Architecture: Building Intelligence in Layers
6) Visualizing the Architecture
7) Core Concepts, Demystified
8) From Static Codebases to Adaptive Systems: The Developer’s New Reality
9) The Transformation of Healthcare Engineering Teams
10) A Note on the Broader Ecosystem
11) The Conclusion That Isn’t Really a Conclusion
CitrusBits helps MedTech leaders build smarter apps, connected devices, and XR health solutions that truly make an impact.