AI Agent Architecture Patterns on the Microsoft Stack

Distributed systems gave us a useful vocabulary for thinking about coordination: sequential pipelines, fan-out/fan-in, event-driven messaging, supervisor hierarchies. The patterns are mature. The failure modes are documented. The tooling exists.

Agent systems inherit that vocabulary and then break it in one specific place: every coordination hop is a reasoning step, and reasoning degrades with context. In a microservices pipeline, passing a message between services doesn't make either service worse at its job. In a multi-agent system, what you pass, how much you pass, and in what order you pass it directly affects the quality of every downstream decision. Context is not just data , it's the cognitive working memory of the system, and it's finite, it's expensive, and it accumulates garbage.

That single constraint , context as a managed resource , is what makes agent architecture genuinely different from the distributed systems problems you've solved before. It's why the architectural decision that matters most isn't which model you use or how many agents you deploy. It's where you draw the boundaries, what you let each agent see, and how much coordination overhead you're willing to pay for.

This post maps the canonical patterns ( single-agent, sequential, concurrent, evaluator-optimizer, and dynamic supervisor ) onto Microsoft Foundry, Microsoft Agent Framework, and Microsoft Fabric. For each one: what problem it actually solves, where it fails, and what the implementation looks like in practice.

What Actually Matters

I've seen a lot of teams get this wrong in the same way. They spend three weeks benchmarking GPT versus Claude, pick the winner, and then wire it up with whatever architecture felt intuitive. Six months later they're rewriting it from scratch.

The model is not the hard part. Architecture is the hard part ,because once you have a context management problem, a coordination failure, or a token budget blowout at scale, no amount of model-switching fixes it. The structure determines whether the system works in production. The model determines how good the answers are within that structure.

This post is about the structure. Specifically, how the canonical agent design patterns from Anthropic's production research map onto the three Microsoft components you're most likely building on right now: Microsoft Foundry, Microsoft Agent Framework, and Microsoft Fabric.

A quick note on naming before we start

If you've been building on Semantic Kernel or AutoGen, those are now legacy paths. Microsoft Agent Framework is the direct successor to both — same teams, merged into one SDK, with proper workflow orchestration added on top. Microsoft's own docs include migration guides from both. I'll only reference Agent Framework here.

"Azure AI Foundry" is also old. It's just Microsoft Foundry now, with Foundry Agent Service as the runtime at the centre of it.

The three pieces and how they actually fit together

Before the patterns, let me clarify what each piece does and where the seams are — because the marketing descriptions of these products don't make it obvious.

Foundry Agent Service is the runtime. It manages conversation threads server-side, orchestrates tool calls, enforces content safety, handles Entra identity and RBAC, and integrates with Application Insights for tracing. When you deploy an agent here, governance is not something you add later — it's structurally enforced. There's also a hosted agents model in public preview: you bring your own agent code in any framework, containerise it, and Foundry runs it on managed infrastructure.

Microsoft Agent Framework (open source, .NET and Python) is the SDK you write your orchestration logic in. It targets Foundry Agent Service but also supports Azure OpenAI, OpenAI, Anthropic, Ollama, and others via a shared AIAgent base class. Its five built-in orchestration patterns — Sequential, Concurrent, Handoff, Group Chat, Magentic — are what I'll spend most of this post on.

Microsoft Fabric is your data layer. OneLake, Lakehouse, Warehouse, Real-Time Intelligence (Eventhouse/KQL), and semantic models — all under one governance model. What matters for agents is that Fabric has a native Fabric Data Agent that you publish as an endpoint. That endpoint connects directly to Foundry Agent Service as a knowledge source, with identity passthrough: the agent queries Fabric using the end user's credentials, not a service principal. This one detail matters enormously for enterprise governance and you should understand it before you design any data access pattern.

These three don't compete; they form a coherent stack. Foundry is the runtime, Agent Framework is the orchestration SDK, Fabric is the data substrate.

Pattern 1: Single-Agent

The perceive → reason → act loop. One model, one tool set, iterates until done.

Most production use cases fit here. I know that's not what people want to hear — everyone wants to build the cool multi-agent system — but the evidence is pretty consistent: single-agent systems cost less, debug faster, and produce auditable traces that actually mean something to the people who need to read them (compliance teams, regulators, management). The most common mistake I see is jumping to multi-agent architecture before a single well-equipped agent has been properly pushed.

There's a practical ceiling on tool count: around 10-15 function tools before the model starts making poor selection decisions. If you're hitting that ceiling on a single agent, that's a signal — but first check whether you've structured your tools well. Broad, vague tool descriptions are usually the culprit before agent count is.

Where it breaks down: genuine cross-domain parallelism, tasks that overflow a context window without clean summarisation, and anything that structurally requires an independent review step on the generated output.

In Foundry, a single agent is built with PersistentAgentsClient. The conversation history is stored server-side per thread — you're not managing it in memory:

import os
from azure.ai.agents.persistent import PersistentAgentsClient
from azure.identity.aio import DefaultAzureCredential

async with DefaultAzureCredential() as credential:
    client = PersistentAgentsClient(
        os.environ["PROJECT_ENDPOINT"],  # https://<resource>.services.ai.azure.com/api/projects/<project>
        credential
    )

    agent = await client.create_agent(
        model=os.environ["MODEL_DEPLOYMENT_NAME"],
        name="credit-risk-analyst",
        instructions="""You are a credit risk analyst.
        Use the available tools to evaluate loan applications.
        Always complete AML screening before generating a risk verdict.
        Flag any regulatory non-compliance explicitly.""",
        tools=[
            {"type": "function", "function": get_credit_bureau_data},
            {"type": "function", "function": query_internal_risk_model},
            {"type": "function", "function": check_aml_screening},
        ]
    )

    thread = await client.threads.create()
    await client.messages.create(
        thread_id=thread.id,
        role="user",
        content="Evaluate REF-2025-009341 for a £250K commercial loan."
    )
    run = await client.runs.create_and_process(thread_id=thread.id, agent_id=agent.id)

Connecting Fabric here: Instead of writing SQL connectors per agent, publish a Fabric Data Agent over your Lakehouse or Warehouse and register it as a knowledge source in Foundry via workspace ID and artifact ID. The Foundry agent decides at runtime whether the incoming query warrants calling Fabric — and when it does, it passes through the user's own identity. Your credit risk agent querying a customer's transaction history never needs database credentials: it gets exactly the data that user is allowed to see, governed in Fabric, with the lineage tracked there.

# After publishing a Fabric Data Agent from the Fabric portal:
# Foundry portal: Agent Setup → Knowledge → Add → Microsoft Fabric
# Requires FABRIC_WORKSPACE_ID and FABRIC_ARTIFACT_ID from the published endpoint URL:
# https://<env>.fabric.microsoft.com/groups/<workspace_id>/aiskills/<artifact_id>

Pattern 2: Sequential Orchestration

Agents in a defined order. Each one receives the previous agent's output. The control flow is static — you decide it, not the model.

This is the right pattern when your process decomposes into stages with hard linear dependencies: stage 2 cannot start until stage 1 is verified. Document processing pipelines, multi-stage compliance checks, content workflows that go draft → review → publish. The benefit is predictability — you can trace exactly which stage produced what, estimate costs before running, and debug by examining individual stage inputs and outputs.

The trade-off is latency: stages run serially. This is often fine. People underestimate how much quality improves when each model call is narrowly focused rather than handling everything at once.

from agent_framework import ChatAgent
from agent_framework.orchestrations import SequentialBuilder
from agent_framework.azure import AzureChatClient

chat_client = AzureChatClient(
    endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    model_id="gpt-4o"
)

# Three agents, three clearly separated responsibilities
extraction_agent = ChatAgent(
    name="DocumentExtractor",
    description="Extracts structured data from unstructured contract text",
    instructions="Extract all financial figures, parties, dates, and obligations. JSON output only. No analysis.",
    chat_client=chat_client,
)

analysis_agent = ChatAgent(
    name="RiskAnalyst",
    description="Analyses extracted contract data for risk indicators",
    instructions="Analyse the structured data for financial and operational risk. Apply IFRS 9 standards. Flag anomalies.",
    chat_client=chat_client,
)

compliance_agent = ChatAgent(
    name="ComplianceChecker",
    description="Validates risk analysis against CBUAE regulatory requirements",
    instructions="Validate the risk analysis. Produce a compliance verdict: PASS, CONDITIONAL, or FAIL with reasons.",
    chat_client=chat_client,
)

workflow = SequentialBuilder() \
    .participants([extraction_agent, analysis_agent, compliance_agent]) \
    .build()

async for event in workflow.run_stream(contract_text):
    print(event)

The extraction agent doesn't reason about compliance. The compliance agent doesn't re-extract data. Each one does one thing well. That separation is what makes each stage reliably auditable.

The Fabric angle for sequential workflows: map stages to the medallion architecture. Extraction writes to Bronze. Analysis reads Bronze, writes enriched data to Silver. Compliance reads Silver, writes its verdict to a certified Gold table — the only layer that downstream Power BI reports and business systems consume. You get data lineage from agent pipeline to governed data product essentially for free.

Pattern 3: Concurrent Orchestration

Multiple agents, same input, running in parallel. Results collected and aggregated when all finish.

There's a rule you have to get right here: concurrent agents must not depend on each other's intermediate output. If Agent B needs Agent A's result to start, that's a sequential dependency disguised as a parallel topology. Forcing it into concurrency introduces race conditions you'll spend weeks debugging. When in doubt, check: could you run these agents in any order and get the same result? If yes, concurrent is valid.

Where this pattern genuinely delivers is multi-dimensional analysis: financial risk assessment across credit, market, and compliance dimensions; content review against multiple independent criteria; voting patterns where you run the same task through N agents and aggregate. The quality improvement from independent perspectives on the same problem is real and measurable.

from agent_framework.orchestrations import ConcurrentBuilder

credit_agent = ChatAgent(
    name="CreditRiskAgent",
    description="Assesses creditworthiness and probability of default",
    instructions="Evaluate credit risk: debt ratios, payment history, collateral. Output a credit risk score 0-100 with justification.",
    chat_client=chat_client,
)

market_agent = ChatAgent(
    name="MarketRiskAgent",
    description="Evaluates exposure to market volatility and macro factors",
    instructions="Assess market risk: sector exposure, rate sensitivity, economic indicators. Output a market risk score 0-100.",
    chat_client=chat_client,
)

aml_agent = ChatAgent(
    name="AMLAgent",
    description="Screens for AML, KYC compliance, and sanctions exposure",
    instructions="Screen for AML/KYC issues, sanctions lists, PEP status. Output a compliance risk score 0-100.",
    chat_client=chat_client,
)

# All three run simultaneously. Total wall time ≈ slowest single agent.
workflow = ConcurrentBuilder() \
    .participants([credit_agent, market_agent, aml_agent]) \
    .build()

async for event in workflow.run_stream(loan_application_json):
    if isinstance(event, WorkflowOutputEvent):
        # Aggregate: weighted score per institutional risk policy
        aggregate_risk_score(event.data)

A BFSI example where I've seen this work well in practice: loan underwriting where regulatory requirements mandate independent assessments across risk dimensions. The assessments are genuinely independent. Running them concurrently cuts decision time without compromising the separation requirement.

Pattern 4: Handoff and Group Chat

Two separate patterns that both address multi-agent collaboration, but with different control philosophies.

Handoff transfers complete task ownership between agents based on context. There's no supervisor — an agent decides for itself to pass the conversation to a peer, and when it does, the receiving agent takes full responsibility. It's inherently interactive: designed for triage and routing scenarios where a first-contact agent identifies the right specialist and exits.

Don't confuse this with agent-as-tools (where a supervisor delegates a subtask and retains control). In Handoff, once the handoff fires, the originating agent is done. The context propagation model in Agent Framework broadcasts user and agent messages to all participants but deliberately excludes tool call contents — worth knowing before you design a workflow that assumes otherwise.

Group Chat is a shared conversation where multiple agents participate in the same thread. Each agent sees the full conversation history and contributes when the orchestrator selects them. Good for research tasks, design reviews, anything where you genuinely benefit from agents building on each other's observations rather than working independently.

Neither of these is inherently "better" than the other. Handoff is cleaner operationally; Group Chat is more flexible. Pick based on whether you need clean ownership transfer or collaborative refinement.

Pattern 5: Magentic — The Dynamic Supervisor

Magentic is Agent Framework's implementation of the supervisor/hierarchical pattern. A manager agent receives the task, builds a plan, selects which specialist to invoke, evaluates progress after each response, replans if needed, and synthesises the final result. The control flow is entirely LLM-driven — the manager decides everything at runtime.

This is the most capable orchestration pattern and also the most expensive and operationally complex. It's genuinely appropriate for open-ended tasks where you cannot predetermine the sequence of steps: competitive intelligence gathering, multi-domain research, incident analysis across systems. For anything where you can predetermine the steps, Sequential is cheaper, faster, and easier to debug. Use Magentic when the problem resists decomposition.

from agent_framework import ChatAgent
from agent_framework.orchestrations import MagenticBuilder
from agent_framework.openai import OpenAIChatClient

# The manager agent is implicit in MagenticBuilder.
# You define the specialists and their descriptions — the manager
# uses those descriptions to decide who to call and when.

researcher = ChatAgent(
    name="Researcher",
    description="Gathers information, market data, and evidence. Does not draw conclusions.",
    instructions="You find and surface relevant information. Present evidence; let the analyst draw conclusions.",
    chat_client=OpenAIChatClient(model_id="gpt-4o-search-preview"),  # web search enabled
)

analyst = ChatAgent(
    name="FinancialAnalyst",
    description="Performs quantitative analysis on data. Produces structured numerical outputs.",
    instructions="Analyse data quantitatively. Use figures. Avoid speculation without data.",
    chat_client=chat_client,
)

strategist = ChatAgent(
    name="Strategist",
    description="Synthesises research and analysis into recommendations.",
    instructions="You produce final recommendations based on evidence from the researcher and analyst.",
    chat_client=chat_client,
)

workflow = MagenticBuilder() \
    .participants([researcher, analyst, strategist]) \
    .build()

One production consideration worth flagging: Magentic includes optional human-in-the-loop at the planning phase and at stall detection points. If you're running this in a regulated environment where someone needs to approve the agent's plan before it executes, that hook is there. Use it.

Pattern 6: Evaluator-Optimizer

Two agents in a loop: one generates, one evaluates, repeat until quality criteria are met. Typically converges in 2–4 cycles.

The pattern is only worth its token cost when three things are true: quality criteria can be made explicit enough for an agent to apply them consistently; the generator is capable of meaningfully incorporating feedback (not just acknowledging it); and the business value of a higher-quality output justifies 3–5× the token spend of a single pass.

For contract clause generation, security-sensitive code, regulatory disclosures, and technical documentation — where precise wording genuinely matters and errors have downstream consequences — this pays for itself. For anything where first-pass quality is already acceptable, it doesn't.

async def evaluator_optimizer_loop(
    generator_agent_id: str,
    evaluator_agent_id: str,
    task: str,
    max_iterations: int = 4,
    quality_threshold: float = 0.85
) -> str:
    current_output = ""
    feedback = ""

    for iteration in range(max_iterations):
        gen_prompt = (
            task if iteration == 0
            else f"{task}\n\nPrevious output:\n{current_output}\n\nEvaluator feedback:\n{feedback}\n\nRevise."
        )

        gen_thread = await client.threads.create()
        await client.messages.create(thread_id=gen_thread.id, role="user", content=gen_prompt)
        gen_run = await client.runs.create_and_process(thread_id=gen_thread.id, agent_id=generator_agent_id)
        current_output = extract_last_message(gen_run)

        eval_prompt = f"""Evaluate this contract clause against:
        1. Legal precision under UAE law
        2. No ambiguous terms
        3. DIFC compliance
        4. Risk exposure for the issuing party

        Clause: {current_output}
        Return JSON only: {{"score": 0.0-1.0, "feedback": "...", "approved": true/false}}"""

        eval_thread = await client.threads.create()
        await client.messages.create(thread_id=eval_thread.id, role="user", content=eval_prompt)
        eval_run = await client.runs.create_and_process(thread_id=eval_thread.id, agent_id=evaluator_agent_id)
        result = parse_json(extract_last_message(eval_run))

        if result["approved"] or result["score"] >= quality_threshold:
            print(f"Converged at iteration {iteration + 1} — score: {result['score']}")
            break

        feedback = result["feedback"]

    return current_output

Write every iteration to a Fabric Lakehouse table: generator output, evaluator score, feedback, cycle number, and final verdict. After a few months of production data, you'll have a clear picture of which task types consistently require 3–4 cycles versus converge on the first pass. That data directly informs where to invest in prompt engineering and where your generator is reliably good enough.

Which pattern to use: three questions, not a matrix

I've seen enough pattern-selection matrices that don't survive first contact with a real requirement. So here are the three questions that actually resolve most decisions:

Can you write down the steps before you run it? If yes, Sequential or Concurrent. If no, Magentic or Group Chat. This single question eliminates 80% of the ambiguity.

Do the subtasks depend on each other's outputs? If yes and they're linear, Sequential. If yes and the dependency structure is complex and dynamic, Magentic. If no, Concurrent.

What does it cost to get this wrong? High-stakes regulated outputs (loan approvals, compliance verdicts, contract clauses) should start simple — single agent or sequential — because you need a trace that a compliance officer can follow. Research and analysis tasks with lower consequence have more room for collaborative/dynamic patterns.

Multi-agent systems consume roughly 10–15× more tokens than equivalent single-agent ones. That's not a reason to avoid them — it's a reason to be intentional about when you deploy them.

Observability is not a feature you add later

Every pattern above will eventually produce a failure mode you didn't anticipate. An agent that consistently makes the wrong tool selection in a specific context. A Magentic manager that replans unnecessarily on a particular class of input. A concurrent workflow where one agent's output systematically degrades the aggregated result.

A stack trace won't tell you any of that. You need visibility into the full reasoning path: what was in the context window, which tool was selected and why, what the intermediate outputs looked like, where the orchestration branched.

Foundry Agent Service emits run-level traces via OpenTelemetry to Application Insights — individual runs, tool call sequences, token consumption per step. Agent Framework adds structured telemetry on top. For multi-agent orchestrations, you need correlation IDs that span the entire workflow, not just individual runs.

Ship that telemetry to Fabric Real-Time Intelligence (Eventhouse). Store it. Query it:

// Agent runs that are candidates for investigation: high token spend or repeated retries
AgentRunLogs
| where timestamp > ago(24h)
| where totalTokens > 15000 or retryCount > 2
| summarize
    avgTokens    = avg(totalTokens),
    p95Tokens    = percentile(totalTokens, 95),
    failureRate  = countif(status == "failed") * 100.0 / count()
  by agentName, orchestrationPattern, bin(timestamp, 1h)
| order by failureRate desc

The KQL is not the point. The point is that once your agent telemetry lives in Fabric alongside your operational data, you can correlate agent behaviour with business outcomes — which is the question that eventually matters, not just whether the agent ran successfully.

The reference architecture

This is the setup I keep coming back to for enterprise deployments. Clean separation between orchestration, tool, and data layers.

Two decisions baked into this that are worth making explicitly:

Agents don't hold database credentials. They access data through published Fabric Data Agents, which means access control lives in Fabric and travels with the user's identity. When a user queries through an agent, they see what they're permitted to see — not what the service principal is permitted to see. This is not a minor governance detail.

Telemetry flows to Fabric, not just to Application Insights. App Insights is where you debug individual runs. Fabric is where you understand system behaviour over time, correlate with business data, and build the improvement loop that makes the system better over months.

What I'd actually do if I were starting today

Start with a single Foundry agent. Wire up a Fabric Data Agent as its primary knowledge source for structured data. Get it working well. Measure.

Add Agent Framework orchestration when — and only when — you have specific evidence that the single-agent pattern is hitting a structural ceiling: context overflow on a particular task class, quality degradation that traces back to the agent trying to handle too many independent concerns at once, or a genuine parallelism requirement.

The teams I've seen do this well shared one habit: they were suspicious of complexity. They pushed single agents further than felt comfortable, proved the ceiling, then made the architectural move. The teams I've seen waste months doing the opposite — designing an elegant multi-agent system on a whiteboard and spending six months debugging coordination failures — did the opposite.

The patterns are all available. Use the simplest one that works.