- CloudCertPro - Learn the Architecture Behind the Certification
- >
- Azure Cloud Knowledge Hub - CloudCertPro
- >
- Azure Domains Learning Hub: Master Azure by Capability Domains
- >
- Azure Architecture Domain
Azure Architecture Domain
Architecture is the structural design layer that governs how cloud services are assembled into coherent, reliable, and cost‑effective systems. In Azure, architecture is not a service or a pattern library—it is the decision‑making discipline that shapes every workload. This page defines architecture as a first‑class cloud domain, explaining its role as the “thinking layer” above compute, storage, networking, security, data, and AI. It connects architectural thinking to real trade‑offs and is designed to be reused across multiple certification paths.
1. Overview #
What Is Cloud Architecture in Azure #
Cloud architecture is the practice of designing distributed systems that run on the Azure platform. It is the conceptual framework that decides how services are selected, composed, scaled, and operated. Architecture translates business requirements, budget constraints, and quality targets (scalability, availability, security, latency) into a concrete structure of Azure resources, interactions, and governance policies.
Why Architecture Is the Highest Abstraction Layer #
While compute, storage, and networking are tangible domains that provide primitive capabilities, architecture sits above them. It does not execute code, store data, or forward packets; instead, it determines how those pieces fit together. A well‑architected system integrates identity flows, cost guardrails, security perimeters, and observability from the start. Without architecture, services become a disjointed collection that fails under real‑world conditions. Architecture is the strategic layer that makes the cloud predictable, secure, and maintainable.
Architecture as System Design + Decision Logic + Trade-off Management #
Architecture is a continuous process of:
- System design: decomposing a problem into components and defining their interactions.
- Decision logic: choosing between competing alternatives (serverless vs. Kubernetes, SQL vs. NoSQL, synchronous vs. asynchronous) based on empirical evaluation.
- Trade‑off management: accepting that no design is optimal in all dimensions; architecture explicitly balances cost, complexity, performance, security, and operational overhead.
In Azure, architecture is codified by frameworks like the Well‑Architected Framework, but the core skill is the ability to reason across domains and justify choices with data and principles.
2. Architecture Domain Scope #
What Architecture Is Responsible For #
Architecture owns the structural integrity of the system. Its responsibilities include:
- Defining the high‑level component topology (services, boundaries, communication patterns).
- Allocating quality attributes (reliability, security, performance, cost) across components.
- Enforcing consistent design principles (least privilege, idempotency, statelessness, loose coupling).
- Governing identity, network segmentation, and data flow between subsystems.
- Establishing the operational model: deployment pipelines, monitoring, disaster recovery.
- Anticipating change: ensuring that the system can evolve without a complete rewrite.
What Architecture Is NOT #
Architecture is not:
- A list of services: naming that you use AKS, Cosmos DB, and Front Door is not architecture; it is a bill of materials.
- Infrastructure as Code alone: IaC is the implementation of an architectural decision, not the decision itself.
- A single pattern: event‑driven, microservices, or CQRS are patterns; architecture is the selection and combination of patterns appropriate for the context.
- A static diagram: real architecture evolves with the system, informed by live metrics and incident analysis.
Relationship with Azure Domains #
Architecture is the integrator of the five core Azure domains:
- Identity: architecture defines trust boundaries and how identity propagates across services.
- Compute: architecture selects the execution models (serverless, containerised, virtualised) for each workload.
- Storage: architecture designs data models, consistency requirements, and access patterns.
- Networking: architecture defines segmentation, traffic routing, and security perimeters (e.g., hub‑spoke, private endpoints).
- AI/Data: architecture governs the flow of data into models, the orchestration of agents, and the safe exposure of AI capabilities.
Architecture does not replace these domains; it coordinates them into a unified system.
3. Core Architecture Layers #
Structural Layer (System Decomposition) #
This layer describes how a system is broken into manageable parts. It defines:
- Bounded contexts (domain‑driven design): owning distinct business capabilities and data.
- Service boundaries: synchronous APIs, asynchronous message channels, and event streams.
- Deployment units: VMs, containers, functions, and their dependencies.
- Hierarchical organisation: management groups, subscriptions, resource groups that reflect operational and governance boundaries.
A good structural layer reduces coupling and allows independent evolution.
Decision Layer (Trade-offs & Frameworks) #
Architecture is an exercise in decision‑making under constraints. The decision layer includes:
- Decision matrices: evaluating options against criteria (e.g., consistency vs. latency for databases).
- Trade‑off analysis: recognising that high availability often increases cost, and deep security may reduce developer velocity.
- Frameworks: the Well‑Architected Framework (pillars of reliability, security, cost optimisation, operational excellence, performance efficiency) provides a structured way to evaluate alternatives.
- Decision records: lightweight documents that capture context, options, rationale, and consequences, ensuring institutional memory.
Pattern Layer (Reusable Solutions) #
Patterns are reusable architectural solutions to recurring problems, validated on Azure. Examples include:
- Strangler Fig for migrating monoliths.
- Event‑driven architecture for decoupling producers and consumers.
- CQRS and Event Sourcing for high‑scale transactional systems.
- Gateway Routing and Backends for Frontends for API management.
- Sidecar and Ambassador for cross‑cutting concerns in containerised environments.
Patterns are not prescriptive; architecture selects and adapts them.
Quality Attributes Layer (Reliability, Scalability, Cost, Security) #
Every system has non‑functional requirements that architecture must satisfy:
- Reliability: defined by RTO/RPO targets, fault isolation, and self‑healing mechanisms.
- Scalability: elastic horizontal scaling, partitioning strategies, and throttling.
- Cost: balancing reserved vs. consumption models, optimising idle resources, and using spot/transient compute.
- Security: defence in depth, Zero Trust, identity as the perimeter, encryption, and compliance.
- Operability: logging, metrics, alerting, and runbook automation.
Architecture translates these abstract qualities into concrete design choices (e.g., Availability Zone deployment, autoscale rules, RBAC scoping).
4. Key Architecture Knowledge Areas #
Design Principles #
Universal principles that guide architecture on Azure:
- Design for failure: every component can fail; build redundancy and graceful degradation.
- Least privilege: identities and services should have no more access than necessary.
- Idempotency: operations must be safe to retry, critical in distributed systems.
- Loose coupling: components communicate through well‑defined contracts, not shared state.
- Automate everything: infrastructure, testing, deployment, and scaling.
Decision Frameworks #
Structured methods for evaluating options:
- AWS Well‑Architected Framework adapted to Azure; many organisations use Azure’s own Well‑Architected Framework pillars.
- Cost‑benefit analysis that includes operational overhead, not just resource charges.
- Suitability models: mapping workload characteristics (predictable vs. bursty, stateless vs. stateful) to service families.
Architectural Patterns #
In addition to the pattern layer, architects must recognise when a pattern is anti‑pattern on Azure (e.g., “lift‑and‑shift” without refactoring, synchronous chains across many services causing cascading failures). The knowledge area includes pattern catalogues (e.g., Azure Architecture Center) and the discipline to adapt rather than copy.
Well‑Architected Framework #
The Microsoft Azure Well‑Architected Framework (WAF) provides a consistent approach to evaluate workloads across five pillars:
- Reliability
- Security
- Cost Optimisation
- Operational Excellence
- Performance Efficiency
Architecture uses WAF as a lens through which to review designs, identify risks, and track improvements over time.
Reference Architectures #
Azure provides tested, documented reference architectures for common scenarios (web applications, microservices, data lakes, AI inference). Architecture treats these as starting points, not blueprints. The skill is in customising them to the specific constraints of the organisation while preserving their validated characteristics.
5. Architecture vs Services vs Domains #
A frequent point of confusion is the difference between these three concepts.
- Domains represent WHAT capabilities exist: compute, storage, networking, identity, data/AI. They group services by function but do not describe how they interact.
- Services are HOW those capabilities are implemented: Azure VM, Azure Functions, Cosmos DB, Azure OpenAI. They are the concrete building blocks.
- Architecture is HOW EVERYTHING IS DESIGNED TOGETHER. It decides which services to use, how they communicate, how identity flows, and how the system handles failure.
For example, the “compute domain” tells you that you can run code. The service “Azure Kubernetes Service” tells you it provides container orchestration. Architecture tells you to use AKS for a specific set of microservices, to isolate stateful workloads, to configure cluster autoscaler, to integrate with Azure AD Pod Identity, and to restrict network egress via Azure Firewall.
Architecture is the integrator; domains and services are the raw materials.
6. Architecture in AI & Agent Systems #
Modern AI workloads demand robust architecture. The dynamic nature of LLMs, tools, and agent loops introduces unique structural challenges.
LLM System Architecture #
An LLM‑based application is not just a call to an API; it is a pipeline:
- Orchestration layer: compute (App Service, Container Apps) that manages prompts, context windows, and response streaming.
- Memory and state: external storage (Redis, Cosmos DB) for conversation history and long‑term memory.
- Content safety: explicit moderation services (Azure AI Content Safety) and architectural guardrails that validate inputs and outputs.
Architecture ensures that the LLM’s non‑deterministic nature is bounded and that the system degrades gracefully when the model is unavailable.
RAG Architecture Design #
Retrieval‑Augmented Generation (RAG) introduces a retrieval pipeline before the LLM:
- Ingestion path: compute functions that chunk documents, generate embeddings, and index into Azure AI Search.
- Query path: at runtime, the user’s query is embedded, relevant documents are retrieved, and they are injected into the prompt.
- Access control: architecture must propagate the user’s identity to the search index so that only authorised documents are retrieved, preventing data leakage.
This is a classic multi‑domain design: compute (ingestion/server), storage (documents), AI (embedding model), and identity (user‑scoped search).
Agent System Architecture #
AI agents (autonomous loops that plan and execute tool calls) require an architectural treatment that considers:
- Agent runtime: a stateful compute service (Container Apps with Dapr, Durable Functions) that maintains the agent’s execution state across multiple tool invocations.
- Tool sandboxing: each tool the agent can call must run under a scoped identity (user‑assigned managed identity or OBO token) that limits blast radius. The agent’s core orchestration identity has minimal permissions.
- Authorisation context: the architecture must distinguish between the agent’s own identity and the user’s identity it is acting on behalf of. OAuth 2.0 on‑behalf‑of flows or custom token exchange enforce this.
- Observability: every tool call, prompt, and reasoning step must be logged for auditability and cost tracking.
Multi‑Agent Orchestration Design #
When multiple agents collaborate, architecture addresses:
- Communication patterns: direct agent‑to‑agent, broker‑mediated, or event‑driven.
- State partitioning: how tasks are assigned and how agents share context.
- Consistency boundaries: each agent may have its own view of the world; eventual consistency and conflict resolution must be designed.
- Security boundaries: each agent and its tools must operate within an independent identity scope to prevent cross‑contamination.
AI System Trade‑offs #
AI systems introduce acute trade‑offs:
- Latency vs. accuracy: a chain of multiple model calls and retrievals increases latency; architecture may use caching, speculative execution, or tiered models (small model for fast path, large model for complex queries).
- Cost vs. quality: premium models (GPT‑4) are expensive; architecture can route simple queries to lighter models.
- Autonomy vs. safety: agents that can execute destructive actions (delete a record) need a human‑in‑the‑loop or approval step. Architecture defines the decision points and integration with human workflows.
- Stateless vs. stateful agents: stateless agents scale easily but lose context; stateful agents provide continuity but require session affinity or external state stores. Architecture chooses the right balance.
7. Certification Mapping #
Architecture as a domain appears in every Azure certification, with different depth:
| Certification | Architectural Expectation |
|---|---|
| AZ‑104 | Understand operational architecture: how VMs, virtual networks, and storage accounts are connected and secured. Execute design decisions provided by an architect. |
| AZ‑305 | Design and evaluate cloud architectures end‑to‑end. Select services, plan for HA/DR, optimise cost, and ensure security across a solution. |
| AI‑900 | Grasp the basic architecture of AI solutions: how AI services fit into applications, the role of compute and storage in AI pipelines. |
| AI‑103 | Design architecture for AI applications: host LLM‑based APIs, implement RAG pipelines, integrate identity, and manage scaling for AI inference. |
| AI‑300 | Architect production‑grade ML systems: training infrastructure, model registries, deployment pipelines, monitoring, and governance. |
| GH‑600 | Architect autonomous agent systems: secure tool execution, identity delegation, state management, and observability for agents operating in GitHub‑ecosystem environments. |
8. Real‑world Architecture Thinking Example #
Designing a Scalable AI System #
Scenario: Build a customer support platform that uses AI to answer questions and, with user permission, perform account operations.
Step 1 – Decompose the system:
- User interface: web and mobile clients.
- API gateway: Azure API Management for throttling, authentication, and routing.
- Core services:
- Query service: handles chat sessions, calls Azure OpenAI, and performs RAG.
- Action service: executes account operations (change address, cancel subscription) on behalf of the user.
- Notification service: sends emails/SMS after actions.
- Data: Azure Cosmos DB (session state), Azure SQL (account data), Azure AI Search (knowledge base index).
Step 2 – Choose patterns:
- The user‑facing API uses Backends for Frontends: separate API surfaces for web and mobile.
- The communication between query service and action service is asynchronous via Service Bus, ensuring that the AI agent never directly updates account records.
- The action service uses the Saga pattern to coordinate multi‑step account changes across different microservices.
Step 3 – Apply identity and security:
- The query service authenticates the user and receives a token. When the AI agent decides to invoke an action, the architecture requires an explicit user approval step (a user‑facing confirmation UI). Only after the user authorises the action does the action service obtain an on‑behalf‑of token to call downstream APIs.
- The agent runtime (Container Apps) has a managed identity that can read from AI Search but has no direct access to account databases. This identity boundary ensures that even if the agent is manipulated, it cannot execute privileged operations without a user’s explicit consent.
Step 4 – Evaluate trade‑offs:
- Performance vs. safety: Adding a user approval step increases latency but prevents unauthorised actions. The architecture accepts this trade‑off and implements a pleasant UI to minimise friction.
- Cost vs. scalability: The RAG pipeline and LLM calls are the main cost drivers. Architecture introduces a semantic cache (Azure Cache for Redis) that stores embeddings of common queries, avoiding repeated expensive LLM calls.
- Operational complexity vs. resilience: Instead of a single monolithic agent service, the design uses microservices. This increases complexity but allows independent scaling and failure isolation.
Step 5 – Validate with Well‑Architected Framework:
- Reliability: all services deployed across Availability Zones; Cosmos DB with multi‑region writes; retry policies with exponential backoff.
- Security: Zero Trust—every service‑to‑service call authenticated via managed identity; user consent required for sensitive actions.
- Cost Optimisation: Functions consumption for background tasks; reserved capacity for baseline App Service instances; auto‑scale to handle peaks.
- Operational Excellence: distributed tracing via Application Insights; deployment via CI/CD with blue‑green slots.
- Performance Efficiency: streaming responses from Azure OpenAI; async queue decouples request from heavy processing.
This example demonstrates how architecture integrates identity, compute, data, and AI into a coherent system, and how every design choice is a deliberate trade‑off evaluated against business and technical constraints. Architecture is not about picking the “best” service; it is about making the system work as a whole—securely, reliably, and sustainably.