- CloudCertPro - Learn the Architecture Behind the Certification
- >
- Azure Cloud Knowledge Hub - CloudCertPro
- >
- Azure Domains Learning Hub: Master Azure by Capability Domains
- >
- Azure Observability Domain
Azure Observability Domain
Azure Observability is the architectural capability that enables teams to understand the internal state of a system from its external outputs—metrics, logs, and traces. This domain page defines observability as a first-class design concern, not a monitoring afterthought, and connects it to reliability, performance, security, and the specific demands of modern AI and agent systems. It is structured for reuse across multiple Azure certifications.
1. Overview #
What Is Observability in Cloud Architecture #
In cloud architecture, observability is the ability to infer the health, performance, and behaviour of a distributed system by instrumenting it and collecting telemetry data. It goes beyond simply checking whether a service is “up”; it provides the granular data needed to ask arbitrary questions about the system without having to pre-define all possible failure modes.
Observability is built on three pillars: metrics (numerical measurements over time), logs (timestamped event records), and traces (end-to-end request flows). Together they allow engineers to detect anomalies, debug issues, and continuously improve the system.
Monitoring vs Observability vs Logging #
- Monitoring is the practice of collecting predefined sets of metrics and alerts to detect known failure conditions. It answers “Is the system healthy right now?”
- Logging is the capture of discrete event data from services and infrastructure. It provides the raw material for debugging but is often voluminous and needs structure for efficient analysis.
- Observability is the superset: it combines metrics, logs, and traces with correlation and querying capabilities, enabling investigation of unknown-unknowns. It answers “Why is the system behaving this way?” and “What is happening across all services for a given request?”
Observability is a property of the system architecture, not just a tool. It must be designed into services through instrumentation, correlation ID propagation, and centralised telemetry pipelines.
Why Observability Is Critical for Distributed Systems #
Distributed systems—microservices, serverless functions, and AI agents—produce complex, non-linear request paths. A single user action may span dozens of services, queues, and databases. Without observability:
- Failures are invisible or take hours to diagnose.
- Performance bottlenecks are impossible to pinpoint.
- Cost of running services cannot be attributed accurately.
- Autonomous agents behave opaquely, raising safety and compliance risks.
Observability transforms the system from a black box into a transparent, manageable entity.
2. Core Observability Signals #
Metrics (System Performance Indicators) #
Metrics are numeric representations of system state captured at regular intervals. Examples: CPU utilisation, request count, message queue length, token usage per minute. Metrics are lightweight, efficient to store, and ideal for:
- Dashboarding real-time status.
- Triggering alerts on threshold breaches.
- Capacity planning and trend analysis.
Azure Monitor collects platform metrics from Azure resources automatically and allows custom metrics from applications.
Logs (Event-Level System Records) #
Logs are immutable, timestamped records of discrete events that occurred within the system. They can be structured (JSON) or unstructured. Logs provide the detailed context needed for debugging: which user did what, which parameters were used, what error was returned.
Azure resource logs, application logs (via Application Insights), and Microsoft Entra ID audit logs all feed into a central Log Analytics workspace for querying and correlation.
Traces (Request Flow Across Services) #
Distributed traces follow a single request as it propagates through multiple services. A trace is composed of spans—each representing a unit of work in a service (e.g., a database query, an HTTP call). Spans carry a common trace ID and parent span ID, enabling reconstruction of the entire call chain.
Application Insights automatically collects distributed traces for instrumented applications, showing latency per hop, dependencies, and failures.
Correlation Between Signals #
The power of observability comes from correlating these signals. For example:
- An alert fires on high latency (metric).
- The engineer drills into the alert timeline and sees correlated error logs (logs).
- They pivot to the distributed trace of a slow request to find the specific database call that timed out (trace).
Correlation IDs (trace IDs, operation IDs) are propagated across services via HTTP headers (e.g., traceparent W3C standard) and message properties, linking metrics, logs, and traces.
Distributed Systems Visibility #
In a distributed system, no single service has the full picture. Centralised telemetry with consistent correlation allows an operator to view a request end‑to‑end: from the front‑end, through API gateways and business services, to the database and back.
3. Azure Observability Services Mapping #
| Service | Architectural Role |
|---|---|
| Azure Monitor | The umbrella observability platform. Collects, analyses, and acts on telemetry from Azure and on-premises environments. It encompasses metrics, logs, alerts, and dashboards. |
| Log Analytics Workspace | A centralised store for logs and performance data. All Azure resource logs, Application Insights data, and custom logs are queried via Kusto Query Language (KQL). Workspaces can be shared or dedicated. |
| Application Insights | Application Performance Management (APM) service for live applications. Automatically instruments web apps, Functions, and containers to capture request rates, dependencies, exceptions, and traces. It supports distributed tracing and smart detection. |
| Azure Metrics Explorer | Interactive metrics visualisation in the Azure portal. Allows creation of metric charts based on Azure platform metrics and custom application metrics. |
| Diagnostic Settings | Configuration that routes Azure resource logs and metrics to a destination: Log Analytics workspace, storage account, or Event Hub. Must be explicitly enabled for granular control. |
| Network Watcher | Network-specific observability: topology, connectivity tests, NSG flow logs, packet capture, and VPN diagnostics. Provides visibility into network-level behaviour. |
4. Observability Architecture Model #
Centralised vs Decentralised Observability Design #
- Centralised model: all logs and metrics from all subscriptions and regions flow into a single Log Analytics workspace or a small set of workspaces. Simplifies cross-service querying, correlation, and access control. Best for single-tenant or highly governed organisations.
- Decentralised model: each application team owns its own workspace. Provides autonomy and cost isolation but makes end‑to‑end tracing across teams harder. Can be mitigated with cross-workspace queries.
A common hybrid approach: centralised workspaces for security and platform logs, application-specific workspaces for app telemetry, with a central dashboard aggregating insights.
Data Collection Pipelines #
Telemetry flows from agents, SDKs, and diagnostic settings into Azure Monitor. The pipeline stages:
- Generation: Application code emits logs, metrics, and traces using SDKs (OpenTelemetry, Application Insights SDK). Azure resources emit platform logs.
- Collection: Diagnostic settings route resource logs; Application Insights SDK sends data directly; Azure Monitor Agent (AMA) collects guest OS data.
- Ingestion & transformation: Logs enter Log Analytics; metrics go to the time-series database. Data Collection Rules (DCRs) can filter and transform data before ingestion to reduce volume and cost.
- Analysis & alerting: KQL queries, workbooks, and alerts consume the data.
Telemetry Ingestion Architecture #
High-volume systems must design for ingestion scaling:
- Use sampling in Application Insights to reduce data volume while preserving statistically relevant data.
- Aggregate custom metrics rather than emitting per-event metrics.
- Apply ingestion-time transformations to drop noisy or low-value logs.
- Send verbose logs to low-cost storage (archive) and retain only essential data in Log Analytics for interactive query.
Correlation ID and Distributed Tracing Strategies #
Correlation IDs must be propagated across all services:
- For HTTP calls, the W3C Trace-Context header (
traceparent,tracestate) is standard and automatically used by Application Insights. - For messaging (Service Bus, Event Hubs), the correlation ID is passed as a message property and set as the operation parent on the consumer side.
- For Functions and Logic Apps, the invocation ID can be linked to the upstream operation.
Application Insights automatically stitches spans into a full end‑to‑end transaction map.
Multi-Service System Visibility Design #
To achieve complete visibility:
- Instrument all services (including containers, Functions, VMs) with the same telemetry provider (Application Insights) for unified tracing.
- Use a shared Log Analytics workspace or cross-workspace queries for logs.
- Define a custom property (
Environment,ApplicationName) on all telemetry to enable filtering and grouping. - Build workbooks that pull metrics from multiple sources into a single pane of glass.
5. Observability Design Decisions #
Metrics vs Logs vs Traces Usage Patterns #
- Metrics: for real-time dashboards, alerts, SLO tracking, and capacity planning. Minimal storage cost, fast queries.
- Logs: for debugging, audit, compliance, and ad-hoc queries. Higher ingestion and storage cost; require structured logging for efficient querying.
- Traces: for latency analysis, bottleneck identification, and dependency mapping. Generated automatically by APM tools; rely on correlation ID propagation.
Pattern: alert on metrics, drill into traces for latency problems, and use logs for error detail.
When to Use Application Insights vs Log Analytics #
- Application Insights: primary choice for application-level telemetry (requests, dependencies, exceptions, page views). It provides APM features like application map, live metrics, and smart detection. It stores data in a Log Analytics workspace under the hood.
- Log Analytics: the direct KQL-queryable store for any log data. Use for resource logs, security logs, and custom logs that are not application request/response telemetry.
The two are complementary: App Insights is the APM layer; Log Analytics is the data platform.
Sampling Strategies for High-Scale Systems #
Sampling reduces telemetry volume while preserving enough data for analysis:
- Adaptive sampling: automatically adjusts the sampling rate to keep within a target volume, while preserving correlation of traces (samples in entire transactions). Good for production.
- Fixed-rate sampling: keeps a constant percentage. Simple but may break trace completeness.
- Ingestion sampling: filters at the SDK or DCR level to reduce cost.
For critical flows (payment, auth), consider low or zero sampling; for high-volume read traffic, aggressive sampling is acceptable.
Retention vs Cost Optimisation #
Log Analytics charges per GB ingested and per GB stored beyond the included retention. Design decisions:
- Default retention is 30 days interactive; archive logs can be kept longer at lower cost but require “restore” for interactive query.
- Send verbose debug logs to a storage account (cheap) and keep only the last 7–14 days in Log Analytics.
- Use lifecycle management on storage accounts for long-term archival.
Real-Time vs Batch Observability Trade-offs #
- Real-time: metrics and alerts, live metrics stream from App Insights. Required for incident detection and SLO dashboards. Slightly higher processing cost.
- Batch: log aggregation and analysis that runs periodically (e.g., daily usage reports, weekly anomaly detection). Lower cost, but not suitable for immediate response.
A balanced architecture uses real-time alerting and metrics, with batch processing for in-depth analysis and compliance reports.
6. Observability in Enterprise Architecture #
Microservices Architectures #
Microservices increase the number of moving parts. Observability strategies:
- Use a distributed tracing tool (Application Insights) to map service-to-service dependencies automatically.
- Emit structured logs from each service with a common schema (e.g.,
{ "severity", "message", "correlationId", "service" }). - Implement health endpoints (
/health,/ready) that report dependencies; aggregate with metric alerts. - Use service mesh observability (Istio telemetry) for network-level metrics without code changes.
Multi-Tier Applications #
For traditional multi-tier apps:
- Instrument each tier (web, API, worker) with App Insights.
- Use operation ID to correlate requests from the web tier down to the database.
- Monitor SQL query performance via SQL Insights and diagnostic logs.
Hybrid Cloud Systems #
Observability must span on-premises and cloud:
- Azure Arc extends monitoring agents to on-premises VMs and Kubernetes clusters, sending data to Azure Monitor.
- Use the same Log Analytics workspace for both cloud and on-premises to unify querying.
- Ensure network paths allow telemetry upload.
Multi-Region Distributed Systems #
Multi-region deployments add geographic complexity:
- Deploy regional Log Analytics workspaces for data residency and egress optimisation, or use a central workspace with cross-region ingestion (consider cost).
- Use availability tests from multiple regions to monitor global latency.
- Correlate telemetry across regions with a global correlation ID that includes region information.
SLA/SLO/SLI Compliance Tracking #
Observability is the foundation of service level management:
- SLIs (Service Level Indicators): metrics that measure the service (e.g., latency p95, error rate).
- SLOs (Service Level Objectives): target values for SLIs over a window.
- SLAs: contractual commitments.
Implement SLO tracking via Azure Monitor metrics alerts and burn‑rate calculation, integrating with dashboards and incident management.
7. Observability for AI & Agent Systems #
Modern AI workloads introduce unique observability challenges. Traditional request latency is not enough; you must understand model behaviour, tool execution, and quality of reasoning.
LLM Latency and Token Usage Tracking #
For applications using Azure OpenAI:
- Latency: track time to first token, token generation rate, and end‑to‑end completion time. Decompose into model inference time and network latency.
- Token usage: log prompt tokens, completion tokens, and total tokens per request. Aggregate to monitor cost and detect anomalies (e.g., prompt hijacking leading to excessive token consumption).
- Application Insights custom events can record these per invocation.
RAG Pipeline Observability #
Retrieval-Augmented Generation involves both retrieval and generation steps. Key signals:
- Retrieval quality: log the number of documents retrieved, similarity scores, and whether the top result was actually used by the LLM. Compare retrieved chunk IDs against the final answer citations to measure precision/recall.
- Hallucination indicators: if the model’s answer cannot be grounded in retrieved documents (based on citation verification), flag the interaction for review.
- Pipeline latency: measure time spent in embedding generation, vector search, and LLM inference separately.
Agent Execution Tracing #
AI agents perform multi-step loops with tool calls. Observability must capture:
- Agent reasoning trace: log each “thought” or “plan” generated by the LLM.
- Tool invocations: for each tool call, log the tool name, parameters, success/failure, and duration. This is analogous to dependency tracking in traditional services.
- State changes: record how the agent’s plan evolves after each tool result.
- Correlation: each agent loop iteration should share a trace ID and have a unique span ID, with parent‑child relationships between the planning step and subsequent tool calls.
Use Application Insights SDK with custom telemetry or OpenTelemetry to instrument the agent orchestrator.
Prompt-Level Logging and Debugging #
For security and quality, log the final prompt sent to the LLM and the raw response (with appropriate data masking). This enables:
- Reviewing prompt injection attempts.
- Understanding why the model gave a particular answer.
- Iterating on prompt engineering.
Log prompts to a dedicated container with strict access controls due to sensitivity.
Model Performance Monitoring #
For custom AI models (fine‑tuned LLMs, vision, etc.), track:
- Inference latency and throughput.
- Model accuracy/quality using evaluation metrics (e.g., ROUGE, BERTScore) on a hold‑out set or live feedback.
- Data drift: monitor input feature distributions compared to training baseline (Azure ML data drift).
- Resource utilisation: GPU memory, CPU, for capacity planning.
Autonomous Agent Behaviour Tracking #
For agent systems (GH-600 relevance), observability must ensure that agents operate within defined safety and performance boundaries:
- Log every decision‑making step: what the agent planned to do, what it actually did.
- Set alerts on anomalies: sudden spikes in tool call failures, agents entering infinite loops (exceeding maximum step count), or executing unexpected tool sequences.
- Track user feedback (thumbs up/down) and link to agent traces for continuous improvement.
- Enable a “replay” capability: given a trace, be able to reconstruct the agent’s state and actions.
8. Reliability, Performance & Incident Response #
Alerting Strategies and Rule Design #
Effective alerting requires:
- Signal selection: alert on symptoms (high latency, error rate) not causes (CPU spike) when possible.
- Thresholds based on SLOs: set alert thresholds to trigger when burn rate indicates you will miss SLO.
- Action groups: define notifications (email, SMS, webhook) and actions (automated runbook, Logic App) that fire on alert.
- Suppression: avoid alert storms with dynamic thresholds, alert grouping, and maintenance windows.
Root Cause Analysis (RCA) Workflows #
Observability supports RCA by providing:
- The timeline of events leading up to the incident.
- The distributed trace of a failed request.
- Logs around the failure timestamp.
- Metrics showing correlation with deployments (use change tracking).
Build an incident workbook that pulls all relevant data for a given time range and correlation ID.
SLO/SLA Monitoring #
Monitor SLOs with metric alerts based on error budgets:
- Compute error budget consumption from a time-series of good/total events.
- Use a dashboard to visualise remaining budget.
- Alert when error budget burn rate exceeds sustainable threshold (e.g., 2% budget burned in 1 hour).
Auto-Remediation Concepts #
Observability can trigger automated remediation:
- A metric alert fires, triggering an Azure Function or Logic App that performs a known safe action (restart a container, scale out a service).
- For AI agents, a failed tool call can automatically retry with a fallback tool or fallback to a simpler response.
Auto-remediation must be designed with idempotency and safety in mind.
Incident Lifecycle Management #
Integrate observability with ITSM (e.g., ServiceNow) via connectors. A typical lifecycle:
- Alert fires → creates an incident in ITSM.
- On-call engineer acknowledges → uses observability tools to diagnose.
- RCA documented, linked to telemetry.
- Post‑mortem analysis feeds back into observability (add new alerts, refine dashboards).
9. Security & Compliance in Observability #
Secure Telemetry Data Handling #
Telemetry data can contain sensitive information (PII, secrets). Design considerations:
- Encryption in transit: all telemetry channels use TLS.
- Encryption at rest: Log Analytics and storage accounts are encrypted.
- Customer-managed keys (CMK) for Log Analytics workspaces in regulated environments.
- Private endpoints: for sending logs to a workspace over a VNet, avoiding public internet.
Sensitive Data Masking in Logs #
Configure log scrubbing to mask or remove sensitive data:
- Use Application Insights telemetry initializers to filter out fields before transmission.
- Use Data Collection Rules to transform or drop specific log columns.
- Avoid logging raw API keys, passwords, or full credit card numbers.
Role-Based Access Control for Monitoring Data #
Azure provides granular RBAC for observability data:
- Monitoring Reader: read metrics and logs.
- Monitoring Contributor: read/write monitoring configuration.
- Log Analytics Reader / Contributor: query and manage workspaces.
- Application Insights component-level access to separate production and development data.
Enforce least privilege: operators may need log query access, but developers may have read-only access to their application’s App Insights resource.
Audit Logging Requirements #
All management operations (e.g., who queried Log Analytics, who modified an alert rule) are captured in the Azure Activity Log. For high-compliance environments, send Activity Logs to a dedicated Log Analytics workspace with immutable storage and long-term retention.
Compliance Reporting Support #
Observability data feeds compliance reporting:
- Use Azure Policy to enforce that all resources send logs to a central workspace.
- Build workbooks that summarise security posture, incident metrics, and audit trails for auditors.
- Integrate with Microsoft Purview for data classification if logs contain business data.
10. Certification Mapping #
| Certification | Observability Domain Relevance |
|---|---|
| AZ-104 | Configure diagnostic settings, monitor VMs and resources with Azure Monitor, create alerts, interpret metrics, and use Network Watcher. |
| AZ-305 | Design enterprise monitoring and logging architecture: workspace strategy, sampling, cost optimisation, SLO tracking, multi-region observability. |
| AI-900 | Understand basic monitoring concepts for AI services: API response monitoring, basic model metrics. |
| AI-103 | Implement observability for AI applications: track LLM calls, token usage, tool invocation telemetry, debug RAG pipelines. |
| AI-300 | Architect MLOps observability: model monitoring, data drift, inference latency, pipeline logging, and model performance dashboards. |
| GH-600 | Design agent observability: trace multi-step agent loops, log tool calls, monitor behaviour anomalies, ensure auditability of autonomous actions. |
11. Real-World Architecture Example #
Scenario: A global e‑commerce platform with microservices, AI‑powered recommendations, and a customer service agent.
Observability design:
- Unified telemetry: All services (App Service APIs, Container Apps, Functions, AKS) are instrumented with Application Insights using the same instrumentation key (via connection string). Correlation IDs are propagated using W3C Trace-Context headers for HTTP and custom properties for Service Bus messages.
- Centralised logging: All diagnostic settings for Azure resources (storage, Cosmos DB, SQL Database) route logs to a central Log Analytics workspace in a management subscription. Application Insights data is also stored in this workspace. A secondary workspace in the DR region aggregates logs for resilience.
- Dashboards:
- Operations dashboard: Azure Monitor workbook showing request rates, p95 latency, error rates per service, and overall health status.
- AI performance dashboard: custom workbook tracking token usage per hour, average LLM latency, RAG retrieval precision (calculated via a custom metric emitted by the orchestration service), and agent tool call success rates.
- SLO dashboard: error budgets for key SLIs (checkout API availability, search latency) with burn‑down charts.
- Alerts:
- High-priority: checkout API error rate > 0.1% for 5 minutes → notification to on‑call team via ITSM.
- AI-specific: when agent tool call failures exceed 10% of requests over 15 minutes → investigate agent reliability.
- Token usage spike: if daily token consumption exceeds forecast by 50% → alert for possible prompt injection or misconfiguration.
- Alert actions: high-severity alerts trigger a Logic App that posts incident details to Teams and creates a ServiceNow ticket; critical database issues trigger an automated failover test.
- Agent observability:
- The agent runtime (Container Apps) emits custom events: “agent.plan.created”, “tool.call.started”, “tool.call.completed”, “agent.plan.completed”. Each event includes the trace ID, user session ID, and step number.
- A Power BI report, built on data exported from Log Analytics to Azure Data Explorer, shows agent conversation quality trends (user feedback sentiment, average steps per task).
- Agent logs are stored in an immutable blob container for compliance, with PII scrubbed.
- Incident response: An incident workbook aggregates all telemetry for a given correlation ID: the end‑user request, all downstream HTTP calls, the trace of the LLM interaction, and the exact logs of the tool call that failed. This reduces mean time to resolution.
This observability architecture ensures that every layer—from web traffic to AI reasoning—is transparent, measurable, and actionable, enabling the platform to operate reliably at scale.