Skip to main content
  1. CloudCertPro - Learn the Architecture Behind the Certification
  2. >
  3. Azure Cloud Knowledge Hub - CloudCertPro
  4. >
  5. Azure Domains Learning Hub: Master Azure by Capability Domains
  6. >
  7. Azure AI & Machine Learning Domain

Azure AI & Machine Learning Domain

Azure AI & Machine Learning is the architectural domain that enables applications to perceive, reason, and act using data and models. This domain page defines AI and ML as a full-stack cloud concern, not a service listing, and explains the model layer, data layer, orchestration, and application integration. It covers LLM systems, RAG, agents, MLOps, and the critical design decisions that shape production AI systems. The content is structured for reuse across multiple Azure certifications.


1. Overview
#

What Is AI & ML in Cloud Architecture
#

In cloud architecture, AI & Machine Learning is the set of services, patterns, and infrastructure that deliver intelligent capabilities—prediction, classification, generation, reasoning, and autonomous action. It transforms raw data into models, serves those models at scale, and integrates model outputs into applications and workflows.

AI is no longer a niche experiment; it is a horizontal architectural layer that interacts with compute, storage, networking, identity, and observability domains. Building AI systems requires deliberate decisions about model selection, data engineering, inference runtime, and safety guardrails.

Evolution from ML Systems → LLM Systems → Agent Systems
#

  • Traditional ML Systems: focused on predictive models—regression, classification, recommendation—trained on structured data and deployed as batch or real-time endpoints.
  • LLM‑Powered Systems: leverage pre-trained large language models (LLMs) via APIs, adding prompt engineering, retrieval augmentation (RAG), and conversation orchestration.
  • Agent Systems: extend LLMs with tool‑calling, multi-step planning, and autonomous loops, turning models from passive responders into active participants that interact with APIs, databases, and other agents.

This evolution elevates AI from a stateless model call to a stateful, distributed application pattern that demands its own architectural treatment.

Why AI Is Now a Full-Stack Architecture Domain
#

AI applications involve:

  • Data ingestion and preprocessing at scale.
  • Model training or fine-tuning with GPU/CPU infrastructure.
  • Inference hosting with low latency and high availability.
  • Integration with application backends, messaging, and APIs.
  • Safety, security, and governance that addresses prompt injection, data leakage, and compliance.

AI architecture thus spans compute, data, networking, security, and observability, making it a cross-cutting domain on par with identity or integration.


2. Core AI System Layers
#

A complete AI system is composed of four logical layers.

Model Layer
#

The model layer contains the machine learning models or LLMs that perform inference. It includes:

  • Foundation models: pre-trained models (e.g., GPT‑4, Llama) consumed via API or hosted endpoints.
  • Fine‑tuned models: foundation models adapted on domain data.
  • Custom models: built from scratch using Azure Machine Learning, PyTorch, or TensorFlow.
  • Model registries and versioning: Azure ML Model Registry tracks model artifacts.

Data Layer
#

The data layer supplies training data, retrieval knowledge bases, and runtime context. Components:

  • Training datasets: stored in Data Lake Storage, Blob Storage, or Azure ML Datastores.
  • RAG knowledge base: documents stored in Blob/Data Lake, indexed by Azure AI Search for vector retrieval.
  • Feature stores: Azure ML Managed Feature Store for consistent feature definitions across training and inference.
  • Feedback data: conversation logs and user feedback stored in Cosmos DB or Azure SQL for model improvement.

Orchestration Layer
#

This layer coordinates model calls, data retrieval, and tool execution:

  • Prompt orchestration: crafting prompts, managing context windows, and chaining multiple model calls (e.g., Semantic Kernel, LangChain, Azure AI Foundry prompt flow).
  • RAG pipeline: retrieve documents, re-rank, inject into prompt, generate response.
  • Agent loop: planning → tool selection → execution → observation → re-planning.
  • Batch inference pipelines: Azure Machine Learning pipelines for large-scale scoring.
  • MLOps automation: CI/CD for model training, evaluation, and deployment.

Application Layer
#

This is the user-facing or system-facing interface:

  • Chat UI, copilots, voice assistants: built with Azure AI Services or custom code.
  • API endpoints: served via Azure API Management or App Service, exposing model capabilities.
  • Business process integration: Logic Apps or Power Automate triggering AI steps.

These layers are decoupled, enabling independent scaling and technology selection.


3. Azure AI Services Mapping
#

Service Architectural Role
Azure OpenAI Service Hosts GPT, Embedding, and DALL‑E models with enterprise controls (private endpoints, content filtering, abuse monitoring). Used as the reasoning engine in LLM applications and agents.
Azure AI Foundry A unified platform for building, testing, and deploying AI applications. Includes prompt flow, model catalog, and responsible AI tools. (Formerly Azure AI Studio).
Azure AI Search Managed search and retrieval service with vector search, hybrid search, and semantic ranking. Serves as the RAG retrieval layer, supporting full‑text and vector queries.
Azure Machine Learning Enterprise MLOps platform: automated ML, designer, pipelines, managed endpoints, model registry, and responsible AI dashboards.
Azure AI Services Pre‑built cognitive APIs: Speech, Vision, Language, Translator, Document Intelligence. Can be used as tools by agents.
Azure Bot Service Framework for building conversational bots with channel integration (Teams, web). Increasingly superseded by custom agent patterns but relevant for simple Q&A bots.

4. AI Architecture Patterns
#

RAG (Retrieval-Augmented Generation)
#

RAG combines an LLM with a knowledge retrieval step to ground responses in authoritative data. The pattern:

  1. User query → embedding model converts to vector.
  2. Vector search retrieves top‑k documents from Azure AI Search (or other vector store).
  3. Retrieved chunks are injected into the LLM prompt with instructions to cite sources.
  4. LLM generates an answer grounded in the retrieved content.

RAG reduces hallucinations and enables the model to answer questions on proprietary data without fine‑tuning.

Fine-Tuning vs Prompt Engineering vs RAG Decision Model
#

Approach When to Use Trade-offs
Prompt Engineering Quick prototyping, simple tasks, limited data. No training cost, but limited control; may not capture domain nuance.
RAG Need up‑to‑date knowledge, large document sets, or traceability. Requires retrieval infrastructure; latency added; can still miss context.
Fine‑Tuning Specific style, tone, or complex task where examples exist. Requires curated dataset, training cost, risk of forgetting. Not for dynamic knowledge.

Most enterprise solutions combine all three: fine‑tune a model for domain language, use RAG for factual grounding, and craft prompts for instruction following.

Agent-Based Architecture
#

An agent is an autonomous loop:

  • Planner: decomposes user goal into steps.
  • Tool definitions: a catalog of APIs/functions with descriptions.
  • Executor: invokes tools based on LLM decisions, using structured function calling.
  • Memory: stores conversation state, plan progress, and tool results.

This pattern enables complex, multi‑step tasks like “book a flight and hotel within budget” where the agent decides to call flight API, check hotel API, compare results, and ask for confirmation.

Batch Inference vs Real-Time Inference
#

  • Batch inference: process large datasets asynchronously. Uses Azure ML pipelines, endpoints with batch deployments, or Azure Databricks. Cost‑efficient, high throughput, higher latency (minutes to hours).
  • Real‑time inference: low‑latency, synchronous predictions via managed online endpoints (Azure ML), Container Apps, or AKS. Suitable for user‑facing features.

Many systems use both: real‑time for user interactions, batch for nightly scoring or content processing.

Multi-Model Orchestration Systems
#

Complex scenarios may route requests to different models based on content, cost, or latency. For example:

  • Light model for simple greetings, heavy model for complex analysis.
  • A router model classifies intent and selects the appropriate downstream model.
  • Azure AI Foundry prompt flow or custom orchestration (Semantic Kernel) can implement multi‑model graphs.

5. AI Design Decisions
#

When to Use RAG vs Fine-Tuning
#

  • RAG is preferred when the knowledge base changes frequently, factual accuracy is critical, or traceability to sources is required.
  • Fine‑tuning is appropriate when you need a specific behaviour, tone, or format that cannot be achieved through prompting alone, and you have a high‑quality dataset.
  • Combined: use fine‑tuning to teach the model how to use tools or follow a specific persona, and RAG to provide fresh knowledge.

Hosted API Models vs Self-Managed ML Models
#

  • Hosted API models (Azure OpenAI, AI Services): zero infrastructure management, global availability, built‑in safety and monitoring. Best for teams that want fast integration.
  • Self‑managed models (Azure ML endpoints, AKS with custom containers): full control over model version, scaling, and cost. Required for fine‑tuned open‑source models (Llama, Mistral) or when data cannot leave a VNet.

Enterprises often start with APIs and migrate to self‑managed for cost control or compliance at scale.

Latency vs Accuracy Trade-offs
#

  • Lower latency (e.g., streaming responses, smaller models) may sacrifice answer quality.
  • Higher accuracy (e.g., CoT reasoning, multi‑step tool calls) adds latency.
  • Design with streaming to show partial results, and allow users to cancel long‑running agent tasks.

Model Selection Strategies
#

Select based on:

  • Task: text generation, embedding, image, speech.
  • Quality: benchmark scores, human evaluation.
  • Cost: token pricing, provisioned throughput.
  • Latency: time to first token, total time.
  • Safety: content filtering strength, alignment.

Azure AI Foundry model catalog offers comparison and evaluation tools.

Data Freshness vs Training Cost
#

  • Fresh data via RAG is near real‑time but adds retrieval cost.
  • Retraining/fine‑tuning on new data is periodic and computationally expensive.
  • For rapidly changing information (e.g., stock prices), RAG is essential; fine‑tuning may be updated monthly.

Single-Agent vs Multi-Agent Systems
#

  • Single agent: simpler to manage, debug, and secure; suitable for personal assistants or domain‑specific copilots.
  • Multi‑agent: distribute tasks across specialized agents, enabling complex workflows and parallel execution. Increases coordination overhead and security complexity.

Choose multi‑agent when tasks naturally decompose into distinct domains with different tools and safety requirements.


6. AI in Enterprise Architecture
#

Customer-Facing Applications
#

AI powers chatbots, recommendation engines, and personalized content. Architectural considerations:

  • Integrate with existing web/mobile backends via APIs.
  • Use APIM for rate limiting and authentication.
  • Ensure low latency with regional deployments and streaming.

Enterprise Automation Workflows
#

AI can classify emails, extract invoice data, or generate reports. Integration with Logic Apps or Power Automate connects AI services to business processes. Use Azure AI Services for document understanding, and LLMs for summarization.

Data Analytics Augmentation
#

Analysts use natural language queries (“show sales by region”) against semantic models in Power BI, or use AI to generate insights from data stored in Synapse. The architecture involves an orchestration layer that translates NL to SQL (or DAX) and formats results.

Microservices AI Integration
#

AI capabilities are exposed as internal services:

  • An “embedding service” (Container App) that other services call.
  • A “recommendation service” that combines rules and ML models.
  • These services follow the same identity, networking, and observability patterns as other microservices.

Decision Support Systems
#

AI models provide predictions that inform human decisions (loan approval, fraud alerts). The architecture must provide explanations (Responsible AI dashboards) and audit trails, often integrating with Azure SQL for structured decisions and Cosmos DB for event logs.


7. AI + LLM + Agent Systems
#

LLM-Based Application Design
#

A typical LLM‑powered app consists of:

  • Client (web, mobile) sending prompts to an API.
  • API service (App Service, Container Apps) that manages sessions, constructs prompts, and calls Azure OpenAI.
  • Optional RAG retrieval before calling the LLM.
  • Response streaming via WebSockets or SSE to the client.

The API service uses managed identity to authenticate to Azure OpenAI, and logs prompts and completions for monitoring.

RAG Pipelines End‑to‑End Architecture
#

  1. Ingestion: blob upload triggers a Function that chunks documents, calls the embedding model, and indexes chunks into Azure AI Search.
  2. Query: user query → embedding → search → re‑rank → prompt assembly → LLM generation.
  3. Security: search index applies user‑level security filters so retrieved documents are permission‑scoped.

The pipeline components are stateless and event‑driven, scaling with demand.

Agent Execution Systems
#

Agent runtime components:

  • Agent host: Container Apps or Durable Functions executing the reasoning loop.
  • Tool registry: a list of tool definitions stored in Cosmos DB or config, each with an associated API endpoint.
  • Identity propagation: the agent obtains an on‑behalf‑of token for the user, ensuring downstream APIs enforce per‑user authorization.
  • State persistence: the agent’s plan and conversation state stored in Cosmos DB for resilience.

Multi-Agent Coordination Systems
#

Agents communicate via messaging (Service Bus) or events (Event Grid). A supervisor agent can delegate to specialist agents (retrieval agent, action agent). Coordination requires:

  • Shared correlation ID for tracing.
  • Timeout and retry policies.
  • Conflict resolution when two agents attempt conflicting actions.

Guardrails and Safe AI Execution
#

Safe AI systems implement:

  • Input guards: Azure AI Content Safety for prompt injection and harmful content.
  • Output guards: content filtering on LLM responses.
  • Tool guardrails: validate tool parameters against allowed schemas; reject dangerous operations.
  • Human‑in‑the‑loop: for high‑risk actions, require user confirmation via an approval workflow (Logic Apps).

8. AI + Data + Compute Integration
#

AI systems are heavily dependent on other cloud domains.

Compute Scaling for Inference/Training
#

  • Inference: real‑time endpoints on Azure ML or AKS scale based on request metrics (CPU, GPU, custom). Use KEDA for event‑driven scaling in Container Apps.
  • Training: Azure ML compute clusters scale out dynamically. Spot VM priority reduces cost for non‑critical jobs. Use the right VM SKU (GPU: NCasT4_v3, ND A100 v4).

Storage for Datasets and Embeddings
#

  • Training data: Data Lake Storage Gen2, versioned and partitioned.
  • RAG document store: Blob Storage or Data Lake with lifecycle management.
  • Embedding cache: Redis stores frequently used embeddings to reduce re‑computation cost.

Databases for Structured Memory
#

  • Agent memory: Cosmos DB for conversation state and user preferences; Azure SQL for transactional data that agents query.
  • Feature stores: Azure ML Managed Feature Store for online/low‑latency feature serving.

Networking for Low-Latency Inference APIs
#

  • Deploy inference endpoints in the same region as consuming services, with VNet peering or in‑VNet integration.
  • Use private endpoints for Azure OpenAI and AI Search to avoid internet latency and exposure.
  • Azure Front Door for global routing of API traffic to nearest region.

Observability for Model Performance Tracking
#

(Further expanded in Section 10, but dependency noted here.)


9. AI Security & Governance
#

Prompt Injection Risks
#

Attackers can craft prompts that manipulate model behaviour. Mitigations:

  • Use system prompts that are immutable and clearly separated from user input.
  • Sanitize user input and restrict length.
  • Employ Azure AI Content Safety to detect jailbreak patterns.
  • Validate tool parameters before execution.

Data Leakage in LLM Systems
#

Models may inadvertently reveal training data or information from RAG contexts. Protections:

  • Use your own Azure OpenAI deployment (not shared public API) with private endpoints.
  • Apply strict RBAC on data stores.
  • Monitor outputs for data loss prevention (DLP) patterns.
  • Limit retrieved document count to minimize exposure.

Model Access Control
#

  • Use Azure RBAC and managed identity to grant least privilege to models.
  • In Azure OpenAI, restrict model deployments and rate limits per subscription.
  • Apply network restrictions (IP firewall, VNet) to model endpoints.

Responsible AI Principles
#

Microsoft’s Responsible AI framework: fairness, reliability & safety, privacy & security, inclusiveness, transparency, accountability. In Azure:

  • Azure AI Content Safety: configurable filters for harmful content.
  • Responsible AI dashboards in Azure ML: error analysis, fairness assessment, interpretability.
  • System messages can include ethical guidelines.
  • Human review for high‑impact decisions.

AI Policy Enforcement in Enterprise Systems
#

Use Azure Policy to enforce that:

  • AI services (Azure OpenAI, Cognitive Services) use private endpoints.
  • Content filtering is enabled.
  • Diagnostic settings are configured.
  • Only approved models (e.g., GPT‑4 without fine‑tuning) are allowed.

10. Observability & MLOps (AI-300 Focus)
#

Model Performance Monitoring
#

Monitor inference endpoints for:

  • Latency (p50, p95, p99) and throughput.
  • Error rates (4xx, 5xx, model errors).
  • Token usage and cost (for LLMs).
  • Data drift: input feature distributions compared to training baseline.
  • Model accuracy: if ground truth or feedback is available, compute evaluation metrics.

Drift Detection
#

Azure ML data drift monitors compare baseline data with current inference data. Set up alerts to trigger retraining when drift exceeds thresholds.

Pipeline Automation
#

Azure ML pipelines (or Azure DevOps/GitHub Actions) orchestrate:

  • Data preparation
  • Model training (including sweep for hyperparameters)
  • Model evaluation
  • Model registration
  • Deployment to staging/production with approval gates.

Reusable components and parameters enable rapid experimentation while maintaining auditability.

CI/CD for ML Models
#

MLOps combines ML pipelines with CI/CD:

  • Training pipeline triggered on code change or schedule.
  • Model promotion: after evaluation, model is registered, then deployed to staging automatically, with manual approval for production.
  • Rollback: ability to revert to previous model version if errors spike.

Evaluation Metrics for AI Systems
#

For LLM-based applications:

  • Groundedness: does the answer reference source documents?
  • Relevance: are retrieved documents relevant to the query?
  • Coherence: is the response logically consistent?
  • Fluency: grammar, readability.
  • Safety: check for harmful content.

Azure AI Foundry and Azure ML provide built‑in evaluation flows.


11. Certification Mapping
#

Certification AI & ML Domain Relevance
AI-900 Foundational understanding of AI workloads, model types, and Azure AI services.
AI-103 Building AI applications: using Azure OpenAI, implementing RAG, orchestrating agents, securing AI endpoints, and monitoring AI calls.
AI-300 Architecting MLOps: designing training pipelines, model management, inference strategies, drift monitoring, and responsible AI.
AZ-305 Integrating AI into enterprise architecture: compute, data, networking, security, and governance requirements for AI workloads.
AZ-104 Basic operations: provisioning AI services, configuring networking and identity for AI resources, monitoring usage.
GH-600 Agent system design: autonomous execution loops, tool calling, memory, secure multi‑agent systems, and observability of agent behaviour.

12. Real-World Architecture Example
#

Scenario: An enterprise knowledge assistant that answers employee questions based on internal documentation, and can perform actions like booking a meeting room.

Components:

  • User interface: Bot front‑end in Microsoft Teams, connected to an Azure Bot Service.
  • API layer: Azure API Management secures and throttles incoming requests.
  • Orchestration service: Azure Container Apps running a Python-based agent framework (e.g., Semantic Kernel) that handles the agent loop.
  • Model endpoint: Azure OpenAI (GPT‑4) accessed via Private Endpoint, with content filtering enabled.
  • RAG pipeline:
    • Document storage: SharePoint documents synced to Azure Blob Storage.
    • Ingestion: Logic App triggers an Azure Function on file upload, which chunks and embeds documents, then indexes into Azure AI Search.
    • Retrieval: at query time, the agent retrieves top‑5 chunks via AI Search with security filters based on the user’s group memberships.
  • Agent tools:
    • “SearchKnowledgeBase” (RAG retrieval).
    • “FindMeetingRoom” (calls internal room booking API via APIM with on‑behalf‑of token).
    • “LogTicket” (creates a support ticket in ServiceNow via its API).
  • Memory: Azure Cosmos DB stores conversation history and user preferences (long‑term). Redis caches session context.
  • MLOps for fine‑tuning: A separate Azure ML pipeline fine‑tunes a smaller model on anonymised conversations to improve intent classification, which is used in the router to decide whether to use RAG or a direct tool call.
  • Observability:
    • Application Insights traces the end‑to‑end flow with correlation IDs.
    • Custom metrics: token consumption per query, retrieval precision, tool call success rate.
    • Alerts on: high token usage (potential abuse), agent loop exceeding max steps (10), RAG retrieval returning zero results (possible index issue).
  • Security:
    • All AI services are VNet‑integrated with private endpoints.
    • Agent uses user‑assigned managed identity; tool calls carry delegated user tokens.
    • Sensitive tool “LogTicket” requires explicit user confirmation via an Adaptive Card in Teams.

Flow:

  1. User asks: “Book a meeting room near Building A and summarise the Q3 earnings report.”
  2. Agent router classifies intent: (a) retrieve document, (b) perform action.
  3. Agent retrieves relevant chunks from AI Search; notes that room booking requires building preference stored in user memory (Cosmos DB).
  4. Agent generates a plan: search docs, find room availability, then summarise.
  5. Executes RAG tool, obtains summary. Calls room booking API, which responds with available rooms. Agent presents both to user.
  6. User confirms room; agent executes booking via tool call (with user token).
  7. Full trace and chat log stored; cost per interaction recorded.

This architecture demonstrates the integration of AI models, RAG, agent tools, MLOps, and enterprise security into a production‑ready, scalable system.