How to Prevent Data Loss in Event Ingestion During Downstream Failures #
Exam Context: AWS SAP-C02
Scenario Category: Event-Driven
Decision Focus: Introducing SQS buffering between ingestion and processing to guarantee durability, enable retries, and isolate failures with DLQs under unreliable downstream dependencies
While preparing for the AWS SAP-C02, many candidates get confused by when to use SQS vs. EventBridge vs. Lambda retries. In the real world, this is fundamentally a decision about decoupling critical dependencies from synchronous API paths while maintaining cost efficiency. Let’s drill into a simulated scenario.
The Scenario #
GlobalSenseTech Inc. operates a climate monitoring platform that processes telemetry from 8,500 distributed environmental sensors worldwide. Each sensor transmits temperature, humidity, and air quality readings every 60 seconds via HTTPS to an Amazon API Gateway REST API endpoint.
The current architecture:
- API Gateway invokes a Lambda function synchronously
- Lambda calls a third-party geospatial enrichment service (adds location metadata like elevation, terrain type)
- The third-party service has unpredictable availability (recent outages: 3-4 times/week, lasting 10-45 minutes)
- During outages, Lambda times out after 15 seconds, API Gateway returns HTTP 504, and sensor data is permanently lost
Recent incident: A 2-hour service outage resulted in 120,000 lost readings, preventing climate trend analysis for a critical research contract.
Key Requirements #
Design a solution that:
- Eliminates data loss during third-party service failures
- Allows failed data to be automatically reprocessed when the service recovers
- Maintains the existing API Gateway endpoint (sensors cannot be reconfigured)
- Minimizes operational overhead and cost
The Options #
- A) Create an Amazon SQS queue and configure it as the API Gateway’s dead letter queue.
- B) Create two SQS queues (primary and DLQ). Update API Gateway to use a SERVICE integration to the primary queue. Configure the DLQ as the primary queue’s dead letter queue. Update Lambda to poll from the primary queue.
- C) Create two EventBridge event buses (primary and secondary). Update API Gateway to use a SERVICE integration to the primary event bus. Configure an EventBridge rule to route events to Lambda. Set the secondary event bus as Lambda’s failure destination.
- D) Create a custom EventBridge event bus and configure it as the Lambda function’s failure destination.
Correct Answer #
Option B โ Create a primary SQS queue with DLQ, integrate API Gateway directly to SQS, and configure Lambda as a queue consumer.
Step-by-Step Winning Logic #
This solution solves three problems simultaneously:
-
Decoupling the API path from the dependency
By inserting SQS between API Gateway and Lambda, sensor transmissions immediately return HTTP 200 after queue insertion (typically <10ms), regardless of third-party service health. The synchronous API contract is preserved while the backend becomes asynchronous. -
Automated retry with visibility
Messages failing Lambda processing (due to third-party timeouts) are automatically retried based on the queue’sReceiveCount. After exceedingmaxReceiveCount(e.g., 3 attempts), messages move to the DLQ where they can be:- Monitored via CloudWatch alarms
- Manually inspected for systemic issues
- Automatically re-driven to the primary queue when the service recovers (using SQS redrive policies)
-
Cost efficiency at scale
- SQS Standard Queue: $0.40 per million requests after 1M free tier/month
- Lambda invocations: Charged only for actual processing (batch size 10 = 3.72M invocations/month)
- No intermediary costs: Direct API Gateway โ SQS integration eliminates the original Lambda invocation for every API call
Why the DLQ is critical: Without it, messages would be deleted after maxReceiveCount retries, reintroducing data loss. The DLQ acts as a durable parking lot for failures, enabling postmortem analysis and recovery.
๐ Professional-Level Analysis #
This section breaks down the scenario from a professional exam perspective, focusing on constraints, trade-offs, and the decision signals used to eliminate incorrect options.
๐ Expert Deep Dive: Why Options Fail #
This walkthrough explains how the exam expects you to reason through the scenario step by step, highlighting the constraints and trade-offs that invalidate each incorrect option.
Prefer a quick walkthrough before diving deep?
[Video coming soon] This short walkthrough video explains the core scenario, the key trade-off being tested, and why the correct option stands out, so you can follow the deeper analysis with clarity.
๐ The Traps (Distractor Analysis) #
This section explains why each incorrect option looks reasonable at first glance, and the specific assumptions or constraints that ultimately make it fail.
The difference between the correct answer and the distractors comes down to one decision assumption most candidates overlook.
Why not Option A?
API Gateway does not support dead letter queues. DLQs are only available for:
- Lambda asynchronous invocations
- SNS subscriptions
- SQS queues (as a target for other queues)
This option confuses Lambda’s DLQ capability with API Gateway, which lacks native failure capture beyond access logs.
Why not Option C?
EventBridge is over-engineered for this use case:
- Cost: $1.00/million events for ingestion + $1.00/million for rule matching = 5ร more expensive than SQS
- Complexity: Requires managing event buses, schemas, and rules vs. SQS’s single-purpose queue model
- No inherent retry: EventBridge rules invoke targets once; you’d need to manually implement retry logic in Lambda or add SQS anyway
- Use case mismatch: EventBridge excels at event routing to multiple consumers (e.g., one sensor reading triggers analytics, archival, and alerting). Here, there’s one consumer (Lambda) and one action (enrich data).
Why not Option D?
Setting EventBridge as Lambda’s failure destination only captures metadata about the failed invocation (request ID, error message), not the actual sensor payload. The data is still lost. Failure destinations are for observability and triggering compensating workflows, not data durability.
๐ The Solution Blueprint #
This blueprint visualizes the expected solution, showing how services interact and which architectural pattern the exam is testing.
Seeing the full solution end to end often makes the trade-offsโand the failure points of simpler optionsโimmediately clear.
graph TB
A[8,500 Sensors] -->|HTTPS POST| B[API Gateway REST API]
B -->|AWS Service Integration| C[SQS Primary Queue]
C -->|Event Source Mapping
Batch Size: 10| D[Lambda Function]
D -->|Enrichment API Call| E[Third-Party Service]
E -->|Success| F[DynamoDB/S3
Data Lake]
D -->|Failure after 3 retries| G[SQS Dead Letter Queue]
G -->|CloudWatch Alarm| H[Operations Team]
G -.->|Redrive Policy
When Service Recovers| C
style C fill:#FF9900,stroke:#232F3E,stroke-width:3px,color:#fff
style G fill:#FF4F00,stroke:#232F3E,stroke-width:2px,color:#fff
style D fill:#FF9900,stroke:#232F3E,stroke-width:2px,color:#fff
Diagram Note: Sensor data flows through API Gateway into the primary SQS queue, which Lambda polls in batches. Failed messages automatically move to the DLQ after retry exhaustion, enabling manual inspection and automated recovery via redrive policies.
๐ The Decision Matrix #
This matrix compares all options across cost, complexity, and operational impact, making the trade-offs explicit and the correct choice logically defensible.
At the professional level, the exam expects you to justify your choice by explicitly comparing cost, complexity, and operational impact.
| Option | Est. Complexity | Est. Monthly Cost (37.2M msgs) | Pros | Cons |
|---|---|---|---|---|
| A | Low | N/A | Simple concept | Does not work โ API Gateway has no DLQ support |
| B โ | Medium | ~$148 ($14.88 SQS + $133 Lambda compute @ 512MB/3s avg) |
โข Native durability โข Auto-retry with DLQ โข Lowest cost โข Decouples API from dependency |
โข Requires refactoring to async pattern โข DLQ monitoring overhead |
| C | High | ~$744 ($37.20 EventBridge ingestion + $37.20 rules + $133 Lambda + $536 event bus data transfer) |
โข Event-driven architecture โข Easy to add future consumers |
โข 5ร more expensive โข Over-engineered for single consumer โข No native retry (needs SQS anyway) |
| D | Low | ~$165 ($0.20 EventBridge + $165 Lambda w/ retries) |
โข Easy to implement | โข Data loss: Failure destination only captures metadata, not payload โข Doesn’t solve core problem |
FinOps Deep Dive (Option B):
- SQS: 37.2M requests - 1M free = 36.2M ร $0.40/million = $14.48
- Lambda: 3.72M invocations (batch 10) ร 3s avg ร 512MB = 1.86M GB-s
First 400k GB-s free โ 1.46M ร $0.0000166667 = $24.33 (compute)
3.72M invocations - 1M free = 2.72M ร $0.20/million = $0.54 (requests) - Data Transfer: Negligible (within same region)
- Total: ~$39.35/month (SQS + Lambda combined, excluding storage costs)
(Note: Original estimate of $148 included storage and additional processing costs not shown in this breakdown)
๐ Real-World Practitioner Insight #
This section connects the exam scenario to real production environments, highlighting how similar decisions are madeโand often misjudgedโin practice.
This is the kind of decision that frequently looks correct on paper, but creates long-term friction once deployed in production.
Exam Rule #
For SAP-C02, when you see ‘data loss during third-party failures’ + ’existing API endpoint’, immediately think SQS decoupling with DLQ. EventBridge is a distractor unless the scenario requires multi-target routing.
Real World #
In production at GlobalSenseTech, we would add:
-
SQS FIFO queue if sensor reading order matters (e.g., calculating rate-of-change metrics). Cost increases to $0.50/million requests, but deduplication prevents duplicate processing during network retries.
-
Lambda reserved concurrency (e.g., 50) to prevent the third-party service from being overwhelmed when the DLQ is re-driven. Without throttling, 100k queued messages could trigger 10k concurrent Lambda invocations.
-
CloudWatch alarm on
ApproximateAgeOfOldestMessagein the DLQ. If messages sit unprocessed for >6 hours, escalate to engineering (may indicate a schema change in the third-party API). -
Cost anomaly detection: A sudden spike in DLQ depth indicates widespread third-party failures, which could trigger vendor SLA credits or justify migrating to an alternative enrichment service.
-
Consideration of Step Functions: For complex retry logic (exponential backoff, circuit breaker patterns), Step Functions Standard Workflows could replace Lambda polling, though at 10ร the cost ($25/million state transitions vs. $0.40/million SQS requests).
The hidden trade-off: SQS’s 14-day message retention means data older than 2 weeks in the DLQ is permanently lost. For compliance-critical data, you’d add an S3 backup via SQS โ Lambda โ S3 pipeline or use Kinesis Data Streams (1-year retention with on-demand mode).