How to Transfer Large Genomics Datasets to AWS in a Hybrid Architecture #
Exam Context: AWS SAP-C02
Scenario Category: Hybrid Connectivity
Decision Focus: Choosing a data transfer strategy that balances latency, cost, and operational scalability for large datasets
While preparing for the AWS SAP-C02, many candidates get confused by data transfer service selection and container orchestration patterns. In the real world, this is fundamentally a decision about transfer latency vs. operational complexity vs. workflow orchestration maturity. Let’s drill into a simulated scenario.
The Scenario #
BioGenix Research, a genomics analytics company, currently runs Docker-based bioinformatics pipelines on on-premises servers. Their sequencing machines generate approximately 200 GB of raw data per genome, stored initially on a local SAN (Storage Area Network). Processing each genome takes several hours with adequate compute, but capacity constraints have stretched total turnaround time to several weeks.
The research team has decided to migrate to AWS to:
- Scale compute dynamically based on workload demand
- Reduce processing time from weeks to days
- Leverage their existing AWS Direct Connect high-speed connection
- Store final results in Amazon S3 for long-term access
Operational Context:
- Daily workload: 10-15 genomics processing jobs
- Per-job data size: ~200 GB
- Current tooling: Open-source Docker containers for analysis
- Existing infrastructure: Direct Connect link already established
Key Requirements #
Design a solution that:
- Transfers sequencing data from on-prem SAN to AWS efficiently
- Triggers automated processing upon data arrival
- Executes containerized workloads with appropriate orchestration
- Stores final results in S3
- Optimizes for cost-efficiency at this scale (10-15 jobs/day)
The Options: #
-
A) Use AWS Snowball Edge devices on a scheduled rotation to transfer sequencing data to AWS. When AWS receives the device and loads data into S3, trigger an AWS Lambda function via S3 event to process the data.
-
B) Use AWS Data Pipeline to transfer sequencing data to S3. Configure S3 events to trigger an EC2 Auto Scaling group that launches instances from a custom AMI running Docker containers.
-
C) Use AWS DataSync to transfer sequencing data to S3. Configure S3 events to trigger an AWS Lambda function that initiates an AWS Step Functions workflow. Store Docker images in Amazon ECR and use AWS Batch to execute containerized processing jobs.
-
D) Deploy AWS Storage Gateway (File Gateway) to transfer sequencing data to S3. Use S3 events to trigger AWS Batch jobs running Docker containers on EC2 instances.
Correct Answer #
Option C â DataSync + Lambda + Step Functions + ECR + Batch
Step-by-Step Winning Logic #
This solution delivers the optimal trade-off across five critical dimensions:
-
Transfer Efficiency: AWS DataSync is purpose-built for high-throughput transfers over Direct Connect, automatically handling network optimization, encryption, and data validationâfar superior to generic pipelines.
-
Orchestration Maturity: AWS Step Functions provides visual workflow management for multi-step genomics pipelines (QC â Alignment â Variant Calling â Reporting), with built-in error handling and retry logic.
-
Container Management: Amazon ECR + AWS Batch eliminates the operational burden of maintaining custom AMIs and auto-scaling groups. Batch handles:
- Optimal instance selection (Spot/On-Demand mix)
- Queue management
- Automatic scaling based on job submission rate
-
Cost Optimization: At 10-15 jobs/day, serverless orchestration (Lambda + Step Functions) costs pennies per workflow, while Batch’s Spot integration can reduce compute costs by 70%+ compared to On-Demand instances.
-
Existing Asset Leverage: Direct Connect bandwidth is already paid forâDataSync maximizes this sunk cost investment.
ð Professional-Level Analysis #
This section breaks down the scenario from a professional exam perspective, focusing on constraints, trade-offs, and the decision signals used to eliminate incorrect options.
ð Expert Deep Dive: Why Options Fail #
This walkthrough explains how the exam expects you to reason through the scenario step by step, highlighting the constraints and trade-offs that invalidate each incorrect option.
Prefer a quick walkthrough before diving deep?
[Video coming soon] This short walkthrough video explains the core scenario, the key trade-off being tested, and why the correct option stands out, so you can follow the deeper analysis with clarity.
ð The Traps (Distractor Analysis) #
This section explains why each incorrect option looks reasonable at first glance, and the specific assumptions or constraints that ultimately make it fail.
The difference between the correct answer and the distractors comes down to one decision assumption most candidates overlook.
Why not A (Snowball Edge)? #
- Latency Death Spiral: Physical device shipping adds 5-7 days per cycleâcompletely contradicting the “weeks to days” requirement.
- Operational Overhead: Requires coordinating device orders, local data loading, and shipping logistics.
- Use Case Mismatch: Snowball is for petabyte-scale migrations or disconnected environments, not daily 200GB transfers over an existing Direct Connect link.
- Cost Impact: Device fees ($300+ per transfer) plus shipping vs. DataSync bandwidth charges.
Why not B (Data Pipeline + Custom AMI + ASG)? #
- Architectural Legacy Debt: AWS Data Pipeline is a legacy service (in maintenance mode since ~2019), superseded by Step Functions and AWS Glue for workflow orchestration.
- Operational Complexity: Maintaining custom AMIs with Docker runtime requires:
- Patching and security updates
- Version control for genomics tool dependencies
- ASG launch template management
- Scaling Inefficiency: ASG responds to CloudWatch metrics, not job queue depthâBatch’s native job queue is purpose-built for this pattern.
- FinOps Red Flag: You’re paying for EC2 instance uptime during scale-out delays and scale-in cooldowns.
Why not D (Storage Gateway File Gateway)? #
- Latency Mismatch: File Gateway provides NFS/SMB interface with local cachingâdesigned for interactive file access, not bulk batch transfers.
- Cache Overhead: Requires provisioning and managing on-premises cache storage (VM or hardware appliance).
- Missing Orchestration: While Batch is correct for execution, the triggering mechanism is incompleteâno workflow coordination for multi-step genomics pipelines.
- Cost Impact: File Gateway charges for cache storage, API requests, and data transferâDataSync is more cost-effective for one-way bulk transfers.
ð The Solution Blueprint #
This blueprint visualizes the expected solution, showing how services interact and which architectural pattern the exam is testing.
Seeing the full solution end to end often makes the trade-offsâand the failure points of simpler optionsâimmediately clear.
graph TD
A[On-Prem Sequencing Machine] -->|200GB Raw Data| B[Local SAN Storage]
B -->|Direct Connect| C[AWS DataSync Agent]
C -->|Optimized Transfer| D[Amazon S3 Raw Data Bucket]
D -->|S3 Event Notification| E[AWS Lambda Orchestrator]
E -->|StartExecution API| F[AWS Step Functions Workflow]
F --> G{Quality Check Step}
G -->|Pass| H[AWS Batch Job: Alignment]
G -->|Fail| I[SNS Notification]
H -->|Pull Image| J[Amazon ECR - Genomics Tools]
H -->|Execute on| K[EC2 Spot Instances]
K -->|Processed Data| L[S3 Results Bucket]
F --> M[Batch Job: Variant Calling]
M --> K
M --> L
F --> N[Batch Job: Report Generation]
N --> L
style C fill:#FF9900,stroke:#232F3E,color:#FFFFFF
style F fill:#FF4F8B,stroke:#232F3E,color:#FFFFFF
style H fill:#FF9900,stroke:#232F3E,color:#FFFFFF
style J fill:#FF9900,stroke:#232F3E,color:#FFFFFF
style K fill:#FF9900,stroke:#232F3E,color:#FFFFFF
Diagram Note: DataSync handles optimized transfer over Direct Connect; S3 events trigger Lambda to orchestrate Step Functions; Batch pulls container images from ECR and executes multi-stage genomics workflows on cost-optimized Spot instances.
ð The Decision Matrix #
This matrix compares all options across cost, complexity, and operational impact, making the trade-offs explicit and the correct choice logically defensible.
At the professional level, the exam expects you to justify your choice by explicitly comparing cost, complexity, and operational impact.
| Option | Est. Complexity | Est. Monthly Cost (15 jobs/day) | Pros | Cons |
|---|---|---|---|---|
| A: Snowball Edge | Low (Physical process) | High (~$9,000) ⢠Device fees: $300 à 30 = $9,000 ⢠Shipping: Variable ⢠S3 storage: $690 |
⢠Simple concept ⢠No network dependency (ironic here) |
⢠5-7 day latency per cycle ⢠Defeats “weeks to days” goal ⢠Logistics overhead ⢠Direct Connect waste |
| B: Data Pipeline + ASG | High (Legacy + Manual scaling) | Medium (~$1,200) ⢠Data Pipeline: ~$100 ⢠EC2 (On-Demand, inefficient scaling): ~$800 ⢠AMI storage/snapshots: $50 ⢠S3: $690 |
⢠Familiar EC2 patterns | ⢠Legacy service (Data Pipeline) ⢠AMI maintenance burden ⢠ASG scaling lag ⢠No native workflow orchestration |
| C: DataSync + Step Functions + Batch â | Medium (Managed services) | Low (~$850) ⢠DataSync: ~$150 (3TB @ $0.05/GB) ⢠Step Functions: ~$2 (450 transitions) ⢠Lambda: ~$1 ⢠Batch (70% Spot): ~$200 ⢠ECR: ~$7 ⢠S3: $690 |
⢠Purpose-built for each layer ⢠70% compute savings (Spot) ⢠Visual workflow management ⢠No server maintenance ⢠Maximizes Direct Connect ROI |
⢠Requires Step Functions workflow design ⢠Learning curve for Batch job definitions |
| D: File Gateway + Batch | Medium-High (Cache management) | Medium (~$1,100) ⢠File Gateway (cache + API): ~$400 ⢠Batch: ~$200 ⢠S3: $690 |
⢠NFS/SMB compatibility ⢠Batch for execution |
⢠Wrong transfer pattern (interactive vs. batch) ⢠Cache management overhead ⢠No orchestration for multi-step workflows ⢠Higher transfer costs than DataSync |
FinOps Insight:
Option C achieves 30% lower monthly costs than the nearest alternative while reducing operational overhead by eliminating AMI maintenance, ASG tuning, and cache management. The 70% Spot discount in Batch is enabled by job queue flexibilityâa $600/month saving that pays for a junior DevOps engineer’s time.
ð Real-World Practitioner Insight #
This section connects the exam scenario to real production environments, highlighting how similar decisions are madeâand often misjudgedâin practice.
This is the kind of decision that frequently looks correct on paper, but creates long-term friction once deployed in production.
Exam Rule #
“For the SAP-C02 exam, when you see existing Direct Connect + containerized workloads + multi-hour jobs, choose DataSync + Batch. If workflow orchestration is mentioned, add Step Functions. Always prefer managed services over custom infrastructure at Professional level.”
Real World #
In production, we’d likely:
-
Add Observability: Integrate AWS Batch with CloudWatch Container Insights and AWS X-Ray for Step Functions tracing.
-
Implement Cost Controls:
- Batch Compute Environments with Spot bid advisors and fallback to On-Demand.
- S3 Lifecycle Policies to transition raw data to Glacier after 30 days (genomics data is rarely reprocessed).
- DataSync bandwidth throttling during business hours if Direct Connect is shared.
-
Enhance Data Governance:
- S3 Object Lock for regulatory compliance (HIPAA/GDPR for genetic data).
- AWS Lake Formation for metadata cataloging if analytics teams need to query results.
-
Operational Maturity:
- EventBridge instead of direct Lambda triggers for better decoupling and filtering.
- AWS Batch Multi-Node Parallel Jobs if genomics tools support MPI for distributed processing.
- Amazon FSx for Lustre as a scratch filesystem if intermediate processing requires high IOPS shared storage.
-
Hybrid Optimization:
- Consider AWS Outposts if regulatory requirements prevent raw data from leaving premisesârun Batch locally, store results in S3.
The exam tests service selection logic; production requires defense-in-depth across cost, security, and operational maturity.