Managed AI vs Custom PHI Detection | SAA-C03

Table of Contents

While preparing for the AWS SAA-C03, many candidates get confused by AWS’s sprawling AI/ML service portfolio. In the real world, this is fundamentally a decision about Build vs. Buy complexity and operational overhead minimization. Let’s drill into a simulated scenario.

The Scenario
#

MediCare Connect, a regional healthcare network, has recently deployed a serverless document ingestion system using Amazon API Gateway and AWS Lambda. Healthcare providers upload medical reports in PDF and JPEG formats through a RESTful API endpoint.

The compliance team has mandated that the system must automatically scan all uploaded documents to identify and flag Protected Health Information (PHI) such as patient names, dates of birth, medical record numbers, and diagnosis codes to ensure HIPAA compliance workflows are triggered appropriately.

The engineering team has limited machine learning expertise and wants to avoid managing ML infrastructure or model training pipelines.

Key Requirements
#

Modify the existing Lambda function to extract text from both PDF and JPEG documents, then identify PHI within that text, with minimal operational overhead.

The Options
#

A) Integrate an existing Python library (like pytesseract) into the Lambda function to extract text from reports, then use regular expressions and custom logic to identify PHI patterns.
B) Use Amazon Textract to extract text from reports. Deploy Amazon SageMaker with a pre-trained NLP model to identify PHI from the extracted text.
C) Use Amazon Textract to extract text from reports. Use Amazon Comprehend Medical to identify PHI from the extracted text.
D) Use Amazon Rekognition to extract text from reports. Use Amazon Comprehend Medical to identify PHI from the extracted text.

Correct Answer
#

Option C.

Step-by-Step Winning Logic
#

This solution represents the optimal trade-off for the SAA-C03 level because:

Service Specialization: Amazon Textract is purpose-built for document text extraction (PDFs and images), using advanced OCR with layout analysis. It handles both PDF and JPEG formats natively.
Domain-Specific AI: Amazon Comprehend Medical is a HIPAA-eligible, healthcare-specialized NLP service that can detect PHI entities (names, dates, medical conditions, medications, etc.) without any model training.
Zero Infrastructure Management: Both services are fully managed, serverless, and API-driven—perfect for Lambda integration with no servers to patch, no models to train.
Compliance Readiness: Comprehend Medical is designed for healthcare use cases and can identify PHI categories aligned with HIPAA requirements.
Pay-per-Use Economics: Costs scale linearly with document volume—no upfront ML infrastructure costs.

💎 The Architect’s Deep Dive: Why Options Fail
#

Correct Answer
#

Option C: Use Amazon Textract to extract text from reports, then use Amazon Comprehend Medical to identify PHI from the extracted text.

The Traps (Distractor Analysis)
#

Why not Option A?
- Accuracy Risk: Open-source OCR libraries like pytesseract have significantly lower accuracy on complex medical documents compared to Textract.
- Operational Overhead: You must maintain OCR libraries, handle version updates, and continuously refine regex patterns for PHI detection—this violates the “minimal operational overhead” requirement.
- Compliance Gaps: Custom regex-based PHI detection is error-prone and difficult to validate against HIPAA standards.
- Lambda Layer Complexity: Packaging large Python libraries increases deployment package size and cold start times.
Why not Option B?
- Unnecessary Complexity: Amazon SageMaker requires you to either train a custom model or deploy a pre-trained model to an always-on endpoint (or serverless inference).
- Operational Overhead: Managing SageMaker endpoints (scaling, monitoring, updating) introduces significant operational burden compared to a simple API call.
- Cost Structure: SageMaker endpoints incur hourly infrastructure costs even during idle periods, whereas Comprehend Medical is pure pay-per-request.
- Time to Value: Deploying and validating a SageMaker model takes weeks; Comprehend Medical is production-ready immediately.
Why not Option D?
- Wrong Tool for the Job: Amazon Rekognition is designed for image and video analysis (object detection, face recognition, celebrity identification), not document text extraction.
- Limited Text Extraction: While Rekognition has a DetectText API, it’s optimized for scene text (street signs, billboards) and performs poorly on dense document layouts like medical reports.
- No PDF Support: Rekognition does not natively support PDF files—only image formats.
- Inferior to Textract: Textract’s document-specific features (table extraction, form parsing, layout preservation) far exceed Rekognition’s text detection capabilities for this use case.

💎 Professional Decision Matrix

This SAA-C03 professional section is locked.
Free beta access reveals the exam logic.

100% Free Beta Access

The Architect Blueprint
#

graph TD
    A[Healthcare Provider] -->|Uploads PDF/JPEG| B[API Gateway]
    B --> C[Lambda Function]
    C -->|Document Stored| D[S3 Bucket]
    C -->|Async Extract Text| E[Amazon Textract]
    E -->|Raw Text JSON| C
    C -->|Detect PHI Entities| F[Amazon Comprehend Medical]
    F -->|PHI Metadata| C
    C -->|Store PHI Flags| G[DynamoDB]
    C -->|Trigger Compliance Workflow| H[EventBridge/SNS]
    
    style E fill:#FF9900,stroke:#232F3E,stroke-width:2px,color:#fff
    style F fill:#3F8624,stroke:#232F3E,stroke-width:2px,color:#fff
    style C fill:#FF9900,stroke:#232F3E,stroke-width:2px,color:#fff

💎 Professional Decision Matrix

This SAA-C03 professional section is locked.
Free beta access reveals the exam logic.

100% Free Beta Access

Diagram Note: The Lambda function orchestrates two managed AI services (Textract for OCR, Comprehend Medical for PHI detection) without managing any ML infrastructure, then stores compliance metadata for downstream workflows.

The Decision Matrix
#

Option	Est. Complexity	Est. Monthly Cost (1000 docs)	Pros	Cons
A) Custom Python OCR + Regex	High	Low (~$20: Lambda execution only)	• Low per-request cost • No external service dependencies	• Poor OCR accuracy on medical docs • High maintenance burden • Compliance validation difficulty • Requires ML/NLP expertise
B) Textract + SageMaker	Very High	High (~$350: $30 Textract + $320 SageMaker endpoint)	• High accuracy for both OCR and PHI • Customizable ML model	• SageMaker endpoint hourly costs • Model deployment/management overhead • Requires ML expertise • Slow time-to-production
C) Textract + Comprehend Medical ✅	Low	Medium (~$80: $30 Textract + $50 Comprehend Medical)	• Purpose-built for healthcare • Zero ML expertise required • HIPAA-eligible services • Pay-per-use pricing • Immediate production readiness	• Higher per-request cost than custom solution • Limited customization of PHI categories
D) Rekognition + Comprehend Medical	Medium	Medium (~$60: $10 Rekognition + $50 Comprehend Medical)	• Pay-per-use pricing • Managed services	• Rekognition poor for document OCR • No PDF support • Lower accuracy than Textract

Cost Assumptions:

1000 documents/month, average 5 pages each (5000 pages)
Textract: $1.50/1000 pages = $7.50, rounded to $30 for multi-page PDFs
Comprehend Medical: $0.01 per unit (100 characters), ~500 units/page = $25
SageMaker: ml.m5.large endpoint at ~$0.115/hr × 730 hrs = $84/month (minimum)
Rekognition DetectText: $0.001 per image

💎 Professional Decision Matrix

This SAA-C03 professional section is locked.
Free beta access reveals the exam logic.

100% Free Beta Access

Real-World Practitioner Insight
#

Exam Rule
#

For the AWS SAA-C03 exam, when you see:

“Minimal operational overhead” + “machine learning/AI task” → Choose fully managed AI services over custom solutions or SageMaker.
“Extract text from documents (PDF/images)” → Amazon Textract (not Rekognition).
“Healthcare/medical entity extraction” → Amazon Comprehend Medical (not SageMaker or custom models).

Real World
#

In a production healthcare system, we would likely implement these additional considerations:

Hybrid Approach for Cost Optimization:
- Use Textract’s asynchronous API for batch processing during off-peak hours to reduce costs.
- Implement caching for frequently accessed documents to avoid re-processing.
Enhanced Compliance Controls:
- Enable CloudTrail logging for all Textract and Comprehend Medical API calls for audit trails.
- Use AWS PrivateLink to keep data traffic within the AWS network.
- Implement S3 bucket encryption with AWS KMS using customer-managed keys.
Accuracy Monitoring:
- Build a feedback loop where clinicians can flag false positives/negatives.
- If Comprehend Medical’s accuracy falls below 95% after validation, then consider SageMaker with a custom model fine-tuned on your specific document types.
Cost Governance:
- Set up AWS Budgets alerts if Textract/Comprehend Medical costs exceed thresholds.
- Analyze CloudWatch metrics to identify documents that repeatedly fail processing (wasting API calls).
Multi-Region Considerations:
- As of 2025, Comprehend Medical is not available in all AWS regions—you may need cross-region API calls or data replication strategies.

The key difference: The exam tests your ability to select the right managed service for the job. Real-world implementations layer on governance, cost controls, and continuous improvement processes that aren’t mentioned in the question.

💎 Professional Decision Matrix

This SAA-C03 professional section is locked.
Free beta access reveals the exam logic.

100% Free Beta Access

The Scenario #

Key Requirements #

The Options #

Correct Answer #

Step-by-Step Winning Logic #

💎 The Architect’s Deep Dive: Why Options Fail #

Correct Answer #

The Traps (Distractor Analysis) #

💎 Professional Decision Matrix

The Architect Blueprint #

💎 Professional Decision Matrix

The Decision Matrix #

💎 Professional Decision Matrix

Real-World Practitioner Insight #

Exam Rule #

Real World #

💎 Professional Decision Matrix

Related Articles

The Scenario
#

Key Requirements
#

The Options
#

Correct Answer
#

Step-by-Step Winning Logic
#

💎 The Architect’s Deep Dive: Why Options Fail
#

Correct Answer
#

The Traps (Distractor Analysis)
#

The Architect Blueprint
#

The Decision Matrix
#

Real-World Practitioner Insight
#

Exam Rule
#

Real World
#