amazon-bedrock-reality-check
Last Updated: October 31, 2025
What It Actually Does
Amazon’s Automated Reasoning product uses an LLM to extract SMT constraints from your documents.
The Workflow
# Step 1: Upload source documentdocument = "company_pricing_policy.pdf"bedrock.upload(document)
# Step 2: LLM EXTRACTS RULES (automated, but can hallucinate!)# API: StartAutomatedReasoningPolicyBuildWorkflow# Official docs: "invokes large language models to extract formal logic rules"extracted_policy = llm_extract_rules(document)
# Example extraction:{ "variables": { "original_price": "Initial price before discounts", "discount_percent": "Percentage discount (0-100)", "final_price": "Price after applying discount" }, "rules": [ "final_price = original_price * (1 - discount_percent / 100)", "discount_percent >= 0", "discount_percent <= 100", "original_price >= 0" ]}
# Step 3: HUMAN REVIEW REQUIRED# Amazon requires you to manually verify the extraction# because LLMs can hallucinate the rules!
# Step 4: Compile to SMT constraintssmt_policy = bedrock.compile_policy(extracted_policy)
# Step 5: Test the policybedrock.test_policy(smt_policy, test_cases)
# Step 6: Deploybedrock.deploy_guardrail(smt_policy)The Critical Admission
From AWS documentation:
“The accuracy of the translation from natural language to formal logic is highly dependent on the quality of variable descriptions.”
Translation: LLM extraction can fail, so you must manually verify it.
Why This Matters
The verification is only as good as the extracted rules:
# Your document says:"Members receive 10-30% discount on all purchases"
# LLM could extract (WRONG):discount_percent == 20 # Fixed at 20%
# Or (CORRECT):discount_percent >= 10 AND discount_percent <= 30
# If extraction is wrong, SMT will verify against WRONG rules!The Trust Chain (With LLM Extraction Points)
Document with business rules ↓[1] LLM extracts rules ← Can hallucinate! ❌ ↓Human reviews (catches some errors) ↓SMT constraints compiled ↓User asks question ↓[2] LLM extracts values from question ← Can hallucinate! ❌ ↓[3] LLM generates answer ← Can hallucinate values! ❌ ↓SMT verifies (only checks internal consistency) ↓If invalid → regenerate with feedback (up to N attempts) ↓Return answer if passesThree LLM extraction points where hallucination can occur!
Where the “99%” Really Comes From
Amazon’s Actual Claim
“Up to 99% verification accuracy with Automated Reasoning checks”
What This Actually Means
99% = P(correct answer | ALL of these conditions)
CONDITIONS:1. ✅ Domain is simple (pricing, dates, basic math)2. ✅ Source document is clear and unambiguous3. ✅ Human reviewed extracted policy and fixed errors4. ✅ Policy was thoroughly tested5. ✅ Question has simple, unambiguous values6. ✅ Multiple regeneration attempts allowed7. ✅ SMT-guided feedback loopThis is NOT:
- ❌ “LLM extracts rules correctly 99% of the time”
- ❌ “LLM generates correct values 99% of the time”
- ❌ “Works for any domain with 99% accuracy”
Failure Modes
Failure Mode 1: Policy Extraction Error
Document says:
“Discounts range from 5% to 25% based on membership level”
LLM extracts:
{ "rules": [ "discount_percent >= 5", "discount_percent <= 20" # ❌ Should be 25! ]}Human reviewer misses this
Result:
- Valid 22% discount gets rejected by SMT
- System claims it’s “verified” but is enforcing wrong rules
Failure Mode 2: Ambiguous Document
Document says:
“Members get up to 20% off”
LLM extracts:
{ "rules": [ "discount_percent <= 20" ]}But the document didn’t specify:
- Can discounts stack with promotions?
- What about loyalty points?
- Minimum purchase requirements?
Result:
- SMT can only verify what was extracted
- Missing rules create security holes
Failure Mode 3: Compatible Hallucinations
Question: “What is 20% of $100?”
Policy (correct):
result = amount * (percentage / 100)LLM hallucinates the VALUES:
{ "amount": 10, # ❌ Should be 100 "percentage": 20, # ✅ Correct "result": 2 # ✅ Mathematically consistent with wrong amount!}SMT verification:
result == amount * (percentage / 100)2 == 10 * (20 / 100)2 == 2 ✅ PASSAnswer: “20% of
WRONG QUESTION, but SMT passed!
Failure Mode 4: Question Extraction Hallucination
User asks: “Calculate 17.5% discount on $127.43”
LLM extracts from question:
{ "amount": 127.34, # ❌ Transposed digits! "percentage": 17.5 # ✅ Correct}LLM generates answer:
{ "amount": 127.34, # Same hallucinated value "percentage": 17.5, "result": 22.29 # Math is correct for wrong amount}SMT verification: ✅ PASS (math is internally consistent)
Answer: “17.5% of
What Amazon Actually Requires
1. Manual Policy Review
From AWS workflow:
- Upload source document
- Review the extracted policy ← Required human step!
- Test and refine
- Deploy
Why? Because LLM extraction can fail.
2. Comprehensive Testing
AWS docs emphasize:
“Test edge cases and boundary conditions”
You must test the extracted policy to catch errors.
3. High-Quality Descriptions
AWS requires:
“Write comprehensive variable descriptions accounting for how users naturally phrase concepts”
Because the extraction quality depends on clear descriptions.
4. Domain Limitations
Amazon only claims 99% for:
- ✅ Pricing calculations
- ✅ Date arithmetic
- ✅ Policy compliance
- ✅ Financial formulas
NOT for:
- ❌ Open-ended questions
- ❌ Complex multi-step reasoning
- ❌ Ambiguous domains
The Real Value Proposition
What Amazon’s product actually provides:
✅ It DOES Help
- Reduces manual work - Don’t have to write Z3 code by hand
- Iterative improvement - Test and refine extracted policies
- SMT verification - Catches internal inconsistencies
- Regeneration with feedback - Multiple attempts to get correct answer
❌ It Does NOT Eliminate
- LLM hallucination - Can happen at multiple stages
- Human oversight - Still need to review extracted policies
- Domain limitations - Only works for specific use cases
- Edge case bugs - Compatible hallucinations can pass
Bottom Line
What We Learned
- Amazon uses LLM to build the SMT model from documents
- Manual review is required because extraction can fail
- 99% is conditional on simple domains and human oversight
- Multiple failure modes exist where hallucinations pass verification
The Real Formula
System Reliability = P(correct policy extraction | human review) * P(correct question extraction) * P(correct answer generation | SMT feedback) * P(SMT catches error)
Optimistically:= 0.95 * 0.92 * 0.95 * 0.9999= ~0.83 (83%)
NOT 99%!The 99% likely assumes:
- Simple, unambiguous domains
- Thorough human review of policies
- Well-tested policies
- Questions with obvious values
- Multiple regeneration attempts
When It’s Actually Useful
Use Amazon Bedrock Automated Reasoning when:
- ✅ Domain is narrow and well-defined
- ✅ Rules can be clearly documented
- ✅ You can invest in policy review and testing
- ✅ Questions have unambiguous values
- ✅ Stakes justify the overhead
Don’t expect:
- ❌ General-purpose hallucination prevention
- ❌ Zero human oversight
- ❌ 99% accuracy without significant setup work
Sources
- AWS Documentation: Automated Reasoning Checks
- AWS Blog: Build Reliable AI Systems
- API Reference:
StartAutomatedReasoningPolicyBuildWorkflow
Key Quote:
“These operations invoke large language models to extract formal logic rules from source documents.”
This confirms LLM extraction is used to build the verification models.