amazon-bedrock-reality-check

Last Updated: October 31, 2025

What It Actually Does

Amazon’s Automated Reasoning product uses an LLM to extract SMT constraints from your documents.

The Workflow

# Step 1: Upload source document
document = "company_pricing_policy.pdf"
bedrock.upload(document)

# Step 2: LLM EXTRACTS RULES (automated, but can hallucinate!)
# API: StartAutomatedReasoningPolicyBuildWorkflow
# Official docs: "invokes large language models to extract formal logic rules"
extracted_policy = llm_extract_rules(document)

# Example extraction:
{
    "variables": {
        "original_price": "Initial price before discounts",
        "discount_percent": "Percentage discount (0-100)",
        "final_price": "Price after applying discount"
    },
    "rules": [
        "final_price = original_price * (1 - discount_percent / 100)",
        "discount_percent >= 0",
        "discount_percent <= 100",
        "original_price >= 0"
    ]
}

# Step 3: HUMAN REVIEW REQUIRED
# Amazon requires you to manually verify the extraction
# because LLMs can hallucinate the rules!

# Step 4: Compile to SMT constraints
smt_policy = bedrock.compile_policy(extracted_policy)

# Step 5: Test the policy
bedrock.test_policy(smt_policy, test_cases)

# Step 6: Deploy
bedrock.deploy_guardrail(smt_policy)

The Critical Admission

From AWS documentation:

“The accuracy of the translation from natural language to formal logic is highly dependent on the quality of variable descriptions.”

Translation: LLM extraction can fail, so you must manually verify it.

Why This Matters

The verification is only as good as the extracted rules:

# Your document says:
"Members receive 10-30% discount on all purchases"

# LLM could extract (WRONG):
discount_percent == 20  # Fixed at 20%

# Or (CORRECT):
discount_percent >= 10 AND discount_percent <= 30

# If extraction is wrong, SMT will verify against WRONG rules!

The Trust Chain (With LLM Extraction Points)

Document with business rules
      ↓
[1] LLM extracts rules ← Can hallucinate! ❌
      ↓
Human reviews (catches some errors)
      ↓
SMT constraints compiled
      ↓
User asks question
      ↓
[2] LLM extracts values from question ← Can hallucinate! ❌
      ↓
[3] LLM generates answer ← Can hallucinate values! ❌
      ↓
SMT verifies (only checks internal consistency)
      ↓
If invalid → regenerate with feedback (up to N attempts)
      ↓
Return answer if passes

Three LLM extraction points where hallucination can occur!

Where the “99%” Really Comes From

Amazon’s Actual Claim

“Up to 99% verification accuracy with Automated Reasoning checks”

What This Actually Means

99% = P(correct answer | ALL of these conditions)

CONDITIONS:
1. ✅ Domain is simple (pricing, dates, basic math)
2. ✅ Source document is clear and unambiguous
3. ✅ Human reviewed extracted policy and fixed errors
4. ✅ Policy was thoroughly tested
5. ✅ Question has simple, unambiguous values
6. ✅ Multiple regeneration attempts allowed
7. ✅ SMT-guided feedback loop

This is NOT:

❌ “LLM extracts rules correctly 99% of the time”
❌ “LLM generates correct values 99% of the time”
❌ “Works for any domain with 99% accuracy”

Failure Modes

Failure Mode 1: Policy Extraction Error

Document says:

“Discounts range from 5% to 25% based on membership level”

LLM extracts:

{
    "rules": [
        "discount_percent >= 5",
        "discount_percent <= 20"  # ❌ Should be 25!
    ]
}

Human reviewer misses this

Result:

Valid 22% discount gets rejected by SMT
System claims it’s “verified” but is enforcing wrong rules

Failure Mode 2: Ambiguous Document

Document says:

“Members get up to 20% off”

LLM extracts:

{
    "rules": [
        "discount_percent <= 20"
    ]
}

But the document didn’t specify:

Can discounts stack with promotions?
What about loyalty points?
Minimum purchase requirements?

Result:

SMT can only verify what was extracted
Missing rules create security holes

Failure Mode 3: Compatible Hallucinations

Question: “What is 20% of $100?”

Policy (correct):

result = amount * (percentage / 100)

LLM hallucinates the VALUES:

{
    "amount": 10,      # ❌ Should be 100
    "percentage": 20,  # ✅ Correct
    "result": 2        # ✅ Mathematically consistent with wrong amount!
}

SMT verification:

result == amount * (percentage / 100)
2 == 10 * (20 / 100)
2 == 2  ✅ PASS

Answer: “20% of 2”

WRONG QUESTION, but SMT passed!

Failure Mode 4: Question Extraction Hallucination

User asks: “Calculate 17.5% discount on $127.43”

LLM extracts from question:

{
    "amount": 127.34,   # ❌ Transposed digits!
    "percentage": 17.5  # ✅ Correct
}

LLM generates answer:

{
    "amount": 127.34,   # Same hallucinated value
    "percentage": 17.5,
    "result": 22.29     # Math is correct for wrong amount
}

SMT verification: ✅ PASS (math is internally consistent)

Answer: “17.5% of 22.29” (wrong amount!)

What Amazon Actually Requires

1. Manual Policy Review

From AWS workflow:

Upload source document
Review the extracted policy ← Required human step!
Test and refine
Deploy

Why? Because LLM extraction can fail.

2. Comprehensive Testing

AWS docs emphasize:

“Test edge cases and boundary conditions”

You must test the extracted policy to catch errors.

3. High-Quality Descriptions

AWS requires:

“Write comprehensive variable descriptions accounting for how users naturally phrase concepts”

Because the extraction quality depends on clear descriptions.

4. Domain Limitations

Amazon only claims 99% for:

✅ Pricing calculations
✅ Date arithmetic
✅ Policy compliance
✅ Financial formulas

NOT for:

❌ Open-ended questions
❌ Complex multi-step reasoning
❌ Ambiguous domains

The Real Value Proposition

What Amazon’s product actually provides:

✅ It DOES Help

Reduces manual work - Don’t have to write Z3 code by hand
Iterative improvement - Test and refine extracted policies
SMT verification - Catches internal inconsistencies
Regeneration with feedback - Multiple attempts to get correct answer

❌ It Does NOT Eliminate

LLM hallucination - Can happen at multiple stages
Human oversight - Still need to review extracted policies
Domain limitations - Only works for specific use cases
Edge case bugs - Compatible hallucinations can pass

Bottom Line

What We Learned

Amazon uses LLM to build the SMT model from documents
Manual review is required because extraction can fail
99% is conditional on simple domains and human oversight
Multiple failure modes exist where hallucinations pass verification

The Real Formula

System Reliability =
    P(correct policy extraction | human review) *
    P(correct question extraction) *
    P(correct answer generation | SMT feedback) *
    P(SMT catches error)

Optimistically:
= 0.95 * 0.92 * 0.95 * 0.9999
= ~0.83 (83%)

NOT 99%!

The 99% likely assumes:

Simple, unambiguous domains
Thorough human review of policies
Well-tested policies
Questions with obvious values
Multiple regeneration attempts

When It’s Actually Useful

Use Amazon Bedrock Automated Reasoning when:

✅ Domain is narrow and well-defined
✅ Rules can be clearly documented
✅ You can invest in policy review and testing
✅ Questions have unambiguous values
✅ Stakes justify the overhead

Don’t expect:

❌ General-purpose hallucination prevention
❌ Zero human oversight
❌ 99% accuracy without significant setup work

Sources

AWS Documentation: Automated Reasoning Checks
AWS Blog: Build Reliable AI Systems
API Reference: StartAutomatedReasoningPolicyBuildWorkflow

Key Quote:

“These operations invoke large language models to extract formal logic rules from source documents.”

This confirms LLM extraction is used to build the verification models.