Last Updated: October 31, 2025

What It Actually Does

Amazon’s Automated Reasoning product uses an LLM to extract SMT constraints from your documents.

The Workflow

# Step 1: Upload source document
document = "company_pricing_policy.pdf"
bedrock.upload(document)
# Step 2: LLM EXTRACTS RULES (automated, but can hallucinate!)
# API: StartAutomatedReasoningPolicyBuildWorkflow
# Official docs: "invokes large language models to extract formal logic rules"
extracted_policy = llm_extract_rules(document)
# Example extraction:
{
"variables": {
"original_price": "Initial price before discounts",
"discount_percent": "Percentage discount (0-100)",
"final_price": "Price after applying discount"
},
"rules": [
"final_price = original_price * (1 - discount_percent / 100)",
"discount_percent >= 0",
"discount_percent <= 100",
"original_price >= 0"
]
}
# Step 3: HUMAN REVIEW REQUIRED
# Amazon requires you to manually verify the extraction
# because LLMs can hallucinate the rules!
# Step 4: Compile to SMT constraints
smt_policy = bedrock.compile_policy(extracted_policy)
# Step 5: Test the policy
bedrock.test_policy(smt_policy, test_cases)
# Step 6: Deploy
bedrock.deploy_guardrail(smt_policy)

The Critical Admission

From AWS documentation:

“The accuracy of the translation from natural language to formal logic is highly dependent on the quality of variable descriptions.”

Translation: LLM extraction can fail, so you must manually verify it.

Why This Matters

The verification is only as good as the extracted rules:

# Your document says:
"Members receive 10-30% discount on all purchases"
# LLM could extract (WRONG):
discount_percent == 20 # Fixed at 20%
# Or (CORRECT):
discount_percent >= 10 AND discount_percent <= 30
# If extraction is wrong, SMT will verify against WRONG rules!

The Trust Chain (With LLM Extraction Points)

Document with business rules
[1] LLM extracts rules ← Can hallucinate! ❌
Human reviews (catches some errors)
SMT constraints compiled
User asks question
[2] LLM extracts values from question ← Can hallucinate! ❌
[3] LLM generates answer ← Can hallucinate values! ❌
SMT verifies (only checks internal consistency)
If invalid → regenerate with feedback (up to N attempts)
Return answer if passes

Three LLM extraction points where hallucination can occur!


Where the “99%” Really Comes From

Amazon’s Actual Claim

“Up to 99% verification accuracy with Automated Reasoning checks”

What This Actually Means

99% = P(correct answer | ALL of these conditions)
CONDITIONS:
1. ✅ Domain is simple (pricing, dates, basic math)
2. ✅ Source document is clear and unambiguous
3. ✅ Human reviewed extracted policy and fixed errors
4. ✅ Policy was thoroughly tested
5. ✅ Question has simple, unambiguous values
6. ✅ Multiple regeneration attempts allowed
7. ✅ SMT-guided feedback loop

This is NOT:

  • ❌ “LLM extracts rules correctly 99% of the time”
  • ❌ “LLM generates correct values 99% of the time”
  • ❌ “Works for any domain with 99% accuracy”

Failure Modes

Failure Mode 1: Policy Extraction Error

Document says:

“Discounts range from 5% to 25% based on membership level”

LLM extracts:

{
"rules": [
"discount_percent >= 5",
"discount_percent <= 20" # ❌ Should be 25!
]
}

Human reviewer misses this

Result:

  • Valid 22% discount gets rejected by SMT
  • System claims it’s “verified” but is enforcing wrong rules

Failure Mode 2: Ambiguous Document

Document says:

“Members get up to 20% off”

LLM extracts:

{
"rules": [
"discount_percent <= 20"
]
}

But the document didn’t specify:

  • Can discounts stack with promotions?
  • What about loyalty points?
  • Minimum purchase requirements?

Result:

  • SMT can only verify what was extracted
  • Missing rules create security holes

Failure Mode 3: Compatible Hallucinations

Question: “What is 20% of $100?”

Policy (correct):

result = amount * (percentage / 100)

LLM hallucinates the VALUES:

{
"amount": 10, # ❌ Should be 100
"percentage": 20, # ✅ Correct
"result": 2 # ✅ Mathematically consistent with wrong amount!
}

SMT verification:

result == amount * (percentage / 100)
2 == 10 * (20 / 100)
2 == 2 ✅ PASS

Answer: “20% of 2”

WRONG QUESTION, but SMT passed!


Failure Mode 4: Question Extraction Hallucination

User asks: “Calculate 17.5% discount on $127.43”

LLM extracts from question:

{
"amount": 127.34, # ❌ Transposed digits!
"percentage": 17.5 # ✅ Correct
}

LLM generates answer:

{
"amount": 127.34, # Same hallucinated value
"percentage": 17.5,
"result": 22.29 # Math is correct for wrong amount
}

SMT verification: ✅ PASS (math is internally consistent)

Answer: “17.5% of 22.29” (wrong amount!)


What Amazon Actually Requires

1. Manual Policy Review

From AWS workflow:

  1. Upload source document
  2. Review the extracted policy ← Required human step!
  3. Test and refine
  4. Deploy

Why? Because LLM extraction can fail.

2. Comprehensive Testing

AWS docs emphasize:

“Test edge cases and boundary conditions”

You must test the extracted policy to catch errors.

3. High-Quality Descriptions

AWS requires:

“Write comprehensive variable descriptions accounting for how users naturally phrase concepts”

Because the extraction quality depends on clear descriptions.

4. Domain Limitations

Amazon only claims 99% for:

  • ✅ Pricing calculations
  • ✅ Date arithmetic
  • ✅ Policy compliance
  • ✅ Financial formulas

NOT for:

  • ❌ Open-ended questions
  • ❌ Complex multi-step reasoning
  • ❌ Ambiguous domains

The Real Value Proposition

What Amazon’s product actually provides:

✅ It DOES Help

  1. Reduces manual work - Don’t have to write Z3 code by hand
  2. Iterative improvement - Test and refine extracted policies
  3. SMT verification - Catches internal inconsistencies
  4. Regeneration with feedback - Multiple attempts to get correct answer

❌ It Does NOT Eliminate

  1. LLM hallucination - Can happen at multiple stages
  2. Human oversight - Still need to review extracted policies
  3. Domain limitations - Only works for specific use cases
  4. Edge case bugs - Compatible hallucinations can pass

Bottom Line

What We Learned

  1. Amazon uses LLM to build the SMT model from documents
  2. Manual review is required because extraction can fail
  3. 99% is conditional on simple domains and human oversight
  4. Multiple failure modes exist where hallucinations pass verification

The Real Formula

System Reliability =
P(correct policy extraction | human review) *
P(correct question extraction) *
P(correct answer generation | SMT feedback) *
P(SMT catches error)
Optimistically:
= 0.95 * 0.92 * 0.95 * 0.9999
= ~0.83 (83%)
NOT 99%!

The 99% likely assumes:

  • Simple, unambiguous domains
  • Thorough human review of policies
  • Well-tested policies
  • Questions with obvious values
  • Multiple regeneration attempts

When It’s Actually Useful

Use Amazon Bedrock Automated Reasoning when:

  1. ✅ Domain is narrow and well-defined
  2. ✅ Rules can be clearly documented
  3. ✅ You can invest in policy review and testing
  4. ✅ Questions have unambiguous values
  5. ✅ Stakes justify the overhead

Don’t expect:

  • ❌ General-purpose hallucination prevention
  • ❌ Zero human oversight
  • ❌ 99% accuracy without significant setup work

Sources

Key Quote:

“These operations invoke large language models to extract formal logic rules from source documents.”

This confirms LLM extraction is used to build the verification models.