AMC vs Everything Else

How does Agent Maturity Compass compare to existing AI evaluation approaches?

Spoiler: most approaches let the agent grade its own homework.

Full support Partial / limited Not supported
Dimension AMC Self-Reported Docs MTEB / Benchmarks Model Cards SOC2 Audits Manual Red-Teaming AgentBench HELM
Evidence Source Execution evidence (observed) Self-reported Synthetic benchmarks Vendor-authored Point-in-time audit Manual testing Synthetic tasks Synthetic scenarios
Tamper Resistance Ed25519 + Merkle tree None None None Auditor independence Human judgment None None
Continuous Monitoring Real-time + drift detection Static docs One-time run Static document Annual audit Periodic testing One-time eval One-time eval
Framework Coverage 14 adapters (LangChain, CrewAI, etc.) Framework-specific Model-level only Model-level only Org-level Single agent Limited frameworks Model-level only
EU AI Act Ready Full compliance mapping No mapping Not designed for it Some transparency Partial overlap No compliance focus Not designed for it Transparency focus
Cost Free (MIT licensed) Free Free Free $50Kโ€“$200K/year $10Kโ€“$100K/engagement Free Free
Time to First Score 2 minutes Hours to write Minutes to run Weeks to prepare 3โ€“6 months Days to weeks Hours to setup Hours to setup
Anti-Gaming Temporal + cross-ref + freshness Easily gamed Data contamination risk Cherry-picked results Scope limitations Evaluator bias Limited checks Limited checks
Agent-Level (not Model) Full agent evaluation Depends on docs Model only Model only Org-level Agent-level Agent-level Model only
Domain Correctness Proof Bounded amcproof lane: proven/disproven/unsupported against declared rule manifests Claims only Benchmark score only Narrative limitations Control audit, not answer proof Human review only No source-to-rule proof No source-to-rule proof

The Core Problem

When you evaluate an AI agent using self-reported documentation, you get a score of 100/100. When AMC evaluates the same agent from execution evidence โ€” watching what it actually does โ€” the real score is 16/100.

That 84-point gap is documentation inflation. Every approach on this page except AMC suffers from some form of it, because they either let the agent provide its own evidence, test only at the model level, or evaluate at a single point in time.

Why AMC Wins

๐Ÿ” Cryptographic Evidence

Ed25519 signatures + Merkle tree ledger. Evidence is tamper-evident by design โ€” you can't fake your way to L5.

๐Ÿ“ก Continuous, Not Point-in-Time

AMC monitors continuously with drift detection. SOC2 audits give you a snapshot; AMC gives you a movie.

๐Ÿค– Agent-Level, Not Model-Level

MTEB and HELM evaluate models. AMC evaluates the full agent โ€” tools, memory, governance, the whole stack.

๐Ÿ”Œ 14 Framework Adapters

LangChain, CrewAI, AutoGen, OpenAI Agents SDK, and more. One env var change. Zero code modifications.

โš–๏ธ EU AI Act Mapped

Full compliance mapping to EU AI Act articles. August 2026 deadline is coming โ€” AMC gets you ready now.

๐Ÿงพ Domain Proof Lane

For declared rule manifests, AMC can say proven, disproven, or unsupported โ€” without pretending evidence receipts prove answer correctness.

๐Ÿ’ฐ Free & Open Source

MIT licensed. No $200K SOC2 audits. No $100K red-team engagements. Zero to first score in 2 minutes.

Ready to close the gap?

Install AMC and get your first score in under 2 minutes.

npm i -g agent-maturity-compass && amc

Read the Docs โ†’