How does Agent Maturity Compass compare to existing AI evaluation approaches?
Spoiler: most approaches let the agent grade its own homework.
| Dimension | AMC | Self-Reported Docs | MTEB / Benchmarks | Model Cards | SOC2 Audits | Manual Red-Teaming | AgentBench | HELM |
|---|---|---|---|---|---|---|---|---|
| Evidence Source | Execution evidence (observed) | Self-reported | Synthetic benchmarks | Vendor-authored | Point-in-time audit | Manual testing | Synthetic tasks | Synthetic scenarios |
| Tamper Resistance | Ed25519 + Merkle tree | None | None | None | Auditor independence | Human judgment | None | None |
| Continuous Monitoring | Real-time + drift detection | Static docs | One-time run | Static document | Annual audit | Periodic testing | One-time eval | One-time eval |
| Framework Coverage | 14 adapters (LangChain, CrewAI, etc.) | Framework-specific | Model-level only | Model-level only | Org-level | Single agent | Limited frameworks | Model-level only |
| EU AI Act Ready | Full compliance mapping | No mapping | Not designed for it | Some transparency | Partial overlap | No compliance focus | Not designed for it | Transparency focus |
| Cost | Free (MIT licensed) | Free | Free | Free | $50Kโ$200K/year | $10Kโ$100K/engagement | Free | Free |
| Time to First Score | 2 minutes | Hours to write | Minutes to run | Weeks to prepare | 3โ6 months | Days to weeks | Hours to setup | Hours to setup |
| Anti-Gaming | Temporal + cross-ref + freshness | Easily gamed | Data contamination risk | Cherry-picked results | Scope limitations | Evaluator bias | Limited checks | Limited checks |
| Agent-Level (not Model) | Full agent evaluation | Depends on docs | Model only | Model only | Org-level | Agent-level | Agent-level | Model only |
| Domain Correctness Proof | Bounded amcproof lane: proven/disproven/unsupported against declared rule manifests | Claims only | Benchmark score only | Narrative limitations | Control audit, not answer proof | Human review only | No source-to-rule proof | No source-to-rule proof |
When you evaluate an AI agent using self-reported documentation, you get a score of 100/100. When AMC evaluates the same agent from execution evidence โ watching what it actually does โ the real score is 16/100.
That 84-point gap is documentation inflation. Every approach on this page except AMC suffers from some form of it, because they either let the agent provide its own evidence, test only at the model level, or evaluate at a single point in time.
Ed25519 signatures + Merkle tree ledger. Evidence is tamper-evident by design โ you can't fake your way to L5.
AMC monitors continuously with drift detection. SOC2 audits give you a snapshot; AMC gives you a movie.
MTEB and HELM evaluate models. AMC evaluates the full agent โ tools, memory, governance, the whole stack.
LangChain, CrewAI, AutoGen, OpenAI Agents SDK, and more. One env var change. Zero code modifications.
Full compliance mapping to EU AI Act articles. August 2026 deadline is coming โ AMC gets you ready now.
For declared rule manifests, AMC can say proven, disproven, or unsupported โ without pretending evidence receipts prove answer correctness.
MIT licensed. No $200K SOC2 audits. No $100K red-team engagements. Zero to first score in 2 minutes.
Install AMC and get your first score in under 2 minutes.
npm i -g agent-maturity-compass && amc