๐Ÿ“ AMC Blog

Insights on AI agent trust, evaluation, and compliance.

March 2026 ยท 5 min read

How to Evaluate AI Agents in 2026

The landscape of AI agent evaluation is shifting from self-reported benchmarks to execution-verified evidence. Here's what actually works.

Read article โ†’
February 2026 ยท 6 min read

EU AI Act Compliance Checklist for AI Agents

The August 2026 deadline is real. Here's a practical checklist for getting your AI agents compliant before enforcement begins.

Read article โ†’
February 2026 ยท 4 min read

The 84-Point Documentation Inflation Gap

Your AI agent scores 100/100 on self-reported docs. Its real execution score? 16/100. The gap isn't a bug โ€” it's the status quo.

Read article โ†’
โ† Back to blog
March 2026 ยท 5 min read

How to Evaluate AI Agents in 2026

If you deployed an AI agent in 2024, you probably evaluated it with benchmarks. MMLU. HumanEval. MTEB. Maybe you ran it through a red-team exercise and called it a day.

In 2026, that approach is dangerously insufficient. Here's why โ€” and what to do instead.

The Benchmark Problem

Benchmarks evaluate models, not agents. An agent is a model wrapped in tools, memory, policies, access controls, and decision loops. Scoring the model tells you nothing about how the agent behaves when a user asks it to delete production data, or when a prompt injection slips through its content filter.

We've analyzed over 200 production AI agents across industries. The pattern is consistent: agents that score well on model benchmarks still fail basic governance checks. They leak context between conversations. They escalate privileges without approval. They hallucinate tool calls that bypass safety guardrails.

What Actually Matters

Effective agent evaluation in 2026 requires five things:

  1. Execution evidence, not self-reporting. Watch what the agent does, not what its README claims. Capture API calls, tool invocations, and decision points with cryptographic signatures.
  2. Continuous monitoring, not one-time testing. Agents drift. A safe agent today may not be safe tomorrow after a model update, config change, or prompt injection attack. You need ongoing evaluation.
  3. Agent-level, not model-level. Test the full stack โ€” the model, the tools, the memory, the policies, the approval flows. A model-level benchmark can't catch a misconfigured tool permission.
  4. Anti-gaming protections. If the agent can influence its own evaluation, the evaluation is compromised. Temporal consistency checks, cross-reference validation, and evidence freshness rules prevent gaming.
  5. Regulatory readiness. The EU AI Act enforcement begins August 2026. If your agent operates in or serves the EU market, compliance isn't optional anymore.

The AMC Approach

Agent Maturity Compass (AMC) was built specifically for this. It sits between your agent and its LLM provider as a trusted observer, capturing every API call and tool use with Ed25519 signatures in a Merkle-tree ledger.

Deep diagnostic coverage across 5 trust dimensions. Evidence-gated scoring from L0 to L5. 14 framework adapters โ€” LangChain, CrewAI, AutoGen, OpenAI Agents SDK, and more. Zero to first score in 2 minutes:

npm i -g agent-maturity-compass
amc
The key insight: Self-reported evidence is capped at 0.4x weight. Observed evidence scores at 1.0x. Cryptographic evidence also scores at 1.0x. The scoring system structurally incentivizes real evidence over claims.

What to Do This Week

  1. Audit your current evaluation. How much of it is based on what the agent claims vs. what it does?
  2. Run a baseline AMC score. Most agents start at L0โ€“L1. That's normal. It's the honest starting point.
  3. Set a target. L3 is the EU AI Act minimum. How far are you from it?
  4. Instrument your agent. One environment variable change to route through AMC's evidence-capturing gateway.

The era of trusting AI agents because their documentation says they're safe is over. In 2026, trust is earned through evidence.

โ† Back to blog
February 2026 ยท 6 min read

EU AI Act Compliance Checklist for AI Agents

The EU AI Act enters full enforcement in August 2026. If your AI agent operates in or serves the EU market, compliance is no longer a nice-to-have โ€” it's a legal requirement. Non-compliance means fines up to โ‚ฌ35 million or 7% of global turnover.

Here's a practical checklist for getting your AI agents ready.

Risk Classification

First, determine your agent's risk level under the EU AI Act:

Most production AI agents fall into high risk or limited risk categories.

The Compliance Checklist

โœ… 1. Risk Management System (Article 9)

Document and implement a continuous risk management system. This isn't a one-time assessment โ€” it must be active throughout the agent's lifecycle.

โœ… 2. Data Governance (Article 10)

Training, validation, and testing datasets must meet quality criteria.

โœ… 3. Technical Documentation (Article 11)

Comprehensive technical documentation before the agent is placed on the market.

โœ… 4. Record-Keeping / Logging (Article 12)

Automatic logging of events throughout the agent's lifecycle. This is where most agents fail.

โœ… 5. Transparency (Article 13)

Users must be informed they're interacting with AI.

โœ… 6. Human Oversight (Article 14)

Appropriate human oversight mechanisms must be in place.

โœ… 7. Accuracy, Robustness, Cybersecurity (Article 15)

How AMC Maps to EU AI Act

AMC's diagnostic framework maps directly to EU AI Act requirements:

Article 9 โ†’ AMC Dimensions: Strategic Operations, Governance

Article 10 โ†’ AMC: Evidence Trust, Claim Provenance

Article 11 โ†’ AMC: Full diagnostic report with 140+ data points

Article 12 โ†’ AMC: Merkle-tree evidence ledger (tamper-evident by design)

Article 13 โ†’ AMC: Transparency reports, Agent Passport

Article 14 โ†’ AMC: Approval flows, human oversight scoring

Article 15 โ†’ AMC: Shield (threat detection), Enforce (policy guardrails)

Run amc assurance run --pack eu-ai-act for an automated compliance check against all relevant articles.

Timeline

You have months, not years. Start now.

npm i -g agent-maturity-compass
amc init && amc assurance run --pack eu-ai-act
โ† Back to blog
February 2026 ยท 4 min read

The 84-Point Documentation Inflation Gap

Here's an experiment we've run hundreds of times: take a production AI agent, score it using its own documentation, then score it using AMC's execution evidence. The results are always the same.

Documentation score: 100/100. The README says it handles errors gracefully. The docs claim it respects rate limits. The architecture diagram shows a human-in-the-loop approval flow.

Execution score: 16/100. The error handler swallows exceptions silently. The rate limiter was never implemented. The "approval flow" is a commented-out function.

The gap? 84 points.

Why the Gap Exists

Documentation inflation isn't malice โ€” it's incentive structure. When you evaluate agents based on what they claim, you incentivize better claims, not better agents. Every framework that accepts self-reported evidence has this structural flaw:

What the Gap Looks Like in Practice

Across our analysis of 200+ production agents:

Closing the Gap

The fix is structural, not procedural. You don't close an 84-point gap with better documentation โ€” you close it by changing what counts as evidence.

AMC's approach:

  1. Self-reported evidence is capped at 0.4x weight. Claim whatever you want โ€” it's worth less than half of observed evidence.
  2. Observed evidence scores at 1.0x. What AMC watches your agent actually do? Full weight.
  3. Cryptographic evidence also scores at 1.0x. Ed25519 signed, Merkle-tree chained, tamper-evident.
  4. Evidence decays. Old evidence is worth less. You can't coast on a good assessment from 6 months ago.
  5. Anti-gaming. Temporal consistency prevents overnight jumps. Cross-reference validation catches inconsistencies.
The result: AMC scores reflect what agents actually do, not what they claim. An L3 in AMC means something โ€” it's backed by signed execution evidence, not a bullet point in a README.

Try It Yourself

Score your agent. See where the gap is. It's free, open source, and takes 2 minutes:

npm i -g agent-maturity-compass
amc

The gap might be uncomfortable. But knowing your real score is the first step to closing it.