March 2026 ยท 5 min read
How to Evaluate AI Agents in 2026
The landscape of AI agent evaluation is shifting from self-reported benchmarks to execution-verified evidence. Here's what actually works.
Read article โ
February 2026 ยท 6 min read
EU AI Act Compliance Checklist for AI Agents
The August 2026 deadline is real. Here's a practical checklist for getting your AI agents compliant before enforcement begins.
Read article โ
February 2026 ยท 4 min read
The 84-Point Documentation Inflation Gap
Your AI agent scores 100/100 on self-reported docs. Its real execution score? 16/100. The gap isn't a bug โ it's the status quo.
Read article โ
โ Back to blog
March 2026 ยท 5 min read
How to Evaluate AI Agents in 2026
If you deployed an AI agent in 2024, you probably evaluated it with benchmarks. MMLU. HumanEval. MTEB. Maybe you ran it through a red-team exercise and called it a day.
In 2026, that approach is dangerously insufficient. Here's why โ and what to do instead.
The Benchmark Problem
Benchmarks evaluate models, not agents. An agent is a model wrapped in tools, memory, policies, access controls, and decision loops. Scoring the model tells you nothing about how the agent behaves when a user asks it to delete production data, or when a prompt injection slips through its content filter.
We've analyzed over 200 production AI agents across industries. The pattern is consistent: agents that score well on model benchmarks still fail basic governance checks. They leak context between conversations. They escalate privileges without approval. They hallucinate tool calls that bypass safety guardrails.
What Actually Matters
Effective agent evaluation in 2026 requires five things:
- Execution evidence, not self-reporting. Watch what the agent does, not what its README claims. Capture API calls, tool invocations, and decision points with cryptographic signatures.
- Continuous monitoring, not one-time testing. Agents drift. A safe agent today may not be safe tomorrow after a model update, config change, or prompt injection attack. You need ongoing evaluation.
- Agent-level, not model-level. Test the full stack โ the model, the tools, the memory, the policies, the approval flows. A model-level benchmark can't catch a misconfigured tool permission.
- Anti-gaming protections. If the agent can influence its own evaluation, the evaluation is compromised. Temporal consistency checks, cross-reference validation, and evidence freshness rules prevent gaming.
- Regulatory readiness. The EU AI Act enforcement begins August 2026. If your agent operates in or serves the EU market, compliance isn't optional anymore.
The AMC Approach
Agent Maturity Compass (AMC) was built specifically for this. It sits between your agent and its LLM provider as a trusted observer, capturing every API call and tool use with Ed25519 signatures in a Merkle-tree ledger.
Deep diagnostic coverage across 5 trust dimensions. Evidence-gated scoring from L0 to L5. 14 framework adapters โ LangChain, CrewAI, AutoGen, OpenAI Agents SDK, and more. Zero to first score in 2 minutes:
npm i -g agent-maturity-compass
amc
The key insight: Self-reported evidence is capped at 0.4x weight. Observed evidence scores at 1.0x. Cryptographic evidence also scores at 1.0x. The scoring system structurally incentivizes real evidence over claims.
What to Do This Week
- Audit your current evaluation. How much of it is based on what the agent claims vs. what it does?
- Run a baseline AMC score. Most agents start at L0โL1. That's normal. It's the honest starting point.
- Set a target. L3 is the EU AI Act minimum. How far are you from it?
- Instrument your agent. One environment variable change to route through AMC's evidence-capturing gateway.
The era of trusting AI agents because their documentation says they're safe is over. In 2026, trust is earned through evidence.
โ Back to blog
February 2026 ยท 6 min read
EU AI Act Compliance Checklist for AI Agents
The EU AI Act enters full enforcement in August 2026. If your AI agent operates in or serves the EU market, compliance is no longer a nice-to-have โ it's a legal requirement. Non-compliance means fines up to โฌ35 million or 7% of global turnover.
Here's a practical checklist for getting your AI agents ready.
Risk Classification
First, determine your agent's risk level under the EU AI Act:
- Unacceptable risk โ Banned. Social scoring, real-time biometric surveillance, manipulation of vulnerable groups.
- High risk โ Requires full compliance. AI in employment, credit scoring, law enforcement, education, critical infrastructure.
- Limited risk โ Transparency obligations. Chatbots, content generation, emotion detection.
- Minimal risk โ No specific requirements. Spam filters, video games, inventory management.
Most production AI agents fall into high risk or limited risk categories.
The Compliance Checklist
โ
1. Risk Management System (Article 9)
Document and implement a continuous risk management system. This isn't a one-time assessment โ it must be active throughout the agent's lifecycle.
- Identify and analyze known and foreseeable risks
- Implement risk mitigation measures
- Document residual risks and communicate them to users
- Regularly test and update the risk assessment
โ
2. Data Governance (Article 10)
Training, validation, and testing datasets must meet quality criteria.
- Document data provenance and lineage
- Assess and mitigate bias in training data
- Ensure datasets are relevant and representative
- Implement data quality monitoring
โ
3. Technical Documentation (Article 11)
Comprehensive technical documentation before the agent is placed on the market.
- System architecture and design specifications
- Performance metrics and limitations
- Risk assessment results
- Testing and validation results
โ
4. Record-Keeping / Logging (Article 12)
Automatic logging of events throughout the agent's lifecycle. This is where most agents fail.
- Tamper-evident audit logs of all decisions
- Evidence of policy enforcement
- Incident records and response actions
- Traceability from input to output
โ
5. Transparency (Article 13)
Users must be informed they're interacting with AI.
- Clear disclosure of AI agent identity
- Documentation of capabilities and limitations
- Interpretable outputs where possible
โ
6. Human Oversight (Article 14)
Appropriate human oversight mechanisms must be in place.
- Human-in-the-loop for high-risk decisions
- Ability to override agent decisions
- Escalation paths for uncertain situations
- Kill switch / emergency stop capability
โ
7. Accuracy, Robustness, Cybersecurity (Article 15)
- Documented accuracy levels and performance bounds
- Resilience to adversarial attacks (prompt injection, jailbreaking)
- Cybersecurity measures appropriate to the risk level
- Graceful degradation under failure conditions
How AMC Maps to EU AI Act
AMC's diagnostic framework maps directly to EU AI Act requirements:
Article 9 โ AMC Dimensions: Strategic Operations, Governance
Article 10 โ AMC: Evidence Trust, Claim Provenance
Article 11 โ AMC: Full diagnostic report with 140+ data points
Article 12 โ AMC: Merkle-tree evidence ledger (tamper-evident by design)
Article 13 โ AMC: Transparency reports, Agent Passport
Article 14 โ AMC: Approval flows, human oversight scoring
Article 15 โ AMC: Shield (threat detection), Enforce (policy guardrails)
Run amc assurance run --pack eu-ai-act for an automated compliance check against all relevant articles.
Timeline
- February 2025 โ Prohibited AI practices enforced
- August 2025 โ General-purpose AI rules apply
- August 2026 โ Full enforcement for high-risk AI systems
You have months, not years. Start now.
npm i -g agent-maturity-compass
amc init && amc assurance run --pack eu-ai-act
โ Back to blog
February 2026 ยท 4 min read
The 84-Point Documentation Inflation Gap
Here's an experiment we've run hundreds of times: take a production AI agent, score it using its own documentation, then score it using AMC's execution evidence. The results are always the same.
Documentation score: 100/100. The README says it handles errors gracefully. The docs claim it respects rate limits. The architecture diagram shows a human-in-the-loop approval flow.
Execution score: 16/100. The error handler swallows exceptions silently. The rate limiter was never implemented. The "approval flow" is a commented-out function.
The gap? 84 points.
Why the Gap Exists
Documentation inflation isn't malice โ it's incentive structure. When you evaluate agents based on what they claim, you incentivize better claims, not better agents. Every framework that accepts self-reported evidence has this structural flaw:
- Model cards โ written by the model vendor. Cherry-picked benchmarks, favorable framing.
- Self-assessments โ the agent evaluating itself. Structural conflict of interest.
- Documentation reviews โ checking if docs exist, not if they're accurate. "Does this policy document exist? โ
" doesn't mean the policy is enforced.
- Periodic audits โ snapshot in time. Agents drift between audits.
What the Gap Looks Like in Practice
Across our analysis of 200+ production agents:
- 73% claimed to have error handling. Only 12% actually caught and recovered from failures gracefully when tested.
- 91% documented rate limiting. Only 8% enforced it when we sent burst traffic.
- 65% listed human oversight as a feature. Only 4% actually routed high-risk decisions to a human.
- 89% claimed prompt injection defense. Only 31% blocked the standard OWASP LLM Top 10 injection patterns.
Closing the Gap
The fix is structural, not procedural. You don't close an 84-point gap with better documentation โ you close it by changing what counts as evidence.
AMC's approach:
- Self-reported evidence is capped at 0.4x weight. Claim whatever you want โ it's worth less than half of observed evidence.
- Observed evidence scores at 1.0x. What AMC watches your agent actually do? Full weight.
- Cryptographic evidence also scores at 1.0x. Ed25519 signed, Merkle-tree chained, tamper-evident.
- Evidence decays. Old evidence is worth less. You can't coast on a good assessment from 6 months ago.
- Anti-gaming. Temporal consistency prevents overnight jumps. Cross-reference validation catches inconsistencies.
The result: AMC scores reflect what agents actually do, not what they claim. An L3 in AMC means something โ it's backed by signed execution evidence, not a bullet point in a README.
Try It Yourself
Score your agent. See where the gap is. It's free, open source, and takes 2 minutes:
npm i -g agent-maturity-compass
amc
The gap might be uncomfortable. But knowing your real score is the first step to closing it.