Claude Mythos Preview reportedly scores 93.9% on SWE-bench Verified and 97.6% on the 2026 USAMO AI Benchmark — a jump so large that it changes how we should think about frontier AI. But the bigger story is not just the numbers. It is that benchmark suites are nearing saturation while Anthropic and Google are already moving Mythos into tightly controlled enterprise deployment.
Table of Contents
- Why These Benchmarks Matter
- The Scoreboard: Where Mythos Pulls Away
- SWE-bench: The Software Engineering Gap Is Widening
- USAMO: From Good to Almost Perfect
- Benchmarks Are Running Out of Road
- The Latest News: Glasswing Expansion and Vertex AI Preview
- What This Means for Enterprise AI Buyers
- Key Takeaways
- Frequently Asked Questions
- References
Why These Benchmarks Matter
In Part 1 of this series, we covered the broad significance of Claude Mythos Preview: Anthropic built a model that it says is its most capable ever — and then refused to release it publicly. In Part 2, we covered the cybersecurity results that forced that decision.
Part 3 is about the capability picture behind those headlines. The benchmark story is not merely that Mythos leads. It is that the size of the lead is unusually large for a single model generation, especially in coding and reasoning. And once frontier models are scoring in the 90s on difficult evaluations, every percentage point becomes more consequential.
For developers, this is the clearest signal yet that autonomous software engineering is becoming practical. For enterprise leaders, it is a warning that the most capable models may increasingly reach the market first through restricted partnerships rather than mass public APIs. And for the AI research community, it is a sign that our familiar benchmarks may no longer be hard enough.

The Scoreboard: Where Mythos Pulls Away
Anthropic’s reported Claude Mythos Preview results include:
- SWE-bench Verified: 93.9%
- USAMO 2026: 97.6%
- Terminal-Bench 2.0: 82.0% standard, 92.1% extended
- HLE with tools: 64.7%
- GPQA Diamond: 94.5%
- GraphWalks BFS 256K–1M: 80.0%
The comparisons are what make the picture striking. On the same benchmark slate, Anthropic reports Claude Opus 4.6 at 80.8% on SWE-bench Verified and 42.3% on USAMO 2026. GPT-5.4 is cited at 95.2% on USAMO. Gemini 3.1 Pro is cited at 80.6% on SWE-bench Verified. Mythos is not just edging ahead. It is creating visible daylight.
That matters because frontier benchmark progress usually compresses as models improve. Gains become harder to earn at the top end. Yet here, Mythos appears to post one of the biggest generation-over-generation jumps we have seen in high-value enterprise tasks.
SWE-bench: The Software Engineering Gap Is Widening
SWE-bench Verified is one of the most important AI coding evaluations because it measures whether a model can solve real software engineering issues in real repositories, not toy exercises. A 93.9% score implies success on roughly 470 of 500 validated tasks.
If that number holds under scrutiny, it has major implications:
- Autonomous code repair becomes credible. This is no longer about autocomplete quality or neat demos. It is about handling issue-driven engineering work at very high accuracy.
- The gap between public and restricted models may widen. Enterprises may soon face a market where the best models are available first through curated programs, not open sign-up.
- Workflow design matters more than model choice alone. Once models operate in this range, the difference between teams may come down to evals, guardrails, repo hygiene, review process, and tool orchestration.
There is also a useful caution here. Public leaderboards such as Vals SWE-bench show top public models clustered in the high-70s as of April 9, 2026, with Gemini 3.1 Pro at 78.8%, GPT-5.4 at 78.2%, and Claude Opus 4.6 at 78.2%. Those numbers are not directly equivalent to Anthropic’s vendor-reported Mythos score because the setups, harnesses, and policies differ. But the contrast still reinforces the same point: benchmark ceilings for public models remain materially lower than what Anthropic says Mythos can do.
USAMO: From Good to Almost Perfect
The most dramatic figure in the set may be the 97.6% score on the 2026 USA Mathematical Olympiad. USAMO is not a multiple-choice benchmark. It is proof-heavy, difficult, and designed for elite human problem-solvers. Going from 42.3% for Opus 4.6 to 97.6% for Mythos is not a routine improvement. It suggests a qualitative shift in long-horizon reasoning, abstraction, and structured problem solving.
Why should enterprise buyers care about a math olympiad score? Because high-end reasoning benchmarks often function as an early signal. They do not perfectly map to business workflows, but they often correlate with a model’s ability to plan, decompose complex tasks, maintain correctness across long chains of logic, and recover from dead ends. Those are the traits that also matter in architecture design, debugging, agent planning, and scientific workflows.
In other words: a model that gets dramatically better at formal reasoning rarely stays confined to math contests.

Benchmarks Are Running Out of Road
One of the biggest takeaways from Mythos is not simply that it sets new records. It is that some evaluations are becoming less informative at the frontier. Anthropic already says Cybench is saturated for its top models. Once scores approach perfection, the benchmark stops telling you much about real-world separation between models.
That creates three problems:
- Capability outpaces measurement. We may know a model is very strong without having a reliable way to quantify how much stronger it is than the last one.
- Safety evaluation gets harder. If benchmark suites become less discriminating, labs need new ways to assess what models can actually do in production-like environments.
- Marketing noise rises. As more numbers converge near the top, selective reporting and methodology differences matter more.
This is why the benchmark story cannot be separated from the cyber story. Anthropic’s most meaningful evidence did not come from legacy benchmark suites alone. It came from operationally realistic testing: Firefox exploit generation, vulnerability reproduction, and private cyber ranges.
The Latest News: Glasswing Expansion and Vertex AI Preview
The newest development is that Claude Mythos is already moving from disclosure to controlled deployment.
Anthropic’s newly announced Project Glasswing says the effort now includes launch partners such as AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. Anthropic also says it has extended access to more than 40 additional organizations that build or maintain critical software infrastructure, while committing up to $100 million in usage credits and $4 million in direct donations to open-source security organizations.
Separately, Google Cloud says Claude Mythos Preview is now available in Private Preview to a select group of Google Cloud customers through Vertex AI as part of Project Glasswing.
That is a meaningful market signal. It suggests Anthropic is not treating Mythos as a mere research artifact. It is treating it as a restricted enterprise asset — too sensitive for broad release, but too strategically valuable to leave on the shelf.
For buyers, this is the beginning of a new pattern: frontier capability may increasingly appear first in gated enterprise programs tied to safety, infrastructure, and compliance requirements. Public access may come later, with stricter monitoring and narrower operating envelopes.

What This Means for Enterprise AI Buyers
If you are evaluating AI for software engineering, internal agents, or security operations, Mythos changes the planning horizon.
- Do not assume public availability equals frontier capability. The best model in the market may not be the one you can trial with a credit card.
- Expect restricted-access programs to grow. The most powerful models may be routed first through cloud partners, critical infrastructure programs, or tightly monitored beta cohorts.
- Invest in evaluation infrastructure now. If benchmark claims are getting harder to compare, enterprises need their own task suites, quality gates, and business-specific scorecards.
- Prepare for agentic software work. At these performance levels, the question is not whether models will change engineering workflows. It is how quickly your organization can adapt responsibly.
Hybr’s view is simple: the benchmark race is no longer academic. It is now directly connected to software delivery speed, security posture, cloud governance, and competitive advantage.
Key Takeaways
| If You Are… | What This Means |
|---|---|
| A CTO or VP Engineering | 93.9% on SWE-bench Verified signals that AI-assisted issue resolution and code maintenance are becoming operationally real. |
| An AI Platform Leader | Benchmark claims are getting harder to compare. Build internal evals instead of relying only on vendor scorecards. |
| A CISO | The same model class that dominates coding benchmarks is also strong enough to change the offense-defense balance in cyber. |
| A CIO or COO | Frontier models may arrive first through governed enterprise partnerships, not open consumer rollout. |
| A Policymaker or Analyst | Benchmark saturation means conventional public metrics may understate how quickly frontier capability is advancing. |
Frequently Asked Questions
What is Claude Mythos Preview’s most impressive benchmark?
The two headline figures are 93.9% on SWE-bench Verified and 97.6% on the 2026 USAMO. SWE-bench is likely more immediately relevant for enterprise buyers because it reflects real software engineering task performance.
Is Mythos publicly available?
No. Anthropic has not broadly released Claude Mythos Preview. Access is being routed through Project Glasswing and selected enterprise or infrastructure partners under restricted conditions.
Why does the Vals SWE-bench leaderboard show lower scores than Anthropic’s Mythos claims?
Because they are different evaluation setups. Public leaderboard results and vendor-reported benchmark figures often use different harnesses, policies, budgets, and eligibility criteria. They are useful for context, but not always directly comparable.
Why should enterprises care about benchmark saturation?
Because once common evaluations stop separating frontier models, enterprises can no longer depend on a single public score to choose platforms. Internal evals, reliability testing, and domain-specific workflows become more important.
What is the most important recent news beyond the original Mythos system card?
Two developments stand out: Anthropic says Glasswing has expanded to 40+ additional critical software organizations with major funding behind it, and Google Cloud says Mythos is now in private preview on Vertex AI for select customers.
References
- Anthropic, Claude Mythos Preview System Card (April 7, 2026) — https://www.anthropic.com/research/claude-mythos-preview-system-card
- Anthropic, Claude Mythos Preview Addendum (April 7, 2026) — https://assets.anthropic.com/m/claude-mythos-preview-addendum.pdf
- Anthropic, Project Glasswing: Securing critical software for the AI era — https://www.anthropic.com/glasswing
- Google Cloud Blog, Claude Mythos Preview on Vertex AI — https://cloud.google.com/blog/products/ai-machine-learning/claude-mythos-preview-on-vertex-ai
- Vals AI, SWE-bench (updated April 9, 2026) — https://www.vals.ai/benchmarks/swebench
- Part 1: Claude Mythos: The Most Powerful AI Model That Anthropic Won’t Let You Use
- Part 2: Claude Mythos: Cybersecurity — Found Zero-Day Bugs in Firefox — Then Built Working Exploits
This is Part 3 of a 10-part series analyzing the Claude Mythos Preview system card. Up next: “Best-Aligned and Most Dangerous: The Claude Mythos Paradox”.
Follow hybr.com/blog for the full series.
