Deep Dive: Benchmarks, Community Sentiment & Competitive Position
May 28, 2026 · Released 42 days after Opus 4.7
Sources: Anthropic System Card, vellum.ai, Artificial Analysis, llm-stats.com
Release
May 28
2026
Days Since 4.7
42
~6 weeks
Pricing
$5/$25
per 1M tokens
Context
1M
tokens
Anthropic calls it a "modest but tangible improvement" — but benchmark numbers tell a more interesting story. Same price, faster release cadence, and wins on 5 of 6 headline benchmarks.
The 4.7 → 4.8 gap is noticeably shorter than typical Anthropic release cycles (~2-3 months). Possible drivers: competitive pressure from GPT-5.5 and Gemini 3.1 Pro launches, plus rapid iteration on agentic capabilities.
Sharper Judgement
Better at catching own mistakes, pushing back on unsound plans, and building confidence before making big changes.
4× Honesty Improvement
~4x less likely to let code flaws pass unremarked. The productivity unlock that matters more than any benchmark point.
Dynamic Workflows
Claude Code plans work and runs hundreds of parallel subagents in a single session.
Fast Mode: 3× Cheaper
2.5× speed for $10/$50 per 1M tokens — down from $30/$150. A third of the previous cost.
Effort Control Dial
System Entries in Messages Array
Update Claude's instructions mid-task without breaking prompt cache or routing through a user turn. Permissions, token budgets, environment context — all updatable inline.
💡 Note
Most published benchmark scores are at default "high" effort. Anthropic notes higher tiers improve quality further — the headline numbers are conservative.
Hardest variant — actively-maintained repos, multi-file diffs, no ground-truth leakage
69.2%
Opus 4.8
64.3%
Opus 4.7
58.6%
GPT-5.5
54.2%
Gemini 3.1 Pro
+4.9 pts vs 4.7 · +10.6 pts vs GPT-5.5 · +15 pts vs Gemini
Opus 4.8: 88.6% vs 4.7: 87.6% vs Gemini: 80.6%
GPT-5.5 leads on own harness (83.4%), but apples-to-apples: Opus 4.8 74.6% vs GPT-5.5 78.2%
Hardest general-knowledge reasoning benchmark
With Tools
57.9%
Opus 4.8
Without Tools
49.8%
Opus 4.8
Leads both configs. With tools: +3.2 pts vs 4.7, +5.7 pts vs GPT-5.5
Real-world computer tasks on live Ubuntu VM
83.4%
Opus 4.8
82.8%
Opus 4.7
+0.6 pts vs 4.7 · GPT-5.5: 78.7% · Gemini 3.1 Pro: 76.2%
Browser-agent benchmark — Opus 4.8 scores 84%, called "a meaningful jump over both Opus 4.7 and GPT-5.5" by Browserbase.
GDPval-AA — Economically valuable knowledge work across professional domains (scale to 2000)
1,890
Opus 4.8
1,769
GPT-5.5
1,753
Opus 4.7
1,314
Gemini 3.1 Pro
576-point gap vs Gemini — largest spread on any benchmark Anthropic published
Gemini 3.5 Flash wins at 57.9% — smaller, cheaper model beats frontier class. Opus 4.8: 53.9% (leads frontier field).
First model to break 10% on all-pass standard — requires every sub-task in multi-step legal workflow to be correct.
Early testers consistently report "sharper judgement," better self-correction, and improved collaboration. The 4× honesty improvement is repeatedly called the most impactful change for daily developer workflows.
Opus 4.8 is ~4× less likely than its predecessor to allow flaws in code it has written to pass unremarked.
💡 Alignment Milestone
Anthropic's Alignment team: Opus 4.8 "reaches new highs on prosocial traits" and misalignment rates are "similar to Claude Mythos Preview" — previously gated behind Project Glasswing.
Coding
SWE-Bench Pro, Verified
CursorBench efficiency
Reasoning
Humanity's Last Exam
GPQA (tied)
Agentic / Computer Use
OSWorld, Online-Mind2Web
Legal Agent Benchmark
Terminal Work
GPT-5.5 leads on own harness
Opus 4.8 leads on public harness
Finance Analysis
Gemini 3.5 Flash wins
Smaller models winning verticals
Professional Work
GDPval-AA dominant
576-point gap over Gemini
Unchanged from Opus 4.7
$5
per 1M input
$25
per 1M output
2.5× speed, 1/3 the previous cost
$10
per 1M input
$50
per 1M output
Was $30/$150 on prior Opus models
Cursor reports fewer steps for same intelligence = lower token-per-task cost. Cognition reports fixed verbosity issues from 4.7 = fewer output tokens. The model is effectively cheaper to use even at the same per-token price.
⚠️ Worth Flagging
Anthropic's own framing calls 4.8 a "modest but tangible improvement." The benchmark numbers are stronger than the marketing suggests, but this isn't a revolutionary leap — it's a sharpening release.
Opus 4.8
61.4
+4.1 vs 4.7
GPT-5.5 (xhigh)
60.2
Previous leader
Opus 4.7
57.3
Prior version
Opus 4.8 takes the #1 spot on the Artificial Analysis Intelligence Index, which incorporates 10 evaluations: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPath.
💡 Note
The index evaluates independently — not using Anthropic's own numbers. 14 of 16 quality evaluations show improvement over Opus 4.7.
Claude Mythos Preview is currently available to a small number of organizations through Project Glasswing. Anthropic confirms Mythos-class models for general release "in the coming weeks."
Anthropic also states they plan to release a new class of model with even higher intelligence than Opus — suggesting the Opus line may not be the ceiling for long.
Claude Opus 4.8 is a sharpening release that punches above its "modest" framing. It wins where it matters for developers (coding, honesty, agentic reliability) and the 3× cheaper fast mode makes frontier intelligence more accessible. The 42-day release cadence signals Anthropic is iterating faster under competitive pressure. For production Claude users, switching requires little deliberation. For those evaluating frontiers, the benchmark case is strong — but GPT-5.5 and Gemini still win specific verticals.
Anthropic Announcement
anthropic.com/news/claude-opus-4-8
Claude API Docs
docs.anthropic.com
System Card
Linked from announcement
vellum.ai Benchmarks Explained
vellum.ai/blog/claude-opus-4-8-benchmarks-explained
Artificial Analysis Index
artificialanalysis.ai
llm-stats.com
llm-stats.com/models/claude-opus-4-8
💡 Research Methodology
This analysis combines Anthropic's official System Card data, independent evaluation from Artificial Analysis, third-party benchmark analysis from vellum.ai, and qualitative feedback from early testers at Cursor, Cognition, Browserbase, Harvey, and others.