Claude Opus 4.8

Deep Dive: Benchmarks, Community Sentiment & Competitive Position

May 28, 2026 · Released 42 days after Opus 4.7

Sources: Anthropic System Card, vellum.ai, Artificial Analysis, llm-stats.com

At a Glance

Release

May 28

2026

Days Since 4.7

42

~6 weeks

Pricing

$5/$25

per 1M tokens

Context

1M

tokens

CONFIRMED

Anthropic calls it a "modest but tangible improvement" — but benchmark numbers tell a more interesting story. Same price, faster release cadence, and wins on 5 of 6 headline benchmarks.

Release Cadence: Faster Than Ever

Claude Opus 4.6 · Feb 2026 · ~2.5 months prior

Claude Opus 4.7 · Apr 16, 2026 · 2 months later

Claude Opus 4.8 · May 28, 2026 · 42 days later (~6 weeks)

ACCELERATED

The 4.7 → 4.8 gap is noticeably shorter than typical Anthropic release cycles (~2-3 months). Possible drivers: competitive pressure from GPT-5.5 and Gemini 3.1 Pro launches, plus rapid iteration on agentic capabilities.

Top Highlights

NEW

Sharper Judgement
Better at catching own mistakes, pushing back on unsound plans, and building confidence before making big changes.

NEW

4× Honesty Improvement
~4x less likely to let code flaws pass unremarked. The productivity unlock that matters more than any benchmark point.

NEW

Dynamic Workflows
Claude Code plans work and runs hundreds of parallel subagents in a single session.

HOT

Fast Mode: 3× Cheaper
2.5× speed for $10/$50 per 1M tokens — down from $30/$150. A third of the previous cost.

Effort Control & API Improvements

USER-FACING

Effort Control Dial

Default: "high" — best quality/UX balance
"extra" (xhigh) — more tokens, better answers
"max" — maximum reasoning depth

DEVELOPER-ONLY

System Entries in Messages Array

Update Claude's instructions mid-task without breaking prompt cache or routing through a user turn. Permissions, token budgets, environment context — all updatable inline.

💡 Note

Most published benchmark scores are at default "high" effort. Anthropic notes higher tiers improve quality further — the headline numbers are conservative.

Benchmarks: Coding

SWE-BENCH PRO

Hardest variant — actively-maintained repos, multi-file diffs, no ground-truth leakage

69.2%

Opus 4.8

64.3%

Opus 4.7

58.6%

GPT-5.5

54.2%

Gemini 3.1 Pro

+4.9 pts vs 4.7 · +10.6 pts vs GPT-5.5 · +15 pts vs Gemini

SWE-BENCH VERIFIED

Opus 4.8: 88.6% vs 4.7: 87.6% vs Gemini: 80.6%

TERMINAL-BENCH 2.1

GPT-5.5 leads on own harness (83.4%), but apples-to-apples: Opus 4.8 74.6% vs GPT-5.5 78.2%

Benchmarks: Reasoning & Computer Use

HUMANITY'S LAST EXAM

Hardest general-knowledge reasoning benchmark

With Tools

57.9%

Opus 4.8

Without Tools

49.8%

Opus 4.8

Leads both configs. With tools: +3.2 pts vs 4.7, +5.7 pts vs GPT-5.5

OSWORLD-VERIFIED

Real-world computer tasks on live Ubuntu VM

83.4%

Opus 4.8

82.8%

Opus 4.7

+0.6 pts vs 4.7 · GPT-5.5: 78.7% · Gemini 3.1 Pro: 76.2%

ONLINE-MIND2WEB

Browser-agent benchmark — Opus 4.8 scores 84%, called "a meaningful jump over both Opus 4.7 and GPT-5.5" by Browserbase.

Benchmarks: Professional Work

LARGEST GAP

GDPval-AA — Economically valuable knowledge work across professional domains (scale to 2000)

1,890

Opus 4.8

1,769

GPT-5.5

1,753

Opus 4.7

1,314

Gemini 3.1 Pro

576-point gap vs Gemini — largest spread on any benchmark Anthropic published

FINANCE AGENT V2

Gemini 3.5 Flash wins at 57.9% — smaller, cheaper model beats frontier class. Opus 4.8: 53.9% (leads frontier field).

LEGAL AGENT (HARVEY)

First model to break 10% on all-pass standard — requires every sub-task in multi-step legal workflow to be correct.

Benchmark Verdict

✅ SWE-Bench Pro — Leads by 4.9 pts vs 4.7, 10.6 pts vs GPT-5.5 CONFIRMED

✅ Humanity's Last Exam — Leads both with/without tools configs CONFIRMED

✅ OSWorld / Computer Use — Leads, prerequisite for dynamic workflows CONFIRMED

✅ GDPval-AA — Dominant 576-point gap over Gemini 3.1 Pro CONFIRMED

⚠️ Terminal-Bench — GPT-5.5 leads on own harness; Opus 4.8 leads on public harness HARNASS-DEPENDENT

❌ Finance Agent v2 — Gemini 3.5 Flash (smaller model) wins at 57.9% LOSES

⚠️ GPQA Diamond — Statistically tied with 4.7 and Gemini (benchmark saturated) TIED

Community Sentiment

@michaeltruell (Cursor CEO) 🔄 High engagement
"On CursorBench, Opus 4.8 exceeds prior Opus models across every effort level. Tool calling is meaningfully more efficient, using fewer steps for the same intelligence."

@scottwu (Cognition CEO) 🔄 High engagement
"Opus 4.8 fixes the comment-verbosity and tool-calling issues we saw with Opus 4.7. This translates directly into faster capability gains for engineers building on Devin."

@katieparrott (Staff Writer) ❤️ High engagement
"Feels like a major quality-of-life update over Opus 4.7: faster, easier to collaborate with, and better at carrying context and style direction across a long session."

More Community Feedback

@nikogrupen (Harvey, Head of Applied Research) 🔄 High engagement
"Highest score recorded on our Legal Agent Benchmark — first to break 10% overall on the all-pass standard. That's the accuracy lift that translates into how much real attorney work customers can hand off."

@michaelran (Sr. Investment Associate) ❤️ High engagement
"Consistently higher quality analysis than prior Opus models. Finished faster and produced richer, more information-dense outputs. The biggest differentiator: proactively flags issues with inputs and outputs."

CONSENSUS

Early testers consistently report "sharper judgement," better self-correction, and improved collaboration. The 4× honesty improvement is repeatedly called the most impactful change for daily developer workflows.

The Honesty Factor

KILLER FEATURE

Opus 4.8 is ~4× less likely than its predecessor to allow flaws in code it has written to pass unremarked.

The Problem

Model claims task is done, test suite hasn't been run, edge case hasn't been considered. You only catch it because something felt off.

The Fix

Cutting that rate by 4× is a bigger productivity unlock than any single benchmark point. Changes day-to-day developer experience.

💡 Alignment Milestone

Anthropic's Alignment team: Opus 4.8 "reaches new highs on prosocial traits" and misalignment rates are "similar to Claude Mythos Preview" — previously gated behind Project Glasswing.

Competitive Position

WINS

Coding
SWE-Bench Pro, Verified
CursorBench efficiency

WINS

Reasoning
Humanity's Last Exam
GPQA (tied)

WINS

Agentic / Computer Use
OSWorld, Online-Mind2Web
Legal Agent Benchmark

MIXED

Terminal Work
GPT-5.5 leads on own harness
Opus 4.8 leads on public harness

LOSES

Finance Analysis
Gemini 3.5 Flash wins
Smaller models winning verticals

LEADS

Professional Work
GDPval-AA dominant
576-point gap over Gemini

Pricing & Cost Efficiency

REGULAR MODE

Unchanged from Opus 4.7

$5

per 1M input

$25

per 1M output

FAST MODE — 3× CHEAPER

2.5× speed, 1/3 the previous cost

$10

per 1M input

$50

per 1M output

Was $30/$150 on prior Opus models

COST EFFICIENCY

Cursor reports fewer steps for same intelligence = lower token-per-task cost. Cognition reports fixed verbosity issues from 4.7 = fewer output tokens. The model is effectively cheaper to use even at the same per-token price.

Where It Loses (The Full Picture)

⚠️ Terminal-Bench 2.1 — GPT-5.5 leads on their Codex CLI harness (83.4%). Apples-to-apples on public Terminus-2: Opus 4.8 (74.6%) vs GPT-5.5 (78.2%). HARNASS GAP

❌ Finance Agent v2 — Gemini 3.5 Flash (smaller, cheaper) wins at 57.9%. Smaller models keep winning specific verticals. VERTICAL LOSS

⚠️ GPQA Diamond — Statistically tied with Opus 4.7 (94.2%) and Gemini 3.1 Pro (94.3%). Benchmark is effectively saturated. SATURATED

⚠️ Worth Flagging

Anthropic's own framing calls 4.8 a "modest but tangible improvement." The benchmark numbers are stronger than the marketing suggests, but this isn't a revolutionary leap — it's a sharpening release.

Artificial Analysis Intelligence Index

Opus 4.8

61.4

+4.1 vs 4.7

GPT-5.5 (xhigh)

60.2

Previous leader

Opus 4.7

57.3

Prior version

INDEX LEADER

Opus 4.8 takes the #1 spot on the Artificial Analysis Intelligence Index, which incorporates 10 evaluations: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPath.

💡 Note

The index evaluates independently — not using Anthropic's own numbers. 14 of 16 quality evaluations show improvement over Opus 4.7.

Project Glasswing & Mythos

FUTURE

Claude Mythos Preview is currently available to a small number of organizations through Project Glasswing. Anthropic confirms Mythos-class models for general release "in the coming weeks."

Alignment Leak?

Opus 4.8 matches Mythos Preview's alignment numbers — the first time a generally-available Claude has been benchmarked at Mythos-class alignment levels.

What It Means

Either alignment work generalizes faster than capability work, or Mythos-class alignment is no longer Mythos-exclusive. We'll know when Mythos ships.

CONFIRMED

Anthropic also states they plan to release a new class of model with even higher intelligence than Opus — suggesting the Opus line may not be the ceiling for long.

Final Verdict

FOR

✅ Best-in-class coding (SWE-Bench Pro)
✅ Leads 5 of 6 headline benchmarks
✅ 4× honesty improvement — biggest UX win
✅ Fast mode 3× cheaper
✅ Dynamic workflows for agentic scale
✅ Same price as 4.7
✅ #1 on Artificial Analysis Intelligence Index

AGAINST

❌ GPT-5.5 still leads Terminal-Bench (own harness)
❌ Gemini 3.5 Flash wins Finance Agent
❌ "Modest" improvement per Anthropic's own framing
❌ GPQA Diamond saturated (no headroom)
❌ Verbose output still reported (4.7 carryover)

BOTTOM LINE

Claude Opus 4.8 is a sharpening release that punches above its "modest" framing. It wins where it matters for developers (coding, honesty, agentic reliability) and the 3× cheaper fast mode makes frontier intelligence more accessible. The 42-day release cadence signals Anthropic is iterating faster under competitive pressure. For production Claude users, switching requires little deliberation. For those evaluating frontiers, the benchmark case is strong — but GPT-5.5 and Gemini still win specific verticals.

Resources

OFFICIAL

Anthropic Announcement
anthropic.com/news/claude-opus-4-8

Claude API Docs
docs.anthropic.com

System Card
Linked from announcement

THIRD-PARTY

vellum.ai Benchmarks Explained
vellum.ai/blog/claude-opus-4-8-benchmarks-explained

Artificial Analysis Index
artificialanalysis.ai

llm-stats.com
llm-stats.com/models/claude-opus-4-8

💡 Research Methodology

This analysis combines Anthropic's official System Card data, independent evaluation from Artificial Analysis, third-party benchmark analysis from vellum.ai, and qualitative feedback from early testers at Cursor, Cognition, Browserbase, Harvey, and others.