Seventy percent versus fifty-eight percent. That single CursorBench jump — a twelve-point gap in a benchmark that barely budges from release to release — is the headline number for Opus 4.7 vs Opus 4.6, and it is the kind of result that makes a Thursday morning interesting. Anthropic shipped claude-opus-4-7 on April 16, 2026, and the release note reads less like an incremental point bump and more like a quiet shift in what a production coding model can actually finish unattended. At TheBomb®, we have been running Claude models through real client builds since Sonnet 3.5 — Astro sites, Cloudflare Worker APIs, WordPress migrations, the full ugly middle of agency work — and the gap between a benchmark chart and a billable shipping sprint is where we live. This piece walks the new numbers, what actually improved, and where the upgrade carries a catch worth budgeting for.
The Short Answer: Is Opus 4.7 Worth Upgrading From 4.6?
Yes — with two caveats.
Upgrade if you run agent loops, long-horizon coding sessions, one-shot feature requests, or anything involving screenshots and visual reasoning. The coding, agent, and vision gains are large enough that staying on 4.6 is a measurable handicap.
Hold if you have a mature, heavily tuned prompt pipeline that is shipping well on 4.6, or if your workload is cost-dominated batch inference where a token accounting shift would sting. Pricing holds at $5 per million input and $25 per million output tokens, unchanged from 4.6, but the new tokenizer means the same English text now maps to between 1.0× and 1.35× as many tokens. Net-net, Anthropic’s internal coding evals report this is efficiency-positive because the model solves more in fewer turns — but your dashboard will not agree on day one.
The stricter instruction adherence in 4.7 also means prompts written against 4.6’s intuition will sometimes get read more literally than you want. More on that below.
Coding Benchmarks Side-by-Side
Here is the full scoreboard from Anthropic’s April 16 release announcement, the associated partner reports, and the official model documentation.
| Benchmark | Opus 4.6 | Opus 4.7 | Delta | Source |
|---|---|---|---|---|
| CursorBench (agentic coding) | 58% | 70% | +12 pts | Cursor |
| Rakuten-SWE-Bench (production tasks resolved) | 1× | 3× | 3× multiplier | Rakuten |
| Terminal-Bench 2.0 | baseline | passes 3 new tasks no prior Claude completed | qualitative | Anthropic |
| Internal 93-task coding benchmark | baseline | +13% resolution | +13 pts | Anthropic |
| CodeRabbit code review recall | baseline | +10%+ recall | +10 pts | CodeRabbit |
| Finance Agent (General Finance) | 0.767 | 0.813 | +0.046 | Anthropic |
| Research-agent benchmark | — | 0.715 tied-top across 6 modules | new SOTA tie | Anthropic |
| Databricks OfficeQA Pro (document reasoning errors) | baseline | 21% fewer errors | -21% errors | Databricks |
| Harvey BigLaw Bench (substantive accuracy) | baseline | 90.9% | new high | Harvey |
| XBOW visual acuity | 54.5% | 98.5% | +44 pts | XBOW |
| Notion Agent (multi-step workflows) | baseline | +14% | +14 pts | Notion |
The SWE-Bench family is the canonical signal for whether a model can actually close tickets, and a 3× multiplier on Rakuten’s production variant is not a rounding error — it is the difference between a model that helps a senior engineer and a model that can run overnight on its own backlog.
What Changed Under the Hood
Anthropic ships these point releases quietly, but the 4.7 changelist is unusually substantive.
New tokenizer. The vocabulary was rebuilt. Same English input now consumes 1.0–1.35× more tokens depending on content. Code-heavy inputs trend toward the low end; prose and structured data toward the high end.
Stricter instruction adherence. 4.6 was known for reading between the lines — sometimes helpfully, sometimes not. 4.7 reads what you actually wrote. Vague instructions produce literal interpretations rather than the old intuition-jumps.
xhigh effort tier. A new reasoning-effort level slots between high and max. It is now the default for Claude Code, which is why one-shot session quality has visibly stepped up without any user config change.
/ultrareview slash command. Claude Code gains a dedicated deep-review mode for critical diffs. Think of it as a preflight check before merge.
Task budgets in public beta. You can now cap an agent’s tool-call or token spend per task, which pairs well with the new tokenizer math.
File-system-based memory. Long-horizon agents can now persist state across sessions through a proper filesystem abstraction rather than fighting context windows.
Sharper deductive logic. Several partners report the model is markedly better at admitting missing information — Hex specifically notes 4.7 reports missing data instead of generating plausible-but-wrong fallbacks. That is a safety win and a reliability win in the same breath.
Real-World Coding: Where 4.7 Pulls Ahead
“We saw a meaningful jump in capabilities with improved autonomy on longer tasks.” — Cursor
That quote tracks with what we have seen running 4.7 through our own development stack. Benchmark deltas are one thing; a model that still knows what it is doing on hour three of a refactor is another.
How does Opus 4.7 handle long-horizon work?
This is the category with the largest practical lift. Devin reports “coherent work for hours” on 4.7 — extending a previously hard ceiling on unattended session length. Augment Code calls it “state-of-the-art” for async CI/CD workflows, where the model picks up a branch, runs tests, iterates on failures, and comes back with a clean diff. Replit reports matching-or-better output quality at lower cost because 4.7 finishes tasks in fewer iterations, and bug detection specifically improved.
For agency work, this maps to the stuff that used to require a senior developer on standby: migration sweeps across dozens of files, Tailwind utility consolidations, schema-aware refactors, legacy plugin extractions. The kind of thing we handle on the development side of TheBomb on a weekly basis.
What about one-shot prompts?
Vercel calls 4.7 “phenomenal on one-shot coding tasks.” Genspark reports the highest quality-per-tool-call ratio of any model they have tested. The xhigh default effort in Claude Code is doing real work here — the model spends more reasoning tokens up-front, and the output reflects that investment. A request like “build me a Cloudflare Worker that proxies an RSS feed with ETag caching” now comes back closer to shippable on the first pass.
Agent & Multi-Step Workflows
Coding is half the 4.7 story. The other half is the agent stack — the long-loop workflows where a model has to plan, act, observe, and replan.
Notion’s AI team reports 4.7 as the first model to reliably pass their implicit-need tests, where the user asks for one thing but actually needs the model to infer and act on a prerequisite. Notion Agent improved 14% on multi-step workflows overall. Harvey BigLaw Bench hit 90.9% substantive accuracy. Databricks OfficeQA Pro saw 21% fewer document reasoning errors.
The research-agent benchmark tied the top score across all six modules at 0.715 — a rare result, because research agents typically have module-level personality: strong at synthesis, weak at retrieval, or vice versa. 4.7 is the first model in its class to hold uniformly.
Vision: A 3× Resolution Jump, Quantified
Opus 4.7 now accepts images up to 2,576 pixels on the long edge — roughly 3.75 megapixels, a three-fold lift over 4.6’s ceiling. On XBOW’s visual acuity test, the model jumps from 54.5% to 98.5%.
Why does this matter for web work? Three reasons:
- Figma screenshots at native Retina resolution are now legible end-to-end. You can paste a full-page mock and get component-accurate extraction without downsampling artifacts.
- Whiteboard photos from discovery calls survive the upload. Client hand-drawn wireframes, site maps, IA diagrams — the model can read them instead of guessing.
- Dev screenshots — console errors, rendered layouts, responsive breakpoints — become reliable inputs. “Here is a screenshot of the bug, fix it” is a real workflow now.
The visual side of the web design workflow benefits disproportionately, and it has direct implications for the kind of detail we preserve in our portfolio projects.
The Tokenizer Catch: Why Your Bill Might Shift
Pricing is unchanged — $5/M input, $25/M output. But the new tokenizer rebases what “a million tokens” actually buys. Expect the same prompt to bill anywhere from 1.0× to 1.35× depending on content mix.
Anthropic’s internal coding evals claim net positive efficiency. The claim is defensible: a model that closes a task in one turn instead of three costs less even if each turn is 1.2× more expensive. The Cursor team’s autonomy numbers support the same logic — fewer iterations per successful task.
But on day one, your bill will spike. Two mitigations help:
- Task budgets (now in public beta) let you cap per-task spend. For batch jobs, this is essential.
- Conciseness prompts matter more than they did. Asking for terse output used to be polite; on 4.7 it is line-item accounting.
If your workload is cost-dominated and quality-plateaued on 4.6, there is a real argument for staying. For most teams, the efficiency claim cashes out within a week of running real traffic.
Prompt Migration: What Actually Breaks
Stricter instruction adherence is the single biggest behavioural change, and it is the one that catches teams off guard. A quick field guide:
Before (4.6-friendly):
“Make this component better.”
After (4.7-friendly):
“Refactor this component to reduce re-renders, extract the fetch logic into a custom hook, and preserve the existing API surface.”
4.6 would often infer the first phrasing as the second. 4.7 will more literally “make it better” — swap a variable name, maybe tidy some spacing — and hand it back. Not wrong, just literal.
Before:
“Clean up this code.”
After:
“Remove unused imports, sort the remaining imports alphabetically, and replace
varwithconstorletwhere the variable is never reassigned.”
Before:
“Write some tests for this.”
After:
“Write Vitest unit tests covering the happy path, one error case per branch, and one null-input guard. Use the existing test file in this directory as a style reference.”
The pattern: specificity is now a first-class input. If you have a prompt library built against 4.6’s looser reading, schedule a tune-up sprint. Your content-ops prompts and internal tooling prompts are both fair game.
Safety, Honesty, and the Sycophancy Drop
Safety profile is broadly similar to 4.6 — no regressions, and cyber capabilities were intentionally reduced versus the internal Mythos line. The notable wins:
- Lower sycophancy. 4.7 pushes back when you are wrong more often. This is a productivity feature, not just a safety feature.
- Better prompt-injection resistance. Agentic workflows that touch untrusted inputs — web browsing, email triage, customer data — are measurably harder to hijack.
- Improved honesty. Hex’s data point is the clearest: 4.7 reports “I do not have this information” rather than generating a confident fake. For a revenue dashboard that is the difference between a bug report and a career incident.
When Opus 4.6 Is Still the Right Call
Two scenarios:
-
Cost-sensitive batch jobs. If you are running summarisation or classification over millions of documents, and 4.6 quality is sufficient, the tokenizer shift is an unforced cost increase. Stay on 4.6 until Haiku 4.7 ships or until your volume drops.
-
Mature prompt pipelines. If you have a hand-tuned production prompt hitting consistent quality on 4.6 and the failure modes are known and contained, do not migrate on release week. Let the community shake out the edge cases, then migrate with a proper regression suite.
For everything else — new builds, agent workflows, Claude Code sessions, vision-heavy tasks, long-horizon coding — 4.7 is the default.
What This Means for Your Team
For development shops and in-house teams, the practical read on Opus 4.7 vs Opus 4.6 is simple: the ceiling on unattended agent work moved up, and the cost of a badly specified prompt moved up with it. Teams that invest in prompt hygiene will see outsized returns. Teams that treat the model like a vending machine will see their bills drift and their output quality flatten.
If you are building a new product, shipping a migration, or trying to decide whether AI-assisted development is ready for your stack, we are happy to talk through it. That is the kind of conversation we have on our development services page, and if you want the longer version, a short call through TheBomb is usually enough to get the scope right. More on the author’s perspective on AI-assisted agency workflows lives on the Cody New author page.
TheBomb® has been building on Anthropic models through every major release since Sonnet 3.5, and the pattern holds: the teams that read the release notes win the quarter.
Key Takeaways
- CursorBench 70% vs 58% and 3× Rakuten-SWE-Bench are the two numbers that matter — both point to meaningfully higher autonomy on production coding tasks.
- Vision ceiling tripled to 2,576 px long edge with XBOW accuracy jumping from 54.5% to 98.5%, making Figma mocks and whiteboard photos viable inputs.
- Tokenizer reshuffle means 1.0–1.35× more tokens for the same English input; net efficiency positive in practice, but budget for a day-one bill spike.
- Stricter instruction adherence breaks loose 4.6-era prompts; specificity is now a first-class input and prompt libraries need a tune-up.
- Upgrade default: yes. Hold only for cost-dominated batch jobs or mature production pipelines where the retune cost outweighs the quality lift.