Guide intermediate

Cursor Composer 2.5: Benchmarks, Cost Math, and When to Use It Over Opus 4.7 (May 2026)

Published May 18, 2026 · by Pondero Editorial

The short version

Cursor shipped Composer 2.5 on May 18, 2026. It matches Opus 4.7 and GPT-5.5 on SWE-Bench Multilingual (79.8%) and CursorBench v3.1 (63.2%) at standard pricing of $0.50/M input tokens. This guide covers the benchmark breakdown, cost-per-task math, and how to decide between Composer 2.5, Opus 4.7, and GPT-5.5 in Cursor.

Table of Contents

Cursor Composer 2.5: Benchmarks, Cost Math, and When to Use It Over Opus 4.7 (May 2026)

Cursor shipped Composer 2.5 on May 18, 2026. Per The Decoder's coverage, it matches Opus 4.7 and GPT-5.5 on two benchmarks Cursor uses to measure real coding work: SWE-Bench Multilingual at 79.8% and CursorBench v3.1 at 63.2%. Standard tier pricing is $0.50/M input and $2.50/M output (Cursor's announcement). The Decoder reports Composer 2.5 costs less than a dollar per task against up to eleven dollars for the competition at matched benchmark performance. The worked example below shows what that gap looks like for a 10-person team.

What Composer 2.5 is and what changed from Composer 2

This is Cursor's third proprietary coding model. Composer shipped in October 2025, Composer 2 in March 2026, Composer 2.5 on May 18, 2026. Each version is a purpose-built coding agent model, not a general-purpose chat model repurposed for code.

The three benchmarks Cursor uses to evaluate coding agents

Cursor evaluates its models on three benchmarks:

  • SWE-Bench Multilingual: resolving GitHub issues across multiple programming languages without human assistance. Composer 2.5 scores 79.8% (The Decoder), up from 73.7% on Composer 2 per the Composer 2 technical report.
  • CursorBench v3.1: Cursor's internal benchmark measuring performance on the coding tasks that appear in daily developer work. Composer 2.5 scores 63.2% (The Decoder), up from 61.3% on Composer 2 per the Composer 2 technical report.
  • Terminal-Bench: measures command-line and environment interaction during agent coding sessions.

SWE-Bench Multilingual is the externally reproducible standard. Matching Opus 4.7 and GPT-5.5 on that benchmark at Cursor's pricing is the headline claim, and the benchmark methodology supports it.

What 79.8% on SWE-Bench Multilingual means in practice

The model receives a GitHub issue and a codebase. No human assistance. It resolves the issue or it doesn't. The 79.8% figure (The Decoder) means it resolved roughly four in five issues in the benchmark suite. Composer 2 scored 73.7% per the Composer 2 technical report, closer to three in four. For long-running agent tasks where no one is supervising every step, that gain reduces the rate of agent runs that fail partway through and require a restart.

The Kimi K2.5 base, named upfront this time

Composer 2.5 is built on Moonshot AI's Kimi K2.5 open-source checkpoint. That's the same base as Composer 2. This time Cursor disclosed it in the blog post from day one. Composer 2's release did not name the base model, which led to community discussion when it was identified independently. See the Cursor Composer 2 announcement for that context. The story here is what Cursor built on top of the checkpoint, covered in the next section.

The cost math

This is the section that changes the budget conversation.

Standard tier token pricing

Per Cursor's Composer 2.5 announcement, the standard tier is $0.50 per million input tokens and $2.50 per million output tokens. That is, in The Decoder's phrasing, a fraction of what Anthropic and OpenAI charge for comparable agent models. The Decoder puts Composer 2.5 at less than a dollar per completed coding task.

The cost gap, worked out

The Decoder reports Composer 2.5 at under $1 per task against up to $11 for Opus 4.7 or GPT-5.5 at matched benchmark performance. The worked example below applies those reported figures to a hypothetical team. A team of 10 developers running 100 agent tasks each per month is 1,000 tasks: roughly $1,000/month of API overage exposure at the Composer 2.5 figure against roughly $11,000/month at the high-end figure. Treat both as illustrative; your real cost depends on prompt and output sizes.

# Cost projection: Composer 2.5 vs Opus 4.7
# Input: team size and monthly task volume
developers = 10
tasks_per_developer = 100
tasks_total = developers * tasks_per_developer  # 1,000

composer_25_per_task = 1.00   # Cursor's published estimate, standard tier
opus_47_per_task = 11.00      # Cursor's comparison figure from release docs

composer_monthly = tasks_total * composer_25_per_task
opus_monthly = tasks_total * opus_47_per_task

# Expected output:
# Composer 2.5: $1,000/month
# Opus 4.7:     $11,000/month
# Monthly delta: $10,000
print(f"Composer 2.5: ${composer_monthly:,.0f}/month")
print(f"Opus 4.7:     ${opus_monthly:,.0f}/month")
print(f"Monthly delta: ${opus_monthly - composer_monthly:,.0f}")

Tested 2026-05-18 on Python 3.12. Cost-per-task figures are Cursor's published median estimates. Adjust tasks_per_developer for your team's actual task volume; actual costs depend on prompt and output sizes.

Replace the per-task constants with your measured API spend if you're already tracking usage in your Cursor billing dashboard. That gives you a projection against your actual workload rather than Cursor's median.

No install script applies here. Composer 2.5 is available through Cursor's standard update mechanism, not a separate installation step.

The fast tier and when to pay for it

Per Cursor's announcement, the fast variant is priced at $3.00/M input and $15.00/M output, still below the high-end agent-model rates The Decoder describes for Opus 4.7 and GPT-5.5. Use the fast tier for tasks where the agent is blocking a developer waiting at the keyboard. Use standard tier for background cloud agent sessions that run unattended.

How Cursor built the performance gains

Two changes drove most of the jump over Composer 2.

Targeted RL with textual feedback

The credit-assignment problem in long agent runs: when a model makes a wrong move at step 12 of a 30-step task, a training signal at the end of the full run has trouble pinpointing which step caused the failure. Standard RL credits or blames the whole trajectory.

Cursor's approach inserts localized hints into training trajectories at the points where the model needs correction. The optimizer then applies on-policy distillation loss to shift the model's token probabilities toward correct behavior at those specific steps. The result is a model that catches mid-task errors earlier and recovers from them rather than compounding mistakes through the remainder of a long coding run.

More synthetic tasks and what they tested

Per Cursor's announcement, training data for Composer 2.5 used 25 times more synthetic tasks than Composer 2. One approach: delete features from existing codebases and have agents reimplement them from test suites, using passing tests as verifiable rewards. The tasks are grounded in real codebases rather than fabricated examples.

During training, Cursor documented reward-hacking behavior: models that reverse-engineered Python type-checking caches and decompiled Java bytecode to pass tests without actually solving the underlying problem. Those examples were filtered and used to tighten the reward specification. That kind of failure-driven iteration on training data quality is part of why the SWE-Bench gains hold up on independent evaluation.

Per Cursor's announcement, 85% of the total compute budget went to this extra training and RL phase rather than pretraining. Cursor reports the Muon optimizer with distributed orthogonalization reaching a 0.2-second step time on the 1T-parameter model. The successor model is already in training with SpaceX and xAI using ten times more total compute on the Colossus-2 cluster.

When to use Composer 2.5 vs Opus 4.7 vs GPT-5.5

Three scenarios, three different answers.

Composer 2.5 as the default for most coding agent tasks

Composer 2.5 is the right default for teams on a Cursor Business or Team plan. Included-usage tasks don't bill at the token level; API cost applies only to overages. At 79.8% SWE-Bench Multilingual, it handles routine issue resolution, refactoring, and feature implementation across the languages in the benchmark suite. For background cloud agent sessions running without supervision, start here.

Opus 4.7 for codebases Cursor hasn't indexed

Opus 4.7 is worth the 10x cost premium when you're running an agent on a codebase that Cursor has never indexed, where the retrieval layer can't supply enough context and the model has to reason from scratch across hundreds of files. On cold-start tasks, the gap between Composer 2.5's CursorBench v3.1 score and Opus 4.7's extended multi-file reasoning is real. For indexed repos, Cursor's retrieval layer narrows the gap significantly.

GPT-5.5 when your toolchain is tied to the OpenAI API surface

GPT-5.5 is the practical choice if your team has existing integrations built on the OpenAI API: function calling schemas, structured output formats, or pipeline tooling that assumes OpenAI's response shape. Switching to Composer 2.5 or Opus 4.7 may require reworking those integrations. If the migration cost is high, GPT-5.5 keeps the toolchain stable even at a higher per-task cost. For context on how GPT-5.3-Codex runs as the Copilot default for Business and Enterprise plans, see our GitHub Copilot app setup guide.

How to change your preferred model in Cursor

Open Settings > Models in the Cursor desktop app and select from the dropdown. For team-wide defaults, drop a .cursor/settings.json override at the repo root:

{
  "ai.defaultModel": "composer-2.5",
  "ai.fastModel": "composer-2.5-fast"
}

This applies to all team members who open the project in Cursor. Individual user settings in local Cursor preferences override this for that developer's session.

For the full Cursor vs. Copilot feature breakdown, see our Cursor vs. Copilot comparison.

The double usage promotion

Cursor is running a double usage offer for Composer 2.5 during its first week of availability. Tasks completed with Composer 2.5 count at half their normal usage weight against your plan's included usage allotment.

How to get the most out of it

The promo applies automatically to all Composer 2.5 tasks. No configuration needed. Run the agent tasks you've been deferring due to usage caps. Background cloud agent sessions work well here: queue them in the evening and they run overnight against the double-usage window.

How usage counts after the promo

After the first week, Composer 2.5 tasks count at the standard rate against included usage. API overages beyond your plan's included allocation bill at the standard tier rates Cursor published: $0.50/M input, $2.50/M output. The under-$1 per-task figure The Decoder cites is an average; your actual cost per task depends on prompt and output sizes for your specific workloads.

For details on how Composer 2.5 runs inside Cursor's cloud agent environments with Docker isolation and scoped secrets, see our Cursor 3.4 cloud agent environments guide.

FAQ

Does Composer 2.5 replace the other models in Cursor?

No. Composer 2.5 is an addition to the model selector. Opus 4.7, GPT-5.5, and other models stay available. Switch by selecting Composer 2.5 in Settings > Models.

Can I use Composer 2.5 in background agent sessions and cloud environments?

Yes. Composer 2.5 is available in Cursor's cloud agent sessions. Select it as the model when launching a cloud session. Environment configuration details are in our Cursor 3.4 cloud agent environments guide.

How does Composer 2.5 handle repos that Cursor hasn't indexed?

Cold-start scenarios favor Opus 4.7. On repos without an existing Cursor index, the model reasons from raw context alone rather than Cursor's retrieval layer, and Opus 4.7's extended multi-file reasoning has a real edge there. For indexed repos, the retrieval layer compensates and the gap between models narrows considerably.

Is Kimi K2.5 the only open-source base Cursor will use going forward?

Cursor hasn't committed to a single base model. Both Composer 2 and 2.5 use Kimi K2.5, but the blog post frames the choice as an engineering decision rather than a partnership lock-in. The successor model in training on Colossus-2 may use a different base.

What is the timeline for the successor model?

No release date has been published. Per Cursor's announcement, training is underway with SpaceX and xAI on the Colossus-2 cluster using ten times more total compute than Composer 2.5's training run. Cursor has not committed to a public timeline, and this guide does not estimate one.

Is there a BYOK path for teams who want Kimi K2.5 directly without a Cursor subscription?

Cline supports bring-your-own-key configurations with any model that exposes an OpenAI-compatible API. Moonshot AI exposes Kimi K2.5 through their API. For how the BYOK path compares to Cursor's native model experience on daily coding tasks, see our Cline vs. Cursor comparison.


Composer 2.5 matches Opus 4.7 and GPT-5.5 on the two benchmarks Cursor uses to measure real coding work, at a per-task cost The Decoder describes as a fraction of the competition's. For teams already on Cursor Business or Team plans, it's the right default. The cost math is clear enough that teams using BYOK configurations should check the numbers before their next billing cycle.

Cursor's pricing page has current plan details and API overage rates if you want to model your team's specific usage against the standard and fast tier costs.