Testing GPT-5.5 in early access: what we are seeing so far

Lovable has been testing GPT-5.5 in early access and our evals show it's the most capable model we've tested for getting builders unblocked and is meaningfully stronger than GPT-5.4 on the more complex tasks that can stall a build session.

This post breaks down how we benchmarked it, where it wins, and what we're seeing in production.

How we evaluate new models

We run every candidate model against Lovable's internal benchmark suite, which measures end-to-end app-building performance on:

Production-readiness: security, reliability, and the kind of edge-case handling a senior engineer would expect in a code review
Agentic task completion: multi-step requests that involve tool use, file edits, and iterative debugging
Common build scenarios: authentication flows, real-time syncing, database integration, multi-file edits, API wiring
Unblocking users on the hardest tasks: specific moments where builders stall, for example extensive UI polishing, long debugging loops, backend config issues, and complex requests that require the model to propose a path forward rather than just execute

GPT-5.5 vs. GPT-5.4

BENCHMARK	GPT-5.4	GPT-5.5	DELTA
Hardest-tasks benchmark (Senior engineer bar)	36.9%	41.6%	+12.5%
Tool calls per request	11.74	9.03	-23.1%
High user success rate	27.36%	30.62%	+11.9%
% of stuck user messages	3.086%	2.780%	−9.9%

What changed under the hood

Three things stand out in how GPT-5.5 handles Lovable workloads:

Deeper reasoning on high-stakes requests. On our hardest benchmark GPT-5.5 is 12.5% ahead of GPT-5.4 at the same cost. It weighs consequences more carefully and produces output closer to what a senior engineer would accept on first review.
Tighter tool use. On average it makes 23.1% fewer tool calls per request and 33% fewer output tokens per message. In practice: fewer intermediate steps, less course-correction, more intentional edits. This makes the model around 15% more cost-efficient than GPT-5.4 on everyday tasks.
Better unstuck behavior. GPT-5.5 is 9.9% more likely to resolve complex tasks blocking users across the board. We've seen it produce novel solutions to problems with strong reasoning when there's no obvious path forward.

What this means for builders

GPT-5.5 is the model to pull in when builders are stuck. The UI fix that won't land, the bug that won't go away, the backend config that's silently breaking your app — GPT-5.5 resolves those in far fewer turns, with a higher success rate, at a lower cost per session.

Builders want continuous progress, not endless iteration. GPT-5.5 breaks through the walls people usually hit on more complex tasks, like authentication flows and real-time syncing, in far fewer turns. It really shines when the work gets hard, handling tough tasks with far less back-and-forth.

— Fabian Hedin, CTO & Co-founder, Lovable.

This will be rolled out to Lovable builders soon.

Start building