GPT-5.5 is OpenAI’s most capable agentic AI model yet

OpenAI launched GPT-5.5 on April 23 as what it calls “a brand new class of intelligence for actual work and powering brokers,” and the framing is deliberate. OpenAI says it’s essentially the most succesful agentic AI mannequin so far, constructed from the bottom as much as plan, use instruments, examine its personal output, and work by way of duties independently.

GPT-5.5 is the primary retrained base mannequin since GPT-4.5, co-designed with NVIDIA’s GB200 and GB300 NVL72 rack-scale methods. The corporate says the sensible distinction is that when utilizing GPT5.5, duties that beforehand required a number of prompts and human ‘course-correction’ can now be handed off extra fully. The mannequin is rolling out to Plus, Professional, Enterprise, and Enterprise customers in ChatGPT and Codex. API entry adopted on April 24.

The benchmarks

OpenAI’s strongest efficiency declare is on Terminal-Bench 2.0, a benchmark that assessments command-line workflows requiring planning and gear coordination in a sandboxed surroundings. GPT-5.5 scores 82.7%, towards GPT-5.4’s 75.1% and Claude Opus 4.7’s 69.4%.

On SWE-Bench Professional, which evaluates GitHub situation decision, GPT-5.5 reaches 58.6%, fixing extra points in a single cross than earlier variations. OpenAI additionally launched Professional-SWE, an inner benchmark the place duties carry a median estimated human completion time of 20 hours. GPT-5.5 scores 73.1%, up from GPT-5.4’s 68.5%.

In long-context reasoning, MRCR v2 at a million tokens, a retrieval benchmark testing whether or not a mannequin can find a particular reply buried in a big doc, GPT-5.5 scores 74.0%, towards GPT-5.4’s 36.6%.

Nonetheless, on MCP Atlas, Scale AI’s Mannequin Context Protocol tool-use benchmark, Claude Opus 4.7 leads at 79.1% and no rating is recorded by GPT-5.5. OpenAI included that absence in its personal benchmark desk, which at the least indicators its confidence within the general image.

Token effectivity, pricing actuality

API entry is priced at US$5 per million enter tokens and US$30 per million output tokens, precisely twice the charges for GPT-5.4. OpenAI’s defence is that GPT-5.5 completes the identical Codex duties with fewer tokens than GPT-5.4, making efficient prices roughly 20% greater as soon as its effectivity is factored in, a declare that unbiased testing lab Synthetic Evaluation validated.

GPT-5.5 Professional, accessible to Professional, Enterprise, and Enterprise customers, is priced at US$30 per million enter tokens and US$180 per million output tokens. It applies extra parallel test-time compute on more durable issues and leads the record of publicly-available fashions on BrowseComp, OpenAI’s agentic web-browsing benchmark, at 90.1%.

Token effectivity is price stress-testing towards precise workloads earlier than committing to a mannequin change. At 10 million output tokens per thirty days, GPT-5.5 normal prices US$300 towards Claude Opus 4.7’s US$250, a 20% that solely pays off if the mannequin’s superior agentic efficiency means fewer process iterations and fewer retries, with the maths various by use case.

In observe

Open AI says greater than 85% of workers now use Codex weekly of their departments, together with engineering and advertising and marketing. In a single instance, the communications staff used GPT-5.5 to course of six months of talking request information, the place the mannequin was in a position to construct a scoring and threat framework to assist automate low-risk approvals.

Greg Brockman described the discharge as “an actual step ahead in direction of the form of computing that we count on sooner or later,” and chief scientist Jakub Pachocki famous the final two years of mannequin progress had felt “surprisingly gradual.”

OpenAI says GPT-5.5 matches GPT-5.4’s per-token latency in manufacturing serving whereas acting at the next degree of intelligence; bigger, extra succesful fashions are sometimes slower to serve, however that trade-off was averted right here.

Whether or not the benchmark leads translate into manufacturing positive factors for groups operating actual agentic pipelines is the query that may take the subsequent few weeks to reply correctly. The Terminal-Bench rating is promising for unattended terminal brokers and DevOps automation. The MCP Atlas hole is price waiting for anybody constructing closely on tool-use orchestration.

See additionally: OpenAI brings GPT-5.5 to Codex for coding taskse

(Picture supply: “‘The Agent’ Fossil Watch” by MarkGregory007 is licensed underneath CC BY-NC-SA 2.0.)

Wish to study extra about AI and massive information from business leaders? Try AI & Big Data Expo going down in Amsterdam, California, and London. The excellent occasion is a part of TechEx and is co-located with different main know-how occasions together with the Cyber Security & Cloud Expo. Click on here for extra data.

AI Information is powered by TechForge Media. Discover different upcoming enterprise know-how occasions and webinars here.

The benchmarks

Token effectivity, pricing actuality

In observe

Related Posts

Modernising apps triples the odds of AI returns, Cloudflare says

Securing AI systems under today’s and tomorrow’s conditions

Kong names Bruce Felt as chief financial officer