GLN-7.5

AI News & Insights

GLN-7.5

AI News & Insights

Back to articles

GPT-5.4 Hands-On: Fast, Smart, but Not Cheap

March 6, 2026·7 min read·1,371 words

AIOpenAIGPT-5.4NewsVideo Summary

Richard Oliver Bray reviews GPT-5.4 on the Better Stack YouTube channel — Image: Screenshot from YouTube.

Key insights

GPT-5.4's new Playwright skill lets it open a browser, spot visual bugs, and fix them without human input, producing a 3D project in three prompts
Tool search cuts token usage by 47% when many tools are connected, but GPT-5.4 has the slowest time-to-first-token of any major model
Anthropic's Opus 4.6 still produces better UI designs in side-by-side comparisons, a gap confirmed by Design Arena rankings

Published March 6, 2026

Better Stack

Hosts:Richard Oliver Bray

This article is a summary of The New Best Model Is Here (GPT-5.4). Watch the video →

Read this article in norsk

In Brief

Richard Oliver Bray, developer educator at Better Stack, spent several hours testing OpenAI's newly released GPT-5.4. His verdict: the model is a clear upgrade for coding and agentic tasks, with impressive browser automation through a new Playwright skill. But it comes with the slowest response time of any major model, steep API pricing, and UI design output that still trails Anthropic's Opus 4.6 in side-by-side tests.

47%

token reduction with tool search

1.5x

faster output in fast mode

$180/M

Pro output token price

The pitch: one model to rule them all

Bray describes GPT-5.4 as OpenAI's attempt to merge the coding power of Codex 5.3 with the knowledge and web search capabilities of GPT-5.2 into a single model (0:30). According to third-party benchmarks from Artificial Analysis, this strategy appears to have worked: GPT-5.4 ranks as the best coding model, the best agentic model (AI that takes actions independently using tools), and ties with Google's Gemini for best intelligence model (0:44).

The most notable addition is native computer use. GPT-5.4 is reportedly OpenAI's first general-purpose model with built-in capabilities for controlling a computer through mouse and keyboard commands in response to screenshots (0:57). OpenAI also released an experimental Playwright skill, a browser automation tool built on Microsoft's open-source library (1:08).

The demo: 3D Tower Bridge in three prompts

Bray tested this by asking GPT-5.4 to build an interactive 3D experience of Tower Bridge in London. The first iteration took about 30 minutes from a single prompt (1:50). The model wrote code, opened a browser using the Playwright skill, navigated the 3D scene, spotted visual problems like mismatched backgrounds, jumped back into the code to fix them, and repeated the cycle (1:35).

After two follow-up prompts to fix details like sideways boats and overlapping textures, the total project took about 1.5 hours of completely hands-off work (2:10). Bray calls it "not perfect by any means" but a "no-brainer upgrade" for existing Codex users.

Tool search: less waste, same accuracy

The other standout feature is tool search. When an AI model has many tools connected through MCP servers (Model Context Protocol, a standard for connecting AI to external tools), all their definitions normally get loaded into the conversation upfront, wasting tokens (the basic units AI models process, roughly 3-4 characters each) and hurting the quality of answers (2:49).

GPT-5.4 instead loads a lightweight list of available tools and looks up the full definition only when it actually needs one. OpenAI claims this reduces token usage by 47% in a test with 36 MCP servers while maintaining the same accuracy (3:15).

The tradeoffs

Speed: slowest of any major model

The most noticeable downside, according to Bray, is response time. GPT-5.4 has the longest time-to-first-token of any model tracked by Artificial Analysis, meaning it takes longer than competitors to start generating output (4:06). The same applies for returning the first 500 tokens.

Bray notes he is "not sure if this is a model issue or a provider issue" and suggests it might improve over time. But he also raises a more cynical possibility: that the model is intentionally slow, nudging users toward the new fast mode (4:21).

Fast mode: same model, higher bill

Fast mode delivers the exact same model and intelligence at 1.5 times the token speed, but it is billed at double the normal rate (2:32). Bray describes it as "essentially just a priority tier," not a different model at all. After two hours of testing, the model itself suggested he should have been using fast mode, which would have saved about an hour.

API pricing: a significant jump

For developers using the API directly, pricing has climbed. The base model costs $2.50 per million input tokens and $15 per million output tokens. The Pro variant costs $30 per million input and $180 per million output (4:27). And if you want to use the full 1 million token context window (the maximum text a model can handle in one conversation), any input beyond 272,000 tokens is billed at double the normal rate (4:47).

Tier	Input (per 1M tokens)	Output (per 1M tokens)
Base	$2.50	$15.00
Pro	$30.00	$180.00
Beyond 272K input	2x base rate	Standard rate

UI design: Opus 4.6 still leads

Bray compared a cafe website generated by GPT-5.4 and Anthropic's Opus 4.6. He preferred the Opus design, noting that GPT models tend to default to a "frosted card" UI style with heavy use of gradients (4:58). This is not just one reviewer's taste: on Design Arena, a crowdsourced platform where users vote on AI-generated designs, GPT-5.4 is not ranking highly either (5:21).

How to interpret these claims

Bray's review is practical and honest, but a few things are worth weighing carefully.

A five-minute review has limits

The entire video is under six minutes, and the hands-on testing covers one project type (3D web experience) and one design comparison. A coding model's real strengths and weaknesses emerge across many project types, languages, and complexity levels. The Playwright demo is impressive, but it is a single data point.

Benchmark rankings shift quickly

Artificial Analysis shows GPT-5.4 leading in several categories today. But model rankings have been unstable in 2025-2026, with Anthropic, Google, and OpenAI regularly trading the top position. A model being "best" on benchmarks at launch does not guarantee it will hold that position.

The speed problem may matter more than benchmarks

For developers who use AI coding assistants throughout the day, time-to-first-token is not an abstract metric. A model that thinks longer before responding breaks the quick back-and-forth that makes AI-assisted coding productive. If GPT-5.4's speed does not improve, the practical experience may lag behind what the benchmarks suggest.

Pricing signals a broader trend

The cost increase from GPT-5.2 to GPT-5.4 follows a pattern across frontier models: each generation gets more capable and more expensive. Developers building production applications on these APIs need to factor in not just today's pricing but the direction it is heading.

Practical implications

For Codex users

If you already use OpenAI's Codex for coding tasks, GPT-5.4 is a straightforward upgrade. The Playwright skill adds genuine value for web development workflows where visual verification matters. Consider starting with fast mode if your billing allows it, since the time savings are real.

For teams choosing between providers

The UI design gap is worth testing yourself before committing. If your project involves front-end work, compare GPT-5.4 and Opus 4.6 on your actual design requirements. Benchmarks measure coding and reasoning, but the visual taste of AI models varies in ways that benchmarks do not capture.

Glossary

Term	Definition
Computer use	An AI capability where the model controls mouse, keyboard, and browser by analyzing screenshots, letting it interact with software the way a human would.
Playwright	An open-source browser automation library built by Microsoft. AI models can use it to open web pages, click buttons, and verify visual output programmatically.
Tool search	A feature where the model looks up tool definitions on demand instead of loading all of them at once, reducing wasted tokens in conversations with many connected tools.
MCP server	Model Context Protocol server. A standardized way to connect external tools and data sources to an AI model so it can use them during conversations.
Token	The smallest unit an AI model processes. Roughly 3-4 characters. Pricing for API access is measured in tokens.
Context window	The maximum amount of text a model can handle in a single conversation. GPT-5.4's window is 1 million tokens, roughly 750,000 words.
Time-to-first-token	How long it takes a model to start generating output after receiving your input. Lower is better for interactive use.
Fast mode	A priority tier that delivers the same model at 1.5 times the speed, billed at double the normal rate. Not a different or smarter model.
Design Arena	A crowdsourced platform where users compare and vote on AI-generated designs, providing a community-driven measure of visual quality.

Sources and resources

Want to go deeper? Watch the full video on YouTube →