GPT-5.4 Hands-On: Fast, Smart, but Not Cheap

Key insights
- GPT-5.4's new Playwright skill lets it open a browser, spot visual bugs, and fix them without human input, producing a 3D project in three prompts
- Tool search cuts token usage by 47% when many tools are connected, but GPT-5.4 has the slowest time-to-first-token of any major model
- Anthropic's Opus 4.6 still produces better UI designs in side-by-side comparisons, a gap confirmed by Design Arena rankings
This article is a summary of The New Best Model Is Here (GPT-5.4). Watch the video โ
Read this article in norsk
In Brief
Richard Oliver Bray, developer educator at Better Stack, spent several hours testing OpenAI's newly released GPT-5.4. His verdict: the model is a clear upgrade for coding and agentic tasks, with impressive browser automation through a new Playwright skill. But it comes with the slowest response time of any major model, steep API pricing, and UI design output that still trails Anthropic's Opus 4.6 in side-by-side tests.
The pitch: one model to rule them all
Bray describes GPT-5.4 as OpenAI's attempt to merge the coding power of Codex 5.3 with the knowledge and web search capabilities of GPT-5.2 into a single model (0:30). According to third-party benchmarks from Artificial Analysis, this strategy appears to have worked: GPT-5.4 ranks as the best coding model, the best agentic model (AI that takes actions independently using tools), and ties with Google's Gemini for best intelligence model (0:44).
The most notable addition is native computer use. GPT-5.4 is reportedly OpenAI's first general-purpose model with built-in capabilities for controlling a computer through mouse and keyboard commands in response to screenshots (0:57). OpenAI also released an experimental Playwright skill, a browser automation tool built on Microsoft's open-source library (1:08).
The demo: 3D Tower Bridge in three prompts
Bray tested this by asking GPT-5.4 to build an interactive 3D experience of Tower Bridge in London. The first iteration took about 30 minutes from a single prompt (1:50). The model wrote code, opened a browser using the Playwright skill, navigated the 3D scene, spotted visual problems like mismatched backgrounds, jumped back into the code to fix them, and repeated the cycle (1:35).
After two follow-up prompts to fix details like sideways boats and overlapping textures, the total project took about 1.5 hours of completely hands-off work (2:10). Bray calls it "not perfect by any means" but a "no-brainer upgrade" for existing Codex users.
Tool search: less waste, same accuracy
The other standout feature is tool search. When an AI model has many tools connected through MCP servers (Model Context Protocol, a standard for connecting AI to external tools), all their definitions normally get loaded into the conversation upfront, wasting tokens (the basic units AI models process, roughly 3-4 characters each) and hurting the quality of answers (2:49).
GPT-5.4 instead loads a lightweight list of available tools and looks up the full definition only when it actually needs one. OpenAI claims this reduces token usage by 47% in a test with 36 MCP servers while maintaining the same accuracy (3:15).
The tradeoffs
Speed: slowest of any major model
The most noticeable downside, according to Bray, is response time. GPT-5.4 has the longest time-to-first-token of any model tracked by Artificial Analysis, meaning it takes longer than competitors to start generating output (4:06). The same applies for returning the first 500 tokens.
Bray notes he is "not sure if this is a model issue or a provider issue" and suggests it might improve over time. But he also raises a more cynical possibility: that the model is intentionally slow, nudging users toward the new fast mode (4:21).
Fast mode: same model, higher bill
Fast mode delivers the exact same model and intelligence at 1.5 times the token speed, but it is billed at double the normal rate (2:32). Bray describes it as "essentially just a priority tier," not a different model at all. After two hours of testing, the model itself suggested he should have been using fast mode, which would have saved about an hour.
API pricing: a significant jump
For developers using the API directly, pricing has climbed. The base model costs $2.50 per million input tokens and $15 per million output tokens. The Pro variant costs $30 per million input and $180 per million output (4:27). And if you want to use the full 1 million token context window (the maximum text a model can handle in one conversation), any input beyond 272,000 tokens is billed at double the normal rate (4:47).
| Tier | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Base | $2.50 | $15.00 |
| Pro | $30.00 | $180.00 |
| Beyond 272K input | 2x base rate | Standard rate |
UI design: Opus 4.6 still leads
Bray compared a cafe website generated by GPT-5.4 and Anthropic's Opus 4.6. He preferred the Opus design, noting that GPT models tend to default to a "frosted card" UI style with heavy use of gradients (4:58). This is not just one reviewer's taste: on Design Arena, a crowdsourced platform where users vote on AI-generated designs, GPT-5.4 is not ranking highly either (5:21).
How to interpret these claims
Bray's review is practical and honest, but a few things are worth weighing carefully.
A five-minute review has limits
The entire video is under six minutes, and the hands-on testing covers one project type (3D web experience) and one design comparison. A coding model's real strengths and weaknesses emerge across many project types, languages, and complexity levels. The Playwright demo is impressive, but it is a single data point.
Benchmark rankings shift quickly
Artificial Analysis shows GPT-5.4 leading in several categories today. But model rankings have been unstable in 2025-2026, with Anthropic, Google, and OpenAI regularly trading the top position. A model being "best" on benchmarks at launch does not guarantee it will hold that position.
The speed problem may matter more than benchmarks
For developers who use AI coding assistants throughout the day, time-to-first-token is not an abstract metric. A model that thinks longer before responding breaks the quick back-and-forth that makes AI-assisted coding productive. If GPT-5.4's speed does not improve, the practical experience may lag behind what the benchmarks suggest.
Pricing signals a broader trend
The cost increase from GPT-5.2 to GPT-5.4 follows a pattern across frontier models: each generation gets more capable and more expensive. Developers building production applications on these APIs need to factor in not just today's pricing but the direction it is heading.
Practical implications
For Codex users
If you already use OpenAI's Codex for coding tasks, GPT-5.4 is a straightforward upgrade. The Playwright skill adds genuine value for web development workflows where visual verification matters. Consider starting with fast mode if your billing allows it, since the time savings are real.
For teams choosing between providers
The UI design gap is worth testing yourself before committing. If your project involves front-end work, compare GPT-5.4 and Opus 4.6 on your actual design requirements. Benchmarks measure coding and reasoning, but the visual taste of AI models varies in ways that benchmarks do not capture.
Glossary
| Term | Definition |
|---|---|
| Computer use | An AI capability where the model controls mouse, keyboard, and browser by analyzing screenshots, letting it interact with software the way a human would. |
| Playwright | An open-source browser automation library built by Microsoft. AI models can use it to open web pages, click buttons, and verify visual output programmatically. |
| Tool search | A feature where the model looks up tool definitions on demand instead of loading all of them at once, reducing wasted tokens in conversations with many connected tools. |
| MCP server | Model Context Protocol server. A standardized way to connect external tools and data sources to an AI model so it can use them during conversations. |
| Token | The smallest unit an AI model processes. Roughly 3-4 characters. Pricing for API access is measured in tokens. |
| Context window | The maximum amount of text a model can handle in a single conversation. GPT-5.4's window is 1 million tokens, roughly 750,000 words. |
| Time-to-first-token | How long it takes a model to start generating output after receiving your input. Lower is better for interactive use. |
| Fast mode | A priority tier that delivers the same model at 1.5 times the speed, billed at double the normal rate. Not a different or smarter model. |
| Design Arena | A crowdsourced platform where users compare and vote on AI-generated designs, providing a community-driven measure of visual quality. |
Sources and resources
Want to go deeper? Watch the full video on YouTube โ