How 7 Researchers Beat Claude Opus Without Fine-Tuning

Key insights
- Poetiq scored 55% on Humanity's Last Exam, outperforming Claude Opus 4.6 (53.1%) — built on top of existing models without any fine-tuning
- On ARC-AGI v2, Poetiq achieved 54% at $32 per problem versus Gemini 3 Deep Think's 45% at ~$70 — better results at half the cost
- The entire optimization for Humanity's Last Exam cost less than $100K, compared to hundreds of millions for training frontier models
Read this article in norsk
In Brief
Poetiq is a seven-person startup founded by former Google DeepMind researchers. Their pitch: instead of spending millions to fine-tune a model that becomes obsolete with the next release, build a "reasoning harness" on top of existing frontier models (the most advanced AI models currently available) that automatically makes them perform better. In this episode of Y Combinator (YC)'s Lightcone podcast, Poetiq CEO Ian Fischer explains how their recursive self-improvement system topped the ARC-AGI v2 leaderboard and outscored Claude Opus 4.6 on Humanity's Last Exam — all for under $100K in optimization costs.
The fine-tuning trap
The episode opens with a problem that many AI startups face. Fischer describes the traditional approach: collect tens of thousands of examples, fine-tune a frontier model (train it further on specialized data), and spend significant compute (processing power) doing it. The result works better than the base model, but by the time you ship, a new model has come out that outperforms your fine-tuned version (2:07).
One of the Lightcone hosts puts it bluntly: you spend millions to hundreds of millions of dollars on fine-tuning and then effectively set that money on fire when the next frontier model drops (2:17).
Fischer argues this dynamic is particularly dangerous for startups. Large labs like Anthropic, OpenAI, and Google can absorb the cost of retraining because it's their core business. A startup that bets everything on a fine-tuned model risks going out of business when the next generation arrives (4:01).
"Stilts" for LLMs (Large Language Models) — what Poetiq actually builds
Rather than modifying the model itself, Poetiq builds what Fischer calls a reasoning harness: a combination of code, prompts, and data that sits on top of one or more language models (8:43). The key difference from manual prompt engineering (the craft of writing better instructions for AI): Poetiq's "meta-system" generates and optimizes these harnesses automatically through recursive self-improvement, a process where the system uses its own output to make itself better over time.
Fischer explains the value proposition: when a new frontier model comes out, the harness is compatible with it immediately. No retraining needed. The same harness that improved Gemini 3 Pro also works with the next release — and Poetiq can continue optimizing for whatever model a customer wants to use (4:35).
The hosts use the word "stilts" repeatedly throughout the episode — the idea that whatever model comes out, Poetiq can make you taller than that model out of the box (6:06).
ARC-AGI v2: Beating Gemini at half the cost
Poetiq came out of stealth (secret development mode) in December 2025 by topping the ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) v2 benchmark, a standardized test used to measure how well AI systems perform. At the time, Gemini 3 Deep Think had just claimed the lead at 45%. Two days later, Poetiq published results showing 54% — a 9 percentage point improvement (5:19).
The cost comparison is striking. Poetiq used Gemini 3 Pro (a much cheaper model than Deep Think) as the base, achieving their results at about $32 per problem versus roughly $70 for Gemini 3 Deep Think (6:14). Better results at half the price — on a cheaper underlying model.
Humanity's Last Exam: Outscoring Claude Opus
The more recent benchmark result is Humanity's Last Exam (HLE), a test developed by Scale AI and the Center for AI Safety with contributions from over 500 domain experts at universities worldwide. The test contains 2,500 questions spanning fields like mathematics, physics, biology, law, and philosophy, all designed to be challenging even for PhDs in their respective fields (6:44). The goal is to measure whether AI systems are approaching expert-level human knowledge. No AI has passed it yet.
Poetiq scored 55%, almost two percentage points above the previous state-of-the-art: Claude Opus 4.6 at 53.1%, which had been published just a week before (7:04).
The optimization cost? Fischer says less than $100K (7:36). The hosts point out the contrast: each foundation model training run costs hundreds of millions of dollars, while Poetiq achieved a new state-of-the-art with a seven-person team and a fraction of the budget.
How the meta-system works
Fischer is careful not to reveal specific details about Poetiq's approach. What he does share: the "Poetiq meta-system" is a recursively self-improving system whose output is reasoning systems that solve hard problems (9:13).
The meta-system handles what a human team would normally do: it examines the data, identifies failure modes, and discovers reasoning strategies, but does so automatically and at a fraction of the cost (9:37).
Fischer also describes a second use case: startups that already have their own AI agent (a program that can take actions and make decisions on its own) can bring it to Poetiq for optimization. The meta-system can optimize specific components (just the prompts, just the reasoning strategies, or the entire system) depending on the customer's needs (10:12).
One of the hosts frames this as a new S-curve (a growth pattern that starts slow, accelerates, then plateaus) beyond reinforcement learning (a training method where AI learns by trial and reward). Fischer agrees, adding that as both the meta-system and the underlying models improve, the performance ceiling keeps shifting higher (11:05).
From 5% to 95% — what reasoning strategies can do
Fischer shares a concrete example from a paper published while still at DeepMind. The team was working on a very hard problem using Gemini 1.5 Flash. After extensive manual prompt optimization, they reached 5% performance. When they added reasoning strategies (code-based approaches rather than just better prompts), performance jumped to 95% (14:09).
This illustrates a point Fischer makes about automated prompt optimization tools like DSPy (a framework for programmatically optimizing AI prompts): they can improve results, but they're far from everything you can achieve when you think about reasoning strategies written in code rather than just in prompts (14:33).
One notable detail: the meta-system generated some examples for ARC-AGI that Fischer says are clearly not what a human would have written, including one example that was actually wrong. The team chose not to fix it, treating the system's output as the product rather than something to manually tweak (12:21).
Ian Fischer: From YC founder to DeepMind to Poetiq
Fischer's path to Poetiq is unconventional. Over a decade ago, he founded Portable, a YC-backed startup that ported mobile apps cross-platform. Google acquired the company, and rather than continuing in mobile dev tools, Fischer used the transition to pivot entirely into AI and robotics research (16:45).
He admits that robotics was more aspirational than practical. Hardware is hard. Instead, he went deep into machine learning research, spending roughly a decade at Google Research and then DeepMind (18:07).
His advice for engineers looking to get into AI: try things every day, push yourself to find the limits of what AI can do, and don't be afraid to build things outside your comfort zone. Fischer himself built an iPhone app in a weekend using GPT-5, something he hadn't done in a decade (18:38).
How to interpret this
Poetiq's benchmark results are impressive, but several things are worth considering before drawing broad conclusions.
Benchmarks are not products. Scoring 55% on Humanity's Last Exam shows capability, but it doesn't tell us how well the system works on real-world production tasks — customer service, code generation, or domain-specific reasoning. Benchmark performance has historically been a poor predictor of real-world utility.
The promotional context matters. This is a YC Lightcone episode featuring a YC company. The hosts are enthusiastic and supportive; this is closer to a pitch than an independent evaluation. No external researchers or skeptics are present to challenge the claims.
Poetiq hasn't shipped anything yet. The company is in early access mode. The technology has only been tested on benchmarks and hasn't been battle-tested in production environments (real customer use at scale). Fischer says startups can sign up at poetiq.ai, but no product is publicly available at the time of recording (14:57).
The cost comparisons need context. The "under $100K optimization cost" is compared to "hundreds of millions for training." This is a valid comparison for the training step, but it doesn't account for the inference cost (the cost of running the AI to generate responses) when the harness is used at scale. Each reasoning harness makes multiple LLM calls per problem, so per-query production cost could still be significant.
Recursive self-improvement is a bold claim. Fischer positions Poetiq's system as genuinely recursively self-improving. This is a loaded term in AI — historically associated with theoretical AGI (Artificial General Intelligence, AI that can match humans at any intellectual task) scenarios. What Poetiq describes appears to be automated optimization of reasoning systems, which is impressive but a narrower claim than what "recursive self-improvement" might suggest.
Glossary
| Term | Definition |
|---|---|
| Fine-tuning | Taking a pre-trained AI model and training it further on a specialized dataset. Expensive and becomes obsolete when a better base model is released — what Fischer calls "the fine-tuning trap." |
| Reasoning harness | A system of code, prompts, and data that wraps around an existing language model to improve its performance on specific tasks. Poetiq generates these automatically instead of building them by hand. |
| Recursive self-improvement | A system that can make itself better at making itself better. In Poetiq's case, the meta-system optimizes reasoning systems, and the meta-system itself can also be improved over time. |
| ARC-AGI | Abstraction and Reasoning Corpus for Artificial General Intelligence — a benchmark testing abstract reasoning ability. Version 2 (v2) is the latest iteration. Poetiq holds the top score at 54%. |
| Humanity's Last Exam | A benchmark of 2,500 expert-level questions designed to be challenging even for PhDs. Created as a harder alternative to existing benchmarks. No AI system has passed it yet. |
| S-curve | A pattern where a technology improves slowly at first, then rapidly, then plateaus. Fischer argues that Poetiq represents a new S-curve beyond what reinforcement learning alone can achieve. |
| DSPy | A popular framework for automated prompt optimization. Fischer acknowledges it provides improvements but argues reasoning strategies in code go much further. |
| The bitter lesson | A concept from AI researcher Rich Sutton arguing that approaches leveraging more compute always win in the long run. The hosts suggest Poetiq "vaccinates" startups against this by making them model-agnostic. |
Sources and resources
Want to go deeper? Watch the full video on YouTube →