How A Team Of 7 Keeps Breaking AI Benchmark Records

While OpenAI and Anthropic spend hundreds of millions of dollars training new models from scratch, a seven-person startup called Poetic is topping AI benchmarks with a radically different approach: recursively self-improving systems that cost less than $100k per optimization run. When Gemini 3 Deep Think hit 45% on ARC AGI v2, Poetic beat it at 54% two days later — while costing half as much. The big question: can this approach really vaccinate startups against «the bitter lesson», or is there a ceiling that stilts can't overcome?

Y CombinatorTech1 Erwähnte Personen 4 Glossar

Videolänge: 19:46·Veröffentlicht 27. Feb. 2026·Videosprache: en-US

4–5 Min. Lesezeit·3,754 gesprochene Wörter → zusammengefasst auf 996 Wörter (4x)·

Auf YouTube ansehen ↗

1 —

Kernaussagen

1

Poetic's recursively self-improving system optimizes AI agents for under $100k per run, compared to the hundreds of millions required to train frontier models from scratch, and remains compatible when new base models are released.

2

The company topped ARC AGI v2 at 54% accuracy (vs. Gemini 3 Deep Think's 45%) at half the per-problem cost, and recently achieved 55% on Humanity's Last Exam, outperforming Claude Opus 4.6's 53.1%.

3

Reasoning strategy improvements — implemented as code-based harnesses rather than just prompt optimization — are responsible for the majority of performance gains, taking one internal benchmark from 5% to 95% accuracy.

4

Startups that fine-tune on older models risk obsolescence when new frontier models release; Poetic's approach sidesteps this by treating base models as a swappable commodity layer.

Kurzgesagt

Poetic has cracked a way to make any AI system dramatically smarter without the hundreds-of-millions cost of fine-tuning or retraining — and their approach stays compatible when the next frontier model drops, making it possible for a seven-person team to beat Google and Anthropic on public benchmarks.

2 —

The Core Insight: Recursive Self-Improvement Without Retraining

Poetic automates agent optimization far faster and cheaper than training new LLMs.

Poetic's central breakthrough is that it can perform recursive self-improvement — where the AI makes itself smarter — without requiring a new foundation model to be trained from scratch. Most approaches to self-improvement, including those explored by OpenAI, Anthropic, and Google, involve retraining a model at every improvement step, which costs hundreds of millions of dollars and takes months. Poetic's «meta system» instead generates what the industry now calls «harnesses» or «agentic systems»: layers of code, prompts, and reasoning strategies that sit on top of existing foundation models and outperform them.

When a new frontier model is released, the same harness remains compatible — no expensive retraining required. This means startups don't have to choose between spending millions on fine-tuning (which becomes obsolete the moment GPT-5 or Claude 5 drops) and falling behind competitors. Ian Fischer describes this as «building stilts to stand on top of» foundation models rather than competing with them. The company demonstrated this in December 2024 when it beat Gemini 3 Deep Think on ARC AGI v2 two days after Google's release, achieving 54% accuracy at $32 per problem versus Google's 45% at over $70.

The system optimizes not just prompts — which many tools like DSPy already automate — but the entire reasoning pipeline, including which models to route tasks to, how to structure multi-step reasoning, and what context to surface at each stage. In one internal benchmark, manual prompt optimization improved accuracy from baseline to 5%, but adding reasoning strategies catapulted performance to 95%. Fischer emphasizes that Poetic's outputs often don't look like human-written prompts: «There's some unexpected stuff and one of the examples is actually wrong but we didn't change it… this is the thing that it output we'll just leave it be.»

3 —

Breaking Records on a Startup Budget

A seven-person team outperformed Google and Anthropic at a fraction of the cost.

ARC AGI v2 Accuracy (Poetic vs. Gemini 3 Deep Think)

54% vs. 45%

Poetic's result came two days after Google's release and at half the per-problem cost

Cost per Problem (ARC AGI v2)

$32 (Poetic) vs. $70+ (Gemini)

Poetic built on the cheaper Gemini 3 Pro base model

Humanity's Last Exam Score

55%

Outperformed Claude Opus 4.6's 53.1% from the previous week

Optimization Budget (Humanity's Last Exam)

Under $100k

Compared to hundreds of millions for frontier model training runs

Team Size

7 research scientists and engineers

Competing against labs with thousands of employees

4 —

Prompts vs. Reasoning Strategies: Where the Real Gains Come From

Code-based reasoning harnesses deliver 10–20× the improvement of prompt tuning alone.

PROMPT OPTIMIZATION

The Table Stakes

Many startups already use tools like DSPy to automate prompt engineering, and this delivers measurable gains. In Poetic's internal testing, aggressive manual prompt optimization improved performance from baseline to around 5% on their hardest benchmark. While useful, this approach hits a ceiling quickly and doesn't fundamentally change how the model reasons.

REASONING STRATEGIES

The 10× Multiplier

Reasoning strategies are implemented as code — determining which models to call, how to structure multi-step inference, what context to surface, and how to verify outputs. On the same benchmark where prompts alone reached 5%, adding reasoning strategies pushed accuracy to 95%. Fischer notes these strategies are «written in code rather than in just better prompts» and represent the majority of Poetic's performance advantage.

5 —

«Don't Limit Yourself — Just Try Things Every Day»

Ian Fischer on building with AI as an iterative, daily practice.

“The world is changing so quickly. This is probably a little bit obvious, but you should just try things and every day do something with AI. Last summer, I took a weekend and used GPT-5 to help me build an iPhone app. I hadn't done that in a decade. And yeah, it's so fast and so easy. And that was, you know, an age ago. That was like 8 months ago. Now it's even faster and easier. Don't limit yourself. Like anything that you imagine, you should just try to use AI and see how far you can get with it and you'll be, you know, making the world better.”
— Ian Fischer

6 —

Vaccinated Against the Bitter Lesson

Poetic's compatibility with any base model eliminates the fine-tuning obsolescence trap.

💡

Vaccinated Against the Bitter Lesson

The traditional startup playbook — collect tens of thousands of examples, fine-tune a frontier model, deploy — is a ticking time bomb. By the time your fine-tuned GPT-4 model ships, GPT-5 has already surpassed it. Poetic's harnesses remain compatible across model generations, letting startups upgrade to the latest base model without rebuilding. As Fischer puts it: «You're totally vaccinated against the bitter lesson.»

7 —

Personen

Ian Fischer

Co-founder and Co-CEO, Poetic

guest

Glossar

Recursive Self-ImprovementAn AI system that autonomously makes itself smarter over successive iterations, often considered the «holy grail» of AI research.

Harness (or Agentic System)A layer of code, prompts, and reasoning strategies built on top of foundation models to improve their performance on specific tasks.

The Bitter LessonThe observation that general-purpose computation and learning (scaling up models) consistently beats hand-crafted domain knowledge in AI; here used to describe how new models make prior fine-tuning obsolete.

Fine-TuningTraining an existing model on a specific dataset to specialize it for a particular task, typically costing millions and becoming outdated when new base models release.

Haftungsausschluss: Dies ist eine KI-generierte Zusammenfassung eines YouTube-Videos für Bildungs- und Referenzzwecke. Sie stellt keine Anlage-, Finanz- oder Rechtsberatung dar. Überprüfen Sie Informationen immer anhand der Originalquellen, bevor Sie Entscheidungen treffen. TubeReads ist nicht mit dem Content-Ersteller verbunden.