How A Team Of 7 Keeps Breaking AI Benchmark Records
While OpenAI and Anthropic spend hundreds of millions of dollars training new models from scratch, a seven-person startup called Poetic is topping AI benchmarks with a radically different approach: recursively self-improving systems that cost less than $100k per optimization run. When Gemini 3 Deep Think hit 45% on ARC AGI v2, Poetic beat it at 54% two days later — while costing half as much. The big question: can this approach really vaccinate startups against «the bitter lesson», or is there a ceiling that stilts can't overcome?
Ключевые выводы
Poetic's recursively self-improving system optimizes AI agents for under $100k per run, compared to the hundreds of millions required to train frontier models from scratch, and remains compatible when new base models are released.
The company topped ARC AGI v2 at 54% accuracy (vs. Gemini 3 Deep Think's 45%) at half the per-problem cost, and recently achieved 55% on Humanity's Last Exam, outperforming Claude Opus 4.6's 53.1%.
Reasoning strategy improvements — implemented as code-based harnesses rather than just prompt optimization — are responsible for the majority of performance gains, taking one internal benchmark from 5% to 95% accuracy.
Startups that fine-tune on older models risk obsolescence when new frontier models release; Poetic's approach sidesteps this by treating base models as a swappable commodity layer.
Вкратце
Poetic has cracked a way to make any AI system dramatically smarter without the hundreds-of-millions cost of fine-tuning or retraining — and their approach stays compatible when the next frontier model drops, making it possible for a seven-person team to beat Google and Anthropic on public benchmarks.
The Core Insight: Recursive Self-Improvement Without Retraining
Poetic automates agent optimization far faster and cheaper than training new LLMs.
Poetic's central breakthrough is that it can perform recursive self-improvement — where the AI makes itself smarter — without requiring a new foundation model to be trained from scratch. Most approaches to self-improvement, including those explored by OpenAI, Anthropic, and Google, involve retraining a model at every improvement step, which costs hundreds of millions of dollars and takes months. Poetic's «meta system» instead generates what the industry now calls «harnesses» or «agentic systems»: layers of code, prompts, and reasoning strategies that sit on top of existing foundation models and outperform them.
When a new frontier model is released, the same harness remains compatible — no expensive retraining required. This means startups don't have to choose between spending millions on fine-tuning (which becomes obsolete the moment GPT-5 or Claude 5 drops) and falling behind competitors. Ian Fischer describes this as «building stilts to stand on top of» foundation models rather than competing with them. The company demonstrated this in December 2024 when it beat Gemini 3 Deep Think on ARC AGI v2 two days after Google's release, achieving 54% accuracy at $32 per problem versus Google's 45% at over $70.
The system optimizes not just prompts — which many tools like DSPy already automate — but the entire reasoning pipeline, including which models to route tasks to, how to structure multi-step reasoning, and what context to surface at each stage. In one internal benchmark, manual prompt optimization improved accuracy from baseline to 5%, but adding reasoning strategies catapulted performance to 95%. Fischer emphasizes that Poetic's outputs often don't look like human-written prompts: «There's some unexpected stuff and one of the examples is actually wrong but we didn't change it… this is the thing that it output we'll just leave it be.»
Breaking Records on a Startup Budget
A seven-person team outperformed Google and Anthropic at a fraction of the cost.
Prompts vs. Reasoning Strategies: Where the Real Gains Come From
Code-based reasoning harnesses deliver 10–20× the improvement of prompt tuning alone.
«Don't Limit Yourself — Just Try Things Every Day»
Ian Fischer on building with AI as an iterative, daily practice.
“The world is changing so quickly. This is probably a little bit obvious, but you should just try things and every day do something with AI. Last summer, I took a weekend and used GPT-5 to help me build an iPhone app. I hadn't done that in a decade. And yeah, it's so fast and so easy. And that was, you know, an age ago. That was like 8 months ago. Now it's even faster and easier. Don't limit yourself. Like anything that you imagine, you should just try to use AI and see how far you can get with it and you'll be, you know, making the world better.”
Vaccinated Against the Bitter Lesson
Poetic's compatibility with any base model eliminates the fine-tuning obsolescence trap.
Vaccinated Against the Bitter Lesson
The traditional startup playbook — collect tens of thousands of examples, fine-tune a frontier model, deploy — is a ticking time bomb. By the time your fine-tuned GPT-4 model ships, GPT-5 has already surpassed it. Poetic's harnesses remain compatible across model generations, letting startups upgrade to the latest base model without rebuilding. As Fischer puts it: «You're totally vaccinated against the bitter lesson.»
Люди
Глоссарий
Отказ от ответственности: Это ИИ-сводка видео с YouTube, подготовленная в образовательных и справочных целях. Она не является инвестиционной, финансовой или юридической консультацией. Всегда проверяйте информацию по первоисточникам перед принятием решений. TubeReads не связан с автором контента.