Inference, Diffusion, World Models, and More | YC Paper Club
The first-ever YC Paper Club convenes at Pioneer in Woodside, California — the same historic space where OpenAI's founding team once brainstormed research directions with YC founders. Bringing together researchers with thousands of citations and founders who've raised tens of millions, the session tackles a set of pressing questions: Can inference speed unlock new levels of capability, not just cost savings? Will robots need internal models of the world to act intelligently, or can they get by on pattern-matching alone? And what happens when compute is infinite but data is scarce — can classical ideas like ensembling and distillation rewrite the scaling playbook?
Key Takeaways
Inference speed will soon determine peak intelligence: for reasoning systems that scale performance with thinking time, tokens-per-second directly caps how smart the model can be.
Speculative speculative decoding (SSD) parallelizes drafting and verification, hiding latency and achieving 300 tokens/sec on Llama 3 70B with four H100s — a meaningful speedup over open-source engines.
World models enable runtime adaptation to new rewards and dynamics, quantify uncertainty, and may be essential for generalizable robotic policies — though the jury is still out versus model-free approaches.
Over-parameterization and benign overfitting are not mysteries: PAC-Bayes bounds show larger models find flatter, more compressible minima, and soft inductive biases reconcile flexibility with generalization.
In data-constrained regimes, aggressive weight decay (30× standard values), ensembling infinitely many models, and distillation can yield a 5× data efficiency win — a potential blueprint as internet text growth (3% per year) lags compute growth (4–5× per year).
In a Nutshell
Inference is evolving from a cost consideration into a capability frontier, world models are making a comeback as a path to generalizable robot control, and when data is the bottleneck, aggressive regularization and ensembling can deliver 5x data efficiency wins — even in the age of trillion-token pre-training.
Welcome to Paper Club: Making Pioneer Great Again
Over a thousand applied; one hundred were selected to resurrect Woodside as an AI research hub.
The inaugural YC Paper Club session opens at Pioneer, the Woodside campus where Sam Altman, Andrej Karpathy, and Greg Brockman once sat alongside early-stage AI founders brainstorming what would become OpenAI. The organizer frames two missions: create a community of top founders and researchers, and revitalize a campus that has sat underutilized despite its legacy. A show-of-hands poll reveals the room's caliber: multiple attendees have ten-thousand-plus citations; several have raised $50 million or more.
The hidden geographic thesis is that half the Bay Area's AI talent lives in San Francisco (Anthropic, OpenAI, Cursor), but the other half — Google DeepMind, Tesla, xAI, Thinking Machines — works in Palo Alto and rarely makes the trek north. Pioneer sits at the nexus, and the club aims to pull both hemispheres together. Five papers anchor the session, spanning speculative decoding, diffusion-based control, world models, generalization theory, and data-constrained scaling.
Speculative Speculative Decoding (SSD): Parallelizing Draft and Verify
Diffusion Model Predictive Control: Multi-Step Actions, Multi-Step Dynamics
DMPC uses diffusion models for action proposals and world evolution, enabling runtime reward adaptation and novel-dynamics transfer.
Model predictive control (MPC) factors agent design into an action-proposal module and a dynamics model (world model). Diffusion MPC (DMPC) applies diffusion models to both, reducing compounding error and simplifying planning. The speaker demonstrates on locomotion tasks: a quadruped trained only on «run forward» and «jump» can exhibit novel gaits at test time by swapping in a new reward function. When dynamics shift — for example, a walker with a broken left ankle — the factorized design allows re-training only the dynamics model on a handful of samples, preserving the action prior and recovering performance.
DMPC outperforms prior MPC and model-free baselines on 2D tasks (Push-T) and remains competitive on 3D (Push Cube), though DINO World Model wins in 3D thanks to its large foundational vision backbone. Crucially, DMPC runs 50× faster than alternatives because all work happens in a learned latent space, fitting on a single GPU with under 24 GB VRAM and only 15 million parameters. The work predates the current wave of «robot foundation models» but shares the same core bet: factored, multi-step models generalize better and adapt faster than end-to-end policies.
JEPA World Models: Avoiding Collapse with the Sigg Regularizer
Lay World Model trains action-conditioned latent forecasting with a single-hyperparameter regularizer that enforces Gaussian-distributed embeddings.
Key Capabilities of World Models
World models enable open-loop prediction, model-based planning, and explicit uncertainty quantification.
Open-Loop Imagination Given a context frame and an action sequence, the model «imagines» future observations. High-quality predictions indicate the model has learned environment dynamics.
Model Predictive Control Encode the current observation and a goal observation, then search over action sequences in latent space to find a trajectory that bridges them. Works well when goal images are available.
Uncertainty Quantification World models can detect when predictions fail. Perturbing the environment — changing object color or teleporting an object — causes a spike in model error, allowing the agent to know when its predictions are unreliable. Model-free policies lack this native capability.
Generalization Is Not Mysterious: PAC-Bayes, Flatness, and Soft Inductive Bias
Andrew Gordon Wilson's work shows overparameterization and benign overfitting fit classical theory when compression is measured correctly.
The current explanation for why scaling works is that «it just does» — a dissatisfying answer when generalization underpins every capability gain in modern AI. Andrew Gordon Wilson's paper argues that classical PAC-Bayes bounds, long dismissed as vacuous for overparameterized models, actually explain deep learning when applied correctly. PAC-Bayes bounds test loss with training loss plus a compression term. Historically, the compression term dominated and bounds became loose. Wilson shows that larger models find more compressible solutions: the volume of flat minima in parameter space grows exponentially with parameter count, and flat minima compress better than sharp ones.
The paper also resolves «benign overfitting» — the mystery of how networks fit random noise yet generalize on structured data. A regularized polynomial offers intuition: enough parameters fit noise, but regularization biases the model toward low-order terms that capture structure. Deep nets are expressive hypothesis spaces with soft inductive biases. The takeaway: if we identify the right inductive biases and optimize for them (e.g., compressibility, flatness), we may unlock massive sample-efficiency gains. The no-free-lunch theorem guarantees that all learning efficiency comes from inductive bias, and humans remain orders of magnitude more sample-efficient than models.
Data-Constrained Scaling: When Compute Is Infinite but Tokens Are Scarce
Aggressive regularization, ensembling, and distillation yield a 5× data efficiency win as compute growth outpaces internet text.
Practical Wins: Distillation and Self-Distillation
Distilling an eight-model ensemble into one dense model retains 83% of the gain; self-distillation surprisingly improves loss.
Practical Wins: Distillation and Self-Distillation
Even though the joint scaling recipe requires massive training compute, distillation compresses test-time cost. An eight-member ensemble (2.4B total parameters) distills into a single 300M dense model while preserving 83% of the loss improvement. More surprisingly, self-distillation — training a fresh copy of the same 300M model on its own outputs — beats the regularized recipe's asymptote. Prior work suggests self-distillation implicitly trains a two-member ensemble, reconciling the counterintuitive result. The findings hold on downstream benchmarks (fully held-out until the end) and in continued pre-training scenarios, confirming the approach generalizes beyond in-distribution validation loss.
The Billion-Dollar Question: Model-Free or Model-Based?
Yann LeCun raised $1B to train world models; the field remains split on explicit versus implicit modeling.
“Yann LeCun raised $1.03 billion dollars back in March basically just to train world models.”
People
Glossary
Disclaimer: This is an AI-generated summary of a YouTube video for educational and reference purposes. It does not constitute investment, financial, or legal advice. Always verify information with original sources before making any decisions. TubeReads is not affiliated with the content creator.