Inference, Diffusion, World Models, and More | YC Paper Club

The first-ever YC Paper Club convenes at Pioneer in Woodside, California — the same historic space where OpenAI's founding team once brainstormed research directions with YC founders. Bringing together researchers with thousands of citations and founders who've raised tens of millions, the session tackles a set of pressing questions: Can inference speed unlock new levels of capability, not just cost savings? Will robots need internal models of the world to act intelligently, or can they get by on pattern-matching alone? And what happens when compute is infinite but data is scarce — can classical ideas like ensembling and distillation rewrite the scaling playbook?

Y CombinatorTech11 Personas mencionadas 8 Términos del glosario

Duración del vídeo: 1:07:19·Publicado 28 may 2026·Idioma del vídeo: en-US

7–8 min de lectura·12,985 palabras habladas → resumido a 1,550 palabras (8x)·

Ver en YouTube ↗

1 —

Puntos clave

1

Inference speed will soon determine peak intelligence: for reasoning systems that scale performance with thinking time, tokens-per-second directly caps how smart the model can be.

2

Speculative speculative decoding (SSD) parallelizes drafting and verification, hiding latency and achieving 300 tokens/sec on Llama 3 70B with four H100s — a meaningful speedup over open-source engines.

3

World models enable runtime adaptation to new rewards and dynamics, quantify uncertainty, and may be essential for generalizable robotic policies — though the jury is still out versus model-free approaches.

4

Over-parameterization and benign overfitting are not mysteries: PAC-Bayes bounds show larger models find flatter, more compressible minima, and soft inductive biases reconcile flexibility with generalization.

5

In data-constrained regimes, aggressive weight decay (30× standard values), ensembling infinitely many models, and distillation can yield a 5× data efficiency win — a potential blueprint as internet text growth (3% per year) lags compute growth (4–5× per year).

En resumen

Inference is evolving from a cost consideration into a capability frontier, world models are making a comeback as a path to generalizable robot control, and when data is the bottleneck, aggressive regularization and ensembling can deliver 5x data efficiency wins — even in the age of trillion-token pre-training.

2 —

Welcome to Paper Club: Making Pioneer Great Again

Over a thousand applied; one hundred were selected to resurrect Woodside as an AI research hub.

The inaugural YC Paper Club session opens at Pioneer, the Woodside campus where Sam Altman, Andrej Karpathy, and Greg Brockman once sat alongside early-stage AI founders brainstorming what would become OpenAI. The organizer frames two missions: create a community of top founders and researchers, and revitalize a campus that has sat underutilized despite its legacy. A show-of-hands poll reveals the room's caliber: multiple attendees have ten-thousand-plus citations; several have raised $50 million or more.

The hidden geographic thesis is that half the Bay Area's AI talent lives in San Francisco (Anthropic, OpenAI, Cursor), but the other half — Google DeepMind, Tesla, xAI, Thinking Machines — works in Palo Alto and rarely makes the trek north. Pioneer sits at the nexus, and the club aims to pull both hemispheres together. Five papers anchor the session, spanning speculative decoding, diffusion-based control, world models, generalization theory, and data-constrained scaling.

3 —

Speculative Speculative Decoding (SSD): Parallelizing Draft and Verify

🚀

Inference as Capability

For reasoning systems that scale with thinking time, tokens-per-second determines peak intelligence — not just cost. A 20,000-GPU cluster working on the Riemann hypothesis requires fast inference to deliver meaningful throughput.

🔮

Predicting Verification

SSD runs drafting and verification in parallel by guessing the most likely accept/reject outcomes (using the draft model's token distributions) and pre-drafting the next round. Cash-hit rates reach 80–90%, hiding latency entirely when correct.

⚡

300 Tokens/Sec on 4×H100

A hand-rolled inference engine implementing SSD samples Llama 3 70B at 300 tok/s on four H100s — roughly 50× faster than competing open-source engines thanks to the algorithm, not the systems work.

4 —

Diffusion Model Predictive Control: Multi-Step Actions, Multi-Step Dynamics

DMPC uses diffusion models for action proposals and world evolution, enabling runtime reward adaptation and novel-dynamics transfer.

Model predictive control (MPC) factors agent design into an action-proposal module and a dynamics model (world model). Diffusion MPC (DMPC) applies diffusion models to both, reducing compounding error and simplifying planning. The speaker demonstrates on locomotion tasks: a quadruped trained only on «run forward» and «jump» can exhibit novel gaits at test time by swapping in a new reward function. When dynamics shift — for example, a walker with a broken left ankle — the factorized design allows re-training only the dynamics model on a handful of samples, preserving the action prior and recovering performance.

DMPC outperforms prior MPC and model-free baselines on 2D tasks (Push-T) and remains competitive on 3D (Push Cube), though DINO World Model wins in 3D thanks to its large foundational vision backbone. Crucially, DMPC runs 50× faster than alternatives because all work happens in a learned latent space, fitting on a single GPU with under 24 GB VRAM and only 15 million parameters. The work predates the current wave of «robot foundation models» but shares the same core bet: factored, multi-step models generalize better and adapt faster than end-to-end policies.

5 —

JEPA World Models: Avoiding Collapse with the Sigg Regularizer

Lay World Model trains action-conditioned latent forecasting with a single-hyperparameter regularizer that enforces Gaussian-distributed embeddings.

THE CHALLENGE

Co-Learning Representation and Dynamics

Training world models requires learning both how to compactly encode high-dimensional observations (images, lidar) and how actions change that representation. Many solutions in the optimization landscape lead to trivial collapse — for instance, mapping every state to the same embedding. Existing methods avoid this through ad-hoc tricks: explicit heuristics to enforce latent health, foundational pre-trained encoders, or privileged ground-truth data during training.

THE SOLUTION

Sigg: One Regularizer, No Tricks

Lay World Model introduces the «Sigg» regularizer (Sketching, Isotropic, Gaussian). Instead of verifying multi-dimensional Gaussianity expensively, Sigg takes one-dimensional slices through the latent space and checks that each slice is Gaussian-distributed. If all slices are Gaussian, the full distribution must be healthy. This single loss term, with one hyperparameter, prevents collapse and matches or beats prior methods on Push-T and Push-Cube tasks while enabling 50× faster inference than DINO World Model on simple environments.

6 —

Key Capabilities of World Models

World models enable open-loop prediction, model-based planning, and explicit uncertainty quantification.

1

Open-Loop Imagination Given a context frame and an action sequence, the model «imagines» future observations. High-quality predictions indicate the model has learned environment dynamics.

2

Model Predictive Control Encode the current observation and a goal observation, then search over action sequences in latent space to find a trajectory that bridges them. Works well when goal images are available.

3

Uncertainty Quantification World models can detect when predictions fail. Perturbing the environment — changing object color or teleporting an object — causes a spike in model error, allowing the agent to know when its predictions are unreliable. Model-free policies lack this native capability.

7 —

Generalization Is Not Mysterious: PAC-Bayes, Flatness, and Soft Inductive Bias

Andrew Gordon Wilson's work shows overparameterization and benign overfitting fit classical theory when compression is measured correctly.

The current explanation for why scaling works is that «it just does» — a dissatisfying answer when generalization underpins every capability gain in modern AI. Andrew Gordon Wilson's paper argues that classical PAC-Bayes bounds, long dismissed as vacuous for overparameterized models, actually explain deep learning when applied correctly. PAC-Bayes bounds test loss with training loss plus a compression term. Historically, the compression term dominated and bounds became loose. Wilson shows that larger models find more compressible solutions: the volume of flat minima in parameter space grows exponentially with parameter count, and flat minima compress better than sharp ones.

The paper also resolves «benign overfitting» — the mystery of how networks fit random noise yet generalize on structured data. A regularized polynomial offers intuition: enough parameters fit noise, but regularization biases the model toward low-order terms that capture structure. Deep nets are expressive hypothesis spaces with soft inductive biases. The takeaway: if we identify the right inductive biases and optimize for them (e.g., compressibility, flatness), we may unlock massive sample-efficiency gains. The no-free-lunch theorem guarantees that all learning efficiency comes from inductive bias, and humans remain orders of magnitude more sample-efficient than models.

8 —

Data-Constrained Scaling: When Compute Is Infinite but Tokens Are Scarce

Aggressive regularization, ensembling, and distillation yield a 5× data efficiency win as compute growth outpaces internet text.

Internet Text Growth

~3% per year

Human-generated text on the internet grows slowly, creating a looming data bottleneck.

Pre-Training Compute Growth

4–5× per year

Compute budgets are scaling much faster than available data, flipping the optimization problem.

Weight Decay Multiplier

30×

Optimal weight decay in data-constrained settings is 30 times higher than in compute-optimal pre-training.

Ensemble Data Efficiency

5× improvement

Training infinitely many infinitely large models (via double limit of scaling laws) projects to 5× fewer tokens needed for the same loss.

Continued Pre-Training Win

17× data efficiency

On math-related tokens, aggressive epoching and ensembling matched full-corpus performance using only 4B of 73B tokens.

9 —

Practical Wins: Distillation and Self-Distillation

Distilling an eight-model ensemble into one dense model retains 83% of the gain; self-distillation surprisingly improves loss.

💡

Practical Wins: Distillation and Self-Distillation

Even though the joint scaling recipe requires massive training compute, distillation compresses test-time cost. An eight-member ensemble (2.4B total parameters) distills into a single 300M dense model while preserving 83% of the loss improvement. More surprisingly, self-distillation — training a fresh copy of the same 300M model on its own outputs — beats the regularized recipe's asymptote. Prior work suggests self-distillation implicitly trains a two-member ensemble, reconciling the counterintuitive result. The findings hold on downstream benchmarks (fully held-out until the end) and in continued pre-training scenarios, confirming the approach generalizes beyond in-distribution validation loss.

10 —

The Billion-Dollar Question: Model-Free or Model-Based?

Yann LeCun raised $1B to train world models; the field remains split on explicit versus implicit modeling.

“Yann LeCun raised $1.03 billion dollars back in March basically just to train world models.”
— Isaac Ward

11 —

Personas

Tanishq

Graduate Student, Stanford

guest

Stannis

Staff Research Scientist, Google DeepMind

guest

Isaac Ward

Researcher (world models specialist)

guest

Ashe

President, Q Labs (YC startup)

guest

Kun Wu

Researcher (co-led data-efficiency paper)

guest

Harj

YC Partner

mentioned

Sam Altman

Former YC President

mentioned

Andrej Karpathy

Co-founder, OpenAI (YC W16)

mentioned

Greg Brockman

Co-founder, OpenAI

mentioned

Yann LeCun

Chief AI Scientist, Meta

mentioned

Andrew Gordon Wilson

Professor (generalization theory)

mentioned

Glosario

Speculative DecodingA technique where a small «draft» model proposes token sequences and a large «target» model verifies them in parallel, trading compute for lower latency.

World ModelA learned dynamics model that predicts how a system's state evolves given actions, enabling planning and uncertainty quantification.

JEPA (Joint Embedding Predictive Architecture)Yann LeCun's framework for learning representations by predicting future latent embeddings rather than raw pixels.

PAC-Bayes BoundA generalization bound from learning theory that relates test loss to training loss plus a measure of model complexity or compressibility.

Sigg RegularizerA term that enforces Gaussian-distributed latent embeddings by checking one-dimensional slices (Sketching, Isotropic, Gaussian).

Benign OverfittingThe phenomenon where overparameterized models fit random noise yet still generalize well on structured data.

Chinchilla Scaling LawsEmpirical rules stating that compute-optimal training requires scaling both model size and training data proportionally.

Self-DistillationTraining a fresh copy of a model using the original model's predictions as soft labels, often improving generalization.

Aviso legal: Este es un resumen generado por IA de un vídeo de YouTube con fines educativos y de referencia. No constituye asesoramiento de inversión, financiero o legal. Verifique siempre la información con las fuentes originales antes de tomar decisiones. TubeReads no está afiliado con el creador de contenido.