Video Summary

Yann LeCun's $1B Bet Against LLMs

Welch Labs

Main takeaways

Yann LeCun funded a large effort to pursue JEPA, a non‑generative joint‑embedding predictive architecture (JEPA).

JEPA encodes observations into embeddings and trains a predictor to predict next‑state embeddings instead of raw pixels or tokens.

Generative video prediction often yields blurry outputs because pixelwise losses average over many plausible futures.

Joint embedding methods (Siamese/networks) learned strong representations but suffered from representation collapse.

Barlow Twins introduced a redundancy‑reduction loss (cross‑correlation → identity) to prevent collapse and improve self‑supervised vision learning. 73.2% ImageNet frozen encoder result noted as a milestone.

Key moments

Questions answered

What is JEPA and how does it differ from generative LLMs?

JEPA (Joint Embedding Predictive Architecture) encodes observations into embeddings for both current and next states, and trains a predictor to predict the next‑state embedding from the current one. Unlike generative LLMs, it does not directly generate text or pixels — it predicts in embedding space, avoiding pixelwise

Why do generative video models often produce blurry frames?

Because the next video frame can be highly uncertain and multimodal; pixelwise losses (e.g., MSE) average across plausible futures, producing washed‑out, blurry images when the model cannot commit to a single sharp outcome.

What is representation collapse and how did Barlow Twins address it?

Representation collapse is when an embedding network maps diverse inputs to identical or trivial vectors. Barlow Twins prevents this by minimizing redundancy: it computes the cross‑correlation between twin embeddings and penalizes deviations from the identity matrix so corresponding dimensions stay correlated while off

How can JEPA-style models help robotic control (VJEPA‑2)?

VJEPA‑2 conditions embedding predictions on action signals, learning how actions change embedding states. That lets a robot predict goal embeddings and use classical optimal control or planning to choose actions that reach desired embeddings rather than predicting raw pixels.

Will JEPA replace LLMs?

According to the discussion, JEPA and world‑model approaches will initially solve different problems (especially perception and control). LeCun suggests they could eventually supplant LLMs for broader reasoning tasks, but LLMs remain strong where language is the substrate of reasoning.

Yann LeCun's Alternative AI Approach 00:10

"AI legend Yann LeCun has raised a billion dollars to pursue an alternative approach to AI."

Yann LeCun has secured a billion dollars to focus on a new method for artificial intelligence that diverges from traditional large language models (LLMs).
Unlike LLMs, which generate text, images, or videos, LeCun's approach is centered around an architecture called Jeepa, designed to be non-generative.
Jeepa is not just a single model but a framework for training various AI models, emphasizing a different methodology for processing inputs and outputs.

Distinction Between Traditional Models and Jeepa 00:40

"Many successful approaches in AI and machine learning train models to predict some output Y given some input X."

Conventional AI methods typically involve predicting an output Y based on an input X, such as predicting the following word in a text or classifying an image.
Specifically, traditional large language models receive text X and aim to predict the subsequent text Y, while image classifiers operate on input images to assign corresponding labels.
In contrast, Jeepa utilizes encoders that convert the inputs and outputs into numerical vectors or matrices, referred to as embeddings.

The Future of AI Systems: Jeepa's Potential 01:24

"Initially they'll solve different problems. Eventually they could replace LLMs."

The early applications of Jeepa and world model-based approaches are likely to address different challenges than those tackled by traditional LLMs.
However, there is a possibility that these innovative methodologies could ultimately supplant LLMs, which excel at linguistic manipulation but have limitations in other areas.
Jeepa represents an alternative avenue for AI transformation, focusing on joint embedding architectures rather than solely on generative language processes.

LeCun's Historical Context in AI 02:35

"Yan LeCun saw the revolution coming in the 1980s."

LeCun was a pioneer in AI, already envisioning transformative changes while the field was primarily focused on expert systems reliant on programmed rules rather than learned from data.
He played a major role in the development of convolutional neural networks, which became foundational for deep learning advancements witnessed later, particularly the AlexNet model in the 2010s.
This historical perspective underscores the evolution of AI methodologies, revealing how LeCun's early predictions and contributions were pivotal to future innovations.

Concerns Regarding Labeled Training Data and Alternative Learning Approaches 03:08

"Lacun and other researchers became increasingly concerned by how much this approach to AI depended on labeled training data."

As the reliance on labeled data became a bottleneck in supervised learning, researchers began to explore alternative learning strategies.
There was a resurgence of interest in reinforcement learning, wherein models learn through environmental interactions rather than through explicit labels, as highlighted by breakthroughs from Google DeepMind.
At the same time, LeCun and colleagues investigated unsupervised methods, emphasizing self-supervised learning where the model extracts labels from the data itself.

The Impact of Self-Supervised Learning 04:22

"If intelligence is a cake, the bulk of the cake is self-supervised learning."

LeCun characterized self-supervised learning as fundamental to advancing AI, with supervised learning being just a component of the overall framework.
The significant success of self-supervised models in language tasks proceeded faster than in visual modalities, ultimately contributing to the emergence of LLMs and the development of suitable training methodologies for them.
One particular model, GPT-1, demonstrated how to untangle the reliance on human-labeled data, paving the way for the scaling seen in subsequent versions like GPT-2 and GPT-3.

Challenges in Predicting Video Frames 08:33

"Results quickly devolve into blurry nothingness."

LeCun and others attempted to extend generative principles to video prediction, configuring neural networks to foresee what occurs in subsequent frames based on initial pixel values.
However, these models often produced blurry results, particularly in long-horizon predictions, illustrating challenges unique to video data that are not present in text-based models.
The exploration of auto-regressive predictions in video has shown significant limitations, necessitating new strategies to refine the output quality in dynamic visual settings.

Application of Transformers in New Domains 09:40

"Is it possible to train a transformer to find patterns in this data and predict future prices?"

The video discusses the utilization of transformer architectures beyond language, specifically in financial markets to predict stock prices based on order book data.
By analyzing the influx of bids and asks for stocks, researchers aim to discover patterns that can aid in forecasting price movements in a data-rich environment.
This approach signifies the versatile potential of transformers, presenting opportunities for comprehensive analysis across various domains beyond natural language processing.

Machine Learning Challenges in Trading 09:52

"The Hudson River trading team has developed interesting approaches to adapt cutting-edge transformer architectures to the complexities and constraints of trading data."

The Hudson River trading team is exploring novel machine learning methods that tackle unique challenges presented by financial trading data.
Their VJEPA model maps patches of videos to individual embedding vectors, suggesting a potential for a similar approach with order book data by tokenizing groups of orders based on financial intuition.
Despite the initial promise, a naive model may struggle to perform effectively due to practical limitations in the fast-paced trading environment where tight latency constraints exist.

The Complexity of Video Prediction Models 11:15

"Language models use fixed-size vocabularies, but for full HD video, there are approximately (10^{15}) million possible next video frames, dwarfing the number of atoms in the observable universe."

Video prediction models face challenges due to the immense complexity of video data compared to language data. Language models like GPT-2 operate with defined vocabularies, but video data presents a near-infinite array of possible outputs.
Traditional generative models may struggle as they attempt to predict each single pixel's intensity directly, leading to issues with uncertainty.
In video, when models are forced to predict frames directly from ambiguous scenarios, the outcomes are often averaged, resulting in blurred and washed-out images.

Joint Embedding in Representation Learning 16:19

"Joint embeddings offer a potentially viable solution to our blurry video problem."

The concept of joint embedding networks emerges as a promising technique to tackle the analytic challenges of video generation.
By using Siamese networks, which train on pairs of images to learn similar embeddings for semantically identical samples, researchers can sidestep the blurry video problem now affecting generative models.
This approach enables the extraction of valuable internal representations from images or videos that can be repurposed for various tasks, analogous to GPT models during their pre-training phase.

Challenges of Representation Collapse 17:54

"Representation collapse occurs when the network simplifies all inputs to a trivial output."

A critical challenge faced by joint embedding systems is representation collapse, where a network reduces all inputs to identical embedding vectors, failing to capture useful distinctions between different inputs.
Utilizing contrastive learning, which presents both positive and negative examples to the network, can overcome this issue by encouraging the model to generate distinct embeddings for different data points.
While effective, these contrastive methods can encounter scalability issues, requiring significant computation and data to yield meaningful internal representations.

The Problem with Generative Models for Self-Supervised Learning 19:13

"By the end of the 2010s, it was clear to Lun and others that using generative models to fully reconstruct images and video was not a good strategy for self-supervised learning."

The amount of contrastive samples can grow exponentially with the dimensionality of the representation, which complicates the learning process.
It became evident to researchers that relying on generative models for full reconstruction of data like images and video was ineffective for developing self-supervised learning methods.
The challenge was the "representation collapse problem," which hindered joint embedding architectures from achieving the robust and general representations that large language models were successfully generating.

Insights into Representation Collapse 19:50

"I had a bit of an epiphany because the methods that we were using to train those joint embedding architectures were kind of hacks."

The realization that existing methods for training joint embedding architectures were merely makeshift led to significant developments in the field.
A key breakthrough came from working with researchers, notably Stefan Deni, who developed a technique called Barlo Twins, which builds on longstanding concepts in computational noise science.
Barlo's hypothesis from 1961 suggested that neurons in visual systems operate by minimizing redundant information, which became a foundational idea for improving joint embedding systems.

The Implementation of Barlo Twins 20:58

"Stefan Deni proposed that one way to avoid representation collapse could be to apply Barlo's idea to the outputs of their networks."

Deni's approach focused on applying Barlo's principle to outputs generated by artificial neurons in embedding networks, effectively enhancing the system's ability to learn without redundancy.
In these joint embedding architectures, outputs from networks are compared, aiming for similarity between embeddings derived from distorted versions of the same input.
Traditional methods tended to lead to representation collapse by producing identical outputs regardless of input differences, which Deni's new method aimed to prevent by reducing redundancy between neuron outputs.

Measuring and Reducing Redundancy 22:58

"To measure the redundancy between neuron outputs, the team computed the cross-correlation between these output vectors."

The new loss function designed for the Barlo Twins architecture evaluates how much the resulting cross-correlation matrix deviates from the identity matrix. This allows the system to maintain high similarity among corresponding neurons while minimizing redundancy between different neurons.
The desired outcome is for the cross-correlation matrix to resemble an identity matrix, signifying effective representation learning with minimal overlap among neuron outputs.

Performance of Barlo Twins 24:45

"Their new method, which they called Barlo Twins, worked surprisingly well, avoiding representation collapse while learning a powerful internal representation."

Barlo Twins outperformed traditional supervised learning methods, achieving impressive accuracy on the ImageNet dataset, a standard benchmark for image classification tasks.
The implementation utilized a linear probe approach to adapt the representations learned by Barlo Twins to specific image classification problems, demonstrating the effectiveness of self-supervised learning.
The frozen Barlo Twins encoder achieved an ImageNet accuracy of 73.2%, surpassing the original AlexNet model's performance by a notable margin.

Advances in Self-Supervised Learning 26:53

"By 2021, thanks to the Barlo Twins epiphany and other joint embedding approaches, self-supervised learning was advancing rapidly for vision tasks."

Although self-supervised methods were improving in vision-related tasks, they still lagged behind fully supervised models, such as those utilizing transformer architectures, which reached state-of-the-art image classification accuracy.
Research efforts like the Dino series of architectures began making strides in reducing the performance gap, indicating a pivotal shift toward more effective self-supervised learning in image processing without relying on labeled data.
The Dino V3 paper, released in August 2025, marked a significant milestone, achieving near state-of-the-art accuracy on image classification and demonstrating the remarkable capabilities of self-supervised models.

Yann LeCun's Vision for Autonomous Machine Intelligence 28:53

"Lecun argues that our current approaches to AI are nowhere near the capabilities of human learning."

Yann LeCun introduced a holistic first-principles approach towards building intelligent machines in his 2022 position paper, "A Path Towards Autonomous Machine Intelligence." This paper contrasts with his previous work that focused on specific machine learning theories and practices.
LeCun uses the example of a teenager who can learn to drive a car in approximately 20 hours, emphasizing that current AI systems, like Tesla's, are still far from reaching similar levels of learning efficiency, as they require millions of hours of training data.
He posits that the key missing component in modern AI is "world models," which can make predictions about the physical world and facilitate common sense reasoning within AI systems.

The Role of World Models in Skill Acquisition 30:02

"Common sense can be seen as a collection of models of the world that can tell an agent what is likely, what is plausible, and what is impossible."

LeCun suggests that animals can learn new skills with minimal trials because they possess world models that allow them to predict the consequences of their actions, leading to improved reasoning, planning, exploration, and creativity in problem-solving.
He believes that joint embedding predictive architectures, like JPA, provide a foundational basis for developing these world models.

Joint Embedding Predictive Architecture (JEPA) Explained 30:40

"You take an observation in the world and then the next observation in the world, you run them through encoders."

JEPA operates by processing observations through encoders to predict the next state of the world based on previous states, conditioned on certain actions.
Unlike traditional generative architectures that predict pixel values in videos, JEPA simplifies the task by focusing on predicting embeddings, which represent salient features of the scene.
An example is provided where a geology model analyzes dashcam footage. The model prioritizes elements relevant to the task rather than random motion that is difficult to predict.

Advancements with VJEPA 2 and Robot Control 32:15

"In the VJEPA 2 paper... the team conditions a JEPA model on the action signals sent to a robot arm."

In VJEPA 2, the architecture is trained with sequences of images and control signals, enabling the model to learn how different control commands influence the position of a robotic arm in its environment.
By representing goal states and the effects of various actions in this world model, the robot can plan and execute tasks more effectively, such as moving a cup off a platform.
This innovative approach allows for classical optimal control techniques to apply, guided by predictions from learned models rather than fixed parameter settings.

The Importance of Predictive Abilities in Agentic Systems 35:10

"I do not understand how you can even think of building an agentic system without the ability to predict the consequences of its actions."

LeCun stresses that for reliable agentic systems to function, they must predict the outcomes of their actions to effectively plan and ensure safety measures.
He contrasts traditional models, which do not inherently possess world models and therefore cannot plan proactively, explaining that modern AI must include predictive capabilities to accelerate development in intelligent systems.

Browse ai summaries

Jump to the ai topic page and keep exploring related summaries.