Robotika
Meta V-JEPA 2: The AI Model Bringing Common Sense to Robots

Meta Video Joint Embedding Predictive Architecture 2 (V-JEPA 2) is a significant advancement in Artificial Intelligence (AI). It helps robots understand and predict physical interactions. The model is trained on over one million hours of video. This enables robots to learn and anticipate what will happen next. It also enables robots to plan actions in new environments, allowing them to interact with unfamiliar objects more effectively.
V-JEPA 2 uses belajar mandiri. It learns directly from video data, without requiring human annotations. This makes it different from other AI models that rely on labeled data. Robots can predict outcomes based on visual context. They can adapt and plan actions as needed. This brings us closer to achieving Advanced Machine Intelligence (AMI).
Building on Meta’s Joint Embedding Predictive Architecture (JEPA), V-JEPA 2 enhances action prediction and world modeling, enabling robots to handle new tasks in unfamiliar settings. Meta is sharing this model with the research community to accelerate AI progress and improve robot capabilities.
Why Common Sense in Robots Has Always Been Hard
Common sense is the ability to make basic decisions. For example, knowing a cup will spill if tipped over or understanding that a chair might block a path. For humans, this knowledge comes naturally through experience. However, robots face challenges in developing this same intuition.
Most robots are programmed for specific tasks in controlled environments. They do well in these tasks. But when situations change or unexpected elements appear, robots struggle. They often fail to recognize cause and effect or predict the consequences of actions. For example, a robot may know how to place a cup on a flat surface. However, it may not foresee that tilting the cup could cause it to spill.
Current AI models, such as those based on Reinforcement Learning (RL), face limitations. RL requires a significant amount of trial-and-error learning. This makes the process slow and resource-intensive. Model bahasa besar (LLM) excel in language but lack grounding in the physical world. They often berhalusinasi responses based solely on text, making them unreliable in dynamic situations. Traditional visi komputer models are also limited in their capabilities. These models are task-specific and fail to adapt to new or unexpected scenarios.
To address these issues, experts recommend utilizing world models. World models enable robots to simulate and predict future actions based on past experiences. These models help robots understand the world's physical dynamics. For example, predicting what will happen when an object is moved or when two objects collide. Meta's V-JEPA 2 is the first model to integrate these principles. It learns directly from raw video data. This makes it adaptable to real-world environments, allowing robots to reason and plan based on dynamic physical interactions.
Understanding V-JEPA 2
V-JEPA 2 is a self-supervised learning model created by Meta’s Fundamental AI Research (FAIR) team. Unlike traditional AI models that require labeled data, V-JEPA 2 learns from unlabeled video by predicting the missing parts of video sequences. This process is known as representation-level prediction. Instead of focusing on every pixel, V-JEPA 2 works with abstract representations that capture the key dynamics and relationships between objects and actions in the environment.
The model is built on Meta’s Joint Embedding Predictive Architecture (JEPA), designed to understand physical dynamics. It has two key components: an encoder, which processes raw video to create useful representations, and a predictor, which uses those representations to predict future events. V-JEPA 2 is trained on over one million hours of video, enabling it to learn complex patterns in the physical world. By learning from video, the model can predict future actions and interactions, improving how robots plan and make decisions.
V-JEPA 2 helps robots perform zero-shot planning. This means robots can handle tasks in new environments even without prior training. Instead, robots can perform tasks like picking up objects and placing them in new locations, even if they have never seen these tasks before. This makes V-JEPA 2 a significant improvement in action prediction and world modeling, making robots more adaptable to new situations.
The model learns from raw video data, enabling robots to predict future events. This makes robots more capable in real-world situations. V-JEPA 2 brings us closer to robots that can plan and execute tasks like humans. Meta is sharing V-JEPA 2 with the research community to accelerate AI progress. Robots using V-JEPA 2 can operate in dynamic environments, adapt quickly, and plan tasks more efficiently.
How V-JEPA 2 Operates: The Two-Stage Process
V-JEPA 2 works in two distinct stages. Each stage enables the model to learn from raw video data and subsequently apply this knowledge to make informed decisions in real-world tasks.
Stage 1: Action-Free Representation Learning
V-JEPA 2 starts with large-scale pre-training on over 1 million hours of video and 1 million images. The model learns by predicting missing parts of video sequences. It processes the video as 3D tubelets, which serve as the primary tokens for the model. The model employs a Transformator Visi (ViT) architecture with 3D Rotary Position Embeddings (3D-RoPE) to capture both spatial and temporal information more effectively.
The encoder processes the tubelets to create high-dimensional feature vectors. These vectors represent both the spatial and temporal dynamics of the video. The model uses a mask denoising objective, where large portions of the video are hidden. The model attempts to predict the hidden content by using the visible parts. An Exponential Moving Average (EMA) target encoder helps the model avoid trivial solutions and ensures stable learning. The loss function minimizes the L1 distance between the predictions and the EMA target encoder's output, focusing on higher-level concepts such as object permanence and motion, rather than pixel-level details.
Stage 2: Action-Conditioned Planning and Control
In the second stage, the model shifts to action-conditioned training. The encoder weights are frozen, and a new predictor is trained using data from robot interactions. This data includes video observations and the corresponding control actions, typically from the DROID dataset (about 62 hours of robot data). Now, the model can predict the future state of an environment based on both the current state and possible actions.
V-JEPA 2 sets up a goal-conditioned energy minimization problem. It encodes both the current observation and a goal image into feature maps. The model then predicts how the state will change with different action sequences. The optimal action sequence is found by minimizing the L1 distance between the predicted future state and the goal representation. The Cross-Entropy Method (CEM) is used for trajectory optimization.
Only the first action of the optimal sequence is carried out, and the process is repeated in a receding horizon control loop. This enables real-time planning and adaptation. By utilizing 3D tubelet processing, V-JEPA 2 captures both spatial and temporal dependencies, which allows robots to reason about motion, object interactions, and the consequences of their actions in complex environments. This enables zero-shot planning and control, even in new scenarios, without the need for task-specific demonstrations or reward engineering.
Applications of V-JEPA 2 in Robotics
V-JEPA 2 is changing the way robots interact with the world. Many applications are still being developed, but the model has demonstrated strong capabilities in controlled environments.
Pick-and-Place Manipulation
In lab settings, V-JEPA 2 has enabled robots to perform pick-and-place tasks with minimal training. Using only 62 hours of data from the DROID dataset, robots can manipulate various objects, including both rigid and deformable ones. This ability is crucial in fields such as logistics, manufacturing, and home robotics, where objects vary significantly in size and complexity.
Navigation in Dynamic Environments
V-JEPA 2 can model temporal dynamics, which makes it useful for real-time navigation in environments with moving people, animals, or obstacles. While it has not yet been used in autonomous vehicles or drones, its predictive abilities can help robots anticipate changes and adjust their paths. This is key for safety and efficiency in busy environments.
Interaksi Manusia-Robot
By learning to predict human actions, V-JEPA 2 can improve human-robot collaboration. Robots can respond more naturally and safely in shared spaces, such as hospitals, homes, or industrial floors. Though still in progress, this ability represents a step toward socially aware robots that can adapt to their surroundings.
Generalization and Zero-Shot Planning
V-JEPA 2 can generalize across tasks and environments. Robots can utilize learned representations in new situations without requiring additional training. This zero-shot planning enables robots to quickly adapt to new tasks, thereby reducing the need for new data collection or retraining.
Real-Time Decision-Making and Efficiency
With its efficient design, V-JEPA 2 supports real-time planning and control. Meta reports that V-JEPA 2 is 30x faster than Nvidia's Cosmos model in some benchmarks. This speed is essential for tasks needing fast decisions, such as robotic manipulation or navigation in changing environments.
Practical Challenges and Limitations
Though V-JEPA 2 has made significant progress in self-supervised learning and robotic planning, there are still challenges to address before it can be widely deployed. Here are the key limitations:
Reliance on Visual Data Alone
V-JEPA 2 is trained solely on video and image data. This makes it effective for visual tasks, but limits its ability to perform multi-sensory tasks, such as tactile manipulation or using auditory cues. Real-world robots rely on multiple sensory inputs.
Sensitivity to Camera Position and Calibration
The model relies on monocular RGB input, which can degrade performance if the robot's base or reference frame is not visible. Manual adjustments to camera setups may be needed to ensure consistent performance.
Limitations in Long-Term and Multi-Step Planning
V-JEPA 2 performs well with short-horizon tasks but struggles with long-term planning. The accumulation of errors in predictions and the expansion of action spaces make complex, multi-step operations difficult.
Tuntutan Komputasi Tinggi
While faster than models like Nvidia's Cosmos, V-JEPA 2 has over 1.2 billion parameters. This requires significant computational resources, which may pose a challenge for smaller labs or organizations with limited infrastructure.
Generalization in Unstructured Environments
V-JEPA 2 performs well in controlled settings but may face issues in unfamiliar or unstructured environments. Its success rate in pick-and-place tasks is around 80%, but it may fail in edge cases.
Integration with Full Robotic Stacks
To be useful, V-JEPA 2 must integrate with motor controllers, real-time sensors, and task planners. Achieving smooth interoperability in dynamic environments remains a challenge.
Pertimbangan Etis dan Bias
Like all large models, V-JEPA 2 may inherit biases from its training data. In real-world applications, particularly involving human interaction, these biases could lead to unintended outcomes. Ethical oversight is essential.
The Bottom Line
V-JEPA 2 represents a significant advancement in AI and robotics. It enables robots to understand and interact with the physical world like human behavior. While the model has demonstrated strong performance in predicting actions, understanding the world, and planning without prior training, it still faces several challenges.
V-JEPA 2 relies on visual data and has some limitations in multi-sensory tasks, long-term planning, and integration with complete robotic systems. However, its ability to make real-time decisions and adapt to new environments makes it highly useful for complex real-world situations.
Meta is continuing to refine V-JEPA 2, which will contribute to advancing AI and making robots smarter. This progress will be valuable for industries such as healthcare, logistics, and autonomous vehicles. V-JEPA 2 has great potential and will play a critical role in the future of robotics.