Thought Leaders
We Taught Robots to Move. Now We Are Teaching Them to Live

Modern robotics has reached a point where movement is no longer the main challenge – machines can already navigate, grasp, and operate in space with impressive precision. Yet enabling them to truly “live” and function in the real world remains an unsolved problem.
In this process, the key role is played by what could be called the “spinal cord”: the system responsible for basic reactions, behavior, and interaction with the environment.
When you look at the evolution of robots through this lens, it becomes clear that this sequence of stages – where the system learns something new at each step, from simple movement to complex, context-aware actions – closely resembles human development.
And it is precisely within this evolution – from “empty” hardware to meaningful behavior – that the main shift in physical AI is happening today. Interesting to learn it more deeply.
The foundation of robotics: a stage rarely discussed
What is a robot in practical terms? It is a physical device initially created as a universal platform. In essence, it is a “blank” that must then be adapted to specific tasks, trained to operate in a given environment, and taught to perform the required actions.
If we move beyond everyday scenarios and consider more realistic near-future applications, it becomes clear that the full adoption of robots will primarily occur in industrial and potentially hazardous environments. This, in turn, implies significantly higher requirements for their behavior, robustness, and training quality.
The process begins with the most basic step – building the device itself. A robot is assembled from multiple components, including actuators, motors, sensors, cameras, LiDARs. It can be humanoid, wheeled, bipedal, or quadrupedal – the form factor is secondary. What matters is that, at this stage, we end up with a functioning but still “empty” device.
The next stage is installing a base model that serves as the foundation for its behavior. In a broad sense, the “model” is the entire functional control layer. It is responsible for core capabilities: maintaining balance, standing and moving, navigating from point A to point B, avoiding obstacles, not damaging the environment, and safely interacting with humans.
This is where reinforcement learning comes into play. In such systems, billions of simulations are run. We often see videos of robots “learning” in complex environments: most of them fall, lose balance, or fail to complete the task. But those who manage to stay upright and keep moving are the ones who progress.
This is the essence of reinforcement learning: selecting successful behavior. The algorithms of those that “survive” become the basis for the next iterations. As a result, after an enormous number of runs, a model emerges that can confidently handle obstacles. This algorithm is then transferred to the physical device.
It is a grounded yet critically important stage – often involving little to no computer vision, which is not required at this point. What we are dealing with here is fundamental physics and mechanics that must be embedded into the system from the very beginning.
How robots begin to “feel” the world
So, we already have the “hardware” – a robot with a base model installed: it can stand, walk, and maintain balance. But is this enough for real-world tasks, for example, in industrial environments? Clearly not.
The next level begins here. We integrate sensors and train the model to act based on sensory input. A new layer of core skills emerges – already far more complex than simple movement.
An analogy with human development is useful here. At the first stage, we brought the system to roughly the level of a one-year-old child: it can stand, take its first steps, and keep balance without falling. The next step is more in line with an eight-year-old’s level.
At this age, a child actively uses their “sensors”: they can perceive risk and evaluate the consequences of their actions. They understand not to touch something hot or put something very cold in their mouth. They can climb onto a table, ride a bicycle, and interact with objects. They are capable of grasping, carrying, and manipulating items and performing basic self-care actions.
We call this stage pretraining. And at this point, simulations alone are no longer sufficient.
Yes, some scenarios can still be effectively modeled: how to pick up a glass, or replace a battery, for example, removing one component, placing it on charge, taking another, and installing it back.
But overall, the balance shifts: around 80% of training can still happen in simulation, while about 20% of the data must come from the real world. And this is where we begin discussing egocentric data.
Egocentric data as the foundation of environmental understanding
Today, egocentric data is being collected at a massive scale worldwide – because without it, it’s impossible to move from basic mechanics to meaningful interaction with the real world. A colleague of mine, who runs a network of auto repair shops, has employees using head-mounted cameras to record the entire car repair process. A building owner in New York City has implemented a similar approach: cleaning staff wear forehead-mounted cameras that capture how they vacuum spaces and maintain sanitary areas.
Over time, these recordings become a standalone product – they are packaged and sold. Their key value lies in their suitability for the pretraining stage, helping build a foundational understanding of environments and sequences of actions.
For example, such a service existed at Keymakr, where the team independently created entire collections of egocentric data from simple scenarios like washing dishes to more complex ones.
Why is this so important? Because such data provides something that pure simulation cannot – the diversity of real-world environments. Offices, auto repair shops, construction sites, restaurants, and hotels – each of these adds its own context, scenarios, and nuances. Together, they form a dataset that allows a system not just to “see,” but to gradually begin understanding the dynamics of the real world.
At this stage, the goal is no longer to teach a robot to perfectly execute a specific action. What matters more is enabling it to orient itself within its surroundings in the first place.
Today, nearly all companies working in robotics – from Tesla to Unitree Robotics and Figure AI – are focused on this exact stage. Their goal is to build a base model whose capabilities first resemble those of an “eight-year-old child,” and then progress toward a “twelve-year-old.” This is also what we focus on at Introspector – preparing the data required for pretraining, the most critical phase in the “coming of age” of modern robotics.
The last mile of training: where universality ends, and specialization begins
Let’s imagine a robot has already completed pretraining and is manufactured from the get-go with a basic understanding of the world and a skill set comparable to that of a teenager. But even this is not enough for real business use cases. Companies don’t need just a “general-purpose” robot – they need a specialist.
Take automotive manufacturing as an example. Some tasks are still performed by humans because they require sensitivity, precision, and continuous visual control. Traditional automation struggles here. Industrial manipulators excel at repetitive, rigid tasks – “pick, move, place.” But tasks that require adaptability, pressure sensing, and real-time adjustments remain in the human domain.
This is where a new demand emerges: to train a robot to perform a specific operation exactly as a skilled worker does on a production line. In other words, after base training comes the next level: training for a specific profession and scenario.
At this point, a practical question arises: what exactly is required for this level of training? If we want a robot to replicate human performance, we need to capture that human behavior as precisely as possible. For example, the specialist on the factory floor would need to wear a camera and, over an extended period, months or even a year, record how they perform the task.
What it takes for robots to “live” in the human world
A camera alone is not enough. It’s necessary to capture not only the visual perspective but also the physics of movement. This is done using specialized gloves with tactile sensors that measure pressure, applied force, and the nature of interaction with objects. This is especially important because the objects themselves can vary significantly. For instance, sealing strips may differ in stiffness by car model, which directly affects how the task is performed.
Next comes kinematic tracking. Markers – visual or sensor-based – are placed on the wrists, elbows, and sometimes shoulders. These can include, for example, bracelets with identifiable markers (similar to QR codes) that allow the system to track hand position in space from video. Additional sensors, such as gyroscopes, are used to capture joint movements.
The final goal is to fully reconstruct the mechanics of motion: how the shoulder moves, how the elbow bends, how the wrist rotates. All of this becomes essential for the next stage – post-training.
If, during pretraining, we could still partially rely on simulation, at this stage, it no longer works. This “last mile” is almost impossible to model accurately. You cannot fully simulate, for example, how a chef rolls out dough – the force applied, how pressure is distributed, how the material is felt.
That’s why, during post-training, nearly all data must come from the real world. And this is where it becomes clear: the main challenge shifts into the practical domain – how to obtain such data in reality. Collecting egocentric data at this level is a complex, multi-step process that involves access to environments, specialized equipment, participation by skilled workers, and subsequent data preparation.
Beyond theory, this is where robots truly “come to life” – after we manage to organize this process, overcome constraints teams face across industries, and annotate such datasets at scale. This will be covered in the next part, where we will take a closer look at all challenges that arise during its labeling and preparation.












