Thought Leaders

The True Cost of Training Robots

mm

In the first part, we discussed how robots evolve from basic mechanics to understanding their environment. At the “last mile” stage – when robots undergo post-training for specific, custom tasks – an unexpected barrier emerges. It is tied to data: its collection, organization, and scaling in real-world conditions.

It is precisely at this stage that the gap between concept and implementation becomes most apparent. What are the key bottlenecks, and how can they be overcome with minimal friction?

Why thousands of hours of data turn into years of work

So let’s imagine we already have a trained robot that has undergone pretraining. It can navigate its surroundings, move, avoid obstacles, and interact with objects. It’s like a “ten-year-old child” who is generally capable of acting independently. The next step is to teach it to perform specific actions under specific conditions, for example, installing glass panels and sealing strips on an automotive production line.

At first glance, the task seems simpler. It involves mastering a single scenario, and the volume of data required is significantly smaller than during pretraining. While foundational training may require hundreds of thousands of hours, post-training might take only thousands. But these numbers are misleading.

When translated into real time, the process reveals its true complexity. Under a standard work schedule, a person works about 160 hours per month. However, this does not mean all that time can be used for recording.

In practice, constant disruptions occur: batteries run out, cameras shift, sensors fail. The more complex the equipment setup, the higher the likelihood of issues. Even a simple failure like sensors on a glove stopping working can halt the process and result in lost time.

As a result, the actual data-collection speed is 2–3 times lower. One hour of high-quality recording can require up to three hours of real work. This radically changes the calculation: 5,000 hours of data translates into roughly 15,000 hours of labor.

Layers upon layers of complexity

During pretraining, it may be enough to give a person a camera and ask them to record everyday activities. At this stage, however, access to a specific environment is required, such as a factory, a construction site, or a specialized production facility.

This immediately introduces practical constraints. For example, on a construction site, workers are required to wear safety helmets, meaning specialized equipment must be developed: helmets with integrated cameras that are resistant to dust, moisture, and impact.

Then comes access to the site itself. Agreements must be made with site owners, permissions obtained, and conditions negotiated. This almost always involves additional costs: companies expect compensation, and workers expect to be paid for participation.

Insurance and safety compliance also become critical concerns. If the equipment does not meet required standards, insurance may be voided, forcing the entire process to be restructured.

Even at the level of daily operations, challenges persist. Cameras must be turned on, monitored, and maintained. Workers operate in gloves and harsh conditions. Equipment gets dirty, wears out, and breaks down. A camera may shut off after a few minutes, and the person may not even notice.

This creates the need for participants to train themselves – they must understand how to use the equipment. Moreover, continuous supervision is required – someone must ensure that recording is ongoing and that devices are functioning properly.

From raw video to training data

After recording, the next stage begins: data collection, uploading, structuring, validating its quality, and labeling.

Any raw data consists of video and sensor signals. To turn it into training material, it must be structured: objects need to be identified, actions captured, and states, movements, and interactions with the environment described. This is where annotation comes into play. A logical question arises – what is the gold standard for such an annotation workflow?

In some cases, simple bounding boxes are enough to identify objects in a frame. In others, temporal annotation is required to describe sequences of actions over time. In certain scenarios, keypoints and skeletal models are used to capture body movement. In more complex cases, 3D meshes or hand pose tracking are needed to accurately represent interaction mechanics. Additional sensors, such as accelerometers, are often integrated to capture motion dynamics and applied force.

Projects like these also often require scaling the team. Labeling is a large and complex task in itself, demanding time, expertise, and substantial human resources. This is where data solution providers with in-house annotation teams come into play. Such as Keymakr, which has proven particularly effective thanks to its ability to scale teams to match any data volume, from a single specialist to hundreds of annotators.

There’s no right approach to training yet

The industry is still in an exploratory phase, as there is no consensus on which data combination yields the best results. Many approaches are validated empirically because they work in specific experiments. As a result, different teams continue to rely on different technologies, shaped by their own experience, tasks, and constraints.

At both academic and applied levels, this leads to fragmentation: labs and companies are moving in different directions. The situation is reminiscent of the early days of autonomous driving when Tesla bet on a vision-only approach without LiDAR, while most other players chose LiDAR as a core sensor.

Today, LiDAR-based systems tend to demonstrate more stable performance, yet Tesla’s approach continues to evolve. The difference is that in autonomous driving, the market has largely matured: stable architectures have emerged, limitations are well understood, and significant expertise has been accumulated.

In contrast, for Physical AI and similar model training, this level of maturity has not yet been reached. The market is still forming, standards are lacking, and much of the progress is driven by experimentation. New methods for training models, improving efficiency, and adapting to real-world scenarios continue to emerge, suggesting that the most important breakthroughs in this field are still ahead.

The human as a reinforcement system

Labeling does not exist in isolation, nor for the model alone. It serves as a tool for the engineer building that model. Through it, they formalize reality, identify key parameters, and define the system’s behavioral rules.

The engineer’s task is to teach the system to perform actions correctly in real-world conditions. For example, a basic scenario may consist of four actions: pick up a glass, turn on the tap, fill it, and turn the tap off. But in reality, a deviation occurs – the glass overflows.

At that moment, the model is expected to complete the scenario and take additional actions: stop the water flow, adjust the water level, and prevent spillage. This is behavioral logic based on contextual understanding.

The engineer follows a cycle: annotate data, train the model, test it. If the system works, the hypothesis is confirmed. If not, the analysis begins.

At some point, it may become clear that the model is missing an important parameter, such as the glass’s fill level. Previously, the data may have included annotations for objects (glass, tap, handle) and actions (opening, filling, closing), but lacked annotations for state, such as the degree of fullness.

A new layer is then added to the process: annotating the fill level, followed by formalization, for instance, defining anything above 85% as a critical state.

This leads to the next iteration of training. You can have hundreds of such iterations.

No one assumes the system will work correctly immediately. On the contrary, the process is built around successive approximations: first, a baseline version is created; then it is tested in real or near-real conditions; gaps are identified; and the system is refined. This is something I often discuss with clients at Introspector, with whom we go through the entire Physical AI journey together.

At a certain point, the desired result is achieved. But its value lies not only in the system beginning to work, but in the accumulated experience that allows this result to be reproduced more predictably.

The economics everyone forgets

Over the past year or so, I’ve noticed that the biggest mistake companies make when working with egocentric data has little to do with technology.

The core problem is actually in underestimating project economics.

At the idea stage, tech takes center stage – what models to use, how to train them, and which approaches to apply. You study, research, discuss architectures, and test hypotheses. This is natural: technology feels like the most tangible and obvious part of the problem.

But far less often at this stage do teams ask a direct and practical question: how much will it cost?

When a project moves from theory to implementation, it becomes clear that behind every model are tens of thousands of hours of data. Collecting this data requires time, access to real environments, and the involvement of specialists. Labeling adds yet another layer of complexity and cost. As a result, the final numbers are often orders of magnitude higher than initially expected.

This does not mean such projects shouldn’t be pursued. On the contrary, they are what drive the industry forward.

But what matters is understanding the scale of the challenge from the very beginning. Recognizing that in model training, behind every amazing algorithm is complex, resource-intensive data work.

Even strong ideas fail to reach full implementation when data costs start to mount well above seven figures.

And perhaps the most important shift happening in robotics today is tied to this realization. The future of these systems will be defined by how “intelligent” they are and by how effectively and precisely the entire data pipeline is built – from data collection to final interpretation.

Michael Abramov is the founder & CEO of Introspector, bringing over 15+ years of software engineering and computer vision AI systems experience to building enterprise-grade labelling tools.

Michael began his career as a software engineer and R&D manager, building scalable data systems and managing cross-functional engineering teams. Until 2025, he has served as the CEO of Keymakr, a data labelling service company, where he pioneered human-in-the-loop workflows, advanced QA systems, and bespoke tooling to support large-scale computer vision and autonomy data needs.

He holds a B.Sc. in Computer Science and a background in engineering and creative arts, bringing a multidisciplinary lens to solving hard problems. Michael lives at the intersection of technology innovation, strategic product leadership, and real-world impact, driving forward the next frontier of autonomous systems and intelligent automation.