Connect with us

Interviews

Victor Erukhimov, CEO of CraftStory – Interview Series

mm

Victor Erukhimov, CEO of CraftStory, is a computer-vision R&D engineer turned entrepreneur who helped shape the early evolution of OpenCV, later co-founding Itseez and guiding it from a technical startup into one of the world’s leading computer-vision research teams before its acquisition by Intel. Over more than a decade, he progressed from CTO to CEO to President, and continued that trajectory at Itseez3D, where he led the development of advanced mobile 3D-scanning and avatar-generation technologies while also serving as a long-time board member of OpenCV.org.

At CraftStory, he now focuses on AI-native video creation, building technology that transforms simple inputs into highly realistic, creator-ready videos. Under his leadership, the company is developing next-generation generative video models designed for marketing teams, educators, and product storytellers who need fast, high-quality content without studio overhead.

You’ve been a driving force behind some of the most influential computer vision projects—from OpenCV to Itseez3D. What inspired you to found CraftStory, and how did your past work shape the vision for long-form, studio-quality AI video?

Before CraftStory, my team and I were working on Avatar SDK—a tool that creates realistic avatars from selfies for VR/AR, gaming, marketing, and other applications. We’d already been thinking deeply about digital humans for several years. Then, about two years ago, we realized that GenAI technology for video generation was getting good enough to unlock an entirely new wave of applications, and we jumped right in.

CraftStory launched with the creators of OpenCV at its core. How did that shared background influence the technical direction and research priorities for Model 2.0?

We’re living in a period of extraordinary progress in computer vision and machine learning. It feels like all the breakthroughs of early quantum mechanics—originally spread across decades—have been compressed into just a few years. Image understanding and generation have advanced far beyond what we were working with when developing OpenCV. Having observed this evolution for more than a decade, making predictions and seeing them succeed or fail, we’ve gained a deep intuition for where the technology and the market are heading. That perspective directly shaped our research priorities and the roadmap for Model 2.0.

Model 2.0 tackles something many video models struggle with: maintaining identity, emotion, and consistency across minutes of footage. What breakthroughs made this possible?

Identity and consistency have been our priorities from day one. Several architectural choices in the network were specifically designed to address these challenges. But equally important was fine-tuning the model on data we collected ourselves. We filmed professional actors in a controlled studio environment using our own high–frame rate cameras to ensure that every frame—including fast movements of the body, hands, and fingers—remained sharp. That level of high-quality, motion-rich data made a significant difference.

Your team introduced a parallelized diffusion pipeline to keep long sequences coherent. What problem was this designed to solve, and why was it essential for multi-minute human video?

Running a single diffusion process over a long sequence of frames is extremely challenging—it’s computationally expensive and demands a massive amount of training data. Our parallelized diffusion pipeline solves this by running multiple diffusion processes on different time segments simultaneously. The key breakthrough was figuring out how to connect these segments so they stay coherent and consistent over long durations. Model 2.0 can now generate videos up to five minutes, but that’s mainly a technical constraint. With more engineering work, we can extend this to videos of essentially arbitrary length.

CraftStory emphasizes realism in both motion and expression. What were the hardest challenges in preserving natural hand, body, and facial dynamics at longer durations?

The biggest challenge is generating realistic body and facial movement consistently over long durations. Small details—like subtle hand motion, shifting posture, or micro-expressions—tend to break down in most models as the sequence gets longer. We solved this by training on our own extensive, high-quality dataset, captured with professional actors and high–frame rate cameras. That level of controlled, motion-rich footage gave the model the signal it needed to preserve natural dynamics across the entire performance, not just in isolated moments.

Many companies are stuck between expensive live shoots and short, unreliable AI clips. Where do you see the biggest commercial demand emerging for multi-minute, human-centric video?

AI-generated videos are rapidly becoming indistinguishable from camera-shot footage, while costing a fraction of traditional production. The biggest early demand we’re seeing is in corporate content—especially Learning & Development—where companies need large volumes of clear, human-centric instructional videos that can be updated instantly. Multi-minute, consistent AI presenters are a perfect fit for that.

We’re also seeing growing interest in marketing use cases such as product introductions, tutorials, and explainers. As the technology matures, long-form AI video will increasingly replace both expensive live shoots and the short, unreliable clips most tools can produce today.

You’ve built an advanced lip-sync and gesture alignment system. How far are we from fully believable AI dialogue, and what still needs improvement?

I think we’re very close. One more iteration of the technology—especially to make it faster and to generate native 1080p—will get us to fully believable AI dialogue.

The text-to-video model you’re developing promises long-form generation directly from scripts. What technical barriers are you still working to overcome before that becomes mainstream?

There are no fundamental barriers—just a lot of engineering work ahead. Video-to-video was the lower-hanging fruit, so we brought that to market first. Now we’re focused on the image-to-video model that takes a script and a reference image as input. We’re making fast progress and hope to release it within the next few weeks.

Moving-camera sequences—like walk-and-talk shots—are a major step toward cinematic automation. How is your team approaching this challenge compared to competitors like Sora?

We’re focused on generating long walk-and-talk sequences—multi-minute shots that feel cinematic and natural. Our goal is to give customers the ability to create videos in the style of the famous “Keep Walking” campaign by Johnnie Walker, but without a full production crew. We’re making rapid progress, and very soon we’ll be able to produce walk-and-talk sequences that run for several minutes with consistent characters, motion, and camera dynamics.

With OpenAI, Google, and others racing into long-form video, what do you see as CraftStory’s edge in this emerging market?

The AI video market is incredibly competitive, and we fully expect the big players to catch up technologically. But our advantage is focus and speed. We have a very ambitious roadmap, and we’re a lean team that can move fast and iterate quickly. That agility—and our focus on long-form, human-centric video—is what sets CraftStory apart.

As AI-generated human video becomes more lifelike and scalable, what ethical or creative safeguards do you believe should be in place as this technology spreads?

Every powerful technology is a double-edged sword, and it’s crucial to understand the specific risks that come with bringing it to market. In AI-generated human video, impersonation is the most significant—though not the only—concern. We’ve spent time analyzing these risks and have implemented safeguards that prevent certain harmful use cases. As the technology becomes more lifelike and scalable, maintaining strong ethical and creative protections will be essential for the entire industry.

Thank you for the great interview, readers who wish to learn more should visit CraftStory.

Antoine is a visionary leader and founding partner of Unite.AI, driven by an unwavering passion for shaping and promoting the future of AI and robotics. A serial entrepreneur, he believes that AI will be as disruptive to society as electricity, and is often caught raving about the potential of disruptive technologies and AGI.

As a futurist, he is dedicated to exploring how these innovations will shape our world. In addition, he is the founder of Securities.io, a platform focused on investing in cutting-edge technologies that are redefining the future and reshaping entire sectors.