Funding
ShengShu Technology Raises Over $86 Million in Series A+ Funding to Push Multimodal AI Boundaries

ShengShu Technology has completed a Series A+ funding round exceeding RMB 600 million (approximately $86 million USD), marking a major milestone for the company as it scales its multimodal foundation models for both digital and physical-world applications. The round was co-led by Zhongguancun Science City and LINK-X CAPITAL, with strategic participation from Wondershare, Visual China Group, and TRS. Several existing investors also increased their commitments, underscoring continued confidence in ShengShu’s technical direction and commercial progress.
The new capital arrives at a moment when multimodal AI systems are moving from experimental tools toward infrastructure that underpins real-world production. ShengShu’s trajectory reflects that shift, with research breakthroughs increasingly translating into deployed products used across industries.
From Early Research to Commercial-Grade Models
ShengShu Technology was among the earliest teams globally to focus on multimodal generative algorithms as a core research direction. In 2022, the company introduced the U-ViT architecture, helping establish a technical foundation for models capable of reasoning across text, image, and video. This research-first approach set the stage for the launch of Vidu in mid-2024.
Vidu entered the market with a Reference-to-Video capability that moved beyond conventional text-to-video or image-to-video generation. Rather than treating each frame as an isolated output, the system was designed to preserve multi-entity consistency across scenes, addressing a long-standing challenge in commercial video generation. Since launch, ShengShu has iterated rapidly, releasing successive versions that improved semantic understanding, motion stability, visual coherence, and inference speed.
The most recent release, Vidu Q3, reflects a deliberate focus on storytelling. The model supports synchronized audio-video generation up to 16 seconds, native 1080p output, precise shot transitions, multilingual text rendering, and multi-language output. These capabilities position the system closer to production workflows, rather than short-form experimental clips.
Performance, Speed, and Open Innovation
Beyond output quality, ShengShu has emphasized efficiency as a competitive differentiator. In late 2025, the company open-sourced its TurboDiffusion framework, a move that significantly reduced video generation latency. With this framework, a five-second video can be generated in under two seconds on a single high-end GPU, representing orders-of-magnitude gains compared to earlier approaches.
This focus on speed is not just a technical benchmark. Lower latency and compute requirements directly affect the feasibility of deploying multimodal models at scale, especially for interactive applications and real-time creative tools. By reducing the cost and time required to generate high-quality video, ShengShu is pushing multimodal AI closer to everyday use in professional environments.
Expanding Adoption Across Creative and Enterprise Markets
ShengShu has built a broad product ecosystem around Vidu, spanning managed services, SaaS offerings, applications, and agent-based tools. These products now serve creators, studios, and enterprises across more than 200 countries and regions. In 2025, the company reported more than tenfold growth in both users and revenue, indicating accelerating adoption.
In film and entertainment, Vidu is used across animation, short-form production, and feature workflows, with engagement across content owners, tool providers, and production studios. In parallel, internet platforms and smart hardware companies are applying the technology to marketing asset creation, interactive content, and product innovation.
Advertising and gaming have emerged as additional areas of traction. Brands and agencies use Vidu to scale video production for campaigns, while game developers deploy it for advertising content and scene generation. Internationally, the platform is gaining traction among creative tool developers and enterprise users, with applications extending into education, broadcasting, and cultural tourism.
The Broader Implications of Multimodal AI
The progress of multimodal foundation models has implications far beyond video creation. By integrating text, image, audio, and motion into unified systems, these models enable machines to interpret context in a way that more closely resembles human perception. For industries, this means faster production cycles, lower barriers to entry for high-quality content, and new forms of interaction between humans and software.
At the same time, the maturation of multimodal AI raises important questions around authenticity, intellectual property, and responsible deployment. As generated video becomes increasingly realistic, technical safeguards and governance frameworks will be essential to maintain trust in digital media.
Looking ahead, multimodal models are likely to play a role not only in digital workflows but also in physical-world systems, from robotics and simulation to smart environments. ShengShu Technology’s latest funding round positions it to participate in that transition, as multimodal AI shifts from a creative novelty into a foundational layer of next-generation productivity.












