Anderson's Angle

A Video Codec for AI-Generated Footage

Published June 4, 2026

Martin Anderson

AI-generated image (GPT-2 + Photoshop): an industrial humanoid robot, heavily blurred, holds a vertical strip of 70mm film close to the camera, with the film frames in sharp focus showing a man and a woman standing and conversing. Every third frame is tinted green and the others appear monochrome.

As the all-you-can-eat era of AI draws to a close, an economical new approach to AI video generation promises notable savings in tokens and time.

The real cost of AI inference is bringing a new note of sobriety to the breakneck pace of the current AI revolution, with increased interest in rationalizing the cost of machine learning. Besides the potential of bringing AI in-house, and the general rise of private AI, VRAM-hungry and resource-gobbling machine learning routines will clearly need optimization too.

Video generation is perhaps the biggest offender, in this respect. If you’ve ever recompressed a movie or exported one out from a video-editing suite, you already know the toll that this particular (non-AI) task takes on your hardware – eating RAM and CPU cycles, and often blocking the machine for any other usage, unless measures are taken to limit the compression algorithm’s impact on the host computer.

Therefore one need only imagine the extent to which the rise of AI video is replaying this ‘power-mad’ procedure in data centers around the globe. At that scale of operation, the smallest gains become immediately significant in the aggregate reckoning.

In the Frame

With this in mind, a new research offering from Shanghai, in collaboration with JD.com, is proposing a video codec aimed not at the rendering-out process (the process of compressing huge, raw frames into a smaller video file size), but at the actual AI video generation process itself.

A normal video codec works by not storing every frame as a full image, but by creating a smaller number of complete pictures, called I-frames, and then storing changes between frames.

For instance, if a person moves slightly in a video , the codec records only the parts of the frame that register that change, rather than rewriting the whole scene. These are P-frames, which are derived from earlier frames, and B-frames, which can also anticipate information in future frames:

Anatomy of a video codec: top row shows frames over time labeled I, P, and B, with I-frames fully stored in full color and P- and B-frames shown faded to indicate reconstruction; arrows indicate whether frames use earlier frames, later frames, or both; lower panels depict a fully stored frame (I-frame, left), a frame built from a previous frame (P-frame, second-left), and a frame built from both earlier and later frames (B-frame, right).

This reuse of nearby information is why video files stay small, with most frames operating not as new images, but as instructions describing how the previous frame has changed. Therefore Iframes constitute ‘full-fat’, space-hogging uncompressed (or minimally compressed) images, with the frames between them constituting only the difference between Iframes (and between themselves).

When every single frame is a full uncompressed image, the movie essentially has no compression. Saving a movie in this way, as uncompressed video, would result in a 2-hour movie needing near (or more than) a terabyte of disk space. Yet this is how AI makes movies^† – by dedicating equal resources and tokens to each and every frame, when it’s calculating how to formulate the video.

Economy of Scale

The new work, titled AdaCodec: A Predictive Visual Code for Video MLLMs, instead expends full visual tokens exclusively on reference frames (Iframes), with all interstitial frames rendered as ‘compact P-tokens’ – a paradigm clearly taken from the traditional compression employed by historical ‘real world’ video codecs.

After this internal compression has taken place, the genAI video can then be compressed normally, and, in theory, all the savings are server-side:

Overview of AdaCodec. Left, videos are divided into adaptive groups of pictures, with full I-frames reserved for moments that are difficult to predict and intermediate P-frames represented using compact motion and residual information. Right, the resulting system matches or exceeds Qwen3-VL-8B across eleven benchmarks, maintains higher long-video accuracy across token budgets, and reduces response latency while processing substantially fewer video tokens. Source

The savings, according to reported results from tests for AdaCodec, are worth pursuing; the paper states that the system outperformed the unmodified Qwen3-VL-8B model across every major benchmark, while using the same amount of processing; and still matched or exceeded that model’s performance after cutting video tokens by an impressive approximate 86%.

The authors state*:

‘We draw inspiration from predictive coding, where a system transmits errors from a prediction rather than the raw signal. This principle has biological grounding: the visual system is thought to encode prediction errors, the mismatch between expected and observed input, rather than the input itself.

‘Modern video codecs use the same residual-coding idea in engineering: reference frames carry full content, while predictive frames carry motion and residual signals relative to a reference.

‘These systems have different objectives, but they share the same conditional structure: when nearby samples are redundant, the channel should carry what prediction fails to explain.

‘Standard codecs, however, optimize for bitstreams and human-viewable reconstruction, not for visual tokens consumed by an LLM. We therefore redesign this mechanism as an MLLM interface for video understanding.’

The new work, written by 11 researchers from Shanghai Jiao Tong University, Shanghai Innovation Institute, and JD.com, comes with an associated project page, with release of source code promised.

Method

As discussed, instead of treating every frame as a completely new image, the system looks for what changed between one frame and the next. On the left, in the image below, we see a small region of the current frame that’s matched against the most similar region in an earlier frame:

Schema overview for AdaCodec.

The distance between the two locations becomes a motion vector, while any remaining visual differences become a residual, with these compact descriptions replacing the need to store a full image.

On the right, we see the resulting information fed into the AI model: important reference frames are still processed as complete images, but the intervening frames are represented by much smaller motion-and-residual tokens – apparently allowing the model to retain enough information to parse the video, while processing notably less visual data.

One interesting challenge is deciding which frames deserve to be stored in full: traditional video codecs usually place reference frames at regular intervals, whether they are needed or not. AdaCodec, instead, tries to identify the moments that matter most.

For instance, consider a scene mostly depicting a static conversation between two people in an apartment – and suddenly a SWAT team bursts in through the window. Immediately the camera views and number of editing cuts will ramp up and require much more data than a regularized reference frame interval will provide:

Sequence of frames showing an empty room, two people talking, a group entering through the window, the two people being escorted away, and the room empty again. The strip of intermediate frames below depicts the same events across a longer timeline, with selected frames highlighted in blue. A horizontal scale labeled

This is the logic behind variable compression (variable bitrate) compression methods, which analyze source video for such ‘busy’ periods, and assign greater data where it is needed – at no small cost of time and resources.

In AdaCodec, if a frame can be predicted accurately from nearby frames, the system continues using compact motion-and-residual tokens; if the scene changes substantially (i.e., the SWAT example above, or something less dramatic), a full reference frame is inserted. This allows more of the available processing budget to be spent on important visual information instead of being distributed evenly across the entire video.

Data and Tests

In testing, the researchers used the aforementioned Qwen3-VL-8B as the base model and evaluated AdaCodec across eleven benchmarks spanning three areas of video understanding: long-video performance was assessed with MLVU, LongVideoBench, and LVBench; temporal understanding with TempCompass, MotionBench, and TOMATO; and general video understanding with Video-MME, MVBench, NExT-QA, PerceptionTest, and EgoSchema.

Open source models tested were InternVL3.5-8B; Keye-VL-1.5-8B; GLM-4.1V-9B; MiniCPM-V-4.5-8B; Eagle2.5-8B; PLM-8B; LLaVA-Video-7B; VideoChat-Flash-7B; Molmo2-8B; and Molmo2-O-7B.

The GPT-5, Gemini, and Claude variations appear in the table below only as comparison baselines. CoPE-VideoLM-7B and ReMoRa-7B are earlier video-language models that reduce visual-token usage through codec-inspired compression, making them the closest direct competitors to AdaCodec:

Main results across eleven benchmarks covering long-video understanding, temporal reasoning, and general video understanding. Higher scores indicate better performance. LVB = LongVideoBench; V-MME = Video-MME. Bold and underlined values indicate the highest and second-highest scores among open-source models. Scores for closed-source models were taken from official reports where available, with missing results sourced from Molmo2, or evaluated by the authors.

To ensure a fair comparison, the same number of visual tokens was allocated to both AdaCodec and the standard Qwen3-VL-8B system, allowing the results to reflect the effectiveness of the compression approach, rather than any differences in computational resources.

At the most aggressive setting, AdaCodec reduced visual-token usage by about 86% while still matching or slightly exceeding the baseline system on long-video, temporal, and general video-understanding tasks.

When the saved tokens were reinvested into processing more video frames, performance improved across every long-video benchmark and every temporal benchmark, with gains reaching +5.4 points on LongVideoBench and +4.3 points on TOMATO, while also producing some of the strongest open-source results in the study.

Conclusion

Though projects of this nature are usually aimed at hyperscale providers, this is the kind of effort that will be of interest to hobbyists and SMEs alike, as part of a potential new ‘public asceticism’ around local, rationalized AI deployment.

In communities such as r/stablediffusion, this is very old news, as every major open source release that arrives there is regularly tormented into a hyper-optimized (GGUF, quantized weights, etc.) version capable of running, with some patience, on lower-end graphics cards.

If the ‘performative stage’ of this third AI upsurge is indeed over, and on the assumption that corporations will be repelled by the true costs of inference, then initiatives such as AdaCodec may comprise part of a coming ‘grand optimization’.

^† This is not the same as rendering out the video to a user-friendly format/file-size; rather, it deals with the internal generation and collation of frames that takes place inside the AI model at inference time.

* My conversion, within reason, of the authors’ inline citations to hyperlinks.

First published Thursday, June 4, 2026