Artificial Intelligence

NeRF: Facebook Co-Research Develops Mixed Static/Dynamic Video Synthesis

Updated on December 9, 2022

A collaboration between the Virginia Polytechnic Institute and State University and Facebook has solved one of the major challenges in NeRF video synthesis: freely mixing static and dynamic imagery and video in Neural Radiance Fields (NeRF) output.

The system can generate navigable scenes that feature both dynamic video elements and static environments, each recorded on location, but separated out into controllable facets of a virtual environment:

Dynamic View Synthesis from Dynamic Monocular Video

Watch this video on YouTube

Furthermore, it achieves this from a single viewpoint, without the need for the kind of multi-camera array that can bind initiatives like this to a studio environment.

The paper, entitled Dynamic View Synthesis from Dynamic Monocular Video, is not the first to develop a monocular NeRF workflow, but seems to be the first to simultaneously train a time-varying and a time-static model from the same input, and to generate a framework that allows motion video to exist inside a ‘pre-mapped' NeRF locale, similar to the kind of virtual environments that often encapsulate actors in high budget SF outings.

Beyond D-NeRF

The researchers have had to essentially recreate the versatility of Dynamic NeRF (D-NeRF) with just a single point of view, and not the multiplicity of cameras that D-NeRF uses. To resolve this, they predicted the forward and backward scene flow and used this information to develop a warped radiance field that's temporally consistent.

With only one POV, it was necessary to use 2D optical flow analysis to obtain 3D points in reference frames. The calculated 3D point is then fed back into the virtual camera in order to establish a ‘scene flow' that matches up the calculated optical flow with the estimated optical flow.

At training time, dynamic elements and static elements are reconciled into a full model as separately accessible facets.

By including a calculation of depth order loss, the model and applying rigorous regularization of scene flow prediction in D-NeRF, the problem of motion blur is greatly mitigated.

Though the research has much to offer in terms of regularizing NeRF calculation, and greatly improves upon the dexterity and facility of exploration for output from a single POV, of at least equal note is the novel separation and re-integration of dynamic and static NeRF elements.

Relying on a sole camera, such a system cannot replicate the panopticon view of multi-camera array NeRF setups, but it can go anywhere, and without a truck.

NeRF – Static Or Video?

Recently we looked at some impressive new NeRF research from China that's able to separate out elements in a dynamic NeRF scene captured with 16 cameras.

ST-NeRF

ST-NeRF (above) allows the viewer to reposition individuated elements in a captured scene, and even to resize them, change their playback rate, freeze them or run them backwards. Additionally, ST-NeRF allows the user to ‘scroll' through any part of the 180-degree arc captured by the 16 cameras.

However, the researchers of the ST-NeRF paper concede in closing that time is always running in some or other direction under this system, and that it is difficult to change the lighting and apply effects to environments that are actually video, rather than ‘statically-mapped' NeRF environments which in themselves contain no moving components, and do not need to be captured as video.

Highly Editable Static NeRF Environments

A static Neural Radiance Field scene, now isolated from any motion video segments, is easier to treat and augment in a number of ways, including relighting, as proposed earlier this year by NeRV (Neural Reflectance and Visibility Fields for Relighting and View Synthesis), which offers an initial step in changing the lighting and/or the texturing of a NeRF environment or object:

Relighting a NeRF object with NeRV. Source: https://www.youtube.com/watch?v=4XyDdvhhjVo

Retexturing in NeRV, even including photorealistic specular effects. Since the basis of the array of images is static, it is easier to process and augment a NeRF facet in this way than to encompass the effect across a range of video frames, making initial pre-processing and eventual training lighter and easier.