TL;DR: CAT4D creates 4D scenes from real or generated videos.


How it works

Given an input monocular video, we generate multi-view videos at novel viewpoints using our multi-view video diffusion model. These generated videos are then used to reconstruct the dynamic 3D scene as deforming 3D Gaussians.



Interactive Viewer

Click on the images below to render 4D scenes in real-time in your browser, powered by Brush!
Note that this is experimental and quality may be reduced.



Separate camera and time control

At the core of CAT4D is a multi-view video diffusion model that disentangles the controls of camera and scene motions. We demonstrate this by generating three types of output sequences given 3 input images (with camera poses): 1) fixed viewpoint and varying time, 2) varying viewpoint and fixed time, and 3) varying viewpoint and varying time.

Input Fixed View
Varying Time
Varying View
Fixed Time
Varying View
Varying Time



Comparisons

Compare our method to baselines on different tasks. Try selecting different tasks and scenes!


Comparison of dynamic scene reconstruction from monocular videos on the DyCheck dataset.

4D-GS Shape-of-Motion MoSca Ours Ground Truth Input



Acknowledgements

We would like to thank Arthur Brussee, Philipp Henzler, Daniel Watson, Jiahui Lei, Hang Gao, Qianqian Wang, Songyou Peng, Stan Szymanowicz, Jiapeng Tang, Hadi Alzayer, Dana Roth, and Angjoo Kanazawa for their valuable contributions. We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, Han Zhang, and Amir Hertz for training the base text-to-image latent diffusion model.