CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
TL;DR: CAT4D creates 4D scenes from real or generated videos.
How it works
Given an input monocular video, we generate multi-view videos at novel viewpoints using our multi-view video diffusion model. These generated videos are then used to reconstruct the dynamic 3D scene as deforming 3D Gaussians.
Interactive Viewer
Click on the images below to render 4D scenes in real-time in your browser, powered by Brush!
Note that this is
experimental and quality may be reduced.
Separate camera and time control
At the core of CAT4D is a multi-view video diffusion model that disentangles the controls of camera and scene motions. We demonstrate this by generating three types of output sequences given 3 input images (with camera poses): 1) fixed viewpoint and varying time, 2) varying viewpoint and fixed time, and 3) varying viewpoint and varying time.
Input | Fixed View Varying Time |
Varying View Fixed Time |
Varying View Varying Time |
---|---|---|---|
Comparisons
Compare our method to baselines on different tasks. Try selecting different tasks and scenes!
Comparison of dynamic scene reconstruction from monocular videos on the DyCheck dataset.
4D-GS | Shape-of-Motion | MoSca | Ours | Ground Truth | Input |
---|---|---|---|---|---|
Acknowledgements
We would like to thank Arthur Brussee, Philipp Henzler, Daniel Watson, Jiahui Lei, Hang Gao,
Qianqian Wang, Songyou Peng, Stan Szymanowicz, Jiapeng Tang, Hadi Alzayer, Dana Roth, and Angjoo Kanazawa for their
valuable contributions.
We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, Han Zhang,
and Amir Hertz for training the base text-to-image latent diffusion model.