CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

TL;DR: CAT4D creates 4D scenes from real or generated videos.

How it works

Given an input monocular video, we generate multi-view videos at novel viewpoints using our multi-view video diffusion model. These generated videos are then used to reconstruct the dynamic 3D scene as deforming 3D Gaussians.

Interactive Viewer

Click on the images below to render 4D scenes in real-time in your browser, powered by Brush!
Note that this is experimental and quality may be reduced.

Separate camera and time control

At the core of CAT4D is a multi-view video diffusion model that disentangles the controls of camera and scene motions. We demonstrate this by generating three types of output sequences given 3 input images (with camera poses): 1) fixed viewpoint and varying time, 2) varying viewpoint and fixed time, and 3) varying viewpoint and varying time.

Input	Fixed View Varying Time	Varying View Fixed Time	Varying View Varying Time

Comparisons

Compare our method to baselines on different tasks. Try selecting different tasks and scenes!

Comparison of dynamic scene reconstruction from monocular videos on the DyCheck dataset.

4D-GS	Shape-of-Motion	MoSca	Ours	Ground Truth	Input

Acknowledgements

We would like to thank Arthur Brussee, Philipp Henzler, Daniel Watson, Jiahui Lei, Hang Gao, Qianqian Wang, Songyou Peng, Stan Szymanowicz, Jiapeng Tang, Hadi Alzayer, Dana Roth, and Angjoo Kanazawa for their valuable contributions. We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, Han Zhang, and Amir Hertz for training the base text-to-image latent diffusion model.

BibTeX

@article{wu2024cat4d,
    title={{CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models}},
    author={Wu, Rundi and Gao, Ruiqi and Poole, Ben and Trevithick, Alex and Zheng, Changxi and Barron, Jonathan T. and Holynski, Aleksander}
    journal={arXiv:2411.18613},
    year={2024}
}