DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

CVPR 2024

1Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2Beijing National Research Center for Information Science and Technology (BNRist) 3Department of Computer Science, University of Rochester, USA 4ByteDance Hangzhou, China
  • ✉corresponding author

Abstract

Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community, covering 4 music genres. With this dataset, we uncover that dance camera movement is multifaceted and human-centric, and possesses multiple influencing factors, making dance camera synthesis a more challenging task compared to camera or dance synthesis alone. To overcome these difficulties, we propose DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy. For evaluation, we devise new metrics measuring camera movement quality, diversity, and dancer fidelity. Utilizing these metrics, we conduct extensive experiments on our DCM dataset, providing both quantitative and qualitative evidence showcasing the effectiveness of our DanceCamera3D model.

An overview of the DCM dataset.

Dataset Description

After alignment, DCM dataset contains 108 pieces of paired data(193 minutes) covering music in 4 kinds of languages including English, Chinese, Korean, and Japanese. For camera pose representation, we originally acquired data in the MMD format (representation in Polar coordinate system) and preprocessed the data into Camera-Centric format (representation in Cartesian coordinate system) for the calculation of training losses. Besides, dance motion in our data consists of rotations and positions of 60 joints. The FPS of dance motion and camera movement is 30.

The duration of the original data ranges from 17 to 267 seconds. To divide them into train, test, and validation sets, we first randomly cut our data into shorter pieces ranging from 17 to 35 seconds, in which all cut points are keyframes for better reservation of camera characteristics. Then for every music type, we randomly split the data with probabilities of 0.8 : 0.1 : 0.1 to obtain the train, test, and validation sets.

Dataset Download

This dataset is available only for the academic use. Out of respect and protection for the original data providers, we have collected all the links to the raw data for users to download from the original data creators. Please show your appreciation and support for the work of the original data creators by liking and bookmarking their content if you use this data. Please adhere to the usage rules corresponding to this original data; any ethical or legal violations will be the responsibility of the user. The users must sign the eula form DCM-EULA-20240318.pdf. and send the scanned form to wangzixu21@mails.tsinghua.edu.cn. Once approved, you will be supplied with a download link. To preprocess our dataset, please see the README.md in our code.

BibTeX

@InProceedings{Wang_2024_CVPR,
  author    = {Wang, Zixuan and Jia, Jia and Sun, Shikun and Wu, Haozhe and Han, Rong and Li, Zhenyu and Tang, Di and Zhou, Jiaqing and Luo, Jiebo},
  title     = {DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2024},
  pages     = {7892-7901}
}