Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions. Text and audio represent the two predominant modalities employed as control conditions. While existing research has primarily focused on single conditions, the multi-condition human motion generation remains underexplored. In the present study, we propose a multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch. This framework effectively extends the applicability of the diffusion model, which is initially predicated solely on textual conditions, to auditory conditions. This extension encompasses both music-to-dance and co-speech Human Motion Synthesis (HMS), while preserving the intrinsic quality of motion and the capabilities for semantic association inherent in the original model. Furthermore, we propose the implementation of a Transformer-based diffusion model, designated as MWNet, as the main branch. This model adeptly apprehends the spatial intricacies and inter-joint correlations inherent in motion sequences, facilitated by the integration of multi-wise self-attention modules. Extensive experiments show that our method achieves competitive results in single-condition and multi-condition HMS tasks.
MCM employs a dual-branch architecture, comprising a pre-trained main branch with frozen parameters and a trainable control branch. The main branch is composed of a pre-trained MWNet or an alternative neural network grounded in Denoising Diffusion Probabilistic Models (DDPM), such as MotionDiffuse or MDM. The control branch mirrors the structural framework of the main branch and is initialized utilizing the parameters derived from the main branch. Drawing inspiration from ControlNet, we implement a strategy of optimizing the main branch and control branch independently during distinct stages of the training process. We pre-train the main branch on the text-to-motion task, with the objective of this phase being to endow the MCM with foundational motion quality and semantic association capabilities. In the training phase of audio-to-motion tasks, all parameters, excluding those assigned to the control branch and bridge module, are maintained in a fixed state. This strategy is implemented to guarantee the retention of the main branch's generative quality and its capabilities in semantic association. For the input to the Control branch, we perform element-wise addition of the Jukebox features with the motion latent vector to incorporate audio information into MCM. The output of each layer of the Control branch is added to the input of the main branch through the bridge module, thereby causing a slight offset in the output of the main branch under the control of audio conditions. By performing zero-initialization on the bridge module, we ensure that the initial output of MCM is identical to that of the main branch and gradually adjust the parameters according to the audio during the iterative process.
@misc{ling2023mcm,
title={MCM: Multi-condition Motion Synthesis Framework for Multi-scenario},
author={Zeyu Ling and Bo Han and Yongkang Wong and Mohan Kangkanhalli and Weidong Geng},
year={2024},
archivePrefix={IJCAI},
}This website draws heavy design inspiration from the excellent EDGE site.