We present a physics-based humanoid controller that achieves high-fidelity motion imitation and fault-tolerant behavior in the presence of noisy input (e.g. pose estimates from video or generated from language) and unexpected falls. Our controller scales up to learning ten thousand motion clips without using any external stabilizing forces and learns to naturally recover from fail-state. Given reference motion, our controller can perpetually control simulated avatars without requiring resets. At its core, we propose the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn harder and harder motion sequences. PMCP allows efficient scaling for learning from large-scale motion databases and adding new tasks, such as fail-state recovery, without catastrophic forgetting. We demonstrate the effectiveness of our controller by using it to imitate noisy poses from video-based pose estimators and language-based motion generators in a live and real-time multi-person avatar use case.
In this section, we visualize PHC's ability to imitate high-quality motion capture (MoCap) data on both seen and and unseen sequences during training. All rendered SMPL mesh (bottom left) is produced using simulation result without any post-processing.
|
|
|
Here we compare the motion imitation capability of our controller with the state-of-the-art model, the universal humanoid controller (UHC). We use the mesh-based humanoid and rotation-based imitator for a fair comparison. We compare with UHC both with and without residual force control (RFC). We can see that PHC's (without using RFC) performance is on-par with UHC with RFC, while UHC without RFC exhibits jittery motion and falls. |
PHC can be trained to controller humanoids of different body shape. Here we showcase 16 humanoids simulated with different gender (red: female; blue: male) and body proportions. Red spheres indicate reference motion's joint position and color gradient indicates weight.
|
|
Here we showcase PHC's ability to imitate dynamic motion from AMASS such as high jumps, spinkicks, and cartwheeling. Failure cases include backflips, running-the-high-jump, etc. Notice that while a multi-clip PHC struggle to imitate these motion, we can overfit to them (last video)
|
|
|
Imitating motion estimated from monocular video or generated from language is significantly more challenging than imitating clean high-quality MoCap sequences. In this section, we showcase PHC's ability to imitate noisy motion and demonstrate real-time avatar use cases.
In this section, the red spheres/humanoid inside simulation indicates reference motion--the imitator's task is to track them.
We run HybrIK on the test split of the H36M dataset to extract motion estimates. Motion estimated from video can have severe foot sliding, floating, and physically implausible poses. We post-process these estimated motion with Gaussian smoothing to improve their stability (not possible for real-time applications). Here we showcase PHC's ability to imitate noisy motion and compare with the state-of-the-art, UHC.
|
|
Here we show PHC ability to imitate real-time and live pose estimates from free-formed videos. Both input and physics simulation visualized in the video are live screen recordings (simulation can be seen on the captured monitor feed). We showcase both rotation-based and keypoint-based imitators. We uses rotation and keypoint estimates from HybrIK and MeTRAbs, respectively. We find that MeTRAbs provides more stable depth estimates, so we mainly focus on using keypoint-based imitator paired with MeTRAbs. We do not reset during any of the videos. Note that during live demonstration, both simulation and pose estimation have fluctuating framerate, which poses additional challenge to our imitator. Slight delay between video input and simulation is caused by pose estimation latency. Slight misalignment between rendered mesh and simulation is caused by discrepancy between simulation and screen recording framerate. Red spheres indicate keypoints estimated from video (for HybrIK, we perform forward kinematics from estimated rotation. ) |
|
|
|
|
We also showcase a multi-person scenarios, where we track two actors in the same scene. We uses our keypoint-based imitator and MeTRAbs for this demo. We also enable physically plausible human-to-human interaction by turning on humanoid-to-humanoid collision. Note that in the multi-person case, occlusion and identity switch is common as can be seen in the flying keypoint estimates (red spheres).
|
|
We uses the motion diffusion model (MDM) to perform text-to-motion. Here we showcase imitating a single clip of motion. Since MDM natively output 3D keypoints, we use our keypoint-based imitator in this section.
|
|
|
|
|
|
|
|
|
We can chain up independently generated motion clips and uses PHC for motion inbetweening. For each generated clip from text, we pads it by repeating the first frame 30 times to allow PHC time to react.
|
|
|
Here we visualize our PHC's ability to recover from fail-state. We study fall-state, far-state, and fall+far states. PHC can get up and walk back to far away reference motion (> 5m, no upper limit) and resumes motion imitation. Red spheres indicates reference motion.
|
|
|