Perpetual Humanoid Control for Real-time Simulated Avatars

Abstract

We present a physics-based humanoid controller that achieves high-fidelity motion imitation and fault-tolerant behavior in the presence of noisy input (e.g. pose estimates from video or generated from language) and unexpected falls. Our controller scales up to learning ten thousand motion clips without using any external stabilizing forces and learns to naturally recover from fail-state. Given reference motion, our controller can perpetually control simulated avatars without requiring resets. At its core, we propose the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn harder and harder motion sequences. PMCP allows efficient scaling for learning from large-scale motion databases and adding new tasks, such as fail-state recovery, without catastrophic forgetting. We demonstrate the effectiveness of our controller by using it to imitate noisy poses from video-based pose estimators and language-based motion generators in a live and real-time multi-person avatar use case.

MoCap Motion Imitation

AMASS Train & Test
Comparison with SOTA
Hard/Failure Cases & Overfitting

Noisy Motion Imitation

Comparison with SOTA
Real-time Single-person Avatar from Video
Real-time Physics-based Multi-Person Interaction from Video
Avatars from Language - Single Clip
Avatars from Language - Multiple Clip

Fail-state Recovery

MoCap Motion Imitation

In this section, we visualize PHC's ability to imitate high-quality motion capture (MoCap) data on both seen and and unseen sequences during training. All rendered SMPL mesh (bottom left) is produced using simulation result without any post-processing.

AMASS Train & Test


AMASS-Train-Belly dancing	AMASS-Train-Simple movement	AMASS-Test

Comparison with SOTA

Here we compare the motion imitation capability of our controller with the state-of-the-art model, the universal humanoid controller (UHC). We use the mesh-based humanoid and rotation-based imitator for a fair comparison. We compare with UHC both with and without residual force control (RFC). We can see that PHC's (without using RFC) performance is on-par with UHC with RFC, while UHC without RFC exhibits jittery motion and falls.

Shape Variation

PHC can be trained to controller humanoids of different body shape. Here we showcase 16 humanoids simulated with different gender (red: female; blue: male) and body proportions. Red spheres indicate reference motion's joint position and color gradient indicates weight.


Motion imitation with various body shapes	Fail-state recovery with various body shapes. Inter-humanoid collision is disabled.

Hard/Failure Cases & Overfitting

Here we showcase PHC's ability to imitate dynamic motion from AMASS such as high jumps, spinkicks, and cartwheeling. Failure cases include backflips, running-the-high-jump, etc. Notice that while a multi-clip PHC struggle to imitate these motion, we can overfit to them (last video)


Hard motion	Failure cases	Overfitting to one clip

Noisy Motion Imitation

Imitating motion estimated from monocular video or generated from language is significantly more challenging than imitating clean high-quality MoCap sequences. In this section, we showcase PHC's ability to imitate noisy motion and demonstrate real-time avatar use cases.

In this section, the red spheres/humanoid inside simulation indicates reference motion--the imitator's task is to track them.

Comparison with SOTA

We run HybrIK on the test split of the H36M dataset to extract motion estimates. Motion estimated from video can have severe foot sliding, floating, and physically implausible poses. We post-process these estimated motion with Gaussian smoothing to improve their stability (not possible for real-time applications). Here we showcase PHC's ability to imitate noisy motion and compare with the state-of-the-art, UHC.


Imitating motion estimated from H36M: a few clips is started and initialized with ill-formed pose estimate (1:24, 2:09), from which our PHC can quickly recover.	Comparing fail-state recovery strategy: using kinematic pose to reset (0:05) can lead to a vicious cycle of reset-fall-reset when the kinematic pose estimate is unreliable.

Real-time Single-person Avatar from Video

Here we show PHC ability to imitate real-time and live pose estimates from free-formed videos. Both input and physics simulation visualized in the video are live screen recordings (simulation can be seen on the captured monitor feed). We showcase both rotation-based and keypoint-based imitators. We uses rotation and keypoint estimates from HybrIK and MeTRAbs, respectively. We find that MeTRAbs provides more stable depth estimates, so we mainly focus on using keypoint-based imitator paired with MeTRAbs. We do not reset during any of the videos. Note that during live demonstration, both simulation and pose estimation have fluctuating framerate, which poses additional challenge to our imitator. Slight delay between video input and simulation is caused by pose estimation latency. Slight misalignment between rendered mesh and simulation is caused by discrepancy between simulation and screen recording framerate. Red spheres indicate keypoints estimated from video (for HybrIK, we perform forward kinematics from estimated rotation. )

Rotation-based imitator + HybrIK


Keypoint-based imitator + MeTRAbs \| Simple motion	Keypoint-based imitator + MeTRAbs \| Dynamic motion including jumping, crouching, single-leg standing, etc.	Keypoint-based imitator + MeTRAbs \| Dynamic motion including jumping rotate (0:30), human object interactions (2:11), missing person (2:30), etc.

Real-time Physics-based Multi-Person Interaction from Video

We also showcase a multi-person scenarios, where we track two actors in the same scene. We uses our keypoint-based imitator and MeTRAbs for this demo. We also enable physically plausible human-to-human interaction by turning on humanoid-to-humanoid collision. Note that in the multi-person case, occlusion and identity switch is common as can be seen in the flying keypoint estimates (red spheres).


Simple motion	Human-to-human interaction. At 2:01, the two humanoid gets entangled due to identity switch and later recovers.

Avatars from Language - Single Clip

We uses the motion diffusion model (MDM) to perform text-to-motion. Here we showcase imitating a single clip of motion. Since MDM natively output 3D keypoints, we use our keypoint-based imitator in this section.


"A person walks backwards."	"A person crouches down and then jumps up."	"A person kicks"

"A person punches in a manner consistent with martial arts."	"A person is skipping rope."	"A person turns to his right and paces back and forth."

"A person walks forward, bends down to pick something up off the ground. "	"A person jumps up and down."	"A throws a ball."

Avatars from Language - Multiple Clips

We can chain up independently generated motion clips and uses PHC for motion inbetweening. For each generated clip from text, we pads it by repeating the first frame 30 times to allow PHC time to react.


"A person waves." "A person crouches down to pick up an object." "A person is very happy and flops onto the ground."	"A person kicks up." "A person talks on the phone." "A person sits down to the ground. "	"A person walks forward while rasing hands." "A person is dancing." "A person is standing on one leg."

Fail-state Recovery

Here we visualize our PHC's ability to recover from fail-state. We study fall-state, far-state, and fall+far states. PHC can get up and walk back to far away reference motion (> 5m, no upper limit) and resumes motion imitation. Red spheres indicates reference motion.


Fall-state	Far-state, including > 5 m (0:45)	Fall+far state