TSTMotion: Training-free Scene-aware Text-to-motion Generation

Anonymous Authors

Abstract

Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, despite notable progress, existing scene-aware methods often rely on large-scale ground-truth motion sequences in diverse 3D scenes, which poses practical challenges due to the expensive cost. To mitigate this challenge, we are the first to propose a Training-free Scene-aware Text-to-Motion framework, dubbed as TSTMotion, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability. Specifically, conditioned on the given 3D scene and text description, we adopt foundation models together to reason, predict and validate a scene-aware motion guidance. Then, the motion guidance is incorporated into the blank-background motion generators with two modifications, resulting in scene-aware text-driven motion sequences. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework.

Method

An overview of our proposed training-free TSTMotion framework for the given text $d$ and 3D scene $S_{3D}$. At first, the Scene Compiler extracts the spatial auxiliary in the $S_{3D}$. Based on the spatial auxiliary, the Motion Planner incorporates the text description and well-designed prompt templates to infer the motion guidance $s[M_{mask}]$. Equipped with the $s[M_{mask}]$, the Aligned Motion Diffusion Model predicts initial scene-aware text-driven motion sequences $m$ with two training-free modifications. Finally, the Motion Checker is applied to iteratively refine and generate final $m$ to better align with the $d$ and $S_{3D}$.

Complete Algorithm of Aligend Motion Diffusion Model


By leveraging such algorithm, the scene-aware text-driven motion sequences is generated in a training-free manner.

Segment Results

Directly from OpenIns3D

Motion Guidance Results

Demonstration of the motion guidance output by the Motion Planner given different text prompts.

Scene-Aware Text-Driven Motion Results

Stand up from the couch

Sit on the toilet

Balance on the table

Box in front of the punching bag

Dance happiliy in front of the mirror

Do a handstand in front of the chair

Fly kick with his right leg on the door

Walk as a chicken towards the barn

Walk in a counterclockwise circle near the campfire

Walk to the couch