Human reaction generation represents a significant research domain for interactive AI, as humans constantly interact with their surroundings. Previous works focus mainly on synthesizing the reactive motion given a human motion sequence. This paradigm limits interaction categories to human-human interactions and ignores emotions that may influence reaction generation. In this work, we propose to generate 3D human reactions from RGB videos, which involves a wider range of interaction categories and naturally provides information about expressions that may reflect the subject's emotions. To cope with this task, we present HERO, a simple yet powerful framework for Human rEaction geneRation from videOs. HERO considers both global and frame-level local representations of the video to extract the interaction intention, and then uses the extracted interaction intention to guide the synthesis of the reaction. Besides, local visual representations are continuously injected into the model to maximize the exploitation of the dynamic properties inherent in videos. Furthermore, the ViMo dataset containing paired Video-Motion data is collected to support the task. In addition to human-human interactions, these video-motion pairs also cover animal-human interactions and scene-human interactions. Extensive experiments demonstrate the superiority of our methodology.
Our framework consists of three modules: The video encoder is used to extract visual representations of the input video. The motion VQ-VAE learns a mapping between the raw motion and discrete code sequences. The reaction generation module extracts interaction intention from visual representations and uses it to guide the generation of code indices. During training, the video and GT reactive motion are input into HERO. As for inference, only the video is provided. Note that we omit the residual motion refinement from the figure for clarity.
ViMo dataset contains 32 subcategorized interactions, each belonging to one of three broad categories: human-human interactions, animal-human interactions, and scene-human interactions. Among them, human-human interactions cover daily socializing, sports, physical confrontations, and others.
Visual comparisons between the different methods given three distinct videos from ViMo test set.
Visualized cases of generating on the Unseen set. Note that the data of the subcategories to which each video in the figure belongs are not utilized when training HERO on the Seen set.
@inproceedings{yu2025hero,
title={HERO: Human Reaction Generation from Videos},
author={Yu, Chengjun and Zhai, Wei and Yang, Yuhang and Cao, Yang and Zha, Zheng-Jun},
journal={arXiv preprint arXiv:2503.08270},
year={2025}
}