CLOPS: Vision-driven avatar motion generationHuman-like motion requires human-like perception. We introduce CLOPS, the first human avatar that solely uses egocentric vision to perceive its surroundings and navigate 3D scenes. Rather than relying on task-specific perception methods or privileged state information, CLOPS argues that human-like avatar behavior requires human-like perception.
Our approach decouples learning into two stages: first, we train a data-driven low-level motion prior on large-scale motion capture data. Then, we use Q-learning to train a policy that maps egocentric visual observations to motion control commands, establishing a continuous perception-motion loop. We demonstrate that egocentric vision can give rise to human-like motion characteristics – avatars naturally avoid obstacles in their visual field, suggesting this sensor-aligned approach holds promise for developing avatars with more natural, human-like behaviors.
CLOPS combines human-like motion with human-like perception. Given a goal in the scene (represented by a red sphere), CLOPS navigates to it using only first-person vision – no maps, no privileged state, just looking.
The system integrates a data-driven low-level motion prior with a Q-Learning policy, creating a feedback loop between visual perception and motion generation.
The Q-Network processes egocentric observations at 1Hz frequency, predicting target head poses (visualized as coordinate frames). The motion generation network then creates natural movements to reach these targets within a continuous perception-motion loop.
Example 1
Example 2
@article{diomataris2025clops,
title = {Moving by Looking: Towards Vision-Driven Avatar Motion Generation},
author = {Diomataris, Markos and Albaba, Berat Mert and Becherini, Giorgio and Ghosh, Partha and Taheri, Omid and Black, Michael J.},
journal = {arXiv preprint arXiv:2509.19259},
year = {2025},
}