Moving by Looking: Towards Vision-Driven Avatar Motion Generation

Markos Diomataris, Berat Mert Albaba, Giorgio Becherini, Partha Ghosh, Omid Taheri, Michael J. Black

September, 2025

CLOPS: Vision-driven avatar motion generation

Type

Publication

arXiv preprint arXiv:2509.19259

Abstract

Human-like motion requires human-like perception. We introduce CLOPS, the first human avatar that solely uses egocentric vision to perceive its surroundings and navigate 3D scenes. Rather than relying on task-specific perception methods or privileged state information, CLOPS argues that human-like avatar behavior requires human-like perception.

Our approach decouples learning into two stages: first, we train a data-driven low-level motion prior on large-scale motion capture data. Then, we use Q-learning to train a policy that maps egocentric visual observations to motion control commands, establishing a continuous perception-motion loop. We demonstrate that egocentric vision can give rise to human-like motion characteristics – avatars naturally avoid obstacles in their visual field, suggesting this sensor-aligned approach holds promise for developing avatars with more natural, human-like behaviors.

What is CLOPS?

CLOPS combines human-like motion with human-like perception. Given a goal in the scene (represented by a red sphere), CLOPS navigates to it using only first-person vision – no maps, no privileged state, just looking.

How Does It Work?

The system integrates a data-driven low-level motion prior with a Q-Learning policy, creating a feedback loop between visual perception and motion generation.

The Q-Network processes egocentric observations at 1Hz frequency, predicting target head poses (visualized as coordinate frames). The motion generation network then creates natural movements to reach these targets within a continuous perception-motion loop.

Qualitative Results

Example 1

Example 2

Summary Video

BibTeX

@article{diomataris2025clops,
  title   = {Moving by Looking: Towards Vision-Driven Avatar Motion Generation},
  author  = {Diomataris, Markos and Albaba, Berat Mert and Becherini, Giorgio and Ghosh, Partha and Taheri, Omid and Black, Michael J.},
  journal = {arXiv preprint arXiv:2509.19259},
  year    = {2025},
}

Moving by Looking: Towards Vision-Driven Avatar Motion Generation

Abstract

What is CLOPS?

How Does It Work?

Qualitative Results

Summary Video

BibTeX

Markos Diomataris

PhD Candidate

Omid Taheri

PostDoc Researcher | Open to Research Scientist Roles