CLUTCH: Text-conditioned hand motion generation in the wildHands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to in-the-wild settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment.
To address this, we introduce 3D Hands in the Wild (3D-HIW), a dataset of 32K 3D hand-motion sequences with aligned text descriptions, and propose CLUTCH, an LLM-based hand animation system with two critical innovations: (1) SHIFT, a novel part-modality decomposed VQ-VAE architecture for improved hand motion tokenization, and (2) a geometric refinement stage that finetunes the LLM. CLUTCH establishes the first benchmark for scalable in-the-wild hand motion modelling, demonstrating strong results on bidirectional text-to-motion and motion-to-text tasks.
CLUTCH synthesizes and captions in-the-wild 3D hand motions through three key steps:
@inproceedings{thambiraja2026clutch,
title = {{CLUTCH}: Contextualized Language Model for Unlocking Text-Conditioned Hand Motion Modelling in the Wild},
author = {Thambiraja, Balamurugan and Taheri, Omid and Danecek, Radek and Becherini, Giorgio and Pons-Moll, Gerard and Thies, Justus},
booktitle = {The Fourteenth International Conference on Learning Representations},
year = {2026},
url = {https://openreview.net/forum?id=W7YRskO47j}
}