1

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Abstract We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes.

Sai Kumar Dwivedi, Dimitrije Antić, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, Dimitrios Tzionas

Humanity's Last Exam: A Multi-Modal Benchmark at the Frontier of Human Knowledge

Abstract Benchmarks are essential for tracking rapid LLM progress—but today’s models exceed 90% on tasks like MMLU, saturating existing exams. We introduce Humanity’s Last Exam (HLE), a multi-modal, closed-ended benchmark spanning 2,500 questions across 100+ subjects at the frontier of human knowledge.

Omid Taheri, & Many Others

Humanity's Last Exam: A Multi-Modal Benchmark at the Frontier of Human Knowledge

NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

Abstract Acquiring physically plausible motor skills across diverse and unconventional morphologies—from humanoids to ants—is crucial for robotics and simulation. We introduce No-data Imitation Learning (NIL), which: Generates a reference video with a pretrained video diffusion model from a single simulation frame + text prompt.

Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael Black

NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

HaPTIC: Predicting 4D Hand Trajectory from Monocular Videos

Abstract We present HaPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space.

Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, Michael J. Black

HaPTIC: Predicting 4D Hand Trajectory from Monocular Videos

CHOIR: A Versatile and Differentiable Hand-Object Interaction Representation

Abstract Synthesizing accurate hand–object interactions (HOI) is critical for AR/VR and vision tasks. Existing dense–correspondence methods improve contact fidelity but lack full differentiability or generality. We propose CHOIR, a versatile, fully differentiable interaction field:

Théo Morales, Omid Taheri, Gerard Lacey

CHOIR: A Versatile and Differentiable Hand-Object Interaction Representation

CWGrasp: 3D Whole-Body Grasp Synthesis with Directional Controllability

Abstract Synthesizing 3D whole bodies that realistically grasp objects is crucial for animation, mixed reality, and robotics. Key challenges include natural coordination between hand, body, and environment, and the scarcity of training data.

Georgios Paschalidis, Romana Wilschut, Dimitrije Antić, Omid Taheri, Dimitrios Tzionas

CWGrasp: 3D Whole-Body Grasp Synthesis with Directional Controllability

HUMOS: Human Motion Model Conditioned on Body Shape

Abstract Generating realistic human motion is crucial for many computer vision and graphics applications. The rich diversity of human body shapes and sizes significantly influences how people move. However, existing motion models typically overlook these differences, using a normalized, average body instead.

Shashank Tripathi, Omid Taheri, Christoph Lassner, Michael Black, Daniel Holden, Carsten Stoll

WANDR: Intention-guided Human Motion Generation

Abstract Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness.

Markos Diomataris, Nikos Athanasiou, Omid Taheri, Xi Wang, Otmar Hilliges, Michael J. Black

WANDR: Intention-guided Human Motion Generation

GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency

Abstract Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality.

Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, Michael J. Black

GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

Abstract Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines.

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, Otmar Hilliges

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation