Publications

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Abstract We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes.

Sai Kumar Dwivedi, Dimitrije Antić, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, Dimitrios Tzionas

Humanity's Last Exam: A Multi-Modal Benchmark at the Frontier of Human Knowledge

Abstract Benchmarks are essential for tracking rapid LLM progress—but today’s models exceed 90% on tasks like MMLU, saturating existing exams. We introduce Humanity’s Last Exam (HLE), a multi-modal, closed-ended benchmark spanning 2,500 questions across 100+ subjects at the frontier of human knowledge.

Omid Taheri, & Many Others

Humanity's Last Exam: A Multi-Modal Benchmark at the Frontier of Human Knowledge

NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

Abstract Acquiring physically plausible motor skills across diverse and unconventional morphologies—from humanoids to ants—is crucial for robotics and simulation. We introduce No-data Imitation Learning (NIL), which: Generates a reference video with a pretrained video diffusion model from a single simulation frame + text prompt.

Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael Black

NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

HaPTIC: Predicting 4D Hand Trajectory from Monocular Videos

Abstract We present HaPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space.

Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, Michael J. Black

HaPTIC: Predicting 4D Hand Trajectory from Monocular Videos

CHOIR: A Versatile and Differentiable Hand-Object Interaction Representation

Abstract Synthesizing accurate hand–object interactions (HOI) is critical for AR/VR and vision tasks. Existing dense–correspondence methods improve contact fidelity but lack full differentiability or generality. We propose CHOIR, a versatile, fully differentiable interaction field:

Théo Morales, Omid Taheri, Gerard Lacey

CHOIR: A Versatile and Differentiable Hand-Object Interaction Representation

CWGrasp: 3D Whole-Body Grasp Synthesis with Directional Controllability

Abstract Synthesizing 3D whole bodies that realistically grasp objects is crucial for animation, mixed reality, and robotics. Key challenges include natural coordination between hand, body, and environment, and the scarcity of training data.

Georgios Paschalidis, Romana Wilschut, Dimitrije Antić, Omid Taheri, Dimitrios Tzionas

CWGrasp: 3D Whole-Body Grasp Synthesis with Directional Controllability

HUMOS: Human Motion Model Conditioned on Body Shape

Abstract Generating realistic human motion is crucial for many computer vision and graphics applications. The rich diversity of human body shapes and sizes significantly influences how people move. However, existing motion models typically overlook these differences, using a normalized, average body instead.

Shashank Tripathi, Omid Taheri, Christoph Lassner, Michael Black, Daniel Holden, Carsten Stoll

WANDR: Intention-guided Human Motion Generation

Abstract Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness.

Markos Diomataris, Nikos Athanasiou, Omid Taheri, Xi Wang, Otmar Hilliges, Michael J. Black

WANDR: Intention-guided Human Motion Generation

GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency

Abstract Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality.

Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, Michael J. Black

GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency

InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction

Abstract Humans constantly interact with objects to accomplish tasks. To understand such interactions, computers need to reconstruct these in 3D from images of whole bodies manipulating objects, e.g., for grasping, moving, and using the latter.

Yinghao Huang, Omid Taheri, Michael J. Black, Dimitrios Tzionas

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

Abstract Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines.

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, Otmar Hilliges

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

IPMAN: 3D Human Pose Estimation via Intuitive Physics

Abstract The estimation of 3D human body shape and pose from images has advanced rapidly. While the results are often well aligned with image features in the camera view, the 3D pose is often physically implausible; bodies lean, float, or penetrate the floor.

Shashank Tripathi, Lea Müller, Chun-Hao P. Huang, Omid Taheri, Michael Black, Dimitrios Tzionas

IPMAN: 3D Human Pose Estimation via Intuitive Physics

InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction

Abstract Humans constantly interact with objects to accomplish tasks. To understand such interactions, computers need to reconstruct these in 3D from images of whole bodies manipulating objects, e.g., for grasping, moving, and using the latter.

Yinghao Huang, Omid Taheri, Michael J. Black, Dimitrios Tzionas

GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping

Abstract Generating digital humans that move realistically has many applications and is widely studied, but existing methods focus on the major limbs of the body, ignoring the hands and head. Hands have been separately studied but the focus has been on generating realistic static grasps of objects.

Omid Taheri, Vasileios Choutas, Michael J. Black, Dimitrios Tzionas

GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping

GRAB: A Dataset of Whole-Body Human Grasping of Objects

Abstract Training computers to understand, model, and synthesize human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time.

Omid Taheri, Nima Ghorbani, Michael J. Black, Dimitrios Tzionas

GRAB: A Dataset of Whole-Body Human Grasping of Objects

Human Leg Motion Tracking by Fusing IMUs and RGB Camera Data Using Extended Kalman Filter

Abstract Human motion capture is frequently used to study rehabilitation and clinical problems, as well as to provide realistic animation for the entertainment industry. IMU-based systems, as well as Markerbased motion tracking systems, are most popular methods to track movement due to their low cost of implementation and lightweight.

Omid Taheri, Hassan Salarieh, Aria Alasty

Human Leg Motion Tracking by Fusing IMUs and RGB Camera Data Using Extended Kalman Filter