Whole-Body Conditioned Egocentric Video Prediction

August 1, 2025 Yutong Bai — UC Berkeley

Abstract

We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model’s embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

Speaker Bio

Yutong is currently a Postdoc Researcher at UC Berkeley (BAIR), advised by Prof. Alexei (Alyosha) Efros, Prof. Jitendra Malik and Prof. Trevor Darrell. Prior to that, she obtained CS PhD degree at Johns Hopkins University advised by Prof. Alan Yuille. She used to intern at Meta AI (FAIR Labs) and Google Brain, and was selected as 2023 Apple Scholar and MIT EECS Rising Star. Her work was norminated for CVPR 2022 Best Paper Award.

Speaker Homepage »