Ego-centric Action-Conditioned Video World Models

Anurag Bagchi1, Zhipeng Bao1, Homanga Bharadhwaj2, Yu-Xiong Wang3, Pavel Tokmakov4, Martial Hebert1

1Carnegie Mellon University, 2JHU, 3UIUC, 4TRI

Tl;DR : We finetune any pre-trained video diffusion model (U-Net or DiT/ Temporally compressed or Not) to follow action trajectories with action-spaces ranging from 3-DoF Position Control to 25-DoF 1x EVE Humanoid Joint Angle Control, to generate egocentric videos of navigation and manipulation tasks.


Input : Initial Frame (I0) + Action Trajectory (A 1:T) → Output : Future Frames (I1:T)

[1] Zero-shot Generalisation to Paintings!

We select some unseen action trajectories from real-world data, and apply that to highly OOD paintings as initial frames to see the egocentric navigation or joint-control action acted out in the painting.
[Unseen Action Trajectory Illustrated in Real Video] → Predicted Video in [Painting 1] [Painting 2] ...

[1.1] 3-DoF Position Control

Unseen Reference Action
Prediction 1
Prediction 2
Prediction 3
Prediction 4
Unseen Reference Action
Prediction 1
Prediction 2
Prediction 3
Prediction 4
Unseen Reference Action
Prediction 1
Prediction 2
Prediction 3
Prediction 4

[1.2] Look down and see yourself ! 25-DoF Joint Angle Control

[Unseen Action Trajectory Illustrated in Real Video] → Predicted Video in [Painting 1] [Painting 2]

Unseen Reference Action
Prediction 1
Prediction 2

[2] Zero-shot Generalisation to Real-World Images Captured by Us

Here we select some unseen action trajectories from real-world data, and apply that to unseen images captured by us in our surroundings, to see the egocentric navigation acted out in unseen real-world scenes.
[Unseen Action Trajectory Illustrated in Real Video][Initial Image] [Predicted Video]

[2.1] 3-DoF Position Control

Unseen Reference Action
Initial frame
Prediction
Unseen Reference Action
Initial frame
Prediction

[2.2] 25-DoF Joint Angle control

[Unseen Action Trajectory Illustrated in Real Video][Initial Image 1] [Predicted Video 1] ...

Unseen Reference Action
Initial frame
Prediction
Initial frame
Prediction
Unseen Reference Action
Initial frame
Prediction
Initial frame
Prediction
Unseen Reference Action
Initial frame
Prediction
Initial frame
Prediction
Unseen Reference Action
Initial frame
Prediction
Initial frame
Prediction

[3] 25-DoF Humanoid Joint Angle Control Results on 1x Validation Set

Here we show 25-DoF joint angle action trajectory controlled humanoid navigation and manipulation results on the val set of the 1x dataset and compare with the groundtruth videos for the same unseen action trajectories.

[3.1] Navigation

[3.2] Manipulation

Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)

[4] 3-DoF Position Control Comparison on RECON Test Set

Here we show 3-DoF position trajectory controled navigation results on the test set of RECON. In each row we compare, the groundtruth video and predicted video by us and Navigation World Model (NWM) for the same unseen action trajectory.

Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
NWM
Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
NWM
Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
NWM
Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
NWM