Images and GIFs Example

Tl;DR : We finetune any pre-trained video diffusion model (U-Net or DiT/ Temporally compressed or Not) to follow action trajectories with action-spaces ranging from 3-DoF Position Control to 25-DoF 1x EVE Humanoid Joint Angle Control, to generate egocentric videos of navigation and manipulation tasks.

Input : Initial Frame (I₀) + Action Trajectory (A _1:T) → Output : Future Frames (I_1:T)

[1] Zero-shot Generalisation to Paintings!

We select some unseen action trajectories from real-world data, and apply that to highly OOD paintings as initial frames to see the egocentric navigation or joint-control action acted out in the painting.
[Unseen Action Trajectory Illustrated in Real Video] → Predicted Video in [Painting 1] [Painting 2] ...

[1.1] 3-DoF Position Control

Unseen Reference Action

Prediction 1

Prediction 2

Prediction 3

Prediction 4

Unseen Reference Action

Prediction 1

Prediction 2

Prediction 3

Prediction 4

Unseen Reference Action

Prediction 1

Prediction 2

Prediction 3

Prediction 4

[1.2] Look down and see yourself ! 25-DoF Joint Angle Control

[Unseen Action Trajectory Illustrated in Real Video] → Predicted Video in [Painting 1] [Painting 2]

Unseen Reference Action

Prediction 1

Prediction 2

[2] Zero-shot Generalisation to Real-World Images Captured by Us

Here we select some unseen action trajectories from real-world data, and apply that to unseen images captured by us in our surroundings, to see the egocentric navigation acted out in unseen real-world scenes.
[Unseen Action Trajectory Illustrated in Real Video] → [Initial Image] [Predicted Video]

[2.1] 3-DoF Position Control

Unseen Reference Action

Initial frame

Prediction

Unseen Reference Action

Initial frame

Prediction

[2.2] 25-DoF Joint Angle control

[Unseen Action Trajectory Illustrated in Real Video] → [Initial Image 1] [Predicted Video 1] ...

Unseen Reference Action

Initial frame

Prediction

Initial frame

Prediction

Unseen Reference Action

Initial frame

Prediction

Initial frame

Prediction

Unseen Reference Action

Initial frame

Prediction

Initial frame

Prediction

Unseen Reference Action

Initial frame

Prediction

Initial frame

Prediction

[3] 25-DoF Humanoid Joint Angle Control Results on 1x Validation Set

Here we show 25-DoF joint angle action trajectory controlled humanoid navigation and manipulation results on the val set of the 1x dataset and compare with the groundtruth videos for the same unseen action trajectories.

[3.1] Navigation

Initial Frame + Action Traj.

GT Video

Ours (SVD)

Initial Frame + Action Traj.

GT Video

Ours (SVD)

[3.2] Manipulation

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

[4] 3-DoF Position Control Comparison on RECON Test Set

Here we show 3-DoF position trajectory controled navigation results on the test set of RECON. In each row we compare, the groundtruth video and predicted video by us and Navigation World Model (NWM) for the same unseen action trajectory.

Initial Frame + Action Traj.