Video Prediction

Unsupervised object-centric video prediction with DDLP

    

  • Input frames \(\{x\}_0^{T-1}\).
  • For each \(0\leq t \leq T-1\), a particle encoder produces the latent particle representation \(z_t\) given \(x_t\) and \(z_{t-1}\) – tracking the particles to induce consistency between frames.
  • Each individual latent \(z_{t}\) is fed through the particle decoder to produce the reconstruction of frame \(\hat{x}_{t}\).
  • A Transformer-based dynamics module (PINT - Particle Interaction Transformer) models the prior distribution parameters \(\{ \hat{z}\}_1^{T-1}\) given \(\{ z\}_0^{T-2}\).
  • A pixel-wise reconstruction term minimizes the distance between the original frame \(x_t\) and decoded frame \(\hat{x}_{t}\), and A KL loss term minimizes the distance between the prior \(\{ \hat{z}\}_1^{T-1}\) and posterior \(\{ z\}_1^{T-1}\).

OBJ3D


More videos are available under the “What If…?” section.

    

Traffic


    

    

PHYRE


    

    

CLEVRER


    

    

Balls-Interaction


    

    

Failure Cases


  • DDLP errs when new objects appear during the context window (CLEVRER)
  • DDLP may deform objects in scenes with objects of varying scales (PHYRE)