Video Prediction

Unsupervised object-centric video prediction with DDLP

We compare DDLP to two SOTA object-centric models: G-SWM (patch-based model) and SlotFormer (slot-based model, “Comaprison” section).

Input frames \(\{x\}_0^{T-1}\).
For each \(0\leq t \leq T-1\), a particle encoder produces the latent particle representation \(z_t\) given \(x_t\) and \(z_{t-1}\) – tracking the particles to induce consistency between frames.
Each individual latent \(z_{t}\) is fed through the particle decoder to produce the reconstruction of frame \(\hat{x}_{t}\).
A Transformer-based dynamics module (PINT - Particle Interaction Transformer) models the prior distribution parameters \(\{ \hat{z}\}_1^{T-1}\) given \(\{ z\}_0^{T-2}\).
A pixel-wise reconstruction term minimizes the distance between the original frame \(x_t\) and decoded frame \(\hat{x}_{t}\), and A KL loss term minimizes the distance between the prior \(\{ \hat{z}\}_1^{T-1}\) and posterior \(\{ z\}_1^{T-1}\).