We re-purpose Text-to-Video diffusion models to segment any spatio-temporal entity given a referral text.
Hover over any GIF to display the input text prompt for the predicted mask
Results from our eval-only benchmark Ref-VPS (first 4 rows) and other interesting samples outside the scope of our dataset.
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method capitalizes on visual-language representations learned by video diffusion models on Internet-scale datasets. A key insight of our approach is preserving as much of the generative model's original representation as possible, while fine-tuning it on narrow-domain Referral Object Segmentation datasets. As a result, our framework can accurately segment and track rare and unseen objects, despite being trained on object masks from a limited set of categories. Additionally, it can generalize to non-object dynamic concepts, such as waves crashing in the ocean, as demonstrated in our newly introduced eval-only benchmark for Referral Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to twelve points in terms of region similarity on out-of-domain data, leveraging the power of Internet-scale pre-training.
Below we show examples from three datasets (Ref-VPS, VSPW, BURST) together with visual comparisons among three methods (UNINEXT, VD-IT, Ours).
Compared to the SOTA, our method REM is much better at consistently segmenting the referred entity through frequent occlusions, pov changes and distortions
.