ReferEverything: Towards Segmenting Everything We Can Speak of in Videos


We re-purpose Text-to-Video diffusion models to segment any spatio-temporal entity given a referral text.

Hover over any GIF to display the input text prompt for the predicted mask

image
the spinning column of water
image
the rain drop running down the center of the windowpane
image
the smoke blown out
image
the large firework
image
the wave crashing in the ocean
image
the building in the middle falling down
image
the beam of the light
image
the green bottle of paint shattering from the impact of the ball
image
the cracks in the middle of the glass
image
the circle of the bright light
image
the black moving particles
image
the water runing into the sink
image
the changing beam of white light
image
the bubble being popped
image
the melting glass
image
the tall water eruption
image
the aurora
image
the lava flowing
image
the dark changing shape in the sky
image
the bat signal and the light beam

image
the legs only
image
the wings of the bird only
image
the small paper
image
the smoke dissipating
image
the boy
image
the hat being thrown
image
the flames on the right
image
the smoke kicked up by the car
image
the fire

Results from our eval-only benchmark Ref-VPS (first 4 rows) and other interesting samples outside the scope of our dataset.



Abstract

We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method capitalizes on visual-language representations learned by video diffusion models on Internet-scale datasets. A key insight of our approach is preserving as much of the generative model's original representation as possible, while fine-tuning it on narrow-domain Referral Object Segmentation datasets. As a result, our framework can accurately segment and track rare and unseen objects, despite being trained on object masks from a limited set of categories. Additionally, it can generalize to non-object dynamic concepts, such as waves crashing in the ocean, as demonstrated in our newly introduced eval-only benchmark for Referral Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to twelve points in terms of region similarity on out-of-domain data, leveraging the power of Internet-scale pre-training.


Visual Comparisons - SOTA vs REM (Ours)

Below we show examples from three datasets (Ref-VPS, VSPW, BURST) together with visual comparisons among three methods (UNINEXT, VD-IT, Ours).


BURST results from UNINEXT

image
the sponge
image
the calendar
image
the backpack
image
the can

BURST results from VD-IT

image
the sponge
image
the calendar
image
the backpack
image
the can

BURST results from REM (Ours)

image
the sponge
image
the calendar
image
the backpack
image
the can




VSPW (stuff) results from UNINEXT

image
the bridge
image
the building
image
the wall
image
the house

VSPW (stuff) results from VD-IT

image
the bridge
image
the building
image
the wall
image
the house

VSPW (stuff) results from REM (Ours)

image
the bridge
image
the building
image
the wall
image
the house




Ref-VPS results from UNINEXT

image
the popping bubble
image
the rainbow appearing in the mist
image
the skin being pulled
image
the green bottle with liquid being broken by the ball
image
the crack patterns in the glass
image
the wave
image
the fireworks in the sky
image
the rainbow
image
the rising wave
image
the blue smoke-like patterns in the bowl
image
the smoke from the car
image
the ice forming
image
the water rising and splashing
image
the bubble being burst


Ref-VPS results from VD-IT

image
the popping bubble
image
the rainbow appearing in the mist
image
the skin being pulled
image
the green bottle with liquid being broken by the ball
image
the crack patterns in the glass
image
the wave
image
the fireworks in the sky
image
the rainbow
image
the rising wave
image
the blue smoke-like patterns in the bowl
image
the smoke from the car
image
the ice forming
image
the water rising and splashing
image
the bubble being burst


Ref-VPS results from REM (Ours)

image
the popping bubble
image
the rainbow appearing in the mist
image
the skin being pulled
image
the green bottle with liquid being broken by the ball
image
the crack patterns in the glass
image
the wave
image
the fireworks in the sky
image
the rainbow
image
the rising wave
image
the blue smoke-like patterns in the bowl
image
the smoke from the car
image
the ice forming
image
the water rising and splashing
image
the bubble being burst



Results on highly challenging fighting scenes

Compared to the SOTA, our method REM is much better at consistently segmenting the referred entity through frequent occlusions, pov changes and distortions

.

UNINEXT

image
the man with white hair

VD-IT

image
the man with white hair

REM (Ours)

image
the man with white hair
image
the man with red cape
image
the man with red cape
image
the man with red cape
image
the man without a shirt
image
the man without a shirt
image
the man without a shirt
image
the man in red and gold suit
image
the man in red and gold suit
image
the man in red and gold suit
image
the boy with dark hair
image
the boy with dark hair
image
the boy with dark hair
image
the man with white hair
image
the man with white hair
image
the man with white hair
image
the boy wearing blue shirt
image
the boy wearing blue shirt
image
the boy wearing blue shirt
image
the boy with red collar and pink hair
image
the boy with red collar and pink hair
image
the boy with red collar and pink hair