Videos have actually ended up being a significantly fundamental part of our every day lives, covering fields such as home entertainment, education, and interaction. Comprehending the material of videos, nevertheless, is a tough job as videos frequently include several occasions taking place at various time scales. For instance, a video of a musher hitching up pet dogs to a pet sled prior to they all race away includes a long occasion (the pet dogs pulling the sled) and a brief occasion (the pet dogs being hitched to the sled). One method to stimulate research study in video understanding is through the job of thick video captioning, which includes temporally localizing and explaining all occasions in a minutes-long video. This varies from single image captioning and basic video captioning, which includes explaining brief videos with a single sentence.
Thick video captioning systems have broad applications, such as making videos available to individuals with visual or acoustic problems, immediately creating chapters for videos, or enhancing the search of video minutes in big databases. Present thick video captioning techniques, nevertheless, have a number of restrictions– for instance, they frequently include extremely specialized task-specific elements, that make it challenging to incorporate them into effective structure designs Additionally, they are frequently skilled solely on by hand annotated datasets, which are extremely hard to get and for this reason are not a scalable option.
In this post, we present “ Vid2Seq: Massive Pretraining of a Visual Language Design for Dense Video Captioning“, to appear at CVPR 2023 The Vid2Seq architecture enhances a language design with unique time tokens, permitting it to perfectly anticipate occasion limits and textual descriptions in the exact same output series. In order to pre-train this merged design, we take advantage of unlabeled narrated videos by reformulating sentence limits of transcribed speech as pseudo-event limits, and utilizing the transcribed speech sentences as pseudo-event captions. The resulting Vid2Seq design pre-trained on countless narrated videos enhances the cutting-edge on a range of thick video captioning standards consisting of YouCook2, ViTT and ActivityNet Captions Vid2Seq likewise generalizes well to the few-shot thick video captioning setting, the video paragraph captioning job, and the basic video captioning job. Lastly, we have actually likewise launched the code for Vid2Seq here
![]() |
Vid2Seq is a visual language design that forecasts thick occasion captions together with their temporal grounding in a video by creating a single series of tokens. |
A visual language design for thick video captioning
Multimodal transformer architectures have actually enhanced the cutting-edge on a wide variety of video jobs, such as action acknowledgment Nevertheless it is not simple to adjust such an architecture to the complex job of collectively localizing and captioning occasions in minutes-long videos.
For a basic summary of how we accomplish this, we enhance a visual language design with unique time tokens (like text tokens) that represent discretized timestamps in the video, comparable to Pix2Seq in the spatial domain. Provided visual inputs, the resulting Vid2Seq design can both take as input and create series of text and time tokens. Initially, this makes it possible for the Vid2Seq design to comprehend the temporal details of the transcribed speech input, which is cast as a single series of tokens. Second, this enables Vid2Seq to collectively anticipate thick occasion captions and temporally ground them in the video while creating a single series of tokens.
The Vid2Seq architecture consists of a visual encoder and a text encoder, which encode the video frames and the transcribed speech input, respectively. The resulting encodings are then forwarded to a text decoder, which autoregressively forecasts the output series of thick occasion captions together with their temporal localization in the video. The architecture is initialized with a effective visual foundation and a strong language design
Massive pre-training on untrimmed narrated videos
Due to the thick nature of the job, the manual collection of annotations for thick video captioning is especially costly. Thus we pre-train the Vid2Seq design utilizing unlabeled narrated videos, which are quickly readily available at scale. In specific, we utilize the YT-Temporal-1B dataset, that includes 18 million narrated videos covering a wide variety of domains.
We utilize transcribed speech sentences and their matching timestamps as guidance, which are cast as a single series of tokens. We pre-train Vid2Seq with a generative goal that teaches the decoder to anticipate the transcribed speech series provided visual inputs just, and a denoising goal that motivates multimodal knowing by needing the design to anticipate masked tokens provided a loud transcribed speech series and visual inputs. In specific, sound is contributed to the speech series by arbitrarily masking out periods of tokens.
![]() |
Vid2Seq is pre-trained on unlabeled narrated videos with a generative goal ( leading) and a denoising goal ( bottom). |
Outcomes on downstream thick video captioning standards
The resulting pre-trained Vid2Seq design can be fine-tuned on downstream jobs with a basic optimum probability goal utilizing instructor requiring (i.e., anticipating the next token provided previous ground-truth tokens). After fine-tuning, Vid2Seq significantly enhances the cutting-edge on 3 basic downstream thick video captioning standards ( ActivityNet Captions, YouCook2 and ViTT) and 2 video captioning standards ( MSR-VTT, MSVD). In our paper we offer extra ablation research studies, qualitative outcomes, in addition to lead to the few-shot settings and in the video paragraph captioning job.
![]() |
Contrast to modern approaches for thick video captioning ( left) and for video captioning ( right), on the CIDEr metric (greater is much better). |
Conclusion
We present Vid2Seq, an unique visual language design for thick video captioning that merely forecasts all occasion limits and captions as a single series of tokens. Vid2Seq can be efficiently pretrained on unlabeled narrated videos at scale, and accomplishes modern outcomes on different downstream thick video captioning standards. Find out more from the paper and get the code here
Recognitions
This research study was carried out by Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid.