The 2nd MSR Video to Language Challenge


Video has become ubiquitous on the Internet, broadcasting channels, as well as personal devices. This has encouraged the development of advanced techniques to analyze the semantic video content for a wide variety of applications. Recognition of videos has been a fundamental challenge of computer vision for decades. Previous research has predominantly focused on recognizing videos with a predefined yet very limited set of individual words. In this grand challenge, we go one step further and target at translating video content to a complete and natural sentence, which can be regarded as the ultimate goal of video understanding. This challenge will bring together diverse topics in the areas of multimedia, computer vision, natural language processing and machine learning, as well as multiple modalities (textual, visual, and aural) and multiple ways of understanding and analyzing video content.

To further motivate and challenge the academic and industrial research communities, Microsoft Research organized the first Video to Language Grand Challenge in ACM Multimedia 2016 (, and released the first version of “Microsoft Research - Video to Text” (MSR-VTT), a large-scale video benchmark to public for video understanding. The dataset contains 38.7 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. The dataset can be used to train and evaluate video to language tasks, and other tasks (e.g., video retrieval, event detection, video categorization, etc.) as well in the future. Below is some statistics of this grand challenge in ACM Multimedia 2016.


Figure 1. Statistics of participants in the first grand challenge at ACM MM 2016. There are 77 teams that registered our challenge, and 22 teams submitted their final results.


Figure 2. Ranking list of top 10 participants in terms of M1 (objective evaluation) metric in the first grand challenge at ACM MM 2016.


Figure 3. Ranking list of top 10 participants in terms of M2 (human subjective evaluation) metric in the first grand challenge at ACM MM 2016.

This year we are proposing to organize the second grand challenge in ACM Multimedia 2017, and will release a new test dataset for evaluations.

This year, similar with what we did in the first challenge, by participating in this challenge, you can:

  • Leverage MSR-VTT benchmark to boost research on an emerging task of video to language;
  • Try out your video to language system using real world data;
  • See how it compares to the rest of the community’s entries;
  • Get to be a contender for ACM Multimedia 2017 Grand Challenge;
Task Description

This year we will focus on video to language task. Given an input video clip, the goal is to automatically generate a complete and natural sentence to describe video content, ideally encapsulating its most informative dynamics.

The contestants are asked to develop video to language systems based on the MSR-VTT dataset provided by the Challenge (as training data) and any other public/private data to recognize a wide range of object, scene, event, etc., in the images/videos. For the evaluation purpose, a contesting system is asked to produce at least one sentence of the test videos. The accuracy will be evaluated against human pre-generated sentence(s) during evaluation stage.

For more information, please refer to “Microsoft Research Video to Language Grand Challenge” website (


The dataset is based on MSR-VTT and we split the data according to 60%:30%:10% in the training, testing and validation set, respectively. Below table shows the statistics of MSR-VTT dataset.

Dataset Context Sentence
#Video #Clip #Sentence #Word Vocabulary Duration
5,942 10,000 200,000 1,535,917 28,528 38.7

*In MSR-VTT dataset, we provide the category information for each video clip and the video clip contains audio information as well.

Evaluation Metric

The evaluation provided here can be used to obtain results on the testing set of MSR-VTT. It computes multiple common metrics, including BLEU@4, METEOR, ROUGE-L, and CIDEr.

In addition, we will carry out the human evaluation of the systems submitted to this challenge on a subset of the testing set. Human were asked to rank four generated sentences and a reference sentence from 1 to 5 (lower - better) with respect to the following criteria.

  • Grammar: judge the fluency and readability of the sentence (independently of the correctness with respect to the video clip).
  • Correctness: for which sentence is the content more correct with respect to the video clip (independent if it is complete, i.e., describes everything), independent of the grammatical correctness.
  • Relevance: Which sentence contains the more salient (i.e., relevant, important) events/objects of the video clip?
  • Helpful for blind (additional criteria): how helpful would the sentence be for a blind person to understand what is happening in this video clip?
Important Dates:
  • April 18, 2017: Dataset available for download (training and validation set)
  • June 1, 2017: Test set available for download
  • June 15, 2017: Results and one-page notebook paper submission
  • June 16-28, 2017: Objective evaluation and human evaluation
  • July 3, 2017: Evaluation results announce
  • July 14, 2017: Paper submission deadline (please follow the instructions on the main conference website)

Ting Yao (), Associate Researcher, Microsoft Research Asia

Tao Mei (), Senior Researcher, Microsoft Research Asia