ICMR 2016


Yahoo! Flickr YFCC100M User Tag and Caption Prediction Challenge
21°F NYC, US

Following the tradition of Grand Challenges at ACM Multimedia and Multimedia Retrieval competitions such as the VideOlympics at CIVR, ICMR 2016 will introduce its first edition of grand challenges. Industry leaders are encouraged to organize challenges which are meant to engage the Multimedia Rerieval research community in solving relevant, interesting and challenging questions about the industry's 3-5 year vision for important retrieval problems of strict multi-modal, multi-media nature. These challenges will provide a fun and engaging framework to provide creative multi-media solutions to challenging real problems affecting the industry today and in the immediate future.

Yahoo! Flickr YFCC100M User Tag and Caption Prediction Challenge

Challenge overview

The members of the Flickr community manually tag photos with the goal of making them searchable and discoverable. With the advent of mobile phone cameras and autouploaders, photo uploads have become more numerous and asynchronous, and manual tagging is cumbersome for most users. Progress has been largely driven by training deep neural networks on datasets, such as ImageNet, that were built by manual annotators. However, acquiring annotations is expensive. In addition, the different categories of annotations are defined by researchers and not by users, which means they are not necessarily relevant to users' interests, and cannot be directly leveraged to enable search and discovery.

This challenge fills a void that is not currently addressed by existing challenges and is uniquely aligned with the context of multimedia retrieval in two aspects: (1) the dataset contains on the order of 100 million photos, which reflects well the challenges of understanding multimedia at large scale, and (2) the benchmark focuses on usergenerated content, where a large vocabulary of concepts is collected from tags annotated by users.

The challenge focuses on how people annotate photos, rather than just focusing on photo annotation without the human component. It asks participants to build image analysis systems that think like humans : the correct annotation for an image isn't necessarily the "true label". For example, a recent study showed that while a photo containing an apple, a banana and a pear could be annotated using these three words, a person would actually more likely annotate the image with the single word "fruit". As the problem of automatic image annotation is not close to being solved, we intend to hold the grand challenge during multiple years. Depending on the progress of the submissions and the state of the art, the difficulty of the challenge could increase.


The challenge will use data exclusively drawn from the Yahoo Flickr Creative Commons 100M (YFCC100M) dataset. The benefit of this dataset is its sheer volume and that it is freely and legally usable. The metadata, pixel data, and a wide variety of features are stored on Amazon S3, meaning that it can be accessed and processed directly on the cloud; this is of particular importance to potential participants that may not have access to sufficient computational power or disk storage at their own research lab.

The YFCC100M dataset shares overlap with existing datasets for which we will borrow manual annotations to serve as additional ground truth annotations in an additional evaluation. Of particular note is the COCO dataset of which about one third (~100K) is present in the YFCC100M.

Task description and evaluation metric

We aim to split the data into three groups depending on the last digit prior to the @ symbol in the user identifier (NSID). The motivation to split the data such that no user occurs in multiple partitions user is to avoid a dependency between the different splits. Depending on the skewness of the amount of data per digit, we plan to group the whole dataset into 10 splits, with split 0 as the testing set, split 1 as the validation set, and the others as the training set. To explore the rich nature in YFCC100M, we consider the following subtasks:

Subtask 1 Tag Prediction:

This task focuses on user tag prediction, i.e. predict the tags that a user annotated a photo with. We also add a subtask in the evaluation where we only consider tags that are in the English dictionary in order to remove tags corresponding to dates and locations, as well as other "noise" tags that are difficult to predict. The subtasks combined will reveal to what extent complicated tags are responsible for a change in prediction performance.

The following metrics can be used to evaluate the filtered user tag prediction in the test set:

- accuracy @ K: 1 if at least one of the top K predicted tags is present in the user tags, 0 otherwise

- precision @ K: proportion of the top K predicted tags that appear in the user tags

- recall @ K: proportion of the user tags that appear in the top K predicted tags.

We will test the following values for K: 0, 5, 10.

Subtask 2 Photo Caption:

This task is to mimic how Flickr users are captioning their photos. Each participant is encouraged to p roduce image captions that is not only accurate but also attractive. Non-informative captions are less preferred. The long term goal of this subtask is to build machines which can not only understand what are in the photo, but also experience emotions and feelings from the photo like a human being.

We would like to consider two types of possible evaluation criteria for the caption task.

- automatic evaluation metrics including BLEU and METEOR scores. Such measure the differences in generated sentences and original caption on the whole test set.

- human judgements. A group of human judges will read the captions on a subsample of the test set, and choose the best performed system.

Submissions should:

Significantly address the challenge posted on the web site.

Depict working, presentable systems or demos, using the grand challenge dataset.

Describe why the system presents a novel and interesting solution.

Submission Guidelines

The submissions (max 4 pages) should be formatted according to ICMR formatting guidelines. Grand Challenge reviewing is double-blind so authors shouldn't reveal their identity in the paper. The finalists will be selected by a committee consisting of academia and industry representatives, based on novelty, presentation, scientific interest of the approaches, and for the evaluation-based challenges, on the performance against the task.

Accepted submissions will be published in the conference proceedings, and will be presented in a special event during the ICMR 2016 conference. At the conference, finalists will be requested to introduce their solutions, give a quick demo, and take questions from the judges and the audience.

Winners will be selected for Grand Challenge awards based on their presentation.

Challenge Solutions to be submitted to the EasyChair conference website

Important Dates

Grand Challenge paper submission: March 20, 2016

Notification of acceptance: April 1, 2016

Camera-ready paper due: April 15, 2016

Grand Challenge Contacts

Bart Thomee, [email protected]

Pierre Garrigues, [email protected]

Liangliang Cao, [email protected]

SHARE LinkedIn Weibo