Yahoo! Flickr YFCC100M User Tag and Caption Prediction Challenge
21°FNYC, US
SCROLL
Following the tradition of Grand Challenges at ACM Multimedia and Multimedia Retrieval competitions such as the VideOlympics at CIVR, ICMR 2016 will introduce its first edition of grand challenges.
Industry leaders are encouraged to organize challenges which are meant to engage the Multimedia Rerieval research community in solving relevant, interesting and challenging questions about the industry's 3-5 year vision for important retrieval problems of strict multi-modal, multi-media nature.
These challenges will provide a fun and engaging framework to provide creative multi-media solutions to challenging real problems affecting the industry today and in the immediate future.
Yahoo! Flickr YFCC100M User Tag and Caption Prediction Challenge
Challenge overview
The members of the Flickr community manually tag photos with the goal of making them
searchable and discoverable. With the advent of mobile phone cameras and autouploaders,
photo uploads have become more numerous and asynchronous, and manual tagging is
cumbersome for most users. Progress has been largely driven by training deep neural networks
on datasets, such as ImageNet, that were built by manual annotators. However, acquiring
annotations is expensive. In addition, the different categories of annotations are defined by
researchers and not by users, which means they are not necessarily relevant to users' interests,
and cannot be directly leveraged to enable search and discovery.
This challenge fills a void that is not currently addressed by existing
challenges and is uniquely aligned with the context of multimedia retrieval in two
aspects: (1) the dataset contains on the order of 100 million photos, which reflects well the
challenges of understanding multimedia at large scale, and (2) the benchmark focuses on
usergenerated content, where a large vocabulary of concepts is collected from tags annotated
by users.
The challenge focuses on how people annotate photos, rather than just focusing on photo
annotation without the human component. It asks participants to build image
analysis systems that think like humans : the correct annotation for an image isn't necessarily
the "true label". For example, a recent study showed that while a photo containing an apple, a
banana and a pear could be annotated using these three words, a person would actually more
likely annotate the image with the single word "fruit".
As the problem of automatic image annotation is not close to being solved, we intend to hold the
grand challenge during multiple years. Depending on the progress of the submissions and the
state of the art, the difficulty of the challenge could increase.
Dataset
The challenge will use data exclusively drawn from the Yahoo Flickr Creative Commons 100M
(YFCC100M) dataset. The benefit of this dataset is its sheer volume and that it is freely and
legally usable. The metadata, pixel data, and a wide variety of features are stored on Amazon
S3, meaning that it can be accessed and processed directly on the cloud; this is of particular
importance to potential participants that may not have access to sufficient computational power
or disk storage at their own research lab.
The YFCC100M dataset shares overlap with existing datasets for which we will borrow manual
annotations to serve as additional ground truth annotations in an additional evaluation. Of
particular note is the COCO dataset of which about one third (~100K) is present in the
YFCC100M.
Task description and evaluation metric
We aim to split the data into three groups depending on the last digit prior to the @ symbol in
the user identifier (NSID). The motivation to split the data such that no user occurs in multiple
partitions user is to avoid a dependency between the different splits. Depending on the
skewness of the amount of data per digit, we plan to group the whole dataset into 10 splits, with
split 0 as the testing set, split 1 as the validation set, and the others as the training set.
To explore the rich nature in YFCC100M, we consider the following subtasks:
Subtask 1 Tag Prediction:
This task focuses on user tag prediction, i.e. predict the tags that a user annotated a photo with.
We also add a subtask in the evaluation where we only consider tags that are in the English
dictionary in order to remove tags corresponding to dates and locations, as well as other "noise"
tags that are difficult to predict. The subtasks combined will reveal to what extent complicated
tags are responsible for a change in prediction performance.
The following metrics can be used to evaluate the filtered user tag prediction in the test set:
- accuracy @ K: 1 if at least one of the top K predicted tags is present in the user tags, 0
otherwise
- precision @ K: proportion of the top K predicted tags that appear in the user tags
- recall @ K: proportion of the user tags that appear in the top K predicted tags.
We will test the following values for K: 0, 5, 10.
Subtask 2 Photo Caption:
This task is to mimic how Flickr users are captioning their photos. Each participant is
encouraged to p roduce image captions that is not only accurate but also attractive.
Non-informative captions are less preferred. The long term goal of this subtask is to
build machines which can not only understand what are in the photo, but also
experience emotions and feelings from the photo like a human being.
We would like to consider two types of possible evaluation criteria for the caption task.
- automatic evaluation metrics including BLEU and METEOR scores. Such measure
the differences in generated sentences and original caption on the whole test set.
- human judgements. A group of human judges will read the captions on a subsample of
the test set, and choose the best performed system.
Submissions should:
Significantly address the challenge posted on the web site.
Depict working, presentable systems or demos, using the grand challenge dataset.
Describe why the system presents a novel and interesting solution.
Submission Guidelines
The submissions (max 4 pages) should be formatted according to ICMR formatting guidelines. Grand Challenge reviewing is double-blind so authors shouldn't reveal their identity in the paper. The finalists will be selected by a committee consisting of academia and industry representatives, based on novelty, presentation, scientific interest of the approaches, and for the evaluation-based challenges, on the performance against the task.
Accepted submissions will be published in the conference proceedings, and will be presented in a special event during the ICMR 2016 conference. At the conference, finalists will be requested to introduce their solutions, give a quick demo, and take questions from the judges and the audience.
Winners will be selected for Grand Challenge awards based on their presentation.