Images do not appear in isolation. For example, on the web images are typically surrounded by informative text in the form of tags (e.g., on Flickr), captions (short summaries conveying something about the picture), and blogs/news articles, etc. In robotics, language is the most convenient way to teach an autonomous agent novel concepts or to communicate the mistakes it is making. For example, when providing a novel task to a robot, such as "pass me the stapler", we could provide additional information, e.g., "it is next to the beer bottle on the table". This information could be used to greatly simplify the parsing task. Conversely, it also crucial that the agent communicates its understanding of the scene to the human, e.g., "I can't, I am watching tv on a sofa, next to the wine bottle."

This class is a graduate seminar course in computer vision. The class will focus on the topic of visual recognition by exploiting textual information. We will discuss various problems and applications in this domain, and survey the current papers on the topic of images/videos and text. The goal of the class will be to understand the cross-domain approaches, to analyze their strengths and weaknesses, as well as to identify interesting new directions for future research.

Prerequisites: Courses in computer vision and/or machine learning (e.g., CSC320, CSC420) are highly recommended (otherwise you will need some additional reading), and basic programming skills are required for projects.

back to top

When emailing me, please put CSC2523 in the subject line.


This class uses piazza. On this webpage, we will post announcements and assignments. The students will also be able to post questions and discussions in a forum style manner, either to their instructors or to their peers.

Please sign up here in the beginning of class.

back to top

We are happy to announce that four projects for this class has been published in top conferences:

(oral presentation)

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

Yukun Zhu*, Ryan Kiros*, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

* Denotes equal contribution

title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
booktitle = {ICCV},
year = {2015}

Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

(oral presentation)

Lost Shopping! Monocular Localization in Large Indoor Spaces

Shenlong Wang, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

title = {Lost Shopping! Monocular Localization in Large Indoor Spaces},
author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2015}

In this paper we propose a novel approach to localization in very large indoor spaces (i.e., 200+ store shopping malls) that takes a single image and a floor plan of the environment as input. We formulate the localization problem as inference in a Markov random field, which jointly reasons about text detection (localizing shop's names in the image with precise bounding boxes), shop facade segmentation, as well as camera's rotation and translation within the entire shopping mall. The power of our approach is that it does not use any prior information about appearance and instead exploits text detections corresponding to the shop names. This makes our method applicable to a variety of domains and robust to store appearance variation across countries, seasons, and illumination conditions. We demonstrate the performance of our approach in a new dataset we collected of two very large shopping malls, and show the power of holistic reasoning.

Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions

Jimmy Ba, Kevin Swersky, Sanja Fidler, Ruslan Salakhutdinov

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

title = {Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions},
author = {Jimmy Ba and Kevin Swersky and Sanja Fidler and Ruslan Salakhutdinov},
booktitle = {ICCV},
year = {2015}

One of the main challenges in Zero-Shot Learning of visual categories is gathering semantic attributes to accompany images. Recent work has shown that learning from textual descriptions, such as Wikipedia articles, avoids the problem of having to explicitly define these attributes. We present a new model that can classify unseen categories from their textual description. Specifically, we use text features to predict the output weights of both the convolutional and the fully connected layers in a deep convolutional neural network (CNN). We take advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches. The proposed model also allows us to automatically generate a list of pseudo- attributes for each visual category consisting of words from Wikipedia articles. We train our models end-to-end us- ing the Caltech-UCSD bird and flower datasets and evaluate both ROC and Precision-Recall curves. Our empirical results show that the proposed model significantly outperforms previous methods.

Skip-Thought Vectors

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler

Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

title = {Skip-Thought Vectors},
author = {Ryan Kiros and Yukun Zhu and Ruslan Salakhutdinov and Richard Zemel and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
booktitle = {NIPS},
year = {2015}

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.

back to top

Each student will need to write two paper reviews each week, present once or twice in class (depending on enrollment), participate in class discussions, and complete a project (done individually or in pairs).

The final grade will consist of the following
Participation (attendance, participation in discussions, reviews) 25%
Presentation (presentation of papers in class)35%
Project (proposal, final report)40%

Detailed Requirements   

Paper reviewing

Every week (except for the first two) we will read 2 to 3 papers. The success of the discussion in class will thus be due to how prepared the students come to class. Each student is expected to read all the papers that will be discussed and write two detailed reviews about the selected two papers. Depending on enrollment, each student will need to also present a paper in class. When you present, you do not need to hand in the review.

Deadline: The reviews will be due one day before the class.

Structure of the review
Short summary of the paper
Main contributions
Positive and negatives points
How strong is the evaluation?
Possible directions for future work


Depending on enrollment, each student will need to present a few papers in class. The presentation should be clear and practiced and the student should read the assigned paper and related work in enough detail to be able to lead a discussion and answer questions. Extra credit will be given to students who also prepare a simple experimental demo highlighting how the method works in practice.

A presentation should be roughly 20 minutes long (please time it beforehand so that you do not go overtime). Typically this is about 15 to 20 slides. You are allowed to take some material from presentations on the web as long as you cite the source fairly. In the presentation, also provide the citation to the paper you present and to any other related work you reference.

Deadline: The presentation should be handed in one day before the class (or before if you want feedback).

Structure of presentation:
High-level overview with contributions
Main motivation
Clear statement of the problem
Overview of the technical approach
Strengths/weaknesses of the approach
Overview of the experimental evaluation
Strengths/weaknesses of evaluation
Discussion: future direction, links to other work


Each student will need to write a short project proposal in the beginning of the class (in January). The projects will be research oriented. In the middle of semester course you will need to hand in a progress report. One week prior to the end of the class the final project report will need to be handed in and presented in the last lecture of the class (April). This will be a short, roughly 15-20 min, presentation.

The students can work on projects individually or in pairs. The project can be an interesting topic that the student comes up with himself/herself or with the help of the instructor. The grade will depend on the ideas, how well you present them in the report, how well you position your work in the related literature, how thorough are your experiments and how thoughful are your conclusions.

Here is a list of possible topics for the projects.

back to top

We will first survey a few current methods on visual object recognition and scene understanding, as well as basic Natural Language Processing. The main focus of the course will be on vision and how to exploit natural language to learn visual concepts, improve visual parsing, do retrieval, as well as lingual description generation.

Visual recognition fundamentals
object recognition
image labeling
scene understanding
Natural Language Processing
parsing, part-of-speech tagging
coreference resolution
Images/videos and text
image tags
visual word-sense disambiguation
retrieval with complex lingual queries
visual concept grounding
description generation

back to top

We will have a guest lecture by Dr. Mohit Bansal, TTI-Chicago, on popular NLP topics and state-of-the-art techniques.

Basics in Image Understanding and Natural Language Processing
Jan 14Course Intro & Overview of Computer Vision  Sanja Fidlerintro lecture
rec. lecture
Jan 21Basics and Popular Topics in NLP Mohit Bansal
(invited lecture)
Readings on Images/Videos and Text
Jan 28Description generationUnifying Visual-Semantic Embeddings with Multimodal Neural Language Models [PDF]
R. Kiros, R. Salakhutdinov, R. S. Zemel
Ryan Kiros
Description generationDeep Visual-Semantic Alignments for Generating Image Descriptions   [PDF]
A. Karpathy, L. Fei-Fei
Ivan Vendrovslides
Feb 4Datasets and metricsKaustav Kunduslides
Description generationEvery Picture Tells a Story: Generating Sentences for Images [PDF]
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. A. Forsyth
Yukun Zhuslides
Feb 11Description generation (video)Translating Video Content to Natural Language Descriptions   [PDF]
M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, B. Schiele
Patricia Thaineslides
Image generation
(from text)
Learning the Visual Interpretation of Sentences   [PDF]
C. L. Zitnick, D. Parikh, L. Vanderwende
Shenlong Wangslides
Feb 18Reading week (no class)
Feb 25Learning visual models from textConnecting Modalities: Semi-supervised Segmentation and Annotation of Images Using Unaligned Text Corpora [PDF]
R. Socher, L. Fei-Fei
Jake Snellslides
Learning visual models from textBeyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers   [PDF]
A. Gupta, L. S. Davis
Arvid Frydenlundslides
March 4Description generationShow, Attend and Tell: Neural Image Caption Generation with Visual Attention   [PDF]
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio
Jimmy Ba
Learning visual models from webInferring the Why in Images   [PDF]
H. Pirsiavash, C. Vondrick, A. Torralba
Micha Livneslides
March 11Zero-shot visual learning via textWrite a Classifier: Zero-Shot Learning Using Purely Textual Descriptions   [PDF]
M. Elhoseiny, B. Saleh, A. Elgammal
Jimmy Ba
Learning visual models from the webLearning Everything about Anything: Webly-Supervised Visual Concept Learning [PDF]
S. K. Divvala, A. Farhadi, C. Guestrin
Patricia Thaineslides
March 18Word-sense disambiguationUnsupervised Learning of Visual Sense Models for Polysemous Words   [PDF]
K. Saenko, T. Darrell
Kamyar Seyed Ghasemipour
Question and answeringA Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input   [PDF]
M. Malinowski, M. Fritz
Ivan Vendrovslides
March 25Learning visual models from text and videoJoint person naming in videos and coreference resolution in text   [PDF]
V. Ramanathan, A. Joulin, P. Liang, L. Fei-Fei
Yukun Zhuslides
Visual retrieval via complex lingual queriesVisual Semantic Search: Retrieving Videos via Complex Textual Queries   [PDF]
D. Lin, S. Fidler, C. Kong, R. Urtasun
Micha Livneslides

back to top

coming soon

back to top