Sanja Fidler

Nadia Li
4th year undergraduate, UofT
Oct 2020 -

Sandeep Routray
4th year undergraduate, IIT Kanpur
July 2020 -

Avik Pal
4th year undergraduate, IIT Kanpur
Jan 2020 -

Hirotaka Ishihara
4th year undergraduate, UofT
Dec 2019 -

Jiongtian Guo
4th year undergraduate, UofT
May 2020 -

Chen Cui
4th year undergraduate, UofT
May 2020 -

Ziyue Xu
4th year undergraduate, UofT
May 2020 -

Anyi Rao
PhD student at CHUK
Sept 2020 -- -

Makarand Tapaswi
Postdoc (now postdoc at INRIA)
Sept 2016 - Dec 2018

Hang Chu
Sept 2016 - April 2020
Now at Autodesk

Atef Chaudhury
Graduated with MSc (now at Google)
Sept 2017 - April 2019

Kevin Shen
Graduated with MSc (now at Layer6)
Sept 2017 - Jan 2019

Jiaman Li
Graduated with MSc (now Phd student at University of South California)
Sept 2017 - Jan 2019

Chaoqi Wang
Graduated with MSc (now Phd student at University of Chicago)
Sept 2017 - Jan 2019

Kaustav Kundu
Graduating with PhD (now at Amazon)
Sept 2015 - May 2017
Co-supervised with Raquel Urtasun

Lluis Castrejon
Graduated with MSc (now at University of Montreal)
Sept 2015 - May 2017
Co-supervised with Raquel Urtasun

Lluis Castrejon

Lluis worked on semi-automatic instance segmentation. Our CVPR'17 paper on this topic received Best Paper Honorable Mention.

Publications

Annotating Object Instances with a Polygon-RNN (best paper honorable mention)

Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017

Paper Abstract Bibtex

@inproceedings{CastrejonCVPR17,
title = {Annotating Object Instances with a Polygon-RNN},
author = {Lluis Castrejon and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2017}
}

We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows a human annotator to interfere at any time and correct a vertex if needed, producing as accurate segmentation as desired by the annotator. We show that our approach speeds up the annotation process by a factor of 4.7 across all classes in Cityscapes, while achieving 78.4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators. For cars, our speed-up factor is 7.3 for an agreement of 82.2%. We further show generalization capabilities of our approach to unseen datasets.

close window

Tom Lee
Graduated with PhD (now at LTAS Technologies Inc)
Sept 2011 - March 2016
Co-supervised with Sven Dickinson

Tom Lee

Tom is a 4th year PhD student and is currently doing a 8-month internship in a Toronto-based company LTAS Technologies Inc. Tom works on mid-level vision: grouping superpixels to form symmetric parts using a discriminative (trained) approach, and a learning framework for grouping superpixels into object proposals using several Gestalt-like cues (symmetry, closure, homogeneity of appearance). For the former, he showed how to learn with parametric submodular energies. His primary supervisor is Prof. Sven Dickinson.

Publications

A Learning Framework for Generating Region Proposals with Mid-level Cues

Tom Lee, Sanja Fidler, Sven Dickinson

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

Paper Abstract Bibtex

@inproceedings{TLeeICCV15,
title = {A Learning Framework for Generating Region Proposals with Mid-level Cues},
author = {Tom Lee and Sanja Fidler and Sven Dickinson},
booktitle = {ICCV},
year = {2015}
}

The object categorization community's migration from object detection to large-scale object categorization has seen a shift from sliding window approaches to bottom-up region segmentation, with the resulting region proposals offering discriminating shape and appearance features through an attempt to explicitly segment the objects in a scene from their background. One powerful class of region proposal techniques is based on parametric energy minimization (PEM) via parametric maxflow. In this paper, we incorporate PEM into a novel structured learning framework that learns how to combine a set of mid-level grouping cues to yield a small set of region proposals with high recall. Second, we diversify our region proposals and rank them with region-based convolutional neural network features. Our novel approach, called parametric min-loss, casts perceptual grouping and cue combination in a learning framework which yields encouraging results on VOC'2012.

A Framework for Symmetric Part Detection in Cluttered Scenes

Tom Lee, Sanja Fidler, Alex Levinshtein, Cristian Sminchisescu, Sven Dickinson

Symmetry, Vol. 7, 2015, pp 1333-1351

Paper Abstract Bibtex

@article{LeeSymmetry2015,
title = {A Framework for Symmetric Part Detection in Cluttered Scenes},
author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Cristian Sminchisescu and Sven Dickinson},
journal = {Symmetry},
volume = {7},
pages = {1333-1351},
year = {2015}
}

The role of symmetry in computer vision has waxed and waned in importance during the evolution of the field from its earliest days. At first figuring prominently in support of bottom-up indexing, it fell out of favour as shape gave way to appearance and recognition gave way to detection. With a strong prior in the form of a target object, the role of the weaker priors offered by perceptual grouping was greatly diminished. However, as the field returns to the problem of recognition from a large database, the bottom-up recovery of the parts that make up the objects in a cluttered scene is critical for their recognition. The medial axis community has long exploited the ubiquitous regularity of symmetry as a basis for the decomposition of a closed contour into medial parts. However, today's recognition systems are faced with cluttered scenes and the assumption that a closed contour exists, i.e., that figure-ground segmentation has been solved, rendering much of the medial axis community's work inapplicable. In this article, we review a computational framework, previously reported in [Lee et al., ICCV'13, Levinshtein et al., ICCV'09, Levinshtein et al., IJCV'13], that bridges the representation power of the medial axis and the need to recover and group an object's parts in a cluttered scene. Our framework is rooted in the idea that a maximally-inscribed disc, the building block of a medial axis, can be modelled as a compact superpixel in the image. We evaluate the method on images of cluttered scenes.

Multi-cue Mid-level Grouping

Tom Lee, Sanja Fidler, Sven Dickinson

In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

Paper Abstract Bibtex

@inproceedings{LeeACCV14,
title = {Multi-cue mid-level grouping},
author = {Tom Lee and Sanja Fidler and Sven Dickinson},
booktitle = {ACCV},
year = {2014}
}

Region proposal methods provide richer object hypotheses than sliding windows with dramatically fewer proposals, yet they still number in the thousands. This large quantity of proposals typically results from a diversification step that propagates bottom-up ambiguity in the form of proposals to the next processing stage. In this paper, we take a complementary approach in which mid-level knowledge is used to resolve bottom-up ambiguity at an earlier stage to allow a further reduction in the number of proposals. We present a method for generating regions using the mid-level grouping cues of closure and symmetry. In doing so, we combine mid-level cues that are typically used only in isolation, and leverage them to produce fewer but higher quality proposals. We emphasize that our model is mid-level by learning it on a limited number of objects while applying it to different objects, thus demonstrating that it is transferable to other objects. In our quantitative evaluation, we 1) establish the usefulness of each grouping cue by demonstrating incremental improvement, and 2) demonstrate improvement on two leading region proposal methods with a limited budget of proposals.

Detecting Curved Symmetric Parts using a Deformable Disc Model

Tom Lee, Sanja Fidler, Sven Dickinson

In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

Paper Abstract Project page Bibtex

@inproceedings{LeeICCV13,
author = {Tom Lee and Sanja Fidler and Sven Dickinson},
title = {Detecting Curved Symmetric Parts using a Deformable Disc Model},
booktitle = {ICCV},
year = {2013}
}

Symmetry is a powerful shape regularity that's been exploited by perceptual grouping researchers in both human and computer vision to recover part structure from an image without a priori knowledge of scene content. Drawing on the concept of a medial axis, defined as the locus of centers of maximal inscribed discs that sweep out a symmetric part, we model part recovery as the search for a sequence of deformable maximal inscribed disc hypotheses generated from a multiscale superpixel segmentation, a framework proposed by [Levinshtein et al., ICCV'09]. However, we learn affinities between adjacent superpixels in a space that's invariant to bending and tapering along the symmetry axis, enabling us to capture a wider class of symmetric parts. Moreover, we introduce a global cost that perceptually integrates the hypothesis space by combining a pairwise and a higher-level smoothing term, which we minimize globally using dynamic programming. The new framework is demonstrated on two datasets, and is shown to significantly outperform the baseline [Levinshtein et al., ICCV'09].

Learning Categorical Shape from Captioned Images

Tom Lee, Sanja Fidler, Alex Levinshtein, Sven Dickinson

Conference on Computer and Robot Vision (CRV), Toronto, Canada, May 2012

Paper Abstract Bibtex

@inproceedings{LeeCRV12,
author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Sven Dickinson},
title = {Learning Categorical Shape from Captioned Images},
booktitle = {Canadian Conference on Computer and Robot Vision (CRV)},
year = {2012}
}

Given a set of captioned images of cluttered scenes containing various objects in different positions and scales, we learn named contour models of object categories without relying on bounding box annotation. We extend a recent language-vision integration framework that finds spatial configurations of image features that co-occur with words in image captions. By substituting appearance features with local contour features, object categories are recognized by a contour model that grows along the object's boundary. Experiments on ETHZ are presented to show that 1) the extended framework is better able to learn named visual categories whose within class variation is better captured by a shape model than an appearance model; and 2) typical object recognition methods fail when manually annotated bounding boxes are unavailable

close window

Yukun Zhu
Graduated with MSc (now at Google)
Sept 2015 - Jan 2016
Co-supervised with Raquel Urtasun and Ruslan Salakhutdinov

Yukun Zhu

Yukun's research was in two domains: object class detection and vision-language integration. His approach published at CVPR'15 significantly outperformed previous state-of-the-art in detection on PASCAL VOC.

Publications

3D Object Proposals using Stereo Imagery for Accurate Object Class Detection

Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Huimin Ma, Sanja Fidler, Raquel Urtasun

In Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017

Paper Abstract Bibtex

@inproceedings{ChenArxiv16,
title = {3D Object Proposals using Stereo Imagery for Accurate Object Class Detection},
author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Huimin Ma and Sanja Fidler and Raquel Urtasun},
booktitle = {arXiv:1608.07711},
year = {2016}
}

The goal of this paper is to perform 3D object detection in the context of autonomous driving. Our method first aims at generating a set of high-quality 3D object proposals by exploiting stereo imagery. We formulate the problem as minimizing an energy function that encodes object size priors, placement of objects on the ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. We then exploit a CNN on top of these proposals to perform object detection. In particular, we employ a convolutional neural net (CNN) that exploits context and depth information to jointly regress to 3D bounding box coordinates and object pose. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. When combined with the CNN, our approach outperforms all existing results in object detection and orientation estimation tasks for all three KITTI object classes. Furthermore, we experiment also with the setting where LIDAR information is available, and show that using both LIDAR and stereo leads to the best result.

MovieQA: Understanding Stories in Movies through Question-Answering
(spotlight presentation)

Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

Benchmark on question-answering about movies

Paper Abstract Benchmark Bibtex

@inproceedings{TapaswiCVPR16,
title = {MovieQA: Understanding Stories in Movies through Question-Answering},
author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2016}
}

We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

Yukun Zhu, Raquel Urtasun, Ruslan Salakhutdinov, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

Currently third in detection on PASCAL VOC Leaderboard

Paper Abstract Project page Bibtex

@inproceedings{ZhuSegDeepM15,
title = {segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection},
author = {Yukun Zhu and Raquel Urtasun and Ruslan Salakhutdinov and Sanja Fidler},
booktitle = {CVPR},
year = {2015}
}

In this paper, we propose an approach that exploits object segmentation in order to improve the accuracy of object detection. We frame the problem as inference in a Markov Random Field, in which each detection hypothesis scores object appearance as well as contextual information using Convolutional Neural Networks, and allows the hypothesis to choose and score a segment out of a large pool of accurate object segmentation proposals. This enables the detector to incorporate additional evidence when it is available and thus results in more accurate detections. Our experiments show an improvement of 4.1% in mAP over the R-CNN baseline on PASCAL VOC 2010, and 3.4% over the current state-of-the-art, demonstrating the power of our approach.

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

arXiv preprint arXiv:1506.06724, 2015

Aligning movies and books for story-like captioning

Paper Abstract Project page Bibtex

@inproceedings{moviebook,
title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
booktitle = {arXiv preprint arXiv:1506.06724},
year = {2015}
}

Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

Skip-Thought Vectors

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler

arXiv preprint arXiv:1506.06726, 2015

Sent2vec neural representation trained on 11K books

Paper Abstract Code Bibtex

@inproceedings{moviebook,
title = {Skip-Thought Vectors},
author = {Ryan Kiros and Yukun Zhu and Ruslan Salakhutdinov and Richard Zemel and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
booktitle = {arXiv preprint arXiv:1506.06726},
year = {2015}
}

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.

3D Object Proposals for Accurate Object Class Detection

Xiaozhi Chen*, Kaustav Kundu*, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, Raquel Urtasun

* Denotes equal contribution

Currently third in Car, and first in Pedestrian and Cyclist detection on KITTI's Leaderboard

Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

Paper Abstract Project page Bibtex

@inproceedings{XiaozhiNIPS15,
title = {3D Object Proposals for Accurate Object Class Detection},
author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Andrew Berneshawi and Huimin Ma and Sanja Fidler and Raquel Urtasun},
booktitle = {NIPS},
year = {2015}
}

The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on Car and Cyclist, and is competitive for the Pedestrian class.

close window

Ivan Vendrov
Graduated with MSc (now at Google)
Sept 2015 - Jan 2016
Co-supervised with Raquel Urtasun

Ivan's masters thesis was on the topic of semantic visual search.

Publications

Order-Embeddings of Images and Language (oral presentation)

Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun

In International Conference on Learning Representations, Puerto Rico, 2016

State-of-the-art in caption-image retrieval on COCO

Paper Abstract Code Bibtex

@inproceedings{VendrovArxiv15,
title = {Order-Embeddings of Images and Language},
author = {Ivan Vendrov and Ryan Kiros and Sanja Fidler and Raquel Urtasun},
booktitle = {ICLR},
year = {2016}
}

Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.

close window

Ziyu Zhang
Graduated with MSc (now at Snap Inc)
Sept 2015 - April 2016
Co-supervised with Raquel Urtasun

Ziyu Zhang

Ziyu's masters' thesis was on instance-level object segmentation in monocular imagery.

Publications

Instance-Level Segmentation with Deep Densely Connected MRFs

Ziyu Zhang, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

Paper Abstract Bibtex

@inproceedings{ZhangCVPR16,
title = {Instance-Level Segmentation with Deep Densely Connected MRFs},
author = {Ziyu Zhang and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2016}
}

Our aim is to provide a pixel-level object instance labeling of a monocular image. We build on recent work [Zhang et al., ICCV15] that trained a convolutional neural net to predict instance labeling in local image patches, extracted exhaustively in a stride from an image. A simple Markov random field model using several heuristics was then proposed in [Zhang et al., ICCV15] to derive a globally consistent instance labeling of the image. In this paper, we formulate the global labeling problem with a novel densely connected Markov random field and show how to encode various intuitive potentials in a way that is amenable to efficient mean field inference [Krahenbuhl et al., NIPS11]. Our potentials encode the compatibility between the global labeling and the patch-level predictions, contrast-sensitive smoothness as well as the fact that separate regions form different instances. Our experiments on the challenging KITTI benchmark [Geiger et al., CVPR12] demonstrate that our method achieves a significant performance boost over the baseline [Zhang et al., ICCV15].

Monocular 3D Object Detection for Autonomous Driving

Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

Paper Abstract Project page Bibtex

@inproceedings{ChenCVPR16,
title = {Monocular 3D Object Detection for Autonomous Driving},
author = {Xiaozhi Chen and Kaustav Kundu and Ziyu Zhang and Huimin Ma and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2016}
}

The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.

Monocular Object Instance Segmentation and Depth Ordering with CNNs

Ziyu Zhang, Alex Schwing, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

Paper Abstract Bibtex

@inproceedings{ZhangICCV15,
title = {Monocular Object Instance Segmentation and Depth Ordering with CNNs},
author = {Ziyu Zhang and Alex Schwing and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2015}
}

In this paper we tackle the problem of instance level segmentation and depth ordering from a single monocular image. Towards this goal, we take advantage of convolutional neural nets and train them to directly predict instance level segmentations where the instance ID encodes depth ordering from large image patches. To provide a coherent single explanation of an image we develop a Markov random field which takes as input the predictions of convolutional nets applied at overlapping patches of different resolutions as well as the output of a connected component algorithm and predicts very accurate instance level segmentation and depth ordering. We demonstrate the effectiveness of our approach on the challenging KITTI benchmark and show very good performance on both tasks.

close window

Didac Suric
MSc student at UPC (now at Columbia University)
Sept 2018 -

Bo Dai
PhD student, Chinese University of Hong Kong (now postdoc at MIT)
Sept - Dec 2017

Enric Corona
MSc student, UPC in Barcelona (now PhD at UPC)
May - Dec 2017

Xavier Puig Fernandez
PhD student, MIT
Jan-March, 2016

Xavier is visited the group two times, once in Nov 2015, and from Jan to March, 2016. We are working on the problem of video to text alignment.

Urban Jezernik
PhD student, University of Ljubljana
Jan-April, 2016

Urban visited the group from Jan to April, 2016. We were working on the problem of music generation.

Makarand Tapaswi
PhD student, KIT (now a postdoc in our group)
Sept-Dec, 2015

Makarand visited for three months in 2015, and has joined our group as a postdoc in the fall of 2016.

Publications

MovieQA: Understanding Stories in Movies through Question-Answering
(spotlight presentation)

Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

Paper Abstract Benchmark Bibtex

@inproceedings{TapaswiCVPR16,
title = {MovieQA: Understanding Stories in Movies through Question-Answering},
author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2016}
}

We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

close window

Edgar Simo-Serra
PhD student, UPC in Barcelona (now a postdoc at Tokyo University)
Summer 2013, 2014

Edgar visited the group twice. During his first visit (to TTI-C) he was working on clothing parsing in fashion photographs. On this topic he published a first-author paper at ACCV'14. In his second visit (to UofT), he worked on predicting how fashionable / stylish someone looks on a photograph, and suggest ways to help the user to improve her/his "look". This resulted in a first-author CVPR'15 paper. The paper got significant international press coverage in major news and fashion media such as New Scientist, Quartz, Wired, Glamour, Cosmopolitan, Elle and Marie Claire (see project page for more details). Edgar gave several interviews for the press, including an appearance on Spanish television (minutes 15:12 to 16:43) and radio (minutes 16:10 to 20:43). Yahoo News, Canada, featured a full photo of him in one of my favorite press articles on the subject.

Publications

Neuroaesthetics in Fashion: Modeling the Perception of Beauty

Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

How fashionable do you look in a photo? And how can you improve?

Paper Abstract Project page Bibtex

@inproceedings{SimoCVPR15,
title = {Neuroaesthetics in Fashion: Modeling the Perception of Beauty},
author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
booktitle = {CVPR},
year = {2015}
}

In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph's setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.

A High Performance CRF Model for Clothes Parsing

Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

Significant performance gain over state-of-the-art in clothing parsing.

Paper Abstract Project page Bibtex

@inproceedings{SimoACCV14,
title = {A High Performance CRF Model for Clothes Parsing},
author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
booktitle = {ACCV},
year = {2014}}

In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset [Yamaguchi et al., CVPR'12] and show that we can obtain a significant improvement over the state-of-the-art.

close window

Roozbeh Mottaghi
PhD student, UCLA (now a Research Scientist at AI2)
Summer 2012, 2013

Roozbeh visited the group several times, working on the topic of object class detection. His work resulted in several state-of-the-art detectors. He published two first-author and two second-author CVPR papers (CVPR'13 and '14), as well as a first-author T-PAMI publication.
Roozbeh went to do a postdoc with Prof. Silvio Savarese at Stanford and is now a Research Scientist at AI2.

Publications

Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, Devi Parikh

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1), 2016, pages 74-8

Paper Abstract Suppl. Mat. Bibtex

@article{MottaghiPAMI16,
title = {Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding},
author = {Roozbeh Mottaghi and Sanja Fidler and Alan Yuille and Raquel Urtasun and Devi Parikh},
journal = {Trans. on Pattern Analysis and Machine Intelligence},
volume= {38},
number= {1},
pages= {74--87},
year = {2016}
}

Recent trends in image understanding have pushed for scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers. In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular semantic segmentation, object detection and scene recognition. Towards this goal, we "plug-in" human subjects for each of the various components in a conditional random field model. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve scene understanding by focusing research efforts on various individual tasks.

The Role of Context for Object Detection and Semantic Segmentation in the Wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, Alan Yuille

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

PASCAL VOC with dense segmentation labels for 400+ classes in Project page

Paper Errata Abstract Project page Suppl. Mat. Bibtex

@inproceedings{MottaghiCVPR14,
author = {Roozbeh Mottaghi and Xianjie Chen and Xiaobai Liu and Sanja Fidler and Raquel Urtasun and Alan Yuille},
title = {The Role of Context for Object Detection and Semantic Segmentation in the Wild},
booktitle = {CVPR},
year = {2014}
}

In this paper we study the role of context in existing stateof-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it contains 520 additional classes for semantic segmentation and object detection. Our analysis shows that nearest neighbor based approaches perform poorly on semantic segmentation of contextual classes, showing the variability of PASCAL imagery. Furthermore, improvements of existing contextual models for detection is rather modest. In order to push forward the performance in this difficult scenario, we propose a novel deformable part-based model, which exploits both local context around each candidate detection as well as global context at the level of the scene. We show that this contextual reasoning significantly helps in detecting objects at all scales.

Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Nam-Gyu Cho, Sanja Fidler, Raquel Urtasun, Alan Yuille

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

PASCAL VOC with object parts segmentations available in Project page

Paper Abstract Project page Bibtex

@inproceedings{PartsCVPR14,
author = {Xianjie Chen and Roozbeh Mottaghi and Xiaobai Liu and Nam-Gyu Cho and Sanja Fidler and Raquel Urtasun and Alan Yuille},
title = {Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts},
booktitle = {CVPR},
year = {2014}
}

Detecting objects becomes difficult when we need to deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformations and partial occlusions in animals (as examples of highly deformable objects), ii) describe them in terms of body parts, and iii) detect them when their body parts are hard to detect (e.g., animals depicted at low resolution). We represent the holistic object and body parts separately and use a fully connected model to arrange templates for the holistic object and body parts. Our model automatically decouples the holistic object or body parts from the model when they are hard to detect. This enables us to represent a large number of holistic object and body part combinations to better deal with different "detectability" patterns caused by deformations, occlusion and/or low resolution. We apply our method to the six animal categories in the PASCAL VOC dataset and show that our method significantly improves state-of-the-art (by 4.1% AP) and provides a richer representation for objects. During training we use annotations for body parts (e.g., head, torso, etc), making use of a new dataset of fully annotated object parts for PASCAL VOC 2010, which provides a mask for each part.

Bottom-up Segmentation for Top-down Detection

Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

8% over DPM and 4% over the state-of-the-art on PASCAL VOC at the time.

Paper Abstract Project page Suppl. Mat. Bibtex

@inproceedings{segdpmCVPR13,
author = {Sanja Fidler and Roozbeh Mottaghi and Alan Yuille and Raquel Urtasun},
title = {Bottom-up Segmentation for Top-down Detection},
booktitle = {CVPR},
year = {2013}
}

In this paper we are interested in how semantic segmentation can help object detection. Towards this goal, we propose a novel deformable part-based model which exploits region-based segmentation algorithms that compute candidate object regions by bottom-up clustering followed by ranking of those regions. Our approach allows every detection hypothesis to select a segment (including void), and scores each box in the image using both the traditional HOG filters as well as a set of novel segmentation features. Thus our model "blends" between the detector and segmentation models. Since our features can be computed very efficiently given the segments, we maintain the same complexity as the original DPM. We demonstrate the effectiveness of our approach in PASCAL VOC 2010, and show that when employing only a root filter our approach outperforms Dalal & Triggs detector on all classes, achieving 13% higher average AP. When employing the parts, we outperform the original DPM in 19 out of 20 classes, achieving an improvement of 8% AP. Furthermore, we outperform the previous state-of-the-art on VOC'10 test by 4%.

Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs

Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

Paper Abstract Suppl. Mat. Bibtex

@inproceedings{MottaghiCVPR13,
author = {Roozbeh Mottaghi and Sanja Fidler and Jian Yao and Raquel Urtasun and Devi Parikh},
title = {Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs},
booktitle = {CVPR},
year = {2013}
}

Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we "plug-in" human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

close window

Liang-Chieh Chen
PhD student, UCLA (now at Google)
Summer 2013

Liang-Chieh ("Jay") worked on weakly-labeled segmentation: getting accurate object segmentation given a ground-truth 3D bounding box as available in the KITTI dataset. His method improved significantly over existing grab-cut type of approaches, and even outperformed MTurkers (compared to accurate in-house annotations). Jay authored a first-author paper at CVPR'14.

Publications

Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision

Liang-Chieh Chen, Sanja Fidler, Alan Yuille, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

More accurate pixel-level object labeling than MTurkers

Paper Abstract Project page CAD models Suppl. Mat. Bibtex

@inproceedings{ChenCVPR14,
author = {Liang-Chieh Chen and Sanja Fidler and Alan Yuille and Raquel Urtasun},
title = {Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision},
booktitle = {CVPR},
year = {2014}
}

Labeling large-scale datasets with very accurate object segmentations is an elaborate task that requires a high degree of quality control and a budget of tens or hundreds of thousands of dollars. Thus, developing solutions that can automatically perform the labeling given only weak supervision is key to reduce this cost. In this paper, we show how to exploit 3D information to automatically generate very accurate object segmentations given annotated 3D bounding boxes. We formulate the problem as the one of inference in a binary Markov random field which exploits appearance models, stereo and/or noisy point clouds, a repository of 3D CAD models as well as topological constraints. We demonstrate the effectiveness of our approach in the context of autonomous driving, and show that we can segment cars with the accuracy of 86% intersection-over-union, performing as well as highly recommended MTurkers!

close window

Abhishek Sharma
PhD student, UMD (now at Apple)
Summer 2012

Abhishek worked on holistic scene parsing by exploiting image captions. Making use of textual information for visual parsing is important for, e.g., robotics applications where an automatics system interacts with a human user. Abhishek co-authored a CVPR'13 paper.

Publications

A Sentence is Worth a Thousand Pixels

Sanja Fidler, Abhishek Sharma, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

Paper Abstract Suppl. Mat. Bibtex

@inproceedings{FidlerCVPR13,
author = {Sanja Fidler and Abhishek Sharma and Raquel Urtasun},
title = {A Sentence is Worth a Thousand Pixels},
booktitle = {CVPR},
year = {2013}
}

We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models.

close window

Bowen Chen
3rd year undergraduate, UofT (now Msc at Columbia University)
Jan 2019 - Sept 2020

Maciej Kowalski
4th year undergraduate, UofT (now at University of Edinburgh)
Dec 2019 - May 2020

Jinchen Xuan
4th year undergraduate, Peking University
June 2019 – Nov 2019

Zhaocong (Justin) Yuan
4th year undergraduate, UofT
Sept 2019 -- May 2020

Jordi Fortuny Profitos
4th year undergraduate, UPC
Thesis, Sept 2018 -

Eric Guisado Bandres
4th year undergraduate, UPC
Thesis, Sept 2018 -

Rafel Palliser Sans
4th year undergraduate, UPC
Thesis, Sept 2018 -

Louis Clergue
4th year undergraduate, UPC
Thesis, Feb 2019 -

Tianshi Cao
4th year undergraduate, UofT (now Msc at UofT)
Thesis, Sept 2018 - Aug 2019

Xi Yan
4rd year undergraduate, UofT (now at Stanford University)
May 2019 - Dec 2019

Yuhao Zhou
3rd year undergraduate, UofT
Jan 2017 - 2019

Zian Wang
4th year undergraduate, Tsinghua University (now PhD at UofT)
July 2018 -- Nov 2018

Liren Chen
3rd year undergraduate, Tsinghua University
Summer 2017

Ching-Yao Chuang
4th year undergraduate, National Tsinghua University of Taiwan (now PhD at MIT)
July - Nov 2017

Zheng Wu
3rd year undergraduate, Shanghai Jiao Tong University
Summer 2017

Kefan (Arthur) Chen
4th year undergraduate, UofT /(now at Google)
Capstone project, Sept 2017 - May 2018

Tiantian Fang
4th year undergraduate, UofT (now at UIUC)
Sept 2017 - Jan 2018

Wesley Heung
4th year undergraduate, UofT
Thesis, Sept 2017 - May 2018

Daiqing Li
4th year undergraduate, UofT (now at NVIDIA)
Capstone project, Sept 2017 - May 2018

Juan Morales Vega
4th year undergraduate, UPC in Barcelona
Feb - June 2017

Haokun Liu
3rd year undergraduate, Peking University (now at NYU)
Feb - June 2017

Ge (Olga) Xu
3rd year undergraduate, UofT
Summer 2016 (USRA)

Kevin Kyunghwan Ra
4th year undergraduate, UofT (now at McMaster University)
2016

Vasu Sharma
3rd year undergraduate, IIT Kanpur
Summer 2016, co-supervised with Raquel Urtasun

Amlan Kar
3rd year undergraduate, IIT Kanpur (now doing PhD with me at UofT)
Summer 2016, co-supervised with Raquel Urtasun

Erin Grant
4th year undergraduate, UofT (now a PhD student at UC Berkeley)
Jan-April, 2016

Seung Wook Kim
4th year undergraduate, UofT (now doing MSc with me at UofT)
Jan-April, 2016

Huazhe Xu
4th year visiting student from Tsinghua University (now a PhD student at UC Berkeley)
Sep 2015 - Dec 2015, co-supervised with Raquel Urtasun

Boris Ivanovic
4th year undergraduate, UofT (now a MSc student at Stanford University)
Sep 2015 - May 2016, co-supervised with Raquel Urtasun

Tamara Lipowski
4th year undergraduate, UofT (now a MSc student at University of Salzburg)
Jan-April, 2016

Zexuan (Aaron) Wang
4th year undergraduate, UofT (now at Qumulo Inc)
Sept 2015 - April 2016, co-supervised with Raquel Urtasun

Jurgen Aliaj
2nd year undergraduate, UofT (now a MSc student at UofT)
Summer 2015 (USRA)

Andrew Berneshawi
4th year undergraduate, UofT (now at Amazon, Seattle)
CSC494, Winter 2015

Andrew worked on road estimation as part of a semester-long project course (CSC494). His approach ranked second on KITTI's road classification benchmark (entry: NNP, time stamped: June 2015).

close window

Chenxi Liu
4th year undergraduate, Tsingua University (now a PhD student at Johns Hopkins University)
Summer 2014, co-supervised with Raquel Urtasun

Chenxi worked on the problem of apartment reconstruction in 3D from rental data (monocular imagery and floor-plan). His work resulted in a joint first-author oral CVPR'15 paper. He gave a talk at CVPR and did a great job (you can check his performance below).

Publications

Rent3D: Floor-Plan Priors for Monocular Layout Estimation
(oral presentation)

Chenxi Liu*, Alex Schwing*, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

Rent an apartment in 3D!

* Denotes equal contribution

Paper Abstract Suppl. Mat. Project page Bibtex

@inproceedings{ApartmentsCVPR15,
title = {Rent3D: Floor-Plan Priors for Monocular Layout Estimation},
author = {Chenxi Liu and Alex Schwing and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2015}}

The goal of this paper is to enable a 3D "virtual-tour" of an apartment given a small set of monocular images of different rooms, as well as a 2D floor plan. We frame the problem as the one of inference in a Markov random field which reasons about the layout of each room and its relative pose (3D rotation and translation) within the full apartment. This gives us information, for example, about in which room the picture was taken. What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge. In particular, we exploit the floor plan to impose aspect ratio constraints across the layouts of different rooms, as well as to extract semantic information, e.g., the location of windows which are labeled in floor plans. We show that this information can significantly help in resolving the challenging room-apartment alignment problem. We also derive an efficient exact inference algorithm which takes only a few ms per apartment. This is due to the fact that we exploit integral geometry as well as our new bounds on the aspect ratio of rooms which allow us to carve the space, reducing significantly the number of physically possible configurations. We demonstrate the effectiveness of our approach in a new dataset which contains over 200 apartments.

[talk] [slides]

close window

Yinan Zhao
4th year undergraduate, Tsingua University (now a PhD student at UT Austin)
Summer 2014, co-supervised with Raquel Urtasun

Chen Kong
4th year undergraduate, Tsingua University (now a PhD student at CMU)
Summer 2013, co-supervised with Raquel Urtasun

Chen worked on 3D indoor scene understanding by exploiting textual information. His work resulted in one first-author and another co-authored CVPR'14 paper, and he co-authored an oral paper at BMVC'15.

Publications

What are you talking about? Text-to-Image Coreference

Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

Exploits text for visual parsing and aligns nouns to objects.

Paper Abstract Bibtex

@inproceedings{KongCVPR14,
title = {What are you talking about? Text-to-Image Coreference},
author = {Chen Kong and Dahua Lin and Mohit Bansal and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2014}
}

In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system.

Visual Semantic Search: Retrieving Videos via Complex Textual Queries

Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

Video retrieval when a query is a longer sentence or a multi-sentence description

Paper Abstract Suppl. Mat. Bibtex

@inproceedings{LinCVPR14,
author = {Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
title = {Visual Semantic Search: Retrieving Videos via Complex Textual Queries},
booktitle = {CVPR},
year = {2014}
}

In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.

Generating Multi-Sentence Lingual Descriptions of Indoor Scenes (oral presentation)

Dahua Lin, Chen Kong, Sanja Fidler, Raquel Urtasun

In British Machine Vision Conference (BMVC), 2015

Paper Abstract Bibtex

@inproceedings{Lin15,
title = {Generating Multi-Sentence Lingual Descriptions of Indoor Scenes},
author = {Dahua Lin and Chen Kong and Sanja Fidler and Raquel Urtasun},
booktitle = {arXiv:1503.00064},
year = {2015}}

This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account the coherence among sentences. Experiments on the augmented NYU-v2 dataset show that our framework can generate natural descriptions with substantially higher ROGUE scores compared to those produced by the baseline.

close window

Jialiang Wang
4th year undergraduate, UofT (now a PhD student at Harvard University)
Summer 2014 (USRA), co-supervised with Sven Dickinson

Uri Priel
3rd year undergraduate, UofT
Summer 2014 (USRA), co-supervised with Sven Dickinson

Kamyar Seyed Ghasemipour
2nd year undergraduate, UofT (now a MSc student at UofT)
Summer 2014 (USRA), co-supervised with Suzanne Stevenson and Sven Dickinson

Kamyar worked on unsupervised word-sense disambiguation of captioned images. He won a research video competition (video) held for the Undergraduate Summer Research Program at UofT.

close window

Students

PhD and MSc students

Amlan Kar

Undergraduate students

Visiting students

Past postdocs

Past graduate students

Lluis Castrejon

Publications

Annotating Object Instances with a Polygon-RNN (best paper honorable mention)

Tom Lee

Publications

A Learning Framework for Generating Region Proposals with Mid-level Cues

A Framework for Symmetric Part Detection in Cluttered Scenes

Multi-cue Mid-level Grouping

Detecting Curved Symmetric Parts using a Deformable Disc Model

Learning Categorical Shape from Captioned Images

Yukun Zhu

Publications

3D Object Proposals using Stereo Imagery for Accurate Object Class Detection

MovieQA: Understanding Stories in Movies through Question-Answering(spotlight presentation)

segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

Skip-Thought Vectors

3D Object Proposals for Accurate Object Class Detection

Publications

Order-Embeddings of Images and Language (oral presentation)

Ziyu Zhang

Publications

Instance-Level Segmentation with Deep Densely Connected MRFs

Monocular 3D Object Detection for Autonomous Driving

Monocular Object Instance Segmentation and Depth Ordering with CNNs

Past visiting graduate students

Publications

MovieQA: Understanding Stories in Movies through Question-Answering (spotlight presentation)

Publications

Neuroaesthetics in Fashion: Modeling the Perception of Beauty

A High Performance CRF Model for Clothes Parsing

Publications

Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

The Role of Context for Object Detection and Semantic Segmentation in the Wild

Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

Bottom-up Segmentation for Top-down Detection

Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs

Publications

Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision

Publications

A Sentence is Worth a Thousand Pixels

Past undergraduate students

Publications

Rent3D: Floor-Plan Priors for Monocular Layout Estimation (oral presentation)

Publications

What are you talking about? Text-to-Image Coreference

Visual Semantic Search: Retrieving Videos via Complex Textual Queries

Generating Multi-Sentence Lingual Descriptions of Indoor Scenes (oral presentation)

MovieQA: Understanding Stories in Movies through Question-Answering
(spotlight presentation)

MovieQA: Understanding Stories in Movies through Question-Answering
(spotlight presentation)

Rent3D: Floor-Plan Priors for Monocular Layout Estimation
(oral presentation)