Makarand Tapaswi

joined in September 2016
PhD from KIT, Germany

Makarand visited for three months in 2015, and has joined the group as a postdoc in the fall of 2016. We are working on video understanding problems.

Publications

  • Situation Recognition with Graph Neural Networks

    Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun, Sanja Fidler

    In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

    Paper  Abstract  Bibtex

    @inproceedings{SituationsICCV17,
    title = {Situation Recognition with Graph Neural Networks},
    author = {Ruiyu Li and Makarand Tapaswi and Renjie Liao and Jiaya Jia and Raquel Urtasun and Sanja Fidler},
    booktitle = {ICCV},
    year = {2017}
    }

    We address the problem of recognizing situations in images. Given an image, the task is to predict the most salient verb (action), and fill its semantic roles such as who is performing the action, what is the source and target of the action, etc. Different verbs have different roles (eg attacking has weapon), and each role can take on many possible values (nouns). We propose a model based on Graph Neural Networks that allows us to efficiently capture joint dependencies between roles using neural networks defined on a graph. Experiments with different graph connectivities show that our approach that propagates information between roles significantly outperforms existing work, as well as multiple baselines. We obtain roughly 3-5% improvement over previous work in predicting the full situation. We also provide a thorough qualitative analysis of our model and influence of different roles in the verbs.

  • MovieQA: Understanding Stories in Movies through Question-Answering
    (spotlight presentation)

    Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Paper  Abstract  Benchmark  Bibtex

    @inproceedings{TapaswiCVPR16,
    title = {MovieQA: Understanding Stories in Movies through Question-Answering},
    author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2016}
    }

    We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

close window


 

Kaustav Kundu

PhD Student (2014    -    )

Co-supervised with Raquel Urtasun
 

Hang Chu

PhD Student (2016    -    )

Co-supervised with Raquel Urtasun
 

Masha Shugrina

PhD Student (2017    -    )

Co-supervised with Karan Singh
 

Amlan Kar

PhD Student (2017    -    )
 

Wenzheng Chen

PhD Student (2017    -    )

Co-supervised with Kyros Kutulakos

 

Kaustav Kundu

Kaustav is a 4th year PhD student, working on 3D scene understanding. He co-authored the Polygon-RNN paper which received the best paper honorable mention at CVPR'17. Kaustav is graduating in Dec 2017.

Publications

  • Annotating Object Instances with a Polygon-RNN             (best paper honorable mention)

    Lluis Castrejon Kaustav Kundu, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017

    Paper  Abstract  Bibtex

    @inproceedings{CastrejonCVPR17,
    title = {Annotating Object Instances with a Polygon-RNN},
    author = {Lluis Castrejon and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2017}
    }

    We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows a human annotator to interfere at any time and correct a vertex if needed, producing as accurate segmentation as desired by the annotator. We show that our approach speeds up the annotation process by a factor of 4.7 across all classes in Cityscapes, while achieving 78.4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators. For cars, our speed-up factor is 7.3 for an agreement of 82.2%. We further show generalization capabilities of our approach to unseen datasets.

  • 3D Object Proposals using Stereo Imagery for Accurate Object Class Detection

    Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Huimin Ma, Sanja Fidler, Raquel Urtasun

    In Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017

    Paper  Abstract  Bibtex

    @inproceedings{ChenArxiv16,
    title = {3D Object Proposals using Stereo Imagery for Accurate Object Class Detection},
    author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {arXiv:1608.07711},
    year = {2016}
    }

    The goal of this paper is to perform 3D object detection in the context of autonomous driving. Our method first aims at generating a set of high-quality 3D object proposals by exploiting stereo imagery. We formulate the problem as minimizing an energy function that encodes object size priors, placement of objects on the ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. We then exploit a CNN on top of these proposals to perform object detection. In particular, we employ a convolutional neural net (CNN) that exploits context and depth information to jointly regress to 3D bounding box coordinates and object pose. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. When combined with the CNN, our approach outperforms all existing results in object detection and orientation estimation tasks for all three KITTI object classes. Furthermore, we experiment also with the setting where LIDAR information is available, and show that using both LIDAR and stereo leads to the best result.

  • Monocular 3D Object Detection for Autonomous Driving

    Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Paper  Abstract  Project page  Bibtex

    @inproceedings{ChenCVPR16,
    title = {Monocular 3D Object Detection for Autonomous Driving},
    author = {Xiaozhi Chen and Kaustav Kundu and Ziyu Zhang and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2016}
    }

    The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.

  • 3D Object Proposals for Accurate Object Class Detection

    Xiaozhi Chen*, Kaustav Kundu*, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, Raquel Urtasun

    * Denotes equal contribution

    Currently third in Car, and first in Pedestrian and Cyclist detection on KITTI's Leaderboard

    Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

    Paper  Abstract  Project page  Bibtex

    @inproceedings{XiaozhiNIPS15,
    title = {3D Object Proposals for Accurate Object Class Detection},
    author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Andrew Berneshawi and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {NIPS},
    year = {2015}
    }

    The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on Car and Cyclist, and is competitive for the Pedestrian class.

  • Rent3D: Floor-Plan Priors for Monocular Layout Estimation         (oral presentation)

    Chenxi Liu, Alex Schwing, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Rent an apartment in 3D!

    Paper  Abstract  Suppl. Mat.  Project page  Bibtex

    @inproceedings{ApartmentsCVPR15,
    title = {Rent3D: Floor-Plan Priors for Monocular Layout Estimation},
    author = {Chenxi Liu and Alex Schwing and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2015}}

    The goal of this paper is to enable a 3D "virtual-tour" of an apartment given a small set of monocular images of different rooms, as well as a 2D floor plan. We frame the problem as the one of inference in a Markov random field which reasons about the layout of each room and its relative pose (3D rotation and translation) within the full apartment. This gives us information, for example, about in which room the picture was taken. What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge. In particular, we exploit the floor plan to impose aspect ratio constraints across the layouts of different rooms, as well as to extract semantic information, e.g., the location of windows which are labeled in floor plans. We show that this information can significantly help in resolving the challenging room-apartment alignment problem. We also derive an efficient exact inference algorithm which takes only a few ms per apartment. This is due to the fact that we exploit integral geometry as well as our new bounds on the aspect ratio of rooms which allow us to carve the space, reducing significantly the number of physically possible configurations. We demonstrate the effectiveness of our approach in a new dataset which contains over 200 apartments.

close window

Hang Chu

Hang is a second year PhD student. He is working on 3D scene understanding.

Publications

  • TorontoCity: Seeing the World with a Million Eyes           (spotlight presentation)

    Shenlong Wang, Min Bai, Gellert Mattyus, Hang Chu, Wenjie Luo, Bin Yang, Justin Liang, Joel Cheverie, Sanja Fidler, Raquel Urtasun

    In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

    Paper  Abstract  Bibtex

    @inproceedings{TCity2017,
    title = {TorontoCity: Seeing the World with a Million Eyes},
    author = {Shenlong Wang and Min Bai and Gellert Mattyus and Hang Chu and Wenjie Luo and Bin Yang and Justin Liang and Joel Cheverie and Sanja Fidler and Raquel Urtasun},
    booktitle = {ICCV},
    year = {2017}
    }

    In this paper we introduce the TorontoCity benchmark, which covers the full greater Toronto area (GTA) with 712.5km2 of land, 8439km of road and around 400,000 buildings. Our benchmark provides different perspectives of the world captured from airplanes, drones and cars driving around the city. Manually labeling such a large scale dataset is infeasible. Instead, we propose to utilize different sources of high-precision maps to create our ground truth. Towards this goal, we develop algorithms that allow us to align all data sources with the maps while requiring minimal human supervision. We have designed a wide variety of tasks including building height estimation (reconstruction), road centerline and curb extraction, building instance segmentation, building contour extraction (reorganization), semantic labeling and scene type classification (recognition). Our pilot study shows that most of these tasks are still difficult for modern convolutional neural networks.

  • Song From PI: A Musically Plausible Network for Pop Music Generation

    Hang Chu, Raquel Urtasun, Sanja Fidler

    ICLR Workshop track, 2017

    Generation of pop songs

    Paper  Abstract  Project page  Press  Bibtex

    @inproceedings{SongOfPI,
    title = {Song From PI: A Musically Plausible Network for Pop Music Generation},
    author = {Hang Chu and Raquel Urtasun and Sanja Fidler},
    booktitle = {arXiv:1611.03477},
    year = {2016}
    }

    We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.

  • HouseCraft: Building Houses from Rental Ads and Street Views

    Hang Chu, Shenlong Wang, Raquel Urtasun, Sanja Fidler

    In European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016

    Creating 3D models of houses

    Paper  Abstract  Project page  Bibtex

    @inproceedings{ChuECCV16,
    title = {HouseCraft: Building Houses from Rental Ads and Street Views},
    author = {Hang Chu and Shenlong Wang and Raquel Urtasun and Sanja Fidler},
    booktitle = {ECCV},
    year = {2016}
    }

    In this paper, we utilize rental ads to create realistic textured 3D models of building exteriors. In particular, we exploit the address of the property and its floorplan, which are typically available in the ad. The address allows us to extract Google StreetView images around the building, while the building's floorplan allows for an efficient parametrization of the building in 3D via a small set of random variables. We propose an energy minimization framework which jointly reasons about the height of each floor, the vertical positions of windows and doors, as well as the precise location of the building in the world's map, by exploiting several geometric and semantic cues from the StreetView imagery. To demonstrate the effectiveness of our approach, we collected a new dataset with 174 houses by crawling a popular rental website. Our experiments show that our approach is able to precisely estimate the geometry and location of the property, and can create realistic 3D building models.

close window

Tingwu Wang

MSc Student (2016    -    )
 

Harris Chan

MSc Student (2017    -    )
Co-supervised with Jimmy Ba
 

Atef Chaudhury

MSc Student (2017    -    )
 

Seung Kim

MSc Student (2017    -    )
 

Jiaman Li

MSc Student (2017    -    )
 

Kevin Shen

MSc Student (2017    -    )
 

Chaoqi Wang

MSc Student (2017    -    )

Amlan Kar

Amlan started his PhD in Sept 2017.

close window


 

Huan Ling

4th year undergraduate, UofT (Oct 2016    -    )
 

David Acuna

MScAc, UofT (June 2017    -    )
 

Huan Ling

Huan is a 4th year undergraduate at University of Toronto. He is working on human guided learning and object instance segmentation. He published a NIPS paper during his 3rd year undergraduate studies.

Publications

  • Teaching Machines to Describe Images via Natural Language Feedback

    Huan Ling, Sanja Fidler

    In Neural Information Processing Systems (NIPS), Long Beach, US, 2017

    Paper  Abstract  Project page  Bibtex

    @inproceedings{LingNIPS2017,
    title = {Teaching Machines to Describe Images via Natural Language Feedback},
    author = {Huan Ling and Sanja Fidler},
    booktitle = {NIPS},
    year = {2017}
    }

    Robots will eventually be part of every household. It is thus critical to enable algorithms to learn from and be guided by non-expert users. In this paper, we bring a human in the loop, and enable a human teacher to give feedback to a learning agent in the form of natural language. We argue that descriptive sentence can provide a stronger learning signal than a numeric reward in that it can easily point to where the mistakes are and how to correct them. We focus on the problem of image captioning in which the quality of the output can easily be judged by non-experts. We propose a hierarchical phrase-based captioning model trained with policy gradients, and design a feedback network that provides reward to the learner by conditioning on the human-provided feedback. We show that by exploiting descriptive feedback our model learns to perform better than when given independently written human captions.

close window

Kefan (Arthur) Chen
4th year undergraduate, UofT
Capstone project, Sept 2017 -

Tiantian Fang
4th year undergraduate, UofT
Sept 2017 -

Wesley Heung
4th year undergraduate, UofT
Thesis, Sept 2017 -

Daiqing Li
4th year undergraduate, UofT
Capstone project, Sept 2017 -

Yuhao Zhou
3rd year undergraduate, UofT
Jan 2017 -

Bo Dai
PhD student, Chinese University of Hong Kong
Sept - Dec 2017

Enric Corona
MSc student, UPC in Barcelona
May - Dec 2017

Liren Chen
3rd year undergraduate, Tsinghua University
Summer 2017

Ching-Yao Chuang
4th year undergraduate, National Tsinghua University of Taiwan
July - Nov 2017

Zheng Wu
3rd year undergraduate, Shanghai Jiao Tong University
Summer 2017


Lluis Castrejon
Graduated with MSc  (now at University of Montreal)
Sept 2015 - May 2017
Co-supervised with Raquel Urtasun

Lluis Castrejon

Lluis worked on semi-automatic instance segmentation. Our CVPR'17 paper on this topic received Best Paper Honorable Mention.

Publications

  • Annotating Object Instances with a Polygon-RNN              (best paper honorable mention)

    Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017

    Paper  Abstract  Bibtex

    @inproceedings{CastrejonCVPR17,
    title = {Annotating Object Instances with a Polygon-RNN},
    author = {Lluis Castrejon and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2017}
    }

    We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows a human annotator to interfere at any time and correct a vertex if needed, producing as accurate segmentation as desired by the annotator. We show that our approach speeds up the annotation process by a factor of 4.7 across all classes in Cityscapes, while achieving 78.4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators. For cars, our speed-up factor is 7.3 for an agreement of 82.2%. We further show generalization capabilities of our approach to unseen datasets.

close window


Tom Lee
Graduated with PhD  (now at LTAS Technologies Inc)
Sept 2011 - March 2016
Co-supervised with Sven Dickinson

Tom Lee

Tom is a 4th year PhD student and is currently doing a 8-month internship in a Toronto-based company LTAS Technologies Inc. Tom works on mid-level vision: grouping superpixels to form symmetric parts using a discriminative (trained) approach, and a learning framework for grouping superpixels into object proposals using several Gestalt-like cues (symmetry, closure, homogeneity of appearance). For the former, he showed how to learn with parametric submodular energies. His primary supervisor is Prof. Sven Dickinson.

Publications

  • A Learning Framework for Generating Region Proposals with Mid-level Cues

    Tom Lee, Sanja Fidler, Sven Dickinson

    In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

    Paper  Abstract  Bibtex

    @inproceedings{TLeeICCV15,
    title = {A Learning Framework for Generating Region Proposals with Mid-level Cues},
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    booktitle = {ICCV},
    year = {2015}
    }

    The object categorization community's migration from object detection to large-scale object categorization has seen a shift from sliding window approaches to bottom-up region segmentation, with the resulting region proposals offering discriminating shape and appearance features through an attempt to explicitly segment the objects in a scene from their background. One powerful class of region proposal techniques is based on parametric energy minimization (PEM) via parametric maxflow. In this paper, we incorporate PEM into a novel structured learning framework that learns how to combine a set of mid-level grouping cues to yield a small set of region proposals with high recall. Second, we diversify our region proposals and rank them with region-based convolutional neural network features. Our novel approach, called parametric min-loss, casts perceptual grouping and cue combination in a learning framework which yields encouraging results on VOC'2012.

  • A Framework for Symmetric Part Detection in Cluttered Scenes

    Tom Lee, Sanja Fidler, Alex Levinshtein, Cristian Sminchisescu, Sven Dickinson

    Symmetry, Vol. 7, 2015, pp 1333-1351

    Paper  Abstract  Bibtex

    @article{LeeSymmetry2015,
    title = {A Framework for Symmetric Part Detection in Cluttered Scenes},
    author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Cristian Sminchisescu and Sven Dickinson},
    journal = {Symmetry},
    volume = {7},
    pages = {1333-1351},
    year = {2015}
    }

    The role of symmetry in computer vision has waxed and waned in importance during the evolution of the field from its earliest days. At first figuring prominently in support of bottom-up indexing, it fell out of favour as shape gave way to appearance and recognition gave way to detection. With a strong prior in the form of a target object, the role of the weaker priors offered by perceptual grouping was greatly diminished. However, as the field returns to the problem of recognition from a large database, the bottom-up recovery of the parts that make up the objects in a cluttered scene is critical for their recognition. The medial axis community has long exploited the ubiquitous regularity of symmetry as a basis for the decomposition of a closed contour into medial parts. However, today's recognition systems are faced with cluttered scenes and the assumption that a closed contour exists, i.e., that figure-ground segmentation has been solved, rendering much of the medial axis community's work inapplicable. In this article, we review a computational framework, previously reported in [Lee et al., ICCV'13, Levinshtein et al., ICCV'09, Levinshtein et al., IJCV'13], that bridges the representation power of the medial axis and the need to recover and group an object's parts in a cluttered scene. Our framework is rooted in the idea that a maximally-inscribed disc, the building block of a medial axis, can be modelled as a compact superpixel in the image. We evaluate the method on images of cluttered scenes.

  • Multi-cue Mid-level Grouping

    Tom Lee, Sanja Fidler, Sven Dickinson

    In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

    Paper  Abstract  Bibtex

     

    @inproceedings{LeeACCV14,
    title = {Multi-cue mid-level grouping},
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    booktitle = {ACCV},
    year = {2014}
    }

    Region proposal methods provide richer object hypotheses than sliding windows with dramatically fewer proposals, yet they still number in the thousands. This large quantity of proposals typically results from a diversification step that propagates bottom-up ambiguity in the form of proposals to the next processing stage. In this paper, we take a complementary approach in which mid-level knowledge is used to resolve bottom-up ambiguity at an earlier stage to allow a further reduction in the number of proposals. We present a method for generating regions using the mid-level grouping cues of closure and symmetry. In doing so, we combine mid-level cues that are typically used only in isolation, and leverage them to produce fewer but higher quality proposals. We emphasize that our model is mid-level by learning it on a limited number of objects while applying it to different objects, thus demonstrating that it is transferable to other objects. In our quantitative evaluation, we 1) establish the usefulness of each grouping cue by demonstrating incremental improvement, and 2) demonstrate improvement on two leading region proposal methods with a limited budget of proposals.

  • Detecting Curved Symmetric Parts using a Deformable Disc Model

    Tom Lee, Sanja Fidler, Sven Dickinson

    In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

    Paper  Abstract  Project page  Bibtex

    @inproceedings{LeeICCV13,
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    title = {Detecting Curved Symmetric Parts using a Deformable Disc Model},
    booktitle = {ICCV},
    year = {2013}
    }

    Symmetry is a powerful shape regularity that's been exploited by perceptual grouping researchers in both human and computer vision to recover part structure from an image without a priori knowledge of scene content. Drawing on the concept of a medial axis, defined as the locus of centers of maximal inscribed discs that sweep out a symmetric part, we model part recovery as the search for a sequence of deformable maximal inscribed disc hypotheses generated from a multiscale superpixel segmentation, a framework proposed by [Levinshtein et al., ICCV'09]. However, we learn affinities between adjacent superpixels in a space that's invariant to bending and tapering along the symmetry axis, enabling us to capture a wider class of symmetric parts. Moreover, we introduce a global cost that perceptually integrates the hypothesis space by combining a pairwise and a higher-level smoothing term, which we minimize globally using dynamic programming. The new framework is demonstrated on two datasets, and is shown to significantly outperform the baseline [Levinshtein et al., ICCV'09].

  • Learning Categorical Shape from Captioned Images

    Tom Lee, Sanja Fidler, Alex Levinshtein, Sven Dickinson

    Conference on Computer and Robot Vision (CRV), Toronto, Canada, May 2012

    Paper  Abstract  Bibtex

    @inproceedings{LeeCRV12,
    author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Sven Dickinson},
    title = {Learning Categorical Shape from Captioned Images},
    booktitle = {Canadian Conference on Computer and Robot Vision (CRV)},
    year = {2012}
    }

    Given a set of captioned images of cluttered scenes containing various objects in different positions and scales, we learn named contour models of object categories without relying on bounding box annotation. We extend a recent language-vision integration framework that finds spatial configurations of image features that co-occur with words in image captions. By substituting appearance features with local contour features, object categories are recognized by a contour model that grows along the object's boundary. Experiments on ETHZ are presented to show that 1) the extended framework is better able to learn named visual categories whose within class variation is better captured by a shape model than an appearance model; and 2) typical object recognition methods fail when manually annotated bounding boxes are unavailable

close window


Yukun Zhu
Graduated with MSc  (now at Google)
Sept 2015 - Jan 2016
Co-supervised with Raquel Urtasun and Ruslan Salakhutdinov

Yukun Zhu

Yukun's research was in two domains: object class detection and vision-language integration. His approach published at CVPR'15 significantly outperformed previous state-of-the-art in detection on PASCAL VOC.

Publications

  • 3D Object Proposals using Stereo Imagery for Accurate Object Class Detection

    Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Huimin Ma, Sanja Fidler, Raquel Urtasun

    In Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017

    Paper  Abstract  Bibtex

    @inproceedings{ChenArxiv16,
    title = {3D Object Proposals using Stereo Imagery for Accurate Object Class Detection},
    author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {arXiv:1608.07711},
    year = {2016}
    }

    The goal of this paper is to perform 3D object detection in the context of autonomous driving. Our method first aims at generating a set of high-quality 3D object proposals by exploiting stereo imagery. We formulate the problem as minimizing an energy function that encodes object size priors, placement of objects on the ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. We then exploit a CNN on top of these proposals to perform object detection. In particular, we employ a convolutional neural net (CNN) that exploits context and depth information to jointly regress to 3D bounding box coordinates and object pose. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. When combined with the CNN, our approach outperforms all existing results in object detection and orientation estimation tasks for all three KITTI object classes. Furthermore, we experiment also with the setting where LIDAR information is available, and show that using both LIDAR and stereo leads to the best result.

  • MovieQA: Understanding Stories in Movies through Question-Answering
    (spotlight presentation)

    Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Benchmark on question-answering about movies

    Paper  Abstract  Benchmark  Bibtex

    @inproceedings{TapaswiCVPR16,
    title = {MovieQA: Understanding Stories in Movies through Question-Answering},
    author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2016}
    }

    We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

  • segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

    Yukun Zhu, Raquel Urtasun, Ruslan Salakhutdinov, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Currently third in detection on PASCAL VOC Leaderboard

    Paper  Abstract  Project page  Bibtex

    @inproceedings{ZhuSegDeepM15,
    title = {segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection},
    author = {Yukun Zhu and Raquel Urtasun and Ruslan Salakhutdinov and Sanja Fidler},
    booktitle = {CVPR},
    year = {2015}
    }

    In this paper, we propose an approach that exploits object segmentation in order to improve the accuracy of object detection. We frame the problem as inference in a Markov Random Field, in which each detection hypothesis scores object appearance as well as contextual information using Convolutional Neural Networks, and allows the hypothesis to choose and score a segment out of a large pool of accurate object segmentation proposals. This enables the detector to incorporate additional evidence when it is available and thus results in more accurate detections. Our experiments show an improvement of 4.1% in mAP over the R-CNN baseline on PASCAL VOC 2010, and 3.4% over the current state-of-the-art, demonstrating the power of our approach.


  • Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

    Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

    arXiv preprint arXiv:1506.06724, 2015

    Aligning movies and books for story-like captioning

    Paper  Abstract  Project page  Bibtex

    @inproceedings{moviebook,
    title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
    author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06724},
    year = {2015}
    }

    Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.


  • Skip-Thought Vectors

    Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    arXiv preprint arXiv:1506.06726, 2015

    Sent2vec neural representation trained on 11K books

    Paper  Abstract  Code  Bibtex

    @inproceedings{moviebook,
    title = {Skip-Thought Vectors},
    author = {Ryan Kiros and Yukun Zhu and Ruslan Salakhutdinov and Richard Zemel and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06726},
    year = {2015}
    }

    We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.

  • 3D Object Proposals for Accurate Object Class Detection

    Xiaozhi Chen*, Kaustav Kundu*, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, Raquel Urtasun

    * Denotes equal contribution

    Currently third in Car, and first in Pedestrian and Cyclist detection on KITTI's Leaderboard

    Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

    Paper  Abstract  Project page  Bibtex

    @inproceedings{XiaozhiNIPS15,
    title = {3D Object Proposals for Accurate Object Class Detection},
    author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Andrew Berneshawi and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {NIPS},
    year = {2015}
    }

    The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on Car and Cyclist, and is competitive for the Pedestrian class.

close window


Ivan Vendrov
Graduated with MSc  (now at Google)
Sept 2015 - Jan 2016
Co-supervised with Raquel Urtasun

Ivan's masters thesis was on the topic of semantic visual search.

Publications

  • Order-Embeddings of Images and Language    (oral presentation)

    Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun

    In International Conference on Learning Representations, Puerto Rico, 2016

    State-of-the-art in caption-image retrieval on COCO

    Paper  Abstract  Code  Bibtex

    @inproceedings{VendrovArxiv15,
    title = {Order-Embeddings of Images and Language},
    author = {Ivan Vendrov and Ryan Kiros and Sanja Fidler and Raquel Urtasun},
    booktitle = {ICLR},
    year = {2016}
    }

    Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.

close window


Ziyu Zhang
Graduated with MSc (now at Snap Inc)  
Sept 2015 - April 2016
Co-supervised with Raquel Urtasun

Ziyu Zhang

Ziyu's masters' thesis was on instance-level object segmentation in monocular imagery.

Publications

  • Instance-Level Segmentation with Deep Densely Connected MRFs

    Ziyu Zhang, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Paper  Abstract  Bibtex

    @inproceedings{ZhangCVPR16,
    title = {Instance-Level Segmentation with Deep Densely Connected MRFs},
    author = {Ziyu Zhang and Sanja Fidler and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2016}
    }

    Our aim is to provide a pixel-level object instance labeling of a monocular image. We build on recent work [Zhang et al., ICCV15] that trained a convolutional neural net to predict instance labeling in local image patches, extracted exhaustively in a stride from an image. A simple Markov random field model using several heuristics was then proposed in [Zhang et al., ICCV15] to derive a globally consistent instance labeling of the image. In this paper, we formulate the global labeling problem with a novel densely connected Markov random field and show how to encode various intuitive potentials in a way that is amenable to efficient mean field inference [Krahenbuhl et al., NIPS11]. Our potentials encode the compatibility between the global labeling and the patch-level predictions, contrast-sensitive smoothness as well as the fact that separate regions form different instances. Our experiments on the challenging KITTI benchmark [Geiger et al., CVPR12] demonstrate that our method achieves a significant performance boost over the baseline [Zhang et al., ICCV15].

  • Monocular 3D Object Detection for Autonomous Driving

    Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Paper  Abstract  Project page  Bibtex

    @inproceedings{ChenCVPR16,
    title = {Monocular 3D Object Detection for Autonomous Driving},
    author = {Xiaozhi Chen and Kaustav Kundu and Ziyu Zhang and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2016}
    }

    The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.

  • Monocular Object Instance Segmentation and Depth Ordering with CNNs

    Ziyu Zhang, Alex Schwing, Sanja Fidler, Raquel Urtasun

    In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

    Paper  Abstract  Bibtex

    @inproceedings{ZhangICCV15,
    title = {Monocular Object Instance Segmentation and Depth Ordering with CNNs},
    author = {Ziyu Zhang and Alex Schwing and Sanja Fidler and Raquel Urtasun},
    booktitle = {ICCV},
    year = {2015}
    }

    In this paper we tackle the problem of instance level segmentation and depth ordering from a single monocular image. Towards this goal, we take advantage of convolutional neural nets and train them to directly predict instance level segmentations where the instance ID encodes depth ordering from large image patches. To provide a coherent single explanation of an image we develop a Markov random field which takes as input the predictions of convolutional nets applied at overlapping patches of different resolutions as well as the output of a connected component algorithm and predicts very accurate instance level segmentation and depth ordering. We demonstrate the effectiveness of our approach on the challenging KITTI benchmark and show very good performance on both tasks.

close window



Xavier Puig Fernandez
PhD student, MIT
Jan-March, 2016

Xavier is visited the group two times, once in Nov 2015, and from Jan to March, 2016. We are working on the problem of video to text alignment.

Co-supervised with Raquel Urtasun:


Urban Jezernik
PhD student, University of Ljubljana
Jan-April, 2016

Urban visited the group from Jan to April, 2016. We were working on the problem of music generation.


Makarand Tapaswi
PhD student, KIT (now a postdoc in our group)
Sept-Dec, 2015

Makarand visited for three months in 2015, and has joined our group as a postdoc in the fall of 2016.

Publications

  • MovieQA: Understanding Stories in Movies through Question-Answering
    (spotlight presentation)

    Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Paper  Abstract  Benchmark  Bibtex

    @inproceedings{TapaswiCVPR16,
    title = {MovieQA: Understanding Stories in Movies through Question-Answering},
    author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2016}
    }

    We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

close window


TV Interview (min 15:12 to 16:43), in Spanish
Edgar Simo-Serra
PhD student, UPC in Barcelona (now a postdoc at Tokyo University)
Summer 2013, 2014

Edgar visited the group twice. During his first visit (to TTI-C) he was working on clothing parsing in fashion photographs. On this topic he published a first-author paper at ACCV'14. In his second visit (to UofT), he worked on predicting how fashionable / stylish someone looks on a photograph, and suggest ways to help the user to improve her/his "look". This resulted in a first-author CVPR'15 paper. The paper got significant international press coverage in major news and fashion media such as New Scientist, Quartz, Wired, Glamour, Cosmopolitan, Elle and Marie Claire (see project page for more details). Edgar gave several interviews for the press, including an appearance on Spanish television (minutes 15:12 to 16:43) and radio (minutes 16:10 to 20:43). Yahoo News, Canada, featured a full photo of him in one of my favorite press articles on the subject.

Publications

  • Neuroaesthetics in Fashion: Modeling the Perception of Beauty

    Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    How fashionable do you look in a photo? And how can you improve?

    Paper  Abstract  Project page  Bibtex

    @inproceedings{SimoCVPR15,
    title = {Neuroaesthetics in Fashion: Modeling the Perception of Beauty},
    author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2015}
    }

    In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph's setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.

  • A High Performance CRF Model for Clothes Parsing

    Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

    In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

    Significant performance gain over state-of-the-art in clothing parsing.

    Paper  Abstract  Project page  Bibtex

     

    @inproceedings{SimoACCV14,
    title = {A High Performance CRF Model for Clothes Parsing},
    author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
    booktitle = {ACCV},
    year = {2014}}

    In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset [Yamaguchi et al., CVPR'12] and show that we can obtain a significant improvement over the state-of-the-art.

close window


Roozbeh Mottaghi
PhD student, UCLA (now a Research Scientist at AI2)
Summer 2012, 2013

Roozbeh visited the group several times, working on the topic of object class detection. His work resulted in several state-of-the-art detectors. He published two first-author and two second-author CVPR papers (CVPR'13 and '14), as well as a first-author T-PAMI publication.
Roozbeh went to do a postdoc with Prof. Silvio Savarese at Stanford and is now a Research Scientist at AI2.

Publications

  • Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

    Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, Devi Parikh

    Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1), 2016, pages 74-8

    Paper  Abstract  Suppl. Mat.  Bibtex

    @article{MottaghiPAMI16,
    title = {Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding},
    author = {Roozbeh Mottaghi and Sanja Fidler and Alan Yuille and Raquel Urtasun and Devi Parikh},
    journal = {Trans. on Pattern Analysis and Machine Intelligence},
    volume= {38},
    number= {1},
    pages= {74--87},
    year = {2016}
    }

    Recent trends in image understanding have pushed for scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers. In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular semantic segmentation, object detection and scene recognition. Towards this goal, we "plug-in" human subjects for each of the various components in a conditional random field model. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve scene understanding by focusing research efforts on various individual tasks.

  • The Role of Context for Object Detection and Semantic Segmentation in the Wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, Alan Yuille

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    PASCAL VOC with dense segmentation labels for 400+ classes in Project page

    Paper  Errata  Abstract  Project page  Suppl. Mat.  Bibtex

    @inproceedings{MottaghiCVPR14,
    author = {Roozbeh Mottaghi and Xianjie Chen and Xiaobai Liu and Sanja Fidler and Raquel Urtasun and Alan Yuille},
    title = {The Role of Context for Object Detection and Semantic Segmentation in the Wild},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper we study the role of context in existing stateof-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it contains 520 additional classes for semantic segmentation and object detection. Our analysis shows that nearest neighbor based approaches perform poorly on semantic segmentation of contextual classes, showing the variability of PASCAL imagery. Furthermore, improvements of existing contextual models for detection is rather modest. In order to push forward the performance in this difficult scenario, we propose a novel deformable part-based model, which exploits both local context around each candidate detection as well as global context at the level of the scene. We show that this contextual reasoning significantly helps in detecting objects at all scales.

  • Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

    Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Nam-Gyu Cho, Sanja Fidler, Raquel Urtasun, Alan Yuille

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    PASCAL VOC with object parts segmentations available in Project page

    Paper  Abstract  Project page  Bibtex

    @inproceedings{PartsCVPR14,
    author = {Xianjie Chen and Roozbeh Mottaghi and Xiaobai Liu and Nam-Gyu Cho and Sanja Fidler and Raquel Urtasun and Alan Yuille},
    title = {Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts},
    booktitle = {CVPR},
    year = {2014}
    }

    Detecting objects becomes difficult when we need to deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformations and partial occlusions in animals (as examples of highly deformable objects), ii) describe them in terms of body parts, and iii) detect them when their body parts are hard to detect (e.g., animals depicted at low resolution). We represent the holistic object and body parts separately and use a fully connected model to arrange templates for the holistic object and body parts. Our model automatically decouples the holistic object or body parts from the model when they are hard to detect. This enables us to represent a large number of holistic object and body part combinations to better deal with different "detectability" patterns caused by deformations, occlusion and/or low resolution. We apply our method to the six animal categories in the PASCAL VOC dataset and show that our method significantly improves state-of-the-art (by 4.1% AP) and provides a richer representation for objects. During training we use annotations for body parts (e.g., head, torso, etc), making use of a new dataset of fully annotated object parts for PASCAL VOC 2010, which provides a mask for each part.

  • Bottom-up Segmentation for Top-down Detection

    Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    8% over DPM and 4% over the state-of-the-art on PASCAL VOC at the time.

    Paper  Abstract  Project page  Suppl. Mat.  Bibtex

    @inproceedings{segdpmCVPR13,
    author = {Sanja Fidler and Roozbeh Mottaghi and Alan Yuille and Raquel Urtasun},
    title = {Bottom-up Segmentation for Top-down Detection},
    booktitle = {CVPR},
    year = {2013}
    }

    In this paper we are interested in how semantic segmentation can help object detection. Towards this goal, we propose a novel deformable part-based model which exploits region-based segmentation algorithms that compute candidate object regions by bottom-up clustering followed by ranking of those regions. Our approach allows every detection hypothesis to select a segment (including void), and scores each box in the image using both the traditional HOG filters as well as a set of novel segmentation features. Thus our model "blends" between the detector and segmentation models. Since our features can be computed very efficiently given the segments, we maintain the same complexity as the original DPM. We demonstrate the effectiveness of our approach in PASCAL VOC 2010, and show that when employing only a root filter our approach outperforms Dalal & Triggs detector on all classes, achieving 13% higher average AP. When employing the parts, we outperform the original DPM in 19 out of 20 classes, achieving an improvement of 8% AP. Furthermore, we outperform the previous state-of-the-art on VOC'10 test by 4%.

  • Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs

    Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{MottaghiCVPR13,
    author = {Roozbeh Mottaghi and Sanja Fidler and Jian Yao and Raquel Urtasun and Devi Parikh},
    title = {Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs},
    booktitle = {CVPR},
    year = {2013}
    }

    Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we "plug-in" human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

close window


Liang-Chieh Chen
PhD student, UCLA (now at Google)
Summer 2013

Liang-Chieh ("Jay") worked on weakly-labeled segmentation: getting accurate object segmentation given a ground-truth 3D bounding box as available in the KITTI dataset. His method improved significantly over existing grab-cut type of approaches, and even outperformed MTurkers (compared to accurate in-house annotations). Jay authored a first-author paper at CVPR'14.

Publications

  • Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision

    Liang-Chieh Chen, Sanja Fidler, Alan Yuille, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    More accurate pixel-level object labeling than MTurkers

    Paper  Abstract  Project page  CAD models  Suppl. Mat.  Bibtex

    @inproceedings{ChenCVPR14,
    author = {Liang-Chieh Chen and Sanja Fidler and Alan Yuille and Raquel Urtasun},
    title = {Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision},
    booktitle = {CVPR},
    year = {2014}
    }

    Labeling large-scale datasets with very accurate object segmentations is an elaborate task that requires a high degree of quality control and a budget of tens or hundreds of thousands of dollars. Thus, developing solutions that can automatically perform the labeling given only weak supervision is key to reduce this cost. In this paper, we show how to exploit 3D information to automatically generate very accurate object segmentations given annotated 3D bounding boxes. We formulate the problem as the one of inference in a binary Markov random field which exploits appearance models, stereo and/or noisy point clouds, a repository of 3D CAD models as well as topological constraints. We demonstrate the effectiveness of our approach in the context of autonomous driving, and show that we can segment cars with the accuracy of 86% intersection-over-union, performing as well as highly recommended MTurkers!

close window


Abhishek Sharma
PhD student, UMD (now at Apple)
Summer 2012

Abhishek worked on holistic scene parsing by exploiting image captions. Making use of textual information for visual parsing is important for, e.g., robotics applications where an automatics system interacts with a human user. Abhishek co-authored a CVPR'13 paper.

Publications

  • A Sentence is Worth a Thousand Pixels

    Sanja Fidler, Abhishek Sharma, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{FidlerCVPR13,
    author = {Sanja Fidler and Abhishek Sharma and Raquel Urtasun},
    title = {A Sentence is Worth a Thousand Pixels},
    booktitle = {CVPR},
    year = {2013}
    }

    We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models.

close window

Juan Morales Vega
4th year undergraduate, UPC in Barcelona
Feb - June 2017

Haokun Liu
3rd year undergraduate, Peking University,
Feb - June 2017

Ge (Olga) Xu
3rd year undergraduate, UofT
Summer 2016 (USRA)

Kevin Kyunghwan Ra
4th year undergraduate, UofT (now at McMaster University)
2016

Vasu Sharma
3rd year undergraduate, IIT Kanpur
Summer 2016, co-supervised with Raquel Urtasun

Amlan Kar
3rd year undergraduate, IIT Kanpur (now doing PhD with me at UofT)
Summer 2016, co-supervised with Raquel Urtasun

Erin Grant
4th year undergraduate, UofT (now a PhD student at UC Berkeley)
Jan-April, 2016

Seung Wook Kim
4th year undergraduate, UofT (now doing MSc with me at UofT)
Jan-April, 2016

Huazhe Xu
4th year visiting student from Tsinghua University (now a PhD student at UC Berkeley)
Sep 2015 - Dec 2015, co-supervised with Raquel Urtasun

Boris Ivanovic
4th year undergraduate, UofT (now a MSc student at Stanford University)
Sep 2015 - May 2016, co-supervised with Raquel Urtasun

Tamara Lipowski
4th year undergraduate, UofT (now a MSc student at University of Salzburg)
Jan-April, 2016

Zexuan (Aaron) Wang
4th year undergraduate, UofT (now at Qumulo Inc)
Sept 2015 - April 2016, co-supervised with Raquel Urtasun

Jurgen Aliaj
2nd year undergraduate, UofT (now a MSc student at UofT)
Summer 2015 (USRA)

Andrew Berneshawi
4th year undergraduate, UofT (now at Amazon, Seattle)
CSC494, Winter 2015

Andrew worked on road estimation as part of a semester-long project course (CSC494). His approach ranked second on KITTI's road classification benchmark (entry: NNP, time stamped: June 2015).

close window


Chenxi's talk at CVPR'15
Chenxi Liu
4th year undergraduate, Tsingua University (now a PhD student at Johns Hopkins University)
Summer 2014, co-supervised with Raquel Urtasun

Chenxi worked on the problem of apartment reconstruction in 3D from rental data (monocular imagery and floor-plan). His work resulted in a joint first-author oral CVPR'15 paper. He gave a talk at CVPR and did a great job (you can check his performance below).

Publications

  • Rent3D: Floor-Plan Priors for Monocular Layout Estimation
    (oral presentation)

    Chenxi Liu*, Alex Schwing*, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Rent an apartment in 3D!

    * Denotes equal contribution

    Paper  Abstract  Suppl. Mat.  Project page  Bibtex

    @inproceedings{ApartmentsCVPR15,
    title = {Rent3D: Floor-Plan Priors for Monocular Layout Estimation},
    author = {Chenxi Liu and Alex Schwing and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2015}}

    The goal of this paper is to enable a 3D "virtual-tour" of an apartment given a small set of monocular images of different rooms, as well as a 2D floor plan. We frame the problem as the one of inference in a Markov random field which reasons about the layout of each room and its relative pose (3D rotation and translation) within the full apartment. This gives us information, for example, about in which room the picture was taken. What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge. In particular, we exploit the floor plan to impose aspect ratio constraints across the layouts of different rooms, as well as to extract semantic information, e.g., the location of windows which are labeled in floor plans. We show that this information can significantly help in resolving the challenging room-apartment alignment problem. We also derive an efficient exact inference algorithm which takes only a few ms per apartment. This is due to the fact that we exploit integral geometry as well as our new bounds on the aspect ratio of rooms which allow us to carve the space, reducing significantly the number of physically possible configurations. We demonstrate the effectiveness of our approach in a new dataset which contains over 200 apartments.


    [talk]  [slides]

close window

Yinan Zhao
4th year undergraduate, Tsingua University (now a PhD student at UT Austin)
Summer 2014, co-supervised with Raquel Urtasun

Chen Kong
4th year undergraduate, Tsingua University (now a PhD student at CMU)
Summer 2013, co-supervised with Raquel Urtasun

Chen worked on 3D indoor scene understanding by exploiting textual information. His work resulted in one first-author and another co-authored CVPR'14 paper, and he co-authored an oral paper at BMVC'15.

Publications

  • What are you talking about? Text-to-Image Coreference

    Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    Exploits text for visual parsing and aligns nouns to objects.

    Paper  Abstract  Bibtex

    @inproceedings{KongCVPR14,
    title = {What are you talking about? Text-to-Image Coreference},
    author = {Chen Kong and Dahua Lin and Mohit Bansal and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system.

  • Visual Semantic Search: Retrieving Videos via Complex Textual Queries

    Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    Video retrieval when a query is a longer sentence or a multi-sentence description

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{LinCVPR14,
    author = {Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
    title = {Visual Semantic Search: Retrieving Videos via Complex Textual Queries},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.

  • Generating Multi-Sentence Lingual Descriptions of Indoor Scenes     (oral presentation)

    Dahua Lin, Chen Kong, Sanja Fidler, Raquel Urtasun

    In British Machine Vision Conference (BMVC), 2015

    Paper  Abstract  Bibtex

    @inproceedings{Lin15,
    title = {Generating Multi-Sentence Lingual Descriptions of Indoor Scenes},
    author = {Dahua Lin and Chen Kong and Sanja Fidler and Raquel Urtasun},
    booktitle = {arXiv:1503.00064},
    year = {2015}}

    This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account the coherence among sentences. Experiments on the augmented NYU-v2 dataset show that our framework can generate natural descriptions with substantially higher ROGUE scores compared to those produced by the baseline.

close window

Jialiang's fun video about his summer research Jialiang Wang
4th year undergraduate, UofT (now a PhD student at Harvard University)
Summer 2014 (USRA), co-supervised with Sven Dickinson

Uri's fun video about his summer research Uri Priel
3rd year undergraduate, UofT
Summer 2014 (USRA), co-supervised with Sven Dickinson


Winning video of a undergraduate research video competition
Kamyar Seyed Ghasemipour
2nd year undergraduate, UofT (now a MSc student at UofT)
Summer 2014 (USRA), co-supervised with Suzanne Stevenson and Sven Dickinson

Kamyar worked on unsupervised word-sense disambiguation of captioned images. He won a research video competition (video) held for the Undergraduate Summer Research Program at UofT.

close window


back to top