Makarand Tapaswi

joined in September 2016
PhD from KIT, Germany
co-advised with Raquel Urtasun

Makarand visited for three months in 2015, and has hoined the group as a postdoc in the fall of 2016. We are working on the problem of question-answering based on videos.

Publications

  • MovieQA: Understanding Stories in Movies through Question-Answering
    (spotlight presentation)

    Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Paper  Abstract  Benchmark  Bibtex

    @inproceedings{TapaswiCVPR16,
    title = {MovieQA: Understanding Stories in Movies through Question-Answering},
    author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2016}
    }

    We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

close window


 

Hang Chu

PhD Student (2016    -    )

Co-supervised with Raquel Urtasun
 

Namdar Homayounfar

PhD Student (2014    -    )

Co-supervised with Raquel Urtasun
 

Kaustav Kundu

PhD Student (2014    -    )

Co-supervised with Raquel Urtasun
 

Paul Vicol

PhD Student (2016    -    )

Co-supervised with Raquel Urtasun
 

Lluis Castrejon

MSc Student (2015    -    )

Co-supervised with Raquel Urtasun
 

Tingwu Wang

MSc Student (2016    -    )

Kaustav Kundu

Kaustav is a third year PhD student, working on 3D scene understanding. His main project so far has been to exploit RGB-D information to (efficiently) propose high quality 3D object proposals. This is particularly important for robotics applications where efficiency (speed) and quality (precise 3D information) are crucial. This work resulted in significant improvements over the state-of-the-art region proposal approaches. Kaustav also significantly contributed to the Rent3D project where the idea was to reconstruct apartments in 3D from rental data (monocular imagery and floor-plans). This paper appeared as an oral at CVPR'15.

Publications

  • Monocular 3D Object Detection for Autonomous Driving

    Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Abstract  Bibtex

    @inproceedings{ChenCVPR16,
    title = {Monocular 3D Object Detection for Autonomous Driving},
    author = {Xiaozhi Chen and Kaustav Kundu and Ziyu Zhang and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2016}
    }

    The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.

  • 3D Object Proposals for Accurate Object Class Detection

    Xiaozhi Chen*, Kaustav Kundu*, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, Raquel Urtasun

    * Denotes equal contribution

    Currently third in Car, and first in Pedestrian and Cyclist detection on KITTI's Leaderboard

    Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

    Paper  Abstract  Project page  Bibtex

    @inproceedings{XiaozhiNIPS15,
    title = {3D Object Proposals for Accurate Object Class Detection},
    author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Andrew Berneshawi and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {NIPS},
    year = {2015}
    }

    The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on Car and Cyclist, and is competitive for the Pedestrian class.

  • Rent3D: Floor-Plan Priors for Monocular Layout Estimation         (oral presentation)

    Chenxi Liu, Alex Schwing, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Rent an apartment in 3D!

    Paper  Abstract  Suppl. Mat.  Project page  Bibtex

    @inproceedings{ApartmentsCVPR15,
    title = {Rent3D: Floor-Plan Priors for Monocular Layout Estimation},
    author = {Chenxi Liu and Alex Schwing and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2015}}

    The goal of this paper is to enable a 3D "virtual-tour" of an apartment given a small set of monocular images of different rooms, as well as a 2D floor plan. We frame the problem as the one of inference in a Markov random field which reasons about the layout of each room and its relative pose (3D rotation and translation) within the full apartment. This gives us information, for example, about in which room the picture was taken. What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge. In particular, we exploit the floor plan to impose aspect ratio constraints across the layouts of different rooms, as well as to extract semantic information, e.g., the location of windows which are labeled in floor plans. We show that this information can significantly help in resolving the challenging room-apartment alignment problem. We also derive an efficient exact inference algorithm which takes only a few ms per apartment. This is due to the fact that we exploit integral geometry as well as our new bounds on the aspect ratio of rooms which allow us to carve the space, reducing significantly the number of physically possible configurations. We demonstrate the effectiveness of our approach in a new dataset which contains over 200 apartments.

close window

Lluis Castrejon

Lluis is a first year Masters student. He is working on representations of text.

close window

Haokun Liu
3rd year undergraduate, Peking University

Ge (Olga) Xu
3rd year undergraduate, UofT
Summer 2016 (USRA)


Tom Lee
Graduated with PhD  (now at LTAS Technologies Inc)
Sept 2011 - March 2016
Co-supervised with Sven Dickinson

Tom Lee

Tom is a 4th year PhD student and is currently doing a 8-month internship in a Toronto-based company LTAS Technologies Inc. Tom works on mid-level vision: grouping superpixels to form symmetric parts using a discriminative (trained) approach, and a learning framework for grouping superpixels into object proposals using several Gestalt-like cues (symmetry, closure, homogeneity of appearance). For the former, he showed how to learn with parametric submodular energies. His primary supervisor is Prof. Sven Dickinson.

Publications

  • A Learning Framework for Generating Region Proposals with Mid-level Cues

    Tom Lee, Sanja Fidler, Sven Dickinson

    In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

    Paper  Abstract  Bibtex

    @inproceedings{TLeeICCV15,
    title = {A Learning Framework for Generating Region Proposals with Mid-level Cues},
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    booktitle = {ICCV},
    year = {2015}
    }

    The object categorization community's migration from object detection to large-scale object categorization has seen a shift from sliding window approaches to bottom-up region segmentation, with the resulting region proposals offering discriminating shape and appearance features through an attempt to explicitly segment the objects in a scene from their background. One powerful class of region proposal techniques is based on parametric energy minimization (PEM) via parametric maxflow. In this paper, we incorporate PEM into a novel structured learning framework that learns how to combine a set of mid-level grouping cues to yield a small set of region proposals with high recall. Second, we diversify our region proposals and rank them with region-based convolutional neural network features. Our novel approach, called parametric min-loss, casts perceptual grouping and cue combination in a learning framework which yields encouraging results on VOC'2012.

  • A Framework for Symmetric Part Detection in Cluttered Scenes

    Tom Lee, Sanja Fidler, Alex Levinshtein, Cristian Sminchisescu, Sven Dickinson

    Symmetry, Vol. 7, 2015, pp 1333-1351

    Paper  Abstract  Bibtex

    @article{LeeSymmetry2015,
    title = {A Framework for Symmetric Part Detection in Cluttered Scenes},
    author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Cristian Sminchisescu and Sven Dickinson},
    journal = {Symmetry},
    volume = {7},
    pages = {1333-1351},
    year = {2015}
    }

    The role of symmetry in computer vision has waxed and waned in importance during the evolution of the field from its earliest days. At first figuring prominently in support of bottom-up indexing, it fell out of favour as shape gave way to appearance and recognition gave way to detection. With a strong prior in the form of a target object, the role of the weaker priors offered by perceptual grouping was greatly diminished. However, as the field returns to the problem of recognition from a large database, the bottom-up recovery of the parts that make up the objects in a cluttered scene is critical for their recognition. The medial axis community has long exploited the ubiquitous regularity of symmetry as a basis for the decomposition of a closed contour into medial parts. However, today's recognition systems are faced with cluttered scenes and the assumption that a closed contour exists, i.e., that figure-ground segmentation has been solved, rendering much of the medial axis community's work inapplicable. In this article, we review a computational framework, previously reported in [Lee et al., ICCV'13, Levinshtein et al., ICCV'09, Levinshtein et al., IJCV'13], that bridges the representation power of the medial axis and the need to recover and group an object's parts in a cluttered scene. Our framework is rooted in the idea that a maximally-inscribed disc, the building block of a medial axis, can be modelled as a compact superpixel in the image. We evaluate the method on images of cluttered scenes.

  • Multi-cue Mid-level Grouping

    Tom Lee, Sanja Fidler, Sven Dickinson

    In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

    Paper  Abstract  Bibtex

     

    @inproceedings{LeeACCV14,
    title = {Multi-cue mid-level grouping},
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    booktitle = {ACCV},
    year = {2014}
    }

    Region proposal methods provide richer object hypotheses than sliding windows with dramatically fewer proposals, yet they still number in the thousands. This large quantity of proposals typically results from a diversification step that propagates bottom-up ambiguity in the form of proposals to the next processing stage. In this paper, we take a complementary approach in which mid-level knowledge is used to resolve bottom-up ambiguity at an earlier stage to allow a further reduction in the number of proposals. We present a method for generating regions using the mid-level grouping cues of closure and symmetry. In doing so, we combine mid-level cues that are typically used only in isolation, and leverage them to produce fewer but higher quality proposals. We emphasize that our model is mid-level by learning it on a limited number of objects while applying it to different objects, thus demonstrating that it is transferable to other objects. In our quantitative evaluation, we 1) establish the usefulness of each grouping cue by demonstrating incremental improvement, and 2) demonstrate improvement on two leading region proposal methods with a limited budget of proposals.

  • Detecting Curved Symmetric Parts using a Deformable Disc Model

    Tom Lee, Sanja Fidler, Sven Dickinson

    In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

    Paper  Abstract  Project page  Bibtex

    @inproceedings{LeeICCV13,
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    title = {Detecting Curved Symmetric Parts using a Deformable Disc Model},
    booktitle = {ICCV},
    year = {2013}
    }

    Symmetry is a powerful shape regularity that's been exploited by perceptual grouping researchers in both human and computer vision to recover part structure from an image without a priori knowledge of scene content. Drawing on the concept of a medial axis, defined as the locus of centers of maximal inscribed discs that sweep out a symmetric part, we model part recovery as the search for a sequence of deformable maximal inscribed disc hypotheses generated from a multiscale superpixel segmentation, a framework proposed by [Levinshtein et al., ICCV'09]. However, we learn affinities between adjacent superpixels in a space that's invariant to bending and tapering along the symmetry axis, enabling us to capture a wider class of symmetric parts. Moreover, we introduce a global cost that perceptually integrates the hypothesis space by combining a pairwise and a higher-level smoothing term, which we minimize globally using dynamic programming. The new framework is demonstrated on two datasets, and is shown to significantly outperform the baseline [Levinshtein et al., ICCV'09].

  • Learning Categorical Shape from Captioned Images

    Tom Lee, Sanja Fidler, Alex Levinshtein, Sven Dickinson

    Conference on Computer and Robot Vision (CRV), Toronto, Canada, May 2012

    Paper  Abstract  Bibtex

    @inproceedings{LeeCRV12,
    author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Sven Dickinson},
    title = {Learning Categorical Shape from Captioned Images},
    booktitle = {Canadian Conference on Computer and Robot Vision (CRV)},
    year = {2012}
    }

    Given a set of captioned images of cluttered scenes containing various objects in different positions and scales, we learn named contour models of object categories without relying on bounding box annotation. We extend a recent language-vision integration framework that finds spatial configurations of image features that co-occur with words in image captions. By substituting appearance features with local contour features, object categories are recognized by a contour model that grows along the object's boundary. Experiments on ETHZ are presented to show that 1) the extended framework is better able to learn named visual categories whose within class variation is better captured by a shape model than an appearance model; and 2) typical object recognition methods fail when manually annotated bounding boxes are unavailable

close window


Yukun Zhu
Graduated with MSc  (now at Google)
Sept 2015 - Jan 2016
Co-supervised with Raquel Urtasun and Ruslan Salakhutdinov

Yukun Zhu

Yukun's research was in two domains: object class detection and vision-language integration. His approach published at CVPR'15 significantly outperformed previous state-of-the-art in detection on PASCAL VOC.

Publications

  • MovieQA: Understanding Stories in Movies through Question-Answering
    (spotlight presentation)

    Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Benchmark on question-answering about movies

    Paper  Abstract  Benchmark  Bibtex

    @inproceedings{TapaswiCVPR16,
    title = {MovieQA: Understanding Stories in Movies through Question-Answering},
    author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2016}
    }

    We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

  • segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

    Yukun Zhu, Raquel Urtasun, Ruslan Salakhutdinov, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Currently third in detection on PASCAL VOC Leaderboard

    Paper  Abstract  Project page  Bibtex

    @inproceedings{ZhuSegDeepM15,
    title = {segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection},
    author = {Yukun Zhu and Raquel Urtasun and Ruslan Salakhutdinov and Sanja Fidler},
    booktitle = {CVPR},
    year = {2015}
    }

    In this paper, we propose an approach that exploits object segmentation in order to improve the accuracy of object detection. We frame the problem as inference in a Markov Random Field, in which each detection hypothesis scores object appearance as well as contextual information using Convolutional Neural Networks, and allows the hypothesis to choose and score a segment out of a large pool of accurate object segmentation proposals. This enables the detector to incorporate additional evidence when it is available and thus results in more accurate detections. Our experiments show an improvement of 4.1% in mAP over the R-CNN baseline on PASCAL VOC 2010, and 3.4% over the current state-of-the-art, demonstrating the power of our approach.


  • Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

    Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

    arXiv preprint arXiv:1506.06724, 2015

    Aligning movies and books for story-like captioning

    Paper  Abstract  Project page  Bibtex

    @inproceedings{moviebook,
    title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
    author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06724},
    year = {2015}
    }

    Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.


  • Skip-Thought Vectors

    Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    arXiv preprint arXiv:1506.06726, 2015

    Sent2vec neural representation trained on 11K books

    Paper  Abstract  Code  Bibtex

    @inproceedings{moviebook,
    title = {Skip-Thought Vectors},
    author = {Ryan Kiros and Yukun Zhu and Ruslan Salakhutdinov and Richard Zemel and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06726},
    year = {2015}
    }

    We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.

  • 3D Object Proposals for Accurate Object Class Detection

    Xiaozhi Chen*, Kaustav Kundu*, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, Raquel Urtasun

    * Denotes equal contribution

    Currently third in Car, and first in Pedestrian and Cyclist detection on KITTI's Leaderboard

    Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

    Paper  Abstract  Project page  Bibtex

    @inproceedings{XiaozhiNIPS15,
    title = {3D Object Proposals for Accurate Object Class Detection},
    author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Andrew Berneshawi and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {NIPS},
    year = {2015}
    }

    The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on Car and Cyclist, and is competitive for the Pedestrian class.

close window


Ivan Vendrov
Graduated with MSc  (now at Google)
Sept 2015 - Jan 2016
Co-supervised with Raquel Urtasun

Ivan's masters thesis was on the topic of semantic visual search.

Publications

  • Order-Embeddings of Images and Language    (oral presentation)

    Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun

    In International Conference on Learning Representations, Puerto Rico, 2016

    State-of-the-art in caption-image retrieval on COCO

    Paper  Abstract  Code  Bibtex

    @inproceedings{VendrovArxiv15,
    title = {Order-Embeddings of Images and Language},
    author = {Ivan Vendrov and Ryan Kiros and Sanja Fidler and Raquel Urtasun},
    booktitle = {ICLR},
    year = {2016}
    }

    Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.

close window


Ziyu Zhang
Graduated with MSc  
Sept 2015 - April 2016
Co-supervised with Raquel Urtasun

Ziyu Zhang

Ziyu's masters' thesis was on instance-level object segmentation in monocular imagery.

Publications

  • Instance-Level Segmentation with Deep Densely Connected MRFs

    Ziyu Zhang, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Paper  Abstract  Bibtex

    @inproceedings{ZhangCVPR16,
    title = {Instance-Level Segmentation with Deep Densely Connected MRFs},
    author = {Ziyu Zhang and Sanja Fidler and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2016}
    }

    Our aim is to provide a pixel-level object instance labeling of a monocular image. We build on recent work [Zhang et al., ICCV15] that trained a convolutional neural net to predict instance labeling in local image patches, extracted exhaustively in a stride from an image. A simple Markov random field model using several heuristics was then proposed in [Zhang et al., ICCV15] to derive a globally consistent instance labeling of the image. In this paper, we formulate the global labeling problem with a novel densely connected Markov random field and show how to encode various intuitive potentials in a way that is amenable to efficient mean field inference [Krahenbuhl et al., NIPS11]. Our potentials encode the compatibility between the global labeling and the patch-level predictions, contrast-sensitive smoothness as well as the fact that separate regions form different instances. Our experiments on the challenging KITTI benchmark [Geiger et al., CVPR12] demonstrate that our method achieves a significant performance boost over the baseline [Zhang et al., ICCV15].

  • Monocular 3D Object Detection for Autonomous Driving

    Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Abstract  Bibtex

    @inproceedings{ChenCVPR16,
    title = {Monocular 3D Object Detection for Autonomous Driving},
    author = {Xiaozhi Chen and Kaustav Kundu and Ziyu Zhang and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2016}
    }

    The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.

  • 3D Object Proposals for Accurate Object Class Detection

    Ziyu Zhang, Alex Schwing, Sanja Fidler, Raquel Urtasun

    In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

    Paper  Abstract  Bibtex

    @inproceedings{ZhangICCV15,
    title = {Monocular Object Instance Segmentation and Depth Ordering with CNNs},
    author = {Ziyu Zhang and Alex Schwing and Sanja Fidler and Raquel Urtasun},
    booktitle = {ICCV},
    year = {2015}
    }

    In this paper we tackle the problem of instance level segmentation and depth ordering from a single monocular image. Towards this goal, we take advantage of convolutional neural nets and train them to directly predict instance level segmentations where the instance ID encodes depth ordering from large image patches. To provide a coherent single explanation of an image we develop a Markov random field which takes as input the predictions of convolutional nets applied at overlapping patches of different resolutions as well as the output of a connected component algorithm and predicts very accurate instance level segmentation and depth ordering. We demonstrate the effectiveness of our approach on the challenging KITTI benchmark and show very good performance on both tasks.

close window



Xavier Puig Fernandez
PhD student, MIT
Jan-March, 2016

Xavier is visited the group two times, once in Nov 2015, and from Jan to March, 2016. We are working on the problem of video to text alignment.

Co-supervised with Raquel Urtasun:


Urban Jezernik
PhD student, University of Ljubljana
Jan-April, 2016

Urban visited the group from Jan to April, 2016. We were working on the problem of music generation.


Makarand Tapaswi
PhD student, KIT (now a postdoc in our group)
Sept-Dec, 2015

Makarand visited for three months in 2015, and is joining the group as a postdoc in the fall of 2016. We are working on the problem of question-answering based on videos.

Publications

  • MovieQA: Understanding Stories in Movies through Question-Answering
    (spotlight presentation)

    Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

    Paper  Abstract  Benchmark  Bibtex

    @inproceedings{TapaswiCVPR16,
    title = {MovieQA: Understanding Stories in Movies through Question-Answering},
    author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2016}
    }

    We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

close window


TV Interview (min 15:12 to 16:43), in Spanish
Edgar Simo-Serra
PhD student, UPC (now a postdoc at Tokyo University)
Summer 2013, 2014

Edgar visited the group twice. During his first visit (to TTI-C) he was working on clothing parsing in fashion photographs. On this topic he published a first-author paper at ACCV'14. In his second visit (to UofT), he worked on predicting how fashionable / stylish someone looks on a photograph, and suggest ways to help the user to improve her/his "look". This resulted in a first-author CVPR'15 paper. The paper got significant international press coverage in major news and fashion media such as New Scientist, Quartz, Wired, Glamour, Cosmopolitan, Elle and Marie Claire (see project page for more details). Edgar gave several interviews for the press, including an appearance on Spanish television (minutes 15:12 to 16:43) and radio (minutes 16:10 to 20:43). Yahoo News, Canada, featured a full photo of him in one of my favorite press articles on the subject.

Publications

  • Neuroaesthetics in Fashion: Modeling the Perception of Beauty

    Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    How fashionable do you look in a photo? And how can you improve?

    Paper  Abstract  Project page  Bibtex

    @inproceedings{SimoCVPR15,
    title = {Neuroaesthetics in Fashion: Modeling the Perception of Beauty},
    author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2015}
    }

    In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph's setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.

  • A High Performance CRF Model for Clothes Parsing

    Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

    In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

    Significant performance gain over state-of-the-art in clothing parsing.

    Paper  Abstract  Project page  Bibtex

     

    @inproceedings{SimoACCV14,
    title = {A High Performance CRF Model for Clothes Parsing},
    author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
    booktitle = {ACCV},
    year = {2014}}

    In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset [Yamaguchi et al., CVPR'12] and show that we can obtain a significant improvement over the state-of-the-art.

close window


Roozbeh Mottaghi
PhD student, UCLA (now a Research Scientist at AI2)
Summer 2012, 2013

Roozbeh visited the group several times, working on the topic of object class detection. His work resulted in several state-of-the-art detectors. He published two first-author and two second-author CVPR papers (CVPR'13 and '14), as well as a first-author T-PAMI publication.
Roozbeh went to do a postdoc with Prof. Silvio Savarese at Stanford and is now a Research Scientist at AI2.

Publications

  • Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

    Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, Devi Parikh

    Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1), 2016, pages 74-8

    Paper  Abstract  Suppl. Mat.  Bibtex

    @article{MottaghiPAMI16,
    title = {Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding},
    author = {Roozbeh Mottaghi and Sanja Fidler and Alan Yuille and Raquel Urtasun and Devi Parikh},
    journal = {Trans. on Pattern Analysis and Machine Intelligence},
    volume= {38},
    number= {1},
    pages= {74--87},
    year = {2016}
    }

    Recent trends in image understanding have pushed for scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers. In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular semantic segmentation, object detection and scene recognition. Towards this goal, we "plug-in" human subjects for each of the various components in a conditional random field model. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve scene understanding by focusing research efforts on various individual tasks.

  • The Role of Context for Object Detection and Semantic Segmentation in the Wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, Alan Yuille

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    PASCAL VOC with dense segmentation labels for 400+ classes in Project page

    Paper  Errata  Abstract  Project page  Suppl. Mat.  Bibtex

    @inproceedings{MottaghiCVPR14,
    author = {Roozbeh Mottaghi and Xianjie Chen and Xiaobai Liu and Sanja Fidler and Raquel Urtasun and Alan Yuille},
    title = {The Role of Context for Object Detection and Semantic Segmentation in the Wild},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper we study the role of context in existing stateof-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it contains 520 additional classes for semantic segmentation and object detection. Our analysis shows that nearest neighbor based approaches perform poorly on semantic segmentation of contextual classes, showing the variability of PASCAL imagery. Furthermore, improvements of existing contextual models for detection is rather modest. In order to push forward the performance in this difficult scenario, we propose a novel deformable part-based model, which exploits both local context around each candidate detection as well as global context at the level of the scene. We show that this contextual reasoning significantly helps in detecting objects at all scales.

  • Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

    Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Nam-Gyu Cho, Sanja Fidler, Raquel Urtasun, Alan Yuille

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    PASCAL VOC with object parts segmentations available in Project page

    Paper  Abstract  Project page  Bibtex

    @inproceedings{PartsCVPR14,
    author = {Xianjie Chen and Roozbeh Mottaghi and Xiaobai Liu and Nam-Gyu Cho and Sanja Fidler and Raquel Urtasun and Alan Yuille},
    title = {Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts},
    booktitle = {CVPR},
    year = {2014}
    }

    Detecting objects becomes difficult when we need to deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformations and partial occlusions in animals (as examples of highly deformable objects), ii) describe them in terms of body parts, and iii) detect them when their body parts are hard to detect (e.g., animals depicted at low resolution). We represent the holistic object and body parts separately and use a fully connected model to arrange templates for the holistic object and body parts. Our model automatically decouples the holistic object or body parts from the model when they are hard to detect. This enables us to represent a large number of holistic object and body part combinations to better deal with different "detectability" patterns caused by deformations, occlusion and/or low resolution. We apply our method to the six animal categories in the PASCAL VOC dataset and show that our method significantly improves state-of-the-art (by 4.1% AP) and provides a richer representation for objects. During training we use annotations for body parts (e.g., head, torso, etc), making use of a new dataset of fully annotated object parts for PASCAL VOC 2010, which provides a mask for each part.

  • Bottom-up Segmentation for Top-down Detection

    Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    8% over DPM and 4% over the state-of-the-art on PASCAL VOC at the time.

    Paper  Abstract  Project page  Suppl. Mat.  Bibtex

    @inproceedings{segdpmCVPR13,
    author = {Sanja Fidler and Roozbeh Mottaghi and Alan Yuille and Raquel Urtasun},
    title = {Bottom-up Segmentation for Top-down Detection},
    booktitle = {CVPR},
    year = {2013}
    }

    In this paper we are interested in how semantic segmentation can help object detection. Towards this goal, we propose a novel deformable part-based model which exploits region-based segmentation algorithms that compute candidate object regions by bottom-up clustering followed by ranking of those regions. Our approach allows every detection hypothesis to select a segment (including void), and scores each box in the image using both the traditional HOG filters as well as a set of novel segmentation features. Thus our model "blends" between the detector and segmentation models. Since our features can be computed very efficiently given the segments, we maintain the same complexity as the original DPM. We demonstrate the effectiveness of our approach in PASCAL VOC 2010, and show that when employing only a root filter our approach outperforms Dalal & Triggs detector on all classes, achieving 13% higher average AP. When employing the parts, we outperform the original DPM in 19 out of 20 classes, achieving an improvement of 8% AP. Furthermore, we outperform the previous state-of-the-art on VOC'10 test by 4%.

  • Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs

    Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{MottaghiCVPR13,
    author = {Roozbeh Mottaghi and Sanja Fidler and Jian Yao and Raquel Urtasun and Devi Parikh},
    title = {Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs},
    booktitle = {CVPR},
    year = {2013}
    }

    Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we "plug-in" human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

close window


Liang-Chieh Chen
PhD student, UCLA (now at Google)
Summer 2013

Liang-Chieh ("Jay") worked on weakly-labeled segmentation: getting accurate object segmentation given a ground-truth 3D bounding box as available in the KITTI dataset. His method improved significantly over existing grab-cut type of approaches, and even outperformed MTurkers (compared to accurate in-house annotations). Jay authored a first-author paper at CVPR'14.

Publications

  • Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision

    Liang-Chieh Chen, Sanja Fidler, Alan Yuille, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    More accurate pixel-level object labeling than MTurkers

    Paper  Abstract  Project page  CAD models  Suppl. Mat.  Bibtex

    @inproceedings{ChenCVPR14,
    author = {Liang-Chieh Chen and Sanja Fidler and Alan Yuille and Raquel Urtasun},
    title = {Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision},
    booktitle = {CVPR},
    year = {2014}
    }

    Labeling large-scale datasets with very accurate object segmentations is an elaborate task that requires a high degree of quality control and a budget of tens or hundreds of thousands of dollars. Thus, developing solutions that can automatically perform the labeling given only weak supervision is key to reduce this cost. In this paper, we show how to exploit 3D information to automatically generate very accurate object segmentations given annotated 3D bounding boxes. We formulate the problem as the one of inference in a binary Markov random field which exploits appearance models, stereo and/or noisy point clouds, a repository of 3D CAD models as well as topological constraints. We demonstrate the effectiveness of our approach in the context of autonomous driving, and show that we can segment cars with the accuracy of 86% intersection-over-union, performing as well as highly recommended MTurkers!

close window


Abhishek Sharma
PhD student, UMD (now at Apple)
Summer 2012

Abhishek worked on holistic scene parsing by exploiting image captions. Making use of textual information for visual parsing is important for, e.g., robotics applications where an automatics system interacts with a human user. Abhishek co-authored a CVPR'13 paper.

Publications

  • A Sentence is Worth a Thousand Pixels

    Sanja Fidler, Abhishek Sharma, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{FidlerCVPR13,
    author = {Sanja Fidler and Abhishek Sharma and Raquel Urtasun},
    title = {A Sentence is Worth a Thousand Pixels},
    booktitle = {CVPR},
    year = {2013}
    }

    We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models.

close window

Kevin Kyunghwan Ra
4th year undergraduate, UofT (now at McGill University)
2016

Vasu Sharma
3rd year undergraduate, IIT Kanpur
Summer 2016, co-supervised with Raquel Urtasun

Amlan Kar
3rd year undergraduate, IIT Kanpur
Summer 2016, co-supervised with Raquel Urtasun

Erin Grant
4th year undergraduate, UofT (now a PhD student at UC Berkeley)
Jan-April, 2016

Seung Wook Kim
4th year undergraduate, UofT
Jan-April, 2016

Huazhe Xu
4th year visiting student from Tsinghua University (now a PhD student at UC Berkeley)
Sep 2015 - Dec 2015, co-supervised with Raquel Urtasun

Boris Ivanovic
4th year undergraduate, UofT (now a Masters student at Stanford University)
Sep 2015 - May 2016, co-supervised with Raquel Urtasun

Tamara Lipowski
4th year undergraduate, UofT (now a Masters student at University of Salzburg)
Jan-April, 2016

Zexuan (Aaron) Wang
4th year undergraduate, UofT (now at Qumulo Inc)
Sept 2015 - April 2016, co-supervised with Raquel Urtasun

Jurgen Aliaj
2nd year undergraduate, UofT
Summer 2015 (USRA)

Andrew Berneshawi
4th year undergraduate, UofT (now at Amazon, Canada)
CSC494, Winter 2015

Andrew worked on road estimation as part of a semester-long project course (CSC494). His approach ranked second on KITTI's road classification benchmark (entry: NNP, time stamped: June 2015).

close window


Chenxi's talk at CVPR'15
Chenxi Liu
4th year undergraduate, Tsingua University (now a PhD student at Johns Hopkins University)
Summer 2014, co-supervised with Raquel Urtasun

Chenxi worked on the problem of apartment reconstruction in 3D from rental data (monocular imagery and floor-plan). His work resulted in a joint first-author oral CVPR'15 paper. He gave a talk at CVPR and did a great job (you can check his performance below).

Publications

  • Rent3D: Floor-Plan Priors for Monocular Layout Estimation
    (oral presentation)

    Chenxi Liu*, Alex Schwing*, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Rent an apartment in 3D!

    * Denotes equal contribution

    Paper  Abstract  Suppl. Mat.  Project page  Bibtex

    @inproceedings{ApartmentsCVPR15,
    title = {Rent3D: Floor-Plan Priors for Monocular Layout Estimation},
    author = {Chenxi Liu and Alex Schwing and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2015}}

    The goal of this paper is to enable a 3D "virtual-tour" of an apartment given a small set of monocular images of different rooms, as well as a 2D floor plan. We frame the problem as the one of inference in a Markov random field which reasons about the layout of each room and its relative pose (3D rotation and translation) within the full apartment. This gives us information, for example, about in which room the picture was taken. What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge. In particular, we exploit the floor plan to impose aspect ratio constraints across the layouts of different rooms, as well as to extract semantic information, e.g., the location of windows which are labeled in floor plans. We show that this information can significantly help in resolving the challenging room-apartment alignment problem. We also derive an efficient exact inference algorithm which takes only a few ms per apartment. This is due to the fact that we exploit integral geometry as well as our new bounds on the aspect ratio of rooms which allow us to carve the space, reducing significantly the number of physically possible configurations. We demonstrate the effectiveness of our approach in a new dataset which contains over 200 apartments.


    [talk]  [slides]

close window

Yinan Zhao
4th year undergraduate, Tsingua University (now a PhD student at UT Austin)
Summer 2014, co-supervised with Raquel Urtasun

Chen Kong
4th year undergraduate, Tsingua University (now a PhD student at CMU)
Summer 2013, co-supervised with Raquel Urtasun

Chen worked on 3D indoor scene understanding by exploiting textual information. His work resulted in one first-author and another co-authored CVPR'14 paper, and he co-authored an oral paper at BMVC'15.

Publications

  • What are you talking about? Text-to-Image Coreference

    Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    Exploits text for visual parsing and aligns nouns to objects.

    Paper  Abstract  Bibtex

    @inproceedings{KongCVPR14,
    title = {What are you talking about? Text-to-Image Coreference},
    author = {Chen Kong and Dahua Lin and Mohit Bansal and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system.

  • Visual Semantic Search: Retrieving Videos via Complex Textual Queries

    Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    Video retrieval when a query is a longer sentence or a multi-sentence description

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{LinCVPR14,
    author = {Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
    title = {Visual Semantic Search: Retrieving Videos via Complex Textual Queries},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.

  • Generating Multi-Sentence Lingual Descriptions of Indoor Scenes     (oral presentation)

    Dahua Lin, Chen Kong, Sanja Fidler, Raquel Urtasun

    In British Machine Vision Conference (BMVC), 2015

    Paper  Abstract  Bibtex

    @inproceedings{Lin15,
    title = {Generating Multi-Sentence Lingual Descriptions of Indoor Scenes},
    author = {Dahua Lin and Chen Kong and Sanja Fidler and Raquel Urtasun},
    booktitle = {arXiv:1503.00064},
    year = {2015}}

    This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account the coherence among sentences. Experiments on the augmented NYU-v2 dataset show that our framework can generate natural descriptions with substantially higher ROGUE scores compared to those produced by the baseline.

close window

Jialiang's fun video about his summer research Jialiang Wang
4th year undergraduate, UofT (now a PhD student at Harvard University)
Summer 2014 (USRA), co-supervised with Sven Dickinson

Uri's fun video about his summer research Uri Priel
3rd year undergraduate, UofT
Summer 2014 (USRA), co-supervised with Sven Dickinson


Winning video of a undergraduate research video competition
Kamyar Seyed Ghasemipour
2nd year undergraduate, UofT
Summer 2014 (USRA), co-supervised with Suzanne Stevenson and Sven Dickinson

Kamyar worked on unsupervised word-sense disambiguation of captioned images. He won a research video competition (video) held for the Undergraduate Summer Research Program at UofT.

close window


back to top