Object Detection
 

Movies and Books
 

Scene Understanding

 

Vision and Language
 

Clothing and Fashion
 

Segmentation



Object Detection

Object class recognition and detection is one of the main challenges in computer vision and also one of my greatest passions. In my PhD work I proposed multi-class hierarchical compositional representations of objects. The main advantage of the representation at the time was that detection time grew sub-linearly with the number of classes, the representation was learned from data, and was capable of modeling contour deformation at multiple levels.

Combining different sources of information such as appearance and segmentation is also important. For example, articulated objects may more accurately be recognized at a smaller scale (texture patches) which is well captured by local image-labeling approaches, while shape is a more global cue and thus better modeled by a representation of a more global, object-size, region/window. In recent work we showed that combining appearance, segmentation and contextual information improved 5-8% over DPM and R-CNN. As part of this work, we also labeled PASCAL VOC with dense pixel labels of over 400 classes (PASCAL-Context dataset), as well as detailed masks of semantic object parts (PASCAL-Part Dataset).

We live in a 3D world, and thus our models should also reason in 3D. This is particularly important for robotics applications where estimating accurate 3D location and pose is crucial. Our work focuses on 3D object detection from single monocular images as well as RGB-D data.

Relevant Publications


Multi-cue Detection

  • segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

    Yukun Zhu, Raquel Urtasun, Ruslan Salakhutdinov, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Currently third in detection on PASCAL VOC Leaderboard

    Paper  Abstract  Project page  Bibtex

    @inproceedings{ZhuSegDeepM15,
    title = {segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection},
    author = {Yukun Zhu and Raquel Urtasun and Ruslan Salakhutdinov and Sanja Fidler},
    booktitle = {CVPR},
    year = {2015}
    }

    In this paper, we propose an approach that exploits object segmentation in order to improve the accuracy of object detection. We frame the problem as inference in a Markov Random Field, in which each detection hypothesis scores object appearance as well as contextual information using Convolutional Neural Networks, and allows the hypothesis to choose and score a segment out of a large pool of accurate object segmentation proposals. This enables the detector to incorporate additional evidence when it is available and thus results in more accurate detections. Our experiments show an improvement of 4.1% in mAP over the R-CNN baseline on PASCAL VOC 2010, and 3.4% over the current state-of-the-art, demonstrating the power of our approach.

  • The Role of Context for Object Detection and Semantic Segmentation in the Wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, Alan Yuille

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    PASCAL VOC with dense segmentation labels for 400+ classes in Project page

    Paper  Errata  Abstract  Project page  Suppl. Mat.  Bibtex

    @inproceedings{MottaghiCVPR14,
    author = {Roozbeh Mottaghi and Xianjie Chen and Xiaobai Liu and Sanja Fidler and Raquel Urtasun and Alan Yuille},
    title = {The Role of Context for Object Detection and Semantic Segmentation in the Wild},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper we study the role of context in existing stateof-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it contains 520 additional classes for semantic segmentation and object detection. Our analysis shows that nearest neighbor based approaches perform poorly on semantic segmentation of contextual classes, showing the variability of PASCAL imagery. Furthermore, improvements of existing contextual models for detection is rather modest. In order to push forward the performance in this difficult scenario, we propose a novel deformable part-based model, which exploits both local context around each candidate detection as well as global context at the level of the scene. We show that this contextual reasoning significantly helps in detecting objects at all scales.

  • Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

    Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Nam-Gyu Cho, Sanja Fidler, Raquel Urtasun, Alan Yuille

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    PASCAL VOC with object parts segmentations available in Project page

    Paper  Abstract  Project page  Bibtex

    @inproceedings{PartsCVPR14,
    author = {Xianjie Chen and Roozbeh Mottaghi and Xiaobai Liu and Nam-Gyu Cho and Sanja Fidler and Raquel Urtasun and Alan Yuille},
    title = {Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts},
    booktitle = {CVPR},
    year = {2014}
    }

    Detecting objects becomes difficult when we need to deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformations and partial occlusions in animals (as examples of highly deformable objects), ii) describe them in terms of body parts, and iii) detect them when their body parts are hard to detect (e.g., animals depicted at low resolution). We represent the holistic object and body parts separately and use a fully connected model to arrange templates for the holistic object and body parts. Our model automatically decouples the holistic object or body parts from the model when they are hard to detect. This enables us to represent a large number of holistic object and body part combinations to better deal with different "detectability" patterns caused by deformations, occlusion and/or low resolution. We apply our method to the six animal categories in the PASCAL VOC dataset and show that our method significantly improves state-of-the-art (by 4.1% AP) and provides a richer representation for objects. During training we use annotations for body parts (e.g., head, torso, etc), making use of a new dataset of fully annotated object parts for PASCAL VOC 2010, which provides a mask for each part.

  • Bottom-up Segmentation for Top-down Detection

    Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    8% over DPM and 4% over the state-of-the-art on PASCAL VOC at the time. Data available.

    Paper  Abstract  Project page  Suppl. Mat.  Bibtex

    @inproceedings{segdpmCVPR13,
    author = {Sanja Fidler and Roozbeh Mottaghi and Alan Yuille and Raquel Urtasun},
    title = {Bottom-up Segmentation for Top-down Detection},
    booktitle = {CVPR},
    year = {2013}
    }

    In this paper we are interested in how semantic segmentation can help object detection. Towards this goal, we propose a novel deformable part-based model which exploits region-based segmentation algorithms that compute candidate object regions by bottom-up clustering followed by ranking of those regions. Our approach allows every detection hypothesis to select a segment (including void), and scores each box in the image using both the traditional HOG filters as well as a set of novel segmentation features. Thus our model "blends" between the detector and segmentation models. Since our features can be computed very efficiently given the segments, we maintain the same complexity as the original DPM. We demonstrate the effectiveness of our approach in PASCAL VOC 2010, and show that when employing only a root filter our approach outperforms Dalal & Triggs detector on all classes, achieving 13% higher average AP. When employing the parts, we outperform the original DPM in 19 out of 20 classes, achieving an improvement of 8% AP. Furthermore, we outperform the previous state-of-the-art on VOC'10 test by 4%.


3D Object Detection in RGB-D

  • Holistic Scene Understanding for 3D Object Detection with RGBD cameras           (oral presentation)

    Dahua Lin, Sanja Fidler, Raquel Urtasun

    In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

    Code, models and ground-truth cuboids for NYU-v2 in Project page

    Paper  Abstract  Project page  Talk slides  Bibtex

    @inproceedings{LinICCV13,
    author = {Dahua Lin and Sanja Fidler and Raquel Urtasun},
    title = {Holistic Scene Understanding for 3D Object Detection with RGBD cameras},
    booktitle = {ICCV},
    year = {2013}
    }

    In this paper, we tackle the problem of indoor scene understanding using RGBD data. Towards this goal, we propose a holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects. Specifically, we extend the CPMC framework to 3D in order to generate candidate cuboids, and develop a conditional random field to integrate information from different sources to classify the cuboids. With this formulation, scene classification and 3D object recognition are coupled and can be jointly solved through probabilistic inference. We test the effectiveness of our approach on the challenging NYU v2 dataset. The experimental results demonstrate that through effective evidence integration and holistic reasoning, our approach achieves substantial improvement over the state-of-the-art.


3D Object Detection in Monocular Imagery

  • Box In the Box: Joint 3D Layout and Object Reasoning from Single Images

    Alex Schwing, Sanja Fidler, Marc Pollefeys, Raquel Urtasun

    In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

    Parallel and improved implementation of Structured SVMs available

    Paper  Abstract  Learning code  Bibtex

    @inproceedings{SchwingICCV13,
    author = {Alex Schwing and Sanja Fidler and Marc Pollefeys and Raquel Urtasun},
    title = {Box In the Box: Joint 3D Layout and Object Reasoning from Single Images},
    booktitle = {ICCV},
    year = {2013}
    }

    In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. Towards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account occlusion in order to not over-count the evidence. We introduce a new decomposition method, which generalizes integral geometry to triangular shapes, and allows us to bound the different terms in constant time. We exploit both geometric cues and object detectors as image features and show large improvements in 2D and 3D object detection over state-of-the-art deformable part-based models.

  • 3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model      (spotlight presentation)

    Sanja Fidler, Sven Dickinson, Raquel Urtasun

    Neural Information Processing Systems (NIPS), Lake Tahoe, USA, December 2012

    800 CAD models registered to canonical viewpoint available!

    Paper  Abstract  CAD dataset  Bibtex

    @inproceedings{FidlerNIPS12,
    author = {Sanja Fidler and Sven Dickinson and Raquel Urtasun},
    title = {3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model},
    booktitle = {NIPS},
    year = {2012}
    }

    This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efficiency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach significantly outperforms the state of-the-art in both 2D and 3D object detection.


Compositional Models (selected publications)

  • Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation

    Sanja Fidler, Marko Boben, Ales Leonardis

    arXiv preprint arXiv:1408.5516, 2014

    Journal version of my PhD work on learning compositional hierarchies encoding spatial relations

    Paper  Abstract  Bibtex

    @inproceedings{FidlerArxiv14,
    title = {Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation},
    author = {Sanja Fidler and Marko Boben and Ale\v{s} Leonardis},
    booktitle = {ArXiv:1408.5516},
    year = {2014}
    }

    Hierarchies allow feature sharing between objects at multiple levels of representation, can code exponential variability in a very compact way and enable fast inference. This makes them potentially suitable for learning and recognizing a higher number of object classes. However, the success of the hierarchical approaches so far has been hindered by the use of hand-crafted features or predetermined grouping rules. This paper presents a novel framework for learning a hierarchical compositional shape vocabulary for representing multiple object classes. The approach takes simple contour fragments and learns their frequent spatial configurations. These are recursively combined into increasingly more complex and class-specific shape compositions, each exerting a high degree of shape variability. At the top-level of the vocabulary, the compositions are sufficiently large and complex to represent the whole shapes of the objects. We learn the vocabulary layer after layer, by gradually increasing the size of the window of analysis and reducing the spatial resolution at which the shape configurations are learned. The lower layers are learned jointly on images of all classes, whereas the higher layers of the vocabulary are learned incrementally, by presenting the algorithm with one object class after another. The experimental results show that the learned multi-class object representation scales favorably with the number of object classes and achieves a state-of-the-art detection performance at both, faster inference as well as shorter training times.

  • A coarse-to-fine Taxonomy of Constellations for Fast Multi-class Object Detection

    Sanja Fidler, Marko Boben, Ales Leonardis

    In European Conference in Computer Vision (ECCV), 2010

    Paper  Abstract  Bibtex

    @inproceedings{FidlerECCV10,
    author = {Sanja Fidler and Marko Boben and Ales Leonardis},
    title = {A coarse-to-fine Taxonomy of Constellations for Fast Multi-class Object Detection},
    booktitle = {ECCV},
    year = {2010}
    }

    In order for recognition systems to scale to a larger number of object categories building visual class taxonomies is important to achieve running times logarithmic in the number of classes [1, 2]. In this paper we propose a novel approach for speeding up recognition times of multi-class part-based object representations. The main idea is to construct a taxonomy of constellation models cascaded from coarse-to-fine resolution and use it in recognition with an efficient search strategy. The taxonomy is built automatically in a way to minimize the number of expected computations during recognition by optimizing the cost-to-power ratio [Blanchard and Geman, Annals of Statistics, 2005]. The structure and the depth of the taxonomy is not pre-determined but is inferred from the data. The approach is utilized on the hierarchy-of-parts model achieving efficiency in both, the representation of the structure of objects as well as in the number of modeled object classes. We achieve speed-up even for a small number of object classes on the ETHZ and TUD dataset. On a larger scale, our approach achieves detection time that is logarithmic in the number of classes.

  • Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

    Sanja Fidler, Marko Boben, Ales Leonardis

    Neural Information Processing Systems (NIPS), 2009

    Paper  Abstract  Bibtex

    @inproceedings{FidlerNIPS09,
    author = {Sanja Fidler and Marko Boben and Ales Leonardis},
    title = {Evaluating multi-class learning strategies in a generative hierarchical framework for object detection},
    booktitle = {NIPS},
    year = {2009}
    }

    Multi-class object learning and detection is a challenging problem due to the large number of object classes and their high visual variability. Specialized detectors usually excel in performance, while joint representations optimize sharing and reduce inference time -- but are complex to train. Conveniently, sequential class learning cuts down training time by transferring existing knowledge to novel classes, but cannot fully exploit the shareability of features among object classes and might depend on ordering of classes during learning. In hierarchical frameworks these issues have been little explored. In this paper, we provide a rigorous experimental analysis of various multiple object class learning strategies within a generative hierarchical framework. Specifically, we propose, evaluate and compare three important types of multi-class learning: 1.) independent training of individual categories, 2.) joint training of classes, and 3.) sequential learning of classes. We explore and compare their computational behavior (space and time) and detection performance as a function of the number of learned object classes on several recognition datasets. We show that sequential training achieves the best trade-off between inference and training times at a comparable detection performance and could thus be used to learn the classes on a larger scale.

  • Learning Hierarchical Compositional Representations of Object Structure

    Sanja Fidler, Marko Boben, Ales Leonardis

    Object Categorization: Computer and Human Vision Perspectives
    Editors: S. Dickinson, A. Leonardis, B. Schiele and M. J. Tarr
    Cambridge university press, 2009

    Bibtex

    @InCollection{FidlerChapter09,
    author = {Sanja Fidler and Marko Boben and Ales Leonardis},
    title = {Learning Hierarchical Compositional Representations of Object Structure},
    booktitle = {Object Categorization: Computer and Human Vision Perspectives},
    editor = {Sven Dickinson and Ale\v{s} Leonardis and Bernt Schiele and Michael J. Tarr},
    year = {2009},
    publisher = {Cambridge University Press},
    pages = {}
    }

  • Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts

    Sanja Fidler, Ales Leonardis

    Conference on Computer Vision and Pattern Recognition (CVPR), 2007

    Learning a deep hierarchy of interpretable features encoding spatial relations

    Paper  Abstract  Bibtex

    @inproceedings{FidlerCVPR07,
    author = {Sanja Fidler and Ales Leonardis},
    title = {Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts},
    booktitle = {CVPR},
    year = {2007}
    }

    This paper proposes a novel approach to constructing a hierarchical representation of visual input that aims to enable recognition and detection of a large number of object categories. Inspired by the principles of efficient indexing (bottom-up), robust matching (top-down), and ideas of compositionality, our approach learns a hierarchy of spatially flexible compositions, i.e. parts, in an unsupervised, statistics-driven manner. Starting with simple, frequent features, we learn the statistically most significant compositions (parts composed of parts), which consequently define the next layer. Parts are learned sequentially, layer after layer, optimally adjusting to the visual data. Lower layers are learned in a category-independent way to obtain complex, yet sharable visual building blocks, which is a crucial step towards a scalable representation. Higher layers of the hierarchy, on the other hand, are constructed by using specific categories, achieving a category representation with a small number of highly generalizable parts that gained their structural flexibility through composition within the hierarchy. Built in this way, new categories can be efficiently and continuously added to the system by adding a small number of parts only in the higher layers. The approach is demonstrated on a large collection of images and a variety of object categories. Detection results confirm the effectiveness and robustness of the learned parts.

close window

Movies and Books

One of our most recent interests is parsing movies and books. This entails designing semantic representation of video as well as text. Our first work on this topic aims to align movies and books with the goal of collecting large-scale multi-sentence, story-like captions of video clips. We also proposed a new neural representation of sentences called ski-thoughts trained unsupervised from a large collection of books (Book Corpus dataset). Our model represents each sentence as a vector which can be used for e.g. semantic relatedness, image-sentence ranking, paraphrase detection, and sentiment analysis.

Relevant Publications



  • Skip-Thought Vectors

    Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    arXiv preprint arXiv:1506.06726, 2015

    Sent2vec neural representation trained on 11K books

    Paper  Abstract  Code  Bibtex

    @inproceedings{moviebook,
    title = {Skip-Thought Vectors},
    author = {Ryan Kiros and Yukun Zhu and Ruslan Salakhutdinov and Richard Zemel and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06726},
    year = {2015}
    }

    coming soon


  • Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

    Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

    arXiv preprint arXiv:1506.06724, 2015

    Aligning movies and books for story-like captioning

    Paper  Abstract  Project page  Bibtex

    @inproceedings{moviebook,
    title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
    author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06724},
    year = {2015}
    }

    coming soon

close window

Scene Understanding

3D scene understanding requires reasoning about multiple related tasks: (3D) object detection and segmentation, relationships between objects (e.g., support), scene-type prediction, as well as inferring the structure of the scene as well (e.g. ground plane in outdoor scenarios and layout of the room in indoors). Our work focuses on designing holistic models that reason jointly about the related sub-tasks, and as such outperform the individual modules.

Our most recent work in this domain aims to reconstruct apartments in 3D from rental ads, for an enhanced viewer's experience. This entails localizing each photo within the floor-plan by exploiting semantic, scene and geometric cues. Renting will never be the same again. ;)

Relevant Publications


Indoor Scene Understanding (Monocular)

  • Rent3D: Floor-Plan Priors for Monocular Layout Estimation               (oral presentation)

    Chenxi Liu, Alex Schwing, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Rent an apartment in 3D!

    Paper  Abstract  Suppl. Mat.  Project page  Bibtex

    @inproceedings{ApartmentsCVPR15,
    title = {Rent3D: Floor-Plan Priors for Monocular Layout Estimation},
    author = {Chenxi Liu and Alex Schwing and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2015}}

    The goal of this paper is to enable a 3D "virtual-tour" of an apartment given a small set of monocular images of different rooms, as well as a 2D floor plan. We frame the problem as the one of inference in a Markov random field which reasons about the layout of each room and its relative pose (3D rotation and translation) within the full apartment. This gives us information, for example, about in which room the picture was taken. What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge. In particular, we exploit the floor plan to impose aspect ratio constraints across the layouts of different rooms, as well as to extract semantic information, e.g., the location of windows which are labeled in floor plans. We show that this information can significantly help in resolving the challenging room-apartment alignment problem. We also derive an efficient exact inference algorithm which takes only a few ms per apartment. This is due to the fact that we exploit integral geometry as well as our new bounds on the aspect ratio of rooms which allow us to carve the space, reducing significantly the number of physically possible configurations. We demonstrate the effectiveness of our approach in a new dataset which contains over 200 apartments.


    [talk]  [slides]

  • Box In the Box: Joint 3D Layout and Object Reasoning from Single Images

    Alex Schwing, Sanja Fidler, Marc Pollefeys, Raquel Urtasun

    In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

    Parallel and improved implementation of Structured SVMs available

    Paper  Abstract  Learning code  Bibtex

    @inproceedings{SchwingICCV13,
    author = {Alex Schwing and Sanja Fidler and Marc Pollefeys and Raquel Urtasun},
    title = {Box In the Box: Joint 3D Layout and Object Reasoning from Single Images},
    booktitle = {ICCV},
    year = {2013}
    }

    In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. Towards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account occlusion in order to not over-count the evidence. We introduce a new decomposition method, which generalizes integral geometry to triangular shapes, and allows us to bound the different terms in constant time. We exploit both geometric cues and object detectors as image features and show large improvements in 2D and 3D object detection over state-of-the-art deformable part-based models.


Indoor Scene Understanding (RGB-D)

  • Generating Multi-Sentence Lingual Descriptions of Indoor Scenes     (oral presentation)

    Dahua Lin, Chen Kong, Sanja Fidler, Raquel Urtasun

    In British Machine Vision Conference (BMVC), To appear, 2015

    Paper  Abstract  Bibtex

    @inproceedings{Lin15,
    title = {Generating Multi-Sentence Lingual Descriptions of Indoor Scenes},
    author = {Dahua Lin and Chen Kong and Sanja Fidler and Raquel Urtasun},
    booktitle = {arXiv:1503.00064},
    year = {2015}}

    This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account the coherence among sentences. Experiments on the augmented NYU-v2 dataset show that our framework can generate natural descriptions with substantially higher ROGUE scores compared to those produced by the baseline.

  • What are you talking about? Text-to-Image Coreference

    Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    Exploits text for visual parsing and aligns nouns to objects. Code and data out soon!

    Paper  Abstract  Bibtex

    @inproceedings{KongCVPR14,
    title = {What are you talking about? Text-to-Image Coreference},
    author = {Chen Kong and Dahua Lin and Mohit Bansal and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system.

  • Holistic Scene Understanding for 3D Object Detection with RGBD cameras           (oral presentation)

    Dahua Lin, Sanja Fidler, Raquel Urtasun

    In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

    Code, models and ground-truth cuboids for NYU-v2 in Project page

    Paper  Abstract  Project page  Talk slides  Bibtex

    @inproceedings{LinICCV13,
    author = {Dahua Lin and Sanja Fidler and Raquel Urtasun},
    title = {Holistic Scene Understanding for 3D Object Detection with RGBD cameras},
    booktitle = {ICCV},
    year = {2013}
    }

    In this paper, we tackle the problem of indoor scene understanding using RGBD data. Towards this goal, we propose a holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects. Specifically, we extend the CPMC framework to 3D in order to generate candidate cuboids, and develop a conditional random field to integrate information from different sources to classify the cuboids. With this formulation, scene classification and 3D object recognition are coupled and can be jointly solved through probabilistic inference. We test the effectiveness of our approach on the challenging NYU v2 dataset. The experimental results demonstrate that through effective evidence integration and holistic reasoning, our approach achieves substantial improvement over the state-of-the-art.


Outdoor Scene Understanding (Monocular)

  • Holistic 3D Scene Understanding from a Single Geo-tagged Image   (oral presentation)

    Shenlong Wang, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Exploiting map priors for segmentation and monocular depth estimation

    Paper  Abstract  Project page  Bibtex

    @inproceedings{WangCVPR15,
    title = {Holistic 3D Scene Understanding from a Single Geo-tagged Image},
    author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2015}}

    In this paper we are interested in exploiting geographic priors to help outdoor scene understanding. Towards this goal we propose a holistic approach that reasons jointly about 3D object detection, pose estimation, semantic segmentation as well as depth reconstruction from a single image. Our approach takes advantage of large-scale crowdsourced maps to generate dense geographic, geometric and semantic priors by rendering the 3D world. We demonstrate the effectiveness of our holistic model on the challenging KITTI dataset, and show significant improvements over the baselines in all metrics and tasks.


    [talk]  [slides]

2D Scene Understanding

  • Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

    Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, Devi Parikh

    Transactions on Pattern Analysis and Machine Intelligence (TPAMI), To appear 2015

    Paper  Abstract  Suppl. Mat.  Bibtex

    @article{MottaghiPAMI15,
    title = {Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding},
    author = {Roozbeh Mottaghi and Sanja Fidler and Alan Yuille and Raquel Urtasun and Devi Parikh},
    journal = {Trans. on Pattern Analysis and Machine Intelligence},
    year = {2015}
    }

    Recent trends in image understanding have pushed for scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers. In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular semantic segmentation, object detection and scene recognition. Towards this goal, we "plug-in" human subjects for each of the various components in a conditional random field model. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve scene understanding by focusing research efforts on various individual tasks.

  • A Sentence is Worth a Thousand Pixels

    Sanja Fidler, Abhishek Sharma, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    Reasons about object detection, segmentation, scene-type and sentence descriptions to improve image parsing.

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{FidlerCVPR13,
    author = {Sanja Fidler and Abhishek Sharma and Raquel Urtasun},
    title = {A Sentence is Worth a Thousand Pixels},
    booktitle = {CVPR},
    year = {2013}
    }

    We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models.

  • Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation

    Jian Yao, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Providence, USA, June 2012

    Code, trained models and annotated bounding boxes for MSRC in Project page

    Paper  Abstract  Project page.  Bibtex

    @inproceedings{YaoCVPR12,
    author = {Jian Yao and Sanja Fidler and Raquel Urtasun},
    title = {Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation},
    booktitle = {CVPR},
    year = {2012}
    }

    In this paper we propose an approach to holistic scene understanding that reasons jointly about regions, location, class and spatial extent of objects, presence of a class in the image, as well as the scene type. Learning and inference in our model are efficient as we reason at the segment level, and introduce auxiliary variables that allow us to decompose the inherent high-order potentials into pairwise potentials between a few variables with small number of states (at most the number of classes). Inference is done via a convergent message-passing algorithm, which, unlike graph-cuts inference, has no submodularity restrictions and does not require potential specific moves. We believe this is very important, as it allows us to encode our ideas and prior knowledge about the problem without the need to change the inference engine every time we introduce a new potential. Our approach outperforms the state-of-the-art on the MSRC-21 benchmark, while being much faster. Importantly, our holistic model is able to improve performance in all tasks.

close window

Clothing and Fashion

"The finest clothing made is a person's skin, but, of course, society demands something more than this.''

- Mark Twain   

Fashion has a tremendous impact on our society. Clothing typically reflects the person's social status and thus puts pressure on how to dress to fit a particular occasion. Its importance becomes even more pronounced due to online social sites like Facebook and Instagram where one's photographs are shared with the world. We also live in a technological era where a significant portion of the population looks for their dream partner on online dating sites. People want to look good; business or casual, elegant or sporty, sexy but not slutty, and of course trendy, particularly so when putting their picture online. This is reflected in the growing online retail sales, reaching 370 billion dollars in the US by 2017, and 191 billion euros in Europe according to the Forbes magazine (2013).

Our goals include clothing parsing from a photo, as well as to predict how fashionable a person looks on a particular photograph. Moreover, we aim to give a rich feedback to the user: not only whether the photograph is appealing or not, but also to make suggestions of what clothing or even the scenery the user could change in order to improve her/his look.

Relevant Publications


  • Neuroaesthetics in Fashion: Modeling the Perception of Beauty

    Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    How fashionable do you look in a photo? And how can you improve?

    Paper  Abstract  Project page  Bibtex

    @inproceedings{SimoCVPR15,
    title = {Neuroaesthetics in Fashion: Modeling the Perception of Beauty},
    author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2015}
    }

    In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph's setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.





    International news


    Vogue (Spain) Stylebook (Germany) Ansa (Italy) CenarioMT (Brazil) Amsterdam Fashion (NL)
    Marie Claire (France) Fashion Police (Nigeria) Nauka (Poland) Pluska (Slovakia) Pressetext (Austria)
    Wired (Germany) Jetzt (Germany) La Gazzetta (Italy) PopSugar (Australia) SinEmbargo (Mexico)

    A more complete list is maintained on our project webpage.

    close window

  • A High Performance CRF Model for Clothes Parsing

    Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

    In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

    Significant performance gain over state-of-the-art. Code, features & models in Project page!

    Paper  Abstract  Project page  Bibtex

     

    @inproceedings{SimoACCV14,
    title = {A High Performance CRF Model for Clothes Parsing},
    author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
    booktitle = {ACCV},
    year = {2014}}

    In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset [Yamaguchi et al., CVPR'12] and show that we can obtain a significant improvement over the state-of-the-art.

close window

Vision and Language

A successful robotic platform needs to be able to understand both the visual world and the human's (lingual) instructions and communicate its understanding back to the user in a natural way. One of my main scientific interests lies in the integration of vision and language in order to develop high-performance autonomous systems that can interact with the humans through language. Such solutions are of particular importance for the blind or visually impaired where language is one of the only means of human-robot interaction.

Relevant Publications


Representation of Text


  • Skip-Thought Vectors

    Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler

    arXiv preprint arXiv:1506.06726, 2015

    Sent2vec neural representation trained on 11K books

    Paper  Abstract  Code  Bibtex

    @inproceedings{moviebook,
    title = {Skip-Thought Vectors},
    author = {Ryan Kiros and Yukun Zhu and Ruslan Salakhutdinov and Richard Zemel and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06726},
    year = {2015}
    }

    coming soon


Vision-Text Alignment and Retrieval


  • Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

    Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

    arXiv preprint arXiv:1506.06724, 2015

    Aligning movies and books for story-like captioning

    Paper  Abstract  Project page  Bibtex

    @inproceedings{moviebook,
    title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
    author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06724},
    year = {2015}
    }

    coming soon

  • What are you talking about? Text-to-Image Coreference

    Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, Sanja Fidler

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    Exploits text for visual parsing and aligns nouns to objects. Code and data out soon!

    Paper  Abstract  Bibtex

    @inproceedings{KongCVPR14,
    title = {What are you talking about? Text-to-Image Coreference},
    author = {Chen Kong and Dahua Lin and Mohit Bansal and Raquel Urtasun and Sanja Fidler},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system.

  • Visual Semantic Search: Retrieving Videos via Complex Textual Queries

    Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    Video retrieval when a query is a longer sentence or a multi-sentence description

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{LinCVPR14,
    author = {Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
    title = {Visual Semantic Search: Retrieving Videos via Complex Textual Queries},
    booktitle = {CVPR},
    year = {2014}
    }

    In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.


Exploiting Text for Image Parsing

  • Neuroaesthetics in Fashion: Modeling the Perception of Beauty

    Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    How fashionable do you look in a photo? And how can you improve? This paper exploits image information as well as tags and user's comments to predict fashionability.

    Paper  Abstract  Project page  Bibtex

    @inproceedings{SimoCVPR15,
    title = {Neuroaesthetics in Fashion: Modeling the Perception of Beauty},
    author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2015}
    }

    In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph's setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.





    International news


    Vogue (Spain) Stylebook (Germany) Ansa (Italy) CenarioMT (Brazil) Amsterdam Fashion (NL)
    Marie Claire (France) Fashion Police (Nigeria) Nauka (Poland) Pluska (Slovakia) Pressetext (Austria)
    Wired (Germany) Jetzt (Germany) La Gazzetta (Italy) PopSugar (Australia) SinEmbargo (Mexico)

    A more complete list is maintained on our project webpage.

    close window

  • A Sentence is Worth a Thousand Pixels

    Sanja Fidler, Abhishek Sharma, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    Exploits textual descriptions of images to improve visual parsing

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{FidlerCVPR13,
    author = {Sanja Fidler and Abhishek Sharma and Raquel Urtasun},
    title = {A Sentence is Worth a Thousand Pixels},
    booktitle = {CVPR},
    year = {2013}
    }

    We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models.


Learning Visual Models from Text

  • Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions

    Jimmy Ba, Kevin Swersky, Sanja Fidler, Ruslan Salakhutdinov

    arXiv preprint arXiv:1506.00511, 2015

    Classification of unseen categories from their textual description (Wiki articles)

    Paper  Abstract  Bibtex

    @inproceedings{moviebook,
    title = {Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions},
    author = {Jimmy Ba and Kevin Swersky and Sanja Fidler and Ruslan Salakhutdinov},
    booktitle = {arXiv preprint arXiv:1506.00511},
    year = {2015}
    }

    One of the main challenges in Zero-Shot Learning of visual categories is gathering semantic attributes to accompany images. Recent work has shown that learning from textual descriptions, such as Wikipedia articles, avoids the problem of having to explicitly define these attributes. We present a new model that can classify unseen categories from their textual description. Specifically, we use text features to predict the output weights of both the convolutional and the fully connected layers in a deep convolutional neural network (CNN). We take advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches. The proposed model also allows us to automatically generate a list of pseudo- attributes for each visual category consisting of words from Wikipedia articles. We train our models end-to-end us- ing the Caltech-UCSD bird and flower datasets and evaluate both ROC and Precision-Recall curves. Our empirical results show that the proposed model significantly outperforms previous methods.


Caption Generation

  • Generating Multi-Sentence Lingual Descriptions of Indoor Scenes     (oral presentation)

    Dahua Lin, Chen Kong, Sanja Fidler, Raquel Urtasun

    In British Machine Vision Conference (BMVC), To appear, 2015

    Paper  Abstract  Bibtex

    @inproceedings{Lin15,
    title = {Generating Multi-Sentence Lingual Descriptions of Indoor Scenes},
    author = {Dahua Lin and Chen Kong and Sanja Fidler and Raquel Urtasun},
    booktitle = {arXiv:1503.00064},
    year = {2015}}

    This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account the coherence among sentences. Experiments on the augmented NYU-v2 dataset show that our framework can generate natural descriptions with substantially higher ROGUE scores compared to those produced by the baseline.

  • Video In Sentences Out           (oral presentation)

    Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Wang, Jinlian Wei, Yifan Yin, Zhiqi Zhang

    In Uncertainty in Artificial Intelligence (UAI), Catalina Island, USA, 2012

    Paper  Abstract  Project page  Bibtex

    @inproceedings{BarbuUAI12,
    author = {Andrei Barbu and Alexander Bridge and Zachary Burchill and Dan Coroian and Sven Dickinson and Sanja Fidler and Aaron Michaux and Sam Mussman and Siddharth Narayanaswamy and Dhaval Salvi and Lara Schmidt and Jiangnan Shangguan and Jeffrey Mark Siskind and Jarrell Waggoner and Song Wang and Jinlian Wei and Yifan Yin and Zhiqi Zhang},
    title = {Video In Sentences Out},
    booktitle = {UAI},
    year = {2012}
    }

    We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases, spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.


Word-Sense Disambiguation

  • Unsupervised Disambiguation of Image Captions

    Wesley May, Sanja Fidler, Afsaneh Fazly, Suzanne Stevenson, Sven Dickinson

    First Joint Conference on Lexical and Computational Semantics (*SEM), 2012

    Paper  Abstract  Bibtex

    @inproceedings{MaySEM12,
    author = {Wesley May and Sanja Fidler and Afsaneh Fazly and Suzanne Stevenson and Sven Dickinson},
    title = {Unsupervised Disambiguation of Image Captions},
    booktitle = {First Joint Conference on Lexical and Computational Semantics (*SEM)},
    year = {2012}
    }

    Given a set of images with related captions, our goal is to show how visual features can improve the accuracy of unsupervised word sense disambiguation when the textual context is very small, as this sort of data is common in news and social media. We extend previous work in unsupervised text-only disambiguation with methods that integrate text and images. We construct a corpus by using Amazon Mechanical Turk to caption sense-tagged images gathered from ImageNet. Using a Yarowsky-inspired algorithm, we show that gains can be made over text-only disambiguation, as well as multimodal approaches such as Latent Dirichlet Allocation.

close window

Segmentation

Image segmentation takes on many forms varying from low-level grouping of pixels into superpixels, grouping (super)pixels into object proposals, and image labeling. Each task is important in its own way; grouping pixels into superpixels in an efficient way subserves high-level semantic tasks since it reduces the complexity of the input. Generating a small set of object proposals facilities object detection as it prunes down the exhaustive set of possible windows down to a small number of plausible candidates. Image-labeling, on the other hand, tries to assign a semantic label to each (super)pixel. Our work falls into all three of these domains, providing a link between low-level image information and more high-level reasoning in other domain of our research.

Relevant Publications


Superpixels and Superpixel Grouping

  • Real-Time Coarse-to-fine Topologically Preserving Segmentation

    Jian Yao, Marko Boben, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

    Paper  Abstract  Bibtex

    @inproceedings{YaoCVPR15,
    title = {Real-Time Coarse-to-fine Topologically Preserving Segmentation},
    author = {Jian Yao and Marko Boben and Sanja Fidler and Raquel Urtasun},
    booktitle = {CVPR},
    year = {2015}
    }

    In this paper, we tackle the problem of unsupervised segmentation in the form of superpixels. Our main emphasis is on speed and accuracy. We build on [Yamaguchi et al., ECCV'14] to define the problem as a boundary and topology preserving Markov random field. We propose a coarse to fine optimization technique that speeds up inference in terms of the number of updates by an order of magnitude. Our approach is shown to outperform [Yamaguchi et al., ECCV'14] while employing a single iteration. We evaluate and compare our approach to state-of-the-art superpixel algorithms on the BSD and KITTI benchmarks. Our approach significantly outperforms the baselines in the segmentation metrics and achieves the lowest error on the stereo task.

  • A Framework for Symmetric Part Detection in Cluttered Scenes

    Tom Lee, Sanja Fidler, Alex Levinshtein, Cristian Sminchisescu, Sven Dickinson

    Symmetry, Vol. 7, 2015, pp 1333-1351

    Paper  Abstract  Bibtex

    @article{LeeSymmetry2015,
    title = {A Framework for Symmetric Part Detection in Cluttered Scenes},
    author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Cristian Sminchisescu and Sven Dickinson},
    journal = {Symmetry},
    volume = {7},
    pages = {1333-1351},
    year = {2015}
    }

    The role of symmetry in computer vision has waxed and waned in importance during the evolution of the field from its earliest days. At first figuring prominently in support of bottom-up indexing, it fell out of favour as shape gave way to appearance and recognition gave way to detection. With a strong prior in the form of a target object, the role of the weaker priors offered by perceptual grouping was greatly diminished. However, as the field returns to the problem of recognition from a large database, the bottom-up recovery of the parts that make up the objects in a cluttered scene is critical for their recognition. The medial axis community has long exploited the ubiquitous regularity of symmetry as a basis for the decomposition of a closed contour into medial parts. However, today's recognition systems are faced with cluttered scenes and the assumption that a closed contour exists, i.e., that figure-ground segmentation has been solved, rendering much of the medial axis community's work inapplicable. In this article, we review a computational framework, previously reported in [Lee et al., ICCV'13, Levinshtein et al., ICCV'09, Levinshtein et al., IJCV'13], that bridges the representation power of the medial axis and the need to recover and group an object's parts in a cluttered scene. Our framework is rooted in the idea that a maximally-inscribed disc, the building block of a medial axis, can be modelled as a compact superpixel in the image. We evaluate the method on images of cluttered scenes.

  • Detecting Curved Symmetric Parts using a Deformable Disc Model

    Tom Lee, Sanja Fidler, Sven Dickinson

    In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

    Paper  Abstract  Project page  Bibtex

    @inproceedings{LeeICCV13,
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    title = {Detecting Curved Symmetric Parts using a Deformable Disc Model},
    booktitle = {ICCV},
    year = {2013}
    }

    Symmetry is a powerful shape regularity that's been exploited by perceptual grouping researchers in both human and computer vision to recover part structure from an image without a priori knowledge of scene content. Drawing on the concept of a medial axis, defined as the locus of centers of maximal inscribed discs that sweep out a symmetric part, we model part recovery as the search for a sequence of deformable maximal inscribed disc hypotheses generated from a multiscale superpixel segmentation, a framework proposed by [Levinshtein et al., ICCV'09]. However, we learn affinities between adjacent superpixels in a space that's invariant to bending and tapering along the symmetry axis, enabling us to capture a wider class of symmetric parts. Moreover, we introduce a global cost that perceptually integrates the hypothesis space by combining a pairwise and a higher-level smoothing term, which we minimize globally using dynamic programming. The new framework is demonstrated on two datasets, and is shown to significantly outperform the baseline [Levinshtein et al., ICCV'09].


Object Proposals

  • Multi-cue Mid-level Grouping

    Tom Lee, Sanja Fidler, Sven Dickinson

    In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

    Paper  Abstract  Bibtex

     

    @inproceedings{LeeACCV14,
    title = {Multi-cue mid-level grouping},
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    booktitle = {ACCV},
    year = {2014}
    }

    Region proposal methods provide richer object hypotheses than sliding windows with dramatically fewer proposals, yet they still number in the thousands. This large quantity of proposals typically results from a diversification step that propagates bottom-up ambiguity in the form of proposals to the next processing stage. In this paper, we take a complementary approach in which mid-level knowledge is used to resolve bottom-up ambiguity at an earlier stage to allow a further reduction in the number of proposals. We present a method for generating regions using the mid-level grouping cues of closure and symmetry. In doing so, we combine mid-level cues that are typically used only in isolation, and leverage them to produce fewer but higher quality proposals. We emphasize that our model is mid-level by learning it on a limited number of objects while applying it to different objects, thus demonstrating that it is transferable to other objects. In our quantitative evaluation, we 1) establish the usefulness of each grouping cue by demonstrating incremental improvement, and 2) demonstrate improvement on two leading region proposal methods with a limited budget of proposals.


Image Labeling

  • Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

    Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, Devi Parikh

    Transactions on Pattern Analysis and Machine Intelligence (TPAMI), To appear 2015

    Paper  Abstract  Suppl. Mat.  Bibtex

    @article{MottaghiPAMI15,
    title = {Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding},
    author = {Roozbeh Mottaghi and Sanja Fidler and Alan Yuille and Raquel Urtasun and Devi Parikh},
    journal = {Trans. on Pattern Analysis and Machine Intelligence},
    year = {2015}
    }

    Recent trends in image understanding have pushed for scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers. In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular semantic segmentation, object detection and scene recognition. Towards this goal, we "plug-in" human subjects for each of the various components in a conditional random field model. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve scene understanding by focusing research efforts on various individual tasks.

  • Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision

    Liang-Chieh Chen, Sanja Fidler, Alan Yuille, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

    Ground-truth segmentations provided for a subset of KITTI cars in Project page

    Paper  Abstract  Project page  CAD models  Suppl. Mat.  Bibtex

    @inproceedings{ChenCVPR14,
    author = {Liang-Chieh Chen and Sanja Fidler and Alan Yuille and Raquel Urtasun},
    title = {Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision},
    booktitle = {CVPR},
    year = {2014}
    }

    Labeling large-scale datasets with very accurate object segmentations is an elaborate task that requires a high degree of quality control and a budget of tens or hundreds of thousands of dollars. Thus, developing solutions that can automatically perform the labeling given only weak supervision is key to reduce this cost. In this paper, we show how to exploit 3D information to automatically generate very accurate object segmentations given annotated 3D bounding boxes. We formulate the problem as the one of inference in a binary Markov random field which exploits appearance models, stereo and/or noisy point clouds, a repository of 3D CAD models as well as topological constraints. We demonstrate the effectiveness of our approach in the context of autonomous driving, and show that we can segment cars with the accuracy of 86% intersection-over-union, performing as well as highly recommended MTurkers!

  • A Sentence is Worth a Thousand Pixels

    Sanja Fidler, Abhishek Sharma, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    Reasons about object detection, segmentation, scene-type and sentence descriptions to improve image parsing.

    Paper  Abstract  Suppl. Mat.  Bibtex

    @inproceedings{FidlerCVPR13,
    author = {Sanja Fidler and Abhishek Sharma and Raquel Urtasun},
    title = {A Sentence is Worth a Thousand Pixels},
    booktitle = {CVPR},
    year = {2013}
    }

    We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models.

  • Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation

    Jian Yao, Sanja Fidler, Raquel Urtasun

    In Computer Vision and Pattern Recognition (CVPR), Providence, USA, June 2012

    Code, trained models and annotated bounding boxes for MSRC in Project page

    Paper  Abstract  Project page.  Bibtex

    @inproceedings{YaoCVPR12,
    author = {Jian Yao and Sanja Fidler and Raquel Urtasun},
    title = {Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation},
    booktitle = {CVPR},
    year = {2012}
    }

    In this paper we propose an approach to holistic scene understanding that reasons jointly about regions, location, class and spatial extent of objects, presence of a class in the image, as well as the scene type. Learning and inference in our model are efficient as we reason at the segment level, and introduce auxiliary variables that allow us to decompose the inherent high-order potentials into pairwise potentials between a few variables with small number of states (at most the number of classes). Inference is done via a convergent message-passing algorithm, which, unlike graph-cuts inference, has no submodularity restrictions and does not require potential specific moves. We believe this is very important, as it allows us to encode our ideas and prior knowledge about the problem without the need to change the inference engine every time we introduce a new potential. Our approach outperforms the state-of-the-art on the MSRC-21 benchmark, while being much faster. Importantly, our holistic model is able to improve performance in all tasks.

close window