Full list also on Google scholar

Year 2018



Semantic Understanding of Scenes Through the ADE20K Dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, Antonio Torralba

International Journal of Computer Vision (IJCV)

@article{ADEIJCV,
title = {Semantic Understanding of Scenes Through the ADE20K Dataset},
author = {Bolei Zhou and Hang Zhao and Xavier Puig and Tete Xiao and Sanja Fidler and Adela Barriuso and Antonio Torralba},
journal = {IJCV},
year = {2018}
}

Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.


A Neural Compositional Paradigm for Image Captioning

Bo Dai, Sanja Fidler, Dahua Lin

In Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018

@inproceedings{Dai18neurips,
title = { A Neural Compositional Paradigm for Image Captioning},
author = {Bo Dai and Sanja Fidler and Dahua Lin},
booktitle = {NeurIPS},
year = {2018}
}

Mainstream captioning models often follow a sequential structure to generate captions, leading to issues such as introduction of irrelevant semantics, lack of diversity in the generated captions, and inadequate generalization performance. In this paper, we present an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: (1) extracting an explicit semantic representation from the given image; and (2) constructing the caption based on a recursive compositional procedure in a bottom-up manner. Compared to conventional ones, our paradigm better preserves the semantic content through an explicit factorization of semantics and syntax. By using the compositional generation procedure, caption construction follows a recursive structure, which naturally fits the properties of human language. Moreover, the proposed compositional procedure requires less data to train, generalizes better, and yields more diverse captions.


Pose Estimation for Objects with Rotational Symmetry

Enric Corona, Kaustav Kundu, Sanja Fidler

In International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain

@inproceedings{pose2018,
title = {Pose Estimation for Objects with Rotational Symmetry},
author = {Enric Corona and Kaustav Kundu and Sanja Fidler},
booktitle = {IROS},
year = {2018}
}

Pose estimation is a widely explored problem, enabling many robotic tasks such as grasping and manipulation. In this paper, we tackle the problem of pose estimation for objects that exhibit rotational symmetry, which are common in man-made and industrial environments. In particular, our aim is to infer poses for objects not seen at training time, but for which their 3D CAD models are available at test time. Previous work has tackled this problem by learning to compare captured views of real objects with the rendered views of their 3D CAD models, by embedding them in a joint latent space using neural networks. We show that sidestepping the issue of symmetry in this scenario during training leads to poor performance at test time. We propose a model that reasons about rotational symmetry during training by having access to only a small set of symmetry-labeled objects, whereby exploiting a large collection of unlabeled CAD models. We demonstrate that our approach significantly outperforms a naively trained neural network on a new pose dataset containing images of tools and hardware.


(oral presentation)

Scaling Egocentric Vision: The EPIC-KITCHENS Datasets

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

In European Conference on Computer Vision (ECCV), Munich, Germany

@inproceedings{Damen2018EPICKITCHENS,
title = {Scaling Egocentric Vision: The EPIC-KITCHENS Dataset},
author = {Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria and Fidler, Sanja and Furnari, Antonino and Kazakos, Evangelos and Moltisanti, Davide and Munro, Jonathan and Perrett, Toby and Price, Will and Wray, Michael},
booktitle = {ECCV},
year = {2018}
}

First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.


(spotlight presentation)

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler

In British Machine Vision Conference (BMVC), Newcastle upon Tyne, UK

@inproceedings{vsepp2018,
title = {VSE++: Improving Visual-Semantic Embeddings with Hard Negatives},
author = {Fartash Faghri and David J. Fleet and Jamie Ryan Kiros and Sanja Fidler},
booktitle = {BMVC},
year = {2018}
}

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).


Efficient Annotation of Segmentation Datasets with Polygon-RNN++

David Acuna*, Huan Ling*, Amlan Kar*, Sanja Fidler                          * (Denotes equal contribution)

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, US, 2018

@inproceedings{PolygonPP2018,
title = {Efficient Annotation of Segmentation Datasets with Polygon-RNN++},
author = {Acuna, David and Ling, Huan and Kar, Amlan and Fidler, Sanja},
booktitle = {CVPR},
year = {2018}
}

Manually labeling datasets with object masks is extremely time consuming. In this work, we follow the idea of PolygonRNN to produce polygonal annotations of objects interactively using humans-in-the-loop. We introduce several important improvements to the model: 1) we design a new CNN encoder architecture, 2) show how to effectively train the model with Reinforcement Learning, and 3) significantly increase the output resolution using a Graph Neural Network, allowing the model to accurately annotate high resolution objects in images. Extensive evaluation on the Cityscapes dataset shows that our model, which we refer to as Polygon-RNN++, significantly outperforms the original model in both automatic (10% absolute and 16% relative improvement in mean IoU) and interactive modes (requiring 50% fewer clicks by annotators). We further analyze the cross-domain scenario in which our model is trained on one dataset, and used out of the box on datasets from varying domains. The results show that Polygon-RNN++ exhibits powerful generalization capabilities, achieving significant improvements over existing pixel-wise methods. Using simple online fine-tuning we further achieve a high reduction in annotation time for new datasets, moving a step closer towards an interactive annotation tool to be used in practice.


(oral presentation)

VirtualHome: Simulating Household Activities via Programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, Antonio Torralba

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, US, 2018

@inproceedings{VirtualHome2018,
title = {VirtualHome: Simulating Household Activities via Programs},
author = {Xavier Puig and Kevin Ra and Marko Boben and Jiaman Li and Tingwu Wang and Sanja Fidler and Antonio Torralba},
booktitle = {CVPR},
year = {2018}
}

In this paper, we are interested in modeling complex activities that occur in a typical household. We propose to use programs, i.e., sequences of atomic actions and interactions, as a high level representation of complex tasks. Programs are interesting because they provide a non-ambiguous representation of a task, and allow agents to execute them. However, nowadays, there is no database providing this type of information. Towards this goal, we first crowd-source programs for a variety of activities that happen in people's homes, via a game-like interface used for teaching kids how to code. Using the collected dataset, we show how we can learn to extract programs directly from natural language descriptions or from videos. We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to ``drive'' an artificial agent to execute tasks in a simulated household environment. Our VirtualHome simulator allows us to create a large activity video dataset with rich ground-truth, enabling training and testing of video understanding models. We further showcase examples of our agent performing tasks in our VirtualHome based on language descriptions.


(spotlight presentation)

MovieGraphs: Towards Understanding Human-Centric Situations from Videos

Paul Vicol, Makarand Tapaswi, Lluis Castrejon, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, US, 2018

@inproceedings{Moviegraphs2018,
title = {MovieGraphs: Towards Understanding Human-Centric Situations from Videos},
author = {Paul Vicol and Makarand Tapaswi and Lluis Castrejon and Sanja Fidler},
booktitle = {CVPR},
year = {2018}
}

There is growing interest in artificial intelligence to build socially intelligent robots. This requires machines to have the ability to "read" people's emotions, motivations, and other factors that affect behavior. Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions. In addition, most interactions and many attributes are grounded in the video with time stamps. We provide a thorough analysis of our dataset, showing interesting common-sense correlations between different social aspects of scenes, as well as across scenes over time. We propose a method for querying videos and text with graphs, and show that: 1) our graphs contain rich and sufficient information to summarize and localize each scene; and 2) subgraphs allow us to describe situations at an abstract level and retrieve multiple semantically relevant situations. We also propose methods for interaction understanding via ordering, and reasoning about the social scene. MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents.


(spotlight presentation)

Now You Shake Me: Towards Automatic 4D Cinema

Yuhao Zhou, Makarand Tapaswi, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, US, 2018

@inproceedings{Movie4D2018,
title = {Now You Shake Me: Towards Automatic 4D Cinema},
author = {Yuhao Zhou and Makarand Tapaswi and Sanja Fidler},
booktitle = {CVPR},
year = {2018}
}

We are interested in enabling automatic 4D cinema by parsing physical and special effects from untrimmed movies. These include effects such as physical interactions, water splashing, light, and shaking, and are grounded to either a character in the scene or the camera. We collect a new dataset referred to as the Movie4D dataset which annotates over 9K effects in 63 movies. We propose a Conditional Random Field model atop a neural network that brings together visual and audio information, as well as semantics in the form of person tracks. Our model further exploits correlations of effects between different characters in the clip as well as across movie threads. We propose effect detection and classification as two tasks, and present results along with ablation studies on our dataset, paving the way towards 4D cinema in everyone's homes.


Learning to Act Properly: Predicting and Explaining Affordances from Images

Ching-Yao Chuang, Jiaman Li, Antonio Torralba, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, US, 2018

@inproceedings{ActProperly2018,
title = {Learning to Act Properly: Predicting and Explaining Affordances from Images},
author = {Chuang, Ching-Yao and Li, Jiaman and Torralba, Antonio and Fidler, Sanja},
booktitle = {CVPR},
year = {2018}
}

We address the problem of affordance reasoning in diverse scenes that appear in the real world. Affordances relate the agent's actions to their effects when taken on the surrounding objects. In our work, we take the egocentric view of the scene, and aim to reason about action-object affordances that respect both the physical world as well as the social norms imposed by the society. We also aim to teach artificial agents why some actions should not be taken in certain situations, and what would likely happen if these actions would be taken. We collect a new dataset that builds upon ADE20k, referred to as ADE-Affordance, which contains annotations enabling such rich visual reasoning. We propose a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object. Our model is showcased through various ablation studies, pointing to successes and challenges in this complex task.


A Face-to-Face Neural Conversation Model

Hang Chu, Daiqing Li, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, US, 2018

@inproceedings{F2F2018,
title = {A Face-to-Face Neural Conversation Model},
author = {Hang Chu and Daiqing Li and Sanja Fidler},
booktitle = {CVPR},
year = {2018}
}

Neural networks have recently become good at engaging in dialog. However, current approaches are based solely on verbal text, lacking the richness of a real face-to-face conversation. We propose a neural conversation model that aims to read and generate facial gestures alongside with text. This allows our model to adapt its response based on the ``mood'' of the conversation. In particular, we introduce an RNN encoder-decoder that exploits the movement of facial muscles, as well as the verbal conversation. The decoder consists of two layers, where the lower layer aims at generating the verbal response and coarse facial expressions, while the second layer fills in the subtle gestures, making the generated output more smooth and natural. We train our neural network by having it "watch'' 250 movies. We showcase our joint face-text model in generating more natural conversations through automatic metrics and a human study. We demonstrate an example application with a face-to-face chatting avatar.


SurfConv: Bridging 3D and 2D Convolution for RGBD Images

Hang Chu, Wei-Chiu Ma, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, US, 2018

@inproceedings{SurfConv2018,
title = {SurfConv: Bridging 3D and 2D Convolution for RGBD Images},
author = {Hang Chu and Wei-Chiu Ma and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2018}
}

The last few years have seen approaches trying to combine the increasing popularity of depth sensors and the success of the convolutional neural networks. Using depth as an additional channel alongside the RGB input has the scale variance problem present in image convolution based approaches. On the other hand, 3D convolution wastes a large amount of memory on mostly unoccupied 3D space, which consists of only the surface visible to the sensor. Instead,we propose SurfConv, which "slides" compact 2D filters along the visible 3D surface. SurfConv is formulated asa simple depth-aware multi-scale 2D convolution, through a new Data-Driven Depth Discretization (D4) scheme. We demonstrate the effectiveness of our method on indoor and outdoor 3D semantic segmentation datasets. Our method achieves state-of-the-art performance while using less than30% parameters used by the 3D convolution based approaches.


NerveNet: Learning Structured Policy with Graph Neural Networks

Tingwu Wang, Renjie Liao, Jimmy Ba, Sanja Fidler

In International Conference on Learning Representations (ICLR), Vancouver, Canada, 2018

@inproceedings{WangICLR2018,
title = {NerveNet: Learning Structured Policy with Graph Neural Networks},
author = {Tingwu Wang and Renjie Liao and Jimmy Ba and Sanja Fidler},
booktitle = {ICLR},
year = {2018}
}

We address the problem of learning structured policies for continuous control. In traditional reinforcement learning, policies of agents are learned by MLPs which take the concatenation of all observations from the environment as input for predicting actions. In this work, we propose NerveNet to explicitly model the structure of an agent, which naturally takes the form of a graph. Specifically, serving as the agent's policy network, NerveNet first propagates information over the structure of the agent and then predict actions for different parts of the agent. In the experiments, we first show that our NerveNet is comparable to state-of-the-art methods on standard MuJoCo environments. We further propose our customized reinforcement learning environments for benchmarking two types of structure transfer learning tasks, i.e., size and disability transfer. We demonstrate that policies learned by NerveNet are significantly better than policies learned by other models and are able to transfer even in a zero-shot setting.

Year 2017



Teaching Machines to Describe Images via Natural Language Feedback

Huan Ling, Sanja Fidler

In Neural Information Processing Systems (NIPS), Long Beach, US, 2017

@inproceedings{LingNIPS2017,
title = {Teaching Machines to Describe Images via Natural Language Feedback},
author = {Huan Ling and Sanja Fidler},
booktitle = {NIPS},
year = {2017}
}

Robots will eventually be part of every household. It is thus critical to enable algorithms to learn from and be guided by non-expert users. In this paper, we bring a human in the loop, and enable a human teacher to give feedback to a learning agent in the form of natural language. We argue that descriptive sentence can provide a stronger learning signal than a numeric reward in that it can easily point to where the mistakes are and how to correct them. We focus on the problem of image captioning in which the quality of the output can easily be judged by non-experts. We propose a hierarchical phrase-based captioning model trained with policy gradients, and design a feedback network that provides reward to the learner by conditioning on the human-provided feedback. We show that by exploiting descriptive feedback our model learns to perform better than when given independently written human captions.


(oral presentation)

Towards Diverse and Natural Image Descriptions via a Conditional GAN

Bo Dai, Sanja Fidler, Raquel Urtasun, Dahua Lin

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

@inproceedings{DaiICCV17,
title = {Towards Diverse and Natural Image Descriptions via a Conditional GAN},
author = {Bo Dai and Sanja Fidler and Raquel Urtasun and Dahua Lin},
booktitle = {ICCV},
year = {2017}
}

Despite the substantial progress in recent years, the image captioning techniques are still far from being perfect. Sentences produced by existing methods, eg those based on RNNs, are often overly rigid and lacking in variability. This issue is related to a learning principle widely used in practice, that is, to maximize the likelihood of training samples. This principle encourages high resemblance to the "ground-truth'' captions, while suppressing other reasonable descriptions. Conventional evaluation metrics, eg BLEU and METEOR, also favor such restrictive methods. In this paper, we explore an alternative approach, with the aim to improve the naturalness and diversity -- two essential properties of human expression. Specifically, we propose a new framework based on Conditional Generative Adversarial Networks (CGAN), which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content. It is noteworthy that training a sequence generator is nontrivial. We overcome the difficulty by Policy Gradient, a strategy stemming from Reinforcement Learning, which allows the generator to receive early feedback along the way. We tested our method on two large datasets, where it performed competitively against real people in our user study and outperformed other methods on various tasks.


(oral presentation)

3D Graph Neural Networks for RGBD Semantic Segmentation

Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

@inproceedings{3dggnnICCV17,
title = {3D Graph Neural Networks for RGBD Semantic Segmentation},
author = {Xiaojuan Qi and Renjie Liao and Jiaya Jia and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2017}
}

RGBD semantic segmentation requires joint reasoning about 2D appearance and 3D geometric information. In this paper we propose a 3D graph neural network (3DGNN) that builds a k-nearest neighbor graph on top of 3D point cloud. Each node in the graph corresponds to a set of points and is associated with a hidden representation vector initialized with an appearance feature extracted by a unary CNN from 2D images. Relying on recurrent functions, every node dynamically updates its hidden representation based on the current status and incoming messages from its neighbors. This propagation model is unrolled for a certain number of time steps and the final per-node representation is used for predicting the semantic class of each pixel. We use back-propagation through time to train the model. Extensive experiments on NYUD2 and SUN-RGBD datasets demonstrate the effectiveness of our approach.


(spotlight presentation)

TorontoCity: Seeing the World with a Million Eyes

Shenlong Wang, Min Bai, Gellert Mattyus, Hang Chu, Wenjie Luo, Bin Yang, Justin Liang, Joel Cheverie, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

@inproceedings{TCity2017,
title = {TorontoCity: Seeing the World with a Million Eyes},
author = {Shenlong Wang and Min Bai and Gellert Mattyus and Hang Chu and Wenjie Luo and Bin Yang and Justin Liang and Joel Cheverie and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2017}
}

In this paper we introduce the TorontoCity benchmark, which covers the full greater Toronto area (GTA) with 712.5km2 of land, 8439km of road and around 400,000 buildings. Our benchmark provides different perspectives of the world captured from airplanes, drones and cars driving around the city. Manually labeling such a large scale dataset is infeasible. Instead, we propose to utilize different sources of high-precision maps to create our ground truth. Towards this goal, we develop algorithms that allow us to align all data sources with the maps while requiring minimal human supervision. We have designed a wide variety of tasks including building height estimation (reconstruction), road centerline and curb extraction, building instance segmentation, building contour extraction (reorganization), semantic labeling and scene type classification (recognition). Our pilot study shows that most of these tasks are still difficult for modern convolutional neural networks.


Situation Recognition with Graph Neural Networks

Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun, Sanja Fidler

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

@inproceedings{SituationsICCV17,
title = {Situation Recognition with Graph Neural Networks},
author = {Ruiyu Li and Makarand Tapaswi and Renjie Liao and Jiaya Jia and Raquel Urtasun and Sanja Fidler},
booktitle = {ICCV},
year = {2017}
}

We address the problem of recognizing situations in images. Given an image, the task is to predict the most salient verb (action), and fill its semantic roles such as who is performing the action, what is the source and target of the action, etc. Different verbs have different roles (eg attacking has weapon), and each role can take on many possible values (nouns). We propose a model based on Graph Neural Networks that allows us to efficiently capture joint dependencies between roles using neural networks defined on a graph. Experiments with different graph connectivities show that our approach that propagates information between roles significantly outperforms existing work, as well as multiple baselines. We obtain roughly 3-5% improvement over previous work in predicting the full situation. We also provide a thorough qualitative analysis of our model and influence of different roles in the verbs.


Open Vocabulary Scene Parsing

Hang Zhao, Xavier Puig, Bolei Zhou, Sanja Fidler, Antonio Torralba

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

@inproceedings{openvoc17,
title = {Open Vocabulary Scene Parsing},
author = {Hang Zhao and Xavier Puig and Bolei Zhou and Sanja Fidler and Antonio Torralba},
booktitle = {ICCV},
year = {2017}
}

Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets. In this paper, we propose a new task that aims at parsing scenes with a large and open vocabulary, and several evaluation metrics are explored for this problem. Our proposed approach to this problem is a joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations. We validate the open vocabulary prediction ability of our framework on ADE20K dataset which covers a wide variety of scenes and objects. We further explore the trained joint embedding space to show its interpretability.


Sequential Grouping Networks for Instance Segmentation

Shu Liu, Jiaya Jia, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

@inproceedings{SGN17,
title = {Sequential Grouping Networks for Instance Segmentation},
author = {Shu Liu and Jiaya Jia and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2017}
}

In this paper, we propose Sequential Grouping Networks (SGN) to tackle the problem of object instance segmentation. SGNs employ a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels. In particular, the first network aims to group pixels along each image row and column by predicting horizontal and vertical object breakpoints. These breakpoints are then used to create line segments. By exploiting two-directional information, the second network groups horizontal and vertical lines into connected components. Finally, the third network groups the connected components into object instances. Our experiments show that our SGN significantly outperforms state-of-the-art approaches in both, the Cityscapes dataset as well as PASCAL VOC.


Be Your Own Prada: Fashion Synthesis with Structural Coherence

Shizhan Zhu, Sanja Fidler, Raquel Urtasun, Dahua Lin, Chen Change Loy

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

@inproceedings{GANprada17,
title = {Be Your Own Prada: Fashion Synthesis with Structural Coherence},
author = {Shizhan Zhu and Sanja Fidler and Raquel Urtasun and Dahua Lin and Chen Change Loy},
booktitle = {ICCV},
year = {2017}
}

We present a novel and effective approach for generating new clothing on a wearer through generative adversarial learning. Given an input image of a person and a sentence describing a different outfit, our model "redresses" the person as desired, while at the same time keeping the wearer and her/his pose unchanged. Generating new outfits with precise regions conforming to a language description while retaining wearer's body structure is a new challenging task. Existing generative adversarial networks are not ideal in ensuring global coherence of structure given both the input photograph and language description as conditions. We address this challenge by decomposing the complex generative process into two conditional stages. In the first stage, we generate a plausible semantic segmentation map that obeys the wearer's pose as a latent spatial arrangement. An effective spatial constraint is formulated to guide the generation of this semantic segmentation map. In the second stage, a generative model with a newly proposed compositional mapping layer is used to render the final image with precise regions and textures conditioned on this map. We extended the DeepFashion dataset by collecting sentence descriptions for 79K images. We demonstrate the effectiveness of our approach through both quantitative and qualitative evaluations. A user study is also conducted.


(best paper honorable mention)

Annotating Object Instances with a Polygon-RNN

Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017

@inproceedings{CastrejonCVPR17,
title = {Annotating Object Instances with a Polygon-RNN},
author = {Lluis Castrejon and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2017}
}

We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows a human annotator to interfere at any time and correct a vertex if needed, producing as accurate segmentation as desired by the annotator. We show that our approach speeds up the annotation process by a factor of 4.7 across all classes in Cityscapes, while achieving 78.4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators. For cars, our speed-up factor is 7.3 for an agreement of 82.2%. We further show generalization capabilities of our approach to unseen datasets.


Sports Field Localization via Deep Structured Models

Namdar Homayounfar, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017

@inproceedings{NamdarCVPR17,
title = {Sports Field Localization via Deep Structured Models},
author = {Namdar Homayounfar and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2017}
}

In this work, we propose a novel way of efficiently localizing a sports field from a single broadcast image of the game. Related work in this area relies on manually annotating a few key frames and extending the localization to similar images, or installing fixed specialized cameras in the stadium from which the layout of the field can be obtained. In contrast, we formulate this problem as a branch and bound inference in a Markov random field where an energy function is defined in terms of semantic cues such as the field surface, lines and circles obtained from a deep semantic segmentation network. Moreover, our approach is fully automatic and depends only on a single image from the broadcast video of the game. We demonstrate the effectiveness of our method by applying it to soccer and hockey.


Scene Parsing through ADE20K Dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, Antonio Torralba

In Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017

@inproceedings{Ade20k,
title = {Scene Parsing through ADE20K Dataset},
author = {Bolei Zhou and Hang Zhao and Xavier Puig and Sanja Fidler and Adela Barriuso and Antonio Torralba},
booktitle = {CVPR},
year = {2017}
}

Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A scene parsing benchmark is built upon the ADE20K with 150 object and stuff classes included. Several segmentation baseline models are evaluated on the benchmark. A novel network design called Cascade Segmentation Module is proposed to parse a scene into stuff, objects, and object parts in a cascade and improve over the baselines. We further show that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis.


Find Your Way by Observing the Sun and Other Semantic Cues

Wei-Chiu Ma, Shenlong Wang, Marcus A. Brubaker, Sanja Fidler, Raquel Urtasun

In International Conference on Robotics and Automation (ICRA), Singapore, 2017

@inproceedings{WeiChiuICRA17,
title = {Find Your Way by Observing the Sun and Other Semantic Cues},
author = {Wei-Chiu Ma and Shenlong Wang and Marcus A. Brubaker and Sanja Fidler and Raquel Urtasun},
booktitle = {ICRA},
year = {2017}
}

In this paper we present a robust, efficient and affordable approach to self-localization which does not require neither GPS nor knowledge about the appearance of the world. Towards this goal, we utilize freely available cartographic maps and derive a probabilistic model that exploits semantic cues in the form of sun direction, presence of an intersection, road type, speed limit as well as the ego-car trajectory in order to produce very reliable localization results. Our experimental evaluation shows that our approach can localize much faster (in terms of driving time) with less computation and more robustly than competing approaches, which ignore semantic information.


3D Object Proposals using Stereo Imagery for Accurate Object Class Detection

Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Huimin Ma, Sanja Fidler, Raquel Urtasun

To appear in Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017

@inproceedings{ChenArxiv16,
title = {3D Object Proposals using Stereo Imagery for Accurate Object Class Detection},
author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Huimin Ma and Sanja Fidler and Raquel Urtasun},
booktitle = {arXiv:1608.07711},
year = {2016}
}

The goal of this paper is to perform 3D object detection in the context of autonomous driving. Our method first aims at generating a set of high-quality 3D object proposals by exploiting stereo imagery. We formulate the problem as minimizing an energy function that encodes object size priors, placement of objects on the ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. We then exploit a CNN on top of these proposals to perform object detection. In particular, we employ a convolutional neural net (CNN) that exploits context and depth information to jointly regress to 3D bounding box coordinates and object pose. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. When combined with the CNN, our approach outperforms all existing results in object detection and orientation estimation tasks for all three KITTI object classes. Furthermore, we experiment also with the setting where LIDAR information is available, and show that using both LIDAR and stereo leads to the best result.

Year 2016



Song From PI: A Musically Plausible Network for Pop Music Generation

Hang Chu, Raquel Urtasun, Sanja Fidler

Arxiv preprint arXiv:1611.03477, ICLR 2017 Workshop track

@inproceedings{SongOfPI,
title = {Song From PI: A Musically Plausible Network for Pop Music Generation},
author = {Hang Chu and Raquel Urtasun and Sanja Fidler},
booktitle = {arXiv:1611.03477},
year = {2016}
}

We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.


Efficient Summarization with Read-Again and Copy Mechanism

Wenyuan Zeng, Wenjie Luo, Sanja Fidler, Raquel Urtasun

Arxiv preprint arXiv:1611.03382

@inproceedings{WenyuanArxiv16,
title = {Efficient Summarization with Read-Again and Copy Mechanism},
author = {Wenyuan Zeng and Wenjie Luo and Sanja Fidler and Raquel Urtasun},
booktitle = {arXiv:1611.03382},
year = {2016}
}

Encoder-decoder models have been widely used to solve sequence to sequence prediction tasks. However current approaches suffer from two shortcomings. First, the encoders compute a representation of each word taking into account only the history of the words it has read so far, yielding suboptimal representations. Second, current decoders utilize large vocabularies in order to minimize the problem of unknown words, resulting in slow decoding times. In this paper we address both shortcomings. Towards this goal, we first introduce a simple mechanism that first reads the input sequence before committing to a representation of each word. Furthermore, we propose a simple copy mechanism that is able to exploit very small vocabularies and handle out-of-vocabulary words. We demonstrate the effectiveness of our approach on the Gigaword dataset and DUC competition outperforming the state-of-the-art.


Proximal Deep Structured Models

Shenlong Wang, Sanja Fidler, Raquel Urtasun

In Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016

@inproceedings{ShenlongNIPS16,
title = {Proximal Deep Structured Models},
author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {NIPS},
year = {2016}
}

Many problems in real-world applications involve predicting continuous-valued random variables that are statistically related. In this paper, we propose a powerful deep structured model that is able to learn complex non-linear functions which encode the dependencies between continuous output variables. We show that inference in our model using proximal methods can be efficiently solved as a feed-forward pass of a special type of deep recurrent neural network. We demonstrate the effectiveness of our approach in the tasks of image denoising, depth refinement and optical flow estimation.


HouseCraft: Building Houses from Rental Ads and Street Views

Hang Chu, Shenlong Wang, Raquel Urtasun, Sanja Fidler

In European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016

@inproceedings{ChuCVPR16,
title = {HouseCraft: Building Houses from Rental Ads and Street Views},
author = {Hang Chu and Shenlong Wang and Raquel Urtasun and Sanja Fidler},
booktitle = {ECCV},
year = {2016}
}

In this paper, we utilize rental ads to create realistic textured 3D models of building exteriors. In particular, we exploit the address of the property and its floorplan, which are typically available in the ad. The address allows us to extract Google StreetView images around the building, while the building's floorplan allows for an efficient parametrization of the building in 3D via a small set of random variables. We propose an energy minimization framework which jointly reasons about the height of each floor, the vertical positions of windows and doors, as well as the precise location of the building in the world's map, by exploiting several geometric and semantic cues from the StreetView imagery. To demonstrate the effectiveness of our approach, we collected a new dataset with 174 houses by crawling a popular rental website. Our experiments show that our approach is able to precisely estimate the geometry and location of the property, and can create realistic 3D building models.


(spotlight presentation)

MovieQA: Understanding Stories in Movies through Question-Answering

Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

@inproceedings{TapaswiCVPR16,
title = {MovieQA: Understanding Stories in Movies through Question-Answering},
author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2016}
}

We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.


Instance-Level Segmentation for Autonomous Driving with Deep Densely Connected MRFs

Ziyu Zhang, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

@inproceedings{ZhangCVPR16,
title = {Instance-Level Segmentation for Autonomous Driving with Deep Densely Connected MRFs},
author = {Ziyu Zhang and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2016}
}

Our aim is to provide a pixel-level object instance labeling of a monocular image. We build on recent work [Zhang et al., ICCV15] that trained a convolutional neural net to predict instance labeling in local image patches, extracted exhaustively in a stride from an image. A simple Markov random field model using several heuristics was then proposed in [Zhang et al., ICCV15] to derive a globally consistent instance labeling of the image. In this paper, we formulate the global labeling problem with a novel densely connected Markov random field and show how to encode various intuitive potentials in a way that is amenable to efficient mean field inference [Krahenbuhl et al., NIPS11]. Our potentials encode the compatibility between the global labeling and the patch-level predictions, contrast-sensitive smoothness as well as the fact that separate regions form different instances. Our experiments on the challenging KITTI benchmark [Geiger et al., CVPR12] demonstrate that our method achieves a significant performance boost over the baseline [Zhang et al., ICCV15].


Monocular 3D Object Detection for Autonomous Driving

Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

@inproceedings{ChenCVPR16,
title = {Monocular 3D Object Detection for Autonomous Driving},
author = {Xiaozhi Chen and Kaustav Kundu and Ziyu Zhang and Huimin Ma and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2016}
}

The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.


HD Maps: Fine-grained Road Segmentation by Parsing Ground and Aerial Images

Gellert Mattyus, Shenlong Wang, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

@inproceedings{MattyusCVPR16,
title = {HD Maps: Fine-grained Road Segmentation by Parsing Ground and Aerial Images},
author = {Gellert Mattyus and Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2016}
}

In this paper we present an approach to enhance existing maps with fine grained segmentation categories such as parking spots and sidewalk, as well as the number and location of road lanes. Towards this goal, we propose an ef- ficient approach that is able to estimate these fine grained categories by doing joint inference over both, monocular aerial imagery, as well as ground images taken from a stereo camera pair mounted on top of a car. Important to this is reasoning about the alignment between the two types of imagery, as even when the measurements are taken with sophisticated GPS+IMU systems, this alignment is not suf- ficiently accurate. We demonstrate the effectiveness of our approach on a new dataset which enhances KITTI with aerial images taken with a camera mounted on an airplane and flying around the city of Karlsruhe, Germany.


(oral presentation)

Order-Embeddings of Images and Language

Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun

In International Conference on Learning Representations (ICLR), Puerto Rico, 2016

@inproceedings{VendrovArxiv15,
title = {Order-Embeddings of Images and Language},
author = {Ivan Vendrov and Ryan Kiros and Sanja Fidler and Raquel Urtasun},
booktitle = {ICLR},
year = {2015}
}

Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.

Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, Devi Parikh

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1), 2016, pages 74-87

@article{MottaghiPAMI16,
title = {Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding},
author = {Roozbeh Mottaghi and Sanja Fidler and Alan Yuille and Raquel Urtasun and Devi Parikh},
journal = {Trans. on Pattern Analysis and Machine Intelligence},
volume= {38},
number= {1},
pages= {74--87},
year = {2016}
}

Recent trends in image understanding have pushed for scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers. In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular semantic segmentation, object detection and scene recognition. Towards this goal, we "plug-in" human subjects for each of the various components in a conditional random field model. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve scene understanding by focusing research efforts on various individual tasks.

Year 2015



(oral presentation)

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

Yukun Zhu*, Ryan Kiros*, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

* Denotes equal contribution

@inproceedings{ZhuICCV15,
title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
booktitle = {ICCV},
year = {2015}
}

Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

(oral presentation)

Lost Shopping! Monocular Localization in Large Indoor Spaces

Shenlong Wang, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

@inproceedings{WangICCV15,
title = {Lost Shopping! Monocular Localization in Large Indoor Spaces},
author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2015}
}

In this paper we propose a novel approach to localization in very large indoor spaces (i.e., 200+ store shopping malls) that takes a single image and a floor plan of the environment as input. We formulate the localization problem as inference in a Markov random field, which jointly reasons about text detection (localizing shop's names in the image with precise bounding boxes), shop facade segmentation, as well as camera's rotation and translation within the entire shopping mall. The power of our approach is that it does not use any prior information about appearance and instead exploits text detections corresponding to the shop names. This makes our method applicable to a variety of domains and robust to store appearance variation across countries, seasons, and illumination conditions. We demonstrate the performance of our approach in a new dataset we collected of two very large shopping malls, and show the power of holistic reasoning.

Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions

Jimmy Ba, Kevin Swersky, Sanja Fidler, Ruslan Salakhutdinov

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

@inproceedings{BaICCV15,
title = {Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions},
author = {Jimmy Ba and Kevin Swersky and Sanja Fidler and Ruslan Salakhutdinov},
booktitle = {ICCV},
year = {2015}
}

One of the main challenges in Zero-Shot Learning of visual categories is gathering semantic attributes to accompany images. Recent work has shown that learning from textual descriptions, such as Wikipedia articles, avoids the problem of having to explicitly define these attributes. We present a new model that can classify unseen categories from their textual description. Specifically, we use text features to predict the output weights of both the convolutional and the fully connected layers in a deep convolutional neural network (CNN). We take advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches. The proposed model also allows us to automatically generate a list of pseudo- attributes for each visual category consisting of words from Wikipedia articles. We train our models end-to-end us- ing the Caltech-UCSD bird and flower datasets and evaluate both ROC and Precision-Recall curves. Our empirical results show that the proposed model significantly outperforms previous methods.

Enhancing World Maps by Parsing Aerial Images

Gellert Matthyus, Shenlong Wang, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

@inproceedings{MatthyusICCV15,
title = {Enhancing World Maps by Parsing Aerial Images},
author = {Gellert Matthyus and Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2015}
}

In recent years, contextual models that exploit maps have been shown to be very effective for many recognition and localization tasks. In this paper, we propose to exploit aerial images in order to enhance freely available world maps. Towards this goal, we make use of OpenStreetMap and formulate the problem as the one of inference in a Markov random field parameterized in terms of the location of the road-segment centerlines as well as their width. This parameterization enables very efficient inference and returns only topologically correct roads. In particular, we can segment all OSM roads in the world in a single day using a small cluster of 10 computers. Importantly, our approach generalizes very well; it can be trained using a single aerial image and produces very accurate results in any location across the globe. We demonstrate the effectiveness of our approach over the previous state-of-the-art on two new benchmarks that we collect. We additionally show how our enhanced maps can be exploited for semantic segmentation of ground images.

Monocular Object Instance Segmentation and Depth Ordering with CNNs

Ziyu Zhang, Alex Schwing, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

@inproceedings{ZhangICCV15,
title = {Monocular Object Instance Segmentation and Depth Ordering with CNNs},
author = {Ziyu Zhang and Alex Schwing and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2015}
}

In this paper we tackle the problem of instance level segmentation and depth ordering from a single monocular image. Towards this goal, we take advantage of convolutional neural nets and train them to directly predict instance level segmentations where the instance ID encodes depth ordering from large image patches. To provide a coherent single explanation of an image we develop a Markov random field which takes as input the predictions of convolutional nets applied at overlapping patches of different resolutions as well as the output of a connected component algorithm and predicts very accurate instance level segmentation and depth ordering. We demonstrate the effectiveness of our approach on the challenging KITTI benchmark and show very good performance on both tasks.

A Learning Framework for Generating Region Proposals with Mid-level Cues

Tom Lee, Sanja Fidler, Sven Dickinson

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

@inproceedings{TLeeICCV15,
title = {A Learning Framework for Generating Region Proposals with Mid-level Cues},
author = {Tom Lee and Sanja Fidler and Sven Dickinson},
booktitle = {ICCV},
year = {2015}
}

The object categorization community's migration from object detection to large-scale object categorization has seen a shift from sliding window approaches to bottom-up region segmentation, with the resulting region proposals offering discriminating shape and appearance features through an attempt to explicitly segment the objects in a scene from their background. One powerful class of region proposal techniques is based on parametric energy minimization (PEM) via parametric maxflow. In this paper, we incorporate PEM into a novel structured learning framework that learns how to combine a set of mid-level grouping cues to yield a small set of region proposals with high recall. Second, we diversify our region proposals and rank them with region-based convolutional neural network features. Our novel approach, called parametric min-loss, casts perceptual grouping and cue combination in a learning framework which yields encouraging results on VOC'2012.

Skip-Thought Vectors

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler

Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

@inproceedings{KirosNIPS15,
title = {Skip-Thought Vectors},
author = {Ryan Kiros and Yukun Zhu and Ruslan Salakhutdinov and Richard Zemel and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
booktitle = {NIPS},
year = {2015}
}

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.

3D Object Proposals for Accurate Object Class Detection

Xiaozhi Chen*, Kaustav Kundu*, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, Raquel Urtasun

Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

* Denotes equal contribution

@inproceedings{XiaozhiNIPS15,
title = {3D Object Proposals for Accurate Object Class Detection},
author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Andrew Berneshawi and Huimin Ma and Sanja Fidler and Raquel Urtasun},
booktitle = {NIPS},
year = {2015}
}

The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on Car and Cyclist, and is competitive for the Pedestrian class.


(oral presentation)

Generating Multi-Sentence Lingual Descriptions of Indoor Scenes

Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun

In British Machine Vision Conference (BMVC), Swansea, UK, 2015

@inproceedings{LinBMVC15,
title = {Generating Multi-Sentence Lingual Descriptions of Indoor Scenes},
author = {Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
booktitle = {BMVC},
year = {2015}}

This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account the coherence among sentences. Experiments on the augmented NYU-v2 dataset show that our framework can generate natural descriptions with substantially higher ROGUE scores compared to those produced by the baseline.


(oral presentation)

Rent3D: Floor-Plan Priors for Monocular Layout Estimation

Chenxi Liu*, Alex Schwing*, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

* Denotes equal contribution

@inproceedings{ApartmentsCVPR15,
title = {Rent3D: Floor-Plan Priors for Monocular Layout Estimation},
author = {Chenxi Liu and Alex Schwing and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2015}}

The goal of this paper is to enable a 3D "virtual-tour" of an apartment given a small set of monocular images of different rooms, as well as a 2D floor plan. We frame the problem as the one of inference in a Markov random field which reasons about the layout of each room and its relative pose (3D rotation and translation) within the full apartment. This gives us information, for example, about in which room the picture was taken. What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge. In particular, we exploit the floor plan to impose aspect ratio constraints across the layouts of different rooms, as well as to extract semantic information, e.g., the location of windows which are labeled in floor plans. We show that this information can significantly help in resolving the challenging room-apartment alignment problem. We also derive an efficient exact inference algorithm which takes only a few ms per apartment. This is due to the fact that we exploit integral geometry as well as our new bounds on the aspect ratio of rooms which allow us to carve the space, reducing significantly the number of physically possible configurations. We demonstrate the effectiveness of our approach in a new dataset which contains over 200 apartments.


(oral presentation)

Holistic 3D Scene Understanding from a Single Geo-tagged Image

Shenlong Wang, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

@inproceedings{WangCVPR15,
title = {Holistic 3D Scene Understanding from a Single Geo-tagged Image},
author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2015}}

In this paper we are interested in exploiting geographic priors to help outdoor scene understanding. Towards this goal we propose a holistic approach that reasons jointly about 3D object detection, pose estimation, semantic segmentation as well as depth reconstruction from a single image. Our approach takes advantage of large-scale crowdsourced maps to generate dense geographic, geometric and semantic priors by rendering the 3D world. We demonstrate the effectiveness of our holistic model on the challenging KITTI dataset, and show significant improvements over the baselines in all metrics and tasks.

segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

Yukun Zhu, Raquel Urtasun, Ruslan Salakhutdinov, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

@inproceedings{ZhuSegDeepM15,
title = {segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection},
author = {Yukun Zhu and Raquel Urtasun and Ruslan Salakhutdinov and Sanja Fidler},
booktitle = {CVPR},
year = {2015}
}

In this paper, we propose an approach that exploits object segmentation in order to improve the accuracy of object detection. We frame the problem as inference in a Markov Random Field, in which each detection hypothesis scores object appearance as well as contextual information using Convolutional Neural Networks, and allows the hypothesis to choose and score a segment out of a large pool of accurate object segmentation proposals. This enables the detector to incorporate additional evidence when it is available and thus results in more accurate detections. Our experiments show an improvement of 4.1% in mAP over the R-CNN baseline on PASCAL VOC 2010, and 3.4% over the current state-of-the-art, demonstrating the power of our approach.

Neuroaesthetics in Fashion: Modeling the Perception of Beauty

Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

@inproceedings{SimoCVPR15,
title = {Neuroaesthetics in Fashion: Modeling the Perception of Beauty},
author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
booktitle = {CVPR},
year = {2015}
}

In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph's setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.





International news


Vogue (Spain) Stylebook (Germany) Ansa (Italy) CenarioMT (Brazil) Amsterdam Fashion (NL)
Marie Claire (France) Fashion Police (Nigeria) Nauka (Poland) Pluska (Slovakia) Pressetext (Austria)
Wired (Germany) Jetzt (Germany) La Gazzetta (Italy) PopSugar (Australia) SinEmbargo (Mexico)

A more complete list is maintained on our project webpage.

close window

Real-Time Coarse-to-fine Topologically Preserving Segmentation

Jian Yao, Marko Boben, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

@inproceedings{YaoCVPR15,
title = {Real-Time Coarse-to-fine Topologically Preserving Segmentation},
author = {Jian Yao and Marko Boben and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2015}
}

In this paper, we tackle the problem of unsupervised segmentation in the form of superpixels. Our main emphasis is on speed and accuracy. We build on [Yamaguchi et al., ECCV'14] to define the problem as a boundary and topology preserving Markov random field. We propose a coarse to fine optimization technique that speeds up inference in terms of the number of updates by an order of magnitude. Our approach is shown to outperform [Yamaguchi et al., ECCV'14] while employing a single iteration. We evaluate and compare our approach to state-of-the-art superpixel algorithms on the BSD and KITTI benchmarks. Our approach significantly outperforms the baselines in the segmentation metrics and achieves the lowest error on the stereo task.

A Framework for Symmetric Part Detection in Cluttered Scenes

Tom Lee, Sanja Fidler, Alex Levinshtein, Cristian Sminchisescu, Sven Dickinson

Symmetry, Vol. 7, 2015, pp 1333-1351

@article{LeeSymmetry2015,
title = {A Framework for Symmetric Part Detection in Cluttered Scenes},
author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Cristian Sminchisescu and Sven Dickinson},
journal = {Symmetry},
volume = {7},
pages = {1333-1351},
year = {2015}
}

The role of symmetry in computer vision has waxed and waned in importance during the evolution of the field from its earliest days. At first figuring prominently in support of bottom-up indexing, it fell out of favour as shape gave way to appearance and recognition gave way to detection. With a strong prior in the form of a target object, the role of the weaker priors offered by perceptual grouping was greatly diminished. However, as the field returns to the problem of recognition from a large database, the bottom-up recovery of the parts that make up the objects in a cluttered scene is critical for their recognition. The medial axis community has long exploited the ubiquitous regularity of symmetry as a basis for the decomposition of a closed contour into medial parts. However, today's recognition systems are faced with cluttered scenes and the assumption that a closed contour exists, i.e., that figure-ground segmentation has been solved, rendering much of the medial axis community's work inapplicable. In this article, we review a computational framework, previously reported in [Lee et al., ICCV'13, Levinshtein et al., ICCV'09, Levinshtein et al., IJCV'13], that bridges the representation power of the medial axis and the need to recover and group an object's parts in a cluttered scene. Our framework is rooted in the idea that a maximally-inscribed disc, the building block of a medial axis, can be modelled as a compact superpixel in the image. We evaluate the method on images of cluttered scenes.

back to top

Year 2014


A High Performance CRF Model for Clothes Parsing

Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

 

@inproceedings{SimoACCV14,
title = {A High Performance CRF Model for Clothes Parsing},
author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
booktitle = {ACCV},
year = {2014}}

In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset [Yamaguchi et al., CVPR'12] and show that we can obtain a significant improvement over the state-of-the-art.

Multi-cue Mid-level Grouping

Tom Lee, Sanja Fidler, Sven Dickinson

In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

 

@inproceedings{LeeACCV14,
title = {Multi-cue mid-level grouping},
author = {Tom Lee and Sanja Fidler and Sven Dickinson},
booktitle = {ACCV},
year = {2014}
}

Region proposal methods provide richer object hypotheses than sliding windows with dramatically fewer proposals, yet they still number in the thousands. This large quantity of proposals typically results from a diversification step that propagates bottom-up ambiguity in the form of proposals to the next processing stage. In this paper, we take a complementary approach in which mid-level knowledge is used to resolve bottom-up ambiguity at an earlier stage to allow a further reduction in the number of proposals. We present a method for generating regions using the mid-level grouping cues of closure and symmetry. In doing so, we combine mid-level cues that are typically used only in isolation, and leverage them to produce fewer but higher quality proposals. We emphasize that our model is mid-level by learning it on a limited number of objects while applying it to different objects, thus demonstrating that it is transferable to other objects. In our quantitative evaluation, we 1) establish the usefulness of each grouping cue by demonstrating incremental improvement, and 2) demonstrate improvement on two leading region proposal methods with a limited budget of proposals.

Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation

Sanja Fidler, Marko Boben, Ales Leonardis

arXiv preprint arXiv:1408.5516, 2014

@inproceedings{FidlerArxiv14,
title = {Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation},
author = {Sanja Fidler and Marko Boben and Ale\v{s} Leonardis},
booktitle = {ArXiv:1408.5516},
year = {2014}
}

Hierarchies allow feature sharing between objects at multiple levels of representation, can code exponential variability in a very compact way and enable fast inference. This makes them potentially suitable for learning and recognizing a higher number of object classes. However, the success of the hierarchical approaches so far has been hindered by the use of hand-crafted features or predetermined grouping rules. This paper presents a novel framework for learning a hierarchical compositional shape vocabulary for representing multiple object classes. The approach takes simple contour fragments and learns their frequent spatial configurations. These are recursively combined into increasingly more complex and class-specific shape compositions, each exerting a high degree of shape variability. At the top-level of the vocabulary, the compositions are sufficiently large and complex to represent the whole shapes of the objects. We learn the vocabulary layer after layer, by gradually increasing the size of the window of analysis and reducing the spatial resolution at which the shape configurations are learned. The lower layers are learned jointly on images of all classes, whereas the higher layers of the vocabulary are learned incrementally, by presenting the algorithm with one object class after another. The experimental results show that the learned multi-class object representation scales favorably with the number of object classes and achieves a state-of-the-art detection performance at both, faster inference as well as shorter training times.

What are you talking about? Text-to-Image Coreference

Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

@inproceedings{KongCVPR14,
title = {What are you talking about? Text-to-Image Coreference},
author = {Chen Kong and Dahua Lin and Mohit Bansal and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2014}
}

In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system.

Visual Semantic Search: Retrieving Videos via Complex Textual Queries

Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

@inproceedings{LinCVPR14,
author = {Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
title = {Visual Semantic Search: Retrieving Videos via Complex Textual Queries},
booktitle = {CVPR},
year = {2014}
}

In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.

Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision

Liang-Chieh Chen, Sanja Fidler, Alan Yuille, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

@inproceedings{ChenCVPR14,
author = {Liang-Chieh Chen and Sanja Fidler and Alan Yuille and Raquel Urtasun},
title = {Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision},
booktitle = {CVPR},
year = {2014}
}

Labeling large-scale datasets with very accurate object segmentations is an elaborate task that requires a high degree of quality control and a budget of tens or hundreds of thousands of dollars. Thus, developing solutions that can automatically perform the labeling given only weak supervision is key to reduce this cost. In this paper, we show how to exploit 3D information to automatically generate very accurate object segmentations given annotated 3D bounding boxes. We formulate the problem as the one of inference in a binary Markov random field which exploits appearance models, stereo and/or noisy point clouds, a repository of 3D CAD models as well as topological constraints. We demonstrate the effectiveness of our approach in the context of autonomous driving, and show that we can segment cars with the accuracy of 86% intersection-over-union, performing as well as highly recommended MTurkers!

The Role of Context for Object Detection and Semantic Segmentation in the Wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, Alan Yuille

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

@inproceedings{MottaghiCVPR14,
author = {Roozbeh Mottaghi and Xianjie Chen and Xiaobai Liu and Sanja Fidler and Raquel Urtasun and Alan Yuille},
title = {The Role of Context for Object Detection and Semantic Segmentation in the Wild},
booktitle = {CVPR},
year = {2014}
}

In this paper we study the role of context in existing state-of-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it contains 520 additional classes for semantic segmentation and object detection. Our analysis shows that nearest neighbor based approaches perform poorly on semantic segmentation of contextual classes, showing the variability of PASCAL imagery. Furthermore, improvements of existing contextual models for detection is rather modest. In order to push forward the performance in this difficult scenario, we propose a novel deformable part-based model, which exploits both local context around each candidate detection as well as global context at the level of the scene. We show that this contextual reasoning significantly helps in detecting objects at all scales.

Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Nam-Gyu Cho, Sanja Fidler, Raquel Urtasun, Alan Yuille

In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June, 2014

@inproceedings{PartsCVPR14,
author = {Xianjie Chen and Roozbeh Mottaghi and Xiaobai Liu and Nam-Gyu Cho and Sanja Fidler and Raquel Urtasun and Alan Yuille},
title = {Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts},
booktitle = {CVPR},
year = {2014}
}

Detecting objects becomes difficult when we need to deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformations and partial occlusions in animals (as examples of highly deformable objects), ii) describe them in terms of body parts, and iii) detect them when their body parts are hard to detect (e.g., animals depicted at low resolution). We represent the holistic object and body parts separately and use a fully connected model to arrange templates for the holistic object and body parts. Our model automatically decouples the holistic object or body parts from the model when they are hard to detect. This enables us to represent a large number of holistic object and body part combinations to better deal with different "detectability" patterns caused by deformations, occlusion and/or low resolution. We apply our method to the six animal categories in the PASCAL VOC dataset and show that our method significantly improves state-of-the-art (by 4.1% AP) and provides a richer representation for objects. During training we use annotations for body parts (e.g., head, torso, etc), making use of a new dataset of fully annotated object parts for PASCAL VOC 2010, which provides a mask for each part.

back to top

Year 2013



(oral presentation)

Holistic Scene Understanding for 3D Object Detection with RGBD cameras

Dahua Lin, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

@inproceedings{LinICCV13,
author = {Dahua Lin and Sanja Fidler and Raquel Urtasun},
title = {Holistic Scene Understanding for 3D Object Detection with RGBD cameras},
booktitle = {ICCV},
year = {2013}
}

In this paper, we tackle the problem of indoor scene understanding using RGBD data. Towards this goal, we propose a holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects. Specifically, we extend the CPMC framework to 3D in order to generate candidate cuboids, and develop a conditional random field to integrate information from different sources to classify the cuboids. With this formulation, scene classification and 3D object recognition are coupled and can be jointly solved through probabilistic inference. We test the effectiveness of our approach on the challenging NYU v2 dataset. The experimental results demonstrate that through effective evidence integration and holistic reasoning, our approach achieves substantial improvement over the state-of-the-art.

Box In the Box: Joint 3D Layout and Object Reasoning from Single Images

Alex Schwing, Sanja Fidler, Marc Pollefeys, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

@inproceedings{SchwingICCV13,
author = {Alex Schwing and Sanja Fidler and Marc Pollefeys and Raquel Urtasun},
title = {Box In the Box: Joint 3D Layout and Object Reasoning from Single Images},
booktitle = {ICCV},
year = {2013}
}

In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. Towards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account occlusion in order to not over-count the evidence. We introduce a new decomposition method, which generalizes integral geometry to triangular shapes, and allows us to bound the different terms in constant time. We exploit both geometric cues and object detectors as image features and show large improvements in 2D and 3D object detection over state-of-the-art deformable part-based models.

Detecting Curved Symmetric Parts using a Deformable Disc Model

Tom Lee, Sanja Fidler, Sven Dickinson

In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

@inproceedings{LeeICCV13,
author = {Tom Lee and Sanja Fidler and Sven Dickinson},
title = {Detecting Curved Symmetric Parts using a Deformable Disc Model},
booktitle = {ICCV},
year = {2013}
}

Symmetry is a powerful shape regularity that's been exploited by perceptual grouping researchers in both human and computer vision to recover part structure from an image without a priori knowledge of scene content. Drawing on the concept of a medial axis, defined as the locus of centers of maximal inscribed discs that sweep out a symmetric part, we model part recovery as the search for a sequence of deformable maximal inscribed disc hypotheses generated from a multiscale superpixel segmentation, a framework proposed by [Levinshtein et al., ICCV'09]. However, we learn affinities between adjacent superpixels in a space that's invariant to bending and tapering along the symmetry axis, enabling us to capture a wider class of symmetric parts. Moreover, we introduce a global cost that perceptually integrates the hypothesis space by combining a pairwise and a higher-level smoothing term, which we minimize globally using dynamic programming. The new framework is demonstrated on two datasets, and is shown to significantly outperform the baseline [Levinshtein et al., ICCV'09].

Bottom-up Segmentation for Top-down Detection

Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

@inproceedings{segdpmCVPR13,
author = {Sanja Fidler and Roozbeh Mottaghi and Alan Yuille and Raquel Urtasun},
title = {Bottom-up Segmentation for Top-down Detection},
booktitle = {CVPR},
year = {2013}
}

In this paper we are interested in how semantic segmentation can help object detection. Towards this goal, we propose a novel deformable part-based model which exploits region-based segmentation algorithms that compute candidate object regions by bottom-up clustering followed by ranking of those regions. Our approach allows every detection hypothesis to select a segment (including void), and scores each box in the image using both the traditional HOG filters as well as a set of novel segmentation features. Thus our model "blends" between the detector and segmentation models. Since our features can be computed very efficiently given the segments, we maintain the same complexity as the original DPM. We demonstrate the effectiveness of our approach in PASCAL VOC 2010, and show that when employing only a root filter our approach outperforms Dalal & Triggs detector on all classes, achieving 13% higher average AP. When employing the parts, we outperform the original DPM in 19 out of 20 classes, achieving an improvement of 8% AP. Furthermore, we outperform the previous state-of-the-art on VOC'10 test by 4%.

A Sentence is Worth a Thousand Pixels

Sanja Fidler, Abhishek Sharma, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

@inproceedings{FidlerCVPR13,
author = {Sanja Fidler and Abhishek Sharma and Raquel Urtasun},
title = {A Sentence is Worth a Thousand Pixels},
booktitle = {CVPR},
year = {2013}
}

We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models.

Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs

Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

@inproceedings{MottaghiCVPR13,
author = {Roozbeh Mottaghi and Sanja Fidler and Jian Yao and Raquel Urtasun and Devi Parikh},
title = {Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs},
booktitle = {CVPR},
year = {2013}
}

Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we "plug-in" human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

back to top

Year 2012



(spotlight present.)

3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

Sanja Fidler, Sven Dickinson, Raquel Urtasun

Neural Information Processing Systems (NIPS), Lake Tahoe, USA, December 2012

@inproceedings{FidlerNIPS12,
author = {Sanja Fidler and Sven Dickinson and Raquel Urtasun},
title = {3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model},
booktitle = {NIPS},
year = {2012}
}

This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efficiency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach significantly outperforms the state of-the-art in both 2D and 3D object detection.

Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation

Jian Yao, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Providence, USA, June 2012

@inproceedings{YaoCVPR12,
author = {Jian Yao and Sanja Fidler and Raquel Urtasun},
title = {Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation},
booktitle = {CVPR},
year = {2012}
}

In this paper we propose an approach to holistic scene understanding that reasons jointly about regions, location, class and spatial extent of objects, presence of a class in the image, as well as the scene type. Learning and inference in our model are efficient as we reason at the segment level, and introduce auxiliary variables that allow us to decompose the inherent high-order potentials into pairwise potentials between a few variables with small number of states (at most the number of classes). Inference is done via a convergent message-passing algorithm, which, unlike graph-cuts inference, has no submodularity restrictions and does not require potential specific moves. We believe this is very important, as it allows us to encode our ideas and prior knowledge about the problem without the need to change the inference engine every time we introduce a new potential. Our approach outperforms the state-of-the-art on the MSRC-21 benchmark, while being much faster. Importantly, our holistic model is able to improve performance in all tasks.

Super-edge grouping for object localization by combining appearance and shape information

Zhiqi Zhang, Sanja Fidler, Jarell W. Waggoner, Yu Cao, Jeff M. Siskind, Sven Dickinson, Song Wang

In Computer Vision and Pattern Recognition (CVPR), Providence, USA, June 2012

@inproceedings{ZhangCVPR12,
author = {Zhiqi Zhang and Sanja Fidler and Jarell W. Waggoner and Yu Cao and Jeff M. Siskind and Sven Dickinson and Song Wang},
title = {Super-edge grouping for object localization by combining appearance and shape information},
booktitle = {CVPR},
year = {2012}
}

Both appearance and shape play important roles in object localization and object detection. In this paper, we propose a new superedge grouping method for object localization by incorporating both boundary shape and appearance information of objects. Compared with the previous edge grouping methods, the proposed method does not subdivide detected edges into short edgels before grouping. Such long, unsubdivided superedges not only facilitate the incorporation of object shape information into localization, but also increase the robustness against image noise and reduce computation. We identify and address several important problems in achieving the proposed superedge grouping, including gap filling for connecting superedges, accurate encoding of region-based information into individual edges, and the incorporation of object-shape information into object localization. In this paper, we use the bag of visual words technique to quantify the region-based appearance features of the object of interest. We find that the proposed method, by integrating both boundary and region information, can produce better localization performance than previous subwindow search and edge grouping methods on most of the 20 object categories from the VOC 2007 database. Experiments also show that the proposed method is roughly 50 times faster than the previous edge grouping method.


(oral presentation)

Video In Sentences Out

Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Wang, Jinlian Wei, Yifan Yin, Zhiqi Zhang

In Uncertainty in Artificial Intelligence (UAI), Catalina Island, USA, 2012

@inproceedings{BarbuUAI12,
author = {Andrei Barbu and Alexander Bridge and Zachary Burchill and Dan Coroian and Sven Dickinson and Sanja Fidler and Aaron Michaux and Sam Mussman and Siddharth Narayanaswamy and Dhaval Salvi and Lara Schmidt and Jiangnan Shangguan and Jeffrey Mark Siskind and Jarrell Waggoner and Song Wang and Jinlian Wei and Yifan Yin and Zhiqi Zhang},
title = {Video In Sentences Out},
booktitle = {UAI},
year = {2012}
}

We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases, spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.

Unsupervised Disambiguation of Image Captions

Wesley May, Sanja Fidler, Afsaneh Fazly, Suzanne Stevenson, Sven Dickinson

First Joint Conference on Lexical and Computational Semantics (*SEM), 2012

@inproceedings{MaySEM12,
author = {Wesley May and Sanja Fidler and Afsaneh Fazly and Suzanne Stevenson and Sven Dickinson},
title = {Unsupervised Disambiguation of Image Captions},
booktitle = {First Joint Conference on Lexical and Computational Semantics (*SEM)},
year = {2012}
}

Given a set of images with related captions, our goal is to show how visual features can improve the accuracy of unsupervised word sense disambiguation when the textual context is very small, as this sort of data is common in news and social media. We extend previous work in unsupervised text-only disambiguation with methods that integrate text and images. We construct a corpus by using Amazon Mechanical Turk to caption sense-tagged images gathered from ImageNet. Using a Yarowsky-inspired algorithm, we show that gains can be made over text-only disambiguation, as well as multimodal approaches such as Latent Dirichlet Allocation.

Learning Categorical Shape from Captioned Images

Tom Lee, Sanja Fidler, Alex Levinshtein, Sven Dickinson

Conference on Computer and Robot Vision (CRV), Toronto, Canada, May 2012

@inproceedings{LeeCRV12,
author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Sven Dickinson},
title = {Learning Categorical Shape from Captioned Images},
booktitle = {Canadian Conference on Computer and Robot Vision (CRV)},
year = {2012}
}

Given a set of captioned images of cluttered scenes containing various objects in different positions and scales, we learn named contour models of object categories without relying on bounding box annotation. We extend a recent language-vision integration framework that finds spatial configurations of image features that co-occur with words in image captions. By substituting appearance features with local contour features, object categories are recognized by a contour model that grows along the object's boundary. Experiments on ETHZ are presented to show that 1) the extended framework is better able to learn named visual categories whose within class variation is better captured by a shape model than an appearance model; and 2) typical object recognition methods fail when manually annotated bounding boxes are unavailable.

back to top

Years 2006-2011


A Probabilistic Model for Recursive Factorized Image Features

Sergey Karayev, Mario Fritz, Sanja Fidler, Trevor Darrell

In Computer Vision and Pattern Recognition (CVPR), 2011

@inproceedings{KarayevCVPR11,
author = {Sergey Karayev and Mario Fritz and Sanja Fidler and Trevor Darrell},
title = {A Probabilistic Model for Recursive Factorized Image Features},
booktitle = {CVPR},
year = {2011}
}

Layered representations for object recognition are important due to their increased invariance, biological plausibility, and computational benefits. However, most of existing approaches to hierarchical representations are strictly feedforward, and thus not well able to resolve local ambiguities. We propose a probabilistic model that learns and infers all layers of the hierarchy jointly. Specifically, we suggest a process of recursive probabilistic factorization, and present a novel generative model based on Latent Dirichlet Allocation to this end. The approach is tested on a standard recognition dataset, outperforming existing hierarchical approaches and demonstrating performance on par with current single-feature state-of-the-art models. We demonstrate two important properties of our proposed model: 1) adding an additional layer to the representation increases performance over the flat model; 2) a full Bayesian approach outperforms a feedforward implementation of the model.

A coarse-to-fine Taxonomy of Constellations for Fast Multi-class Object Detection

Sanja Fidler, Marko Boben, Ales Leonardis

In European Conference in Computer Vision (ECCV), 2010

@inproceedings{FidlerECCV10,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {A coarse-to-fine Taxonomy of Constellations for Fast Multi-class Object Detection},
booktitle = {ECCV},
year = {2010}
}

In order for recognition systems to scale to a larger number of object categories building visual class taxonomies is important to achieve running times logarithmic in the number of classes [1, 2]. In this paper we propose a novel approach for speeding up recognition times of multi-class part-based object representations. The main idea is to construct a taxonomy of constellation models cascaded from coarse-to-fine resolution and use it in recognition with an efficient search strategy. The taxonomy is built automatically in a way to minimize the number of expected computations during recognition by optimizing the cost-to-power ratio [Blanchard and Geman, Annals of Statistics, 2005]. The structure and the depth of the taxonomy is not pre-determined but is inferred from the data. The approach is utilized on the hierarchy-of-parts model achieving efficiency in both, the representation of the structure of objects as well as in the number of modeled object classes. We achieve speed-up even for a small number of object classes on the ETHZ and TUD dataset. On a larger scale, our approach achieves detection time that is logarithmic in the number of classes.

Categorial Perception

Mario Fritz, Mykhaylo Andriluka, Sanja Fidler, Michael Stark, Ales Leonardis, Bernt Schiele

Cognitive Systems, No. 8, 2010

@InCollection{FritzChapter09,
author = {Mario Fritz and Mykhaylo Andriluka and Sanja Fidler and Michael Stark and Ales Leonardis and Bernt Schiele},
title = {Categorical Perception},
booktitle = {Cognitive Systems},
series = {Cognitive Systems Monographs},
volume = {8},
year = {2010},
publisher = {Springer},
organization = {Springer},
chapter = {Categorical Perception}
}

The ability to recognize and categorize entities in its environment is a vital competence of any cognitive system. Reasoning about the current state of the world, assessing consequences of possible actions, as well as planning future episodes requires a concept of the roles that objects and places may possibly play. For example, objects afford to be used in specific ways, and places are usually devoted to certain activities. The ability to represent and infer these roles, or, more generally, categories, from sensory observations of the world, is an important constituent of a cognitive system's perceptual processing (Section 1.3 elaborates on this with a very visual example). In the CoSy project, a substantial amount of work has been conducted on the advancement of methods that recognize and categorize objects and places by using different modalities, namely, vision, language, and laser range data. Our progress contributes to our effort to build systems that evolve through interaction with its environment in an ultimately live-long learning process. While this chapter describes our contribution to modeling, learning and representing of visual categories, Chapter 7 shows how to combine the visual information with other modalities in a multi-modal learning process (e.g. speech/language as detailed in Chapter 8). Finally, Chapter 9 and 10 shows how we integrated these concepts in a autonomous systems to understand the implications of our progress in categorization on an interactive evolving system.

Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

Sanja Fidler, Marko Boben, Ales Leonardis

Neural Information Processing Systems (NIPS), 2009

@inproceedings{FidlerNIPS09,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {Evaluating multi-class learning strategies in a generative hierarchical framework for object detection},
booktitle = {NIPS},
year = {2009}
}

Multi-class object learning and detection is a challenging problem due to the large number of object classes and their high visual variability. Specialized detectors usually excel in performance, while joint representations optimize sharing and reduce inference time -- but are complex to train. Conveniently, sequential class learning cuts down training time by transferring existing knowledge to novel classes, but cannot fully exploit the shareability of features among object classes and might depend on ordering of classes during learning. In hierarchical frameworks these issues have been little explored. In this paper, we provide a rigorous experimental analysis of various multiple object class learning strategies within a generative hierarchical framework. Specifically, we propose, evaluate and compare three important types of multi-class learning: 1.) independent training of individual categories, 2.) joint training of classes, and 3.) sequential learning of classes. We explore and compare their computational behavior (space and time) and detection performance as a function of the number of learned object classes on several recognition datasets. We show that sequential training achieves the best trade-off between inference and training times at a comparable detection performance and could thus be used to learn the classes on a larger scale.

Optimization framework for learning a hierarchical shape vocabulary for object class detection

Sanja Fidler, Marko Boben, Ales Leonardis

British Machine Vision Conference (BMVC), 2009

@inproceedings{FidlerBMVC09,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {Optimization framework for learning a hierarchical shape vocabulary for object class detection},
booktitle = {BMVC},
year = {2009}
}

Learning Hierarchical Compositional Representations of Object Structure

Sanja Fidler, Marko Boben, Ales Leonardis

Object Categorization: Computer and Human Vision Perspectives
Editors: S. Dickinson, A. Leonardis, B. Schiele and M. J. Tarr
Cambridge university press, 2009

@InCollection{FidlerChapter09,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {Learning Hierarchical Compositional Representations of Object Structure},
booktitle = {Object Categorization: Computer and Human Vision Perspectives},
editor = {Sven Dickinson and Ale\v{s} Leonardis and Bernt Schiele and Michael J. Tarr},
year = {2009},
publisher = {Cambridge University Press},
pages = {}
}

Similarity-based cross-layered hierarchical representation for object categorization

Sanja Fidler, Marko Boben, Ales Leonardis

Conference on Computer Vision and Pattern Recognition (CVPR), 2008

@inproceedings{FidlerCVPR08,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {Similarity-based cross-layered hierarchical representation for object categorization},
booktitle = {CVPR},
year = {2008}
}

Selecting features for object detection using an AdaBoost-compatible evaluation function

Luka Furst, Sanja Fidler, Ales Leonardis

Pattern Recognition Letters, Vol. 29, No. 11, pp. 1603-1612, 2008

@article{FurstPRL08,
author = {Luka Furst and Sanja Fidler and Ales Leonardis},
title = {Similarity-based cross-layered hierarchical representation for object categorization},
journal = {Pattern Recognition Letters},
volume = {29},
number = {11},
pages = {1603-1612},
year = {2008}
}

This paper addresses the problem of selecting features in a visual object detection setup where a detection algorithm is applied to an input image represented by a set of features. The set of features to be employed in the test stage is prepared in two training-stage steps. In the first step, a feature extraction algorithm produces a (possibly large) initial set of features. In the second step, on which this paper focuses, the initial set is reduced using a selection procedure. The proposed selection procedure is based on a novel evaluation function that measures the utility of individual features for a certain detection task. Owing to its design, the evaluation function can be seamlessly embedded into an AdaBoost selection framework. The developed selection procedure is integrated with state-of-the-art feature extraction and object detection methods. The presented system was tested on five challenging detection setups. In three of them, a fairly high detection accuracy was effected by as few as six features selected out of several hundred initial candidates.

Learning hierarchical representations of object categories for robot vision       (invited paper)

Ales Leonardis, Sanja Fidler

International Symposium of Robotics Research (ISRR), 2007

@inproceedings{FidlerISSR07,
author = {Ales Leonardis and Sanja Fidler},
title = {Learning hierarchical representations of object categories for robot vision},
booktitle = {International Symposium of Robotics Research (ISRR},
year = {2007}
}

This paper presents our recently developed approach to constructing a hierarchical representation of visual input that aims to enable recognition and detection of a large number of object categories. Inspired by the principles of efficient indexing, robust matching, and ideas of compositionality, our approach learns a hierarchy of spatially flexible compositions, i.e. parts, in an unsupervised, statistics-driven manner. Starting with simple, frequent features, we learn the statistically most significant compositions (parts composed of parts), which consequently define the next layer. Parts are learned sequentially, layer after layer, optimally adjusting to the visual data. Lower layers are learned in a category-independent way to obtain complex, yet sharable visual building blocks, which is a crucial step towards a scalable representation. Higher layers of the hierarchy, on the other hand, are constructed by using specific categories, achieving a category representation with a small number of highly generalizable parts that gained their structural flexibility through composition within the hierarchy. Built in this way, new categories can be efficiently and continuously added to the system by adding a small number of parts only in the higher layers. The approach is demonstrated on a large collection of images and a variety of object categories.

Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts

Sanja Fidler, Ales Leonardis

Conference on Computer Vision and Pattern Recognition (CVPR), 2007

@inproceedings{FidlerCVPR07,
author = {Sanja Fidler and Ales Leonardis},
title = {Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts},
booktitle = {CVPR},
year = {2007}
}

This paper proposes a novel approach to constructing a hierarchical representation of visual input that aims to enable recognition and detection of a large number of object categories. Inspired by the principles of efficient indexing (bottom-up), robust matching (top-down), and ideas of compositionality, our approach learns a hierarchy of spatially flexible compositions, i.e. parts, in an unsupervised, statistics-driven manner. Starting with simple, frequent features, we learn the statistically most significant compositions (parts composed of parts), which consequently define the next layer. Parts are learned sequentially, layer after layer, optimally adjusting to the visual data. Lower layers are learned in a category-independent way to obtain complex, yet sharable visual building blocks, which is a crucial step towards a scalable representation. Higher layers of the hierarchy, on the other hand, are constructed by using specific categories, achieving a category representation with a small number of highly generalizable parts that gained their structural flexibility through composition within the hierarchy. Built in this way, new categories can be efficiently and continuously added to the system by adding a small number of parts only in the higher layers. The approach is demonstrated on a large collection of images and a variety of object categories. Detection results confirm the effectiveness and robustness of the learned parts.

Combining Reconstructive and Discriminative Subspace Methods for Robust Classification and Regression by Subsampling

Sanja Fidler, Danijel Skocaj, Ales Leonardis

IEEE Trans. on Pattern Anal. and Machine Intell. (PAMI), vol. 28, no. 3, pp. 337-350, 2006

@article{FidlerPAMI06,
author = {Sanja Fidler and Danijel Skocaj and Ales Leonardis},
title = {Combining Reconstructive and Discriminative Subspace Methods for Robust Classification and Regression by Subsampling},
journal = {IEEE Trans. on Pattern Analysis and Machine Intelligence},
volume = {28},
number = {3},
pages = {337-350},
year = {2006}
}

Linear subspace methods that provide sufficient reconstruction of the data, such as PCA, offer an efficient way of dealing with missing pixels, outliers, and occlusions that often appear in the visual data. Discriminative methods, such as LDA, which, on the other hand, are better suited for classification tasks, are highly sensitive to corrupted data. We present a theoretical framework for achieving the best of both types of methods: An approach that combines the discrimination power of discriminative methods with the reconstruction property of reconstructive methods which enables one to work on subsets of pixels in images to efficiently detect and reject the outliers. The proposed approach is therefore capable of robust classification with a high-breakdown point. We also show that subspace methods, such as CCA, which are used for solving regression tasks, can be treated in a similar manner. The theoretical results are demonstrated on several computer vision tasks showing that the proposed approach significantly outperforms the standard discriminative methods in the case of missing pixels and images containing occlusions and outliers.

Hierarchical Statistical Learning of Generic Parts of Object Structure

Sanja Fidler, Gregor Berginc, Ales Leonardis

Conference on Computer Vision and Pattern Recognition (CVPR), 2006

@inproceedings{FidlerCVPR06,
author = {Sanja Fidler and Gregor Berginc and Ales Leonardis},
title = {Hierarchical Statistical Learning of Generic Parts of Object Structure},
booktitle = {CVPR},
year = {2006}
}

With the growing interest in object categorization various methods have emerged that perform well in this challenging task, yet are inherently limited to only a moderate number of object classes. In pursuit of a more general categorization system this paper proposes a way to overcome the computational complexity encompassing the enormous number of different object categories by exploiting the statistical properties of the highly structured visual world. Our approach proposes a hierarchical acquisition of generic parts of object structure, varying from simple to more complex ones, which stem from the favorable statistics of natural images. The parts recovered in the individual layers of the hierarchy can be used in a top-down manner resulting in a robust statistical engine that could be efficiently used within many of the current categorization systems. The proposed approach has been applied to large image datasets yielding important statistical insights into the generic parts of object structure.

back to top

Earlier work


Robust estimation of canonical correlation coefficients

Danijel Skocaj, Ales Leonardis, Sanja Fidler

28th wrk. of the Austrian Association for Pattern Recognition (OAGM/AAPR), 2004

@inproceedings{SkocajOAGM04,
author = {Danijel Skocaj and Ales Leonardis and Sanja Fidler},
title = {Robust estimation of canonical correlation coefficients},
booktitle = {28th workshop of the Austrian Association for Pattern Recognition (OAGM/AAPR)},
year = {2004}
}

Canonical Correlation Analysis is well suited for regression tasks in appearance-based approach to modeling of objects and scenes. However, since it relies on the standard projection it is inherently non-robust. In this paper, we propose to embed the estimation of the CCA coefficients in an augmented PCA space, which enables detection of outliers and preserves regression-relevant information enabling robust estimation of canonical correlation coefficients.

Robust LDA classification by subsampling

Sanja Fidler, Ales Leonardis

In Workshop in Statistical Analysis in Computer Vision (in conjunction with CVPR), 2003

@inproceedings{FidlerSACV03,
author = {Sanja Fidler and Ales Leonardis},
title = {Robust LDA classification by subsampling},
booktitle = {Workshop in Statistical Analysis in Computer Vision in conjunction with CVPR},
year = {2003}
}

In this paper we present a new method which enables a robust calculation of the LDA classification rule, thus making the recognition of objects under non-ideal conditions possible, i.e., in situations when objects are occluded or they appear on a varying background, or when their images are corrupted by outliers. The main idea behind the method is to translate the task of calculating the LDA classification rule into the problem of determining the coefficients of an augmented generative model (PCA). Specifically, we construct an augmented PCA basis which, on the one hand, contains information necessary for the classification (in the LDA sense), and, on the other hand, enables us to calculate the necessary coefficients by means of a subsampling approach resulting in a high breakdown point classification. The theoretical results are evaluated on the ORL face database showing that the proposed method significantly outperforms the standard LDA.


(best paper award)

Robust LDA classification

Sanja Fidler, Ales Leonardis

27th wrk. of the Austrian Association for Pattern Recognition (OAGM/AAPR), 2003

@inproceedings{FidlerOAGM03,
author = {Sanja Fidler and Ales Leonardis},
title = {Robust LDA classification},
booktitle = {27th workshop of the Austrian Association for Pattern Recognition (OAGM/AAPR)},
year = {2003}
}

back to top