Sanja Fidler

Full list also on Google scholar

Year 2020

Learning Deformable Tetrahedral Meshes for 3D Reconstruction

Jun Gao, Wenzheng Chen, Tommy Xiang, Alec Jacobson, Morgan McGuire, Sanja Fidler

In Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2020

Paper Abstract Project page Bibtex

@inproceedings{DefTet20,
title = {Learning Deformable Tetrahedral Meshes for 3D Reconstruction},
author = {Jun Gao and Wenzheng Chen and Tommy Xiang and Alec Jacobson and Morgan McGuire and Sanja Fidler},
booktitle = {NeurIPS},
year = {2020}
}

3D shape representations that accommodate learning-based 3D reconstruction are an open problem in machine learning and computer graphics. Previous work on neural 3D reconstruction demonstrated benefits, but also limitations, of point cloud, voxel, surface mesh, and implicit function representations. We introduce Deformable Tetrahedral Meshes (DefTet) as a particular parameterization that utilizes volumetric tetrahedral meshes for the reconstruction problem. Unlike existing volumetric approaches, DefTet optimizes for both vertex placement and occupancy, and is differentiable with respect to standard 3D reconstruction loss functions. It is thus simultaneously high-precision, volumetric, and amenable to learning-based neural architectures. We show that it can represent arbitrary, complex topology, is both memory and computationally efficient, and can produce high-fidelity reconstructions with a significantly smaller grid size than alternative volumetric approaches. The predicted surfaces are also inherently defined as tetrahedral meshes, thus do not require post-processing. We demonstrate that DefTet matches or exceeds both the quality of the previous best approaches and the performance of the fastest ones. Our approach obtains high-quality tetrahedral meshes computed directly from noisy point clouds, and is the first to showcase high-quality 3D results using only a single image as input.

Variational Amodal Object Completion for Interactive Scene Editing

Huan Ling, David Acuna, Karsten Kreis, Seung Kim, Sanja Fidler

In Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2020

Paper Abstract Project page Bibtex

@inproceedings{amodalVAE20,
title = {Variational Amodal Object Completion for Interactive Scene Editing},
author = {Huan Ling and David Acuna and Karsten Kreis and Seung Kim and Sanja Fidler},
booktitle = {NeurIPS},
year = {2020}
}

In images of complex scenes, objects are often occluding each other which makes perception tasks such as object detection and tracking, or robotic control tasks such as planning, challenging. To facilitate downstream tasks, it is thus important to reason about the full extent of objects, i.e., seeing behind occlusion, typically referred to as amodal instance completion. In this paper, we propose a variational generative framework for amodal completion which does not require any amodal labels at training time, as it is able to utilize widely available object instance masks. We showcase our approach on the downstream task of scene editing where the user is presented with interactive tools to complete and erase objects in photographs. Experiments on complex street scenes demonstrate state-of-the-art performance in amodal mask completion, and showcase higher quality scene editing results over parallel work by Zhan et al. Interestingly, a user study shows that humans prefer our object completions to the human-labeled ones.

Federated Simulation for Medical Imaging (nominated for Young Scientist Award)

Daiqing Li, Amlan Kar, Nishant Ravikumar, Alejandro F Frangi, Sanja Fidler

In Medical Image Computing and Computer Assisted Intervention (MICCAI), 2020

Paper Abstract Project page Bibtex

@inproceedings{fedsim20,
title = {Federated Simulation for Medical Imaging},
author = {Daiqing Li and Amlan Kar and Nishant Ravikumar and Alejandro F Frangi and Sanja Fidler},
booktitle = {MICCAI},
year = {2020}
}

Labelling data is expensive and time consuming especially for domains such as medical imaging that contain volumetric imaging data and require expert knowledge. Exploiting a larger pool of labeled data available across multiple centers, such as in federated learning, has also seen limited success since current deep learning approaches do not generalize well to images acquired with scanners from different manufacturers. We aim to address these problems in a common, learning-based image simulation framework which we refer to as Federated Simulation. We introduce a physics-driven generative approach that consists of two learnable neural modules: 1) a module that synthesizes 3D cardiac shapes along with their materials, and 2) a CT simulator that renders these into realistic 3D CT Volumes, with annotations. Since the model of geometry and material is disentangled from the imaging sensor, it can effectively be trained across multiple medical centers. We show that our data synthesis framework improves the downstream segmentation performance on several datasets.

Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

Jonah Philion, Sanja Fidler

In European Conference on Computer Vision (ECCV), 2020

Paper Abstract Project page Bibtex

@inproceedings{liftsplat20,
title = {Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D},
author = {Jonah Philion and Sanja Fidler},
booktitle = {ECCV},
year = {2020}
}

The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single "bird's-eye-view" coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a rasterized bird's-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird's-eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar.

Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation

Jeevan Devaranjan*, Amlan Kar*, Sanja Fidler

In European Conference on Computer Vision (ECCV), 2020

Paper Abstract Project page Bibtex

@inproceedings{metasim20,
title = {Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation},
author = {Jeevan Devaranjan and Amlan Kar and Sanja Fidler},
booktitle = {ECCV},
year = {2020}
}

Procedural models are being widely used to synthesize scenes for graphics, gaming, and to create (labeled) synthetic datasets for ML. In order to produce realistic and diverse scenes, a number of parameters governing the procedural models have to be carefully tuned by experts. These parameters control both the structure of scenes being generated (e.g. how many cars in the scene), as well as parameters which place objects in valid configurations. Meta-Sim aimed at automatically tuning parameters given a target collection of real images in an unsupervised way. In Meta-Sim2, we aim to learn the scene structure in addition to parameters, which is a challenging problem due to its discrete nature. Meta-Sim2 proceeds by learning to sequentially sample rule expansions from a given probabilistic scene grammar. Due to the discrete nature of the problem, we use Reinforcement Learning to train our model, and design a feature space divergence between our synthesized and target images that is key to successful training. Experiments on a real driving dataset show that, without any supervision, we can successfully learn to generate data that captures discrete structural statistics of objects, such as their frequency, in real images. We also show that this leads to downstream improvement in the performance of an object detector trained on our generated dataset as opposed to other baseline simulation methods.

Beyond Fixed Grid: Learning Geometric Image Representation with a Deformable Grid

Jun Gao, Zian Wang, Jinchen Xuan, Sanja Fidler

In European Conference on Computer Vision (ECCV), 2020

Paper Abstract Project page Bibtex

@inproceedings{defgrid20,
title = {Beyond Fixed Grid: Learning Geometric Image Representation with a Deformable Grid},
author = {Jun Gao and Zian Wang and Jinchen Xuan and Sanja Fidler},
booktitle = {ECCV},
year = {2020}
}

In modern computer vision, images are typically represented as a fixed uniform grid with some stride and processed via a deep convolutional neural network. We argue that deforming the grid to better align with the high-frequency image content is a more effective strategy. We introduce Deformable Grid (DefGrid), a learnable neural network module that predicts location offsets of vertices of a 2-dimensional triangular grid, such that the edges of the deformed grid align with image boundaries. We showcase our DefGrid in a variety of use cases, i.e., by inserting it as a module at various levels of processing. We utilize DefGrid as an end-to-end learnable geometric downsampling layer that replaces standard pooling methods for reducing feature resolution when feeding images into a deep CNN. We show significantly improved results at the same grid resolution compared to using CNNs on uniform grids for the task of semantic segmentation. We also utilize DefGrid at the output layers for the task of object mask annotation, and show that reasoning about object boundaries on our predicted polygonal grid leads to more accurate results over existing pixel-wise and curve-based approaches. We finally showcase DefGrid as a standalone module for unsupervised image partitioning, showing superior performance over existing approaches.

Interactive Annotation of 3D Object Geometry using 2D Scribbles

Tianchang Shen, Jun Gao, Amlan Kar, Sanja Fidler

In European Conference on Computer Vision (ECCV), 2020

Paper Abstract Project page Demo Bibtex

@inproceedings{scribble3d20,
title = {Interactive Annotation of 3D Object Geometry using 2D Scribbles},
author = {Tianchang Shen and Jun Gao and Amlan Kar and Sanja Fidler},
booktitle = {ECCV},
year = {2020}
}

Inferring detailed 3D geometry of the scene is crucial for robotics applications, simulation, and 3D content creation. However, such information is hard to obtain, and thus very few datasets support it. In this paper, we propose an interactive framework for annotating 3D object geometry from both point cloud data and RGB imagery. The key idea behind our approach is to exploit strong priors that humans have about the 3D world in order to interactively annotate complete 3D shapes. Our framework targets a wide pool of annotators, i.e. naive users without artistic or graphics expertise. In particular, we introduce two simple-to-use interaction modules. First, we make an automatic guess of the 3D shape and allow the user to provide feedback about large errors by drawing scribbles in desired 2D views. Next, we aim to correct minor errors, in which users drag and drop 3D mesh vertices, assisted by a neural interactive module implemented as a Graph Convolutional Network. Experimentally, we show that only a few user interactions are needed to produce good quality 3D shapes on popular benchmarks such as ShapeNet, Pix3D and ScanNet. We implement our framework as a web service and conduct a user study, where we show that user annotated data using our method effectively facilitates real-world learning tasks.

ScribbleBox: Interactive Annotation Framework for Video Object Segmentation

Bowen Chen, Huan Ling, Xiaohui Zeng, Jun Gao, Ziyue Xu, Sanja Fidler

In European Conference on Computer Vision (ECCV), 2020

Paper Abstract Project page Bibtex

@inproceedings{scribblebox20,
title = {ScribbleBox: Interactive Annotation Framework for Video Object Segmentation},
author = {Bowen Chen and Huan Ling and Xiaohui Zeng and Jun Gao and Ziyue Xu and Sanja Fidler},
booktitle = {ECCV},
year = {2020}
}

Manually labeling video datasets for segmentation tasks is extremely time consuming. In this paper, we introduce ScribbleBox, a novel interactive framework for annotating object instances with masks in videos. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve with a small number of control points which the annotator can interactively correct. Our approach tolerates a modest amount of noise in the box placements, thus typically only a few clicks are needed to annotate tracked boxes to a sufficient accuracy. Segmentation masks are corrected via scribbles which are efficiently propagated through time. We show significant performance gains in annotation efficiency over past work. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with 9.14 clicks per box track, and 4 frames of scribble annotation.

Expressive Telepresence via Modular Codec Avatar

Hang Chu, Shugao Ma, Fernando Torre, Sanja Fidler, Yaser Sheikh

In European Conference on Computer Vision (ECCV), 2020

Paper Abstract Bibtex

@inproceedings{modularavatar20,
title = {Expressive Telepresence via Modular Codec Avatar},
author = {Hang Chu and Shugao Ma and Fernando Torre and Sanja Fidler and Yaser Sheikh},
booktitle = {ECCV},
year = {2020}
}

VR telepresence consists of interacting with another human in a virtual space represented by an avatar. Today most avatars are cartoon-like, but soon the technology will allow video-realistic ones. This paper aims in this direction, and presents Modular Codec Avatars (MCA), a method to generate hyper-realistic faces driven by the cameras in the VR headset. MCA extends traditional Codec Avatars (CA) by replacing the holistic models with a learned modular representation. It is important to note that traditional person-specific CAs are learned from few training samples, and typically lack robustness as well as limited expressiveness when transferring facial expressions. MCAs solve these issues by learning a modulated adaptive blending of different facial components as well as an exemplar-based latent alignment. We demonstrate that MCA achieves improved expressiveness and robustness w.r.t to CA in a variety of real-world datasets and practical scenarios. Finally, we showcase new applications in VR telepresence enabled by the proposed model.

Nonlinear Color Triads for Approximation, Learning and Direct Manipulation of Color Distributions

Maria Shugrina, Amlan Kar, Sanja Fidler, Karan Singh

In SIGGRAPH, 2020

Paper Abstract Project page Bibtex

@inproceedings{colortriads20,
title = {Nonlinear Color Triads for Approximation, Learning and Direct Manipulation of Color Distributions},
author = {Maria Shugrina and Amlan Kar and Sanja Fidler and Karan Singh},
booktitle = {SIGGRAPH},
year = {2020}
}

We present nonlinear color triads, an extension of color gradients able to approximate a variety of natural color distributions that have no standard interactive representation. We derive a method to fit this compact parametric representation to existing images and show its power for tasks such as image editing and compression. Our color triad formulation can also be included in standard deep learning architectures, facilitating further research.

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Dima Damen, Hazel Doughty, Giovanni Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2020

Paper Abstract Project page Bibtex

@inproceedings{EPICKITCHENS20,
title = {The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines},
author = {Dima Damen and Hazel Doughty and Giovanni Farinella and Sanja Fidler and Antonino Furnari and Evangelos Kazakos and Davide Moltisanti and Jonathan Munro and Toby Perrett and Will Price and Michael Wray},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2020}
}

Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict non-scripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in 4 countries by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions (e.g. "closing a tap" from "opening" it up).

Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data (oral presentation)

Xi Yan*, David Acuna*, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), 2020

Paper Abstract Project page Bibtex

@inproceedings{nds20,
title = {Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data},
author = {Xi Yan and David Acuna and Sanja Fidler},
booktitle = {CVPR},
year = {2020}
}

Transfer learning has proven to be a successful technique to train deep learning models in the domains where little training data is available. The dominant approach is to pretrain a model on a large generic dataset such as ImageNet and finetune its weights on the target domain. However, in the new era of an ever-increasing number of massive datasets, selecting the relevant data for pretraining is a critical issue. We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain. NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client, an end-user with a target application with its own small labeled dataset. The dataserver represents large datasets with a much more compact mixture-of-experts model, and employs it to perform data search in a series of dataserver-client transactions at a low computational cost. We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets and tasks such as image classification, object detection and instance segmentation. Neural Data Server is available as a web-service at this http URL.

Learning to Simulate Dynamic Environments with GameGAN

Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), 2020

Paper Abstract Project page Bibtex

@inproceedings{gamegan20,
title = {Learning to Simulate Dynamic Environments with GameGAN},
author = {Seung Wook Kim and Yuhao Zhou and Jonah Philion and Antonio Torralba and Sanja Fidler},
booktitle = {CVPR},
year = {2020}
}

Simulation is a crucial component of any robotic system. In order to simulate correctly, we need to write complex rules of the environment: how dynamic agents behave, and how the actions of each of the agents affect the behavior of others. In this paper, we aim to learn a simulator by simply watching an agent interact with an environment. We focus on graphics games as a proxy of the real environment. We introduce GameGAN, a generative model that learns to visually imitate a desired game by ingesting screenplay and keyboard actions during training. Given a key pressed by the agent, GameGAN "renders" the next screen using a carefully designed generative adversarial network. Our approach offers key advantages over existing work: we design a memory module that builds an internal map of the environment, allowing for the agent to return to previously visited locations with high visual consistency. In addition, GameGAN is able to disentangle static and dynamic components within an image making the behavior of the model more interpretable, and relevant for downstream tasks that require explicit reasoning over dynamic elements. This enables many interesting applications such as swapping different components of the game to build new games that do not exist.

Learning to Evaluate Perception Models Using Planner-Centric Metrics

Jonah Philion, Amlan Kar, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), 2020

Paper Abstract Project page Bibtex

@inproceedings{learneval20,
title = {Learning to Evaluate Perception Models Using Planner-Centric Metrics},
author = {Jonah Philion and Amlan Kar and Sanja Fidler},
booktitle = {CVPR},
year = {2020}
}

Variants of accuracy and precision are the gold-standard by which the computer vision community measures progress of perception algorithms. One reason for the ubiquity of these metrics is that they are largely task-agnostic; we in general seek to detect zero false negatives or positives. The downside of these metrics is that, at worst, they penalize all incorrect detections equally without conditioning on the task or scene, and at best, heuristics need to be chosen to ensure that different mistakes count differently. In this paper, we propose a principled metric for 3D object detection specifically for the task of self-driving. The core idea behind our metric is to isolate the task of object detection and measure the impact the produced detections would induce on the downstream task of driving. Without hand-designing it to, we find that our metric penalizes many of the mistakes that other metrics penalize by design. In addition, our metric downweighs detections based on additional factors such as distance from a detection to the ego car and the speed of the detection in intuitive ways that other detection metrics do not. For human evaluation, we generate scenes in which standard metrics and our metric disagree and find that humans side with our metric 79% of the time.

Auto-Tuning Structured Light by Optical Stochastic Gradient Descent

Wenzheng Chen, Parsa Mirdehghan, Sanja Filder, Kiriakos N. Kutulakos

In Computer Vision and Pattern Recognition (CVPR), 2020

Paper Abstract Project page Bibtex

@inproceedings{autostructlight20,
title = {Auto-Tuning Structured Light by Optical Stochastic Gradient Descent},
author = {Wenzheng Chen and Parsa Mirdehghan and Sanja Filder and Kiriakos N. Kutulakos},
booktitle = {CVPR},
year = {2020}
}

We consider the problem of optimizing the performance of an active imaging system by automatically discovering the illuminations it should use, and the way to decode them. Our approach tackles two seemingly incompatible goals: (1) "tuning" the illuminations and decoding algorithm precisely to the devices at hand---to their optical transfer functions, non-linearities, spectral responses, image processing pipelines---and (2) doing so without modeling or calibrating the system; without modeling the scenes of interest; and without prior training data. The key idea is to formulate a stochastic gradient descent (SGD) optimization procedure that puts the actual system in the loop: projecting patterns, capturing images, and calculating the gradient of expected reconstruction error. We apply this idea to structured-light triangulation to "auto-tune" several devices---from smartphones and laser projectors to advanced computational cameras. Our experiments show that despite being modelfree and automatic, optical SGD can boost system 3D accuracy substantially over state-of-the-art coding schemes.

A Theoretical Analysis of the Number of Shots in Few-Shot Learning

Tianshi Cao, Marc Law, Sanja Fidler

In International Conference on Learning Representations (ICLR), 2020

Paper Abstract Bibtex

@inproceedings{protoanalysis20,
title = {A Theoretical Analysis of the Number of Shots in Few-Shot Learning},
author = {Tianshi Cao and Marc Law and Sanja Fidler},
booktitle = {ICLR},
year = {2020}
}

Few-shot classification is the task of predicting the category of an example from a set of few labeled examples. The number of labeled examples per category is called the number of shots (or shot number). Recent works tackle this task through meta-learning, where a meta-learner extracts information from observed tasks during meta-training to quickly adapt to new tasks during meta-testing. In this formulation, the number of shots exploited during meta-training has an impact on the recognition performance at meta-test time. Generally, the shot number used in meta-training should match the one used in meta-testing to obtain the best performance. We introduce a theoretical analysis of the impact of the shot number on Prototypical Networks, a state-of-the-art few-shot classification method. From our analysis, we propose a simple method that is robust to the choice of shot number used during meta-training, which is a crucial hyperparameter. The performance of our model trained for an arbitrary meta-training shot number shows great performance for different values of meta-testing shot numbers. We experimentally demonstrate our approach on different few-shot classification benchmarks.

Efficient and Information-Preserving Future Frame Prediction and Beyond

Wei Yu, Yichao Lu, Steve Easterbrook, Sanja Fidler

In International Conference on Learning Representations (ICLR), 2020

Paper Abstract Bibtex

@inproceedings{crevnet20,
title = {Efficient and Information-Preserving Future Frame Prediction and Beyond},
author = {Wei Yu and Yichao Lu and Steve Easterbrook and Sanja Fidler},
booktitle = {ICLR},
year = {2020}
}

Applying resolution-preserving blocks is a common practice to maximize information preservation in video prediction, yet their high memory consumption greatly limits their application scenarios. We propose CrevNet, a Conditionally Reversible Network that uses reversible architectures to build a bijective two-way autoencoder and its complementary recurrent predictor. Our model enjoys the theoretically guaranteed property of no information loss during the feature extraction, much lower memory consumption and computational efficiency. The lightweight nature of our model enables us to incorporate 3D convolutions without concern of memory bottleneck, enhancing the model’s ability to capture both short-term and long-term temporal dependencies. Our proposed approach achieves state-of-the-art results on Moving MNIST, Traffic4cast and KITTI datasets. We further demonstrate the transferability of our self-supervised learning method by exploiting its learnt features for object detection on KITTI. Our competitive results indicate the potential of using CrevNet as a generative pre-training strategy to guide downstream tasks.

Year 2019

Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research

Krishna Murthy Jatavallabhula, Edward Smith, Jean-Francois Lafleche, Clement Fuji Tsang, Artem Rozantsev, Wenzheng Chen, Tommy Xiang, Rev Lebaredian, Sanja Fidler

Technical Report (arXiv:1911.05063), 2019

Tech Report Abstract Project page Bibtex

@inproceedings{Kaolin19,
title = {Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research},
author = {Krishna Murthy Jatavallabhula and Edward Smith and Jean-Francois Lafleche and Clement Fuji Tsang and Artem Rozantsev and Wenzheng Chen and Tommy Xiang and Rev Lebaredian and Sanja Fidler},
booktitle = {arXiv:1911.05063},
year = {2019}
}

Kaolin is a PyTorch library aiming to accelerate 3D deep learning research. Kaolin provides efficient implementations of differentiable 3D modules for use in deep learning systems. With functionality to load and preprocess several popular 3D datasets, and native functions to manipulate meshes, pointclouds, signed distance functions, and voxel grids, Kaolin mitigates the need to write wasteful boilerplate code. Kaolin packages together several differentiable graphics modules including rendering, lighting, shading, and view warping. Kaolin also supports an array of loss functions and evaluation metrics for seamless evaluation and provides visualization functionality to render the 3D results. Importantly, we curate a comprehensive model zoo comprising many state-of-the-art 3D deep learning architectures, to serve as a starting point for future research endeavours.

Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer

Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, Jaako Lehtinen, Alec Jacobson, Sanja Fidler

In Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019

Paper Abstract Project page Bibtex

@inproceedings{DIBR19,
title = {Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer},
author = {Wenzheng Chen and Huan Ling and Jun Gao and Edward Smith and Jaako Lehtinen and Alec Jacobson and Sanja Fidler},
booktitle = {NeurIPS},
year = {2019}
}

Many machine learning models operate on images, but ignore the fact that images are 2D projections formed by 3D geometry interacting with light, in a process called rendering. Enabling ML models to understand image formation might be key for generalization. However, due to an essential rasterization step involving discrete assignment operations, rendering pipelines are non-differentiable and thus largely inaccessible to gradient-based ML techniques. In this paper, we present DIB-R, a differentiable rendering framework which allows gradients to be analytically computed for all pixels in an image. Key to our approach is to view foreground rasterization as a weighted interpolation of local properties and background rasterization as an distance-based aggregation of global geometry. Our approach allows for accurate optimization over vertex positions, colors, normals, light directions and texture coordinates through a variety of lighting models. We showcase our approach in two ML applications: single-image 3D object prediction, and 3D textured object generation, both trained using exclusively using 2D supervision.

Meta-Sim: Learning to Generate Synthetic Datasets (oral presentation)

Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, Sanja Fidler

In International Conference on Computer Vision (ICCV), Seul, Korea, 2019

Paper Abstract Project page Bibtex

@inproceedings{Metasim19,
title = {Meta-Sim: Learning to Generate Synthetic Datasets},
author = {Amlan Kar and Aayush Prakash and Ming-Yu Liu and Eric Cameracci and Justin Yuan and Matt Rusiniak and David Acuna and Antonio Torralba and Sanja Fidler},
booktitle = {ICCV},
year = {2019}
}

Training models to high-end performance requires availability of large labeled datasets, which are expensive to get. The goal of our work is to automatically synthesize labeled datasets that are relevant for a downstream task. We propose Meta-Sim, which learns a generative model of synthetic scenes, and obtain images as well as its corresponding ground-truth via a graphics engine. We parametrize our dataset generator with a neural network, which learns to modify attributes of scene graphs obtained from probabilistic scene grammars, so as to minimize the distribution gap between its rendered outputs and target data. If the real dataset comes with a small labeled validation set, we additionally aim to optimize a meta-objective, i.e. downstream task performance. Experiments show that the proposed method can greatly improve content generation quality over a human-engineered probabilistic scene grammar, both qualitatively and quantitatively as measured by performance on a downstream task.

Neural Turtle Graphics for Modeling City Road Layouts (oral presentation)

Hang Chu, Daiqing Li, David Acuna, Amlan Kar, Maria Shugrina, Xinkai Wei, Ming-Yu Liu, Antonio Torralba, Sanja Fidler

In International Conference on Computer Vision (ICCV), Seul, Korea, 2019

Paper Abstract Bibtex

@inproceedings{NTG19,
title = {Neural Turtle Graphics for Modeling City Road Layouts},
author = {Hang Chu and Daiqing Li and David Acuna and Amlan Kar and Maria Shugrina and Xinkai Wei and Ming-Yu Liu and Antonio Torralba and Sanja Fidler},
booktitle = {ICCV},
year = {2019}
}

We propose Neural Turtle Graphics (NTG), a novel generative model for spatial graphs, and demonstrate its applications in modeling city road layouts. Specifically, we represent the road layout using a graph where nodes in the graph represent control points and edges in the graph represents road segments. NTG is a sequential generative model parameterized by a neural network. It iteratively generates a new node and an edge connecting to an existing node conditioned on the current graph. We train NTG on Open Street Map data and show it outperforms existing approaches using a set of diverse performance metrics. Moreover, our method allows users to control styles of generated road layouts mimicking existing cities as well as to sketch a part of the city road layout to be synthesized. In addition to synthesis, the proposed NTG finds uses in an analytical task of aerial road parsing. Experimental results show that it achieves state-of-the-art performance on the SpaceNet dataset.

Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Towaki Takikawa, David Acuna, Varun Jampani, Sanja Fidler

In International Conference on Computer Vision (ICCV), Seul, Korea, 2019

Paper Abstract Project page Bibtex

@inproceedings{GSCNN19,
title = {Gated-SCNN: Gated Shape CNNs for Semantic Segmentation},
author = {Towaki Takikawa and David Acuna and Varun Jampani and Sanja Fidler},
booktitle = {ICCV},
year = {2019}
}

Current state-of-the-art methods for image segmentation form a dense image representation where the color, shape and texture information are all processed together inside a deep CNN. This however may not be ideal as they contain very different type of information relevant for recognition. Here, we propose a new two-stream CNN architecture for semantic segmentation that explicitly wires shape information as a separate processing branch, i.e. shape stream, that processes information in parallel to the classical stream. Key to this architecture is a new type of gates that connect the intermediate layers of the two streams. Specifically, we use the higher-level activations in the classical stream to gate the lower-level activations in the shape stream, effectively removing noise and helping the shape stream to only focus on processing the relevant boundary-related information. This enables us to use a very shallow architecture for the shape stream that operates on the image-level resolution. Our experiments show that this leads to a highly effective architecture that produces sharper predictions around object boundaries and significantly boosts performance on thinner and smaller objects. Our method achieves state-of-the-art performance on the Cityscapes benchmark, in terms of both mask (mIoU) and boundary (F-score) quality, improving by 2% and 4% over strong baselines.

Lifelong Learning for Image Captioning by Asking Natural Language Questions

Kevin Shen, Amlan Kar, Sanja Fidler

In International Conference on Computer Vision (ICCV), Seul, Korea, 2019

Paper Abstract Project page Bibtex

@inproceedings{Shen19,
title = {Lifelong Learning for Image Captioning by Asking Natural Language Questions},
author = {Kevin Shen and Amlan Kar and Sanja Fidler},
booktitle = {ICCV},
year = {2019}
}

In order to bring artificial agents into our lives, we will need to go beyond supervised learning on closed datasets to having the ability to continuously expand knowledge. Inspired by a student learning in a classroom, we present an agent that can continuously learn by posing natural language questions to humans. Our agent is composed of three interacting modules, one that performs captioning, another that generates questions and a decision maker that learns when to ask questions by implicitly reasoning about the uncertainty of the agent and expertise of the teacher. As compared to current active learning methods which query images for full captions, our agent is able to ask pointed questions to improve the generated captions. The agent trains on the improved captions, expanding its knowledge. We show that our approach achieves better performance using less human supervision than the baselines on the challenging MSCOCO dataset.

Video Face Clustering with Unknown Number of Clusters

Makarand Tapaswi, Marc Law, Sanja Fidler

In International Conference on Computer Vision (ICCV), Seul, Korea, 2019

Paper Abstract Project page Bibtex

@inproceedings{BallClust19,
title = {Video Face Clustering with Unknown Number of Clusters},
author = {Makarand Tapaswi and Marc Law and Sanja Fidler},
booktitle = {ICCV},
year = {2019}
}

Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing. We address the challenging problem of clustering face tracks based on their identity. Different from previous work in this area, we choose to operate in a realistic and difficult setting where: (i) the number of characters is not known a priori; and (ii) face tracks belonging to minor or background characters are not discarded. To this end, we propose Ball Cluster Learning (BCL), a supervised approach to carve the embedding space into balls of equal size, one for each cluster. The learned ball radius is easily translated to a stopping criterion for iterative merging algorithms. This gives BCL the ability to estimate the number of clusters as well as their assignment, achieving promising results on commonly used datasets. We also present a thorough discussion of how existing metric learning literature can be adapted for this task.

DMM-Net: Differentiable Mask-Matching Network for Video Instance Segmentation

Xiaohui Zeng, Renjie Liao, Li Gu, Yuwen Xiong, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Seul, Korea, 2019

Paper Abstract Code - Bibtex

@inproceedings{DMMNet19,
title = {DMM-Net: Differentiable Mask-Matching Network for Video Instance Segmentation},
author = {Xiaohui Zeng and Renjie Liao and Li Gu and Yuwen Xiong and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2019}
}

In this paper, we propose the differentiable mask-matching network (DMM-Net) for solving the video object segmentation problem where the initial object masks are provided. Relying on the Mask R-CNN backbone, we extract mask proposals per frame and formulate the matching between object templates and proposals as a linear assignment problem where thA heading inside a blocke cost matrix is predicted by a deep convolutional neural network. We propose a differentiable matching layer which unrolls a projected gradient descent algorithm in which the projection step exploits the Dykstra's algorithm. We prove that under mild conditions, the matching is guaranteed to converge to the optimal one. In practice, it achieves similar performance compared to the Hungarian algorithm during inference. Importantly, we can back-propagate through it to learn the cost matrix. After matching, a U-Net style architecture is exploited to refine the matched mask per time step. On DAVIS 2017 dataset, DMM-Net achieves the best performance without online learning on the first frames and the 2nd best with it. Without any fine-tuning, DMM-Net performs comparably to state-of-the-art methods on SegTrack v2 dataset. At last, our differentiable matching layer is very simple to implement; we attach the PyTorch code in the supplementary material which is less than $50$ lines long.

Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations (oral presentation)

David Acuna, Amlan Kar, Sanja Fidler

Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019

Paper Abstract Project page Bibtex

@inproceedings{Steal19,
title = {Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations},
author = {David Acuna and Amlan Kar and Sanja Fidler},
booktitle = {CVPR},
year = {2019}
}

We tackle the problem of semantic boundary prediction, which aims to identify pixels that belong to object(class) boundaries. We notice that relevant datasets consist of a significant level of label noise, reflecting the fact that precise annotations are laborious to get and thus annotators trade-off quality with efficiency. We aim to learn sharp and precise semantic boundaries by explicitly reasoning about annotation noise during training. We propose a simple new layer and loss that can be used with existing learning-based boundary detectors. Our layer/loss enforces the detector to predict a maximum response along the normal direction at an edge, while also regularizing its direction. We further reason about true object boundaries during training using a level set formulation, which allows the network to learn from misaligned labels in an end-to-end fashion. Experiments show that we improve over the CASENet backbone network by more than 4% in terms of MF(ODS) and 18.61% in terms of AP, outperforming all current state-of-the-art methods including those that deal with alignment. Furthermore, we show that our learned network can be used to significantly improve coarse segmentation labels, lending itself as an efficient way to label new data.

Fast Interactive Object Annotation with Curve-GCN

Huan Ling*, Jun Gao*, Amlan Kar, Wenzheng Chen, Sanja Fidler

Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019

Paper Abstract Code Bibtex

@inproceedings{CurveGCN19,
title = {Fast Interactive Object Annotation with Curve-GCN},
author = {Huan Ling and Jun Gao and Amlan Kar and Wenzheng Chen and Sanja Fidler},
booktitle = {CVPR},
year = {2019}
}

Manually labeling objects by tracing their boundaries is a laborious process. In Polygon-RNN++ the authors proposed Polygon-RNN that produces polygonal annotations in a recurrent manner using a CNN-RNN architecture, allowing interactive correction via humans-in-the-loop. We propose a new framework that alleviates the sequential nature of Polygon-RNN, by predicting all vertices simultaneously using a Graph Convolutional Network (GCN). Our model is trained end-to-end. It supports object annotation by either polygons or splines, facilitating labeling efficiency for both line-based and curved objects. We show that Curve-GCN outperforms all existing approaches in automatic mode, including the powerful PSP-DeepLab and is significantly more efficient in interactive mode than Polygon-RNN++. Our model runs at 29.3ms in automatic, and 2.6ms in interactive mode, making it 10x and 100x faster than Polygon-RNN++.

Object Instance Annotation with Deep Extreme Level Set Evolution

Zian Wang, David Acuna, Huan Ling, Amlan Kar, Sanja Fidler

Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019

Paper Abstract Bibtex

@inproceedings{DELSE19,
title = {Object Instance Annotation with Deep Extreme Level Set Evolution},
author = {Zian Wang and David Acuna and Huan Ling and Amlan Kar and Sanja Fidler},
booktitle = {CVPR},
year = {2019}
}

In this paper, we tackle the task of interactive object segmentation. We revive the old ideas on level set segmentation which framed object annotation as curve evolution. Carefully designed energy functions ensured that the curve was well aligned with image boundaries, and generally "well behaved". The Level Set Method can handle objects with complex shapes and topological changes such as merging and splitting, thus able to deal with occluded objects and objects with holes. We propose Deep Extreme Level Set Evolution that combines powerful CNN models with level set optimization in an end-to-end fashion. Our method learns to predict evolution parameters conditioned on the image and evolves the predicted initial contour to produce the final result. We make our model interactive by incorporating user clicks on the extreme boundary points, following DEXTR. We show that our approach significantly outperforms DEXTR on the static Cityscapes dataset and the video segmentation benchmark DAVIS, and performs on par on PASCAL and SBD.

Synthesizing Environment-Aware Activities via Activity Sketches

Yuan-Hong Liao, Xavier Puig, Marko Boben, Antonio Torralba, Sanja Fidler

Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019

Paper Abstract Project page Bibtex

@inproceedings{VirtualHome2,
title = {Synthesizing Environment-Aware Activities via Activity Sketches},
author = {Yuan-Hong Liao and Xavier Puig and Marko Boben and Antonio Torralba and Sanja Fidler},
booktitle = {CVPR},
year = {2019}
}

In order to learn to perform activities from demonstrations or descriptions, agents need to distill what the essence of the given activity is, and how it can be adapted to new environments. In this work, we address the problem of environment-aware program generation. Given a visual demonstration or a description of an activity, we generate program sketches representing the essential instructions and propose a model to transform these into full programs representing the actions needed to perform the activity under the presented environmental constraints. To this end, we build upon VirtualHome to create a new dataset VirtualHome-Env, where we collect program sketches to represent activities and match programs with environments that can afford them. Furthermore, we construct a knowledge base to sample realistic environments and another knowledge base to seek out the programs under the sampled environments. Finally, we propose ResActGraph, a network that generates a program from a given sketch and an environment graph and tracks the changes in the environment induced by the program.

Creative Flow+ Dataset

Maria Shugrina, Ziheng Liang, Amlan Kar, Jiaman Li, Angad Singh, Karan Singh, Sanja Fidler

Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019

Paper Abstract Project page Bibtex

@inproceedings{CreativeFlow19,
title = {Creative Flow+ Dataset},
author = {Maria Shugrina and Ziheng Liang and Amlan Kar and Jiaman Li and Angad Singh and Karan Singh and Sanja Fidler},
booktitle = {CVPR},
year = {2019}
}

We present the Creative Flow+ Dataset, the first diverse multi-style artistic video dataset richly labeled with per-pixel optical flow, occlusions, correspondences, segmentation labels, normals, and depth. Our dataset includes 3000 animated sequences rendered using styles randomly selected from 40 textured line styles and 38 shading styles, spanning the range between flat cartoon fill and wildly sketchy shading. Our dataset includes 124K+ train set frames and 10K test set frames rendered at 1500x1500 resolution, far surpassing the largest available optical flow datasets in size. While modern techniques for tasks such as optical flow estimation achieve impressive performance on realistic images and video, today there is no way to gauge their performance on non-photorealistic images. Creative Flow+ poses a new challenge to generalize real-world Computer Vision to messy stylized content. We show that learning-based optical flow methods fail to generalize to this data and struggle to compete with classical approaches, and invite new research in this area. Our dataset and a new optical flow benchmark will be publicly available at: www.cs.toronto.edu/creativeflow/. We further release the complete dataset creation pipeline, allowing the community to generate and stylize their own data on demand.

DARNet: Deep Active Ray Network for Building Segmentation

Dominic Cheng, Renjie Liao, Sanja Fidler, Raquel Urtasun

Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019

Paper Abstract Project page Bibtex

@inproceedings{DARnet19,
title = {DARNet: Deep Active Ray Network for Building Segmentation},
author = {Dominic Cheng and Renjie Liao and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2019}
}

In this paper, we propose a Deep Active Ray Network (DARNet) for automatic building segmentation. Taking an image as input, it first exploits a deep convolutional neural network (CNN) as the backbone to predict energy maps, which are further utilized to construct an energy function. A polygon-based contour is then evolved via minimizing the energy function, of which the minimum defines the final segmentation. Instead of parameterizing the contour using Euclidean coordinates, we adopt polar coordinates, i.e., rays, which not only prevents self-intersection but also simplifies the design of the energy function. Moreover, we propose a loss function that directly encourages the contours to match building boundaries. Our DARNet is trained end-to-end by back-propagating through the energy minimization and the backbone CNN, which makes the CNN adapt to the dynamics of the contour evolution. Experiments on three building instance segmentation datasets demonstrate our DARNet achieves either state-of-the-art or comparable performances to other competitors.

Action Recognition from Single Timestamp Supervision in Untrimmed Videos

Davide Moltisanti, Sanja Fidler, Dima Damen

Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019

Paper Abstract Project page Bibtex

@inproceedings{Moltisani19,
title = {Action Recognition from Single Timestamp Supervision in Untrimmed Videos},
author = {Davide Moltisanti and Sanja Fidler and Dima Damen},
booktitle = {CVPR},
year = {2019}
}

Recognizing actions in videos relies on labelled supervision during training, typically the start and end times of each action instance. This supervision is not only subjective, but also expensive to acquire. Weak video-level supervision has been successfully exploited for recognition in untrimmed videos, however it is challenged when the number of different actions in training videos increases. We propose a method that is supervised by single timestamps located around each action instance, in untrimmed videos. We replace expensive action bounds with sampling distributions initialized from these timestamps. We then use the classifier's response to iteratively update the sampling distributions. We demonstrate that these distributions converge to the location and extent of discriminative action segments. We evaluate our method on three datasets for fine-grained recognition, with increasing number of different actions per video, and show that single timestamps offer a reasonable compromise between recognition performance and labelling effort, performing comparably to full temporal supervision. Our update method improves top-1 test accuracy by up to 5.4%. across the evaluated datasets.

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

Chaoqi Wang, Roger Grosse, Sanja Fidler, Guodong Zhang

International Conference on Machine Learning (ICML), Long Beach, USA, 2019

Paper Abstract Code Bibtex

@inproceedings{EigenDamage19,
title = {EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis},
author = {Chaoqi Wang, Roger Grosse, Sanja Fidler, Guodong Zhang},
booktitle = {ICML},
year = {2019}
}

Reducing the test time resource requirements of a neural network while preserving test accuracy is crucial for running inference on resource-constrained devices. To achieve this goal, we introduce a novel network reparameterization based on the Kronecker-factored eigenbasis (KFE), and then apply Hessian-based structured pruning methods in this basis. As opposed to existing Hessian-based pruning algorithms which do pruning in parameter coordinates, our method works in the KFE where different weights are approximately independent, enabling accurate pruning and fast computation. We demonstrate empirically the effectiveness of the proposed method through extensive experiments. In particular, we highlight that the improvements are especially significant for more challenging datasets and networks. With negligible loss of accuracy, an iterative-pruning version gives a 10? reduction in model size and a 8? reduction in FLOPs on wide ResNet32.

Visual Reasoning by Progressive Module Networks

Seung Wook Kim, Makarand Tapaswi, Sanja Fidler

International Conference on Learning Representations (ICLR), New Orleans, USA, 2019

Paper Abstract Code Bibtex

@inproceedings{PMN2018,
title = {Visual Reasoning by Progressive Module Networks},
author = {Seung Wook Kim and Makarand Tapaswi and Sanja Fidler},
booktitle = {ICLR},
year = {2019}
}

Humans learn to solve tasks of increasing complexity by building on top of previously acquired knowledge. Typically, there exists a natural progression in the tasks that we learn - most do not require completely independent solutions, but can be broken down into simpler subtasks. We propose to represent a solver for each task as a neural module that calls existing modules (solvers for simpler tasks) in a functional program-like manner. Lower modules are a black box to the calling module, and communicate only via a query and an output. ThUSA, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output. Our model effectively combines previous skill-sets, does not suffer from forgetting, and is fully differentiable. We test our model in learning a set of visual reasoning tasks, and demonstrate improved performances in all tasks by learning progressively. By evaluating the reasoning process using human judges, we show that our model is more interpretable than an attention-based baseline.

Neural Graph Evolution: Automatic Robot Design

Tingwu Wang, Yuhao Zhou, Sanja Fidler, Jimmy Ba

International Conference on Learning Representations (ICLR), New Orleans, USA, 2019

Paper Abstract Project page Bibtex

@inproceedings{NGE2019,
title = {Neural Graph Evolution: Automatic Robot Design},
author = {Tingwu Wang and Yuhao Zhou and Sanja Fidler and Jimmy Ba},
booktitle = {ICLR},
year = {2019}
}

Despite the recent successes in robotic locomotion control, the design of robot relies heavily on human engineering. Automatic robot design has been a long studied subject, but the recent progress has been slowed due to the large combinatorial search space and the difficulty in evaluating the found candidates. To address the two challenges, we formulate automatic robot design as a graph search problem and perform evolution search in graph space. We propose Neural Graph Evolution (NGE), which performs selection on current candidates and evolves new ones iteratively. Different from previous approaches, NGE uses graph neural networks to parameterize the control policies, which reduces evaluation cost on new candidates with the help of skill transfer from previously evaluated designs. In addition, NGE applies Graph Mutation with Uncertainty (GM-UC) by incorporating model uncertainty, which reduces the search space by balancing exploration and exploitation. We show that NGE significantly outperforms previous methods by an order of magnitude. As shown in experiments, NGE is the first algorithm that can automatically discover kinematically preferred robotic graph structures, such as a fish with two symmetrical flat side-fins and a tail, or a cheetah with athletic front and back legs. Instead of using thousands of cores for weeks, NGE efficiently solves searching problem within a day on a single 64 CPU-core Amazon EC2 machine.

Color Builder: A Direct Manipulation Interface for Versatile Color Theme Authoring

Maria Shugrina, Wenjia Zhang, Fanny Chevalier, Sanja Fidler, Karan Singh

CHI Conference on Human Factors in Computing Systems (CHI), Glasgow, UK, 2019

Paper Abstract Project page Bibtex

@inproceedings{CHI-Masha2019,
title = {Color Builder: A Direct Manipulation Interface for Versatile Color Theme Authoring},
author = {Maria Shugrina and Wenjia Zhang and Fanny Chevalier and Sanja Fidler and Karan Singh},
booktitle = {CHI},
year = {2019}
}

Color themes or palettes are popular for sharing color combinations across many visual domains. We present a novel interface for creating color themes through direct manipulation of color swatches. Users can create and rearrange swatches, and combine them into smooth and step-based gradients and three-color blends -- all using a seamless touch or mouse input. Analysis of existing solutions reveals a fragmented color design workflow, where separate software is used for swatches, smooth and discrete gradients and for in-context color visualization. Our design unifies these tasks, while encouraging playful creative exploration. Adjusting a color using standard color pickers can break this interaction flow with mechanical slider manipulation. To keep interaction seamless, we additionally design an in situ color tweaking interface for freeform exploration of an entire color neighborhood. We evaluate our interface with a group of professional designers and students majoring in this field.

Year 2018

Semantic Understanding of Scenes Through the ADE20K Dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, Antonio Torralba

International Journal of Computer Vision (IJCV)

Paper Abstract Dataset Code Bibtex

@article{ADEIJCV,
title = {Semantic Understanding of Scenes Through the ADE20K Dataset},
author = {Bolei Zhou and Hang Zhao and Xavier Puig and Tete Xiao and Sanja Fidler and Adela Barriuso and Antonio Torralba},
journal = {IJCV},
year = {2018}
}

Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

Color Sails: Discrete-Continuous Palettes for Deep Color Exploration

Maria Shugrina, Amlan Kar, Karan Singh, Sanja Fidler

Arxiv preprint arXiv:1806.02918

Paper Abstract Project page Bibtex

@inproceedings{sails2018,
title = {Color Sails: Discrete-Continuous Palettes for Deep Color Exploration},
author = {Maria Shugrina and Amlan Kar and Karan Singh and Sanja Fidler},
booktitle = {arXiv:1806.02918},
year = {2018}
}

We present color sails, a discrete-continuous color gamut representation that extends the color gradient analogy to three dimensions and allows interactive control of the color blending behavior. Our representation models a wide variety of color distributions in a compact manner, and lends itself to applications such as color exploration for graphic design, illustration and similar fields. We propose a Neural Network that can fit a color sail to any image. Then, the user can adjust color sail parameters to change the base colors, their blending behavior and the number of colors, exploring a wide range of options for the original design. In addition, we propose a Deep Learning model that learns to automatically segment an image into color-compatible alpha masks, each equipped with its own color sail. This allows targeted color exploration by either editing their corresponding color sails or using standard software packages. Our model is trained on a custom diverse dataset of art and design. We provide both quantitative evaluations, and a user study, demonstrating the effectiveness of color sail interaction. Interactive demos are available at www.colorsails.com.

A Neural Compositional Paradigm for Image Captioning

Bo Dai, Sanja Fidler, Dahua Lin

In Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018

Paper Abstract Bibtex

@inproceedings{Dai18neurips,
title = { A Neural Compositional Paradigm for Image Captioning},
author = {Bo Dai and Sanja Fidler and Dahua Lin},
booktitle = {NeurIPS},
year = {2018}
}

Mainstream captioning models often follow a sequential structure to generate captions, leading to issues such as introduction of irrelevant semantics, lack of diversity in the generated captions, and inadequate generalization performance. In this paper, we present an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: (1) extracting an explicit semantic representation from the given image; and (2) constructing the caption based on a recursive compositional procedure in a bottom-up manner. Compared to conventional ones, our paradigm better preserves the semantic content through an explicit factorization of semantics and syntax. By using the compositional generation procedure, caption construction follows a recursive structure, which naturally fits the properties of human language. Moreover, the proposed compositional procedure requires less data to train, generalizes better, and yields more diverse captions.

Pose Estimation for Objects with Rotational Symmetry

Enric Corona, Kaustav Kundu, Sanja Fidler

In International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain

Paper Abstract Project page Bibtex

@inproceedings{pose2018,
title = {Pose Estimation for Objects with Rotational Symmetry},
author = {Enric Corona and Kaustav Kundu and Sanja Fidler},
booktitle = {IROS},
year = {2018}
}

Pose estimation is a widely explored problem, enabling many robotic tasks such as grasping and manipulation. In this paper, we tackle the problem of pose estimation for objects that exhibit rotational symmetry, which are common in man-made and industrial environments. In particular, our aim is to infer poses for objects not seen at training time, but for which their 3D CAD models are available at test time. Previous work has tackled this problem by learning to compare captured views of real objects with the rendered views of their 3D CAD models, by embedding them in a joint latent space using neural networks. We show that sidestepping the issue of symmetry in this scenario during training leads to poor performance at test time. We propose a model that reasons about rotational symmetry during training by having access to only a small set of symmetry-labeled objects, whereby exploiting a large collection of unlabeled CAD models. We demonstrate that our approach significantly outperforms a naively trained neural network on a new pose dataset containing images of tools and hardware.

Scaling Egocentric Vision: The EPIC-KITCHENS Datasets (oral presentation)

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

In European Conference on Computer Vision (ECCV), Munich, Germany

Paper Abstract Project page Bibtex

@inproceedings{Damen2018EPICKITCHENS,
title = {Scaling Egocentric Vision: The EPIC-KITCHENS Dataset},
author = {Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria and Fidler, Sanja and Furnari, Antonino and Kazakos, Evangelos and Moltisanti, Davide and Munro, Jonathan and Perrett, Toby and Price, Will and Wray, Michael},
booktitle = {ECCV},
year = {2018}
}

First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (spotlight present.)

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler

In British Machine Vision Conference (BMVC), Newcastle upon Tyne, UK

Paper Abstract Project page Bibtex

@inproceedings{vsepp2018,
title = {VSE++: Improving Visual-Semantic Embeddings with Hard Negatives},
author = {Fartash Faghri and David J. Fleet and Jamie Ryan Kiros and Sanja Fidler},
booktitle = {BMVC},
year = {2018}
}

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

VirtualHome: Simulating Household Activities via Programs (oral presentation)

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, Antonio Torralba

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018

Paper Abstract Project page Press Bibtex

@inproceedings{VirtualHome2018,
title = {VirtualHome: Simulating Household Activities via Programs},
author = {Xavier Puig and Kevin Ra and Marko Boben and Jiaman Li and Tingwu Wang and Sanja Fidler and Antonio Torralba},
booktitle = {CVPR},
year = {2018}
}

In this paper, we are interested in modeling complex activities that occur in a typical household. We propose to use programs, i.e., sequences of atomic actions and interactions, as a high level representation of complex tasks. Programs are interesting because they provide a non-ambiguous representation of a task, and allow agents to execute them. However, nowadays, there is no database providing this type of information. Towards this goal, we first crowd-source programs for a variety of activities that happen in people's homes, via a game-like interface used for teaching kids how to code. Using the collected dataset, we show how we can learn to extract programs directly from natural language descriptions or from videos. We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to ``drive'' an artificial agent to execute tasks in a simulated household environment. Our VirtualHome simulator allows us to create a large activity video dataset with rich ground-truth, enabling training and testing of video understanding models. We further showcase examples of our agent performing tasks in our VirtualHome based on language descriptions.

close window

MovieGraphs: Towards Understanding Human-Centric Situations from Videos (spotlight present.)

Paul Vicol, Makarand Tapaswi, Lluis Castrejon, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018

Paper Abstract Project page Bibtex

@inproceedings{Moviegraphs2018,
title = {MovieGraphs: Towards Understanding Human-Centric Situations from Videos},
author = {Paul Vicol and Makarand Tapaswi and Lluis Castrejon and Sanja Fidler},
booktitle = {CVPR},
year = {2018}
}

There is growing interest in artificial intelligence to build socially intelligent robots. This requires machines to have the ability to "read" people's emotions, motivations, and other factors that affect behavior. Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions. In addition, most interactions and many attributes are grounded in the video with time stamps. We provide a thorough analysis of our dataset, showing interesting common-sense correlations between different social aspects of scenes, as well as across scenes over time. We propose a method for querying videos and text with graphs, and show that: 1) our graphs contain rich and sufficient information to summarize and localize each scene; and 2) subgraphs allow us to describe situations at an abstract level and retrieve multiple semantically relevant situations. We also propose methods for interaction understanding via ordering, and reasoning about the social scene. MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents.

Now You Shake Me: Towards Automatic 4D Cinema (spotlight presentation)

Yuhao Zhou, Makarand Tapaswi, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018

Paper Abstract Project page Bibtex

@inproceedings{Movie4D2018,
title = {Now You Shake Me: Towards Automatic 4D Cinema},
author = {Yuhao Zhou and Makarand Tapaswi and Sanja Fidler},
booktitle = {CVPR},
year = {2018}
}

We are interested in enabling automatic 4D cinema by parsing physical and special effects from untrimmed movies. These include effects such as physical interactions, water splashing, light, and shaking, and are grounded to either a character in the scene or the camera. We collect a new dataset referred to as the Movie4D dataset which annotates over 9K effects in 63 movies. We propose a Conditional Random Field model atop a neural network that brings together visual and audio information, as well as semantics in the form of person tracks. Our model further exploits correlations of effects between different characters in the clip as well as across movie threads. We propose effect detection and classification as two tasks, and present results along with ablation studies on our dataset, paving the way towards 4D cinema in everyone's homes.

Efficient Annotation of Segmentation Datasets with Polygon-RNN++

David Acuna*, Huan Ling*, Amlan Kar*, Sanja Fidler * (Denotes equal contribution)

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018

Paper Abstract Project page Bibtex

@inproceedings{PolygonPP2018,
title = {Efficient Annotation of Segmentation Datasets with Polygon-RNN++},
author = {Acuna, David and Ling, Huan and Kar, Amlan and Fidler, Sanja},
booktitle = {CVPR},
year = {2018}
}

Manually labeling datasets with object masks is extremely time consuming. In this work, we follow the idea of PolygonRNN to produce polygonal annotations of objects interactively using humans-in-the-loop. We introduce several important improvements to the model: 1) we design a new CNN encoder architecture, 2) show how to effectively train the model with Reinforcement Learning, and 3) significantly increase the output resolution using a Graph Neural Network, allowing the model to accurately annotate high resolution objects in images. Extensive evaluation on the Cityscapes dataset shows that our model, which we refer to as Polygon-RNN++, significantly outperforms the original model in both automatic (10% absolute and 16% relative improvement in mean IoU) and interactive modes (requiring 50% fewer clicks by annotators). We further analyze the cross-domain scenario in which our model is trained on one dataset, and used out of the box on datasets from varying domains. The results show that Polygon-RNN++ exhibits powerful generalization capabilities, achieving significant improvements over existing pixel-wise methods. Using simple online fine-tuning we further achieve a high reduction in annotation time for new datasets, moving a step closer towards an interactive annotation tool to be used in practice.

Learning to Act Properly: Predicting and Explaining Affordances from Images

Ching-Yao Chuang, Jiaman Li, Antonio Torralba, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018

Paper Abstract Project page Bibtex

@inproceedings{ActProperly2018,
title = {Learning to Act Properly: Predicting and Explaining Affordances from Images},
author = {Chuang, Ching-Yao and Li, Jiaman and Torralba, Antonio and Fidler, Sanja},
booktitle = {CVPR},
year = {2018}
}

We address the problem of affordance reasoning in diverse scenes that appear in the real world. Affordances relate the agent's actions to their effects when taken on the surrounding objects. In our work, we take the egocentric view of the scene, and aim to reason about action-object affordances that respect both the physical world as well as the social norms imposed by the society. We also aim to teach artificial agents why some actions should not be taken in certain situations, and what would likely happen if these actions would be taken. We collect a new dataset that builds upon ADE20k, referred to as ADE-Affordance, which contains annotations enabling such rich visual reasoning. We propose a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object. Our model is showcased through various ablation studies, pointing to successes and challenges in this complex task.

A Face-to-Face Neural Conversation Model

Hang Chu, Daiqing Li, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018

Paper Abstract Project page Bibtex

@inproceedings{F2F2018,
title = {A Face-to-Face Neural Conversation Model},
author = {Hang Chu and Daiqing Li and Sanja Fidler},
booktitle = {CVPR},
year = {2018}
}

Neural networks have recently become good at engaging in dialog. However, current approaches are based solely on verbal text, lacking the richness of a real face-to-face conversation. We propose a neural conversation model that aims to read and generate facial gestures alongside with text. This allows our model to adapt its response based on the ``mood'' of the conversation. In particular, we introduce an RNN encoder-decoder that exploits the movement of facial muscles, as well as the verbal conversation. The decoder consists of two layers, where the lower layer aims at generating the verbal response and coarse facial expressions, while the second layer fills in the subtle gestures, making the generated output more smooth and natural. We train our neural network by having it "watch'' 250 movies. We showcase our joint face-text model in generating more natural conversations through automatic metrics and a human study. We demonstrate an example application with a face-to-face chatting avatar.

SurfConv: Bridging 3D and 2D Convolution for RGBD Images

Hang Chu, Wei-Chiu Ma, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018

Paper Abstract Project page Bibtex

@inproceedings{SurfConv2018,
title = {SurfConv: Bridging 3D and 2D Convolution for RGBD Images},
author = {Hang Chu and Wei-Chiu Ma and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2018}
}

The last few years have seen approaches trying to combine the increasing popularity of depth sensors and the success of the convolutional neural networks. Using depth as an additional channel alongside the RGB input has the scale variance problem present in image convolution based approaches. On the other hand, 3D convolution wastes a large amount of memory on mostly unoccupied 3D space, which consists of only the surface visible to the sensor. Instead,we propose SurfConv, which "slides" compact 2D filters along the visible 3D surface. SurfConv is formulated asa simple depth-aware multi-scale 2D convolution, through a new Data-Driven Depth Discretization (D4) scheme. We demonstrate the effectiveness of our method on indoor and outdoor 3D semantic segmentation datasets. Our method achieves state-of-the-art performance while using less than30% parameters used by the 3D convolution based approaches.

NerveNet: Learning Structured Policy with Graph Neural Networks

Tingwu Wang, Renjie Liao, Jimmy Ba, Sanja Fidler

In International Conference on Learning Representations (ICLR), Vancouver, Canada, 2018

Paper Abstract Project page Bibtex

@inproceedings{WangICLR2018,
title = {NerveNet: Learning Structured Policy with Graph Neural Networks},
author = {Tingwu Wang and Renjie Liao and Jimmy Ba and Sanja Fidler},
booktitle = {ICLR},
year = {2018}
}

We address the problem of learning structured policies for continuous control. In traditional reinforcement learning, policies of agents are learned by MLPs which take the concatenation of all observations from the environment as input for predicting actions. In this work, we propose NerveNet to explicitly model the structure of an agent, which naturally takes the form of a graph. Specifically, serving as the agent's policy network, NerveNet first propagates information over the structure of the agent and then predict actions for different parts of the agent. In the experiments, we first show that our NerveNet is comparable to state-of-the-art methods on standard MuJoCo environments. We further propose our customized reinforcement learning environments for benchmarking two types of structure transfer learning tasks, i.e., size and disability transfer. We demonstrate that policies learned by NerveNet are significantly better than policies learned by other models and are able to transfer even in a zero-shot setting.

Year 2017

Teaching Machines to Describe Images via Natural Language Feedback

Huan Ling, Sanja Fidler

In Neural Information Processing Systems (NIPS), Long Beach, USA, 2017

Paper Abstract Project page Bibtex

@inproceedings{LingNIPS2017,
title = {Teaching Machines to Describe Images via Natural Language Feedback},
author = {Huan Ling and Sanja Fidler},
booktitle = {NIPS},
year = {2017}
}

Robots will eventually be part of every household. It is thus critical to enable algorithms to learn from and be guided by non-expert users. In this paper, we bring a human in the loop, and enable a human teacher to give feedback to a learning agent in the form of natural language. We argue that descriptive sentence can provide a stronger learning signal than a numeric reward in that it can easily point to where the mistakes are and how to correct them. We focus on the problem of image captioning in which the quality of the output can easily be judged by non-experts. We propose a hierarchical phrase-based captioning model trained with policy gradients, and design a feedback network that provides reward to the learner by conditioning on the human-provided feedback. We show that by exploiting descriptive feedback our model learns to perform better than when given independently written human captions.

Towards Diverse and Natural Image Descriptions via a Conditional GAN (oral presentation)

Bo Dai, Sanja Fidler, Raquel Urtasun, Dahua Lin

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

Paper Abstract Bibtex

@inproceedings{DaiICCV17,
title = {Towards Diverse and Natural Image Descriptions via a Conditional GAN},
author = {Bo Dai and Sanja Fidler and Raquel Urtasun and Dahua Lin},
booktitle = {ICCV},
year = {2017}
}

Despite the substantial progress in recent years, the image captioning techniques are still far from being perfect. Sentences produced by existing methods, eg those based on RNNs, are often overly rigid and lacking in variability. This issue is related to a learning principle widely used in practice, that is, to maximize the likelihood of training samples. This principle encourages high resemblance to the "ground-truth'' captions, while suppressing other reasonable descriptions. Conventional evaluation metrics, eg BLEU and METEOR, also favor such restrictive methods. In this paper, we explore an alternative approach, with the aim to improve the naturalness and diversity -- two essential properties of human expression. Specifically, we propose a new framework based on Conditional Generative Adversarial Networks (CGAN), which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content. It is noteworthy that training a sequence generator is nontrivial. We overcome the difficulty by Policy Gradient, a strategy stemming from Reinforcement Learning, which allows the generator to receive early feedback along the way. We tested our method on two large datasets, where it performed competitively against real people in our user study and outperformed other methods on various tasks.

3D Graph Neural Networks for RGBD Semantic Segmentation (oral presentation)

Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

Paper Abstract Bibtex

@inproceedings{3dggnnICCV17,
title = {3D Graph Neural Networks for RGBD Semantic Segmentation},
author = {Xiaojuan Qi and Renjie Liao and Jiaya Jia and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2017}
}

RGBD semantic segmentation requires joint reasoning about 2D appearance and 3D geometric information. In this paper we propose a 3D graph neural network (3DGNN) that builds a k-nearest neighbor graph on top of 3D point cloud. Each node in the graph corresponds to a set of points and is associated with a hidden representation vector initialized with an appearance feature extracted by a unary CNN from 2D images. Relying on recurrent functions, every node dynamically updates its hidden representation based on the current status and incoming messages from its neighbors. This propagation model is unrolled for a certain number of time steps and the final per-node representation is used for predicting the semantic class of each pixel. We use back-propagation through time to train the model. Extensive experiments on NYUD2 and SUN-RGBD datasets demonstrate the effectiveness of our approach.

TorontoCity: Seeing the World with a Million Eyes (spotlight presentation)

Shenlong Wang, Min Bai, Gellert MattyUSA, Hang Chu, Wenjie Luo, Bin Yang, Justin Liang, Joel Cheverie, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

Paper Abstract Bibtex

@inproceedings{TCity2017,
title = {TorontoCity: Seeing the World with a Million Eyes},
author = {Shenlong Wang and Min Bai and Gellert Mattyus and Hang Chu and Wenjie Luo and Bin Yang and Justin Liang and Joel Cheverie and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2017}
}

In this paper we introduce the TorontoCity benchmark, which covers the full greater Toronto area (GTA) with 712.5km2 of land, 8439km of road and around 400,000 buildings. Our benchmark provides different perspectives of the world captured from airplanes, drones and cars driving around the city. Manually labeling such a large scale dataset is infeasible. Instead, we propose to utilize different sources of high-precision maps to create our ground truth. Towards this goal, we develop algorithms that allow us to align all data sources with the maps while requiring minimal human supervision. We have designed a wide variety of tasks including building height estimation (reconstruction), road centerline and curb extraction, building instance segmentation, building contour extraction (reorganization), semantic labeling and scene type classification (recognition). Our pilot study shows that most of these tasks are still difficult for modern convolutional neural networks.

Situation Recognition with Graph Neural Networks

Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun, Sanja Fidler

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

Paper Abstract Bibtex

@inproceedings{SituationsICCV17,
title = {Situation Recognition with Graph Neural Networks},
author = {Ruiyu Li and Makarand Tapaswi and Renjie Liao and Jiaya Jia and Raquel Urtasun and Sanja Fidler},
booktitle = {ICCV},
year = {2017}
}

We address the problem of recognizing situations in images. Given an image, the task is to predict the most salient verb (action), and fill its semantic roles such as who is performing the action, what is the source and target of the action, etc. Different verbs have different roles (eg attacking has weapon), and each role can take on many possible values (nouns). We propose a model based on Graph Neural Networks that allows us to efficiently capture joint dependencies between roles using neural networks defined on a graph. Experiments with different graph connectivities show that our approach that propagates information between roles significantly outperforms existing work, as well as multiple baselines. We obtain roughly 3-5% improvement over previous work in predicting the full situation. We also provide a thorough qualitative analysis of our model and influence of different roles in the verbs.

Open Vocabulary Scene Parsing

Hang Zhao, Xavier Puig, Bolei Zhou, Sanja Fidler, Antonio Torralba

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

Paper Abstract Bibtex

@inproceedings{openvoc17,
title = {Open Vocabulary Scene Parsing},
author = {Hang Zhao and Xavier Puig and Bolei Zhou and Sanja Fidler and Antonio Torralba},
booktitle = {ICCV},
year = {2017}
}

Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets. In this paper, we propose a new task that aims at parsing scenes with a large and open vocabulary, and several evaluation metrics are explored for this problem. Our proposed approach to this problem is a joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations. We validate the open vocabulary prediction ability of our framework on ADE20K dataset which covers a wide variety of scenes and objects. We further explore the trained joint embedding space to show its interpretability.

Sequential Grouping Networks for Instance Segmentation

Shu Liu, Jiaya Jia, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

Paper Abstract Bibtex

@inproceedings{SGN17,
title = {Sequential Grouping Networks for Instance Segmentation},
author = {Shu Liu and Jiaya Jia and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2017}
}

In this paper, we propose Sequential Grouping Networks (SGN) to tackle the problem of object instance segmentation. SGNs employ a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels. In particular, the first network aims to group pixels along each image row and column by predicting horizontal and vertical object breakpoints. These breakpoints are then used to create line segments. By exploiting two-directional information, the second network groups horizontal and vertical lines into connected components. Finally, the third network groups the connected components into object instances. Our experiments show that our SGN significantly outperforms state-of-the-art approaches in both, the Cityscapes dataset as well as PASCAL VOC.

Be Your Own Prada: Fashion Synthesis with Structural Coherence

Shizhan Zhu, Sanja Fidler, Raquel Urtasun, Dahua Lin, Chen Change Loy

In International Conference on Computer Vision (ICCV), Venice, Italy, 2017

Paper Abstract Project page Bibtex

@inproceedings{GANprada17,
title = {Be Your Own Prada: Fashion Synthesis with Structural Coherence},
author = {Shizhan Zhu and Sanja Fidler and Raquel Urtasun and Dahua Lin and Chen Change Loy},
booktitle = {ICCV},
year = {2017}
}

We present a novel and effective approach for generating new clothing on a wearer through generative adversarial learning. Given an input image of a person and a sentence describing a different outfit, our model "redresses" the person as desired, while at the same time keeping the wearer and her/his pose unchanged. Generating new outfits with precise regions conforming to a language description while retaining wearer's body structure is a new challenging task. Existing generative adversarial networks are not ideal in ensuring global coherence of structure given both the input photograph and language description as conditions. We address this challenge by decomposing the complex generative process into two conditional stages. In the first stage, we generate a plausible semantic segmentation map that obeys the wearer's pose as a latent spatial arrangement. An effective spatial constraint is formulated to guide the generation of this semantic segmentation map. In the second stage, a generative model with a newly proposed compositional mapping layer is used to render the final image with precise regions and textures conditioned on this map. We extended the DeepFashion dataset by collecting sentence descriptions for 79K images. We demonstrate the effectiveness of our approach through both quantitative and qualitative evaluations. A user study is also conducted.

Annotating Object Instances with a Polygon-RNN (best paper honorable mention)

Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017

Paper Abstract Project page Bibtex

@inproceedings{CastrejonCVPR17,
title = {Annotating Object Instances with a Polygon-RNN},
author = {Lluis Castrejon and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2017}
}

We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows a human annotator to interfere at any time and correct a vertex if needed, producing as accurate segmentation as desired by the annotator. We show that our approach speeds up the annotation process by a factor of 4.7 across all classes in Cityscapes, while achieving 78.4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators. For cars, our speed-up factor is 7.3 for an agreement of 82.2%. We further show generalization capabilities of our approach to unseen datasets.

[talk] [press]

Sports Field Localization via Deep Structured Models

Namdar Homayounfar, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017

Paper Abstract Bibtex

@inproceedings{NamdarCVPR17,
title = {Sports Field Localization via Deep Structured Models},
author = {Namdar Homayounfar and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2017}
}

In this work, we propose a novel way of efficiently localizing a sports field from a single broadcast image of the game. Related work in this area relies on manually annotating a few key frames and extending the localization to similar images, or installing fixed specialized cameras in the stadium from which the layout of the field can be obtained. In contrast, we formulate this problem as a branch and bound inference in a Markov random field where an energy function is defined in terms of semantic cues such as the field surface, lines and circles obtained from a deep semantic segmentation network. Moreover, our approach is fully automatic and depends only on a single image from the broadcast video of the game. We demonstrate the effectiveness of our method by applying it to soccer and hockey.

Scene Parsing through ADE20K Dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, Antonio Torralba

In Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017

Paper Abstract Dataset Code Bibtex

@inproceedings{Ade20k,
title = {Scene Parsing through ADE20K Dataset},
author = {Bolei Zhou and Hang Zhao and Xavier Puig and Sanja Fidler and Adela Barriuso and Antonio Torralba},
booktitle = {CVPR},
year = {2017}
}

Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A scene parsing benchmark is built upon the ADE20K with 150 object and stuff classes included. Several segmentation baseline models are evaluated on the benchmark. A novel network design called Cascade Segmentation Module is proposed to parse a scene into stuff, objects, and object parts in a cascade and improve over the baselines. We further show that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis.

Find Your Way by Observing the Sun and Other Semantic Cues

Wei-Chiu Ma, Shenlong Wang, Marcus A. Brubaker, Sanja Fidler, Raquel Urtasun

In International Conference on Robotics and Automation (ICRA), Singapore, 2017

Paper Abstract Bibtex

@inproceedings{WeiChiuICRA17,
title = {Find Your Way by Observing the Sun and Other Semantic Cues},
author = {Wei-Chiu Ma and Shenlong Wang and Marcus A. Brubaker and Sanja Fidler and Raquel Urtasun},
booktitle = {ICRA},
year = {2017}
}

In this paper we present a robust, efficient and affordable approach to self-localization which does not require neither GPS nor knowledge about the appearance of the world. Towards this goal, we utilize freely available cartographic maps and derive a probabilistic model that exploits semantic cues in the form of sun direction, presence of an intersection, road type, speed limit as well as the ego-car trajectory in order to produce very reliable localization results. Our experimental evaluation shows that our approach can localize much faster (in terms of driving time) with less computation and more robustly than competing approaches, which ignore semantic information.

3D Object Proposals using Stereo Imagery for Accurate Object Class Detection

Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Huimin Ma, Sanja Fidler, Raquel Urtasun

In Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017

Paper Abstract Project page Bibtex

@inproceedings{ChenArxiv16,
title = {3D Object Proposals using Stereo Imagery for Accurate Object Class Detection},
author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Huimin Ma and Sanja Fidler and Raquel Urtasun},
booktitle = {arXiv:1608.07711},
year = {2016}
}

The goal of this paper is to perform 3D object detection in the context of autonomous driving. Our method first aims at generating a set of high-quality 3D object proposals by exploiting stereo imagery. We formulate the problem as minimizing an energy function that encodes object size priors, placement of objects on the ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. We then exploit a CNN on top of these proposals to perform object detection. In particular, we employ a convolutional neural net (CNN) that exploits context and depth information to jointly regress to 3D bounding box coordinates and object pose. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. When combined with the CNN, our approach outperforms all existing results in object detection and orientation estimation tasks for all three KITTI object classes. Furthermore, we experiment also with the setting where LIDAR information is available, and show that using both LIDAR and stereo leads to the best result.

Year 2016

Song From PI: A Musically Plausible Network for Pop Music Generation

Hang Chu, Raquel Urtasun, Sanja Fidler

arXiv:1611.03477, ICLR Workshop track, 2017

Paper Abstract Project page Press Bibtex

@inproceedings{SongOfPI,
title = {Song From PI: A Musically Plausible Network for Pop Music Generation},
author = {Hang Chu and Raquel Urtasun and Sanja Fidler},
booktitle = {arXiv:1611.03477},
year = {2016}
}

We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.

The Guardian The Register

close window

Efficient Summarization with Read-Again and Copy Mechanism

Wenyuan Zeng, Wenjie Luo, Sanja Fidler, Raquel Urtasun

Arxiv preprint arXiv:1611.03382

Paper Abstract Bibtex

@inproceedings{WenyuanArxiv16,
title = {Efficient Summarization with Read-Again and Copy Mechanism},
author = {Wenyuan Zeng and Wenjie Luo and Sanja Fidler and Raquel Urtasun},
booktitle = {arXiv:1611.03382},
year = {2016}
}

Encoder-decoder models have been widely used to solve sequence to sequence prediction tasks. However current approaches suffer from two shortcomings. First, the encoders compute a representation of each word taking into account only the history of the words it has read so far, yielding suboptimal representations. Second, current decoders utilize large vocabularies in order to minimize the problem of unknown words, resulting in slow decoding times. In this paper we address both shortcomings. Towards this goal, we first introduce a simple mechanism that first reads the input sequence before committing to a representation of each word. Furthermore, we propose a simple copy mechanism that is able to exploit very small vocabularies and handle out-of-vocabulary words. We demonstrate the effectiveness of our approach on the Gigaword dataset and DUC competition outperforming the state-of-the-art.

Proximal Deep Structured Models

Shenlong Wang, Sanja Fidler, Raquel Urtasun

In Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016

Paper Abstract Bibtex

@inproceedings{ShenlongNIPS16,
title = {Proximal Deep Structured Models},
author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {NIPS},
year = {2016}
}

Many problems in real-world applications involve predicting continuous-valued random variables that are statistically related. In this paper, we propose a powerful deep structured model that is able to learn complex non-linear functions which encode the dependencies between continuous output variables. We show that inference in our model using proximal methods can be efficiently solved as a feed-forward pass of a special type of deep recurrent neural network. We demonstrate the effectiveness of our approach in the tasks of image denoising, depth refinement and optical flow estimation.

HouseCraft: Building Houses from Rental Ads and Street Views

Hang Chu, Shenlong Wang, Raquel Urtasun, Sanja Fidler

In European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016

Paper Abstract Project page Bibtex

@inproceedings{ChuECCV16,
title = {HouseCraft: Building Houses from Rental Ads and Street Views},
author = {Hang Chu and Shenlong Wang and Raquel Urtasun and Sanja Fidler},
booktitle = {ECCV},
year = {2016}
}

In this paper, we utilize rental ads to create realistic textured 3D models of building exteriors. In particular, we exploit the address of the property and its floorplan, which are typically available in the ad. The address allows us to extract Google StreetView images around the building, while the building's floorplan allows for an efficient parametrization of the building in 3D via a small set of random variables. We propose an energy minimization framework which jointly reasons about the height of each floor, the vertical positions of windows and doors, as well as the precise location of the building in the world's map, by exploiting several geometric and semantic cues from the StreetView imagery. To demonstrate the effectiveness of our approach, we collected a new dataset with 174 houses by crawling a popular rental website. Our experiments show that our approach is able to precisely estimate the geometry and location of the property, and can create realistic 3D building models.

MovieQA: Understanding Stories in Movies through Question-Answering (spotlight)

Makarand Tapaswi, Yukun Zhu, Reiner Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

Benchmark on question-answering about movies

Paper Abstract Benchmark Press Bibtex

@inproceedings{TapaswiCVPR16,
title = {MovieQA: Understanding Stories in Movies through Question-Answering},
author = {Makarand Tapaswi and Yukun Zhu and Reiner Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2016}
}

We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 15,000 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

[talk]

MIT Tech Review NVIDIA News

close window

Instance-Level Segmentation for Autonomous Driving with Deep Densely Connected MRFs

Ziyu Zhang, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

Paper Abstract Project page Bibtex

@inproceedings{ZhangCVPR16,
title = {Instance-Level Segmentation for Autonomous Driving with Deep Densely Connected MRFs},
author = {Ziyu Zhang and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2016}
}

Our aim is to provide a pixel-level object instance labeling of a monocular image. We build on recent work [Zhang et al., ICCV15] that trained a convolutional neural net to predict instance labeling in local image patches, extracted exhaustively in a stride from an image. A simple Markov random field model using several heuristics was then proposed in [Zhang et al., ICCV15] to derive a globally consistent instance labeling of the image. In this paper, we formulate the global labeling problem with a novel densely connected Markov random field and show how to encode various intuitive potentials in a way that is amenable to efficient mean field inference [Krahenbuhl et al., NIPS11]. Our potentials encode the compatibility between the global labeling and the patch-level predictions, contrast-sensitive smoothness as well as the fact that separate regions form different instances. Our experiments on the challenging KITTI benchmark [Geiger et al., CVPR12] demonstrate that our method achieves a significant performance boost over the baseline [Zhang et al., ICCV15].

Monocular 3D Object Detection for Autonomous Driving

Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

Paper Abstract Project page Bibtex

@inproceedings{ChenCVPR16,
title = {Monocular 3D Object Detection for Autonomous Driving},
author = {Xiaozhi Chen and Kaustav Kundu and Ziyu Zhang and Huimin Ma and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2016}
}

The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.

HD Maps: Fine-grained Road Segmentation by Parsing Ground and Aerial Images

Gellert MattyUSA, Shenlong Wang, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016

Paper Abstract Bibtex

@inproceedings{MattyusCVPR16,
title = {HD Maps: Fine-grained Road Segmentation by Parsing Ground and Aerial Images},
author = {Gellert Mattyus and Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2016}
}

In this paper we present an approach to enhance existing maps with fine grained segmentation categories such as parking spots and sidewalk, as well as the number and location of road lanes. Towards this goal, we propose an efficient approach that is able to estimate these fine grained categories by doing joint inference over both, monocular aerial imagery, as well as ground images taken from a stereo camera pair mounted on top of a car. Important to this is reasoning about the alignment between the two types of imagery, as even when the measurements are taken with sophisticated GPS+IMU systems, this alignment is not sufficiently accurate. We demonstrate the effectiveness of our approach on a new dataset which enhances KITTI with aerial images taken with a camera mounted on an airplane and flying around the city of Karlsruhe, Germany.

Order-Embeddings of Images and Language (oral presentation)

Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun

In International Conference on Learning Representations (ICLR), Puerto Rico, 2016

Paper Abstract Code Bibtex

@inproceedings{VendrovArxiv15,
title = {Order-Embeddings of Images and Language},
author = {Ivan Vendrov and Ryan Kiros and Sanja Fidler and Raquel Urtasun},
booktitle = {ICLR},
year = {2016}
}

Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.

[talk]

Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, Devi Parikh

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1), 2016, pages 74-87

Paper Abstract Suppl. Mat. Bibtex

@article{MottaghiPAMI16,
title = {Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding},
author = {Roozbeh Mottaghi and Sanja Fidler and Alan Yuille and Raquel Urtasun and Devi Parikh},
journal = {Trans. on Pattern Analysis and Machine Intelligence},
volume= {38},
number= {1},
pages= {74--87},
year = {2016}
}

Recent trends in image understanding have pushed for scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers. In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular semantic segmentation, object detection and scene recognition. Towards this goal, we "plug-in" human subjects for each of the various components in a conditional random field model. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve scene understanding by focusing research efforts on various individual tasks.

Year 2015

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
(oral presentation)

Yukun Zhu*, Ryan Kiros*, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

* Denotes equal contribution

Paper Abstract Project page Bibtex

@inproceedings{ZhuICCV15,
title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
booktitle = {ICCV},
year = {2015}
}

Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

Lost Shopping! Monocular Localization in Large Indoor Spaces (oral presentation)

Shenlong Wang, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

Paper Abstract Bibtex

@inproceedings{WangICCV15,
title = {Lost Shopping! Monocular Localization in Large Indoor Spaces},
author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2015}
}

In this paper we propose a novel approach to localization in very large indoor spaces (i.e., 200+ store shopping malls) that takes a single image and a floor plan of the environment as input. We formulate the localization problem as inference in a Markov random field, which jointly reasons about text detection (localizing shop's names in the image with precise bounding boxes), shop facade segmentation, as well as camera's rotation and translation within the entire shopping mall. The power of our approach is that it does not use any prior information about appearance and instead exploits text detections corresponding to the shop names. This makes our method applicable to a variety of domains and robust to store appearance variation across countries, seasons, and illumination conditions. We demonstrate the performance of our approach in a new dataset we collected of two very large shopping malls, and show the power of holistic reasoning.

Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions

Jimmy Ba, Kevin Swersky, Sanja Fidler, Ruslan Salakhutdinov

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

Classification of unseen categories from their textual description (Wiki articles)

Paper Abstract Bibtex

@inproceedings{BaICCV15,
title = {Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions},
author = {Jimmy Ba and Kevin Swersky and Sanja Fidler and Ruslan Salakhutdinov},
booktitle = {ICCV},
year = {2015}
}

One of the main challenges in Zero-Shot Learning of visual categories is gathering semantic attributes to accompany images. Recent work has shown that learning from textual descriptions, such as Wikipedia articles, avoids the problem of having to explicitly define these attributes. We present a new model that can classify unseen categories from their textual description. Specifically, we use text features to predict the output weights of both the convolutional and the fully connected layers in a deep convolutional neural network (CNN). We take advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches. The proposed model also allows us to automatically generate a list of pseudo- attributes for each visual category consisting of words from Wikipedia articles. We train our models end-to-end us- ing the Caltech-UCSD bird and flower datasets and evaluate both ROC and Precision-Recall curves. Our empirical results show that the proposed model significantly outperforms previous methods.

Enhancing World Maps by Parsing Aerial Images

Gellert MatthyUSA, Shenlong Wang, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

Paper Abstract Bibtex

@inproceedings{MatthyusICCV15,
title = {Enhancing World Maps by Parsing Aerial Images},
author = {Gellert Matthyus and Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2015}
}

In recent years, contextual models that exploit maps have been shown to be very effective for many recognition and localization tasks. In this paper, we propose to exploit aerial images in order to enhance freely available world maps. Towards this goal, we make use of OpenStreetMap and formulate the problem as the one of inference in a Markov random field parameterized in terms of the location of the road-segment centerlines as well as their width. This parameterization enables very efficient inference and returns only topologically correct roads. In particular, we can segment all OSM roads in the world in a single day using a small cluster of 10 computers. Importantly, our approach generalizes very well; it can be trained using a single aerial image and produces very accurate results in any location across the globe. We demonstrate the effectiveness of our approach over the previous state-of-the-art on two new benchmarks that we collect. We additionally show how our enhanced maps can be exploited for semantic segmentation of ground images.

Monocular Object Instance Segmentation and Depth Ordering with CNNs

Ziyu Zhang, Alex Schwing, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

Paper Abstract Bibtex

@inproceedings{ZhangICCV15,
title = {Monocular Object Instance Segmentation and Depth Ordering with CNNs},
author = {Ziyu Zhang and Alex Schwing and Sanja Fidler and Raquel Urtasun},
booktitle = {ICCV},
year = {2015}
}

In this paper we tackle the problem of instance level segmentation and depth ordering from a single monocular image. Towards this goal, we take advantage of convolutional neural nets and train them to directly predict instance level segmentations where the instance ID encodes depth ordering from large image patches. To provide a coherent single explanation of an image we develop a Markov random field which takes as input the predictions of convolutional nets applied at overlapping patches of different resolutions as well as the output of a connected component algorithm and predicts very accurate instance level segmentation and depth ordering. We demonstrate the effectiveness of our approach on the challenging KITTI benchmark and show very good performance on both tasks.

A Learning Framework for Generating Region Proposals with Mid-level Cues

Tom Lee, Sanja Fidler, Sven Dickinson

In International Conference on Computer Vision (ICCV), Santiago, Chile, 2015

Paper Abstract Bibtex

@inproceedings{TLeeICCV15,
title = {A Learning Framework for Generating Region Proposals with Mid-level Cues},
author = {Tom Lee and Sanja Fidler and Sven Dickinson},
booktitle = {ICCV},
year = {2015}
}

The object categorization community's migration from object detection to large-scale object categorization has seen a shift from sliding window approaches to bottom-up region segmentation, with the resulting region proposals offering discriminating shape and appearance features through an attempt to explicitly segment the objects in a scene from their background. One powerful class of region proposal techniques is based on parametric energy minimization (PEM) via parametric maxflow. In this paper, we incorporate PEM into a novel structured learning framework that learns how to combine a set of mid-level grouping cues to yield a small set of region proposals with high recall. Second, we diversify our region proposals and rank them with region-based convolutional neural network features. Our novel approach, called parametric min-loss, casts perceptual grouping and cue combination in a learning framework which yields encouraging results on VOC'2012.

Skip-Thought Vectors

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler

Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

A neural representation of sentences trained on 11K books

Paper Abstract Code Neural storyteller Bibtex

@inproceedings{KirosNIPS15,
title = {Skip-Thought Vectors},
author = {Ryan Kiros and Yukun Zhu and Ruslan Salakhutdinov and Richard Zemel and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
booktitle = {NIPS},
year = {2015}
}

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.

3D Object Proposals for Accurate Object Class Detection

Xiaozhi Chen*, Kaustav Kundu*, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, Raquel Urtasun

* Denotes equal contribution

Neural Information Processing Systems (NIPS), Montreal, Canada, 2015

Paper Abstract Project page Bibtex

@inproceedings{XiaozhiNIPS15,
title = {3D Object Proposals for Accurate Object Class Detection},
author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Andrew Berneshawi and Huimin Ma and Sanja Fidler and Raquel Urtasun},
booktitle = {NIPS},
year = {2015}
}

The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on Car and Cyclist, and is competitive for the Pedestrian class.

Generating Multi-Sentence Lingual Descriptions of Indoor Scenes (oral presentation)

Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun

In British Machine Vision Conference (BMVC), Swansea, UK, 2015

Paper Abstract Bibtex

@inproceedings{LinBMVC15,
title = {Generating Multi-Sentence Lingual Descriptions of Indoor Scenes},
author = {Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
booktitle = {BMVC},
year = {2015}}

This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account the coherence among sentences. Experiments on the augmented NYU-v2 dataset show that our framework can generate natural descriptions with substantially higher ROGUE scores compared to those produced by the baseline.

Rent3D: Floor-Plan Priors for Monocular Layout Estimation (oral presentation)

Chenxi Liu*, Alex Schwing*, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

* Denotes equal contribution

Paper Abstract Suppl. Mat. Project page Bibtex

@inproceedings{ApartmentsCVPR15,
title = {Rent3D: Floor-Plan Priors for Monocular Layout Estimation},
author = {Chenxi Liu and Alex Schwing and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2015}}

The goal of this paper is to enable a 3D "virtual-tour" of an apartment given a small set of monocular images of different rooms, as well as a 2D floor plan. We frame the problem as the one of inference in a Markov random field which reasons about the layout of each room and its relative pose (3D rotation and translation) within the full apartment. This gives us information, for example, about in which room the picture was taken. What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge. In particular, we exploit the floor plan to impose aspect ratio constraints across the layouts of different rooms, as well as to extract semantic information, e.g., the location of windows which are labeled in floor plans. We show that this information can significantly help in resolving the challenging room-apartment alignment problem. We also derive an efficient exact inference algorithm which takes only a few ms per apartment. This is due to the fact that we exploit integral geometry as well as our new bounds on the aspect ratio of rooms which allow us to carve the space, reducing significantly the number of physically possible configurations. We demonstrate the effectiveness of our approach in a new dataset which contains over 200 apartments.

[talk] [slides]

Holistic 3D Scene Understanding from a Single Geo-tagged Image (oral presentation)

Shenlong Wang, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

Exploiting map priors for segmentation and monocular depth estimation

Paper Abstract Project page Bibtex

@inproceedings{WangCVPR15,
title = {Holistic 3D Scene Understanding from a Single Geo-tagged Image},
author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2015}}

In this paper we are interested in exploiting geographic priors to help outdoor scene understanding. Towards this goal we propose a holistic approach that reasons jointly about 3D object detection, pose estimation, semantic segmentation as well as depth reconstruction from a single image. Our approach takes advantage of large-scale crowdsourced maps to generate dense geographic, geometric and semantic priors by rendering the 3D world. We demonstrate the effectiveness of our holistic model on the challenging KITTI dataset, and show significant improvements over the baselines in all metrics and tasks.

[talk] [slides]

segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

Yukun Zhu, Raquel Urtasun, Ruslan Salakhutdinov, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

Paper Abstract Project page Bibtex

@inproceedings{ZhuSegDeepM15,
title = {segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection},
author = {Yukun Zhu and Raquel Urtasun and Ruslan Salakhutdinov and Sanja Fidler},
booktitle = {CVPR},
year = {2015}
}

In this paper, we propose an approach that exploits object segmentation in order to improve the accuracy of object detection. We frame the problem as inference in a Markov Random Field, in which each detection hypothesis scores object appearance as well as contextual information using Convolutional Neural Networks, and allows the hypothesis to choose and score a segment out of a large pool of accurate object segmentation proposals. This enables the detector to incorporate additional evidence when it is available and thus results in more accurate detections. Our experiments show an improvement of 4.1% in mAP over the R-CNN baseline on PASCAL VOC 2010, and 3.4% over the current state-of-the-art, demonstrating the power of our approach.

Neuroaesthetics in Fashion: Modeling the Perception of Beauty

Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

How fashionable do you look in a photo? And how can you improve?

Paper Abstract Project page Bibtex

@inproceedings{SimoCVPR15,
title = {Neuroaesthetics in Fashion: Modeling the Perception of Beauty},
author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
booktitle = {CVPR},
year = {2015}
}

In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph's setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.

[Press]

News and Tech websites


New Scientist	Quartz	Tech Times	Wired, UK	Mashable


AOL News (video)	Huffington Post, UK (video)	Huffington Post, Canada	MSN, Canada	Protein


Yahoo, Canada	Science Daily	Daily Mail, UK	PSFK	Toronto Star


Gizmag	TheRecord.com	iDigitalTimes

Fashion magazines (online)


Harper's Bazaar	Glamour	Elle	Cosmopolitan, UK	Marie Claire


Fashion Magazine	Yahoo style	Red Magazine, UK	The Pool, UK	FashionNotes

International news


Vogue (Spain)	Stylebook (Germany)	Ansa (Italy)	CenarioMT (Brazil)	Amsterdam Fashion (NL)


Marie Claire (France)	Fashion Police (Nigeria)	Nauka (Poland)	Pluska (Slovakia)	Pressetext (Austria)


Wired (Germany)	Jetzt (Germany)	La Gazzetta (Italy)	PopSugar (Australia)	SinEmbargo (Mexico)

A more complete list is maintained on our project webpage.

close window

Real-Time Coarse-to-fine Topologically Preserving Segmentation

Jian Yao, Marko Boben, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Boston, 2015

Paper Abstract Code Bibtex

@inproceedings{YaoCVPR15,
title = {Real-Time Coarse-to-fine Topologically Preserving Segmentation},
author = {Jian Yao and Marko Boben and Sanja Fidler and Raquel Urtasun},
booktitle = {CVPR},
year = {2015}
}

In this paper, we tackle the problem of unsupervised segmentation in the form of superpixels. Our main emphasis is on speed and accuracy. We build on [Yamaguchi et al., ECCV'14] to define the problem as a boundary and topology preserving Markov random field. We propose a coarse to fine optimization technique that speeds up inference in terms of the number of updates by an order of magnitude. Our approach is shown to outperform [Yamaguchi et al., ECCV'14] while employing a single iteration. We evaluate and compare our approach to state-of-the-art superpixel algorithms on the BSD and KITTI benchmarks. Our approach significantly outperforms the baselines in the segmentation metrics and achieves the lowest error on the stereo task.

A Framework for Symmetric Part Detection in Cluttered Scenes

Tom Lee, Sanja Fidler, Alex Levinshtein, Cristian Sminchisescu, Sven Dickinson

Symmetry, Vol. 7, 2015, pp 1333-1351

Paper Abstract Bibtex

@article{LeeSymmetry2015,
title = {A Framework for Symmetric Part Detection in Cluttered Scenes},
author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Cristian Sminchisescu and Sven Dickinson},
journal = {Symmetry},
volume = {7},
pages = {1333-1351},
year = {2015}
}

The role of symmetry in computer vision has waxed and waned in importance during the evolution of the field from its earliest days. At first figuring prominently in support of bottom-up indexing, it fell out of favour as shape gave way to appearance and recognition gave way to detection. With a strong prior in the form of a target object, the role of the weaker priors offered by perceptual grouping was greatly diminished. However, as the field returns to the problem of recognition from a large database, the bottom-up recovery of the parts that make up the objects in a cluttered scene is critical for their recognition. The medial axis community has long exploited the ubiquitous regularity of symmetry as a basis for the decomposition of a closed contour into medial parts. However, today's recognition systems are faced with cluttered scenes and the assumption that a closed contour exists, i.e., that figure-ground segmentation has been solved, rendering much of the medial axis community's work inapplicable. In this article, we review a computational framework, previously reported in [Lee et al., ICCV'13, Levinshtein et al., ICCV'09, Levinshtein et al., IJCV'13], that bridges the representation power of the medial axis and the need to recover and group an object's parts in a cluttered scene. Our framework is rooted in the idea that a maximally-inscribed disc, the building block of a medial axis, can be modelled as a compact superpixel in the image. We evaluate the method on images of cluttered scenes.

Year 2014

A High Performance CRF Model for Clothes Parsing

Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun

In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

Paper Abstract Project page Bibtex

@inproceedings{SimoACCV14,
title = {A High Performance CRF Model for Clothes Parsing},
author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
booktitle = {ACCV},
year = {2014}}

In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset [Yamaguchi et al., CVPR'12] and show that we can obtain a significant improvement over the state-of-the-art.

Multi-cue Mid-level Grouping

Tom Lee, Sanja Fidler, Sven Dickinson

In Asian Conference on Computer Vision (ACCV), Singapore, November, 2014

Paper Abstract Bibtex

@inproceedings{LeeACCV14,
title = {Multi-cue mid-level grouping},
author = {Tom Lee and Sanja Fidler and Sven Dickinson},
booktitle = {ACCV},
year = {2014}
}

Region proposal methods provide richer object hypotheses than sliding windows with dramatically fewer proposals, yet they still number in the thousands. This large quantity of proposals typically results from a diversification step that propagates bottom-up ambiguity in the form of proposals to the next processing stage. In this paper, we take a complementary approach in which mid-level knowledge is used to resolve bottom-up ambiguity at an earlier stage to allow a further reduction in the number of proposals. We present a method for generating regions using the mid-level grouping cues of closure and symmetry. In doing so, we combine mid-level cues that are typically used only in isolation, and leverage them to produce fewer but higher quality proposals. We emphasize that our model is mid-level by learning it on a limited number of objects while applying it to different objects, thus demonstrating that it is transferable to other objects. In our quantitative evaluation, we 1) establish the usefulness of each grouping cue by demonstrating incremental improvement, and 2) demonstrate improvement on two leading region proposal methods with a limited budget of proposals.

Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation

Sanja Fidler, Marko Boben, Ales Leonardis

arXiv preprint arXiv:1408.5516, 2014

Journal version of my PhD work on learning compositional hierarchies encoding spatial relations

Paper Abstract Bibtex

@inproceedings{FidlerArxiv14,
title = {Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation},
author = {Sanja Fidler and Marko Boben and Ale\v{s} Leonardis},
booktitle = {ArXiv:1408.5516},
year = {2014}
}

Hierarchies allow feature sharing between objects at multiple levels of representation, can code exponential variability in a very compact way and enable fast inference. This makes them potentially suitable for learning and recognizing a higher number of object classes. However, the success of the hierarchical approaches so far has been hindered by the use of hand-crafted features or predetermined grouping rules. This paper presents a novel framework for learning a hierarchical compositional shape vocabulary for representing multiple object classes. The approach takes simple contour fragments and learns their frequent spatial configurations. These are recursively combined into increasingly more complex and class-specific shape compositions, each exerting a high degree of shape variability. At the top-level of the vocabulary, the compositions are sufficiently large and complex to represent the whole shapes of the objects. We learn the vocabulary layer after layer, by gradually increasing the size of the window of analysis and reducing the spatial resolution at which the shape configurations are learned. The lower layers are learned jointly on images of all classes, whereas the higher layers of the vocabulary are learned incrementally, by presenting the algorithm with one object class after another. The experimental results show that the learned multi-class object representation scales favorably with the number of object classes and achieves a state-of-the-art detection performance at both, faster inference as well as shorter training times.

What are you talking about? Text-to-Image Coreference

Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, Sanja Fidler

In Computer Vision and Pattern Recognition (CVPR), ColumbUSA, USA, June, 2014

Paper Abstract Project page Bibtex

@inproceedings{KongCVPR14,
title = {What are you talking about? Text-to-Image Coreference},
author = {Chen Kong and Dahua Lin and Mohit Bansal and Raquel Urtasun and Sanja Fidler},
booktitle = {CVPR},
year = {2014}
}

In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system.

Visual Semantic Search: Retrieving Videos via Complex Textual Queries

Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), ColumbUSA, USA, June, 2014

Paper Abstract Suppl. Mat. Bibtex

@inproceedings{LinCVPR14,
author = {Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
title = {Visual Semantic Search: Retrieving Videos via Complex Textual Queries},
booktitle = {CVPR},
year = {2014}
}

In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.

Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision

Liang-Chieh Chen, Sanja Fidler, Alan Yuille, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), ColumbUSA, USA, June, 2014

Ground-truth segmentations provided for a subset of KITTI cars in Project page

Paper Abstract Project page CAD models Suppl. Mat. Bibtex

@inproceedings{ChenCVPR14,
author = {Liang-Chieh Chen and Sanja Fidler and Alan Yuille and Raquel Urtasun},
title = {Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision},
booktitle = {CVPR},
year = {2014}
}

Labeling large-scale datasets with very accurate object segmentations is an elaborate task that requires a high degree of quality control and a budget of tens or hundreds of thousands of dollars. ThUSA, developing solutions that can automatically perform the labeling given only weak supervision is key to reduce this cost. In this paper, we show how to exploit 3D information to automatically generate very accurate object segmentations given annotated 3D bounding boxes. We formulate the problem as the one of inference in a binary Markov random field which exploits appearance models, stereo and/or noisy point clouds, a repository of 3D CAD models as well as topological constraints. We demonstrate the effectiveness of our approach in the context of autonomous driving, and show that we can segment cars with the accuracy of 86% intersection-over-union, performing as well as highly recommended MTurkers!

The Role of Context for Object Detection and Semantic Segmentation in the Wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, Alan Yuille

In Computer Vision and Pattern Recognition (CVPR), ColumbUSA, USA, June, 2014

PASCAL VOC with dense segmentation labels for 400+ classes in Project page

Paper Errata Abstract Project page Suppl. Mat. Bibtex

@inproceedings{MottaghiCVPR14,
author = {Roozbeh Mottaghi and Xianjie Chen and Xiaobai Liu and Sanja Fidler and Raquel Urtasun and Alan Yuille},
title = {The Role of Context for Object Detection and Semantic Segmentation in the Wild},
booktitle = {CVPR},
year = {2014}
}

In this paper we study the role of context in existing state-of-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it contains 520 additional classes for semantic segmentation and object detection. Our analysis shows that nearest neighbor based approaches perform poorly on semantic segmentation of contextual classes, showing the variability of PASCAL imagery. Furthermore, improvements of existing contextual models for detection is rather modest. In order to push forward the performance in this difficult scenario, we propose a novel deformable part-based model, which exploits both local context around each candidate detection as well as global context at the level of the scene. We show that this contextual reasoning significantly helps in detecting objects at all scales.

Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Nam-Gyu Cho, Sanja Fidler, Raquel Urtasun, Alan Yuille

In Computer Vision and Pattern Recognition (CVPR), ColumbUSA, USA, June, 2014

PASCAL VOC with object parts segmentations available in Project page

Paper Abstract Project page Bibtex

@inproceedings{PartsCVPR14,
author = {Xianjie Chen and Roozbeh Mottaghi and Xiaobai Liu and Nam-Gyu Cho and Sanja Fidler and Raquel Urtasun and Alan Yuille},
title = {Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts},
booktitle = {CVPR},
year = {2014}
}

Detecting objects becomes difficult when we need to deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformations and partial occlusions in animals (as examples of highly deformable objects), ii) describe them in terms of body parts, and iii) detect them when their body parts are hard to detect (e.g., animals depicted at low resolution). We represent the holistic object and body parts separately and use a fully connected model to arrange templates for the holistic object and body parts. Our model automatically decouples the holistic object or body parts from the model when they are hard to detect. This enables us to represent a large number of holistic object and body part combinations to better deal with different "detectability" patterns caused by deformations, occlusion and/or low resolution. We apply our method to the six animal categories in the PASCAL VOC dataset and show that our method significantly improves state-of-the-art (by 4.1% AP) and provides a richer representation for objects. During training we use annotations for body parts (e.g., head, torso, etc), making use of a new dataset of fully annotated object parts for PASCAL VOC 2010, which provides a mask for each part.

Year 2013

Holistic Scene Understanding for 3D Object Detection with RGBD cameras (oral presentation)

Dahua Lin, Sanja Fidler, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

Code, models and ground-truth cuboids for NYU-v2 in Project page

Paper Abstract Project page Talk slides Bibtex

@inproceedings{LinICCV13,
author = {Dahua Lin and Sanja Fidler and Raquel Urtasun},
title = {Holistic Scene Understanding for 3D Object Detection with RGBD cameras},
booktitle = {ICCV},
year = {2013}
}

In this paper, we tackle the problem of indoor scene understanding using RGBD data. Towards this goal, we propose a holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects. Specifically, we extend the CPMC framework to 3D in order to generate candidate cuboids, and develop a conditional random field to integrate information from different sources to classify the cuboids. With this formulation, scene classification and 3D object recognition are coupled and can be jointly solved through probabilistic inference. We test the effectiveness of our approach on the challenging NYU v2 dataset. The experimental results demonstrate that through effective evidence integration and holistic reasoning, our approach achieves substantial improvement over the state-of-the-art.

Box In the Box: Joint 3D Layout and Object Reasoning from Single Images

Alex Schwing, Sanja Fidler, Marc Pollefeys, Raquel Urtasun

In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

Parallel and improved implementation of Structured SVMs released

Paper Abstract Learning code Bibtex

@inproceedings{SchwingICCV13,
author = {Alex Schwing and Sanja Fidler and Marc Pollefeys and Raquel Urtasun},
title = {Box In the Box: Joint 3D Layout and Object Reasoning from Single Images},
booktitle = {ICCV},
year = {2013}
}

In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. Towards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account occlusion in order to not over-count the evidence. We introduce a new decomposition method, which generalizes integral geometry to triangular shapes, and allows us to bound the different terms in constant time. We exploit both geometric cues and object detectors as image features and show large improvements in 2D and 3D object detection over state-of-the-art deformable part-based models.

Detecting Curved Symmetric Parts using a Deformable Disc Model

Tom Lee, Sanja Fidler, Sven Dickinson

In International Conference on Computer Vision (ICCV), Sydney, Australia, 2013

Paper Abstract Project page Bibtex

@inproceedings{LeeICCV13,
author = {Tom Lee and Sanja Fidler and Sven Dickinson},
title = {Detecting Curved Symmetric Parts using a Deformable Disc Model},
booktitle = {ICCV},
year = {2013}
}

Symmetry is a powerful shape regularity that's been exploited by perceptual grouping researchers in both human and computer vision to recover part structure from an image without a priori knowledge of scene content. Drawing on the concept of a medial axis, defined as the locus of centers of maximal inscribed discs that sweep out a symmetric part, we model part recovery as the search for a sequence of deformable maximal inscribed disc hypotheses generated from a multiscale superpixel segmentation, a framework proposed by [Levinshtein et al., ICCV'09]. However, we learn affinities between adjacent superpixels in a space that's invariant to bending and tapering along the symmetry axis, enabling us to capture a wider class of symmetric parts. Moreover, we introduce a global cost that perceptually integrates the hypothesis space by combining a pairwise and a higher-level smoothing term, which we minimize globally using dynamic programming. The new framework is demonstrated on two datasets, and is shown to significantly outperform the baseline [Levinshtein et al., ICCV'09].

Bottom-up Segmentation for Top-down Detection

Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

Paper Abstract Project page Suppl. Mat. Bibtex

@inproceedings{segdpmCVPR13,
author = {Sanja Fidler and Roozbeh Mottaghi and Alan Yuille and Raquel Urtasun},
title = {Bottom-up Segmentation for Top-down Detection},
booktitle = {CVPR},
year = {2013}
}

In this paper we are interested in how semantic segmentation can help object detection. Towards this goal, we propose a novel deformable part-based model which exploits region-based segmentation algorithms that compute candidate object regions by bottom-up clustering followed by ranking of those regions. Our approach allows every detection hypothesis to select a segment (including void), and scores each box in the image using both the traditional HOG filters as well as a set of novel segmentation features. Thus our model "blends" between the detector and segmentation models. Since our features can be computed very efficiently given the segments, we maintain the same complexity as the original DPM. We demonstrate the effectiveness of our approach in PASCAL VOC 2010, and show that when employing only a root filter our approach outperforms Dalal & Triggs detector on all classes, achieving 13% higher average AP. When employing the parts, we outperform the original DPM in 19 out of 20 classes, achieving an improvement of 8% AP. Furthermore, we outperform the previous state-of-the-art on VOC'10 test by 4%.

A Sentence is Worth a Thousand Pixels

Sanja Fidler, Abhishek Sharma, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

Paper Abstract Suppl. Mat. Bibtex

@inproceedings{FidlerCVPR13,
author = {Sanja Fidler and Abhishek Sharma and Raquel Urtasun},
title = {A Sentence is Worth a Thousand Pixels},
booktitle = {CVPR},
year = {2013}
}

We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models.

Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs

Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

Paper Abstract Suppl. Mat. Bibtex

@inproceedings{MottaghiCVPR13,
author = {Roozbeh Mottaghi and Sanja Fidler and Jian Yao and Raquel Urtasun and Devi Parikh},
title = {Analyzing Semantic Segmentation Using Human-Machine Hybrid CRFs},
booktitle = {CVPR},
year = {2013}
}

Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we "plug-in" human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much "head room" there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

Year 2012

3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model (spotlight)

Sanja Fidler, Sven Dickinson, Raquel Urtasun

Neural Information Processing Systems (NIPS), Lake Tahoe, USA, December 2012

800 CAD models registered to canonical viewpoint released

Paper Abstract CAD dataset Bibtex

@inproceedings{FidlerNIPS12,
author = {Sanja Fidler and Sven Dickinson and Raquel Urtasun},
title = {3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model},
booktitle = {NIPS},
year = {2012}
}

This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efficiency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach significantly outperforms the state of-the-art in both 2D and 3D object detection.

Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation

Jian Yao, Sanja Fidler, Raquel Urtasun

In Computer Vision and Pattern Recognition (CVPR), Providence, USA, June 2012

Code, trained models and annotated bounding boxes for MSRC in Project page

Paper Abstract Project page. Bibtex

@inproceedings{YaoCVPR12,
author = {Jian Yao and Sanja Fidler and Raquel Urtasun},
title = {Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation},
booktitle = {CVPR},
year = {2012}
}

In this paper we propose an approach to holistic scene understanding that reasons jointly about regions, location, class and spatial extent of objects, presence of a class in the image, as well as the scene type. Learning and inference in our model are efficient as we reason at the segment level, and introduce auxiliary variables that allow us to decompose the inherent high-order potentials into pairwise potentials between a few variables with small number of states (at most the number of classes). Inference is done via a convergent message-passing algorithm, which, unlike graph-cuts inference, has no submodularity restrictions and does not require potential specific moves. We believe this is very important, as it allows us to encode our ideas and prior knowledge about the problem without the need to change the inference engine every time we introduce a new potential. Our approach outperforms the state-of-the-art on the MSRC-21 benchmark, while being much faster. Importantly, our holistic model is able to improve performance in all tasks.

Super-edge grouping for object localization by combining appearance and shape information

Zhiqi Zhang, Sanja Fidler, Jarell W. Waggoner, Yu Cao, Jeff M. Siskind, Sven Dickinson, Song Wang

In Computer Vision and Pattern Recognition (CVPR), Providence, USA, June 2012

Paper Abstract Bibtex

@inproceedings{ZhangCVPR12,
author = {Zhiqi Zhang and Sanja Fidler and Jarell W. Waggoner and Yu Cao and Jeff M. Siskind and Sven Dickinson and Song Wang},
title = {Super-edge grouping for object localization by combining appearance and shape information},
booktitle = {CVPR},
year = {2012}
}

Both appearance and shape play important roles in object localization and object detection. In this paper, we propose a new superedge grouping method for object localization by incorporating both boundary shape and appearance information of objects. Compared with the previous edge grouping methods, the proposed method does not subdivide detected edges into short edgels before grouping. Such long, unsubdivided superedges not only facilitate the incorporation of object shape information into localization, but also increase the robustness against image noise and reduce computation. We identify and address several important problems in achieving the proposed superedge grouping, including gap filling for connecting superedges, accurate encoding of region-based information into individual edges, and the incorporation of object-shape information into object localization. In this paper, we use the bag of visual words technique to quantify the region-based appearance features of the object of interest. We find that the proposed method, by integrating both boundary and region information, can produce better localization performance than previous subwindow search and edge grouping methods on most of the 20 object categories from the VOC 2007 database. Experiments also show that the proposed method is roughly 50 times faster than the previous edge grouping method.

Video In Sentences Out (oral presentation)

Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Wang, Jinlian Wei, Yifan Yin, Zhiqi Zhang

In Uncertainty in Artificial Intelligence (UAI), Catalina Island, USA, 2012

Paper Abstract Project page Bibtex

@inproceedings{BarbuUAI12,
author = {Andrei Barbu and Alexander Bridge and Zachary Burchill and Dan Coroian and Sven Dickinson and Sanja Fidler and Aaron Michaux and Sam Mussman and Siddharth Narayanaswamy and Dhaval Salvi and Lara Schmidt and Jiangnan Shangguan and Jeffrey Mark Siskind and Jarrell Waggoner and Song Wang and Jinlian Wei and Yifan Yin and Zhiqi Zhang},
title = {Video In Sentences Out},
booktitle = {UAI},
year = {2012}
}

We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases, spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.

Unsupervised Disambiguation of Image Captions

Wesley May, Sanja Fidler, Afsaneh Fazly, Suzanne Stevenson, Sven Dickinson

First Joint Conference on Lexical and Computational Semantics (*SEM), 2012

Paper Abstract Bibtex

@inproceedings{MaySEM12,
author = {Wesley May and Sanja Fidler and Afsaneh Fazly and Suzanne Stevenson and Sven Dickinson},
title = {Unsupervised Disambiguation of Image Captions},
booktitle = {First Joint Conference on Lexical and Computational Semantics (*SEM)},
year = {2012}
}

Given a set of images with related captions, our goal is to show how visual features can improve the accuracy of unsupervised word sense disambiguation when the textual context is very small, as this sort of data is common in news and social media. We extend previous work in unsupervised text-only disambiguation with methods that integrate text and images. We construct a corpus by using Amazon Mechanical Turk to caption sense-tagged images gathered from ImageNet. Using a Yarowsky-inspired algorithm, we show that gains can be made over text-only disambiguation, as well as multimodal approaches such as Latent Dirichlet Allocation.

Learning Categorical Shape from Captioned Images

Tom Lee, Sanja Fidler, Alex Levinshtein, Sven Dickinson

Conference on Computer and Robot Vision (CRV), Toronto, Canada, May 2012

Paper Abstract Bibtex

@inproceedings{LeeCRV12,
author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Sven Dickinson},
title = {Learning Categorical Shape from Captioned Images},
booktitle = {Canadian Conference on Computer and Robot Vision (CRV)},
year = {2012}
}

Given a set of captioned images of cluttered scenes containing various objects in different positions and scales, we learn named contour models of object categories without relying on bounding box annotation. We extend a recent language-vision integration framework that finds spatial configurations of image features that co-occur with words in image captions. By substituting appearance features with local contour features, object categories are recognized by a contour model that grows along the object's boundary. Experiments on ETHZ are presented to show that 1) the extended framework is better able to learn named visual categories whose within class variation is better captured by a shape model than an appearance model; and 2) typical object recognition methods fail when manually annotated bounding boxes are unavailable.

Years 2006-2011

A Probabilistic Model for Recursive Factorized Image Features

Sergey Karayev, Mario Fritz, Sanja Fidler, Trevor Darrell

In Computer Vision and Pattern Recognition (CVPR), 2011

Paper Abstract Bibtex

@inproceedings{KarayevCVPR11,
author = {Sergey Karayev and Mario Fritz and Sanja Fidler and Trevor Darrell},
title = {A Probabilistic Model for Recursive Factorized Image Features},
booktitle = {CVPR},
year = {2011}
}

Layered representations for object recognition are important due to their increased invariance, biological plausibility, and computational benefits. However, most of existing approaches to hierarchical representations are strictly feedforward, and thus not well able to resolve local ambiguities. We propose a probabilistic model that learns and infers all layers of the hierarchy jointly. Specifically, we suggest a process of recursive probabilistic factorization, and present a novel generative model based on Latent Dirichlet Allocation to this end. The approach is tested on a standard recognition dataset, outperforming existing hierarchical approaches and demonstrating performance on par with current single-feature state-of-the-art models. We demonstrate two important properties of our proposed model: 1) adding an additional layer to the representation increases performance over the flat model; 2) a full Bayesian approach outperforms a feedforward implementation of the model.

A coarse-to-fine Taxonomy of Constellations for Fast Multi-class Object Detection

Sanja Fidler, Marko Boben, Ales Leonardis

In European Conference in Computer Vision (ECCV), 2010

Paper Abstract Bibtex

@inproceedings{FidlerECCV10,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {A coarse-to-fine Taxonomy of Constellations for Fast Multi-class Object Detection},
booktitle = {ECCV},
year = {2010}
}

In order for recognition systems to scale to a larger number of object categories building visual class taxonomies is important to achieve running times logarithmic in the number of classes [1, 2]. In this paper we propose a novel approach for speeding up recognition times of multi-class part-based object representations. The main idea is to construct a taxonomy of constellation models cascaded from coarse-to-fine resolution and use it in recognition with an efficient search strategy. The taxonomy is built automatically in a way to minimize the number of expected computations during recognition by optimizing the cost-to-power ratio [Blanchard and Geman, Annals of Statistics, 2005]. The structure and the depth of the taxonomy is not pre-determined but is inferred from the data. The approach is utilized on the hierarchy-of-parts model achieving efficiency in both, the representation of the structure of objects as well as in the number of modeled object classes. We achieve speed-up even for a small number of object classes on the ETHZ and TUD dataset. On a larger scale, our approach achieves detection time that is logarithmic in the number of classes.

Categorial Perception

Mario Fritz, Mykhaylo Andriluka, Sanja Fidler, Michael Stark, Ales Leonardis, Bernt Schiele

Cognitive Systems, No. 8, 2010

Paper Abstract Bibtex

@InCollection{FritzChapter09,
author = {Mario Fritz and Mykhaylo Andriluka and Sanja Fidler and Michael Stark and Ales Leonardis and Bernt Schiele},
title = {Categorical Perception},
booktitle = {Cognitive Systems},
series = {Cognitive Systems Monographs},
volume = {8},
year = {2010},
publisher = {Springer},
organization = {Springer},
chapter = {Categorical Perception}
}

The ability to recognize and categorize entities in its environment is a vital competence of any cognitive system. Reasoning about the current state of the world, assessing consequences of possible actions, as well as planning future episodes requires a concept of the roles that objects and places may possibly play. For example, objects afford to be used in specific ways, and places are usually devoted to certain activities. The ability to represent and infer these roles, or, more generally, categories, from sensory observations of the world, is an important constituent of a cognitive system's perceptual processing (Section 1.3 elaborates on this with a very visual example). In the CoSy project, a substantial amount of work has been conducted on the advancement of methods that recognize and categorize objects and places by using different modalities, namely, vision, language, and laser range data. Our progress contributes to our effort to build systems that evolve through interaction with its environment in an ultimately live-long learning process. While this chapter describes our contribution to modeling, learning and representing of visual categories, Chapter 7 shows how to combine the visual information with other modalities in a multi-modal learning process (e.g. speech/language as detailed in Chapter 8). Finally, Chapter 9 and 10 shows how we integrated these concepts in a autonomous systems to understand the implications of our progress in categorization on an interactive evolving system.

Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

Sanja Fidler, Marko Boben, Ales Leonardis

Neural Information Processing Systems (NIPS), 2009

Paper Abstract Bibtex

@inproceedings{FidlerNIPS09,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {Evaluating multi-class learning strategies in a generative hierarchical framework for object detection},
booktitle = {NIPS},
year = {2009}
}

Multi-class object learning and detection is a challenging problem due to the large number of object classes and their high visual variability. Specialized detectors usually excel in performance, while joint representations optimize sharing and reduce inference time -- but are complex to train. Conveniently, sequential class learning cuts down training time by transferring existing knowledge to novel classes, but cannot fully exploit the shareability of features among object classes and might depend on ordering of classes during learning. In hierarchical frameworks these issues have been little explored. In this paper, we provide a rigorous experimental analysis of various multiple object class learning strategies within a generative hierarchical framework. Specifically, we propose, evaluate and compare three important types of multi-class learning: 1.) independent training of individual categories, 2.) joint training of classes, and 3.) sequential learning of classes. We explore and compare their computational behavior (space and time) and detection performance as a function of the number of learned object classes on several recognition datasets. We show that sequential training achieves the best trade-off between inference and training times at a comparable detection performance and could thus be used to learn the classes on a larger scale.

Optimization framework for learning a hierarchical shape vocabulary for object class detection

Sanja Fidler, Marko Boben, Ales Leonardis

British Machine Vision Conference (BMVC), 2009

Bibtex

@inproceedings{FidlerBMVC09,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {Optimization framework for learning a hierarchical shape vocabulary for object class detection},
booktitle = {BMVC},
year = {2009}
}

Learning Hierarchical Compositional Representations of Object Structure

Sanja Fidler, Marko Boben, Ales Leonardis

Object Categorization: Computer and Human Vision Perspectives
Editors: S. Dickinson, A. Leonardis, B. Schiele and M. J. Tarr
Cambridge university press, 2009

Bibtex

@InCollection{FidlerChapter09,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {Learning Hierarchical Compositional Representations of Object Structure},
booktitle = {Object Categorization: Computer and Human Vision Perspectives},
editor = {Sven Dickinson and Ale\v{s} Leonardis and Bernt Schiele and Michael J. Tarr},
year = {2009},
publisher = {Cambridge University Press},
pages = {}
}

Similarity-based cross-layered hierarchical representation for object categorization

Sanja Fidler, Marko Boben, Ales Leonardis

Conference on Computer Vision and Pattern Recognition (CVPR), 2008

Bibtex

@inproceedings{FidlerCVPR08,
author = {Sanja Fidler and Marko Boben and Ales Leonardis},
title = {Similarity-based cross-layered hierarchical representation for object categorization},
booktitle = {CVPR},
year = {2008}
}

Selecting features for object detection using an AdaBoost-compatible evaluation function

Luka Furst, Sanja Fidler, Ales Leonardis

Pattern Recognition Letters, Vol. 29, No. 11, pp. 1603-1612, 2008

Paper Abstract Bibtex

@article{FurstPRL08,
author = {Luka Furst and Sanja Fidler and Ales Leonardis},
title = {Similarity-based cross-layered hierarchical representation for object categorization},
journal = {Pattern Recognition Letters},
volume = {29},
number = {11},
pages = {1603-1612},
year = {2008}
}

This paper addresses the problem of selecting features in a visual object detection setup where a detection algorithm is applied to an input image represented by a set of features. The set of features to be employed in the test stage is prepared in two training-stage steps. In the first step, a feature extraction algorithm produces a (possibly large) initial set of features. In the second step, on which this paper focuses, the initial set is reduced using a selection procedure. The proposed selection procedure is based on a novel evaluation function that measures the utility of individual features for a certain detection task. Owing to its design, the evaluation function can be seamlessly embedded into an AdaBoost selection framework. The developed selection procedure is integrated with state-of-the-art feature extraction and object detection methods. The presented system was tested on five challenging detection setups. In three of them, a fairly high detection accuracy was effected by as few as six features selected out of several hundred initial candidates.

Learning hierarchical representations of object categories for robot vision (invited paper)

Ales Leonardis, Sanja Fidler

International Symposium of Robotics Research (ISRR), 2007

Paper Abstract Bibtex

@inproceedings{FidlerISSR07,
author = {Ales Leonardis and Sanja Fidler},
title = {Learning hierarchical representations of object categories for robot vision},
booktitle = {International Symposium of Robotics Research (ISRR},
year = {2007}
}

This paper presents our recently developed approach to constructing a hierarchical representation of visual input that aims to enable recognition and detection of a large number of object categories. Inspired by the principles of efficient indexing, robust matching, and ideas of compositionality, our approach learns a hierarchy of spatially flexible compositions, i.e. parts, in an unsupervised, statistics-driven manner. Starting with simple, frequent features, we learn the statistically most significant compositions (parts composed of parts), which consequently define the next layer. Parts are learned sequentially, layer after layer, optimally adjusting to the visual data. Lower layers are learned in a category-independent way to obtain complex, yet sharable visual building blocks, which is a crucial step towards a scalable representation. Higher layers of the hierarchy, on the other hand, are constructed by using specific categories, achieving a category representation with a small number of highly generalizable parts that gained their structural flexibility through composition within the hierarchy. Built in this way, new categories can be efficiently and continuously added to the system by adding a small number of parts only in the higher layers. The approach is demonstrated on a large collection of images and a variety of object categories.

Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts

Sanja Fidler, Ales Leonardis

Conference on Computer Vision and Pattern Recognition (CVPR), 2007

Learning a compositional hierarchy of interpretable features encoding spatial relations

Paper Abstract Bibtex

@inproceedings{FidlerCVPR07,
author = {Sanja Fidler and Ales Leonardis},
title = {Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts},
booktitle = {CVPR},
year = {2007}
}

This paper proposes a novel approach to constructing a hierarchical representation of visual input that aims to enable recognition and detection of a large number of object categories. Inspired by the principles of efficient indexing (bottom-up), robust matching (top-down), and ideas of compositionality, our approach learns a hierarchy of spatially flexible compositions, i.e. parts, in an unsupervised, statistics-driven manner. Starting with simple, frequent features, we learn the statistically most significant compositions (parts composed of parts), which consequently define the next layer. Parts are learned sequentially, layer after layer, optimally adjusting to the visual data. Lower layers are learned in a category-independent way to obtain complex, yet sharable visual building blocks, which is a crucial step towards a scalable representation. Higher layers of the hierarchy, on the other hand, are constructed by using specific categories, achieving a category representation with a small number of highly generalizable parts that gained their structural flexibility through composition within the hierarchy. Built in this way, new categories can be efficiently and continuously added to the system by adding a small number of parts only in the higher layers. The approach is demonstrated on a large collection of images and a variety of object categories. Detection results confirm the effectiveness and robustness of the learned parts.

Combining Reconstructive and Discriminative Subspace Methods for Robust Classification and Regression by Subsampling

Sanja Fidler, Danijel Skocaj, Ales Leonardis

IEEE Trans. on Pattern Anal. and Machine Intell. (PAMI), vol. 28, no. 3, pp. 337-350, 2006

Paper Abstract Code Bibtex

@article{FidlerPAMI06,
author = {Sanja Fidler and Danijel Skocaj and Ales Leonardis},
title = {Combining Reconstructive and Discriminative Subspace Methods for Robust Classification and Regression by Subsampling},
journal = {IEEE Trans. on Pattern Analysis and Machine Intelligence},
volume = {28},
number = {3},
pages = {337-350},
year = {2006}
}

Linear subspace methods that provide sufficient reconstruction of the data, such as PCA, offer an efficient way of dealing with missing pixels, outliers, and occlusions that often appear in the visual data. Discriminative methods, such as LDA, which, on the other hand, are better suited for classification tasks, are highly sensitive to corrupted data. We present a theoretical framework for achieving the best of both types of methods: An approach that combines the discrimination power of discriminative methods with the reconstruction property of reconstructive methods which enables one to work on subsets of pixels in images to efficiently detect and reject the outliers. The proposed approach is therefore capable of robust classification with a high-breakdown point. We also show that subspace methods, such as CCA, which are used for solving regression tasks, can be treated in a similar manner. The theoretical results are demonstrated on several computer vision tasks showing that the proposed approach significantly outperforms the standard discriminative methods in the case of missing pixels and images containing occlusions and outliers.

Hierarchical Statistical Learning of Generic Parts of Object Structure

Sanja Fidler, Gregor Berginc, Ales Leonardis

Conference on Computer Vision and Pattern Recognition (CVPR), 2006

Paper Abstract Bibtex

@inproceedings{FidlerCVPR06,
author = {Sanja Fidler and Gregor Berginc and Ales Leonardis},
title = {Hierarchical Statistical Learning of Generic Parts of Object Structure},
booktitle = {CVPR},
year = {2006}
}

With the growing interest in object categorization various methods have emerged that perform well in this challenging task, yet are inherently limited to only a moderate number of object classes. In pursuit of a more general categorization system this paper proposes a way to overcome the computational complexity encompassing the enormous number of different object categories by exploiting the statistical properties of the highly structured visual world. Our approach proposes a hierarchical acquisition of generic parts of object structure, varying from simple to more complex ones, which stem from the favorable statistics of natural images. The parts recovered in the individual layers of the hierarchy can be used in a top-down manner resulting in a robust statistical engine that could be efficiently used within many of the current categorization systems. The proposed approach has been applied to large image datasets yielding important statistical insights into the generic parts of object structure.

Earlier work

Robust estimation of canonical correlation coefficients

Danijel Skocaj, Ales Leonardis, Sanja Fidler

28th wrk. of the Austrian Association for Pattern Recognition (OAGM/AAPR), 2004

Paper Abstract Bibtex

@inproceedings{SkocajOAGM04,
author = {Danijel Skocaj and Ales Leonardis and Sanja Fidler},
title = {Robust estimation of canonical correlation coefficients},
booktitle = {28th workshop of the Austrian Association for Pattern Recognition (OAGM/AAPR)},
year = {2004}
}

Canonical Correlation Analysis is well suited for regression tasks in appearance-based approach to modeling of objects and scenes. However, since it relies on the standard projection it is inherently non-robust. In this paper, we propose to embed the estimation of the CCA coefficients in an augmented PCA space, which enables detection of outliers and preserves regression-relevant information enabling robust estimation of canonical correlation coefficients.

Robust LDA classification by subsampling

Sanja Fidler, Ales Leonardis

In Workshop in Statistical Analysis in Computer Vision (in conjunction with CVPR), 2003

Paper Abstract Bibtex

@inproceedings{FidlerSACV03,
author = {Sanja Fidler and Ales Leonardis},
title = {Robust LDA classification by subsampling},
booktitle = {Workshop in Statistical Analysis in Computer Vision in conjunction with CVPR},
year = {2003}
}

In this paper we present a new method which enables a robust calculation of the LDA classification rule, thus making the recognition of objects under non-ideal conditions possible, i.e., in situations when objects are occluded or they appear on a varying background, or when their images are corrupted by outliers. The main idea behind the method is to translate the task of calculating the LDA classification rule into the problem of determining the coefficients of an augmented generative model (PCA). Specifically, we construct an augmented PCA basis which, on the one hand, contains information necessary for the classification (in the LDA sense), and, on the other hand, enables us to calculate the necessary coefficients by means of a subsampling approach resulting in a high breakdown point classification. The theoretical results are evaluated on the ORL face database showing that the proposed method significantly outperforms the standard LDA.

Robust LDA classification (best paper award)

Sanja Fidler, Ales Leonardis

27th wrk. of the Austrian Association for Pattern Recognition (OAGM/AAPR), 2003

Bibtex

@inproceedings{FidlerOAGM03,
author = {Sanja Fidler and Ales Leonardis},
title = {Robust LDA classification},
booktitle = {27th workshop of the Austrian Association for Pattern Recognition (OAGM/AAPR)},
year = {2003}
}

Extended abstracts

Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, Sanja Fidler
What are you talking about? Text-to-Image Coreference
In Scene Understanding Workshop (SUNw), ColumbUSA, USA, June 2014

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, Alan Yuille
The Role of Context for Object Detection and Semantic Segmentation in the Wild
In Scene Understanding Workshop (SUNw), ColumbUSA, USA, June 2014

Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, Raquel Urtasun
Bottom-up Segmentation for Top-down Detection
In Scene Understanding Workshop (SUNw), Portland, USA, June 2013

Sanja Fidler, Abhishek Sharma, Raquel Urtasun
A Sentence is Worth a Thousand Pixels
In Scene Understanding Workshop (SUNw), Portland, USA, June 2013

Sanja Fidler, Sven Dickinson, Raquel Urtasun
3D Object Detection with a Deformable 3D Cuboid Model
shop (SUNw), Portland, USA, June 2013

Publications (Switch to Simple view)

Year 2020

Learning Deformable Tetrahedral Meshes for 3D Reconstruction

Variational Amodal Object Completion for Interactive Scene Editing

Federated Simulation for Medical Imaging (nominated for Young Scientist Award)

Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation

Beyond Fixed Grid: Learning Geometric Image Representation with a Deformable Grid

Interactive Annotation of 3D Object Geometry using 2D Scribbles

ScribbleBox: Interactive Annotation Framework for Video Object Segmentation

Expressive Telepresence via Modular Codec Avatar

Nonlinear Color Triads for Approximation, Learning and Direct Manipulation of Color Distributions

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data (oral presentation)

Learning to Simulate Dynamic Environments with GameGAN

Learning to Evaluate Perception Models Using Planner-Centric Metrics

Auto-Tuning Structured Light by Optical Stochastic Gradient Descent

A Theoretical Analysis of the Number of Shots in Few-Shot Learning

Efficient and Information-Preserving Future Frame Prediction and Beyond

Year 2019

Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research

Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer

Meta-Sim: Learning to Generate Synthetic Datasets (oral presentation)

Neural Turtle Graphics for Modeling City Road Layouts (oral presentation)

Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Lifelong Learning for Image Captioning by Asking Natural Language Questions

Video Face Clustering with Unknown Number of Clusters

DMM-Net: Differentiable Mask-Matching Network for Video Instance Segmentation

Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations (oral presentation)

Fast Interactive Object Annotation with Curve-GCN

Object Instance Annotation with Deep Extreme Level Set Evolution

Synthesizing Environment-Aware Activities via Activity Sketches

Creative Flow+ Dataset

DARNet: Deep Active Ray Network for Building Segmentation

Action Recognition from Single Timestamp Supervision in Untrimmed Videos

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

Visual Reasoning by Progressive Module Networks

Neural Graph Evolution: Automatic Robot Design

Color Builder: A Direct Manipulation Interface for Versatile Color Theme Authoring

Year 2018

Semantic Understanding of Scenes Through the ADE20K Dataset

Color Sails: Discrete-Continuous Palettes for Deep Color Exploration

A Neural Compositional Paradigm for Image Captioning

Pose Estimation for Objects with Rotational Symmetry

Scaling Egocentric Vision: The EPIC-KITCHENS Datasets (oral presentation)

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (spotlight present.)

VirtualHome: Simulating Household Activities via Programs (oral presentation)

MovieGraphs: Towards Understanding Human-Centric Situations from Videos (spotlight present.)

Now You Shake Me: Towards Automatic 4D Cinema (spotlight presentation)

Efficient Annotation of Segmentation Datasets with Polygon-RNN++

Learning to Act Properly: Predicting and Explaining Affordances from Images

A Face-to-Face Neural Conversation Model

SurfConv: Bridging 3D and 2D Convolution for RGBD Images

NerveNet: Learning Structured Policy with Graph Neural Networks

Year 2017

Teaching Machines to Describe Images via Natural Language Feedback

Towards Diverse and Natural Image Descriptions via a Conditional GAN (oral presentation)

3D Graph Neural Networks for RGBD Semantic Segmentation (oral presentation)

TorontoCity: Seeing the World with a Million Eyes (spotlight presentation)

Situation Recognition with Graph Neural Networks

Open Vocabulary Scene Parsing

Sequential Grouping Networks for Instance Segmentation

Be Your Own Prada: Fashion Synthesis with Structural Coherence

Annotating Object Instances with a Polygon-RNN (best paper honorable mention)

Sports Field Localization via Deep Structured Models

Scene Parsing through ADE20K Dataset

Find Your Way by Observing the Sun and Other Semantic Cues

3D Object Proposals using Stereo Imagery for Accurate Object Class Detection

Year 2016

Song From PI: A Musically Plausible Network for Pop Music Generation

Efficient Summarization with Read-Again and Copy Mechanism

Proximal Deep Structured Models

HouseCraft: Building Houses from Rental Ads and Street Views

MovieQA: Understanding Stories in Movies through Question-Answering (spotlight)

Instance-Level Segmentation for Autonomous Driving with Deep Densely Connected MRFs

Monocular 3D Object Detection for Autonomous Driving

HD Maps: Fine-grained Road Segmentation by Parsing Ground and Aerial Images

Order-Embeddings of Images and Language (oral presentation)

Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

Year 2015

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
(oral presentation)