Sanja Fidler

Semantic Understanding of Scenes Through the ADE20K Dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, Antonio Torralba

International Journal of Computer Vision (IJCV)

@article{ADEIJCV,
title = {Semantic Understanding of Scenes Through the ADE20K Dataset},
author = {Bolei Zhou and Hang Zhao and Xavier Puig and Tete Xiao and Sanja Fidler and Adela Barriuso and Antonio Torralba},
journal = {IJCV},
year = {2018}
}

Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

A Neural Compositional Paradigm for Image Captioning

Bo Dai, Sanja Fidler, Dahua Lin

In Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018

@inproceedings{Dai18neurips,
title = { A Neural Compositional Paradigm for Image Captioning},
author = {Bo Dai and Sanja Fidler and Dahua Lin},
booktitle = {NeurIPS},
year = {2018}
}

Mainstream captioning models often follow a sequential structure to generate captions, leading to issues such as introduction of irrelevant semantics, lack of diversity in the generated captions, and inadequate generalization performance. In this paper, we present an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: (1) extracting an explicit semantic representation from the given image; and (2) constructing the caption based on a recursive compositional procedure in a bottom-up manner. Compared to conventional ones, our paradigm better preserves the semantic content through an explicit factorization of semantics and syntax. By using the compositional generation procedure, caption construction follows a recursive structure, which naturally fits the properties of human language. Moreover, the proposed compositional procedure requires less data to train, generalizes better, and yields more diverse captions.

Pose Estimation for Objects with Rotational Symmetry

Enric Corona, Kaustav Kundu, Sanja Fidler

In International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain

@inproceedings{pose2018,
title = {Pose Estimation for Objects with Rotational Symmetry},
author = {Enric Corona and Kaustav Kundu and Sanja Fidler},
booktitle = {IROS},
year = {2018}
}

Pose estimation is a widely explored problem, enabling many robotic tasks such as grasping and manipulation. In this paper, we tackle the problem of pose estimation for objects that exhibit rotational symmetry, which are common in man-made and industrial environments. In particular, our aim is to infer poses for objects not seen at training time, but for which their 3D CAD models are available at test time. Previous work has tackled this problem by learning to compare captured views of real objects with the rendered views of their 3D CAD models, by embedding them in a joint latent space using neural networks. We show that sidestepping the issue of symmetry in this scenario during training leads to poor performance at test time. We propose a model that reasons about rotational symmetry during training by having access to only a small set of symmetry-labeled objects, whereby exploiting a large collection of unlabeled CAD models. We demonstrate that our approach significantly outperforms a naively trained neural network on a new pose dataset containing images of tools and hardware.

(oral presentation)

Scaling Egocentric Vision: The EPIC-KITCHENS Datasets

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

In European Conference on Computer Vision (ECCV), Munich, Germany

@inproceedings{Damen2018EPICKITCHENS,
title = {Scaling Egocentric Vision: The EPIC-KITCHENS Dataset},
author = {Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria and Fidler, Sanja and Furnari, Antonino and Kazakos, Evangelos and Moltisanti, Davide and Munro, Jonathan and Perrett, Toby and Price, Will and Wray, Michael},
booktitle = {ECCV},
year = {2018}
}

First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.

(spotlight presentation)

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler

In British Machine Vision Conference (BMVC), Newcastle upon Tyne, UK

@inproceedings{vsepp2018,
title = {VSE++: Improving Visual-Semantic Embeddings with Hard Negatives},
author = {Fartash Faghri and David J. Fleet and Jamie Ryan Kiros and Sanja Fidler},
booktitle = {BMVC},
year = {2018}
}

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).