Recognizing 3D objects and localizing them images is a difficult problem that has been around since the beginning of Computer Vision. The problem entails accurately segmenting each object in an image, inferring its semantic class and parsing it in terms of 3D parts. The challenges include dealing with a large number of different classes, appearances, poses and articulations of objects, occlusion, and varying illumination conditions. Our group has some of the seminal work on 3D object parsing and recognition. Our recent work on object detection with deep learning is ranked among the top performing methods in standard benchmarks.
3D scene understanding requires reasoning about multiple related tasks: (3D) object detection and segmentation, relationships between objects (e.g., support), scene-type prediction, as well as inferring the structure of the scene as well (e.g. ground plane in outdoor scenarios and layout of the room in indoors). Our work focuses on designing holistic models that reason jointly about the related sub-tasks, and as such outperform the individual modules.
Some of the topics we are working on: semantic understanding and 3D reconstruction of RGB and RGB-D scenes, room layout estimation, indoor and outdoor localization, depth from a single image.
A successful robotic platform needs to be able to understand both the three-dimensional visual world as well as the human’s (lingual) instructions, and communicate its understanding back to the user in a natural way. Our group works on integrating vision and language in order to develop high-performance autonomous systems that can interact with humans through language. Such solutions are of particular importance for the visually impaired where language is one of the only means of human-robot interaction.
The topics we are tackling are: image and video captioning, learning visual models from textual descriptions, zero-shot learning, alignment of text and video, and image search.
The physical regularities of our 3-D world project to regularities on our retinae that the human vision system has evolved to detect very efficiently. Detecting such regularities, the process of perceptual grouping, helps the human vision system parse an image into distinct objects, parse an object into distinct parts, or parse a video sequence into multiple moving objects. Our research seeks to computationally model such perceptual grouping processes, including symmetric part detection, symmetric part grouping, contour closure detection, region segmentation, and motion segmentation, in support of image and video segmentation when knowledge of image/video content is unavailable.
The shape of an object is a powerful invariant feature for many object categories. But a categorical shape description must be invariant to within-class shape deformation, leading to shape representations that capture the qualitative shape of an object. Our research explores various qualitative 2-D and 3-D shape representations, how they can be recovered from 2-D and 3-D images, and their role in object categorization.
Knowing the 3D structure of small molecules like viruses and proteins is central to both understanding fundamental biological processes as well as the development of new drugs and treatments. However, determining the structure of these molecules is a challenging task due to their small size and delicate biological nature. Electron cryomicroscopy (Cryo-EM) is an emerging method for structure determination which analyzes noisy 2D images from unknown orientations to determine the 3D structure of the particle. Our research is focused on producing efficient and robust methods for determining 3D structure and enabling faster and more reliable structure determination.