## Announcements

- I will now be holding office hours on Friday Feb 14 from 2-4pm, instead of Wednesday Feb 12.

- I will be holding office hours on Tues Feb 11 from 4-6pm and Wed Feb 12 from 3-5pm to discuss project proposals. If you cannot make either of these times, please let me know, and I will try to arrange office hours with one of the TAs.
- Reminder: there is a discussion group for the course on Quercus.

- Project and project proposal deadlines are given below (under "marking Scheme").

- We need 6 or 7 volunteers to present a paper on February 4, the second week of presentations. Please let me know ASAP if you are interested. Advantages of being second:

- More support.
- No overlap with course project and project proposal deadlines.
- Papers for the weeks of February 4 and 11 are now listed below.

- Marking rubrics are now available, below (under Marking Scheme).

- Paper presentations may be done in teams of two (or more), but the length of the presentation will be 15 minutes per student. Also, unless a paper is particularly difficult or long, a team will be asked to cover more than one paper. A team may cover one paper listed below and one or more of its references.

- We need 6 or 7 volunteers to present a paper on January 28, the first week of presentations. Please let me know ASAP if you are interested. Advantages of being first:

- More support.
- No overlap with course project and project proposal deadlines.
- Papers for the week of January 28 are now listed below.

## Overview

Convolutional neural networks have achieved astounding breakthroughs on a number of machine vision tasks, especially object classification. However, unlike people, they can require vast amount of data to train, and their (sometimes comical) mistakes show that they do not truly understand what they see. This limits their abilities and leaves them short of the full promise of Artificial Intelligence.

To fully understand a scene, a computer must have a rich, 3-dimensional representation of the world. It must be able to infer what objects are in a scene, their position, orientation, size, shape, color, texture, category, what parts they are composed of, their relationship to other objects in the scene, as well as the illumination and position and viewing angle of the camera. In other words, a scene understanding program must be able to represent the world in much the same way as a computer graphics program does. The main difference is that computer graphics generates a 2-dimensional image from a 3-dimensional representation, while scene understanding aims to do the reverse: to infer a 3-dimensional representation of a scene from a 2-dimensional image. Note that once a 3-dimensional representation has been inferred, it should be possible to answer many common-sense questions about an image. It should also be possible to use a graphics program to regenerate the image from the 3-dimensional representation, and moreover, to generate modified versions of the image, in which objects have been moved or rotated and illumination or camera positions have changed.

This view of scene understanding is known as inverse graphics. Inverting the graphics process to generate a 3-dimensional representation of an image is a difficult, non-deterministic problem. This course approaches the problem with machine learning. That is, we investigate techniques for learning programs that do inverse graphics, as well as related techniques for overcoming the limitations of convolutional neural networks for vision.

This is an advanced graduate course in machine learning. It is primarily a seminar course in which students will read and present papers from the literature. There will also be a major course project. The goal is to bring students to the state of the art in this exciting field. Tentative topics include generative and discriminative models for vision, convolutional and deconvolutional neural nets, variational inference and autoencoders, capsule networks, group symmetries and equivariance, visual attention mechanisms, differentiable renderers, and applications.

## Prerequisites:

A solid introduction to Machine Learning (such as csc411 or a graduate course in ML), especially neural nets, a solid knowledge of linear algebra, the basics of multivariate calculus and probability, and programming skills, especially programming with vectors and matrices. Mathematical maturity will be assumed.

## Classes:

- Tuesday, 1-3pm
- BA 2155

## Instructor:

- Anthony Bonner
- Email: bonner [at] cs [dot] toronto [dot] edu
- Office: BA 5230
- Phone: 416-997-3463
- Office Hours: TBA
## Teaching Assistants:

- Alexander Olson, alex [dot] olson [at] mail [dot] utoronto [dot] ca

- Mohan Zhang, zhangmo4 [at] cs [dot] toronto [dot] edu

## Course Structure

The course is organized along the lines of csc2547: Learning to Search, given by David Duvenaud last semester, though the course content is quite different.

- First three classes: lectures on background material.
- Next seven classes: student presentations of papers from the literature.
- Last two classes: project presentations
## Paper presentations:

- The goal is for each paper presentation to be a high-quality, accessible tutorial.
- Each week will focus on one topic
- You will vote for your choice of topic/week (soon).

- I will assign you to a week (soon).

- Papers on each topic will be listed below.

- If you have a particular paper you would like to add to the list, let me know.
- 7 weeks and 45 students = 6 or 7 students per week and about 15 minutes per student.
- Two-week planning cycle:

- Two weeks before your presentation, meet me after class to discuss and assign papers.
- The following week, meet with the TA for a practice presentation (required).
- Present in class under strict time constraints (just like a conference).
- Papers may be presented in teams of two or more with longer presentations (15 minutes per team member).

- Unless a paper is particularly difficult or long, a team will be expected to cover more than one paper (one paper per team member).

- A team may cover one paper listed below and one or more of its references.
- Feel free to suggest other possibilities.
- When presenting a paper,

- Each slide should describe at most one idea.
- No more than one idea per minute (so no more than one slide per minute, unless a single idea is spread over several slides)
- You should be able to explain anything that you put on a slide. (
e.g., If you mention MCMC on a slide, then you should be able to say something informative about it, even if you don't understand it completely.)- Don't just explain how a system works. Also explain
whyit works. (e.g., why are the latent variables interpretable.) This will require you to understand the system.- If it isn't clear why a system works, you should think about it and speculate on possible reasons.
- Be able to answer questions about what you have presented.
- Minimize the use of formulas.
- Use pictures wherever possible.

- Describe how the system is trained. end-to-end? supervised? unsupervised? pre-trained with synthetic data?
- Describe the loss function for training.

## Projects:

- You may propose any project you like, as long as it is about machine learning and vision and it has a major technical component.

- Here are some project ideas and considerations.

- Projects may be done individually or in teams of up to four. More will be expected of a team project.
- The grade will depend on the ideas, how well you present them in the report, how clearly you position your work relative to existing literature, how illuminating your experiments are, and well-supported your conclusions are. Full marks will require a novel contribution.
- Each team will write a short (2-4 pages) research project proposal, which ideally will be structured similarly to a standard paper. It should include a description of a minimum viable project, some nice-to-haves if time allows, and a short review of related work. You don’t have to do what your project proposal says - the point of the proposal is mainly to have
aplan and to make it easy for me to give you feedback.

- Towards the end of the course everyone will present their project in a short, roughly 5 minute, presentation.

- At the end of the class you’ll hand in a project report (around 4 to 8 pages), ideally in the format of a machine learning conference paper such as NIPS.

## Marking Scheme:

- [20%] Paper presentation. Rubric

- [20%] Project proposal, due February 18. Rubric

- [20%] Project presentations, March 24 and 31. Rubric

- [40%] Project report and code, due April 12. Rubric

## Tentative Schedule

## Lectures:

January 7: lecture

Intro and overviewReferences:

Review of neural netsReview of CNNs,

Short video on how to trick a neural net. Read about it here.

Slides on vision as inverse graphics, by Vinjai Vale

January 14: lecture

overview of presentations, topicsand discriminative models

lecture on variational inference and autoencodersTutorial on variational inference, by Shakir Mohamed

Metacademy on variational inference

January 21: lecture

overview of projects and generative modelslecture on variational autoencoders and the REINFORCE algorithm

Readings:

variational autoencoders and the reparameterization trick

backpropagation through discrete random variables based on REINFORCE

tutorial on variational autoencoders, by Jaan Altosaar

## Student Presentations:

January 28: Discriminative approaches

Since the deep-learning revolution of 2012, there has been a surge of work on using convolutional neural nets in a feed-forward, discriminative fashion to address a large number of problems in machine vision.

Papers for presentation:

Human pose estimation:

cascade of CNNs(Chianda Chen)Markov random field

spatial model

Object detection and localization:

based on region proposals.

multi-scale sliding window

faster region proposal network(Yizhan Jiang and Yunhao Ji)spatial pyramid pooling

based on regression/classification.

single shot multibox detector

you only look once(Yizhan Jiang andYunhao Ji)

Image transformation:

texture synthesissemantic segmentation(Yushi Guan and Rohit Saha)

depth predictionscene labeling

artistic stylecolorization(Lauren Erdman)

feature interpolation

February 4: Generative models

Since the development of variational autoencoders (vae) in 2014, there has been extensive research on using them to learn representations of images. The accuracy and completeness of a representation can be tested by generating the image from the representation and comparing this to the original image. In this way, representations can be learned in an unsupervised way, without the need for labelled data.

Background:

variational autoencoders and the reparameterization trick for backpropagating through continuous random varriables.

neural variational inference, for backpropagating through discrete random variables. Based on REINFORCE with some variance reduction of the gradient estimates.

The Concrete Distribution and Categorical Reparameterization. These two papers (published simultaneously) both introduce the Gumbel-softmax trick for estimating low-variance (but biased) gradients for backpropagating through discrete random variables using reparameterization.

REBAR Combines REINFORCE and Gumbel-softmax to estimate low-variance and unbiased gradients for backpropagating through discrete random variables.

tutorial on variational autoencoders, by Jaan Altosaar

Papers for presentation:

Learning disentangled representations:

Learning 3D structure of single objects:

3D structure from images(Bin Shi)

via 2.5D sketches(a representation intermediate between 2D and 3D)

deep voxels (3D pixels as a representation)

multi-view stereo images

Scene understanding:

animated video(Raymond Zheng)

neural scene derenderingscene representation networks

February 11: Generative models(continued)

Papers for presentation:

Towards inverse graphics:

scenes with multiple objects(Qinyu Lei)

overcoming occlusion, chapter 3describing scenes with programs (You would not be expected to cover the entire paper)

Conditional image generation:

conditional vae(Mingyue Yang)

arbitrary conditionsattributes to images(Hyunmin Lee)

Other:

learning a compositional representationvisually grounded imaginationvisual analogies(Shuja Khalid)

visual question answering(John Chen and Yu-Siang Wang)

February18: reading week, no classes

February 25: Capsule networks

Although convolutional neural networks have achieved amazing breakthroughs in computer vision, they require vast amounts of data for learning, make silly mistakes and do not understand what they see. Capsule networks are a recently developed alternative that addresses these problems. The notions of object and geometry are built into capsule networks and do not have to be learned, so less data is required and silly, non-geometric images are not misinterpreted.

Each capsule in a network represents an object, and unlike a neuron, which has a single output, a capsule has many outputs, representing the many properties of an object, such as its position, orientation, and texture. Moreover, unlike convolutional networks, which throw away positional information in the pooling layers, capsule networks keep track of the spatial relationships between objects.During training, a capsule network learns a model of common types of objects, including their parts and the spatial relationship of the parts to the whole. In effect, a capsule network learns a spatial grammar during training, and builds a spatial parse tree of an image during inference.

Background:

Hinton lecture on capsulesdynamic routing (higher layers: building objects out of parts)

EM routing (building objects out of parts)

Papers for presentation:

Foundations:

transforming autoencoders (the original paper, with an emphasis on the first layer)

dynamic routing (higher layers: building objects out of parts)(ZihaoChen)

EM routing (building objects out of parts)

stacked capsule autoencoders

Other developments:Alternative implementations:Lecture: Generative Adversarial Networks (GANs)

March 3:Building more geometry into CNNs

Convolutional neural networks have translational invariance and equivariance built in. That is,they canbecause the weights used at different locations are the same,recognize objects no matter where they are located in an image. There is now a substantial body of research into extending invariance and equivariance to geometric transformations other than translation, so that CNNs can recognize objects no matter how large or small they are, and no matter how much they are rotated, skewed or transformed in other ways. Depending on your background, the papers below may introduce you to new mathematical concepts (such as Fourier transforms, eigenfunctions or group representations, depending on the paper). These concepts are not difficult and have wide application, and Google and Wikipedia will answer all your questions.

Papers for presentation:

Rotation:

harmonic networks (2D rotation)

spherical cnn (3D rotation for data on the surface of a sphere)filter decomposition (rotational and radial basis filters)

Scale:

local scale invariance (a basic approach to scale invariance)(Haowei Zhang)

deep scale spaces (a sophisticated approach based on group theory)

Other transformations

affine and non-linear transformations(Julian Braganza)

global scale and rotation equivariance (section 3 can be skipped)

March 10: Visual attention mechanisms

Background:

recurrent models of visual attention Image classification using attention

DRAW An RNN for image generation

Papers for presentation:

recurrent models of visual attention Image classification using attention(Alex Chang)

multiple object recognitionDRAW A VAE for image generation with RNNs for encoder and decoder

attend, infer, repeat(Mohammad Reza Motallebi)

March 17: Adversarial approaches and Differentiable renderering (2 topics)

Since the development of Generative Adversarial Networks (GANs) in 2014, there has been an explosion of work in adversarial methods and amazing breakthroughs in the generation of photo-realistic images. Here, we look at some of the applications of adversarial methods to inverse graphics. Like VAE's, the idea is to learn an image generator and then invert the image generation process. However, the way in which a GAN learns to generate images is completely different from that of a VAE and has added a whole new dimension to machine learning.

Differentiable rendering takes a different tack. Instead oflearningto generate images, the idea is to use a graphics program to generate them, since graphics is a well-understood area. The main problem is that to invert the image generation process, the graphics program (or renderer) must be differentiable, so we can backpropagate through it. When used in combination with a GAN or VAE, a differentiable renderer disentangles the latent variables and gives them a natural interpretation.

Background:

the original GAN paperWasserstein GAN An improvement that reduces many of the problems with GAN training.Improved Training of Wasserstein GANs

progressive growing of GANs Demo

Papers for presentation:

Adversarial approaches:

image-to-image translation using cyclic constraints and a new kind of weakly-supervised training: "unpaired supervision"(Tianchang Shen)

adversarial inference and adversarial feature learning. These two papers appeared simultaneously and proposed the same method,a kind of adversarially-trained VAE.

disentangling representations by maximizing the mutual information between latent variables and images.

text to image using GANs with attention

Differentiable renderering:

OpenDR The original differentiable renderer.

differentiable mesh rendering with applications to inverse graphicsRenderNetA neural network for rendering 3D shapes into 2D images.interpolation-based rendering A nice description of the (differentiable) rendering process, and applications to inferring 3D object structure.

## Project Presentations:

Mar 24: project presentations

Mar 31: project presentations