Seyed Kamyar Seyed Ghasemipour

Seyed Kamyar Seyed Ghasemipour

kamyar (at) generalistai [dot] com

I am a Founding Member of Technical Staff at Generalist. Our mission is to build robot foundation models for general-purpose dexterous manipulation. Prior to Generalist, I was a graduate student in the in the Machine Learning Group at the University of Toronto and the Vector Institute. My advisor was Rich Zemel. I was also a long-term student research collaborator with Google Brain Robotics (later Google DeepMind Robotics), where I worked with many wonderful collaborators.

You can find my CV here.

Announcements

May, 2024 — Joined Generalist as Founding Member of Technical Staff

Building robot foundation models for general-purpose dexterous bimanual manipulation!

Earlier Announcements

Dec, 2022 — Outstanding Paper Award @ NeurIPS 2022

The Imagen text-to-image generative model won Outstanding Paper Award at NeurIPS 2022!

Dec, 2022 — NeurIPS 2022

Our offline RL work "Why so pessimistic?" will be presented at NeurIPS 2022!

Dec, 2022 — NeurIPS 2022

The Imagen text-to-image generative model will be presented at NeurIPS 2022!

Oct, 2022 — Talk @ Autodesk Research

Presenting our line of work on robotic magnetic assembly as well as Imagen text-to-image generative models.

June, 2022 — ICML 2022

Our work "Blocks Assemble!" wil be presented at ICML 2022!

March, 2022 — Sim2Real Work on Arxiv

Our effort on Sim2Real transfer of bimanual magnetic assembly policies is available on Arxiv!

Dec 14, 2021 — Offline RL Workshop NeurIPS 2021

We are very excited about our work "Why so pessimistic? Estimating uncertainties for offline RL through ensembles, and why their independence matters." (with Shane Gu and Ofir Nachum) which was accepted at the Offline RL Workshop at NeurIPS 2021.

May 8, 2021 — ICML 2021

Our work "EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL" (with Dale Schuurmans and Shane Gu) was accepted at ICML 2021.

November 1, 2019 — Best Paper Award @ CoRL 2019!!!! :D

Our paper "A Divergence Minimization Perspective on Imitation Learning Methods" (with Richard Zemel and Shane Gu) received the Best Paper Award at the Conference on Robot Learning (CoRL) 2019!

September 30, 2019 — Research Internship @ Google Brain Robotics

This semester I am interning with Corey Lynch and Pierre Sermanet at Google Brain Robotics in Mountainview

September 7, 2019 — CoRL Paper (Oral! :D)

Our paper "A Divergence Minimization Perspective on Imitation Learning Methods" (with Richard Zemel and Shane Gu) was accepted as an oral at CoRL 2019!

September 4, 2019 — NeurIPS Paper

Our paper "SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies" (with Shane Gu and Richard Zemel) was accepted as a poster at NeurIPS 2019!

June 1, 2019 — ICML Workshop Oral Presentation

Our paper "SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies" (with Shane Gu and Richard Zemel) was accepted as an oral presentation to the Imitation, Intent, and Interaction (I3) Workshop at ICML 2019!

June 1, 2019 — ICML Workshop Poster

Our paper "Interpreting Imitation Learning Methods Under a Divergence Minimization Perspective" (with Shane Gu and Richard Zemel) was accepted to the Imitation, Intent, and Interaction (I3) Workshop at ICML 2019!

April 20, 2019 — ICLR Workshop Poster

Our paper "Interpreting Imitation Learning Methods Under a Divergence Minimization Perspective" (with Shane Gu and Richard Zemel) was accepted to the Deep Generative Models for Highly Structured Data Workshop at ICLR 2019!

Research

Last Updated (June 15, 2025). For most up to date information please refer to my Google Scholar page.

Preprints / Under Review

Bi-Manual Manipulation and Attachment via Sim-to-Real Reinforcement Learning
Satoshi Kataoka, Seyed Kamyar Seyed Ghasemipour, Daniel Freeman, Igor Mordatch
Under Review at ICRA 2023
abstract

Assembly of multi-part physical structures is both a valuable end product for autonomous robotics, as well as a valuable diagnostic task for open-ended training of embodied intelligent agents. We introduce a naturalistic physics-based environment with a set of connectable magnet blocks inspired by children's toy kits. The objective is to assemble blocks into a succession of target blueprints. Despite the simplicity of this objective, the compositional nature of building diverse blueprints from a set of blocks leads to an explosion of complexity in structures that agents encounter. Furthermore, assembly stresses agents' multi-step planning, physical reasoning, and bimanual coordination. We find that the combination of large-scale reinforcement learning and graph-based policies -- surprisingly without any additional complexity -- is an effective recipe for training agents that not only generalize to complex unseen blueprints in a zero-shot manner, but even operate in a reset-free setting without being trained to do so. Through extensive experiments, we highlight the importance of large-scale training, structured representations, contributions of multi-task vs. single-task learning, as well as the effects of curriculums, and discuss qualitative behaviors of trained agents.

/ pdf / website

Braxlines: Fast and Interactive Toolkit for RL-driven Behavior Engineering beyond Reward Maximization
Shixiang Shane Gu, Manfred Diaz, Daniel C. Freeman, Hiroki Furuta, Seyed Kamyar Seyed Ghasemipour, Anton Raichuk, Byron David, Erik Frey, Erwin Coumans, Olivier Bachem
Preprint
abstract

The goal of continuous control is to synthesize desired behaviors. In reinforcement learning (RL)-driven approaches, this is often accomplished through careful task reward engineering for efficient exploration and running an off-the-shelf RL algorithm. While reward maximization is at the core of RL, reward engineering is not the only -- sometimes nor the easiest -- way for specifying complex behaviors. In this paper, we introduce braxlines, a toolkit for fast and interactive RL-driven behavior generation beyond simple reward maximization that includes composer, a programmatic API for generating continuous control environments, and set of stable and well-tested baselines for two families of algorithms --mutual information maximization (mimax) and divergence minimization (dmin)-- supporting unsupervised skill learning and distribution sketching as other modes of behavior specification. In addition, we discuss how to standardize metrics for evaluating these algorithms, which can no longer rely on simple reward maximization. Our implementations build on a hardware-accelerated Brax simulator in Jax with minimal modifications, enabling behavior synthesis within minutes of training. We hope braxlines~can serve as an interactive toolkit for rapid creation and testing of environments and behaviors, empowering explosions of future benchmark designs and new modes of RL-driven behavior generation and their algorithmic research.

/ arxiv

Conference Publications

Self-Improving Embodied Foundation Models
Seyed Kamyar Seyed Ghasemipour, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi*, Igor Mordatch*
NeurIPS 2025
abstract

Foundation models trained on web-scale data have revolutionized robotics, but their application to low-level control remains largely limited to behavioral cloning. Drawing inspiration from the success of the reinforcement learning stage in fine-tuning large language models, we propose a two-stage post-training approach for robotics. The first stage, Supervised Fine-Tuning (SFT), fine-tunes pretrained foundation models using both: a) behavioral cloning, and b) steps-to-go prediction objectives. In the second stage, Self-Improvement, steps-to-go prediction enables the extraction of a well-shaped reward function and a robust success detector, enabling a fleet of robots to autonomously practice downstream tasks with minimal human supervision. Through extensive experiments on real-world and simulated robot embodiments, our novel post-training recipe unveils significant results on Embodied Foundation Models. First, we demonstrate that the combination of SFT and Self-Improvement is significantly more sample-efficient than scaling imitation data collection for supervised learning, and that it leads to policies with significantly higher success rates. Further ablations highlight that the combination of web-scale pretraining and Self-Improvement is the key to this sample-efficiency. Next, we demonstrate that our proposed combination uniquely unlocks a capability that current methods cannot achieve: autonomously practicing and acquiring novel skills that generalize far beyond the behaviors observed in the imitation learning datasets used during training. These findings highlight the transformative potential of combining pretrained foundation models with online Self-Improvement to enable autonomous skill acquisition in robotics.

/ pdf / website

ALOHA Unleashed: A Simple Recipe for Robot Dexterity
Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Seyed Kamyar Seyed Ghasemipour, Chelsea Finn, Ayzaan Wahid
CoRL 2024
abstract

Recent work has shown promising results for learning end-to-end robot policies using imitation learning. In this work we address the question of how far can we push imitation learning for challenging dexterous manipulation tasks. We show that a simple recipe of large scale data collection on the ALOHA 2 platform, combined with expressive models such as Diffusion Policies, can be effective in learning challenging bimanual manipulation tasks involving deformable objects and complex contact rich dynamics. We demonstrate our recipe on 5 challenging real-world and 3 simulated tasks and demonstrate improved performance over state-of-the-art baselines. The project website and videos can be found at https://aloha-unleashed.github.io/.

/ pdf / website

Learning Interactive Real-World Simulators
Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, Pieter Abbeel
Outstanding Paper Award, Oral Presentation ICLR 2024
abstract

Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as "open the drawer" and low-level controls from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io/unisim/.

/ pdf / website

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, Mohammad Norouzi
Outstanding Paper Award, Oral Presentation NeurIPS 2022
abstract

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.

/ pdf / website

Why so pessimistic? Estimating uncertainties for offline RL through ensembles, and why their independence matters.
Seyed Kamyar Seyed Ghasemipour, Shane Gu, Ofir Nachum
NeurIPS 2022
abstract

Motivated by the success of ensembles for uncertainty estimation in supervised learning, we take a renewed look at how ensembles of -functions can be leveraged as the primary source of pessimism for offline reinforcement learning (RL). We begin by identifying a critical flaw in a popular algorithmic choice used by many ensemble-based RL algorithms, namely the use of shared pessimistic target values when computing each ensemble member's Bellman error. Through theoretical analyses and construction of examples in toy MDPs, we demonstrate that shared pessimistic targets can paradoxically lead to value estimates that are effectively optimistic. Given this result, we propose MSG, a practical offline RL algorithm that trains an ensemble of -functions with independently computed targets based on completely separate networks, and optimizes a policy with respect to the lower confidence bound of predicted action values. Our experiments on the popular D4RL and RL Unplugged offline RL benchmarks demonstrate that on challenging domains such as antmazes, MSG with deep ensembles surpasses highly well-tuned state-of-the-art methods by a wide margin. Additionally, through ablations on benchmarks domains, we verify the critical significance of using independently trained -functions, and study the role of ensemble size. Finally, as using separate networks per ensemble member can become computationally costly with larger neural network architectures, we investigate whether efficient ensemble approximations developed for supervised learning can be similarly effective, and demonstrate that they do not match the performance and robustness of MSG with separate networks, highlighting the need for new efforts into efficient uncertainty estimation directed at RL.

/ pdf / code

Blocks Assemble! Learning to Assemble with Large-Scale Structured Reinforcement Learning
Seyed Kamyar Seyed Ghasemipour, Daniel Freeman, Byron David, Shixiang Shane Gu, Satoshi Kataoka, Igor Mordatch
ICML 2022
abstract

/ pdf / website

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL
Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, Shane Gu
ICML 2021
abstract

Off-policy reinforcement learning (RL) holds the promise of sample-efficient learning of decision-making policies by leveraging past experience. However, in the offline RL setting -- where a fixed collection of interactions are provided and no further interactions are allowed -- it has been shown that standard off-policy RL methods can significantly underperform. Recently proposed methods aim to address this shortcoming by regularizing learned policies to remain close to the given dataset of interactions. However, these methods involve several configurable components such as learning a separate policy network on top of a behavior cloning actor, and explicitly constraining action spaces through clipping or reward penalties. Striving for simultaneous simplicity and performance, in this work we present a novel backup operator, Expected-Max Q-Learning (EMaQ), which naturally restricts learned policies to remain within the support of the offline dataset \emph{without any explicit regularization}, while retaining desirable theoretical properties such as contraction. We demonstrate that EMaQ is competitive with Soft Actor Critic (SAC) in online RL, and surpasses SAC in the deployment-efficient setting. In the offline RL setting -- the main focus of this work -- through EMaQ we are able to make important observations regarding key components of offline RL, and the nature of standard benchmark tasks. Lastly but importantly, we observe that EMaQ achieves state-of-the-art performance with fewer moving parts such as one less function approximation, making it a strong, yet easy to implement baseline for future work.

/ arxiv

A Divergence Minimization Perspective on Imitation Learning Methods
Seyed Kamyar Seyed Ghasemipour, Richard Zemel, Shane Gu
Best Paper Award, Oral Presentation CoRL 2019
abstract

In many settings, it is desirable to learn decision-making and control policies through learning or bootstrapping from expert demonstrations. The most common approaches under this Imitation Learning (IL) framework are Behavioural Cloning (BC), and Inverse Reinforcement Learning (IRL). Recent methods for IRL have demonstrated the capacity to learn effective policies with access to a very limited set of demonstrations, a scenario in which BC methods often fail. Unfortunately, due to multiple factors of variation, directly comparing these methods does not provide adequate intuition for understanding this difference in performance. In this work, we present a unified probabilistic perspective on IL algorithms based on divergence minimization. We present $f$-MAX, an $f$-divergence generalization of AIRL [Fu et al., 2018], a state-of-the-art IRL method. $f$-MAX enables us to relate prior IRL methods such as GAIL [Ho & Ermon, 2016] and AIRL [Fu et al., 2018], and understand their algorithmic properties. Through the lens of divergence minimization we tease apart the differences between BC and successful IRL approaches, and empirically evaluate these nuances on simulated high-dimensional continuous control domains. Our findings conclusively identify that IRL's state-marginal matching objective contributes most to its superior performance. Lastly, we apply our new understanding of IL method to the problem of state-marginal matching, where we demonstrate that in simulated arm pushing environments we can teach agents a diverse range of behaviours using simply hand-specified state distributions and no reward functions or expert demonstrations. For datasets and reproducing results please refer to https://github.com/KamyarGh/rl_swiss/blob/master/reproducing/fmax_paper.md .

/ camera-ready pdf / code

SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies
Seyed Kamyar Seyed Ghasemipour, Shane Gu, Richard Zemel
NeurIPS 2019
abstract

Imitation Learning (IL) has been successfully applied to complex sequential decision-making problems where standard Reinforcement Learning (RL) algorithms fail. A number of recent methods extend IL to few-shot learning scenarios, where a meta-trained policy learns to quickly master new tasks using limited demonstrations. However, although Inverse Reinforcement Learning (IRL) often outperforms Behavioral Cloning (BC) in terms of imitation quality, most of these approaches build on BC due to its simple optimization objective. In this work, we propose SMILe, a scalable framework for Meta Inverse Reinforcement Learning (Meta-IRL) based on maximum entropy IRL, which can learn high-quality policies from few demonstrations. We examine the efficacy of our method on a variety of high-dimensional simulated continuous control tasks and observe that SMILe significantly outperforms Meta-BC. Furthermore, we observe that SMILe performs comparably or outperforms Meta-DAgger, while being applicable in the state-only setting and not requiring online experts. To our knowledge, our approach is the first efficient method for Meta-IRL that scales to the function approximator setting. For datasets and reproducing results please refer to https://github.com/KamyarGh/rl_swiss/blob/master/reproducing/smile_paper.md .

/ camera-ready pdf / code

Whitepapers

Acme: A Research Framework for Distributed Reinforcement Learning
Matthew W. Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Nikola Momchev, Danila Sinopalnikov, Piotr Stańczyk, Sabela Ramos, Anton Raichuk, Damien Vincent, Léonard Hussenot, Robert Dadashi, Gabriel Dulac-Arnold, Manu Orsini, Alexis Jacq, Johan Ferret, Nino Vieillard, Seyed Kamyar Seyed Ghasemipour, Sertan Girgin, Olivier Pietquin, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, Sarah Henderson, Abe Friesen, Ruba Haroun, Alex Novikov, Sergio Gómez Colmenarejo, Serkan Cabi, Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Andrew Cowie, Ziyu Wang, Bilal Piot, Nando de Freitas
Whitepaper
abstract

Deep reinforcement learning (RL) has led to many recent and groundbreaking advances. However, these advances have often come at the cost of both increased scale in the underlying architectures being trained as well as increased complexity of the RL algorithms used to train them. These increases have in turn made it more difficult for researchers to rapidly prototype new ideas or reproduce published RL algorithms. To address these concerns this work describes Acme, a framework for constructing novel RL algorithms that is specifically designed to enable agents that are built using simple, modular components that can be used at various scales of execution. While the primary goal of Acme is to provide a framework for algorithm development, a secondary goal is to provide simple reference implementations of important or state-of-the-art algorithms. These implementations serve both as a validation of our design decisions as well as an important contribution to reproducibility in RL research. In this work we describe the major design decisions made within Acme and give further details as to how its components can be used to implement various algorithms. Our experiments provide baselines for a number of common and state-of-the-art algorithms as well as showing how these algorithms can be scaled up for much larger and more complex environments. This highlights one of the primary advantages of Acme, namely that it can be used to implement large, distributed RL algorithms that can run at massive scales while still maintaining the inherent readability of that implementation. This work presents a second version of the paper which coincides with an increase in modularity, additional emphasis on offline, imitation and learning from demonstrations algorithms, as well as various new agents implemented as part of Acme.

/ pdf

Workshop Publications

Why so pessimistic? Estimating uncertainties for offline RL through ensembles, and why their independence matters.
Seyed Kamyar Seyed Ghasemipour, Shane Gu, Ofir Nachum
Offline RL Workshop at NeurIPS 2021, also Under Review at ICLR 2022
abstract

In offline/batch reinforcement learning (RL), the predominant class of approaches with most success have been ``support constraint" methods, where trained policies are encouraged to remain within the support of the provided offline dataset. However, support constraints correspond to an overly pessimistic assumption that actions outside the provided data may lead to worst-case outcomes. In this work, we aim to relax this assumption by obtaining uncertainty estimates for predicted action values, and acting conservatively with respect to a lower-confidence bound (LCB) on these estimates. Motivated by the success of ensembles for uncertainty estimation in supervised learning, we propose MSG, an offline RL method that employs an ensemble of independently updated Q-functions. First, theoretically, by referring to the literature on infinite-width neural networks, we demonstrate the crucial dependence of the quality of derived uncertainties on the manner in which ensembling is performed, a phenomenon that arises due to the dynamic programming nature of RL and overlooked by existing offline RL methods. Our theoretical predictions are corroborated by pedagogical examples on toy MDPs, as well as empirical comparisons in benchmark continuous control domains. In the significantly more challenging antmaze domains of the D4RL benchmark, MSG with deep ensembles by a wide margin surpasses highly well-tuned state-of-the-art methods. Consequently, we investigate whether efficient approximations can be similarly effective. We demonstrate that while some very efficient variants also outperform current state-of-the-art, they do not match the performance and robustness of MSG with deep ensembles. We hope that the significant impact of our less pessimistic approach engenders increased focus into uncertainty estimation techniques directed at RL, and engenders new efforts from the community of deep network uncertainty estimation researchers whom thus far have not employed offline reinforcement learning domains as a testbed for validating modern uncertainty estimation techniques.

/ pdf

ABC Problem: An Investigation of Offline RL for Vision-Based Dynamic Manipulation
Seyed Kamyar Seyed Ghasemipour, Igor Mordatch, Shane Gu
Embodied Multi-Modal Learning Workshop at ICLR 2021
abstract

In recent years, reinforcement learning has had significant successes in the domain of robotics. However, most such successes have either been in robotic locomotion domains or robotic manipulation settings such as pick-and-place, block-stacking, and other quasi-static tasks, both of which lack a requirement of fine-grained geometric reasoning. In this work, we ask the following question: Given current established methodologies in RL, can we obtain effective vision-based policies that solve tasks requiring significant geometric reasoning, and how well do such policies generalize? As a secondary question, we pursue our investigations under the offline/batch RL setting. We study these questions in a simplified simulated rendition of the ``ABC Problem" proposed by Prof. Tenenbaum. In the ABC problem, in each episode an agent is initialized with two random objects to use as its hands (A and B), and the objective is to lift a third randomly selected object (C) from the ground. Due to the varying geometries of the sampled objects, a trained agent must learn to reason about the most effective procedure for lifting the objects. Our empirical results demonstrate that indeed, by training on a limited subset of available objects, vision-based policies obtained through offline RL can significantly improve upon the policies generating the offline datasets, and can transfer to a diversity of objects outside the training distribution. Additionally, we demonstrate that learned policies exhibit novel characteristics not seen in the offline datasets, and we provide evidence that points towards investing efforts in attention architectures for vision-based control policies. Videos can be found in supplementary materials.

/ pdf

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL
Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, Shane Gu
Offline Rl Workshop at NeurIPS 2021
abstract

Off-policy reinforcement learning (RL) holds the promise of sample-efficient learning of decision-making policies by leveraging past experience. However, in the offline RL setting -- where a fixed collection of interactions are provided and no further interactions are allowed -- it has been shown that standard off-policy RL methods can significantly underperform. Recently proposed methods aim to address this shortcoming by regularizing learned policies to remain close to the given dataset of interactions. However, these methods involve several configurable components such as learning a separate policy network on top of a behavior cloning actor, and explicitly constraining action spaces through clipping or reward penalties. Striving for simultaneous simplicity and performance, in this work we present a novel backup operator, Expected-Max Q-Learning (EMaQ), which naturally restricts learned policies to remain within the support of the offline dataset without any explicit regularization, while retaining desirable theoretical properties such as contraction. We demonstrate that EMaQ is competitive with Soft Actor Critic (SAC) in online RL, and surpasses SAC in the deployment-efficient setting. In the offline RL setting -- the main focus of this work -- through EMaQ we are able to make important observations regarding key components of offline RL, and the nature of standard benchmark tasks. Lastly but importantly, we observe that EMaQ achieves state-of-the-art performance with fewer moving parts such as one less function approximation, making it a strong, yet easy to implement baseline for future work.

/ arxiv

SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies
Seyed Kamyar Seyed Ghasemipour, Shane Gu, Richard Zemel
Imitation, Intent, and Interaction (I3) Workshop, ICML 2019 (Oral Presentation)
abstract

Imitation Learning (IL) has been successfully applied to complex sequential decision-making problems where standard Reinforcement Learning (RL) algorithms fail. A number of recent methods extend IL to few-shot learning scenarios, where a meta-trained policy learns to quickly master new tasks using limited demonstrations. However, although Inverse Reinforcement Learning (IRL) often outperforms Behavioral Cloning (BC) in terms of imitation quality, most of these approaches build on BC due to its simple optimization objective. In this work, we propose SMILe, a scalable framework for Meta Inverse Reinforcement Learning (Meta-IRL) based on maximum entropy IRL, which can learn high-quality policies from few demonstrations. We examine the efficacy of our method on a variety of high-dimensional simulated continuous control tasks and observe that SMILe significantly outperforms Meta-BC. To our knowledge, our approach is the first efficient method for Meta-IRL that scales to the intractable function approximator setting.

/ pdf / code (coming soon) / poster

Interpreting Imitation Learning Methods Under a Divergence Minimization Perspective
Seyed Kamyar Seyed Ghasemipour, Richard Zemel, Shane Gu
Imitation, Intent, and Interaction (I3) Workshop, ICML 2019
Deep Generative Models for Highly Structured Data Workshop, ICLR 2019
abstract

In many settings, it is desirable to learn decision-making and control policies through learning or from expert demonstrations. The most common approaches under this framework are Behaviour Cloning (BC), and Inverse Reinforcement Learning (IRL). Recent methods for IRL have demonstrated the capacity to learn effective policies with access to a very limited set of demonstrations, a scenario in which BC methods often fail. Unfortunately, directly comparing the algorithms for these methods does not provide adequate intuition for understanding this difference in performance. This is the motivating factor for our work. We begin by presenting $f$-MAX, a generalization of AIRL (Fu et al., 2018), a state-of-the-art IRL method. $f$-MAX provides grounds for more directly comparing the objectives for LfD. We demonstrate that $f$-MAX, and by inheritance AIRL, is a subset of the cost-regularized IRL framework laid out by Ho & Ermon (2016). We conclude by empirically evaluating the factors of difference between various LfD objectives in the continuous control domain.

/ pdf / code (coming soon) / poster

Gradient-Based Optimization of Neural Network Architecture
Will Grathwohl*, Elliot Creager*, Seyed Kamyar Seyed Ghasemipour*, Richard Zemel
Workshop, ICLR 2018
abstract

Neural networks can learn relevant features from data, but their predictive accuracy and propensity to overfit are sensitive to the values of the discrete hyperparameters that specify the network architecture (number of hidden layers, number of units per layer, etc.). Previous work optimized these hyperparmeters via grid search, random search, and black box optimization techniques such as Bayesian optimization. Bolstered by recent advances in gradient-based optimization of discrete stochastic objectives, we instead propose to directly model a distribution over possible architectures and use variational optimization to jointly optimize the network architecture and weights in one training pass. We discuss an implementation of this approach that estimates gradients via the Concrete relaxation, and show that it finds compact and accurate architectures for convolutional neural networks applied to the CIFAR10 and CIFAR100 datasets.

/ pdf

Unpusblished Submissions

Semi-Supervised Structured Prediction with the Use of Generative Adversarial Networks
Seyed Kamyar Seyed Ghasemipour, Yujia Li, Jackson Wang, Richard Zemel
Submitted to ICCV 2017

Slides

SMILe Oral Presentation, Imitation, Intent, and Interaction (I3) Workshop at ICML 2019

Videos

Why so pessimistic? Estimating uncertainties for offline RL through ensembles, and why their independence matters. Offline RL Workshop at NeurIPS 2021

SMILe (link coming soon) 15 min, June 15, 2019, Oral Presentation, Imitation, Intent, and Interaction (I3) Workshop at ICML 2019

Summer 2015 Research Video 1st place undergraduate research video competition

Summer 2014 Research Video 1st place undergraduate research video competition