PORL09: Partially Observable Reinforcement Learning
Symposium at NIPS'09 December 10, Vancouver


Marcus Hutter, The Australian National University
William Uther, NICTA and University of New South Wales
Pascal Poupart, University of Waterloo
Kee Siong Ng, NICTA and The Australian National University
Contact: Any of the organisers
Summary: For many years, the reinforcement learning community primarily focused on sequential decision making in fully observable but unknown domains while the planning under uncertainty community focused on known but partially observable domains. Since most problems are both partially observable and (at least partially) unknown, recent years have seen a surge of interest in combining the related, but often different, algorithmic machineries developed in the two communities. The time thus seems ripe for a symposium that brings these two communities together and reviews recent advances in this convergence.

A reinforcement learning agent for a partially observable environment is often broken into two parts: 1) the inference of an environment model from data; and 2) the solution of the associated control/planning problem. There has been significant progress on both these fronts in recent years. Both linear and non-linear models of various forms can now be learned from history data. Modern POMDP solvers can also now handle some models with millions of states. This symposium brings together five active researchers in PORL research to present some state-of-the-art developments.
Registration: PORL'09 is part of the NIPS'09 conference. To attend the symposium you must register for the NIPS conference. For more information on registration, please see the NIPS website.
Preliminary program:
Symposium session
13:30 - 13:45 William Uther: An Introduction and Overview of Approaches to PORL
13:50 - 14:20 Satinder Singh: What to Model?
14:25 - 14:55 David Wingate: Structured Hierarchical Bayesian Priors For Modeling Dynamical Systems
15:00 - 15:10 coffee break
15:10 - 15:40 Pascal Poupart: Algorithmic and Theoretical Properties of Bayesian Reinforcement Learning
15:45 - 16.15 Marcus Hutter: Principled Large-Scale POMDP Learning
16:15 - 16:30 Joel Veness: A Monte Carlo AIXI Approximation

Talk Abstracts: Title: An Introduction and Overview of Approaches to PORL
Speaker: William Uther, NICTA and University of New South Wales

Abstract: Inferring a model of a partially observable environment from a sequence of actions, observations and rewards has been approached in a number of different ways. This introduction for the mini-symposium will give a brief overview of the range of techniques found in the literature and describe some relationships between them.

Title: What to Model?
Speaker: Satinder Singh, University of Michigan

Abstract: In applications of reinforcement learning to engineering and control problems, it is often quite clear what models to build, i.e., the state variables and actions are well defined and the modeling challenge is to find a good parametric representation and efficient learning algorithm. In more AI settings, where the specific tasks have not themselves been the object of many person-years of engineering effort, e.g., a robot equipped with a camera and other sensors in a building expected to do all sorts of chores, it is far less clear what model to build. What is important to predict? What secondary things do we need to be able to predict in order to make predictions of the primary things we want to predict? Ideally, one would want to learn models whose complexity depends only on the complexity of the prediction questions we want the model to be able to answer and not on the complexity of the real world. These are significant and understudied challenges towards progress in AI. In this talk I will present some thoughts and preliminary results pertaining to these challenges.

Title: Structured Hierarchical Bayesian Priors For Modeling Dynamical Systems
Speaker: David Wingate, Massachusetts Institute of Technology

Abstract: Hierarchical Bayesian models with interesting (possibly nonparametric) priors can be used to learn models of structured but partially observable domains. Elemental distributions (such as multinomials or Dirichlet processes) can be composed, for example, to co-cluster states and perceptual features with an unknown number of clusters; such clustering improves generalization and decreases sample complexity of learning models. The same types of priors can also be used to reason about complex, compositional options which are triggered in response to features of the environment. In this talk, we'll discuss examples of both structured world models (including object-oriented models) and structured policies, and how an agent can plan using both.

This is joint work with Josh Tenenbaum and Noah Goodman.

Title: Algorithmic and Theoretical Properties of Bayesian Reinforcement Learning
Speaker: Pascal Poupart, University of Waterloo

Reinforcement Learning in partially observable domains is a notoriously hard problem. When taking a Bayesian perspective, reinforcement learning becomes a problem of planning in a special partially observable Markov decision process (POMDP) where the unknown parameters (e.g., transition dynamics, observation probabilities and reward distribution) are treated as state features that the learner maintains beliefs about. In this talk, I will describe some of the algorithmic and theoretical properties of this POMDP. More specifically, I will discuss how this POMDP provides a natural formulation of non-myopic active learning and how the exploration/exploitation tradeoff is naturally optimized. I will also show that the optimal value function is the upper surface of a set of mixtures of Dirichlets which is piecewise-linear and convex. These properties will then be used to derive approximate dynamic programming procedures such as point-based value iteration for offline planning.

Title: Principled Large-Scale POMDP Learning
Speaker: Marcus Hutter, The Australian National University

Abstract: Partially Observable Markov Decision Processes (POMDPs) are a very important generalization of MDPs. Nature is still assumed to be an MDP, but the states of nature are only partially observed via some non-injective or probabilistic function. While POMDP planning is well-defined but (harder than NP) hard, a general theory of learning POMDPs is missing. A different approach is to work with histories and directly map them to a finite approximation of a (belief) MDP. A Coding/MDL/Bayes-inspired criterion can be used to learn this mapping. This reduction significantly expands the scope of many existing reinforcement learning algorithms. The approach can be extended to factored MDPs, resulting in the first principled general-purpose learning algorithm for large POMDPs.

Title: A Monte Carlo AIXI Approximation
Speaker: Joel Veness, University of New South Wales

Abstract: Marcus Hutter's AIXI agent provides a mathematically rigorous, optimality notion for general reinforcement learning agents. An interesting open question is to what extent this ideal can be approximated in a computationally efficient manner. This talk will focus on a particular, direct approximation of AIXI. This approximation can be broken into two main parts: Solomonoff's universal distribution is approximated by an efficiently computable Bayesian mixture of Prediction Suffix Trees, and the finite horizon expectimax operation is approximated via Monte-Carlo Tree Search. Although this approximation is undoubtedly crude, it has already achieved impressive results on small, noisy, partially observable and stochastic domains.