%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Asymptotic Learnability of Reinforcement Problems %%
%% with Arbitrary Dependence %%
%% Daniil Ryabko & Marcus Hutter: Start 2005 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt,twoside]{article}
\usepackage{amsmath,amssymb}
\usepackage{amsthm}
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm \sloppy\lineskip=0pt
\sloppy\lineskip=0pt
%-------------------------------%
% Spacings %
%-------------------------------%
\edef\,{\thinspace} \edef\;{\thickspace} \edef\!{\negthinspace}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
\def\paradot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1.}}}
\def\paranodot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1}}}
%-------------------------------%
% Environments %
%-------------------------------%
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{definition}[theorem]{Definition}
\theoremstyle{remark}
\newtheorem*{note*}{Note}
\renewcommand{\proofname}{\bf Proof}
\def\qedsymbol{$\Box\quad$}
\def\qedx{}
\newenvironment{keywords}%
{\centerline{\bf\small Keywords}\begin{quote}\small}%
{\par\end{quote}\vskip 1ex}
%-------------------------------%
% More Macro-Definitions %
%-------------------------------%
\let\emptyset\varnothing
\let\phi\varphi
\def\as{\text{ a.s.}}
\def\argmax{\operatorname{argmax}}
\def\P{\operatorname{\bf P}}
\def\E{\operatorname{\bf E}}
\def\up{\overline}
\def\low{\underline}
\def\r#1#2{r_{#1..#2}}
\def\nq{\hspace{-1em}}
\def\odt{{\textstyle{1\over 2}}}
\def\odm{{\textstyle{1\over m}}}
\def\SetB{I\!\!B}
\def\SetR{I\!\!R}
\def\SetN{I\!\!N}
\def\C{{\cal C}} % Set of prob. distributions
\def\X{{\cal X}} % generic set
\def\Y{{\cal Y}} % generic set
\def\A{{\cal A}} % action space
\def\R{{\cal R}} % reward space
\def\O{{\cal O}} % observation space
\def\Z{{\cal Z}} % (action,reward,observation) space
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\def\qqmbox#1{{\qquad\mbox{#1}\qquad}}
\def\l{{\ell}} % length of string or program
\def\lb{{\log_2}}
\def\eps{\varepsilon} % for small positive number
\def\epstr{\epsilon} % for empty string
\def\toinfty#1{\stackrel{\smash{#1}\to\infty}{\longrightarrow}}
\def\a{\alpha}
\def\o{\omega}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\normalsize\sc Technical Report \hfill IDSIA-09-06
\vskip 2mm\bf\Large\hrule height5pt \vskip 6mm
Asymptotic Learnability of Reinforcement Problems with Arbitrary Dependence
\vskip 6mm \hrule height2pt \vskip 5mm}
\author{{{\bf Daniil Ryabko} and {\bf Marcus Hutter}}\\[3mm]
\normalsize IDSIA, Galleria 2, CH-6928\ Manno-Lugano, Switzerland%
\thanks{This work was supported by the Swiss NSF grant 200020-107616.
}\\
\normalsize \{daniil,marcus\}@idsia.ch, \ http://www.idsia.ch/$^{_{_\sim}}\!$\{daniil,marcus\} }
\date{28 March 2006}
\maketitle
\begin{abstract}\noindent We address the problem of reinforcement
learning in which observations may exhibit an arbitrary form of
stochastic dependence on past observations and actions.
The task for an agent is to attain the best possible asymptotic
reward where the true generating environment is unknown but
belongs to a known countable family of environments. We find some
sufficient conditions on the class of environments under which an
agent exists which attains the best asymptotic reward for any
environment in the class. We analyze how tight these conditions
are and how they relate to different probabilistic assumptions
known in reinforcement learning and related fields, such as Markov
Decision Processes and mixing conditions.
\end{abstract}
\begin{keywords}
Reinforcement learning,
asymptotic average value,
self-optimizing policies,
(non) Markov decision processes.
\end{keywords}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
%\paradot{Learning in uncertain worlds}
%-------------------------------%
Many real-world ``learning'' problems (like learning to drive a
car or playing a game) can be modelled as an agent $\pi$ that
interacts with an environment $\mu$ and is (occasionally)
rewarded for its behavior. We are interested in agents which
perform well in the sense of having high long-term reward, also called
the value $V(\mu,\pi)$ of agent $\pi$ in environment $\mu$. If $\mu$
is known, it is a pure (non-learning) computational problem to
determine the optimal agent $\pi^\mu:=\arg\max_\pi V(\mu,\pi)$.
It is far less clear what an ``optimal'' agent means, if
$\mu$ is unknown.
%
A reasonable objective is to have a single policy $\pi$ with high
value simultaneously in many environments. We will formalize and
call this criterion {\em self-optimizing} later.
%-------------------------------%
\paradot{Learning approaches in reactive worlds}
%-------------------------------%
Reinforcement learning, sequential decision theory, adaptive
control theory, and active expert advice, are theories dealing
with this problem. They overlap but have different core focus:
%
Reinforcement learning algorithms \cite{Sutton:98} are developed
to learn $\mu$ or directly its value. Temporal difference learning
is computationally very efficient, but has slow asymptotic
guarantees (only) in (effectively) small observable MDPs. Others
have faster guarantee in finite state MDPs \cite{Brafman:01}.
There are algorithms \cite{EvenDar:05} which are optimal for
any finite connected POMDP, and this is apparently the largest
class of environments considered.
%
In sequential decision theory, a Bayes-optimal agent $\pi^*$ that
maximizes $V(\xi,\pi)$ is considered, where $\xi$ is a mixture of
environments $\nu\in\C$ and $\C$ is a class of environments that
contains the true environment $\mu\in\C$ \cite{Hutter:04uaibook}.
Policy $\pi^*$ is self-optimizing in an arbitrary
class $\C$, provided $\C$
allows for self-optimizingness \cite{Hutter:02selfopt}.
%
Adaptive control theory \cite{Kumar:86} considers very simple
(from an AI perspective) or special systems (e.g.\ linear with
quadratic loss function), which sometimes allow computationally
and data efficient solutions.
%
Action with expert advice
\cite{Farias:03,Poland:05dule,Poland:05aixifoe,Cesa:06}
constructs an agent (called master) that performs nearly as well
as the best agent (best expert in hindsight) from some class of
experts, in {\em any} environment $\nu$.
The important special case of passive sequence prediction in
arbitrary unknown environments, where the actions=predictions do
not affect the environment is comparably easy
\cite{Hutter:03optisp,Hutter:04expert}.
The difficulty in active learning problems can be identified (at
least, for countable classes) with
{\em traps} in the environments. Initially the agent does not know
$\mu$, so has asymptotically to be forgiven in taking initial
``wrong'' actions. A well-studied such class are ergodic MDPs
which guarantee that, from any action history, every state can be
(re)visited \cite{Hutter:02selfopt}.
%-------------------------------%
\paradot{What's new}
%-------------------------------%
The aim of this paper is to characterize as general as possible
classes $\C$ in which self-optimizing behaviour is possible, more
general than POMDPs. To do this we need to characterize classes of
environments that forgive. For instance, exact
state recovery is unnecessarily strong; it is sufficient being
able to recover high rewards, from whatever states. Further, in
many real world problems there is no information available about
the ``states'' of the environment (e.g.\ in POMDPs) or the
environment may exhibit long history dependencies.
Rather than trying to model an environment (e.g. by MDP) we
try to identify the conditions sufficient for learning.
Towards this aim, we propose to consider only environments in
which, after any arbitrary finite sequence of actions, the best
value is still achievable. The performance criterion here is
asymptotic average reward. Thus we consider such environments for
which there exists a policy whose asymptotic average reward
exists and upper-bounds asymptotic average reward of any other
policy. Moreover, the same property should hold after any finite
sequence of actions has been taken (no traps).
Yet this property in itself is not sufficient for identifying
optimal behavior. We require further that, from any sequence of
$k$ actions, it is possible to return to the optimal level of
reward in $o(k)$ steps. (The above conditions will be formulated in a
probabilistic form.) Environments which possess this property are
called \emph{(strongly) value-stable}.
We show that for any countable class of value-stable environments
there exists a policy which achieves best possible value in any
of the environments from the class (i.e. is
\emph{self-optimizing} for this class). We also show that strong
value-stability is in a certain sense necessary.
We also consider examples of environments which possess strong
value-stability. In particular, any ergodic MDP can be easily
shown to have this property. A mixing-type condition which
implies value-stability is also demonstrated. Finally, we provide
a construction allowing to build examples of value-stable
environments which are not isomorphic to a finite POMDP, thus
demonstrating that the class of value-stable environments is quite
general.
It is important in our argument that the class of environments
for which we seek a self-optimizing policy is countable, although
the class of all value-stable environments is uncountable. To
find a set of conditions necessary and sufficient for learning
which do not rely on countability of the class is yet an open
problem. However, from a
computational perspective countable classes are sufficiently large
(e.g.\ the class of all computable probability measures is
countable).
%-------------------------------%
\paradot{Contents}
%-------------------------------%
The paper is organized as follows. Section~\ref{secNot} introduces
necessary notation of the agent framework. In
Section~\ref{secSetup} we define and explain the notion of
value-stability, which is central in the paper.
Section~\ref{secMain} presents the theorem about self-optimizing
policies for classes of value-stable environments, and illustrates
the applicability of the theorem by providing examples of
strongly value-stable environments. In Section~\ref{sec:nec} we
discuss necessity of the conditions of the main theorem.
Section~\ref{secDisc} provides some discussion of the results and
an outlook to future research. The formal proof of the main
theorem is given in the appendix, while Section~\ref{secMain}
contains only intuitive explanations.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Notation \& Definitions}\label{secNot}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We essentially follow the notation of \cite{Hutter:02selfopt,Hutter:04uaibook}.
%------------------------------%
\paradot{Strings and probabilities}
%------------------------------%
We use letters $i,k,l,m,n\in\SetN$ for natural numbers, and
denote the cardinality of sets $\cal S$ by $\#{\cal S}$.
We write $\X^*$ for the set of finite strings over some alphabet
$\X$, and $\X^\infty$ for the set of infinite sequences.
For a string $x\in\X^*$ of length $\l(x)=n$ we write $x_1x_2...x_n$ with
$x_t\in\X$ and further abbreviate
$x_{k:n}:=x_kx_{k+1}...x_{n-1}x_n$ and $x_{d_\nu(k,\eps)+n\eps\mid z_{V^*_{\nu^t}$)
and tries to get $\eps$-close to this value acting optimally under
$\nu^e$. If it can not get close to the $\nu^e$-optimal value then
$\nu^e$ is not the true environment, and the next environment can
be picked for exploration (here we call ``exploration'' successive
attempts to exploit an environment which differs from the current
hypothesis about the true environment and has a higher average
reward). If it can, then it switches to exploitation of $\nu^t$,
exploits it until it is $\eps'$-close to $V^*_{\nu^t}$,
$\eps'<\eps$ and switches to $\nu^e$ again this time trying to get
$\eps'$-close to $V_{\nu^e}$; and so on. This can happen only a
finite number of times if the true environment is $\nu^t$, since
$V^*_{\nu^t}V^*_{\nu^t}$ is picked for exploration. If it is
$\nu^t$ then the first consistent environment is picked for
exploitation (and denoted $\nu^t$). This in turn can happen only a
finite number of times before the true environment $\nu$ is picked
as $\nu^t$. After this, the algorithm still continues its
exploration attempts, but can always keep within
$\eps_k\rightarrow0$ of the optimal value. This is ensured by
$d(k)=o(k)$.
The probabilistic case is somewhat more complicated since we can
not say whether an environment is ``consistent'' with the current
history. Instead we test each environment for consistency as
follows. Let $\xi$ be a mixture of all environments in $\C$.
Observe that together with some fixed policy each environment
$\mu$ can be considered as a measure on $\Z^\infty$. Moreover, it
can be shown that (for any fixed policy) the ratio
$\frac{\nu(z_{0$ and \beqn
\r{k}{k+n}^\nu-\E \left( \r{k}{k+n}^{p\nu}\mid z_{d(k)+n\eps\right)
\\
& & \le I\left( \r{k}{k+n}^\nu-\E\r{k}{k+n}^{p\nu}>d(k)\right)+
\P\left(\left|\r{k}{k+n}^{p\nu}-\E\r{k}{k+n}^{p\nu}\right|>n\eps\right).
\eqan The first term equals $0$ by assumption and the second term
for each $\eps$ can be shown to be summable using
\cite[Thm.1.3]{Bosq:96}: For a sequence of uniformly
bounded zero-mean random variables $r_i$ satisfying strong
$\alpha$-mixing conditions the following bound holds true for any
integer $q\in[1,n/2]$: \beqn
\P\left(|\r1n|>n\eps\right)\le c e^{-\eps^2q/c}+cq\alpha\left(\frac{n}{2q}\right)
\eeqn for some constant $c$; in our case we just set
$q=n^{\frac{\eps}{2+\eps}}$.
\qed\end{proof}
%-------------------------------%
\paradot{(PO)MDPs}
%-------------------------------%
Applicability of Theorem~\ref{th:main} and
Proposition~\ref{prop:mix} can be illustrated on (PO)MDPs. We note
that self-optimizing policies for (uncountable) classes of finite
ergodic MDPs and POMDPs are known \cite{Brafman:01,EvenDar:05};
the aim of the present section is to show that value-stability is
a weaker requirement than the requirements of these models, and
also to illustrate applicability of our results.
We call $\mu$ a (stationary) {\em Markov decision process} (MDP)
if the probability of perceiving $x_k\in\X$, given history
$z_{n\eps\right)\le
\sup_{a\in\X } \P\left( \left|\E\left(\r{k}{k+n}^{p_\mu\mu}|x_k=a\right)
- \r{k}{k+n}^{p\mu}\right|>n\eps)\right) \\
& & \le \sup_{a,b\in\X}\P(l(a,b)>f(n)/r_{\max})
\\
& & +
\; \sup_{a,b\in\X}\P\left( \left|\E\left(\r{k}{k+n}^{p_\mu\mu}| x_k=a\right)
- \r{k+f(n)}{k+n}^{p_\mu\mu}\right|>n\eps-f(n)\Big|
x_{k+f(n)}=a\right) \\
& & \le \sup_{a,b\in\X}\P(l(a,b)>f(n)/r_{\max})
\\
& & +
\; \sup_{a\in\X}\P\left( \left|\E\left(\r{k}{k+n}^{p_\mu\mu}| x_k=a\right)
- \r{k}{k+n}^{p_\mu\mu}\right|>n\eps-2f(n)\Big|
x_{k}=a\right).
\eqan
In the last term we have the deviation of the reward attained by
the optimal policy from its expectation. Clearly, both terms are
bounded exponentially in $n$.
\qed\end{proof}
In the examples above the function $d(k,\eps)$ is a constant and
$\phi(n,\eps)$ decays exponentially fast. This suggests that the
class of value-stable environments stretches beyond finite
(PO)MDPs. We illustrate this guess by the construction that
follows.
%-------------------------------%
\paranodot{An example of a value-stable environment}: infinitely armed bandit.
%-------------------------------%
Next we present a construction of environments which can not be
modelled as finite POMDPs but are value-stable. Consider the
following environment $\nu$. There is a countable family
$\C'=\{\zeta_i: i\in\SetN\}$ of {\em arms}, that is, sources
generating i.i.d. rewards $0$ and $1$ (and, say, empty
observations) with some probability $\delta_i$ of the reward being
$1$. The action space $\Y$ consists of three actions
$\Y=\{g,u,d\}$. To get the next reward from the current arm
$\zeta_i$ an agent can use the action $g$. At the beginning the
current arm is $\zeta_0$ and then the agent can move between arms
as follows: it can move one arm ``up'' using the action $u$ or
move ``down'' to the first environment using the action $d$. The
reward for actions $u$ and $d$ is $0$.
Clearly, $\nu$ is a POMDP with countably infinite number of states
in the underlying Markov chain, which (in general) is not
isomorphic to a finite POMDP.
\begin{claim} The environment $\nu$ just constructed is value-stable.
\end{claim}
\begin{proof}
Let $\delta=\sup_{i\in\SetN}\delta_i$. Clearly, $\up V(\nu,p')\le
\delta$ with probability $1$ for any policy $p'$ . A policy $p$
which, knowing all the probabilities $\delta_i$, achieves $\up
V(\nu,p) =\low V(\nu,p) =\delta=:V^*_\nu$ a.s., can be easily
constructed. Indeed, find a sequence $\zeta'_j$, $j\in\SetN$,
where for each $j$ there is $i=:i_j$ such that $\zeta'_j=\zeta_i$,
satisfying $\lim_{j\rightarrow\infty}\delta_{i_j}=\delta$. The
policy $p$ should carefully exploit one by one the arms $\zeta_j$,
staying with each arm long enough to ensure that the average
reward is close to the expected reward with $\eps_j$ probability,
where $\eps_j$ quickly tends to 0, and so that switching between
arms has a negligible impact on the average reward. Thus $\nu$ can
be shown to be explorable. Moreover, a policy $p$ just sketched
can be made independent on (observation and) rewards.
Furthermore, one can modify the policy $p$ (possibly allowing it
to exploit each arm longer) so that on each time step $t$ (from
some $t$ on) we have $j(t)\le\sqrt{t}$, where $j(t)$ is the number
of the current arm on step $t$. Thus, after any
actions-perceptions history $z_{0$ construct
the environment $\nu_s$ as follows: $r_i(a)=1$ for any $i$,
$r_i(b)=2$ if the longest consecutive sequence of action $b$
taken has length greater than $n_i$ and $n_i\ge s$; otherwise
$r_i(b)=0$.
Suppose that there exists a policy $p$ such that $\low V(\nu_i,p)
= V^*_{\nu_i}$ for each $i>0$ and let the true environment be
$\nu_0$. By assumption, for each $s$ there exists such $n$ that
\beqn
\#\{i\le n : y_i=b, r_i=0\}\ge s >\#\{i\le n: y_i=a, r_i=1\}
\eeqn
which implies $\low V(\nu_0,p)\le 1/2<1=V^*_{\nu_0}$.
\qed\end{proof}
It is also easy to show that the {\em uniformity of convergence in
(\ref{eq:svs})} can not be dropped. That is, if in the definition of
value-stability we allow the function $\phi(n,\eps)$ to depend
additionally on the past history $z_{1$ define a measure $\xi$ as follows
$$
\xi(z_{*0$ for
all $\nu\in\mathcal C$.
\noindent{\bf\boldmath Define $T$.}
On each step $i$ let
\beqn
T \;\equiv\; T_i \;:=\;
\left\{\nu\in\C:\frac{\nu(z_{**V^*_{\nu^t}$ and $\nu^e(z_{0$, if such an environment exists.
Otherwise proceed one step (according to $p^t$) and try again.
Increment $j^e$.
\noindent{\bf Consistency.} On each step $i$ (re)define $T$.
If $\nu^t\notin T$, define $\nu^t$, increment
$s$ and iterate the infinite loop. (Thus $s$ is incremented only
if $\nu^t$ is not in $T$ or if $T$ is empty.)
Start the {\bf infinite loop}. Increment $n$.
Let $\delta:=(V^*_{\nu^e}-V^*_{\nu^t})/2$.
Let $\eps:=\eps^{\nu^t}_n$. If $\eps<\delta$ set $\delta=\eps$.
Let $h=j^e$.
\noindent{\bf Prepare for exploration.}
Increment $h$. The index $h$ is incremented with each next attempt of exploring $\nu^e$.
Each attempt will be at least $h$ steps
in length.
Let $p^t=p^{y_{**2i_h$ such that for all $m>k_2$
\beq\label{eq:k2}
\left|\frac{1}{m-i_h} \r{i_h+1}{m}^{\nu^t}-V^*_{{\nu^t}}\right|\le \eps/8.
\eeq
Find $k_3$ such that
\begin{equation}\label{eq:k3}
hr_{max}/k_3<\eps/8.
\end{equation}
Find $k_4$ such that for all $m>k_4$
\beq\label{eq:d}
\frac{1}{m}d_{{\nu^e}}(m,\eps/4)\le
\eps/8 \text{, \ \ }
\frac{1}{m}d_{\nu^t}(m,\eps/8)\le \eps/8
\text{\ \ and\ \ }
\frac{1}{m}d_{\nu^t}(i_h,\eps/8)\le \eps/8.
\eeq
Moreover, it is always possible to find such
$k>\max\{k_1,k_2,k_3,k_4\}$ that
\beq\label{eq:k}
\frac{1}{2k}\r{k}{3k}^{\nu^e} \ge \frac{1}{2k}\r{k}{3k}^{\nu^t} + \delta.
\eeq
Iterate up to the step $k$.
\noindent{\bf Exploration.}
Set $p^e=p_{{\nu^e}}^{y_{2i_h$ we obtain that the event
that $(ii)$ breaks infinitely often has probability $0$ under
$\nu^t$.
Thus (at least) one of the environments $\nu^t$ and $\nu^e$ is
singular with respect to the true environment $\nu$ given the
described policy and current history. Denote this environment by $\nu'$. It is known
(see e.g. \cite[Thm.26]{CsiszarShields:04}) that if measures
$\mu$ and $\nu$ are mutually singular then
$\frac{\mu(x_1,\dots,x_n)}{\nu(x_1,\dots,x_n)}\rightarrow\infty$
$\mu$-a.s. Thus
\beq\label{eq:sing}
\frac{\nu'(z_{**V^*_\nu$ then $\nu^t$ will be excluded from $T$
since on any optimal for $\nu^t$ sequence of actions (policy)
measures $\nu$ and $\nu^t$ are singular. If
$V^*_{\nu^t}*