%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% On the Possibility of Learning in Reactive Environmnents %%
%% with Arbitrary Dependence %%
%% Daniil Ryabko & Marcus Hutter: Start 2005 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
% Document-Style %
%-------------------------------%
\documentclass[12pt]{article}
\usepackage{amsmath,amssymb}
\usepackage{amsthm}
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm \sloppy\lineskip=0pt
\sloppy\lineskip=0pt
%-------------------------------%
% Spacings %
%-------------------------------%
\edef\,{\thinspace} \edef\;{\thickspace} \edef\!{\negthinspace} %\def\>{\mskip 4mu plus 2mu minus 4mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
\def\paradot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1.}}}
\def\paranodot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1}}}
%-------------------------------%
% Environments %
%-------------------------------%
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{definition}[theorem]{Definition}
\theoremstyle{remark}
\newtheorem*{note*}{Note}
\renewcommand{\proofname}{\bf Proof}
\def\qedsymbol{$\Box\quad$}
\def\qedx{}
\newenvironment{keywords}%
{\centerline{\bf\small Keywords}\begin{quote}\small}%
{\par\end{quote}\vskip 1ex}
%-------------------------------%
% More Macro-Definitions %
%-------------------------------%
\let\phi\varphi
\def\as{\text{ a.s.}}
\def\argmax{\operatorname{argmax}}
\def\P{\operatorname{\bf P}}
\def\E{\operatorname{\bf E}}
\def\v{\boldsymbol}
\def\up{\overline}
\def\low{\underline}
\def\r#1#2{r_{#1..#2}}
\def\nq{\hspace{-1em}}
\def\odm{{\textstyle{1\over m}}}
\def\SetR{I\!\!R}
\def\SetN{I\!\!N}
\def\C{{\cal C}} % Set of prob. distributions
\def\X{{\cal X}} % generic set
\def\Y{{\cal Y}} % generic set
\def\A{{\cal A}} % action space
\def\R{{\cal R}} % reward space
\def\O{{\cal O}} % observation space
\def\Z{{\cal Z}} % (action,reward,observation) space
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\def\l{{\ell}} % length of string or program
\def\eps{\varepsilon} % for small positive number
\def\toinfty#1{\stackrel{\smash{#1}\to\infty}{\longrightarrow}}
\def\o{\omega}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\vspace{-3ex}\normalsize\sc Technical Report \hfill IDSIA-08-08
\vskip 2mm\bf\Large\hrule height5pt \vskip 6mm
On the Possibility of Learning in Reactive Environments with Arbitrary Dependence
\vskip 6mm \hrule height2pt}
\author{
\begin{minipage}{0.49\textwidth}\centering
{\bf Daniil Ryabko}\\[3mm]
\normalsize IDSIA, Galleria 2, CH-6928\\
Manno-Lugano, Switzerland%
\thanks{This work was supported by the Swiss NSF grants 200020-107616 and 200021-113364.
}
\\
daniil@idsia.ch
\end{minipage}\hfill
\begin{minipage}{0.49\textwidth}\centering
{\bf Marcus Hutter}\\[3mm]
\normalsize RSISE$\,$@$\,$ANU and SML$\,$@$\,$NICTA \\
\normalsize Canberra, ACT, 0200, Australia \\
http://www.hutter1.net
\end{minipage}
}
\date{October 2008}
\maketitle
\vspace{-7ex}
\begin{abstract}
We address the problem of reinforcement learning in which
observations may exhibit an arbitrary form of stochastic dependence
on past observations and actions, i.e.\ environments more general
than (PO)MDPs. The task for an agent is to attain the best possible
asymptotic reward where the true generating environment is unknown
but belongs to a known countable family of environments. We find
some sufficient conditions on the class of environments under which
an agent exists which attains the best asymptotic reward for any
environment in the class. We analyze how tight these conditions are
and how they relate to different probabilistic assumptions known in
reinforcement learning and related fields, such as Markov Decision
Processes and mixing conditions.
\def\contentsname{\centering\normalsize Contents}
{\parskip=-2.5ex\tableofcontents}
\end{abstract}
\begin{keywords}
Reinforcement learning,
asymptotic average value,
self-optimizing policies,
(non) Markov decision processes.
\end{keywords}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
%\paradot{Learning in uncertain worlds}
%-------------------------------%
Many real-world ``learning'' problems (like learning to drive a car
or playing a game) can be modelled as an agent $\pi$ that interacts
with an environment $\mu$ and is (occasionally) rewarded for its
behavior. We are interested in agents which perform well in the
sense of having high long-term reward, also called the value
$V(\mu,\pi)$ of agent $\pi$ in environment $\mu$. If $\mu$ is known,
it is a pure (non-learning) computational problem to determine the
optimal agent $\pi^\mu:=\arg\max_\pi V(\mu,\pi)$. It is far less
clear what an ``optimal'' agent means, if $\mu$ is unknown.
%
A reasonable objective is to have a single policy $\pi$ with high
value simultaneously in many environments. We will formalize and
call this criterion {\em self-optimizing} later.
%-------------------------------%
\paradot{Learning approaches in reactive worlds}
%-------------------------------%
Reinforcement learning, sequential decision theory, adaptive
control theory, and active expert advice, are theories dealing
with this problem. They overlap but have different core focus:
%
Reinforcement learning algorithms \cite{Sutton:98} are developed to
learn $\mu$ or directly its value. Temporal difference learning is
computationally very efficient, but has slow asymptotic guarantees
(only) in (effectively) small observable MDPs. Others have faster
guarantee in finite state MDPs \cite{Brafman:01}. There are
algorithms \cite{EvenDar:05} which are optimal for any finite
connected POMDP, and this is apparently the largest class of
environments considered.
%
In sequential decision theory, a Bayes-optimal agent $\pi^*$ that
maximizes $V(\xi,\pi)$ is considered, where $\xi$ is a mixture of
environments $\nu\in\C$ and $\C$ is a class of environments that
contains the true environment $\mu\in\C$ \cite{Hutter:04uaibook}.
Policy $\pi^*$ is self-optimizing in an arbitrary (e.g.\ non-POMDP)
class $\C$, provided $\C$ allows for self-optimizingness
\cite{Hutter:02selfopt}.
%
Adaptive control theory \cite{Kumar:86} considers very simple
(from an AI perspective) or special systems (e.g.\ linear with
quadratic loss function), which sometimes allow computationally
and data efficient solutions.
%
Action with expert advice
\cite{Farias:03,Poland:05dule,Poland:05aixifoe,Cesa:06}
constructs an agent (called master) that performs nearly as well
as the best agent (best expert in hindsight) from some class of
experts, in {\em any} environment $\nu$.
%
The important special case of passive sequence prediction in
arbitrary unknown environments, where the actions=predictions do
not affect the environment is comparably easy
\cite{Hutter:03optisp,Hutter:04expert}.
The difficulty in active learning problems can be identified (at
least, for countable classes) with {\em traps} in the environments.
Initially the agent does not know $\mu$, so has asymptotically to be
forgiven in taking initial ``wrong'' actions. A well-studied such
class are ergodic MDPs which guarantee that, from any action
history, every state can be (re)visited \cite{Hutter:02selfopt}.
%-------------------------------%
\paradot{What's new}
%-------------------------------%
The aim of this paper is to characterize as general as possible
classes $\C$ in which self-optimizing behaviour is possible, more
general than POMDPs. To do this we need to characterize classes of
environments that forgive. For instance, exact state recovery is
unnecessarily strong; it is sufficient being able to recover high
rewards, from whatever states. Further, in many real world problems
there is no information available about the ``states'' of the
environment (e.g.\ in POMDPs) or the environment may exhibit long
history dependencies.
Rather than trying to model an environment (e.g. by MDP) we try to
identify the conditions sufficient for learning. Towards this aim,
we propose to consider only environments in which, after any
arbitrary finite sequence of actions, the best value is still
achievable. The performance criterion here is asymptotic average
reward. Thus we consider such environments for which there exists a
policy whose asymptotic average reward exists and upper-bounds
asymptotic average reward of any other policy. Moreover, the same
property should hold after any finite sequence of actions has been
taken (no traps). We call such environments \emph{recoverable}. If
we only want to get $\eps$-close to the optimal value infinitely
often with decreasing $\eps$ (that is, to have the same upper limit
for the average value), then this property is already sufficient.
Yet recoverability in itself is not sufficient for identifying
behaviour which results in optimal limiting average value. We
require further that, from any sequence of $k$ actions, it is
possible to return to the optimal level of reward in $o(k)$ steps;
that is, it is not just possible to recover after any sequence of
(wrong) actions, but it is possible to recover fast. Environments
which possess this property are called \emph{value-stable}. (These
conditions will be formulated in a probabilistic form.)
We show that for any countable class of value-stable environments
there exists a policy which achieves the best possible value in any
of the environments from the class (i.e. is \emph{self-optimizing}
for this class).
Furthermore, we present some examples of environments which possess
value-stability and/or recoverability. In particular, any ergodic
MDP can be easily shown to be value-stable. A mixing-type condition
which implies value-stability is also demonstrated. In addition, we
provide a construction allowing to build examples of value-stable
and/or recoverable environments which are not isomorphic to a finite
POMDP, thus demonstrating that the class of value-stable
environments is quite general.
Finally, we consider environments which are not recoverable but
still are value-stable. In other words, we consider the question of
what it means to be optimal in an environment which does not
``forgive'' wrong actions. Even in such cases some policies are
better than others, and we identify some conditions which are
sufficient for learning a policy that is optimal from some point on.
It is important in our argument that the class of environments for
which we seek a self-optimizing policy is countable, although the
class of all value-stable environments is uncountable. To find a set
of conditions necessary and sufficient for learning which do not
rely on countability of the class is yet an open problem. However,
from a computational perspective countable classes are sufficiently
large (e.g.\ the class of all computable probability measures is
countable).
%-------------------------------%
\paradot{Contents}
%-------------------------------%
The paper is organized as follows. Section~\ref{secNot} introduces
necessary notation of the agent framework. In Section~\ref{secSetup}
we define and explain the notion of value-stability, which is
central to the paper, and a weaker but simpler notion of
recoverability. Section~\ref{secMain} presents the theorems about
self-optimizing policies for classes of value-stable environments
and recoverable environments. In Section~\ref{sec:nonrec} we
discuss what can be achieved if the environments are not
recoverable. Section~\ref{sec:ex} illustrates the applicability of
the theorems by providing examples of value-stable and recoverable
environments. In Section~\ref{sec:nec} we discuss necessity of the
conditions of the main theorems. Section~\ref{secDisc} provides some
discussion of the results and an outlook to future research. Formal
proofs of the main theorems are given in Section~\ref{secPrThMain},
while Sections~\ref{secMain} and~\ref{sec:nonrec} contain only
intuitive explanations.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Notation and Definitions}\label{secNot}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We essentially follow the notation of
\cite{Hutter:02selfopt,Hutter:04uaibook}.
%------------------------------%
\paradot{Strings and probabilities}
%------------------------------%
We use letters $i,k,l,m,n\in\SetN$ for natural numbers, and denote
the cardinality of sets $\cal S$ by $\#{\cal S}$. We write $\X^*$
for the set of finite strings over some alphabet $\X$, and
$\X^\infty$ for the set of infinite sequences. For a string
$x\in\X^*$ of length $\l(x)=n$ we write $x_1x_2...x_n$ with
$x_t\in\X$ and further abbreviate $x_{k:n}:=x_kx_{k+1}...x_{n-1}x_n$
and $x_{0$ there exists a
policy $p$ such that
$$
\P (\up V(\nu,p) =\up V^*| z_{ d_\nu(k,\eps)+n\eps\mid z_{V^*_{\nu^t}$) and tries to get $\eps$-close to this
value acting optimally under $\nu^e$. If it cannot get close to the
$\nu^e$-optimal value then $\nu^e$ is not the true environment, and
the next environment can be picked for exploration (here we call
``exploration'' successive attempts to exploit an environment which
differs from the current hypothesis about the true environment and
has a higher average reward). If it can, then it switches to
exploitation of $\nu^t$, exploits it until it is $\eps'$-close to
$V^*_{\nu^t}$, $\eps'<\eps$ and switches to $\nu^e$ again this time
trying to get $\eps'$-close to $V_{\nu^e}$; and so on. This can
happen only a finite number of times if the true environment is
$\nu^t$, since $V^*_{\nu^t}V^*_{\nu^t}$ is picked for
exploration. If it is $\nu^t$ then the first consistent environment
is picked for exploitation (and denoted $\nu^t$). This in turn can
happen only a finite number of times before the true environment
$\nu$ is picked as $\nu^t$. After this, the algorithm still
continues its exploration attempts, but can always keep within
$\eps_k\rightarrow0$ of the optimal value. This is ensured by
$d(k)=o(k)$.
The probabilistic case is somewhat more complicated since we can not
say whether an environment is ``consistent'' with the current
history. Instead we test each environment for consistency as
follows. Let $\xi$ be a mixture of all environments in $\C$. Observe
that together with some fixed policy each environment $\mu$ can be
considered as a measure on $\Z^\infty$. Moreover, it can be shown
that (for any fixed policy) the ratio
$\frac{\nu(z_{0$, if there exists a constant $\up V$ or a
constant $\low V$ such that
$$
P(\up V(\nu,p) \;=\; \up V |z_{0$ there exists a
policy $p_\nu^{z_{0)} V^*_\nu(z_{0)}\up V^*_\nu(z_{
d_\nu(k,\eps)+n\eps\mid z_{0$ and
\beqn
\r{k}{k+n}^\nu-\E \left( \r{k}{k+n}^{p\nu}\mid z_{d(k)+n\eps\right)
\\
& & \le I\left( \r{k}{k+n}^\nu-\E\r{k}{k+n}^{p\nu}>d(k)\right)+
\P\left(\left|\r{k}{k+n}^{p\nu}-\E\r{k}{k+n}^{p\nu}\right|>n\eps\right).
\eqan
The first term equals $0$ by assumption and the second term for each
$\eps$ can be shown to be summable using \cite[Thm.1.3]{Bosq:96}:
for a sequence of uniformly bounded zero-mean random variables $r_i$
satisfying strong $\alpha$-mixing conditions the following bound
holds true for any integer $q\in[1,n/2]$
\beqn
\P\left(|\r1n|>n\eps\right)\le c e^{-\eps^2q/c}+cq\alpha\left(\frac{n}{2q}\right)
\eeqn
for some constant $c$; in our case we just set
$q=n^{\frac{\eps}{2+\eps}}$.
\end{proof}
%-------------------------------%
\paradot{(PO)MDPs}
%-------------------------------%
Applicability of Theorem~\ref{th:main} and
Proposition~\ref{prop:mix} can be illustrated on (PO)MDPs. We note
that self-optimizing policies for (uncountable) classes of finite
ergodic MDPs and POMDPs are known \cite{Brafman:01,EvenDar:05}; the
aim of the present section is to show that value-stability is a
weaker requirement than the requirements of these models, and also
to illustrate applicability of our results. We call $\mu$ a
(stationary) {\em Markov decision process} (MDP) if the probability
of perceiving $x_k\in\X$, given history $z_{n\eps\right)\le
\sup_{a\in\X } \P\left( \left|\E\left(\r{k}{k+n}^{p_\mu\mu}|x_k=a\right)
- \r{k}{k+n}^{p\mu}\right|>n\eps)\right)
\it \\ & & \le \sup_{a,b\in\X}\P(l(a,b)>f(n)/r_{\max})
\\
& & +
\; \sup_{a,b\in\X}\P\left( \left|\E\left(\r{k}{k+n}^{p_\mu\mu}| x_k=a\right)
- \r{k+f(n)}{k+n}^{p_\mu\mu}\right|>n\eps-f(n)\Big|
x_{k+f(n)}=a\right)
\rm \\ & & \le \sup_{a,b\in\X}\P(l(a,b)>f(n)/r_{\max})
\\
& & +
\; \sup_{a\in\X}\P\left( \left|\E\left(\r{k}{k+n}^{p_\mu\mu}| x_k=a\right)
- \r{k}{k+n}^{p_\mu\mu}\right|>n\eps-2f(n)\Big|
x_{k}=a\right).
\eqan
In the last term we have the deviation of the reward attained by
the optimal policy from its expectation. Clearly, both terms are
bounded exponentially in $n$.
\end{proof}
In the examples above the function $d(k,\eps)$ is a constant and
$\phi(n,\eps)$ decays exponentially fast. This suggests that the
class of value-stable environments stretches beyond finite
(PO)MDPs. We illustrate this guess by the construction that follows.
%-------------------------------%
A general scheme for constructing
{\bf value-stable environment or recoverable environments}:
infinitely armed bandit.
%-------------------------------%
Next we present a construction of environments which cannot be
modelled as finite POMDPs but are value-stable and/or recoverable.
Consider the following environment $\nu$. There is a countable
family $\C'=\{\zeta_i: i\in\SetN\}$ of {\em arms}, that is, sources
generating i.i.d. rewards $0$ and $1$ (and, say, empty observations)
with some probability $\delta_i$ of the reward being $1$. The action
space $\Y$ consists of three actions $\Y=\{g,u,d\}$. To get the next
reward from the current arm $\zeta_i$ an agent can use the action
$g$. Let $i$ denote the index of the current arm. At the beginning
$i=0$, the current arm is $\zeta_0$ and then the agent can move
between arms as follows: it can move $U(i)$ arms ``up'' using the
action $u$ (i.e. $i:=i+U(i)$) or it can move $D(i)$ arms ``down''
using the action $d$ (i.e. $i:=i-D(i)$ or 0 if the result is
negative). The reward for actions $u$ and $d$ is $0$. In all the
examples below $U(i)\equiv1$, that is, the action $u$ takes the
agent one arm up.
Clearly, $\nu$ is a POMDP with countably infinite number of states
in the underlying Markov chain, which (in general) is not isomorphic
to a finite POMDP.
\begin{claim}
If $D(i)=i$ for all $i\in\SetN$ then the environment $\nu$ just
constructed is value-stable. If $D(i)\equiv1$ then $\nu$ is
recoverable but not necessarily value-stable; that is, there are
choices of the probabilities $\delta_i$ such that $\nu$ is not
value-stable.
\end{claim}
%
\begin{proof}
First we show that in either case ($D(i)=i$ or $D(i)=1$) $\nu$ is
explorable. Let $\delta=\sup_{i\in\SetN}\delta_i$. Clearly, $\up
V(\nu,p')\le \delta$ with probability $1$ for any policy $p'$ . A
policy $p$ which, knowing all the probabilities $\delta_i$, achieves
$\up V(\nu,p) =\low V(\nu,p) =\delta=:V^*_\nu$ a.s., can be easily
constructed. Indeed, find a sequence $\zeta'_j$, $j\in\SetN$, where
for each $j$ there is $i=:i_j$ such that $\zeta'_j=\zeta_i$,
satisfying $\lim_{j\rightarrow\infty}\delta_{i_j}=\delta$. The
policy $p$ should carefully exploit one by one the arms $\zeta_j$,
staying with each arm long enough to ensure that the average reward
is close to the expected reward with $\eps_j$ probability, where
$\eps_j$ quickly tends to 0, and so that switching between arms has
a negligible impact on the average reward. Thus $\nu$ can be shown
to be explorable. Moreover, a policy $p$ just sketched can be made
independent on (observation and) rewards.
Next we show if $D(i)=i$, that is, the action $d$ always takes the
agent down to the first arm, then the environment is value-stable.
Indeed, one can modify the policy $p$ (possibly allowing it to
exploit each arm longer) so that on each time step $t$ (from some
$t$ on) we have $j(t)\le\sqrt{t}$, where $j(t)$ is the number of the
current arm on step $t$. Thus, after any actions-perceptions history
$z_{0$; then after
taking $n$ actions $u$ one can only return to optimal rewards with
$n$ actions ($d$), that is $d(k)=o(k)$ cannot be obtained. Still it
is easy to check that recoverability is preserved, whatever the
choice of $\delta_i$.
\end{proof}
In the above construction we can also allow the action $d$ to bring
the agent $d(i)*0$ construct the
environment $\nu_s$ as follows: $r_i(a)=1$ for any $i$, $r_i(b)=2$
if the longest consecutive sequence of action $b$ taken has length
greater than $n_i$ and $n_i\ge s$; otherwise $r_i(b)=0$.
It is easy to see that each $\nu_i$, $i>0$ satisfies the value
stability conditions with $\phi(n,\eps)\equiv0$ except
$d(k,\eps)=k\ne o(k)$, and does not satisfy it with any
$d(k,\eps)=o(k)$. Next we show that there is no self-optimizing
policy for the class.
Suppose that there exists a policy $p$ such that $\low V(\nu_i,p) =
V^*_{\nu_i}$ for each $i>0$ and let the true environment be $\nu_0$.
By assumption, for each $s$ there exists such $n$ that
\beqn
\#\{i\le n : y_i=b, r_i=0\}\ge s >\#\{i\le n: y_i=a, r_i=1\}
\eeqn
which implies $\low V(\nu_0,p)\le 1/2<1=V^*_{\nu_0}$.
\end{proof}
It is also easy to show that the {\em uniformity of convergence in
(\ref{eq:svs})} cannot be dropped. That is, if in the definition of
value-stability we allow the function $\phi(n,\eps)$ to depend
additionally on the past history $z_{0$ from the proof of
Proposition~\ref{th:tight}. (Whereas if we add $\nu_0$ to the class
a self-optimizing policy no longer exists.) So we have the
following:
%
\begin{claim}
There are countable classes of not value-stable environments for which
self-optimizing policies exist.
\end{claim}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Discussion}\label{secDisc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
\paradot{Summary}
%-------------------------------%
We have proposed a set of conditions on environments, called
value-stability, such that any countable class of value-stable
environments admits a self-optimizing policy. It was also shown that
these conditions are in a certain sense tight. The class of all
value-stable environments includes ergodic MDPs, certain class of
finite POMDPs, passive environments, and (provably) more
environments.
%
So the concept of value-stability allows us to characterize
self-optimizing environment classes, and proving value-stability is
typically much easier than proving self-optimizingness directly.
Value stability means that from any (sup-optimal) sequence of
actions it is possibly to recover fast. If it is possible to
recover, but not necessarily fast, then we get a condition which we
called recoverability, which was shown to be sufficient to be able
to recover the upper limit of the optimal average asymptotic value.
We have also analyzed what can be achieved in environments which
possess (worst-case) value-stability but are not recoverable; it
turned out that a certain worst-case self-optimizingness can be
identified in this case too.
On the following picture we summarize the concepts introduced in
Sections~\ref{secSetup}, \ref{secMain} and~\ref{sec:nonrec}. The
arrows symbolize implications: some of them follow from theorems or
stated in definitions (marked accordingly), while others are
trivial.
\begin{center}
\small
\unitlength=1.3pt
\begin{picture}(310,180)
%-------------------------------%
\put(0,150){\framebox(60,14)[c]{explorable}}
\put(110,157){\vector(-1,0){50}}
\put(80,160){\text{{\tiny Def.\ref{def:vstable}}}}
\put(30,150){\vector(0,-1){25}}
\put(110,150){\framebox(60,14)[c]{value-stable}}
\put(170,157){\vector(1,0){50}}
\put(170,156){\vector(1,0){50}}
\put(215,154){\text{$\blacktriangleright$}}
\put(180,160){\text{{\tiny Th.\ref{th:main}}}}
\put(140,150){\vector(0,-1){25}}
\put(110,150){\vector(0,-1){65}}
\put(260,150){\vector(0,-1){25}}
\put(223,150){\vector(0,-1){65}}
\put(223,150){\framebox(75,14)[c]{self-optimizing}}
%-------------------------------%
\put(3,110){\framebox(76,14)[c]{{\small upper explorable}}}
\put(112,117){\vector(-1,0){32}}
\put(85,120){\text{{\tiny Def.\ref{def:vstable2} }}}
\put(0,84){\vector(0,1){66}}
\put(298,110){\vector(0,-1){66}}
\put(114,110){\framebox(56,14)[c]{recoverable}}
\put(170,117){\vector(1,0){54}}
\put(170,116){\vector(1,0){54}}
\put(219,114){\text{$\blacktriangleright$}}
\put(180,120){\text{{\tiny Th.\ref{th:wr} }}}
\put(226,110){\framebox(72,14)[c]{upper self-opt.}}
%-------------------------------%
\put(0,70){\framebox(60,14)[c]{{\small strongly exp.}}}
\put(97,77){\vector(-1,0){37}}
\put(75,80){\text{{\tiny Def.\ref{def:wstable} }}}
\put(30,70){\vector(0,-1){25}}
\put(260,70){\vector(0,-1){25}}
\put(97,70){\framebox(80,14)[c]{{\small worst-case val.-st.}}}
\put(177,77){\vector(1,0){33}}
\put(177,76){\vector(1,0){33}}
\put(205,74){\text{$\blacktriangleright$}}
\put(180,80){\text{{\tiny Th.\ref{th:worst}\,(i) }}}
\put(213,70){\framebox(82,14)[c]{{\small worst-case self-opt.}}}
%-------------------------------%
\put(0,30){\framebox(90,14)[c]{{\small strongly upper exp.}}}
\put(70,44){\vector(0,1){66}}
\put(90,37){\vector(1,0){100}}
\put(90,36){\vector(1,0){100}}
\put(185,34){\text{$\blacktriangleright$}}
\put(120,40){\text{{\tiny Th.\ref{th:worst}\,(ii) }}}
\put(193,30){\framebox(112,14)[c]{{\small worst-case upper self-opt.}}}
\end{picture}
\end{center}
%-------------------------------%
\paradot{Outlook}
%-------------------------------%
We considered only countable environment classes $\C$. From a
computational perspective such classes are sufficiently large (e.g.\
the class of all computable probability measures is countable). On
the other hand, countability excludes continuously parameterized
families (like all ergodic MDPs), common in statistical practice.
%
So perhaps the main open problem is to find under which conditions
the requirement of countability of the class can be lifted. Another
important question is whether (meaningful) necessary and sufficient
conditions for self-optimizingness can be found. However,
identifying classes of environments for which self-optimizing
policies exist is a hard problem which has not been solved even for
passive environments \cite{RyabkoHutter:06pq}.
One more question concerns the uniformity of forgetfulness of the
environment. Currently in the definition of
value-stability~(\ref{eq:svs}) we have the function $\phi(n,\eps)$
which is the same for all histories $z_{1$ define a
measure $\xi$ as follows
\beq\label{eq:xi}
\xi(z_{**0$ for all $\nu\in\mathcal C$.
\noindent{\bf\boldmath Define $T$.}
On each step $i$ let
\beqn
T \;\equiv\; T_i \;:=\;
\left\{\nu\in\C:\frac{\nu(z_{**V^*_{\nu^t}$ and $\nu^e(z_{0$, if such an
environment exists. Otherwise proceed one step (according to $p^t$)
and try again. Increment $j^e$.
\noindent{\bf Consistency.} On each step $i$ (re)define $T$. If
$\nu^t\notin T$, define $\nu^t$, increment $s$ and iterate the
infinite loop. (Thus $s$ is incremented only if $\nu^t$ is not in
$T$ or if $T$ is empty.)
Start the {\bf infinite loop}. Increment $n$.
Let $\delta:=(V^*_{\nu^e}-V^*_{\nu^t})/2$. Let
$\eps:=\eps^{\nu^t}_n$. If $\eps<\delta$ set $\delta=\eps$. Let
$h=j^e$.
\noindent{\bf Prepare for exploration.}
Increment $h$. The index $h$ is incremented with each next attempt
of exploring $\nu^e$. Each attempt will be at least $h$ steps in
length.
Let $p^t=p^{y_{**2i_h$ such that for all $m>k_2$
\beq\label{eq:k2}
\left|\frac{1}{m-i_h} \r{i_h+1}{m}^{\nu^t}-V^*_{{\nu^t}}\right|\le \eps/8.
\eeq
Find $k_3$ such that
\beq\label{eq:k3}
hr_{max}/k_3<\eps/8.
\eeq
Find $k_4$ such that for all $m>k_4$
\beq\label{eq:d}
\frac{1}{m}d_{{\nu^e}}(m,\eps/4)\le
\eps/8 \text{, \ \ }
\frac{1}{m}d_{\nu^t}(m,\eps/8)\le \eps/8
\text{\ \ and\ \ }
\frac{1}{m}d_{\nu^t}(i_h,\eps/8)\le \eps/8.
\eeq
Moreover, it is always possible to find such
$k>\max\{k_1,k_2,k_3,k_4\}$ that
\beq\label{eq:k}
\frac{1}{2k}\r{k}{3k}^{\nu^e} \ge \frac{1}{2k}\r{k}{3k}^{\nu^t} + \delta.
\eeq
Iterate up to the step $k$.
\noindent{\bf Exploration.}
Set $p^e=p_{{\nu^e}}^{y_{2i_h$
we obtain that the event that $(ii)$ breaks infinitely often has
probability $0$ under $\nu^t$.
Thus (at least) one of the environments $\nu^t$ and $\nu^e$ is
singular with respect to the true environment $\nu$ given the
described policy and current history. Denote this environment by
$\nu'$. It is known (see e.g. \cite[Thm.26]{CsiszarShields:04}) that
if measures $\mu$ and $\nu$ are mutually singular then
$\frac{\mu(x_1,\dots,x_n)}{\nu(x_1,\dots,x_n)}\rightarrow\infty$
$\mu$-a.s. Thus
\beq\label{eq:sing}
\frac{\nu'(z_{**V^*_\nu$
then $\nu^t$ will be excluded from $T$ since on any optimal for
$\nu^t$ sequence of actions (policy) measures $\nu$ and $\nu^t$ are
singular. If $V^*_{\nu^t}*