%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Asymptotically Optimal Agents %%
%% Tor Lattimore and Marcus Hutter (2011) %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt,twoside]{article}
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\usepackage{latexsym,amsmath,amssymb}
\usepackage{wrapfig}
\usepackage{tikz}
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\newcommand{\eqn}[1]{\begin{align}#1\end{align}}
\newcommand{\eq}[1]{\begin{align*}#1\end{align*}}
\newcommand{\proofof}{Proof of }
\newcommand{\proofend}{}
\def\subsubsect#1{\vspace{1ex plus 0.5ex minus 0.5ex}\noindent{\bf\boldmath{#1.}}}
\usetikzlibrary{trees,arrows,automata}
\renewcommand{\v}[1]{{\boldsymbol #1}}
\newcommand{\argmax}{\operatornamewithlimits{arg\,max}}
\newcommand{\ind}[1]{[\![ #1 ]\!]}
\newcommand{\R}[0]{\mathbb R}
\newcommand{\N}[0]{\mathbb N}
\usepackage{amsthm}
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newenvironment{keywords}{\centerline{\bf\small
Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\newcommand{\xy}{x\!y}
\newcommand{\yx}{y\!x}
\newcommand{\yr}{y\!r}
\newcommand{\A}{\mathcal Y}
\newcommand{\Rc}{\mathcal R}
\newcommand{\Mc}{\mathcal M}
\newcommand{\X}{\mathcal X}
\newcommand{\B}{\mathcal B}
\newcommand{\E}{\mathbf E}
\renewcommand{\Alph}{\mathcal A}
\renewcommand{\d}{\gamma}
\renewcommand{\O}{\mathcal O}
\sloppy
\raggedbottom
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Title - Page %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{
\vskip 2mm\bf\Large\hrule height5pt \vskip 4mm
Asymptotically Optimal Agents
\vskip 4mm \hrule height2pt}
\author{{\bf Tor Lattimore}$^1$ and {\bf Marcus Hutter}$^{1,2}$\\[3mm]
\normalsize Research School of Computer Science \\[-0.5ex]
\normalsize $^1$Australian National University and $^2$ETH Z{\"u}rich \\[-0.5ex]
\normalsize\texttt{\{tor.lattimore,marcus.hutter\}@anu.edu.au}
}
\date{25 July 2011}
\maketitle
\begin{abstract}
Artificial general intelligence aims to create agents capable of learning to solve arbitrary interesting problems. We define two versions of
asymptotic optimality and prove that no agent can satisfy the strong version while in some cases, depending on discounting, there
does exist a non-computable weak asymptotically optimal agent.
\def\contentsname{\centering\normalsize Contents}
{\parskip=-2.7ex\tableofcontents}
\end{abstract}
\begin{keywords}
Rational agents;
sequential decision theory;
artificial general intelligence;
reinforcement learning;
asymptotic optimality;
general discounting.
\end{keywords}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The dream of artificial general intelligence is to create an agent that, starting with no knowledge of its environment, eventually
learns to behave optimally. This means it should be able to learn chess just by playing, or Go, or how to drive a car or mow the lawn, or any
task we could conceivably be interested in assigning it.
Before considering the existence of universally intelligent agents, we must be precise about what is meant by optimality.
If the environment and goal are known, then subject to computation issues, the optimal policy is easy to construct using an expectimax search
from sequential decision theory \cite{NR03}.
However, if the true environment is unknown then the agent will necessarily spend some time exploring, and so
cannot immediately play according to the optimal policy.
Given a class of environments, we suggest two definitions of asymptotic optimality for an agent.
\begin{enumerate}
\item An agent is strongly asymptotically optimal if for every environment in the class it plays optimally in the limit.
\item It is weakly asymptotic optimal if for every environment in the class it plays optimally {\it on average} in the limit.
\end{enumerate}
The key difference is that a strong asymptotically optimal agent must eventually stop exploring, while a weak asymptotically optimal agent
may explore forever, but with decreasing frequency.
In this paper we consider the (non-)existence of weak/strong asymptotically optimal agents in the class of all deterministic computable environments.
The restriction to deterministic is for the sake of simplicity and because the results for this case are already sufficiently non-trivial to be interesting.
The restriction to computable is more philosophical. The Church-Turing thesis is the unprovable hypothesis that anything that can intuitively be
computed can also be computed by a Turing machine. Applying this to physics leads to the strong Church-Turing thesis that
the universe is computable (possibly stochastically computable, i.e.\ computable when given access to an oracle of random noise). Having made these assumptions, the largest interesting class then becomes the class of computable
(possibly stochastic) environments.
In \cite{Hut04}, Hutter conjectured that his universal Bayesian agent, AIXI, was weakly asymptotically optimal in the class
of all computable stochastic environments. Unfortunately this
was recently shown to be false in \cite{Ors10}, where it is proven that no Bayesian agent (with a static prior) can be
weakly asymptotically optimal in this class.\footnote{Or even the class of computable deterministic environments.} The key idea behind Orseau's proof was to show that AIXI eventually stops exploring. This is somewhat surprising because it is normally assumed that Bayesian agents solve
the exploration/exploitation dilemma in a principled way. This result is a bit reminiscent of Bayesian (passive induction)
inconsistency results \cite{DF86a,DF86b}, although the details of the failure are very different.
We extend the work of \cite{Ors10}, where only Bayesian agents are considered, to show that non-computable weak asymptotically optimal
agents do exist in the class of deterministic computable environments for some discount
functions (including geometric), but not for others. We also show that no asymptotically optimal agent can be computable, and that
for all ``reasonable'' discount functions there does not exist a strong asymptotically optimal agent.
The weak asymptotically optimal agent we construct is similar to AIXI, but with an exploration component similar to $\epsilon$-learning for finite
state Markov decision processes or the UCB algorithm for bandits. The key is to explore sufficiently often and deeply to ensure that the environment
used for the model is an adequate approximation of the true environment. At the same time, the agent must explore infrequently enough that it actually
exploits its knowledge. Whether or not it is possible to get this balance right depends, somewhat surprisingly, on how forward looking the agent
is (determined by the discount function). That it is sometimes not possible to explore enough to learn the true environment without damaging even
a weak form of asymptotic optimality is surprising and unexpected.
Note that the exploration/exploitation problem is
well-understood in the Bandit case \cite{Auer:02ucb,Berry:85}
and for (finite-state stationary) Markov decision
processes \cite{Strehl:08}. In these restrictive settings,
various satisfactory optimality criteria are available.
%
In this work, we do not make any assumptions like Markov,
stationary, ergodicity, or else besides computability of the
environment. So far, no satisfactory optimality definition is
available for this general case.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Notation and Definitions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We use similar notation to \cite{Hut04,Ors10} where the agent takes actions and the environment returns an observation/reward pair.
%-------------------------------%
\subsubsect{Strings}
%-------------------------------%
A finite string $a$ over alphabet $\Alph$ is a finite sequence $a_1a_2a_3\cdots a_{n-1}a_n$ with $a_i \in \Alph$.
An infinite string $\omega$ over alphabet $\Alph$ is an infinite sequence $\omega_1\omega_2\omega_3\cdots$.
%
$\Alph^n$, $\Alph^*$ and $\Alph^\infty$ are the sets of strings of length $n$, strings of finite length, and infinite strings
respectively.
%
Let $x$ be a string (finite or infinite) then substrings are denoted
$x_{s:t} := x_s x_{s+1} \cdots x_{t-1} x_{t}$ where $s,t\in\N$ and $s \leq t$.
Strings may be concatenated.
%
Let $x,y \in \Alph^*$ of length $n$ and $m$ respectively, and $\omega \in \Alph^\infty$. Then define
$xy := x_1 x_2 \cdots x_{n-1} x_{n} y_1 y_2 \cdots y_{m-1} y_m$ and
$x\omega := x_1 x_2 \cdots x_{n-1} x_{n} \omega_1 \omega_2 \omega_3 \cdots$.
Some useful shorthands,
\eqn{
\label{eqn-string} x_{=latex', ->, scale=0.75]
\foreach \x in {1,2,3,4,5} {
\node[block] at (1.2*\x, 2.8) (input\x) {$o_\x | r_\x$};
\node[block] at (1.2*\x, 0) (output\x) {$y_\x$};
}
\node[block] at (1.2*6, 0) (input7) {$\cdots$};
\node[block] at (1.2*6, 2.8) (input7) {$\cdots$};
\node[block] at (2.4, 1.4) (agent) {$agent$, $\pi$};
\node[block] at (6.5, 1.4) (env) {$environment$, $\mu$};
\path (input3) edge node {} (agent)
(agent) edge node {} (output4)
(output4) edge node {} (env)
(env) edge node {} (input4)
;
\end{tikzpicture}
\end{center}
\end{wrapfigure}
%-------------------------------%
\subsubsect{Environments and Optimality}
%-------------------------------%
Let $\A$, $\O$ and $\Rc \subset \R$ be action, observation and reward spaces respectively. Let $\X = \O\times\Rc$.
An agent interacts with an environment as illustrated in the diagram on the right. First, the agent takes an action, upon which
it receives a new observation/reward pair. The agent then takes another action, receives another observation/reward pair, and so-on
indefinitely. The goal of the agent is to maximise its discounted rewards over time.
In this paper we consider only deterministic environments where the next observation/reward pair is determined
by a function of the previous actions, observations and rewards.
\begin{definition}[Deterministic Environment]
A {\it deterministic environment} $\mu$ is a function $\mu:(\A\times\X)^*\times \A \to \X$
where $\mu(\yx_{ p\right\}.
}
An infinite sequence of rewards starting at time
$t$, $r_t, r_{t+1}, r_{t+2},\cdots$ is given a value of ${1 \over \Gamma_t} \sum_{i=t}^\infty \d_i r_i$. The term ${1 \over \Gamma_t}$
is a normalisation term to ensure that values scale in such a way that they can still be compared in the limit.
%
A discount function is computable if there exists a Turing machine computing it. All well known discount functions, such as geometric,
fixed horizon and hyperbolic are computable.
%
Note that $H_t(p)$ exists for all $p \in [0, 1)$ and represents the effective horizon of the agent. After $H_t(p)$ time-steps into the
future, starting at time $t$, the agent stands to gain/lose at most $1 - p$.
\begin{definition}[Values and Optimal Policy]\label{defn_optimal}
The value of policy $\pi$ when starting from history $\yx^{\mu,\pi}_{t}) \\
0 & \text{otherwise}
\end{cases}
}
Since $\pi$ is computable $\mu$ is as well. Therefore $\mu \in \Mc$. Now $V^*_\mu(\yr_{ c_pt$ with $c_p > 0$ for infinitely many $t$,
but the proof will likely be messy.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Existence of Weak Asymptotically Optimal Policies}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In the previous section we showed there did not exist a strong asymptotically optimal policy (for most discount functions)
and that any weak asymptotically optimal policy must be incomputable. In this section we show that a weak
asymptotically optimal policy exists for geometric discounting (and is, of course, incomputable).
The policy is reminiscant of $\epsilon$-exploration in finite state MDPs (or UCB for bandits) in that it spends
most of its time exploiting the information it already knows, while
still exploring sufficiently often (and for sufficiently long) to detect any significant errors in its model.
The idea will be to use a model-based policy that chooses its
current model to be the first environment in the model class (all computable deterministic environments) consistent
with the history seen so far. With increasing probability it takes the best action according to this policy,
while still occasionally exploring randomly. When it explores it always does so in bursts of increasing length.
%
\begin{definition}[History Consistent]
A deterministic environment $\mu$ is {\it consistent} with history $\yx_{ 0$ and $\alpha_i := \ind{a_i > \epsilon/2}$ then $\alpha_i = \chi_i = 1$ for infinitely many $i$.
\end{enumerate}
\end{lemma}
\begin{proof}
1. Let $i \in \N$, $\epsilon > 0$ and $E_i^\epsilon$ be the event that $\#1(\dot\chi^h_{1:2^i}) > 2^i \epsilon$.
Using the definition of $\dot\chi^h$ to compute the expectation $\E\left[\#1(\dot\chi^h_{1:2^i})\right] < i(i + 1)h$ and applying the Markov inequality
gives that $P(E_i^\epsilon) < i(i+1) h 2^{-i} / \epsilon$. Therefore $\sum_{i \in \N} P(E_i^\epsilon) < \infty$.
Therefore the Borel-Cantelli lemma gives that $E_i^\epsilon$ occurs for only finitely many $i$ with probability $1$.
We now assume that $\limsup_{n\to\infty} {1 \over n} \#1(\dot\chi^h_{1:n}) > 2 \epsilon > 0$ and show that $E_i^\epsilon$ must occur infinitely often.
By the definition of $\limsup$ and our assumption we have that there exists a sequence
$n_1, n_2, \cdots$ such that $\#1(\dot\chi^h_{1:n_i}) > 2 n_i \epsilon$ for all $i \in \N$.
Let $n^+ := \min_{k\in\N} \left\{2^k : 2^k \geq n\right\}$ and note that $\#1(\dot\chi^h_{1:n_i^+}) > n_i^+ \epsilon$, which is
exactly $E^\epsilon_{\log n_i^+}$. Therefore there exist infinitely many $i$ such that $E_i^\epsilon$ occurs and so
$\limsup_{n\to\infty} {1 \over n} \#1(\dot\chi^h_{1:n}) = 0$ with probability 1.\\
2. The probability that $\alpha_i = 1 \implies \chi_i = 0$ for all $i \geq T$ is
$P(\alpha_i = 1 \implies \chi_i = 0 \forall i \geq T) = \prod_{i=T}^\infty \left(1 - {\alpha_i \over i}\right) =: p = 0$,
by Lemma \ref{lem_tech2}.
Therefore the probability that $\alpha_i = \chi_i = 1$ for only
finitely many $i$ is zero. Therefore there exists infinitely many $i$ with $\alpha_i = \chi_i = 1$ with probability $1$, as required.
\proofend\end{proof}
\begin{lemma}[Approximation Lemma]\label{lem_approximation}
Let $\pi_1$ and $\pi_2$ be policies, $\mu$ an environment and $h \geq H_t(1-\epsilon)$. Let $\yx_{ \epsilon$
then $\mu$ is $H_t(1 - \epsilon)$-different to $\nu$ on $\yx^{\pi,\mu}$.
\end{lemma}
\begin{proof}
Follows from the approximation lemma.
\proofend\end{proof}
We are now ready to prove the main theorem.
\begin{proof}[\proofof Theorem \ref{thm_weak_deterministic_optimal}]
Let $\pi$ be the policy defined in Definition \ref{defn_det_asym} and
$\mu$ be the true (unknown) environment. Recall that $\nu_t = \mu_{i_t}$ with
$i_t = \min\left\{i: \mu_i \text{ consistent with history } \yx^{\pi,\mu}_{ 0.
}
Let $\alpha \in \B^\infty$ be defined by $\alpha_t := 1$ if and only if,
\eqn{
\label{eqn10-1} \left[V^*_\mu(\yx^{\pi,\mu}_{ 0$
and
$
P(\forall k \exists i \in [t_k, t_k + h] \text{ with }
\psi_{i} \neq \pi^*_\mu(\yx^{\pi,\mu}_{* 0$, $h := H_t(1-\epsilon)$ and $t \geq T$. If $\dot\chi^h_t = 0$ then by the definition of $\pi$ and the approximation lemma we obtain
\eqn{
\label{eqn_main1} \left|V_\mu^{\pi_\nu^*}(\yx^{\pi,\mu}_{ n {\epsilon \over 2}
}
\end{lemma}
\begin{proof}
Let $A_{>} := \left\{a \in A : a \geq {\epsilon \over 2}\right\}$ and
$A_{<} := A - A_{>}$. Therefore
\eq{
n\epsilon &\leq \sum_{a\in A} a = \sum_{a\in A_{<}} a + \sum_{a\in A_{>}} a \\
& \leq \sum_{a\in A_{<}} {\epsilon \over 2} + \sum_{a \in A_{>}} 1 \\
& = |A_{<}| {\epsilon \over 2} + |A_{>}|
}
By rearranging and algebra, $\left|\left\{ a \in A : a \geq {\epsilon \over 2} \right\} \right| \equiv |A_{>}| > n{\epsilon \over 2}$ as required.
\proofend\end{proof}
\begin{proof}[\proofof Lemma \ref{lem_tech2}] First,
\eqn{
\prod_{i=1}^\infty\left[1-{\alpha_i \over i} \right] & \leq \exp\left[-\sum_{i=1}^\infty {\alpha_i \over i} \right]
\label{eqn13-1}
}
Equation (\ref{eqn13-1}) follows since $1 - a \leq \exp(-a)$ for all $a$.
Now since $\limsup_{n\to\infty} {1 \over n} \sum_{i=1}^n a_n = \epsilon$, we have for any $N$ there exists an $n > N$
such that ${1 \over n} \sum_{i=1}^n a_n > {\epsilon \over 2}$.
Let $n_1 = 0$ then inductively choose
$n_i = \min\left\{n: n > {8(n_{i-1}+1) \over \epsilon} \wedge {1 \over n} \sum_{i=1}^n a_i > {\epsilon \over 2} \right\}$
By Lemma \ref{lem_tech1},
\eqn{
\label{eqn-tech2-1} \left|\left\{i \leq n_j : a_i \geq {\epsilon \over 4}\right\}\right| \geq n_j{\epsilon \over 4}
}
Therefore
\eqn{
\label{eqn-tech2-3} \sum_{i=n_j+1}^{n_{j+1}} {\alpha_i \over i} &\geq \sum_{i=n_{j+1} - {\epsilon \over 4} n_{j+1} + n_j + 1}^{n_{j+1}} {1 \over n_{j+1}} \\
\label{eqn-tech2-4} &\geq \sum_{i=(1 - {\epsilon \over 8})n_{j+1}}^{n_{j+1}} {1 \over n_{j+1}} = {\epsilon \over 8}
}
Equation (\ref{eqn-tech2-3}) follows from (\ref{eqn-tech2-1}) and because $1 \over i$ is a decreasing function. (\ref{eqn-tech2-4}) follows from
the definition of $n_j$ and algebra. Therefore
\eqn{
\nonumber \sum_{i=1}^\infty {\alpha_i \over i} &= \lim_{k\to\infty} \sum_{j=1}^k \sum_{i=n_j+1}^{n_{j+1}} {\alpha_i \over i} \\
\label{eqn-tech2-6} &\geq \lim_{k\to\infty} \sum_{j=1}^k {\epsilon \over 8} = \infty
}
Finally, substituting Equation (\ref{eqn-tech2-6}) into (\ref{eqn13-1}) gives,
\eq{
\prod_{i=1}^\infty \left[1 - {\alpha_i \over i}\right] = 0
}
as required.
\proofend\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Table of Notation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{|p{1.6cm} | p{12.5cm}|}
\hline
{\bf Symbol} & {\bf Description} \\
$\A$ & Set of possible actions \\
$\O$ & Set of possible observations \\
$\Rc$ & Set of possible rewards \\
$\mu, \nu$ & Environments \\
$y$ & An action. \\
$x$ & An observation/reward pair \\
$r$ & A reward \\
$o$ & An observation \\
$\ind{expr}$ & The delta function. $\ind{expression} = 1$ if $expression$ is true and $0$ otherwise. \\
$\neg b$ & The {\it not} function. $\neg 0 = 1$ and $\neg 1 = 0$. \\
$\pi$ & A policy. \\
$\chi$ & An infinite binary string. $\chi_k = 1$ if an agent starts exploring at time-step $k$. \\
$\bar \chi$ & An infinite binary string. $\bar \chi_k = 1$ if an agent is exploring at time-step $k$. \\
$\dot \chi^h$ & An infinite binary string. $\dot \chi^h_k = 0$ if an agent will not explore for the next $h$ time-steps. \\
$\alpha$ & An infinite binary string. \\
$\psi$ & An infinite random binary string sampled from the coin flip measure. \\
$t, n, i, j, k$ & Time indices. \\
$\yx_{*