
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%            Universal Algorithmic Intelligence:            %%
%%             A mathematical top->down approach             %%
%%                                                           %%
%% A Gentle Introduction to the Universal Algorithmic Agent AIXI %%
%%                                                           %%
%%              Marcus Hutter: Start: 20.09.02               %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%-------------------------------%
%   Document-Style              %
%-------------------------------%
\newif\ifagibook\agibookfalse
\newif\ifpublished\publishedtrue

\documentclass[12pt,twoside]{article}
\usepackage{latexsym}
\usepackage{makeidx}
\pagestyle{myheadings}
\markboth{\sc Marcus Hutter, Technical Report, IDSIA-01-03
}{\sc Universal Algorithmic Intelligence}
\setcounter{tocdepth}{4}
\setcounter{secnumdepth}{3}
\topmargin=0cm  \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\sloppy
\makeindex

%-------------------------------%
%       My Math-Spacings        %
%-------------------------------%
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip=  4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu                    \medmuskip=  1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}}    \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}}    \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}}  \def\eqan{\end{eqnarray*}\textmuskip}

%Default Math-Spacings
%\def\beq{\begin{equation}}    \def\eeq{\end{equation}}
%\def\beqn{\begin{displaymath}}\def\eeqn{\end{displaymath}}
%\def\bqa{\begin{eqnarray}}    \def\eqa{\end{eqnarray}}
%\def\bqan{\begin{eqnarray*}}  \def\eqan{\end{eqnarray*}}

%-------------------------------%
%   Macro-Definitions           %
%-------------------------------%
\newenvironment{keywords}{\centerline{\bf
Keywords}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip
1ex}

\renewenvironment{abstract}{\centerline{\bf
Abstract}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip
1ex}

\newtheorem{theorem}{Theorem}
\newtheorem{tablex}[theorem]{Table}
\newtheorem{figurex}[theorem]{Figure}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{axiom}[theorem]{Axiom}
\newtheorem{assumption}[theorem]{Assumption}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{definition}[theorem]{Definition}

\def\ftablex#1#2#3{\begin{tablex}[#2]\label{#1} #3 \end{tablex} }
\def\ffigurex#1#2#3#4{{#4} \begin{figurex}[#2]\label{#1} #3 \end{figurex} }
\def\fclaim#1#2#3{\begin{claim}[#2]\label{#1} #3 \end{claim} }
\def\faxiom#1#2#3{\begin{axiom}[#2]\label{#1} #3 \end{axiom} }
\def\fassumption#1#2#3{\begin{assumption}[#2]\label{#1} #3 \end{assumption} }
\def\ftheorem#1#2#3{\begin{theorem}[#2]\label{#1} #3 \end{theorem} }
\def\fcorollary#1#2#3{\begin{corollary}[#2]\label{#1} #3 \end{corollary} }
\def\flemma#1#2#3{\begin{lemma}[#2]\label{#1} #3 \end{lemma} }
\def\fdefinition#1#2#3{\begin{definition}[#2]\label{#1} #3 \end{definition} }

\def\aidx#1{\protect\index{#1}} %Authors & editors to index
\def\idx#1{\index{#1}#1} %\idx{name} for also in text
\def\indxs#1#2{\index{#1!#2}\index{#2!#1}} %\idx{name} for also in text
\def\paragraph#1{\vspace{1ex}\noindent{\bf #1.}}
\def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}
\def\nq{\hspace{-1em}}
\def\qed{\hspace*{\fill}$\Box$}
\def\odt{{\textstyle{1\over 2}}}
\def\odf{{\textstyle{1\over 4}}}
\def\odA{{\textstyle{1\over A}}}
\def\odm{{\textstyle{1\over m}}}
\def\odn{{\textstyle{1\over n}}}
\def\eps{\varepsilon}
\def\epstr{\epsilon}                    % for empty string
\def\pb#1{\def\pb{\underline}\underline{#1}} % probability notation
\def\hh#1{{\dot{#1}}}                     % historic I/O
\def\best{*}                              % or {best}
\def\l{{\ell}}                               % length of string or program
\def\approxleq{\mbox{\raisebox{-0.8ex}{$\stackrel{\displaystyle<}\sim$}}} %% make nicer
\def\approxgeq{\mbox{\raisebox{-0.8ex}{$\stackrel{\displaystyle>}\sim$}}} %% make nicer
\def\equa{\stackrel+=}                 %\eqa unfortunately already used up
\def\leqa{\stackrel+\leq}              % don't put around another braces,
\def\geqa{\stackrel+\geq}              % since this messes up spacings
\def\eqm{\stackrel\times=}             % for some reason
\def\leqm{\stackrel\times\leq}
\def\geqm{\stackrel\times\geq}
%\def\EC{{\text{EC}}}
\def\AI{{\text{AI}}}
\def\SP{{\text{SP}}}
\def\CF{{\text{CF}}}
\def\SG{{\text{SG}}}
\def\FM{{\text{FM}}}
%\def\EX{{\text{EX}}}
\def\rhoai{{\rho}}
\def\M{{\cal M}}
\def\X{{\cal X}}
\def\Y{{\cal Y}}
\def\O{{\cal O}}
\def\F{{\cal F}}
\def\R{{\cal R}}
\def\E{{\bf E}}
\def\P{{\bf P}}
\def\B{{I\!\!B}}                       % Binary set (or \SetB)
\def\SetN{I\!\!N} \def\SetQ{I\!\!\!Q} \def\SetR{I\!\!R} \def\SetZ{Z\!\!\!Z}
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\ifagibook\def\Chapter{Section }\else\def\Chapter{Chapter }\fi
\def\mrcp{the chain rule}
\def\text#1{\mbox{\scriptsize{#1}}}    % if not using amstex

\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                      T i t l e - P a g e                      %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{titlepage}

\title{\vspace{-1cm}
{\small Technical Report IDSIA-01-03 \hfill
  \ifpublished In {\it Artificial General Intelligence}, 2007
  \else 17 January 2003 (minor rev.18Oct'04)
  \fi \\[4mm]}
  \Large\bf\hrule height1pt \vskip 3mm
\mbox{{\LARGE U}NIVERSAL {\LARGE A}LGORITHMIC {\LARGE I}NTELLIGENCE} \\[1mm]
\bf A mathematical top$\to$down approach
\vskip 3mm \hrule height1pt}

\author{Marcus Hutter\\[3ex]
\small IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland%
\\
\small marcus@idsia.ch \hspace{9ex} http://www.idsia.ch/$^{_{_\sim}}\!$marcus
}
\ifpublished\date{17 January 2003}\else\date{}\fi

\maketitle              % typeset the title of the contribution

\begin{keywords}
Artificial intelligence;
algorithmic probability;
sequential decision theory;
rational agents;
value function;
Solomonoff induction;
Kolmogorov complexity;
reinforcement learning;
universal sequence prediction;
strategic games;
function minimization;
supervised learning.
\end{keywords}

\begin{abstract}
Sequential decision theory formally solves the problem of rational
agents in uncertain worlds if the true environmental prior
probability distribution is known. Solomonoff's theory of
universal induction formally solves the problem of sequence
prediction for unknown prior distribution. We combine both ideas
and get a parameter-free theory of universal Artificial
Intelligence. We give strong arguments that the resulting AIXI
model is the most intelligent unbiased agent possible. We outline
how the AIXI model can formally solve a number of problem classes,
including sequence prediction, strategic games, function
minimization, reinforcement and supervised learning. The major
drawback of the AIXI model is that it is uncomputable. To overcome
this problem, we construct a modified algorithm AIXI$tl$ that is
still effectively more intelligent than any other time $t$ and
length $l$ bounded agent. The computation time of AIXI$tl$ is of
the order $t\cdot 2^l$. The discussion includes formal definitions
of intelligence order relations, the horizon problem and relations
of the AIXI theory to other AI approaches.
\end{abstract}

\end{titlepage}

%------------------------------%
%      Table of Contents       %
%------------------------------%
{\parskip=0ex\tableofcontents}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This \ifagibook chapter \else article \fi gives an introduction to
a mathematical theory for intelligence. We present the AIXI model,
a parameter-free optimal reinforcement learning agent embedded in
an arbitrary unknown environment.

The science of Artificial Intelligence (AI) may be defined as the
construction of intelligent systems and their analysis. A natural
definition of a {\it system} is anything that has an input and an
output stream. Intelligence is more complicated. It can have many
faces like creativity, solving problems, pattern recognition,
classification, learning, induction, deduction, building
analogies, optimization, surviving in an environment, language
processing, knowledge and many more. A formal definition
incorporating every aspect of intelligence, however, seems
difficult. Most, if not all known facets of intelligence can be
formulated as goal-driven or, more precisely, as maximizing some
utility function. It is, therefore, sufficient to study
goal-driven AI; e.g.\ the (biological) goal of animals and humans
is to survive and spread. The goal of AI systems should be to be
useful to humans. The problem is that, except for special cases,
we know neither the utility function nor the environment in which
the agent will operate in advance. The mathematical theory,
coined AIXI, is supposed to solve these problems.

Assume the availability of unlimited computational resources. The
first important observation is that this does not make the AI
problem trivial. Playing chess optimally or solving NP-complete
problems become trivial, but driving a car or surviving in nature
don't. This is because it is a challenge itself to
well-define the latter problems, not to mention presenting an
algorithm. In other words: The AI problem has not yet been
well defined. One may view AIXI as a suggestion for such a
mathematical definition of AI.

AIXI is a universal theory of sequential decision making akin to
Solomonoff's celebrated universal theory of induction. Solomonoff
derived an optimal way of predicting future data, given previous
perceptions, provided the data is sampled from a computable
probability distribution. AIXI extends this approach to an optimal
decision making agent embedded in an unknown environment. The {\em
main idea} is to replace the unknown environmental distribution
$\mu$ in the Bellman equations by a suitably generalized
universal Solomonoff distribution $\xi$. The state space is the
space of complete histories. AIXI is a universal theory without
adjustable parameters, making no assumptions about the environment
except that it is sampled from a computable distribution. From an
algorithmic complexity perspective, the AIXI model generalizes
optimal passive universal induction to the case of active agents.
From a decision-theoretic perspective, AIXI is a suggestion of a
new (implicit) ``learning'' algorithm, which may overcome all
(except computational) problems of previous reinforcement learning
algorithms.

There are strong arguments that AIXI is the most intelligent
unbiased agent possible. We outline for a number of problem
classes, including sequence prediction, strategic games, function
minimization, reinforcement and supervised learning, how the AIXI
model can formally solve them. The major drawback of the AIXI
model is that it is incomputable. To overcome this problem, we
construct a modified algorithm AIXI$tl$ that is still
effectively more intelligent than any other time $t$ and length $l$
bounded agent. The computation time of AIXI$tl$ is of the order $t
\cdot 2^l$. Other discussed topics are a formal definition of
an intelligence order relation, the horizon problem and relations of
the AIXI theory to other AI approaches.

The article is meant to be a gentle introduction to and discussion
of the AIXI model. For a mathematically rigorous treatment, many
subtleties, and proofs see the references to the author's works in
the annotated bibliography section at the end of this \ifagibook
chapter\else article\fi, and in particular the book
\cite{Hutter:04uaibook}. This section also provides references to
introductory textbooks and original publications on algorithmic
information theory and sequential decision theory.

%------------------------------%
%\paragraph{Contents} %technical
%------------------------------%
{\em \Chapter \ref{chAImu}} presents the theory of sequential
decisions in a very general form (called AI$\mu$ model) in which
actions and perceptions may depend on arbitrary past events. We
clarify the connection to the Bellman equations and discuss minor
parameters including (the size of) the I/O spaces and the lifetime
of the agent and their universal choice which we have in mind.
Optimality of AI$\mu$ is obvious by construction.

{\em \Chapter \ref{chSU}:} How and in which sense induction is
possible at all has been subject to long philosophical
controversies. Highlights are Epicurus' principle of multiple
explanations, Occam's razor, and probability theory. Solomonoff
elegantly unified all these aspects into one formal theory of
inductive inference based on a universal probability distribution
$\xi$, which is closely related to Kolmogorov complexity $K(x)$,
the length of the shortest program computing $x$. Rapid
convergence of $\xi$ to the unknown true environmental
distribution $\mu$ and tight loss bounds for arbitrary bounded
loss functions and finite alphabet can be shown. Pareto optimality
of $\xi$ in the sense that there is no other predictor that
performs better or equal in all environments and strictly better
in at least one can also be shown. In view of these results it is
fair to say that the problem of sequence prediction possesses a
universally optimal solution.

{\em \Chapter \ref{chAIxi}:} In the active case, reinforcement
learning algorithms are usually used if $\mu$ is unknown. They can
succeed if the state space is either small or has effectively been
made small by generalization techniques. The algorithms work only
in restricted (e.g.\ Markovian) domains, have problems with
optimally trading off exploration versus exploitation, have
nonoptimal learning rate, are prone to diverge, or are otherwise
ad hoc. The formal solution proposed here is to generalize
Solomonoff's universal prior $\xi$ to include action conditions
and replace $\mu$ by $\xi$ in the AI$\mu$ model, resulting in the
AI$\xi\equiv$AIXI model, which we claim to be universally optimal.
We investigate what we can expect from a universally optimal agent
and clarify the meanings of {\em universal}, {\em optimal}, etc.
Other discussed topics are formal definitions of an intelligence
order relation, the horizon problem, and Pareto optimality of
AIXI.

{\em \Chapter \ref{chApply}:} We show how a number of AI problem
classes fit into the general AIXI model. They include sequence
prediction, strategic games, function minimization, and supervised
learning. We first formulate each problem class in its natural way
(for known $\mu$) and then construct a formulation within the
AI$\mu$ model and show their equivalence. We then consider the
consequences of replacing $\mu$ by $\xi$. The main goal is to
understand in which sense the problems are solved by AIXI.

{\em \Chapter \ref{chTime}:} The major drawback of AIXI is that it
is incomputable, or more precisely, only asymptotically
computable, which makes an implementation impossible. To overcome
this problem, we construct a modified model AIXI$tl$, which is
still superior to any other time $t$ and length $l$ bounded
algorithm. The computation time of AIXI$tl$ is of the order
$t\cdot 2^l$. The solution requires an implementation of first-order logic, the definition of a universal Turing machine within
it and a proof theory system.

{\em \Chapter \ref{chDisc}:} Finally we discuss and remark on some
otherwise unmentioned topics of general interest. We
remark on various topics, including concurrent actions and
perceptions, the choice of the I/O spaces, treatment of encrypted
information, and peculiarities of mortal embodies agents. We
continue with an outlook on further research, including
optimality, down-scaling, implementation, approximation, elegance,
extra knowledge, and training of/for AIXI($tl$). We also include
some (personal) remarks on non-computable physics, the number of
wisdom $\Omega$, and consciousness.

\ifagibook An annotated bibliography concludes this chapter.
\else An annotated bibliography and other references conclude this work.
\fi

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Agents in Known Probabilistic Environments}\label{chAImu}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The general framework for AI might be viewed as the design and
study of intelligent agents \cite{Russell:03}. An agent is a
cybernetic system with some internal state, which acts with output
$y_k$ on some environment in cycle $k$, perceives some input $x_k$
from the environment and updates its internal state. Then the next
cycle follows. We split the input $x_k$ into a regular part $o_k$
and a reward $r_k$, often called reinforcement feedback. From time
to time the environment provides nonzero reward to the agent. The
task of the agent is to maximize its utility, defined as the sum
of future rewards. A probabilistic environment can be described by
the conditional probability $\mu$ for the inputs $x_1...x_n$ to
the agent under the condition that the agent outputs $y_1...y_n$.
Most, if not all environments are of this type. We give formal
expressions for the outputs of the agent, which maximize the total
$\mu$-expected reward sum, called value. This model is called the
AI$\mu$ model. As every AI problem can be brought into this form,
the problem of maximizing utility is hence being formally solved,
if $\mu$ is known.
%
Furthermore, we study some special aspects of the AI$\mu$ model.
We introduce factorizable probability distributions describing
environments with independent episodes. They occur in several
problem classes studied in Section~\ref{chApply} and are a special
case of more general separable probability distributions defined
in Section~\ref{secAIsep}. We also clarify the connection to
the Bellman equations of sequential decision theory and discuss
similarities and differences. We discuss minor parameters of our
model, including (the size of) the input and output spaces $\X$
and $\Y$ and the lifetime of the agent, and their universal
choice, which we have in mind.
%
There is nothing remarkable in this section; it is the essence of
sequential decision theory
\cite{VonNeumann:44,Bellman:57,Bertsekas:96,Sutton:98}, presented
in a new form. Notation and formulas needed in later sections are
simply developed. There are two major remaining problems: the
problem of the unknown true probability distribution $\mu$, which
is solved in Section~\ref{chAIxi}, and computational aspects,
which are addressed in Section~\ref{chTime}.

%-------------------------------------------------------------%
\subsection{The Cybernetic Agent Model}
%-------------------------------------------------------------%
A good way to start thinking about intelligent systems is to
consider more generally cybernetic systems,\index{cybernetic
systems} in AI usually called agents.\index{agents} This
avoids having to struggle with the meaning of \idx{intelligence}
from the very beginning. A cybernetic system is a control circuit
with input\index{input} $y$ and output\index{output} $x$
and an internal state.\index{state!internal} From an external
input and the internal state the agent calculates
deterministically or stochastically an output. This output
(\idx{action}) modifies the environment and leads to a new input
(\idx{perception}). This continues ad infinitum or for a finite
number of cycles.

%------------------------------%
\fdefinition{defAgent}{The Agent Model}{
%------------------------------%
An agent is a system that interacts with an environment in cycles
$k=1,2,3,...$. In cycle $k$ the action (output) $y_k \in \Y$ of
the agent is determined by a policy $p$ that depends on the
I/O-history $y_1x_1...y_{k-1}x_{k-1}$. The environment reacts to
this action and leads to a new perception (input) $x_k \in \X$
determined by a deterministic function $q$ or probability
distribution $\mu$, which depends on the history
$y_1x_1...y_{k-1}x_{k-1}y_k$. Then the next cycle $k+1$ starts.
}%------------------------------%

\noindent As explained in the last section, we need some
\idx{reward} assignment to the cybernetic system. The input $x$ is
divided into two parts, the standard input $o$ and some reward
input $r$. If input and output are represented by strings, a
\idx{deterministic} cybernetic system can be modeled by a
\idx{Turing machine} $p$, where $p$ is called the \idx{policy} of the agent,
which determines the (re)action to a perception. If the environment is
also computable it might be modeled by a Turing machine $q$ as
well. The interaction of the agent with the environment can be
illustrated as follows:

\begin{center}\label{cyberpic}
%\input KCUnAI.pic
\small\unitlength=0.95mm\thicklines
\begin{picture}(106,47)
\put( 1,41){\framebox(16,6)[cc]{$r_1 \;|\; o_1$}}
\put(17,41){\framebox(16,6)[cc]{$r_2 \;|\; o_2$}}
\put(33,41){\framebox(16,6)[cc]{$r_3 \;|\; o_3$}}
\put(49,41){\framebox(16,6)[cc]{$r_4 \;|\; o_4$}}
\put(65,41){\framebox(16,6)[cc]{$r_5 \;|\; o_5$}}
\put(81,41){\framebox(16,6)[cc]{$r_6 \;|\; o_6$}}
\put(97,47){\line(1,0){9}}\put(97,41){\line(1,0){9}}\put(102,44){\makebox(0,0)[cc]{...}}
\put( 1,1){\framebox(16,6)[cc]{$y_1$}}
\put(17,1){\framebox(16,6)[cc]{$y_2$}}
\put(33,1){\framebox(16,6)[cc]{$y_3$}}
\put(49,1){\framebox(16,6)[cc]{$y_4$}}
\put(65,1){\framebox(16,6)[cc]{$y_5$}}
\put(81,1){\framebox(16,6)[cc]{$y_6$}}
\put(97,7){\line(1,0){9}}\put(97,1){\line(1,0){9}}\put(102,4){\makebox(0,0)[cc]{...}}
%
%
\put(1,21){\line(1,0){16}}
\put(1,27){\line(1,0){16}}
\put(1,21){\line(0,6){6}}
\put(9,24){\makebox(0,0)[cc]{work}}
\put(17,17){\framebox(20,14)[cc]{$\displaystyle{Agent\atop\bf p}$}}
\put(37,27){\line(1,0){14}}
\put(37,21){\line(1,0){14}}
\put(39,24){\makebox(0,0)[lc]{tape ...}}
%
%
\put(56,21){\line(1,0){16}}
\put(56,27){\line(1,0){16}}
\put(56,21){\line(0,6){6}}
\put(64,24){\makebox(0,0)[cc]{work}}
\put(72,17){\framebox(20,14)[cc]{$\displaystyle{Environ\mbox{-}\atop ment\quad\bf q}$}}
\put(92,27){\line(1,0){14}}
\put(92,21){\line(1,0){14}}
\put(94,24){\makebox(0,0)[lc]{tape ...}}
%
%\normalcolor
\put(46,41){\vector(-2,-1){20}}
\put(81,31){\vector(-2,1){20}}
\put(54,7){\vector(3,1){30}}
\put(24,17){\vector(3,-1){30}}
\end{picture}
\end{center}
%
Both $p$ as well as $q$ have unidirectional input and output tapes
and bidirectional work
tapes.\index{tape!unidirectional}\index{tape!bidirectional} What
entangles the agent with the environment is the fact that the
upper tape serves as input tape for $p$, as well as output tape
for $q$, and that the lower tape serves as output tape for $p$ as
well as input tape for $q$. Further, the reading head must always
be left of the writing head, i.e.\ the symbols must first be
written before they are read. Both $p$ and $q$ have their own mutually
inaccessible work tapes containing their own ``secrets''. The
heads\index{Turing machine!head}\label{defxyr} move in the
following way. In the $k^{th}$ cycle $p$ writes $y_k$, $q$ reads
$y_k$, $q$ writes $x_k \equiv r_k o_k$, $p$ reads $x_k \equiv r_k
o_k$, followed by the $(k + 1)^{th}$ \idx{cycle} and so on. The
whole process starts with the first cycle, all heads on tape start
and work tapes being empty. We call Turing machines
behaving in this way {\it chronological Turing machines}. Before
continuing, some notations on strings are appropriate.

%-------------------------------------------------------------%
\subsection{Strings}\label{secStrings}\index{strings}\index{sequence}
%-------------------------------------------------------------%
We denote strings over the \idx{alphabet} $\X$ by $s =
x_1x_2...x_n$, with $x_k \in \X$, where $\X$ is alternatively
interpreted as a nonempty subset of $\SetN$ or itself as a
prefix-free set\index{set!prefix-free} of binary strings. The
length\index{string!length} of $s$ is $\l(s)=\l(x_1) +...+
\l(x_n)$. Analogous definitions hold for $y_k \in \Y$. We call
$x_k$ the $k^{th}$ input word\index{input!word} and $y_k$ the
$k^{th}$ output word\index{output!word} (rather than letter). The
string $s=y_1x_1...y_nx_n$ represents the input/output in
\idx{chronological} order. Due to the \idx{prefix property} of the
$x_k$ and $y_k$, $s$ can be uniquely separated into its words. The
words appearing in strings are always in chronological order. We
further introduce the following abbreviations:
$\epsilon$\index{string!empty} is the empty string,
$x_{n:m}:=x_nx_{n+1}...x_{m-1}x_m$ for $n\leq m$ and $\epsilon$
for $n>m$. $x_{<n}:=x_1... x_{n-1}$. Analogously for $y$. Further,
$y\!x_n :=y_nx_n$, $y\!x_{n:m} := y_nx_n...y_mx_m$, and so on.

%-------------------------------------------------------------%
\subsection{AI Model for Known Deterministic Environment}
%-------------------------------------------------------------%
\indxs{environment}{deterministic}%
Let us define for the chronological Turing machine\index{Turing
machine!chronological}\index{chronological!Turing machine} $p$ a
partial function also named $p : \X^* \rightarrow \Y^*$ with
$y_{1:k}=p(x_{<k})$, where $y_{1:k}$ is the output of Turing
machine $p$ on input $x_{<k}$ in cycle $k$, i.e.\ where $p$ has
read up to $x_{k-1}$ but no further.$\!$\footnote{Note that a
possible additional dependence of $p$ on $y_{<k}$ as mentioned in
Definition~\ref{defAgent} can be eliminated by recursive
substitution; see below. Similarly for $q$.} In an analogous way,
we define $q : \Y^* \rightarrow \X^*$ with $x_{1:k}=q(y_{1:k})$.
Conversely, for every partial
recursive\index{chronological!function} chronological function we
can define a corresponding chronological Turing machine. Each
(agent,environment) pair $(p,q)$ produces a unique \idx{I/O
sequence} $\omega^{pq}:=y_1^{pq}x_1^{pq}y_2^{pq}x_2^{pq}...$. When
we look at the definitions of $p$ and $q$ we see a nice symmetry
between the cybernetic system and the environment. Until now, not
much intelligence is in our agent. Now the credit assignment comes
into the game and removes the symmetry somewhat. We split the
input $x_k \in \X := \R \times \O$ into a regular part
\index{input!regular}\index{input!reward}%
\index{reward!maximize}\index{maximize!reward}%
$o_k \in \O$ and a reward $r_k \in \R \subset \SetR$. We define
$x_k \equiv r_k o_k$ and $r_k\equiv r(x_k)$. The goal of the agent
should be to maximize received rewards. This is called
reinforcement learning. The reason for the \idx{asymmetry} is that
eventually we (humans) will be the environment with which the
agent will communicate and {\it we} want to dictate what is good
and what is wrong, not the other way round. This one-way learning,
the agent learns from the environment, and not conversely, neither
prevents the agent from becoming more intelligent than the
environment, nor does it prevent the environment learning from the
agent because the environment can itself interpret the outputs
$y_k$ as a regular and a reward part. The environment is just not
forced to learn, whereas the agent is. In cases where we restrict
the reward to two values $r\in\R=\B:=\{0,1\}$, $r=1$ is
interpreted as a positive
feedback,\index{feedback!positive}\index{feedback!negative} called
{\it good} or {\it correct}, and $r = 0$ a negative feedback,
called {\it bad} or {\it error}. Further, let us restrict for a
while the \label{defHorizon}\idx{lifetime} (number of cycles) $m$
of the agent to a large but finite value. Let
\beqn
  V_{km}^{pq} \;:=\; \sum_{i=k}^m r(x_i^{pq})
\eeqn
be the\index{reward!total}\index{reward!future} future total reward
(called future utility), the agent $p$ receives from the
environment $q$ in the cycles $k$ to $m$. It is now natural to
call the agent $p^\best$ that maximizes $V_{1m}$ (called total
\idx{utility}), the {\it best} one.$\!$\footnote{$\arg\max_p V(p)$ is
the $p$ that maximizes $V(\cdot)$. If there is more than one
maximum we might choose the lexicographically smallest one for
definiteness.}
\beq\label{voptq}
 p^\best:=\arg\max_p V_{1m}^{pq} \quad\Rightarrow\quad
 V_{km}^{p^\best q} \geq V_{km}^{pq} \quad \forall p :
 y_{<k}^{pq}=y_{<k}^{p^\best q}
\eeq
For $k=1$ the condition on $p$ is nil. For $k>1$ it states that
$p$ shall be consistent with $p^*$ in the sense that they have the
same history. If $\X$, $\Y$ and $m$ are finite, the number of
different behaviors of the agent, i.e.\ the search space is
finite. Therefore, because we have assumed that $q$ is known,
$p^\best$ can effectively be determined by pre-analyzing all
behaviors. The main reason for restricting to finite $m$ was not
to ensure computability of $p^\best$ but that the limit
$m\to\infty$ might not exist. The ease with which we defined and
computed the optimal policy $p^\best$ is not remarkable. Just the
(unrealistic) assumption of a completely known deterministic
environment $q$ has trivialized everything.

%-------------------------------------------------------------%
\subsection{AI Model for Known Prior Probability}\label{secAIfunc}
%-------------------------------------------------------------%
\indxs{environment}{probabilistic}%
Let us now weaken our assumptions by replacing the deterministic environment $q$
with a probability distribution $\mu(q)$ over chronological
functions. Here $\mu$ might be interpreted in two ways. Either the
environment itself behaves stochastically defined by $\mu$ or the
true environment is deterministic, but we only have subjective
(probabilistic) information of which environment is the true
environment. Combinations of both cases are also possible. We
assume here that $\mu$ is known and describes the true stochastic
behavior of the environment. The case of unknown $\mu$ with the
agent having some beliefs about the environment lies at the heart
of the AI$\xi$ model described in Section~\ref{chAIxi}.

The {\em best} or {\em most intelligent} agent is now the one
that maximizes the {\em expected} utility (called value function)
$
  V_\mu^p\equiv V_{1m}^{p\mu}:=\sum_q\mu(q)V_{1m}^{pq}
$.
%$p^\best=\arg\max_p V_\mu^p$
This defines the AI$\mu$ model.

%------------------------------%
\fdefinition{defAImu}{The AI$\mu$ model}{
%------------------------------%
The AI$\mu$ model is the agent with policy $p^\mu$ that maximizes
the $\mu$-expected total reward $r_1 +...+ r_m$, i.e.\
$p^\best\equiv p^\mu:=\arg\max_p V_\mu^p$. Its value is
$V_\mu^\best:=V_\mu^{p^\mu}$.
}%------------------------------%

\noindent We need the concept of a {\em value function} in a
slightly more general form.

%------------------------------%
\fdefinition{defValue}{The $\mu$/true/generating value function}{
%------------------------------%
The agent's perception $x$ consists of a regular observation $o \in \O$
and a reward $r \in \R \subset \SetR$. In cycle $k$ the {\em
value} $V_{km}^{p\mu}(y\!x_{<k})$ is defined as the
$\mu$-expectation of the future reward sum $r_k +...+ r_m$ with
actions generated by policy $p$, and fixed history $y\!x_{<k}$. We
say that $V_{km}^{p\mu}(y\!x_{<k})$ is the (future) {\em value} of
policy $p$ in environment $\mu$ given history $y\!x_{<k}$, or
shorter, the $\mu$ or true or generating value of $p$ given
$y\!x_{<k}$. $V_\mu^p:=V_{1m}^{p\mu}$ is the (total) value of $p$.
}%------------------------------%

\noindent We now give a more formal definition for
$V_{km}^{p\mu}$. Let us assume we are in cycle $k$ with history
$\hh y\!\hh x_1...\hh y\!\hh x_{k-1}$ and ask for the {\it best}
output $y_k$. Further, let $\hh Q_k := \{q:q(\hh y_{<k})=\hh
x_{<k}\}$ be the set of all environments producing the above
\idx{history}. We say that $q\in\hh Q_k$ is {\em consistent} with
history $\hh y\!\hh x_{<k}$. The expected reward for the next
$m-k+1$ cycles (given the above history) is called the value of
policy $p$ and is given by a conditional probability:
\beq\label{eefunc}
  V_{km}^{p\mu}(\hh y\!\hh x_{<k}) \;:=\;
  { \sum_{q\in \hh Q_k} \mu(q)V_{km}^{pq} \over
    \sum_{q\in \hh Q_k} \mu(q) }.
\eeq
Policy $p$ and environment $\mu$ do not determine history
$\hh y\!\hh x_{<k}$, unlike the
deterministic case, because the history is no longer
deterministically determined by $p$ and $q$, but depends on $p$
and $\mu$ {\it and} on the outcome of a stochastic process.
Every new cycle adds new information ($\hh x_i$) to the
agent. This is indicated by the dots over the symbols.
In cycle $k$ we have to maximize the expected future
rewards, taking into account the information in the history $\hh
y\!\hh x_{<k}$. This information is not already present
in $p$ and $q/\mu$ at the agent's start, unlike in the deterministic
case.

Furthermore, we want to generalize the finite lifetime $m$ to a
dynamic (computable) farsightedness\index{farsightedness!dynamic}
$h_k \equiv m_k - k + 1 \geq 1$, called horizon. For
$m_k = m$ we have our original finite lifetime; for
$h_k = h$ the agent maximizes in every cycle the
next $h$ expected rewards. A discussion of the choices for $m_k$ is
delayed to Section~\ref{secHorizon}.
%
The next $h_k$ rewards are maximized by
\beqn
  p_k^\best \;:=\; \arg\max_{p\in \hh P_k} V_{km_k}^{p\mu}(\hh y\!\hh
  x_{<k}),
\eeqn
where $\hh P_k := \{p:\exists y_k:p(\hh x_{<k})=\hh y_{<k}y_k\}$
is the set of systems consistent with the current history.
Note that $p_k^\best$ depends on $k$ and is used only in step $k$ to
determine $\hh y_k$ by $ p_k^\best(\hh x_{<k}|\hh y_{<k}) = \hh
y_{<k}\hh y_k$. After writing $\hh y_k$ the environment replies
with $\hh x_k$ with (conditional) probability $\mu(\hh
Q_{k+1})/\mu(\hh Q_k)$. This probabilistic outcome provides new
information to the agent. The cycle $k + 1$ starts with
determining $\hh y_{k+1}$ from $p_{k+1}^\best$ (which can differ
from $p_k^\best$ for dynamic $m_k$) and so on. Note that
$p_k^\best$ implicitly also depends on $\hh y_{<k}$ because $\hh
P_k$ and $\hh Q_k$ do so. But recursively inserting
$p_{k-1}^\best$ and so on, we can define
\beq\label{pbestfunc}
  p^\best(\hh x_{<k}) \;:=\;
  p_k^\best(\hh x_{<k}|p_{k-1}^\best(\hh x_{<k-1}|...p_1^\best))
\eeq
It is a chronological function and computable if $\X$, $\Y$ and
$m_k$ are finite and $\mu$ is computable. For constant $m$ one can
show that the policy (\ref{pbestfunc}) coincides with the AI$\mu$
model (Definition \ref{defAImu}). This also proves
\beq\label{voptmu}
  V_{km}^{\best\mu}(y\!x_{<k}) \geq V_{km}^{p\mu}(y\!x_{<k})
  \quad\forall p \;\mbox{consistent with}\; y\!x_{<k}
\eeq
similarly to (\ref{voptq}). For $k = 1$ this is obvious. We also
call (\ref{pbestfunc}) AI$\mu$ model. For
deterministic\footnote{We call a probability distribution
deterministic if it assumes values 0 and 1 only.} $\mu$ this model
reduces to the deterministic case discussed in the last
subsection.

It is important to maximize the sum of future rewards and not, for
instance, to be \idx{greedy} and only maximize the next reward, as is
done e.g.\ in sequence prediction. For example, let the environment
be a sequence of chess games, and each cycle corresponds to one
move. Only at the end of each game is a positive reward $r = 1$
given to the agent if it won the game (and made no illegal move).
For the agent, maximizing all future rewards means trying to win
as many games in as short as possible time (and avoiding illegal
moves). The same performance is reached if we choose $h_k$ much
larger than the typical game lengths. Maximization of only the
next reward would be a very bad chess playing agent. Even if we
would make our reward $r$ finer, e.g.\ by evaluating the number of
chessmen, the agent would play very bad chess for $h_k = 1$,
indeed.

The AI$\mu$ model still depends on $\mu$ and $m_k$; $m_k$ is
addressed in Section~\ref{secHorizon}. To get our final universal
AI model the idea is to replace $\mu$ by the universal probability
$\xi$, defined later. This is motivated by the fact that $\xi$
converges to $\mu$ in a certain sense for any $\mu$. With $\xi$
instead of $\mu$ our model no longer depends on any parameters, so
it is truly universal. It remains to show that it behaves
intelligently. But let us continue step by step. In the following
we develop an alternative but equivalent formulation of the
AI$\mu$ model. Whereas the \idx{functional form} presented above
is more suitable for theoretical considerations, especially for
the development of a time-bounded version in
Section~\ref{secAIXItl}, the iterative and recursive\index{iterative
formulation}\index{recursive formulation} formulation of the next
subsections will be more appropriate for the explicit calculations
in most of the other sections.

%-------------------------------------------------------------%
\subsection{Probability Distributions}\label{secPD}
%-------------------------------------------------------------%
\index{probability!distribution}\index{probability distribution}%
We use Greek letters for probability distributions,
and \idx{underline} their
arguments to indicate that they are probability arguments. Let
$\rho_n(\pb x_1...\pb x_n)$ be the probability that an (infinite)
string starts with $x_1...x_n$. We drop the index on $\rho$ if it
is clear from its arguments:
\beq\label{prop}
  \sum_{x_n\in \X}\rho(\pb x_{1:n}) \equiv
  \sum_{x_n}\rho_n(\pb x_{1:n}) =
  \rho_{n-1}(\pb x_{<n}) \equiv
  \rho(\pb x_{<n})
  ,\quad
  \rho(\epsilon) \equiv \rho_0(\epsilon)=1.
\eeq
We also need conditional probabilities\index{probability
distribution!conditional} derived from the \idx{chain rule}. We prefer a
notation that preserves the chronological
order\index{chronological!order} of the words, in contrast to the
standard notation $\rho(\cdot|\cdot)$ that flips it. We extend
the definition of $\rho$ to the conditional case with the
following convention for its arguments: An underlined argument
$\pb x_k$ is a probability variable, and other non-underlined
arguments $x_k$ represent conditions. With this convention, the
conditional probability has the form $\rho(x_{<n}\pb x_n) =
\rho(\pb x_{1:n})/\rho(\pb x_{<n})$. The equation states that the
probability that a string $x_1...x_{n-1}$ is followed by $x_n$ is
equal to the probability of $x_1...x_n*$ divided by the
probability of $x_1...x_{n-1}*$. We use $x*$ as an abbreviation
for `strings starting with $x$'.

The introduced notation is also suitable for defining the
conditional probability $\rho(y_1\pb x_1...y_n\pb x_n)$ that the
environment reacts with $x_1...x_n$ under the condition that the
output of the agent is $y_1...y_n$. The environment is
chronological, i.e.\ input $x_i$ depends on $y\!x_{<i}y_i$ only. In
the probabilistic case this means that $\rho(y\!\pb
x_{<k}y_k) := \sum_{x_k}\rho(y\!\pb x_{1:k})$ is independent of
$y_k$, hence a tailing $y_k$ in the arguments of $\rho$ can be
dropped. Probability distributions with this property will be
called {\it chronological}. The $y$ are always conditions, i.e.\
are never underlined, whereas additional conditioning for the $x$ can
be obtained with \mrcp
\bqa\label{bayes2}
  \rho(y\!x_{<n}y\!\pb x_n) & = &
  \rho(y\!\pb x_{1:n})/\rho(y\!\pb x_{<n}) \quad\mbox{and}
  \\ \nonumber
  \rho(y\!\pb x_{1:n}) & = &
  \rho(y\!\pb x_1)\!\cdot\!\rho(y\!x_1y\!\pb x_2)\!\cdot...\cdot\!
  \rho(y\!x_{<n}y\!\pb x_n).
\eqa
The second equation is the first equation applied $n$ times.

%-------------------------------------------------------------%
\subsection{Explicit Form of the AI$\mu$ Model}\label{secAImurec}
%-------------------------------------------------------------%
\index{AI$\mu $ model!recursive \& iterative form}%
Let us define the AI$\mu$ model $p^\best$ in a different way:
%
Let $\mu(y\!x_{<k}y\!\pb x_k)$ be the true probability of input
$x_k$ in cycle $k$, given the history $y\!x_{<k}y_k$; $\mu(y\!\pb
x_{1:k})$ is the true chronological prior probability that the
environment reacts with $x_{1:k}$ if provided with actions
$y_{1:k}$ from the agent. We assume the cybernetic model depicted
on page \pageref{cyberpic} to be valid. Next we define the value
$V_{k+1,m}^{\best\mu}(y\!x_{1:k})$ to be the $\mu$-expected reward
sum $r_{k+1} +...+ r_m$ in cycles $k + 1$ to $m$ with outputs
$y_i$ generated by agent $p^\best$ that maximizes the expected
reward sum, and responses $x_i$ from the environment, drawn
according to $\mu$. Adding $r(x_k) \equiv r_k$ we get the reward including
cycle $k$. The probability of $x_k$, given $y\!x_{<k}y_k$, is
given by the conditional probability $\mu(y\!x_{<k}y\!\pb x_k)$. So
the expected reward sum in cycles $k$ to $m$ given $y\!x_{<k}y_k$
is
\beq\label{ebesty}
  V_{km}^{\best\mu}(y\!x_{<k}y_k) \;:=\;
  \sum_{x_k}[r(x_k)+V_{k+1,m}^{\best\mu}(y\!x_{1:k})] \!\cdot\!
  \mu(y\!x_{<k}y\!\pb x_k)
\eeq
Now we ask how $p^\best$ chooses $y_k$: It should choose $y_k$ as
to maximize the future rewards. So the expected reward in cycles
$k$ to $m$ given $y\!x_{<k}$ and $y_k$ chosen by $p^\best$ is $
V_{km}^{\best\mu}(y\!x_{<k})
 := \max_{y_k}V_{km}^{\best\mu}(y\!x_{<k}y_k)$ (see Figure~\ref{figexmax}).

\begin{figure}[tbh]
\ffigurex{figexmax}{Expectimax Tree/Algorithm for $\O=\Y=\B$}{}{%
\small
\unitlength=1mm
\begin{center}
\begin{picture}(145,60)(10,10)
% action $y_k$
\thicklines
\put(50,60){\circle*{1.5}}
\put(50,60){\line(-1,-1){20}}
\put(39,47){\makebox(0,0)[lc]{$y_k\!=0$}}
\put(50,60){\line(1,-1){20}}
\put(62,47){\makebox(0,0)[rc]{$y_k\!=1$}}
\put(50,55){\makebox(0,0)[ct]{$\underbrace{\;max\;}$}}
\put(55,59){\makebox(0,0)[lc]{$\displaystyle V_{km}^{\best\mu}(y\!x_{<k})=\max_{y_k}V_{km}^{\best\mu}(y\!x_{<k}y_k)$}}
\put(65,50){\makebox(0,0)[lc]{action $y_k$ with max value.}}
% perception $x_k$
\put(30,40){\circle*{1}}
\put(30,40){\line(-1,-2){10}}
\put(24,27){\makebox(0,0)[lc]{$o_k\!=0$}}
\put(23,23){\makebox(0,0)[lc]{$r_k\!=...$}}
\put(30,40){\line(1,-2){10}}
\put(37.5,27){\makebox(0,0)[lc]{$o_k\!=1$}}
\put(39,23){\makebox(0,0)[lc]{$r_k\!=...$}}
\put(30,35){\makebox(0,0)[ct]{$\underbrace{\bf E}$}}
\put(70,40){\circle*{1}}
\put(70,40){\line(-1,-2){10}}
\put(62.5,27){\makebox(0,0)[rc]{$o_k\!=0$}}
\put(61,23){\makebox(0,0)[rc]{$r_k\!=...$}}
\put(70,40){\line(1,-2){10}}
\put(76,27){\makebox(0,0)[rc]{$o_k\!=1$}}
\put(77.5,23){\makebox(0,0)[rc]{$r_k\!=...$}}
\put(70,35){\makebox(0,0)[ct]{$\underbrace{\bf E}$}}
\put(75,39){\makebox(0,0)[lc]{$\displaystyle V_{km}^{\best\mu}(y\!x_{<k}y_k)=\sum_{x_k}[r_k+V_{k+1,m}^{\best\mu}(y\!x_{1:k})]\mu(y\!x_{<k}y\!\pb x_k)$}}
\put(80,30){\makebox(0,0)[lc]{$\mu$-expected reward $r_k$ and observation $o_k$.}}
% action $y_{k+1}$
\thinlines
\put(20,20){\circle*{0.8}}
\put(20,20){\line(-1,-2){5}}
\put(20,20){\line(1,-2){5}}
\put(20,14){\makebox(0,0)[ct]{${{\scriptscriptstyle m\!a\!x}\atop \smile}$}}
\put(30,15){\makebox(0,0)[cc]{$\scriptstyle y_{k+1}$}}
\put(40,20){\circle*{0.8}}
\put(40,20){\line(-1,-2){5}}
\put(40,20){\line(1,-2){5}}
\put(40,14){\makebox(0,0)[ct]{${{\scriptscriptstyle m\!a\!x}\atop \smile}$}}
\put(50,15){\makebox(0,0)[cc]{$\scriptstyle y_{k+1}$}}
\put(60,20){\circle*{0.8}}
\put(60,20){\line(-1,-2){5}}
\put(60,20){\line(1,-2){5}}
\put(60,14){\makebox(0,0)[ct]{${{\scriptscriptstyle m\!a\!x}\atop \smile}$}}
\put(70,15){\makebox(0,0)[cc]{$\scriptstyle y_{k+1}$}}
\put(80,20){\circle*{0.8}}
\put(80,20){\line(-1,-2){5}}
\put(80,20){\line(1,-2){5}}
\put(80,14){\makebox(0,0)[ct]{${{\scriptscriptstyle m\!a\!x}\atop \smile}$}}
\put(85,19){\makebox(0,0)[lc]{\scriptsize $\displaystyle V_{k+1,m}^{\best\mu}(y\!x_{1:k})=\max_{y_{k+1}}V_{k+1,m}^{\best\mu}(y\!x_{1:k}y_{k+1})$}}
\put(15,8){\makebox(0,0)[cc]{$\cdots$}}
\put(25,8){\makebox(0,0)[cc]{$\cdots$}}
\put(35,8){\makebox(0,0)[cc]{$\cdots$}}
\put(45,8){\makebox(0,0)[cc]{$\cdots$}}
\put(55,8){\makebox(0,0)[cc]{$\cdots$}}
\put(65,8){\makebox(0,0)[cc]{$\cdots$}}
\put(75,8){\makebox(0,0)[cc]{$\cdots$}}
\put(85,8){\makebox(0,0)[cc]{$\cdots$}}
\end{picture}
\end{center}}
\end{figure}

Together with the induction start
\beq\label{ee0}
  V_{m+1,m}^{\best\mu}(y\!x_{1:m}) \;:=\; 0
\eeq
$V_{km}^{\best\mu}$ is completely defined.
We might summarize one cycle into the formula
\beq\label{defAImuVi}
  V_{km}^{\best\mu}(y\!x_{<k}) \;=\;
  \max_{y_k}\sum_{x_k}
  [r(x_k)+V_{k+1,m}^{\best\mu}(y\!x_{1:k})] \!\cdot\!
  \mu(y\!x_{<k}y\!\pb x_k)
\eeq
We introduce a dynamic (computable)
farsightedness\index{farsightedness!dynamic} $h_k \equiv m_k - k +
1 \geq 1$, called horizon\index{horizon}. For $m_k=m$, where $m$
is the lifetime of the agent, we achieve optimal behavior, for
limited farsightedness $h_k = h$ ($m = m_k =h + k - 1$), the agent
maximizes in every cycle the next $h$ expected rewards. A
discussion of the choices for $m_k$ is delayed to
Section~\ref{secHorizon}. If $m_k$ is our horizon function of
$p^\best$ and $\hh y\!\hh x_{<k}$ is the actual history in cycle
$k$, the output $\hh y_k$ of the agent is explicitly given by
\beq\label{pbestrec}
  \hh y_k \;=\; \arg\max_{y_k}V_{km_k}^{\best\mu}
  (\hh y\!\hh x_{<k}y_k)
\eeq
which in turn defines the policy $p^\best$. Then the environment responds $\hh x_k$ with
probability $\mu(\hh y\!\hh x_{<k}\hh y\!\pb{\hh
x}_k)$. Then cycle $k + 1$ starts. We might
unfold the recursion (\ref{defAImuVi}) further and give $\hh y_k$
nonrecursively as
\beq\label{ydotrec}
  \hh y_k \;\equiv\; \hh y_k^\mu \;:=\;
  \arg\max_{y_k}\sum_{x_k}\max_{y_{k+1}}\sum_{x_{k+1}}\;...\;
  \max_{y_{m_k}}\sum_{x_{m_k}}
  (r(x_k)\!+...+\!r(x_{m_k})) \!\cdot\!
  \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m_k})
\eeq
This has a direct interpretation: The probability of inputs
$x_{k:m_k}$ in cycle $k$ when the agent outputs $y_{k:m_k}$ with
actual history $\hh y\!\hh x_{<k}$ is $\mu(\hh y\!\hh x_{<k}y\!\pb
x_{k:m_k})$. The future reward in this case is $r(x_k) +...+
r(x_{m_k})$. The best expected reward is obtained by averaging
over the $x_i$ ($\sum_{x_i}$) and maximizing over the $y_i$. This
has to be done in chronological order to correctly incorporate the
dependencies of $x_i$ and $y_i$ on the history. This is essentially
the expectimax
algorithm/tree\index{expectimax!algorithm}\index{expectimax!tree}
\cite{Michie:66,Russell:03}. The AI$\mu$ model is {\it optimal} in
the sense that no other policy leads to higher expected reward.
%
The value for a general policy $p$ can be written in the form
\beq\label{vdefpmu}
  V_{km}^{p\mu}(y\!x_{<k}) \;:=\;
  \sum_{x_{1:m}}(r_k\!+...+\!r_m)\mu(y\!x_{<k}y\!\pb
  x_{k:m})_{|y_{1:m}=p(x_{<m})}
\eeq
%
\index{AI$\mu $ model!equivalence}%
As is clear from their interpretations, the iterative
environmental probability $\mu$ relates to the functional form in
the following way:
\beq\label{mufr}
  \mu(y\!\pb x_{1:k}) \;=\;
  \nq\sum_{q:q(y_{1:k})=x_{1:k}}\nq \mu(q)
\eeq
%
With this identification one can show \cite{Hutter:00kcunai,Hutter:04uaibook} the
following:

%------------------------------%
\ftheorem{thEqAI}{Equivalence of functional and explicit AI model}{
%------------------------------%
The actions of the functional AI model (\ref{pbestfunc}) coincide
with the actions of the explicit (recursive/iterative) AI model
(\ref{defAImuVi})--(\ref{ydotrec}) with environments identified by
(\ref{mufr}).
}%------------------------------%

\label{secAIspec}
\index{AI$\mu $ model!special aspects}

%-------------------------------------------------------------%
\subsection{Factorizable Environments}\label{secFacmu}
%-------------------------------------------------------------%
\index{factorizable!environment}%
Up to now we have made no restrictions on the form of the prior
probability $\mu$ apart from being a chronological probability
distribution. On the other hand, we will see that, in order to
prove rigorous reward bounds, the prior probability must satisfy
some separability condition to be defined later. Here we introduce
a very strong form of separability, when $\mu$ factorizes into
products.

Assume that the cycles are grouped into
independent episodes $r = 1,2,3,...$, where each episode $r$
consists of the cycles $k = n_r + 1,...,n_{r+1}$ for some
$0=n_0<n_1<...<n_s=n$:
\beq\label{facmu}
  \mu(y\!\pb x_{1:n}) \;=\;
  \prod_{r=0}^{s-1} \mu_r(y\!\pb x_{n_r+1:n_{r+1}})
\eeq
(In the simplest case, when all episodes have the
same length $l$ then $n_r=r \cdot l$). Then $\hh y_k$ depends on
$\mu_r$ and $x$ and $y$ of episode $r$ only, with $r$ such
that $n_r < k \leq n_{r+1}$. One can show that

\beq\label{facydot}
  \hh y_k \;=\; \arg\max_{y_k}V_{km_k}^{\best\mu}
  (\hh y\!\hh x_{<k}y_k) \;=\;
  \arg\max_{y_k}V_{kt}^{\best\mu}(\hh y\!\hh x_{<k}y_k)
\eeq
with $t := \min\{m_k,n_{r+1}\}$. The different episodes
are\index{independent!episodes} completely independent in the
sense that the inputs $x_k$ of different episodes are
statistically independent and depend only on the outputs $y_k$ of
the same \idx{episode}. The outputs $y_k$ depend on the $x$ and
$y$ of the corresponding episode $r$ only, and are independent of
the actual I/O of the other episodes.

Note that $\hh y_k$ is also independent of the choice of $m_k$,
as long as $m_k$ is sufficiently large. If all episodes have a
length of at most $l$, i.e.\ $n_{r+1} - n_r \leq l$ and if we
choose the horizon $h_k$ to be at least $l$, then $m_k \geq k + l
- 1 \geq n_r + l \geq n_{r+1}$ and hence $t=n_{r+1}$ independent
of $m_k$. This means that for factorizable $\mu$ there is no
problem in taking the limit $m_k \to \infty$. Maybe this limit can
also be performed in the more general case of a sufficiently
separable $\mu$. The (problem of the) choice of $m_k$ will be
discussed in more detail later.

Although factorizable $\mu$ are too restrictive to cover all AI
problems, they often occur in practice in the form of repeated
problem solving, and hence, are worthy of study. For example, if
the agent has to play games like chess repeatedly, or has to
minimize different functions, the different games/functions might
be completely independent, i.e.\ the environmental probability
factorizes, where each factor corresponds to a game/function
minimization. For details, see the appropriate sections on
strategic games and function minimization.

Further, for factorizable $\mu$ it is probably easier to derive
suitable reward bounds for the universal AI$\xi$ model defined in
the next section, than for the separable cases that will be
introduced later. This could be a first step toward a definition
and proof for the general case of separable problems. One goal of
this paragraph was to show that the notion of a factorizable
$\mu$ could be the first step toward a definition and analysis of
the general case of separable $\mu$.

%-------------------------------------------------------------%
\subsection{Constants and Limits}\label{secCL}\index{constants}\index{limits}
%-------------------------------------------------------------%
We have in mind a universal agent with complex interactions that
is at least as intelligent and complex as a \idx{human} being. One
might think of an agent whose input $y_k$ comes from a digital
\idx{video camera}, and the output $x_k$ is some \idx{image} to a
\idx{monitor},$\!$\footnote{Humans can only simulate a screen as
output device\index{output!device}\index{input!device} by drawing pictures.}
only for the rewards we might restrict to the most primitive
binary ones, i.e.\ $r_k \in \B$. So we think of the following
constant sizes:
\beqn
\begin{array}{ccccccccc}
  1 & \ll & \langle \l(y_kx_k)\rangle & \ll & k & \leq & m & \ll & |\Y\times \X| \\
  1 & \ll & 2^{16} & \ll & 2^{24} & \le & 2^{32} & \ll & 2^{65536}
\end{array}
\eeqn
The first two limits say that the actual number $k$ of
inputs/outputs should be reasonably large compared to the typical
length $\langle\l\rangle$ of the input/output words, which itself
should be rather sizeable. The last limit expresses the fact that
the total \idx{lifetime} $m$ (number of I/O cycles) of the agent
is far too small to allow every possible input to occur, or to try
every possible output, or to make use of identically repeated
inputs or outputs. We do not expect any useful outputs for
$k\approxleq\langle\l\rangle$. More interesting than the lengths
of the inputs is the complexity $K(x_1...x_k)$ of all inputs until
now, to be defined later. The environment is usually not
``\idx{perfect}''. The agent could either interact with an
\idx{imperfect} human or tackle a \idx{nondeterministic world}
(due to\index{quantum physics}\index{physics!quantum} quantum
mechanics or \idx{chaos}).$\!$\footnote{Whether there exist truly
stochastic processes at all is a difficult question. At least the
quantum indeterminacy comes very close to it.} In either case, the
sequence contains some \idx{noise}, leading to
$K(x_1...x_k)\propto \langle\l\rangle \cdot k$. The complexity of
the probability distribution of the input
sequence\index{complexity!input sequence} is something different.
We assume that this \idx{noisy world} operates according to some
simple computable rules. $K(\mu_k)\ll \langle\l\rangle \cdot k$,
i.e.\ the rules of the world can be highly compressed. We may
allow environments in which new aspects appear for $k \to \infty$,
causing a non-bounded $K(\mu_k)$.

In the following we never use these limits, except when explicitly
stated. In some simpler models and examples the size of the
constants will even violate these limits (e.g.\
$\l(x_k)=\l(y_k)=1$), but it is the limits above that the reader
should bear in mind. We are only interested in theorems that do
not degenerate under the above limits. In order to avoid
cumbersome convergence and existence considerations we make the
following assumptions throughout this work:

\fassumption{assFinite}{Finiteness}{
We assume that
\begin{itemize}\parskip=0ex\parsep=0ex\itemsep=0ex
\item the input/perception space $\X$ is finite,
\item the output/action space $\Y$ is finite,
\item the rewards are nonnegative and bounded,
      i.e.\ $r_k \in \R \subseteq [0,r_{max}]$,
\item the horizon $m$ is finite.
\end{itemize}
}
%
\index{input space!choice}\index{output space!choice}\index{horizon!choice}%
Finite $\X$ and bounded $\R$ (each separately) ensure existence of
$\mu$-expectations but are sometimes needed together. Finite $\Y$
ensures that $\arg\max_{y_k\in\Y}[...]$ exists, i.e.\ that maxima
are attained, while finite $m$ avoids various technical and
philosophical problems (Section~\ref{secHorizon}),
and positive rewards are needed for the time-bounded AIXI$tl$
model (Section~\ref{secAIXItl}). Many theorems can be generalized by
relaxing some or all of the above finiteness assumptions.

%-------------------------------------------------------------%
\subsection{Sequential Decision Theory}
%-------------------------------------------------------------%
One can relate (\ref{defAImuVi}) to the Bellman equations
\cite{Bellman:57} of sequential decision theory by identifying
complete histories $y\!x_{<k}$ with states, $\mu(y\!x_{<k}y\!\pb
x_k)$ with the state transition matrix, $V_\mu^\best$ with the
value function, and $y_k$ with the action in cycle $k$
\cite{Bertsekas:96,Russell:03}. Due to the use of complete
histories as state space, the AI$\mu$ model neither assumes
stationarity, nor the Markov property, nor complete accessibility
of the environment. Every state occurs at most once in the
lifetime of the system. For this and other reasons the explicit
formulation (\ref{ydotrec}) is more natural and useful here than
to enforce a pseudo-recursive Bellman equation form.

%------------------------------%
%\paragraph{Constants, limits, assumptions, computability}
%------------------------------%
As we have in mind a universal system with complex interactions,
the action and perception spaces $\Y$ and $\X$ are huge (e.g.\
video images), and every action or perception itself occurs
usually only once in the lifespan $m$ of the agent. As there is no
(obvious) universal similarity relation on the state space, an
effective reduction of its size is impossible, but there is no
principle problem in determining $y_k$ from (\ref{ydotrec}) as
long as $\mu$ is known and computable, and $\X$, $\Y$ and $m$ are
finite.

\index{optimal!policy}
\index{policy!optimal}
\index{policy}
\index{utility!expected}
\index{expected!utility}
\index{complete!history}
\index{history!complete}
\index{state!environmental}
\index{stationarity}
\index{Markov}
\index{accessibility}
\index{universe}
\index{reduction!state space}
\index{value iteration}
\index{policy iteration}
\index{reinforcement learning}
\index{learning!by reinforcement}
\index{generalization techniques}
\index{restricted domains}
\index{exploration}
\index{exploitation}
\index{learning!rate}

%------------------------------%
%\paragraph{Reinforcement learning}
%------------------------------%
Things drastically change if $\mu$ is unknown. Reinforcement
learning algorithms \cite{Kaelbling:96,Sutton:98,Bertsekas:96} are
commonly used in this case to learn the unknown $\mu$ or directly
its value. They succeed if the state space is either small or has
effectively been made small by generalization or function
approximation techniques. In any case, the solutions are either
{\it ad hoc}, work in restricted domains only, have serious
problems with state space exploration versus exploitation, or are
prone to diverge, or have nonoptimal learning rates. There is no
universal and optimal solution to this problem so far. The central
theme of this article is to present a new model and argue that it
formally solves all these problems in an optimal way. The true
probability distribution $\mu$ will not be learned directly, but
will be replaced by some generalized universal prior $\xi$, which
converges to $\mu$.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Universal Sequence Prediction}\label{chSP}\label{chSU}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This section deals with the question of how to make predictions
in unknown environments.
%
Following a brief description of important philosophical attitudes
regarding inductive reasoning and inference, we describe more
accurately what we mean by induction, and motivate why we can
focus on sequence prediction tasks. The most important concept is
Occam's razor (simplicity) principle.
%
Indeed, one can show that the best way to make predictions is
based on the shortest ($\widehat=$ simplest) description of the
data sequence seen so far. The most general effective descriptions
can be obtained with the help of general recursive functions, or
equivalently by using programs on Turing machines, especially on
the universal Turing machine. The length of the shortest program
describing the data is called the Kolmogorov complexity of the
data.
%
Probability theory is needed to deal with uncertainty.
The environment may be a stochastic process (e.g.\ gambling houses
or quantum physics) that can be described by ``objective''
probabilities. But also uncertain knowledge about the environment,
which leads to beliefs about it, can be modeled by ``subjective''
probabilities.
%
The old question left open by subjectivists of how to choose the a
priori probabilities is solved by Solomonoff's universal prior,
which is closely related to Kolmogorov complexity.
%
Solomonoff's major result is that the universal (subjective) posterior
converges to the true (objective) environment(al probability)
$\mu$. The only assumption on $\mu$ is that $\mu$ (which needs not
be known!) is computable. The problem of the unknown environment
$\mu$ is hence solved for all problems of inductive type, like
sequence prediction and classification.

%-------------------------------------------------------------%
\subsection{Introduction}
%-------------------------------------------------------------%

An important and highly nontrivial aspect of intelligence
is inductive inference. Simply speaking, induction is the process
of predicting the future from the past, or more precisely, it is
the process of finding rules in (past) data and using these rules
to guess future data. Weather or stock-market forecasting, or
continuing number series in an IQ test are nontrivial examples.
Making good predictions plays a central role in natural and
artificial intelligence in general, and in machine learning in
particular.
%
All induction problems can be phrased as sequence prediction
tasks. This is, for instance, obvious for time-series prediction,
but also includes classification tasks. Having observed data $x_t$
at times $t<n$, the task is to predict the $n^{th}$ symbol $x_n$
from sequence $x_1...x_{n-1}$. This {\em\idx{prequential
approach}} \cite{Dawid:84} skips over the intermediate step of
learning a model based on observed data $x_1...x_{n-1}$ and then
using this model to predict $x_n$. The prequential approach avoids
problems of model consistency, how to separate noise from useful
data, and many other issues. The goal is to make ``good''
predictions, where the prediction quality is usually measured by a
loss function, which shall be minimized.
%
The key concept to well-define and solve induction problems is
{\em Occam's razor} (simplicity) principle, which says that ``{\em
Entities should not be multiplied beyond necessity},'' which may
be interpreted as to keep the simplest theory consistent with the
observations $x_1...x_{n-1}$ and to use this theory to predict
$x_n$.
%
Before we can present Solomonoff's formal solution, we have to
quantify Occam's razor in terms of Kolmogorov complexity, and
introduce the notion of subjective/objective probabilities.

%-------------------------------------------------------------%
\subsection{Algorithmic Information Theory}
%-------------------------------------------------------------%
\index{Kolmogorov complexity}\index{complexity!Kolmogorov}

Intuitively, a string is simple if it can be described in a few
words, like ``the string of one million ones'', and is complex if
there is no such short description, like for a random string whose
shortest description is specifying it bit by bit. We can restrict
the discussion to binary strings, since for other (non-stringy
mathematical) objects we may assume some default coding as binary
strings. Furthermore, we are only interested in effective
descriptions, and hence restrict decoders to be Turing machines.
%
Let us choose some universal (so-called prefix) {\em Turing
machine $U$} with unidirectional binary input and output tapes and
a bidirectional work tape \cite{Li:97,Hutter:04uaibook}. We can
then define the (conditional) {\em prefix Kolmogorov complexity}
\cite{Chaitin:75,Gacs:74,Kolmogorov:65,Levin:74} of a binary
string $x$ as the length $l$ of the shortest program $p$, for
which $U$ outputs the binary string $x$ (given $y$)

%------------------------------%
\fdefinition{defKC}{Kolmogorov complexity}{
%------------------------------%
Let $U$ be a universal prefix Turing machine $U$. The
(conditional) prefix Kolmogorov
complexity\index{Kolmogorov complexity} is defined as the shortest
program $p$, for which $U$ outputs $x$ (given $y$):
\beqn
  K(x) \;:=\; \min_p\{\l(p): U(p)=x\},\quad
  K(x|y) \;:=\; \min_p\{\l(p): U(y,p)=x\}
\eeqn\vspace{-3ex}
}%------------------------------%
%
Simple strings like 000...0 can be generated by short programs,
and hence have low Kolmogorov complexity, but irregular (e.g.\
random) strings are their own shortest description, and hence have
high Kolmogorov complexity. An important property of $K$ is that
it is nearly independent of the choice of $U$. Furthermore, it
shares many properties with Shannon's entropy (information
measure) $S$, but $K$ is superior to $S$ in many respects. To be
brief, $K$ is an excellent universal complexity measure, suitable
for quantifying Occam's razor. There is (only) one severe
disadvantage: $K$ is not finitely computable. The major
algorithmic property of $K$ is that it is (only) co-enumerable,
i.e.\ it is approximable from above.

For general (non-string) objects one can specify some default
coding $\langle\cdot\rangle$ and define $K(\mbox{\it
object}):=K(\langle\mbox{\it object}\rangle)$, especially for
numbers and pairs, e.g.\ we abbreviate $K(x,y):=K(\langle
x,y\rangle)$. The most important information-theoretic properties
of $K$ are listed below, where we abbreviate $f(x)\leq g(x)+O(1)$
by $f(x)\leqa g(x)$. We also later abbreviate $f(x)=O(g(x))$ by
$f(x)\leqm g(x)$.

%------------------------------%
\ftheorem{thKCprop}{Information properties of Kolmogorov complexity}{\hfill
%------------------------------%
\begin{itemize}\parskip=0ex\parsep=0ex\itemsep=1ex
% upper bound
\item[$i)$] $K(x) \;\leqa\; \l(x) + 2\log\,\l(x),\qquad
             K(n) \;\leqa\; \log n + 2\log\log n$
% lower bound for most $n$, Kraft inequality
\item[$ii)$] $\sum_x 2^{-K(x)}\;\leq\;1,\quad$
             $K(x)\;\geq\;l(x)$ for `most' $x$,
%             $\quad K(n)\;\geq\;\log n\quad$ for `most' $x,n,\quad$
             $\quad K(n)\to\infty$ for $n\to\infty$.
% extra information
\item[$iii)$] $K(x|y) \;\leqa\; K(x) \;\leqa\; K(x,y)$
% subadditivity
\item[$iv)$] $K(x,y) \;\leqa\; K(x)+K(y),\qquad
            K(xy) \;\leqa\; K(x)+K(y)$
% symmetry of information
\item[$v)$] $K(x|y,K(y))+K(y) \;\equa\; K(x,y)
              \;\equa\; K(y,x) \;\equa\; K(y|x,K(x))+K(x)$
% information non-increase
\item[$vi)$] $K(f(x)) \;\leqa\; K(x)+K(f)\;$
     if $\;f:\B^*\to\B^*$ is recursive/computable
% coding relative to probability distribution / MDL
\item[$vii)$] $K(x) \;\leqa\; -\log_2P(x)+K(P)$ if $P:\B^*\to[0,1]$
     is recursive and $\sum_x P(x)\leq 1$ %\cite[p255]{Li:97}
\end{itemize}
}%------------------------------%

% Explanation
\noindent All (in)equalities remain valid if $K$ is (further)
conditioned under some $z$, i.e.\ $K(...)\leadsto K(...|z)$ and
$K(...|y)\leadsto K(...|y,z)$. Those stated are all valid within
an additive constant of size $O(1)$, but there are others which
are only valid to logarithmic accuracy.
%
$K$ has many properties in common with Shannon entropy as it
should be, since both measure the information content of a
string.
%
Property $(i)$ gives an upper bound on $K$, and property $(ii)$ is
Kraft's inequality which implies a lower bound on $K$ valid for
`most' $n$, where `most' means that there are only $o(N)$
exceptions for $n\in\{1,..., N\}$. Providing side information $y$
can never increase code length, requiring extra information $y$
can never decrease code length $(iii)$. Coding $x$ and $y$
separately never helps $(iv)$, and transforming $x$ does not
increase its information content $(vi)$. Property $(vi)$ also
shows that if $x$ codes some object $o$, switching from one coding
scheme to another by means of a recursive bijection leaves $K$
unchanged within additive $O(1)$ terms. The first nontrivial
result is the symmetry of information $(v)$, which is the analogue
of the multiplication/chain rule for conditional probabilities.
Property $(vii)$ is at the heart of the MDL principle
\cite{Rissanen:89}, which approximates $K(x)$ by
$-\log_2P(x)+K(P)$. See \cite{Li:97} for proofs.

%-------------------------------------------------------------%
\subsection{Uncertainty \& Probabilities}
%-------------------------------------------------------------%

For the {\em\idx{objectivist}} probabilities are real aspects of
the world. The outcome of an observation or an \idx{experiment} is
not deterministic, but involves \idx{physical random processes}.
%
Kolmogorov's axioms of probability theory formalize the properties
that probabilities should have. In the case of i.i.d.\
experiments the probabilities assigned to events can be
interpreted as limiting frequencies ({\em frequentist} view), but
applications are not limited to this case. Conditionalizing
probabilities and Bayes' rule\index{chain rule} are the major tools in
computing posterior probabilities from prior ones.
%
For instance, given the initial binary sequence $x_1...x_{n-1}$,
what is the probability of the next bit being $1$? The probability
of observing $x_n$ at time $n$, given past observations
$x_1...x_{n-1}$ can be computed with the multiplication or chain
rule\footnote{Strictly speaking it is just the definition of
conditional probabilities.} if the true generating distribution
$\mu$ of the sequences $x_1x_2x_3...$ is known: $\mu(x_{<n}\pb
x_n)=\mu(\pb x_{1:n})/\mu(\pb x_{<n})$ (see
Sections~\ref{secStrings} and \ref{secPD}). The problem, however,
is that one often does not know the true distribution $\mu$ (e.g.\
in the cases of weather and stock-market forecasting).

The {\em\idx{subjectivist}} uses probabilities to characterize an
agent's \idx{degree of belief} in (or plausibility of) something,
rather than to characterize physical random processes. This is the
most relevant interpretation of probabilities in AI.
%
It is somewhat surprising that plausibilities can be shown to also
respect Kolmogorov's axioms of probability and \mrcp\ for
conditional probabilities by assuming only a few plausible
qualitative rules they should follow \cite{Cox:46}. Hence, if the
plausibility of $x_{1:n}$ is $\xi(\pb x_{1:n})$, the degree of
belief in $x_n$ given $x_{<n}$ is, again, given by the conditional
probability: $\xi(x_{<n}\pb x_n)=\xi(\pb x_{1:n})/\xi(\pb
x_{<n})$.

The \mrcp\ allows determining posterior
probabilities/plausibilities from prior ones, but leaves open the
question of how to determine the priors themselves. In statistical
physics, the principle of indifference (symmetry principle) and
the maximum entropy principle can often be exploited to determine
prior probabilities, but only Occam's razor is general enough to
assign prior probabilities in {\em every} situation, especially to
cope with complex domains typical for AI.

%Neither objective nor subjective probabilities are essential in our treatment.

%-------------------------------------------------------------%
\subsection{Algorithmic Probability \& Universal Induction}
%-------------------------------------------------------------%

Occam's razor (appropriately interpreted and in compromise with
Epicurus' principle of indifference) tells us to assign high/low a
priori plausibility to simple/complex strings $x$. Using $K$ as
the complexity measure, any monotone decreasing function of $K$,
e.g.\ $\xi(\pb x)=2^{-K(x)}$ would satisfy this criterion. But $\xi$
also has to satisfy the probability axioms, so we have to be a bit
more careful.
%
Solomonoff \cite{Solomonoff:64,Solomonoff:78} defined the {\em
universal prior} $\xi(\pb x)$ as the probability that the output
of a universal Turing machine $U$ starts with $x$ when provided
with \idx{fair coin flips} on the input tape. Formally, $\xi$
can be defined as
\beq\label{xidef}
  \xi(\pb x)\;:=\;\sum_{p\;:\;U(p)=x*}\nq 2^{-\l(p)} \;\geq\; 2^{-K(x)}
\eeq
where the sum is over all (so-called minimal) programs $p$ for
which $U$ outputs a string starting with $x$.
The inequality follows by dropping all terms in $\sum_p$ except
for the shortest $p$ computing $x$. Strictly speaking $\xi$ is only
a {\em semimeasure} since it is not normalized to 1, but this is
acceptable/correctable.
%
We derive the following bound:
\beqn
  \sum_{t=1}^\infty(1\!-\!\xi(x_{<t}\pb x_t))^2 \;\leq\;
  -\odt \sum_{t=1}^\infty\ln \xi(x_{<t}\pb x_t) \;=\;
  -\odt\ln\xi(\pb x_{1:\infty}) \;\leq\;
  \odt\ln 2\!\cdot\!K(x_{1:\infty})
\eeqn
In the first inequality we have used $(1-a)^2\leq-\odt\ln a$ for
$0\leq a\leq 1$. In the equality we exchanged the sum with the
logarithm and eliminated the resulting product by \mrcp\
(\ref{bayes2}). In the last inequality we used (\ref{xidef}). If
$x_{1:\infty}$ is a computable sequence, then $K(x_{1:\infty})$ is
finite, which implies $\xi(x_{<t}\pb x_t)\to 1$
($\sum_{t=1}^\infty(1-a_t)^2<\infty\Rightarrow a_t\to 1$). This
means, that if the environment is a computable sequence
(whichsoever, e.g.\ the digits of $\pi$ or $e$ in binary
representation), after having seen the first few digits, $\xi$
correctly predicts the next digit with high probability, i.e.\ it
recognizes the structure of the sequence.

Assume now that the true sequence is drawn from the distribution
$\mu$, i.e.\ the true (objective) probability of $x_{1:n}$ is
$\mu(\pb x_{1:n})$, but $\mu$ is unknown. How is the posterior
(subjective) belief $\xi(x_{<n}\pb x_n)=\xi(\pb x_n)/\xi(\pb
x_{<n})$ related to the true (objective) posterior probability
$\mu(x_{<n}\pb x_n)$?
%
Solomonoff's \cite{Solomonoff:78} crucial result is that the
posterior (subjective) beliefs converge to the true (objective)
posterior probabilities, if the latter are computable. More precisely, he showed that
\beq\label{eukdistxi}
  \sum_{t=1}^\infty\sum_{x_{<t}}\mu(\pb x_{<t})
  \Big(\xi(x_{<t}\pb 0)-\mu(x_{<t}\pb 0)\Big)^2 \;\leqa\;
  {\odt}\ln 2\!\cdot\!K(\mu),
\eeq
$K(\mu)$ is finite if $\mu$ is computable, but the infinite sum on
the l.h.s.\ can only be finite if the difference $\xi(x_{<t}\pb
0)-\mu(x_{<t}\pb 0)$ tends to zero for $t\to\infty$ with
$\mu$-probability $1$. This shows that using $\xi$ as an estimate
for $\mu$ may be a reasonable thing to do.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Loss Bounds \& Pareto Optimality}\label{secLBPO}\label{secErr}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Most predictions are eventually used as a basis for some decision
or action, which itself leads to some reward or loss. Let
$\ell_{x_t y_t}\in[0,1]\subset\SetR$ be the received loss when
performing prediction/decision/action $y_t\in\Y$ and $x_t\in\X$ is
the $t^{th}$ symbol of the sequence. Let $y_t^\Lambda \in \Y$ be
the prediction of a (causal) prediction scheme $\Lambda$. The true
probability of the next symbol being $x_t$, given $x_{<t}$, is
$\mu(x_{<t}\pb x_t)$. The expected loss when predicting $y_t$ is
$\E[\ell_{x_t y_t}]$. The total $\mu$-expected loss suffered by
the $\Lambda$ scheme in the first $n$ predictions is
\beq\label{rholossi}
  L_n^\Lambda \;:=\; \sum_{t=1}^n
  \E[\ell_{x_t y_t^\Lambda}] \;=\;
  \sum_{t=1}^n\sum_{x_{1:t}\in\X^t} \mu(\pb x_{1:t}) \ell_{x_t y_t^\Lambda}
\eeq
For instance, for the error-loss $\l_{xy}=1$ if $x=y$ and 0 else,
$L_n^\Lambda$ is the expected number of prediction errors, which
we denote by $E_n^\Lambda$. The goal is to minimize the expected
loss. More generally, we define the $\Lambda_\rho$ sequence
prediction scheme (later also called SP$\rho$)
$
  y_t^{\smash{\Lambda_\rho}} :=
  \arg\min_{y_t\in\Y}\sum_{x_t}\rho(x_{<t}\pb x_t)\ell_{x_t y_t}
$
which minimizes the $\rho$-expected loss.
If $\mu$ is known, $\Lambda_\mu$ is obviously the
best prediction scheme in the sense of achieving minimal expected
loss ($L_n^{\smash{\Lambda_\mu}}\leq L_n^\Lambda$ for any $\Lambda$).
%
One can prove the following loss bound for the universal
$\Lambda_\xi$ predictor \cite{Hutter:01loss,Hutter:01alpha,Hutter:02spupper}
\beq\label{thULoss}
  0 \;\leq\; L_n^{\smash{\Lambda_\xi}}-L_n^{\smash{\Lambda_\mu}} \;\leq\;
  2\ln 2\!\cdot\! K(\mu)+2\sqrt{L_n^{\smash{\Lambda_\mu}}\ln 2\!\cdot\! K(\mu)}
\eeq
Together with $L_n\leq n$ this shows that $\odn
L_n^{\smash{\Lambda_\xi}}-\odn L_n^{\smash{\Lambda_\mu}}=O(n^{-1/2})$, i.e.\
asymptotically $\Lambda_\xi$ achieves the optimal average loss of
$\Lambda_\mu$ with rapid convergence. Moreover
$L_\infty^{\smash{\Lambda_\xi}}$ is finite if $L_\infty^{\smash{\Lambda_\mu}}$ is
finite and $L_n^{\smash{\Lambda_\xi}}/L_n^{\smash{\Lambda_\mu}}\to 1$ if
$L_\infty^{\smash{\Lambda_\mu}}$ is not finite. Bound (\ref{thULoss}) also
implies $L_n^\Lambda\geq L_n^{\smash{\Lambda_\xi}} -
2\sqrt{L_n^{\smash{\Lambda_\xi}}\ln 2\cdot K(\mu)}$, which shows that {\em
no} (causal) predictor $\Lambda$ whatsoever achieves significantly
less (expected) loss than $\Lambda_\xi$. In view of these results
it is fair to say that, ignoring computational issues, the problem
of sequence prediction has been solved in a universal way.

A different kind of optimality is {\em Pareto optimality}. The
universal prior $\xi$ is Pareto optimal in the sense that there is
no other predictor that leads to equal or smaller loss in {\em
all} environments. Any improvement achieved by some predictor
$\Lambda$ over $\Lambda_\xi$ in some environments is
balanced by a deterioration in other environments
\cite{Hutter:03optisp}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Universal Algorithmic Agent AIXI}\label{chAIxi}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Active systems, like game playing (SG) and optimization (FM),
cannot be reduced to induction systems. The {\it main idea of this
work} is to generalize universal induction to the general agent
model described in Section~\ref{chAImu}. For this, we generalize
$\xi$ to include actions as conditions and replace $\mu$ by $\xi$ in the
rational agent model, resulting in the AI$\xi$(=AIXI) model. In
this way the problem that the true prior probability $\mu$ is
usually unknown is solved. Convergence of $\xi\to\mu$ can be
shown, indicating that the AI$\xi$ model could behave optimally
in any computable but unknown environment with reinforcement
feedback.

The main focus of this section is to investigate what we can
expect from a universally optimal agent and to clarify the
meanings of {\em universal}, {\em optimal}, etc. Unfortunately
bounds similar to the loss bound (\ref{thULoss}) in the SP case
can hold for {\em no} active agent. This forces us to lower our
expectation about universally optimal agents and to introduce
other (weaker) performance measures.
%
Finally, we show that AI$\xi$ is Pareto optimal in the sense that
there is no other policy yielding higher or equal value in {\em
all} environments and a strictly higher value in at least one.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{The Universal AI$\xi$ Model}\label{secAIxi}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\index{AI$\xi $ model}\index{universal!AI$\xi$ model}
\index{model!universal}\index{model!AI$\xi$}

%------------------------------%
\paragraph{Definition of the AI$\xi$ model}
%------------------------------%
We have developed enough formalism to suggest our universal
AI$\xi$ model. All we have to do is to suitably generalize the
universal semimeasure $\xi$ from the last section and
replace the true but unknown prior probability $\mu^\AI$ in the
AI$\mu$ model by this generalized $\xi^\AI$. In what sense
this AI$\xi$ model is universal will be discussed subsequently.
\index{universal!generalized prior}
\index{generalized universal prior}

In the functional formulation we define the universal probability
$\xi^\AI$ of an environment $q$ just as $2^{-\l(q)}$
\beqn
  \xi(q) \;:=\; 2^{-\l(q)}
\eeqn
The definition could not be easier\footnote{It is not necessary to
use $2^{-K(q)}$ or something similar as some readers may expect at
this point, because for every program $q$ there exists
a functionally equivalent program $\tilde q$ with
$K(q)\equa\l(\tilde q)$.}!\footnote{Here and later we identify
objects with their coding relative to some fixed Turing machine
$U$. For example, if $q$ is a function $K(q):=K(\langle q\rangle)$
with $\langle q\rangle$ being a binary coding of $q$ such that
$U(\langle q\rangle,y)=q(y)$. Reversely, if $q$ already is a
binary string we define $q(y) :=U(q,y)$.} Collecting the formulas
of Section~\ref{secAIfunc} and replacing $\mu(q)$ by $\xi(q)$ we
get the definition of the AI$\xi$ agent in functional form. Given
the history $\hh y\!\hh x_{<k}$ the policy $p^\xi$ of the
functional AI$\xi$ agent is given by
\beq\label{eefuncxi}
  \hh y_k \;:=\;
  \arg\max_{y_k}\max_{p:p(\hh x_{<k})=\hh y_{<k}y_k}
  \sum_{q:q(\hh y_{<k})=\hh x_{<k}}
  \nq 2^{-\l(q)}\cdot V_{km_k}^{pq}
\eeq
in cycle $k$, where $V_{km_k}^{pq}$ is the total reward of cycles $k$ to $m_k$ when
agent $p$ interacts with environment $q$. We have dropped the
denominator $\sum_q\mu(q)$ from (\ref{eefunc}) as it is
independent of the $p \in \hh P_k$ and a constant multiplicative
factor does not change $\arg\max_{y_k}$.

For the iterative formulation, the universal probability
$\xi$ can be obtained by inserting the functional $\xi(q)$ into
(\ref{mufr})
\beq\label{uniMAI}
  \xi(y\!\pb x_{1:k}) \;=\;
  \nq\sum_{q:q(y_{1:k})=x_{1:k}}\nq 2^{-\l(q)}
\eeq
Replacing $\mu$ by $\xi$ in (\ref{ydotrec}) the
iterative AI$\xi$ agent outputs
\beq\label{ydotxi}
  \hh y_k \;\equiv\; \hh y_k^\xi \;:=\;
  \arg\max_{y_k}\sum_{x_k}\max_{y_{k+1}}\sum_{x_{k+1}}\;...\;
  \max_{y_{m_k}}\sum_{x_{m_k}}
  (r(x_k)\!+...+\!r(x_{m_k})) \!\cdot\!
  \xi(\hh y\!\hh x_{<k}y\!\pb x_{k:m_k})
\eeq
in cycle $k$ given the history $\hh y\!\hh x_{<k}$.

The equivalence of the functional and iterative AI model (Theorem
\ref{thEqAI}) is true for every chronological semimeasure $\rho$,
especially for $\xi$, hence we can talk about {\it the} AI$\xi$
model in this respect. It (slightly) depends on the choice of the
universal Turing machine. $\l(\langle q\rangle)$ is defined only up
to an additive constant. The AI$\xi$ model also depends on the
choice of $\X = \R \times \O$ and $\Y$, but we do not expect any
\idx{bias} when the spaces are chosen sufficiently simple, e.g.\
all strings of length $2^{16}$. Choosing $\SetN$ as the word
space would be ideal, but whether the maxima (suprema) exist in
this case, has to be shown beforehand. The only nontrivial
dependence is on the horizon function $m_k$ which will be
discussed later. So apart from $m_k$ and unimportant details the
AI$\xi$ agent is uniquely defined by (\ref{eefuncxi}) or
(\ref{ydotxi}). It does not depend on any assumption about the
environment apart from being generated by some computable (but
unknown!) probability distribution.

%------------------------------%
\paragraph{Convergence of $\xi$ to $\mu$}
%------------------------------%
Similarly to (\ref{eukdistxi}) one can show that the $\mu$-expected
squared difference of $\mu$ and $\xi$ is finite for computable
$\mu$. This, in turn, shows that $\xi(y\!x_{<k}y\!\pb x_k)$
converges rapidly to $\mu(y\!x_{<k}y\!\pb x_k)$ for $k \to \infty$ with
$\mu$-probability 1. The line of reasoning is the same; the $y$
are pure spectators. This will change when we analyze
loss/reward bounds analogous to (\ref{thULoss}).
%
More generally, one can show \cite{Hutter:04uaibook} that\footnote{Here,
and everywhere else, with $\xi_k\to\mu_k$ we mean $\xi_k-\mu_k\to
0$, and not that $\mu_k$ (and $\xi_k$) itself converge to a
limiting value.}
\beq\label{aixitomu}
  \xi(y\!x_{<k}y\!\pb x_{k:m_k}) \toinfty{k} \mu(y\!x_{<k}y\!\pb x_{k:m_k})
\eeq
This gives hope that the outputs $\hh y_k$ of the AI$\xi$ model
(\ref{ydotxi}) could converge to the outputs $\hh y_k$ from the
AI$\mu$ model (\ref{ydotrec}).

\indxs{solvable}{problem}\indxs{learnable}{task}
\indxs{universal}{optimality}
We want to call an AI model {\it universal}, if it is
$\mu$-independent (unbiased, model-free) and is able
to solve any solvable problem and learn any learnable task.
Further, we call a universal model, {\it universally optimal}, if
there is no program, which can solve or learn significantly faster
(in terms of interaction cycles). Indeed, the AI$\xi$ model is
parameter free, $\xi$ converges to $\mu$ (\ref{aixitomu}), the
AI$\mu$ model is itself optimal, and we expect no other model to
converge faster to AI$\mu$ by analogy to SP (\ref{thULoss}):
%\beqn
%  \mbox{\it we expect AI$\xi$ to be universally optimal.}
%\eeqn

%------------------------------%
\fclaim{clMain}{We expect AIXI to be universally optimal}{\vspace{-1ex}}
%------------------------------%
%
This is our main claim. In a sense, the intention of the remaining
sections is to define this statement more rigorously and to give
further support.

%------------------------------%
\paragraph{Intelligence order relation}\label{secIOR}
%------------------------------%
\indxs{intelligence}{order relation}%
\indxs{consistent}{policy}\indxs{inconsistent}{policy}%
We define the $\xi$-expected reward in cycles $k$ to $m$ of a
policy $p$ similar to (\ref{eefunc}) and (\ref{eefuncxi}).
We extend the definition to programs $p \not\in \hh P_k$ that
are not consistent with the current history.
\beq\label{cxi}
  V_{km}^{p\xi}(\hh y\!\hh x_{<k}) \;:=\;
  {1\over\cal N}
  \sum_{q:q(\hh y_{<k})=\hh x_{<k}}
  \nq 2^{-\l(q)}\cdot V_{km}^{\tilde p q}
\eeq
The normalization $\cal N$ is again only necessary for
interpreting $V_{km}$ as the expected reward but is otherwise
unneeded. For consistent policies $p\in\hh P_k$ we define $\tilde
p := p$. For $p \not\in \hh P_k$, $\tilde p$ is a modification of
$p$ in such a way that its outputs are consistent with the current
history $\hh y\!\hh x_{<k}$, hence $\tilde p \in \hh P_k$, but
unaltered for the current and future cycles $\geq k$. Using this
definition of $V_{km}$ we could take the maximium over all
policies $p$ in (\ref{eefuncxi}), rather than only the consistent
ones.

%------------------------------%
\fdefinition{defaiorder}{Intelligence order relation}{
%------------------------------%
We call a policy $p$ {\it more or equally intelligent} than $p'$ and write
\beqn
  p\succeq p' \quad:\Leftrightarrow\quad
  \forall k\forall\hh y\!\hh x_{<k}:
  V_{km_k}^{p\xi}(\hh y\!\hh x_{<k}) \geq
  V_{km_k}^{p'\xi}(\hh y\!\hh x_{<k}).
\eeqn
i.e.\ if $p$ yields in any circumstance higher $\xi$-expected
reward than $p'$.
}%------------------------------%

\indxs{most intelligent}{agent}
\indxs{universal}{order relation}
\indxs{intermediate}{intelligence}%
\noindent As the algorithm $p^\best$ behind the AI$\xi$
agent maximizes $V_{km_k}^{p\xi}$ we have $p^\xi\succeq p$ for all
$p$. The AI$\xi$ model is hence the most intelligent agent
w.r.t.\ $\succeq$. Relation $\succeq$ is a universal order relation in the
sense that it is free of any parameters (except $m_k$) or specific
assumptions about the environment. A proof, that $\succeq$ is a
reliable intelligence order (which we believe to be true), would
prove that AI$\xi$ is universally optimal. We could further ask:
How useful is $\succeq$ for ordering policies of practical
interest with intermediate intelligence, or how can $\succeq$ help
to guide toward constructing more intelligent systems with
reasonable computation time? An effective intelligence order
relation $\succeq^c$ will be defined in Section~\ref{secAIXItl},
which is more useful from a practical point of view.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{On the Optimality of AIXI}\label{secOOAIXI}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\indxs{optimality}{AI$\xi$ model}
\indxs{adaptive}{control}

In this section we outline ways toward an optimality proof of
AIXI. Sources of inspiration are the SP loss bounds proven in
Section~\ref{chSP} and optimality criteria from the adaptive control
literature (mainly) for linear systems \cite{Kumar:86}.
%
The value bounds for AIXI are expected to be, in a sense, weaker
than the SP loss bounds because the problem class covered by AIXI
is much larger than the class of induction problems. Convergence
of $\xi$ to $\mu$ has already been proven, but is not sufficient
to establish convergence of the behavior of the AIXI model to the
behavior of the AI$\mu$ model. We will focus on three approaches
toward a general optimality proof:

\indxs{universal}{optimality}
%------------------------------%
\paragraph{What is meant by universal optimality}
%------------------------------%
The first step is to investigate what we can expect from AIXI,
i.e.\ what is meant by {\it universal optimality}. A ``learner''
(like AIXI) may converge to the optimal informed decision-maker
(like AI$\mu$) in several senses. Possibly relevant concepts from
statistics are, {\em\idx{consistency}},
{\em\idx{self-tunability}}, {\em\idx{self-optimization}},
{\em\idx{efficiency}}, {\em\idx{unbiasedness}}, {\em asymptotic}
or\indxs{asymptotic}{convergence}\indxs{finite}{convergence} {\em
finite} convergence \cite{Kumar:86}, \idx{Pareto optimality}, and
some more defined in Section~\ref{secAIsep}. Some concepts are
stronger than necessary, others are weaker than desirable but
suitable to start with. Self-optimization is defined as the
asymptotic convergence of the average true value $\odm
V_{1m}^{p^\xi \mu}$ of AI$\xi$ to the optimal value $\odm
V_{1m}^{\best\mu}$. Apart from convergence speed,
self-optimization of AIXI would most closely correspond to the
loss bounds proven for SP. We investigate which properties are
desirable and under which circumstances the AIXI model satisfies
these properties. We will show that no universal model, including
AIXI, can in general be self-optimizing. On the other hand, we show
that AIXI is Pareto optimal in the sense that there is no other
policy that performs better or equal in all environments, and
strictly better in at least one.

\indxs{limited}{environmental class}
\indxs{restricted}{concept class}
%------------------------------%
\paragraph{Limited environmental classes}
%------------------------------%
The problem of defining and proving general value bounds becomes
more feasible by considering, in a first step, restricted concept
classes. We analyze AIXI for known classes (like Markovian or
factorizable environments) and especially for the new classes
(forgetful, relevant, asymptotically learnable, farsighted,
uniform, pseudo-passive, and passive) defined later in
Section~\ref{secAIsep}. In Section~\ref{chApply} we study the
behavior of AIXI in various standard problem classes, including
sequence prediction, strategic games, function minimization, and
supervised learning.

\indxs{general}{Bayes mixture}
\indxs{general Bayes mixture}{AI$\xi$ model}
%------------------------------%
\paragraph{Generalization of AIXI to general Bayes mixtures}
%------------------------------%
The other approach is to generalize AIXI to AI$\zeta$, where
$\zeta()=\sum_{\nu\in\M}w_\nu\nu()$ is a general Bayes mixture of
distributions $\nu$ in some class $\M$. If $\M$ is the multi-set
of enumerable semimeasures enumerated by a Turing machine, then
AI$\zeta$ coincides with AIXI. If $\M$ is the (multi)set of
passive effective environments, then AIXI reduces to the
$\Lambda_\xi$ predictor that has been shown to perform well. One
can show that these loss/value bounds generalize to wider classes,
at least asymptotically \cite{Hutter:02selfopt}. Promising
classes are, again, the ones described in Section~\ref{secAIsep}.
In particular, for ergodic {\sc mdp}s we showed that AI$\zeta$ is
self-optimizing\indxs{self-optimizing}{policy}. Obviously, the
least we must demand from $\M$ to have a chance of finding a
self-optimizing policy is that there exists some self-optimizing
policy at all. The key result in \cite{Hutter:02selfopt} is that
this necessary condition is also sufficient. More generally, the key is
not to prove absolute results for specific problem classes, but to
prove relative results of the form ``if there exists a policy with
certain desirable properties, then AI$\zeta$ also possesses these
desirable properties''. If there are tasks that cannot be solved
by any policy, AI$\zeta$ cannot be blamed for failing.
Environmental classes that allow for self-optimizing policies
include bandits, i.i.d.\ processes, classification tasks, certain
classes of {\sc pomdp}s, $k^{th}$-order ergodic {\sc mdp}s,
factorizable environments, repeated games, and prediction
problems. Note that in this approach we have for each
environmental class a corresponding model AI$\zeta$, whereas in
the approach pursued in this article the same universal AIXI model
is analyzed for all environmental classes.

\index{optimality!by construction}
\index{Bandit problem}
%------------------------------%
\paragraph{Optimality by construction}
%------------------------------%
A possible further approach toward an optimality ``proof'' is to
regard AIXI as {\em optimal by construction}.
This perspective is common in various (simpler) settings.
%
For instance, in bandit problems, where pulling arm $i$ leads to
reward $1$ ($0$) with unknown probability $p_i$ ($1-p_i$), the
traditional Bayesian solution to the uncertainty about $p_i$ is to
assume a uniform (or Beta) prior over $p_i$ and to maximize
the (subjectively) expected reward sum over multiple trials. The
exact solution (in terms of Gittins indices) is widely regarded as
``optimal'', although justified alternative approaches exist.
%
Similarly, but simpler, assuming a uniform subjective prior over
the Bernoulli parameter $p_{(i)}\in[0,1]$, one arrives at the
reasonable, but more controversial, Laplace rule for predicting
i.i.d.\ sequences.
%
AIXI is similar in the sense that the unknown $\mu\in\M$ is the
analogue of the unknown $p\in[0,1]$, and the prior beliefs
$w_\nu=2^{-K(\nu)}$ justified by Occam's razor are the analogue of
a uniform distribution over [0,1].
%
In the same sense as Gittins' solution to the bandit problem and
Laplace' rule for Bernoulli sequences, AIXI may also be
regarded as optimal by construction.
%
Theorems relating AIXI to AI$\mu$ would not be regarded as
optimality proofs of AIXI, but just as how much harder it becomes
to operate when $\mu$ is unknown, i.e.\
%
the achievements of the first three approaches are simply
reinterpreted.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Value Bounds and Separability Concepts}\label{secAIsep}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\indxs{value}{bound}\indxs{separability}{concepts}

%------------------------------%
\paragraph{Introduction}
%------------------------------%
The values $V_{km}$ associated with the AI systems correspond
roughly to the negative loss $-L_n^\Lambda$ of the SP
systems. In SP, we were interested in small bounds for the loss
excess $L_n^{\smash{\Lambda_\xi}} - L_n^\Lambda$. Unfortunately, simple
value bounds for AI$\xi$ in terms of $V_{km}$ analogous to the
loss bound (\ref{thULoss}) do not hold. We even have difficulties in
specifying what we can expect to hold for AI$\xi$ or any AI system
that claims to be universally optimal. Consequently, we cannot
have a proof if we don't know what to prove. In SP, the only
important property of $\mu$ for proving loss bounds was its
complexity $K(\mu)$. We will see that in the AI case, there are no
useful bounds in terms of $K(\mu)$ only. We either have to study
restricted problem classes or consider bounds depending on other
properties of $\mu$, rather than on its complexity only. In the
following, we will exhibit the difficulties by two examples and
introduce concepts that may be useful for proving value bounds.
Despite the difficulties in even claiming useful value bounds, we
nevertheless, firmly believe that the order relation
(Definition~\ref{defaiorder}) correctly formalizes the intuitive meaning of
intelligence and, hence, that the AI$\xi$ agent is universally optimal.

%------------------------------%
\paragraph{(Pseudo) Passive $\mu$ and the HeavenHell example}
%------------------------------%
\indxs{pseudo-passive}{environment}\index{HeavenHell example}%
In the following we choose $m_k = m$. We want to compare the
true, i.e.\ $\mu$-expected value $V^\mu_{1m}$ of a $\mu$-independent
universal policy $p^{best}$ with any other policy $p$.
Naively, we might expect the existence of a policy $p^{best}$
that maximizes $V^\mu_{1m}$, apart from additive corrections of
lower order for $m\to\infty$
\beq\label{cximu}
  V_{1m}^{p^{best}\mu} \;\geq\; V_{1m}^{p\mu} - o(...)
  \quad \forall\mu,p
\eeq
Such policies are sometimes called self-optimizing
\cite{Kumar:86}. Note that $V_{1m}^{p^\mu\mu} \geq
V_{1m}^{p\mu}\,\forall p$, but $p^\mu$ is not a candidate for (a
universal) $p^{best}$ as it depends on $\mu$. On the other hand,
the policy $p^\xi $ of the AI$\xi$ agent maximizes $V^\xi_{1m}$ by
definition ($p^\xi \succeq p$). As $V^\xi_{1m}$ is thought to be a
guess of $V^\mu_{1m}$, we might expect $p^{best} = p^\xi $ to
approximately maximize $V^\mu_{1m}$, i.e.\ (\ref{cximu}) to hold.
Let us consider the problem class (set of environments)
$\M=\{\mu_0,\mu_1\}$ with $\Y = \R =\{0,1\}$ and $r_k
=\delta_{iy_1}$ in environment $\mu_i$, where the Kronecker symbol
$\delta_{xy}$ is defined as 1 for $x=y$ and 0 otherwise. The first
action $y_1$ decides whether you go to heaven with all future
rewards $r_k$ being $1$ (good) or to hell with all future rewards
being $0$ (bad). Note that $\mu_i$ are
(deterministic, non-ergodic) {\sc mdp}s:

\begin{center}
\footnotesize\unitlength=1.5ex
\begin{picture}(34,7)(0,0)
\thicklines
\put(1,2){\makebox(0,0)[rc]{\normalsize$\mu_i\quad=$}}
\put(7,2){\oval(6,4)}\put(7,2){\makebox(0,0)[cc]{\small Hell}}
\put(7,6){\oval(2,2)[t]}\put(6,6){\vector(0,-1){2}}\put(8,6){\line(0,-1){2}}
\put(8,6){\makebox(0,0)[lb]{ $y=*$}}
\put(8,6){\makebox(0,0)[lt]{ $r=0$}}
\put(16,2){\vector(-1,0){6}}
\put(13,2.3){\makebox(0,0)[cb]{$y=1-i$}}
\put(13,1.7){\makebox(0,0)[ct]{$r=0$}}
%
\put(19,2){\oval(6,4)}\put(19,2){\makebox(0,0)[cc]{\small Start}}
\put(22,2){\vector(1,0){6}}
\put(25,2.3){\makebox(0,0)[cb]{$y=i$}}
\put(25,1.7){\makebox(0,0)[ct]{$r=1$}}
%
\put(31,2){\oval(6,4)}\put(31,2){\makebox(0,0)[cc]{\small Heaven}}
\put(31,6){\oval(2,2)[t]}\put(32,6){\vector(0,-1){2}}\put(30,6){\line(0,-1){2}}
\put(30,6){\makebox(0,0)[rb]{ $y=*$}}
\put(30,6){\makebox(0,0)[rt]{ $r=1$}}
\end{picture}
\end{center}
\noindent It is clear that if $\mu_i$, i.e.\ $i$ is known,
the optimal policy $p^{\mu_i}$ is to output $y_1 = i$ in the first
cycle with $V_{1m}^{p^{\mu_i}\mu} = m$. On the other hand, any
unbiased policy $p^{best}$ independent of the actual $\mu$ either
outputs $y_1 = 1$ or $y_1 = 0$. Independent of the actual choice
$y_1$, there is always an environment ($\mu = \mu_{1-y_1}$) for
which this choice is catastrophic ($V_{1m}^{p^{best}\mu} = 0$). No
single agent can perform well in both environments $\mu_0$ {\it
and} $\mu_1$. The r.h.s.\ of (\ref{cximu}) equals $m - o(m)$ for
$p = p^\mu$. For all $p^{best}$ there is a $\mu$ for which the
l.h.s.\ is zero. We have shown that no $p^{best}$ can satisfy
(\ref{cximu}) for all $\mu$ and $p$, so we cannot expect $p^\xi $
to do so. Nevertheless, there are problem classes for which
(\ref{cximu}) holds, for instance SP. For SP, (\ref{cximu}) is
just a reformulation of (\ref{thULoss}) with an appropriate choice
for $p^{best}$, namely $\Lambda_\xi$ (which differs from $p^\xi $,
see next section). We expect (\ref{cximu}) to hold for all
inductive problems in which the environment is not
influenced\footnote{Of course, the reward feedback $r_k$ depends
on the agent's output. What we have in mind is, like in sequence
prediction, that the true sequence is not influenced by the
agent.} by the output of the agent.
\indxs{passive}{environment}\indxs{inductive}{environment}%
We want\indxs{pseudo-passive}{environment} to call these $\mu$,
{\it passive} or {\it inductive} environments. Further, we want to
call $\M$ and $\mu\in\M$ satisfying (\ref{cximu}) with $p^{best} =
p^\xi $ {\it pseudo-passive}. So we expect inductive $\mu$ to be
pseudo-passive.

%------------------------------%
\paragraph{The OnlyOne example}\index{OnlyOne example}
%------------------------------%
Let us give a further example to demonstrate the difficulties in
establishing value bounds. Let $\X=\R =\{0,1\}$ and $|\Y|$ be large.
We consider all (deterministic) environments in which a single
complex output $y^*$ is correct ($r = 1$) and all others are wrong
($r = 0$). The problem class $\M$ is defined by
\beqn
%  \M:=\{\mu:\mu(y\!x_{<k}y_k\pb 1)=
%       \delta_{y_ky^*}\,\forall k,\; y^*\!\in\!\Y,\; K(y^*)\!=\!_\lfloor\log_2|\Y|_\rfloor\}
  \M:=\{\mu_{y^*}: y^*\!\in\!\Y,\; K(y^*)=_\lfloor\!\!\log|\Y|_\rfloor\},
  \qmbox{where} \mu_{y^*}(y\!x_{<k}y_k\pb 1):=\delta_{y_ky^*}\,\forall k.
\eeqn
There are $N\eqm|\Y|$ such $y^*$. The only way a $\mu$-independent
policy $p$ can find the correct $y^*$, is by trying one $y$ after
the other in a certain order. In the first $N - 1$ cycles, at most
$N - 1$ different $y$ are tested. As there are $N$ different
possible $y^*$, there is always a $\mu\in\M$ for which $p$ gives
erroneous outputs in the first $N - 1$ cycles. The number of
errors is $E_\infty^p \geq N - 1 \eqm|\Y|\eqm 2^{K(y^*)}\eqm
2^{K(\mu)}$ for this $\mu$. As this is true for any $p$, it is
also true for the AI$\xi$ model, hence $E_k^{p^\xi} \leq
2^{K(\mu)}$ is the best possible error bound we can expect that
depends on $K(\mu)$ only. Actually, we will derive such a bound in
Section~\ref{secSP} for inductive environments. Unfortunately, as
we are mainly interested in the cycle region $k\ll|\Y|\eqm
2^{K(\mu)}$ (see Section~\ref{secCL}) this bound is vacuous. There
are no interesting bounds for deterministic $\mu$ depending on
$K(\mu)$ only, unlike the SP case. Bounds must either depend on
additional properties of $\mu$ or we have to consider specialized
bounds for restricted problem classes. The case of probabilistic
$\mu$ is similar. Whereas for SP there are useful bounds in terms
of $L_k^{\smash{\Lambda_\mu}}$ and $K(\mu)$, there are no such
bounds for AI$\xi$. Again, this is not a drawback of AI$\xi$ since
for no unbiased AI system could the errors/rewards be bound in
terms of $K(\mu)$ and the errors/rewards of AI$\mu$ only.

\index{posterization}\index{bound!boost}\index{boosting!bound}%
There is a way to make use of gross (e.g.\ $2^{K(\mu)}$) bounds.
Assume that after a reasonable number of cycles $k$, the
information $\hh x_{<k}$ perceived by the AI$\xi$ agent contains a
lot of information about the true environment $\mu$. The
information in $\hh x_{<k}$ might be coded in any form. Let us
assume that the complexity $K(\mu|\hh x_{<k})$ of $\mu$ under the
condition that $\hh x_{<k}$ is known, is of order 1. Consider a
theorem, bounding the sum of rewards or of other quantities over
cycles $1...\infty$ in terms of $f(K(\mu))$ for a function $f$
with $f(O(1)) = O(1)$, like $f(n) = 2^n$. Then, there will be a
bound for cycles $k...\infty$ in terms of $\approx f(K(\mu|\hh
x_{<k})) = O(1)$. Hence, a bound like $2^{K(\mu)}$ can be replaced
by small bound $\approx 2^{K(\mu|\hh x_{<k})} = O(1)$ after $k$
cycles. All one has to show/ensure/assume is that enough
information about $\mu$ is presented (in any form) in the first
$k$ cycles. In this way, even a gross bound could become useful.
In Section~\ref{secEX} we use a similar argument to prove that
AI$\xi$ is able to learn supervised.

%------------------------------%
\paragraph{Asymptotic learnability}
%------------------------------%
\index{asymptotic!learnability}\index{learnable!asymptotically}%
In the following, we weaken (\ref{cximu}) in the hope of getting a
bound applicable to wider problem classes than the passive one.
Consider the I/O sequence $\hh y_1\hh x_1...\hh y_n\hh x_n$ caused
by AI$\xi$. On history $\hh y\!\hh x_{<k}$, AI$\xi$ will output
$\hh y_k \equiv\hh y^\xi_k$ in cycle $k$. Let us compare this to
$\hh y^\mu_k$ what AI$\mu$ would output, still on the same history
$\hh y\!\hh x_{<k}$ produced by AI$\xi$. As AI$\mu$ maximizes the
$\mu$-expected value, AI$\xi$ causes lower (or at best equal)
$V_{km_k}^\mu$ if $\hh y^\xi_k$ differs from $\hh y^\mu_k$. Let
$D_{n\mu\xi} := \E[\sum_{k=1}^n 1 - \delta_{\hh y^\mu_k,\hh
y^\xi_k}]$ be the $\mu$-expected number of suboptimal
choices of AI$\xi$, i.e.\ outputs different from AI$\mu$ in the
first $n$ cycles. One might weigh the deviating cases by their
severity. In particular, when the $\mu$-expected rewards
$V_{km_k}^{p\mu}$ for $\hh y^\xi_k$ and $\hh y^\mu_k$ are equal or
close to each other, this should be taken into account in a
definition of $D_{n\mu\xi}$, e.g.\ by a weight factor
$[V_{km}^{\best\mu}(y\!x_{<k}) - V_{km}^{p^\xi\mu}(y\!x_{<k})]$.
These details do not matter in the following qualitative
discussion.\indxs{suboptimal}{decision}\index{decision!wrong} The
important difference to (\ref{cximu}) is that here we stick to the
history produced by AI$\xi$ and count a wrong decision as, at
most, one error. The wrong decision in the HeavenHell example in
the first cycle no longer counts as losing $m$ rewards, but counts
as one wrong decision. In a sense, this is fairer. One shouldn't
blame somebody too much who makes a single wrong decision for
which he just has too little information available, in order to
make a correct decision. The AI$\xi$ model would deserve to be
called asymptotically optimal if the probability of making a
wrong decision tends to zero, i.e.\ if
\beq\label{Doon}
  D_{n\mu\xi}/n\to 0 \qmbox{for} n\to\infty, \quad\mbox{i.e.}\quad
  D_{n\mu\xi} \;=\; o(n).
\eeq
We say that $\mu$ can be {\it asymptotically learned} (by AI$\xi$)
if (\ref{Doon}) is satisfied. We claim that AI$\xi$ (for $m_k \to
\infty$) can asymptotically learn every problem $\mu$ of
relevance, i.e.\ AI$\xi$ is asymptotically
optimal.\indxs{relevant}{problem} We included the qualifier {\it
of relevance}, as we are not sure whether there could be strange
$\mu$ spoiling (\ref{Doon}) but we expect those $\mu$ to be
irrelevant from the perspective of AI. In the field of Learning,
there are many asymptotic learnability theorems, often not too
difficult to prove. So a proof of (\ref{Doon}) might also be
feasible. Unfortunately, asymptotic learnability theorems are
often too weak to be useful from a practical point of view.
Nevertheless, they point in the right direction.

%------------------------------%
\paragraph{Uniform $\mu$}\indxs{uniform}{environment}
%------------------------------%
From the convergence (\ref{aixitomu}) of $\xi\to\mu$ we might
expect $V_{km_k}^{p\xi} \to V_{km_k}^{p\mu}$ for all $p$, %only true on-policy
and hence we might also expect $\hh y^\xi_k$
defined in (\ref{ydotxi}) to converge to $\hh y^\mu_k$ defined in
(\ref{ydotrec}) for $k \to \infty$. The
first problem is that if the $V_{km_k}$ for the different choices
of $y_k$ are nearly equal, then even if $V_{km_k}^{p\xi} \approx
V_{km_k}^{p\mu}$, $\hh y^\xi_k \neq \hh y^\mu_k$ is possible due to
the non-continuity of $\arg\max_{y_k}$. This can be cured by a
weighted $D_{n\mu\xi}$ as described above. More serious is the
second problem we explain for $h_k = 1$ and $\X = \R = \{0,1\}$.
For $\hh y^\xi_k \equiv \arg\max_{y_k}\xi(\hh y\!\hh r_{<k}y_k\pb
1)$ to converge to $\hh y^\mu_k \equiv \arg\max_{y_k}\mu(\hh
y\!\hh r_{<k}y_k\pb 1)$, it is not sufficient to know that
$\xi(\hh y\!\hh r_{<k}\hh y\!\hh{\pb r}_k) \to \mu(\hh y\!\hh
r_{<k}\hh y\!\hh{\pb r}_k)$ as proven in (\ref{aixitomu}). We need
convergence not only for the true output $\hh y_k$,
%and arbitrary reward $\hh r'_k$,  and reward 1
but also for alternative outputs $y_k$. $\hh
y^\xi_k$ converges to $\hh y^\mu_k$ if $\xi$ converges uniformly
to $\mu$, i.e.\ if in addition to (\ref{aixitomu})
\indxs{uniform}{convergence}
\beq\label{uniform}
  \big|\mu(y\!x_{<k}y'_k\pb x'_k)-\xi(y\!x_{<k}y'_k\pb x'_k)\big|
  \;<\; c\!\cdot\!
  \big|\mu(y\!x_{<k}y\!\pb x_k)-\xi(y\!x_{<k}y\!\pb x_k)\big|
  \quad\forall y'_kx'_k
\eeq
holds for some constant $c$ (at least in a $\mu$-expected sense).
We call $\mu$ satisfying (\ref{uniform}) {\it uniform}. For
uniform $\mu$ one can show (\ref{Doon}) with appropriately
weighted $D_{n\mu\xi}$ and bounded horizon $h_k < h_{max}$.
Unfortunately there are relevant $\mu$ that are not uniform.

%------------------------------%
\paragraph{Other concepts}
%------------------------------%
\indxs{Markov}{environment}\index{Markov!$k$-th order}%
\indxs{factorizable}{environment}%
\indxs{stationary}{environment}%
\indxs{forgetful}{environment}%
\indxs{farsighted}{environment}%
In the following, we briefly mention some further concepts. A {\it
Markovian} $\mu$ is defined as depending only on the last cycle,
i.e.\ $\mu(y\!x_{<k}y\!\pb x_k) = \mu_k(x_{k-1}y\!\pb x_k)$. We
say $\mu$ is {\it generalized ($l^{th}$-order) Markovian}, if
$\mu(y\!x_{<k}y\!\pb x_k) = \mu_k(x_{k-l}y\!x_{k-l+1:k-1}y\!\pb
x_k)$ for fixed $l$. This property has some similarities to {\it
factorizable} $\mu$ defined in (\ref{facmu}). If further $\mu_k
\equiv \mu_1\forall k$, $\mu$ is called {\it stationary}. Further,
we call $\mu$ ($\xi$) {\it forgetful} if $\mu(y\!x_{<k}y\!\pb
x_k)$ ($\xi(y\!x_{<k}y\!\pb x_k)$) become(s) independent of $y\!x_{<l}$
for fixed $l$ and $k\to\infty$ with $\mu$-probability 1. Further,
we say $\mu$ is {\it farsighted} if $\lim_{m_k\to\infty}\hh
y_k^{(m_k)}$ exists. More details will be given in
Section~\ref{secHorizon}, where we also give an example of a farsighted
$\mu$ for which nevertheless the limit $m_k \to \infty$ makes no
sense.

%------------------------------%
\paragraph{Summary}
%------------------------------%
We have introduced several concepts that might be useful for
proving value bounds, including forgetful, relevant, asymptotically
learnable, farsighted, uniform, (generalized) Markovian, factorizable
and (pseudo)passive $\mu$. We have sorted them here, approximately in
the order of decreasing generality. We will call them {\it
separability concepts}. The more general (like relevant,
asymptotically learnable and farsighted) $\mu$ will be called
weakly separable, the more restrictive (like (pseudo) passive and
factorizable) $\mu$ will be called strongly separable, but we will
use these qualifiers in a more qualitative, rather than rigid
sense. Other (non-separability) concepts are deterministic $\mu$
and, of course, the class of all chronological $\mu$.


\indxs{Pareto optimality}{AI$\xi$ model}
%-------------------------------------------------------------%
\subsection{Pareto Optimality of AI$\xi$}
%-------------------------------------------------------------%
This subsection shows Pareto-opimtality of AI$\xi$ analogous to
SP. The total $\mu$-expected reward $V_\mu^{p^\xi}$ of
policy $p^\xi$ of the AI$\xi$ model is of central interest in
judging the performance of AI$\xi$. We know that there {\em are}
policies (e.g.\ $p^\mu$ of AI$\mu$) with higher $\mu$-value
($V_\mu^\best\geq V_\mu^{p^\xi}$). In general, every
policy based on an estimate $\rho$ of $\mu$ that is closer to
$\mu$ than $\xi$ is, outperforms $p^\xi$ in environment $\mu$,
simply because it is more tailored toward $\mu$. On the other
hand, such a system probably performs worse than $p^\xi$ in other
environments.
%
Since we do not know $\mu$ in advance we may ask whether there
exists a policy $p$ with better or equal performance than $p^\xi$
in {\em all} environments $\nu\in\M$ and a strictly better
performance for one $\nu \in\M$. This would clearly render $p^\xi$
suboptimal. One can show that there is no such $p$ \cite{Hutter:02selfopt}

%------------------------------%
\fdefinition{defaiPareto}{Pareto Optimality}{
%------------------------------%
A policy $\tilde p$ is called Pareto optimal if there is no other
policy $p$ with $V_{\nu}^p\geq V_{\nu}^{\tilde
p}$ for all $\nu \in \M$ and strict inequality for at
least one $\nu$.
}%------------------------------%

%------------------------------%
\ftheorem{thaiPareto}{Pareto Optimality}{
%------------------------------%
AI$\xi$ alias $p^\xi$ is Pareto optimal.
}%------------------------------%

\noindent Pareto optimality should be regarded as a necessary
condition for an agent aiming to be optimal. From a practical
point of view, a significant increase of $V$ for many environments
$\nu$ may be desirable, even if this causes a small decrease of $V$
for a few other $\nu$. The impossibility of such a ``balanced''
improvement is a more demanding condition on $p^\xi$ than pure
Pareto optimality. In \cite{Hutter:02selfopt} it has been shown
that AI$\xi$ is also balanced Pareto optimal.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{The Choice of the Horizon}\label{secHorizon}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\indxs{horizon}{problem}%
The only significant arbitrariness in the
AI$\xi$ model lies in the choice of the horizon function $h_k
\equiv m_k - k + 1$. We discuss some choices that seem to be
natural and give preliminary conclusions at the end. We will not
discuss ad hoc choices of $h_k$ for specific problems (like the
discussion in Section~\ref{secSG} in the
context of finite strategic games). We are interested in universal
choices of $m_k$.

\indxs{fixed}{horizon}
%------------------------------%
\paragraph{Fixed horizon}
%------------------------------%
If the lifetime of the agent is known to be $m$, which is in
practice always large but finite, then the choice $m_k = m$
maximizes correctly the expected future reward. Lifetime $m$ is usually not
known in advance, as in many cases the time we are willing to run
an agent depends on the quality of its outputs. For this reason,
it is often desirable that good outputs are not delayed too much,
if this results in a marginal reward increase only. This can be
incorporated by damping the future rewards. If, for instance, the
probability of survival in a cycle is $\gamma<1$, an exponential
damping (geometric discount) $r_k := r'_k \cdot \gamma^k$ is
appropriate, where $r'_k$ are bounded, e.g.\ $r'_k \in [0,1]$.
Expression (\ref{ydotxi}) converges for $m_k \to \infty$ in this
case.$\!$\footnote{More precisely, $\displaystyle \hh
y_k=\arg\max_{y_k}\lim_{m_k\to\infty}V_{km_k}^{\best\xi}(\hh y \hh
x_{<k}y_k)$ exists.} But this does not solve the problem, as we
introduced a new arbitrary time scale $(1-\gamma)^{-1}$. Every
damping introduces a time scale. Taking $\gamma\to 1$ is prone
to the same problems as $m_k\to\infty$ in the undiscounted case
discussed below.

\indxs{dynamic}{horizon}
\indxs{universal}{discounting}
\indxs{harmonic}{discounting}
%------------------------------%
\paragraph{Dynamic horizon (universal \& harmonic discounting)}
%------------------------------%
The largest horizon with guaranteed finite and enumerable reward
sum can be obtained by the universal discount $r_k \leadsto r_k
\cdot 2^{-K(k)}$. This discount results in truly farsighted agent
with effective horizon that grows faster than any computable
function.
%
It is similar to a near-harmonic discount $r_k \leadsto
r_k \cdot k^{-(1+\eps)}$, since $2^{-K(k)} \leq 1/k$ for most $k$
and $2^{-K(k)} \geq c/(k\,\log^2 k)$. More
generally, the time-scale invariant damping factor $r_k = r'_k
\cdot k^{-\alpha}$ introduces a dynamic time scale. In cycle $k$
the contribution of cycle $2^{1/\alpha} \cdot k$ is damped by a
factor $\odt$. The effective horizon $h_k$ in this case is $\sim
k$. The choice $h_k = \beta \cdot k$ with $\beta \sim
2^{1/\alpha}$ qualitatively models the same behavior. We have not
introduced an arbitrary time scale $m$, but limited the
farsightedness to some multiple (or fraction) of the length of the
current history. This avoids the preselection of a global
time scale $m$ or ${1\over 1-\gamma}$. This choice has some appeal, as
it seems that humans of age $k$ years usually do not plan their
lives for more than, perhaps, the next $k$ years ($\beta_{human}
\approx 1$). From a practical point of view this model might serve
all needs, but from a theoretical point we feel uncomfortable with
such a limitation in the horizon from the very beginning. Note
that we have to choose $\beta = O(1)$ because otherwise we would
again introduce a number $\beta$, which has to be justified.
%
We favor the universal discount $\gamma_k=2^{-K(k)}$, since
it allows us, if desired, to ``mimic'' all other more greedy
behaviors based on other discounts $\gamma_k$ by choosing $r_k\in
[0,c\cdot\gamma_k]\subseteq[0,2^{-K(k)}]$.

\indxs{infinite}{horizon}
%------------------------------%
\paragraph{Infinite horizon}
%------------------------------%
The naive limit $m_k \to \infty$ in (\ref{ydotxi}) may turn out
to be well defined and the previous discussion superfluous. In the
following, we suggest a limit that is always well defined (for
finite $\Y$). Let $\hh y_k^{(m_k)}$ be defined as in (\ref{ydotxi})
with dependence on $m_k$ made explicit. Further, let $\hh
\Y_k^{(m)} := \{\,\hh y_k^{(m_k)} : m_k \geq m\}$ be the set
of outputs in cycle $k$ for the choices $m_k = m,m+1,m+2,...$.
Because $\hh \Y_k^{(m)} \supseteq \hh \Y_k^{(m+1)} \neq \{\}$,
we have $\hh \Y_k^{(\infty)} := \bigcap_{m=k}^\infty\hh
\Y_k^{(m)} \neq \{\}$. We define the $m_k = \infty$ model to
output any $\hh y_k^{(\infty)} \in \hh \Y_k^{(\infty)}$. This is
the best output consistent with some arbitrary large choice of
$m_k$. Choosing the lexicographically smallest $\hh
y_k^{(\infty)} \in \hh \Y_k^{(\infty)}$ would correspond to the
lower limit $\underline\lim_{m\to\infty}\hh y_k^{(m)}$, which
always exists (for finite $\Y$). Generally $\hh
y_k^{(\infty)} \in \hh \Y_k^{(\infty)}$ is unique, i.e.\ $|\hh
\Y_k^{(\infty)}| = 1$ iff the naive limit $\lim_{m\to\infty}\hh
y_k^{(m)}$ exists. Note that the limit
$\lim_{m\to\infty}V_{km}^\best(y\!x_{<k})$ need not exist for
this construction.

\indxs{average}{reward}\indxs{differential}{gain}
%------------------------------%
\paragraph{Average reward and differential gain}
%------------------------------%
Taking the raw average reward $(r_k +...+ r_m)/(m - k + 1)$ and
$m\to\infty$ also does not help: consider an arbitrary policy for the
first $k$ cycles and the/an optimal policy for the remaining
cycles $k+1...\infty$. In e.g.\ i.i.d.\ environments the limit
exists, but all these policies give the same average value, since
changing a finite number of terms does not affect an infinite
average. In {\sc mdp} environments with a single recurrent class
one can define the relative or differential gain
\cite{Bertsekas:96}. In more general environments (we are
interested in) the differential gain can be infinite, which is
acceptable, since differential gains can still be totally ordered.
The major problem is the {\it existence} of the differential gain,
i.e.\ whether it converges for $m\to\infty$ in
$\SetR\cup\{\infty\}$ at all (and does not oscillate). This is
just the old convergence problem in slightly different form.

\indxs{immortal}{agents}\indxs{lazy}{agents}
%------------------------------%
\paragraph{Immortal agents are lazy}
%------------------------------%
The construction in the next to previous paragraph leads to a mathematically elegant,
no-parameter AI$\xi$ model. Unfortunately this is not the end of
the story. The limit $m_k \to \infty$ can cause undesirable
results in the AI$\mu$ model for special $\mu$, which might also
happen in the AI$\xi$ model whatever we define $m_k \to \infty$.
Consider an agent who for every $\sqrt{l}$ consecutive days of
work, can thereafter take $l$ days of holiday. Formally, consider
$\Y = \X = \R = \{0,1\}$. Output $y_k = 0$ shall give reward $r_k
= 0$ and output $y_k = 1$ shall give $r_k = 1$ iff $\hh
y_{k-l-\sqrt l}...\hh y_{k-l} = 0...0$ for some $l$, i.e.\ the
agent can achieve $l$ consecutive positive rewards if there was a
preceding sequence of length at least $\sqrt l$ with $y_k = r_k =
0$. If the lifetime of the AI$\mu$ agent is $m$, it outputs $\hh
y_k = 0$ in the first $s$ cycles and then $\hh y_k = 1$ for the
remaining $s^2$ cycles with $s$ such that $s+s^2=m$. This will
lead to the highest possible total reward $V_{1m} = s^2 =
m+\odt-\sqrt{m+^1\!\!/\!_4}$. Any fragmentation of the $0$ and $1$
sequences would reduce $V_{1m}$, e.g.\ alternatingly working for 2
days and taking 4 days off would give $V_{1m}={2\over 3}m$. For $m
\to \infty$ the AI$\mu$ agent can and will delay the point $s$ of
switching to $\hh y_k = 1$ indefinitely and always output $0$
leading to total reward $0$, obviously the worst possible
behavior. The AI$\xi$ agent will explore the above rule after a
while of trying $y_k = 0/1$ and then applies the same behavior as
the AI$\mu$ agent, since the simplest rules covering past data
dominate $\xi$. For finite $m$ this is exactly what we want, but
for infinite $m$ the AI$\xi$ model (probably) fails, just as the
AI$\mu$ model does. The good point is that this is not a weakness
of the AI$\xi$ model in particular, as AI$\mu$ fails too. The bad
point is that $m_k\to\infty$ has far-reaching consequences, even
when starting from an already very large $m_k=m$. This is because
the $\mu$ of this example is highly nonlocal in time, i.e.\
it may violate one of our weak separability conditions.

%------------------------------%
\paragraph{Conclusions}
%------------------------------%
We are not sure whether the choice of $m_k$ is of marginal
importance, as long as $m_k$ is chosen sufficiently large and of
low complexity, $m_k=2^{2^{16}}$ for instance, or whether the
choice of $m_k$ will turn out to be a central topic for the
AI$\xi$ model or for the planning aspect of any AI system in
general. We suppose that the limit $m_k\to\infty$ for the
AI$\xi$ model results in correct behavior for weakly separable
$\mu$. A proof of this
conjecture, if true, would probably give interesting insights.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Outlook}\label{secAIxiOut}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\indxs{prediction}{expert advice}
%------------------------------%
\paragraph{Expert advice approach}
%------------------------------%
We considered expected performance bounds for predictions based on
Solomonoff's prior. The other, dual, currently very popular
approach, is ``prediction with expert advice'' (PEA) invented by
Littlestone and Warmuth (1989)\nocite{Littlestone:89}, and Vovk
(1992)\nocite{Vovk:92}. Whereas PEA performs well in any
environment, but only relative to a given set of experts , our
$\Lambda_\xi$ predictor competes with {\em any} other predictor,
but only in expectation for environments with computable
distribution. It seems philosophically less compromising to make
assumptions on prediction strategies than on the environment,
however weak. One could investigate whether PEA can be generalized
to the case of active agents, which would result in a model dual
to AIXI. We believe the answer to be negative, which on the
positive side would show the necessity of Occam's razor
assumption, and the distinguishedness of AIXI.

\indxs{random}{action}
%------------------------------%
\paragraph{Actions as random variables}
%------------------------------%
The uniqueness for the choice of the generalized $\xi$
(\ref{xidef}) in the AIXI model could be explored. From the
originally many alternatives, which could all be ruled out, there
is one alternative which still seems possible. Instead of defining
$\xi$ as in (\ref{uniMAI}) one could treat the agent's
actions $y$ also as universally distributed random variables and
then conditionalize $\xi$ on $y$ by \mrcp.

\indxs{structure}{AI$\xi$ model}
\indxs{axiomatic approach}{AI$\xi$ model}
%------------------------------%
\paragraph{Structure of AIXI}
%------------------------------%
The algebraic properties and the structure of AIXI could be
investigated in more depth. This would extract the essentials
from AIXI which finally could lead to an axiomatic
characterization of AIXI. The benefit is as in any axiomatic
approach. It would clearly exhibit the assumptions, separate the
essentials from technicalities, simplify understanding and,
most important, guide in finding proofs.

\index{policy!restricted class}
%------------------------------%
\paragraph{Restricted policy classes}
%------------------------------%
The development in this section could be scaled down to restricted
classes of policies $\cal P$. One may define
$V^*=\arg\max_{p\in\cal P}V^p$. For instance, consider a finite
class of quickly computable policies. For {\sc mdp}s, $\xi$ is
quickly computable and $V_\xi^p$ can be (efficiently) computed by
\idx{Monte Carlo} sampling. Maximizing over the finitely many
policies $p\in\cal P$ selects the asymptotically best policy
$p^\xi$ from $\cal P$ for all (ergodic) {\sc mdp}s
\cite{Hutter:02selfopt}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Conclusions}\label{secAIxiCon}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
All tasks that require intelligence to be solved can naturally be
formulated as a maximization of some expected utility in the
framework of agents. We gave an explicit expression
(\ref{ydotrec}) of such a decision-theoretic agent. The main
remaining problem is the unknown prior probability distribution
$\mu^\AI$ of the environment(s). Conventional learning algorithms
are unsuitable, because they can neither handle large
(unstructured) state spaces nor do they converge in the
theoretically minimal number of cycles nor can they handle
non-stationary environments appropriately. On the other hand, the
universal semimeasure $\xi$ (\ref{xidef}), based on ideas from
algorithmic information theory, solves the problem of the unknown
prior distribution for induction problems. No explicit learning
procedure is necessary, as $\xi$ automatically converges to $\mu$.
We unified the theory of universal sequence prediction with the
decision-theoretic agent by replacing the unknown true prior
$\mu^\AI$ by an appropriately generalized universal semimeasure
$\xi^\AI$. We gave strong arguments that the resulting AI$\xi$
model is universally optimal. Furthermore, possible solutions to
the horizon problem were discussed. In Section~\ref{chApply}
we present a number of problem classes, and outline how the
AI$\xi$ model can solve them. They include sequence prediction,
strategic games, function minimization and, especially, how
AI$\xi$ learns to learn supervised. In Section~\ref{chTime} we
develop a modified time-bounded (computable) AIXI$tl$ version.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Important Problem Classes}\label{chApply}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In order to give further support for the universality and
optimality of the AI$\xi$ theory, we apply AI$\xi$ in this section
to a number of problem classes. They include sequence
prediction, strategic games, function minimization and,
especially, how AI$\xi$ learns to learn supervised. For some
classes we give concrete examples to illuminate the scope of the
problem class. We first formulate each problem class in its
natural way (when $\mu^{\mbox{\tiny problem}}$ is known) and then
construct a formulation within the AI$\mu$ model and prove its
equivalence. We then consider the consequences of replacing $\mu$
by $\xi$. The main goal is to understand why and how the problems
are solved by AI$\xi$. We only highlight special aspects of each
problem class. Sections~\ref{secSP}--\ref{secOther} together
should give a better picture of the AI$\xi$ model. We do not study
every aspect for every problem class. The subsections may be read
selectively, and are not essential to understand the remainder.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Sequence Prediction (SP)}\label{secSP}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We introduced the AI$\xi$ model as a unification of ideas
of sequential decision theory and universal probability
distribution. We might expect AI$\xi$ to behave identically to
SP$\xi$, when faced with a sequence prediction problem, but
things are not that simple, as we will see.

%------------------------------%
\paragraph{Using the AI$\mu$ model for sequence prediction}
%------------------------------%
% 9910(15) 9911(7)
We saw in Section~\ref{chSP} how to predict sequences for
known and unknown prior distribution $\mu^\SP$. Here we consider
binary sequences\footnote{We use $z_k$ to avoid notational
conflicts with the agent's inputs $x_k$.} $z_1z_2z_3...\in
\B^\infty$ with known prior probability
$\mu^\SP(\pb{z_1z_2z_3...})$.

We want to show how the AI$\mu$ model can be used for sequence
prediction. We will see that it makes the same prediction as the
SP$\mu$ agent. For simplicity we only discuss the special
error loss $\ell_{xy}=1-\delta_{xy}$, where $\delta$ is the
Kronecker symbol, defined as $\delta_{ab} = 1$ for $a = b$ and $0$
otherwise. First, we have to specify {\it how} the AI$\mu$ model
should be used for sequence prediction. The following choice is
natural:

The system's output $y_k$ is interpreted as a prediction for the
$k^{th}$ bit $z_k$ of the string under consideration. This
means that $y_k$ is binary ($y_k \in \B =: \Y$). As a
reaction of the environment, the agent receives reward $r_k = 1$
if the prediction was correct ($y_k = z_k$), or $r_k = 0$ if
the prediction was erroneous ($y_k \neq z_k$). The question is
wh