%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% U n i v e r s a l L e a r n i n g T h e o r y %%
%% Marcus Hutter, Start: June 2009 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
% Document-Style %
%-------------------------------%
\documentclass[12pt,twoside]{article}
\usepackage{latexsym,amsmath,amssymb,hyperref}
\newdimen\paravsp \paravsp=1.3ex
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\sloppy\lineskip=0pt
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
\newenvironment{keywords}{\centerline{\bf\small
Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\def\paradot#1{\vspace{\paravsp plus 0.5\paravsp minus 0.5\paravsp}\noindent{\bf\boldmath{#1.}}}
\def\paranodot#1{\vspace{\paravsp plus 0.5\paravsp minus 0.5\paravsp}\noindent{\bf\boldmath{#1}}}
\def\req#1{(\ref{#1})}
\def\eps{\varepsilon}
\def\epstr{\epsilon} % for empty string
\def\nq{\hspace{-1em}}
\def\fr#1#2{{\textstyle{#1\over#2}}}
\def\frs#1#2{{^{#1}\!/\!_{#2}}}
\def\SetR{I\!\!R}
\def\SetN{I\!\!N}
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\def\E{{\bf E}} % Expectation
\def\P{{\rm P}} % Probability
\def\lb{{\log_2}}
\def\g{\gamma}
\def\t{\theta}
\def\l{{\ell}} % length of string or program
\def\M{{\cal M}}
\def\X{{\cal X}}
\def\Y{{\cal Y}} % generic set
\def\o{\omega}
\def\Km{K\!m}
\def\Loss{\mbox{Loss}}
\def\u#1{{\it #1}}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% T i t l e - P a g e %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\vskip 2mm\bf\Large\hrule height5pt \vskip 4mm
Universal Learning Theory
\vskip 4mm \hrule height2pt}
\author{{\bf Marcus Hutter}\\[3mm]
\normalsize RSISE$\,$@$\,$ANU and SML$\,$@$\,$NICTA \\
\normalsize Canberra, ACT, 0200, Australia \\
\normalsize \texttt{marcus@hutter1.net \ \ www.hutter1.net}
}
\date{February 2011}
\maketitle
\begin{abstract}
This encyclopedic article gives a mini-introduction into the
theory of universal learning, founded by Ray Solomonoff in the
1960s and significantly developed and extended in the last
decade. It explains the spirit of universal learning, but
necessarily glosses over technical subtleties.
\vspace{2ex}\def\contentsname{\centering\normalsize Contents\\}
{\parskip=-2.7ex\tableofcontents}
\end{abstract}
\begin{keywords}
Algorithmic probability;
Ray Solomonoff;
induction;
prediction;
decision;
action;
Turing machine;
Kolmogorov complexity;
universal prior;
Bayes' rule.
\end{keywords}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Definition, Motivation and Background}\label{secIntro}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Universal (machine) learning is concerned with the development and
study of algorithms that are able to learn from data in a very large
range of environments with as few assumptions as possible. The
class of environments typically considered includes all computable
stochastic processes. The investigated learning tasks range from
\u{inductive inference}, sequence prediction, sequential decisions, to
(re)active problems like \u{reinforcement learning}
\cite{Hutter:04uaibook}, but also include \u{clustering}, \u{regression},
and others \cite{Li:08}.
%
Despite various no-free-lunch theorems \cite{Wolpert:97}, universal
learning is {\em possible} by assuming that the data possess {\em
some} effective structure, but without specifying any further, {\em
which} structure.
%
Learning algorithms that are universal (at least to some degree) are
also {\em necessary} for developing autonomous general
intelligent systems, required e.g.\ for exploring other planets, as
opposed to decision {\em support} systems which keep a human in the
loop.
%
There is also an {\em intrinsic} interest in striving for
generality: Finding new learning algorithms for every particular
(new) problem is possible but cumbersome and prone to disagreement
or contradiction. A sound formal general and ideally complete theory
of learning can unify existing approaches, guide the development
of practical learning algorithms, and last but not least lead to
novel and deep insights.
This encyclopedic article gives a mini-introduction into the theory
of universal learning, founded by Ray Solomonoff in the 1960s
\cite{Solomonoff:64,Solomonoff:78} and significantly developed and
extended by the author and his colleagues in the last decade.
%
It is based on \cite{Hutter:04uaibook}. It explains the spirit of
universal learning, but necessarily glosses over many technical
subtleties. Precise formulation of the results with proofs and/or
references to original publications can be found in
\cite{Hutter:04uaibook}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Deterministic Environments}\label{secDE}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Let $t,n\in\SetN$ be natural numbers, $\X^*$ be the set of finite
strings and $\X^\infty$ be the set of infinite sequences over some
alphabet $\X$ of size $|\X|$. For a string $x\in\X^*$
of length $\l(x)=n$ we write $x_1x_2...x_n$ with $x_t\in\X$, and
further abbreviate $x_{t:n}:=x_t x_{t+1}...x_{n-1}x_n$ and
$x_{0$ with $\sum_i w_i\leq 1$. Let
$W:=\sum_{i:H_i\in\M_t}w_i$ be the total weight of the hypotheses in
$\M_t$. Let $\M_t^a:=\{H_i\in\M_t: x_t^{H_i}=a\}$ be the consistent
hypotheses predicting $x_t=a$, and $W_a$ their weight, and take the
weighted majority prediction $x_t=\arg\max_a W_a$. Similarly as
above, a prediction error decreases $W$ by a factor of $1-1/|\X|$,
since $\max_a W_a\geq W/|\X|$. Since $w_m\leq W\leq 1$, this
algorithm can at most make $\log_{1-1/|\X|} w_m=O(\log w_m^{-1})$
prediction errors. If we choose for instance $w_i=(i+1)^{-2}$, the
number of errors is $O(\log m)$, which is an exponential improvement
over the Gold-style learning by enumeration above.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Algorithmic Probability}\label{secAP}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Algorithmic probability has been founded by Solomonoff
\cite{Solomonoff:64}.
%
The so-called universal probability or a-priori probability is the
key quantity for universal learning. Its philosophical and technical
roots
are \u{Ockham's razor} (choose the simplest model consistent with the data), %
Epicurus' principle of multiple explanations (keep all explanations consistent with the data), %
(Universal) Turing machines (to compute, quantify and assign codes to all quantities of interest), and %
Kolmogorov complexity (to define what simplicity/complexity means).
%
This section considers deterministic computable sequences, and the
next section the general setup of computable probability
distributions.
%-------------------------------%
\paradot{(Universal) monotone Turing machines}
%-------------------------------%
Since we consider infinite computable sequences, we need devices
that convert input data streams to output data streams. For this we
define the following variants of a classical deterministic Turing
Machine:
%
A monotone Turing machine $T$ is defined as a Turing machine
with one unidirectional input tape, one unidirectional output tape,
and some bidirectional work tapes. The input tape is binary (no
blank) and read only, the output tape is over finite alphabet $\X$
(no blank) and write only, unidirectional tapes are those where the
head can only move from left to right, work tapes are initially
filled with zeros and the output tape with some fixed element from
$\X$.
%
We {\em say} that {\em monotone Turing machine} $T$ outputs/computes
a string starting with $x$ on input $p$, and write $T(p)=x*$ if $p$
is to the left of the input head when the last bit of $x$ is output
($T$ reads all of $p$ but no more). $T$ may continue operation and
need not halt. For a given $x$, the set of such $p$ forms a prefix
code. Such codes are called {\em minimal} programs. Similarly we
write $T(p)=\omega$ if $p$ outputs the infinite sequence $\omega$.
%
% Prefix codes and Kraft inequality
A {\em prefix code} $\cal P$ is a set of binary strings such that no
element is proper prefix of another. It satisfies {\em Kraft's inequality}
$\sum_{p\in\cal P} 2^{-\l(p)}\leq 1$.
% Universal monotone Turing machine
The table of rules of a Turing machine $T$ can be prefix encoded in
a canonical way as a binary string, denoted by $\langle T\rangle$.
Hence, the set of Turing machines $\{T_1,T_2,...\}$ can be
effectively enumerated. There are so-called universal Turing
machines that can ``simulate'' all other Turing machines. We define
a particular one which simulates monotone Turing machine $T(q)$ if
fed with input $\langle T\rangle q$, i.e.\ $U(\langle T \rangle
q)=T(q)$ $\forall T,q$. Note that for $p$ not of the form $\langle
T\rangle q$, $U(p)$ does not output anything.
%
We call this particular $U$ the {\em reference universal Turing
machine}.
%-------------------------------%
\paradot{Universal weighted majority learning}
%-------------------------------%
$T_1(\epstr), T_2(\epstr), ...$ constitutes an effective
enumeration of all finite and infinite computable sequences, hence
also monotone $U(p)$ for $p\in\{0,1\}^*$. As argued below, the class of
computable infinite sequences, is conceptually very interesting.
The halting problem implies that there is no recursive enumeration of all
partial-recursive functions with infinite domain; hence
we cannot remove the finite sequences algorithmically.
It is very fortunate that we don't have to.
Hypothesis $H_p$ is identified with the sequence $U(p)$,
which may be finite, infinite, or possibly even empty.
The class of considered hypotheses is $\M:=\{H_p:p\in\{0,1\}^*\}$.
The weighted majority algorithm also needs weights $w_p$ for each
$H_p$. Ockham's razor combined with Epicurus' principle demand
to assign a high (low) prior weight to a simple (complex) hypothesis.
If complexity is identified with program length, then $w_p$ should be
a decreasing function of $\l(p)$. It turns out that $w_p=2^{-\l(p)}$
is the ``right'' choice, since minimal $p$ form a prefix code and
therefore $\sum_p w_p\leq 1$ as required.
Using $H_p$ for prediction can now fail in two ways. $H_p$ may make
a wrong prediction or no prediction at all for $x_t$. The true
hypothesis $H_m$ is still assumed to produce an infinite sequence.
%
The weighted majority algorithm in this setting makes at most
$O(\log w_p^{-1})=O(\l(p))$ errors. It is also plausible
that learning $\l(p)$ bits requires $O(\l(p))$ ``trials''.
%-------------------------------%
\paradot{Universal mixture prediction}
%-------------------------------%
Solomonoff \cite{Solomonoff:78} defined the following universal a-priori probability
\beq\label{Mdef}
M(x) \;:=\; \sum_{p:U(p)=x*} 2^{-\l(p)}
\eeq
That is, $M(x)=W$ is the total weight of the computable
deterministic hypotheses consistent with $x$ for the universal
weight choice $w_p=2^{-\l(p)}$. The universal weighted majority
algorithm predicted $\arg\max_a M(\dot x_{0$ at more
than $c/\eps\delta$ times, is bounded by $\delta$.
This might loosely be called the number of errors.
%
% convergence
Hence Solomonoff's bounds implies
\beqn\label{eqconv2}
M(x_t|\o_{0$
is bounded by $O(K(\mu))$, i.e.\ is proportional to the complexity
of the environment, which is again reasonable. A counting argument
shows that $O(K(\mu))$ errors for most $\mu$ are unavoidable.
No other choice for $w_\nu$ would lead to significantly better
bounds. Again, in general it is not possible to determine {\em when}
these ``errors'' occur.
%
% Multi-step prediction
Multi-step lookahead convergence
$M(x_{t:n_t}|\o_{n$, hence
we can write $\rho(x_{1:n}|y_{1:n})$.
%
In classification and regression, $\rho$ is typically
(conditionally) i.i.d., i.e.\
$\rho(x_{1:n}|y_{1:n})=\prod_{t=1}^n\rho(x_t|y_t)$, which is
chronological, but note that the Bayes mixture $\xi$ is {\em not} i.i.d.
One can show that the class of lower semi-computable chronological
semimeasures $\M_U^|=\{\nu_1(\cdot|\cdot),\nu_2(\cdot|\cdot),...\}$
is effectively enumerable.
The generalized universal a-priori semimeasure also has two equivalent definitions:
\beq\label{defMg}
M(x_{1:n}|y_{1:n}) \;:=\; \sum_{p:U(p,y_{1:n})=x_{1:n}\nq\nq} 2^{-\l(p)}
\;=\; \sum_{\nu\in\M} 2^{-K(\nu)}\nu(x_{1:n}|y_{1:n})
\eeq
which is again in $\M_U^|$. In case of $|\Y|=1$, this reduces to
\req{Mdef} and \req{xiUdef}. The bounds \req{SolBnd} and \req{lbnd}
and others continue to hold, now for all individual $y$'s,
i.e.\ $M$ predicts asymptotically $x_t$ from $y_t$ and
$(y_{0$ for computable $\t$ (and $\nu_\t$), and 0 for
uncomputable ones. This effectively reduces $\M$ to a discrete class
$\{\nu_\t\in\M:w_\t^U>0\}\subseteq\M_U$ which is typically dense in
$\M$.
%
There are various fundamental philosophical and statistical problems
and paradoxes around (Bayesian) induction, which nicely disappear in
the universal framework. For instance, universal induction has no
zero and no improper p(oste)rior problem, i.e.\ can confirm
universally quantified hypotheses, is reparametrization and
representation invariant, and avoids the old-evidence and updating
problem, in contrast to most classical continuous prior densities.
It even performs well in incomputable environments, actually better
than latter \cite{Hutter:07uspx}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Discussion and Future Directions}\label{secDisc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Prior Knowledge and subjective priors
Universal learning is designed to work for a wide range of problems
without any a-priori knowledge.
%
In practice we often have extra information about the problem at
hand, which could and should be used to guide the forecasting.
%
One can incorporate it by explicating all our prior
knowledge $z$, and place it on an extra input tape of our universal
Turing machine $U$, or prefix our observation sequence $x$ by $z$ and
use $M(zx)$ for prediction.
% On constants and $U$ dependence
Another concern is the dependence of $K$ and $M$ on $U$. The good
news is that a change of $U$ changes $K(x)$ only within an additive
and $M(x)$ within a multiplicative constant independent of $x$. This
makes the theory practically immune to any ``reasonable'' choice of
$U$ for large data sets $x$, but predictions for short sequences
(shorter than typical compiler lengths) can be arbitrary. One
solution is to take into account our (whole) scientific prior
knowledge $z$ \cite{Hutter:06hprize}, and predicting the now long
string $zx$ leads to good (less sensitive to ``reasonable'' $U$)
predictions. This is a kind of grand transfer learning scheme. It is
unclear whether a more elegant theoretical solution is possible.
% Incomputability
Finally, the incomputability of $K$ and $M$ prevents a {\em direct}
implementation of Solomonoff induction. Most fundamental theories
have to be approximated for practical use, sometimes systematically
like polynomial time approximation algorithms or numerical
integration, and sometimes heuristically like in many AI-search
problems or in non-convex optimization problems. Universal machine
learning is similar, except that its core quantities are only
semi-computable. This makes them often hard, but as described in the
previous section, not impossible, to approximate.
% Conclusion
In any case, universal induction can serve as a ``gold standard''
which practitioners can aim at.
%
Solomonoff's theory considers the class of all computable
(stochastic) models, and a universal prior inspired by Ockham and
Epicurus, quantified by Kolmogorov complexity. This lead to a
universal theory of induction, prediction, decisions, and, by
including Bellman, to universal actions in reactive environments.
%
Future progress on the issues above (incorporating prior knowledge,
getting rid of the compiler constants, and finding better
approximations) will lead to new insights and will continually
increase the number of applications.
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Recommended Reading %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\def\refname{Recommended Reading}
{\addcontentsline{toc}{section}{\refname}
\begin{small}
\begin{thebibliography}{ABCD} %\parskip=0ex
\bibitem[CV05]{Cilibrasi:05}
R.~Cilibrasi and P.~M.~B. Vit\'anyi.
\newblock Clustering by compression.
\newblock {\em IEEE Trans. Information Theory}, 51(4):1523--1545, 2005.
\bibitem[Gag07]{Gaglio:07}
M.~Gaglio.
\newblock Universal search.
\newblock {\em Scholarpedia}, 2(11):2575, 2007.
\bibitem[Gr{\"u}07]{Gruenwald:07book}
P.~D. Gr{\"u}nwald.
\newblock {\em The Minimum Description Length Principle}.
\newblock The MIT Press, Cambridge, 2007.
\bibitem[Hut05]{Hutter:04uaibook}
M.~Hutter.
\newblock {\em Universal Artificial Intelligence: Sequential Decisions based on
Algorithmic Probability}.
\newblock Springer, Berlin, 2005.
\bibitem[Hut06]{Hutter:06hprize}
M.~Hutter.
\newblock Human knowledge compression prize, 2006.
\newblock open ended, http://prize.hutter1.net/.
\bibitem[Hut07]{Hutter:07uspx}
M.~Hutter.
\newblock On universal prediction and {B}ayesian confirmation.
\newblock {\em Theoretical Computer Science}, 384(1):33--48, 2007.
\bibitem[LV08]{Li:08}
M.~Li and P.~M.~B. Vit\'anyi.
\newblock {\em An Introduction to {K}olmogorov Complexity and its
Applications}.
\newblock Springer, Berlin, 3rd edition, 2008.
\bibitem[Sch02]{Schmidhuber:02gtm}
J.~Schmidhuber.
\newblock Hierarchies of generalized {Kolmogorov} complexities and
nonenumerable universal measures computable in the limit.
\newblock {\em International Journal of Foundations of Computer Science},
13(4):587--612, 2002.
\bibitem[Sol64]{Solomonoff:64}
R.~J. Solomonoff.
\newblock A formal theory of inductive inference: Parts 1 and 2.
\newblock {\em Information and Control}, 7:1--22 and 224--254, 1964.
\bibitem[Sol78]{Solomonoff:78}
R.~J. Solomonoff.
\newblock Complexity-based induction systems: Comparisons and convergence
theorems.
\newblock {\em IEEE Transactions on Information Theory}, IT-24:422--432, 1978.
\bibitem[VNHS10]{Hutter:10aixictw}
J.~Veness, K.~S. Ng, M.~Hutter, and D.~Silver.
\newblock Reinforcement learning via {AIXI} approximation.
\newblock In {\em Proc. 24th AAAI Conference on Artificial Intelligence},
Atlanta, 2010. AAAI Press.
\bibitem[WM97]{Wolpert:97}
D.~H. Wolpert and W.~G. Macready.
\newblock No free lunch theorems for optimization.
\newblock {\em IEEE Transactions on Evolutionary Computation}, 1(1):67--82,
1997.
\bibitem[ZL70]{Zvonkin:70}
A.~K. Zvonkin and L.~A. Levin.
\newblock The complexity of finite objects and the development of the concepts
of information and randomness by means of the theory of algorithms.
\newblock {\em Russian Mathematical Surveys}, 25(6):83--124, 1970.
\end{thebibliography}
\end{small}
\end{document}
%---------------------End-of-UniLearn.tex---------------------%