%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Sequential Predictions based on Algorithmic Complexity %%
%% Marcus Hutter: Start: November 2002 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newif\ifjcss\jcssfalse % techreport versus jcss version
%-------------------------------%
% Document-Style %
%-------------------------------%
\ifjcss
\documentclass{elsart}
\sloppy\lineskip=0pt
\else
\documentclass[12pt,twoside]{article}
\usepackage{latexsym}
\pagestyle{myheadings}
\markboth{\sc Marcus Hutter, IDSIA-16-04
}{\sc Predictions based on Algorithmic Complexity}
\setcounter{tocdepth}{1}\setcounter{secnumdepth}{2}
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\sloppy\lineskip=0pt
\fi
%-------------------------------%
% My Math-Environments %
%-------------------------------%
\ifjcss %normal math spacing
\def\beq{\begin{equation}} \def\eeq{\end{equation}}
\def\beqn{\begin{displaymath}}\def\eeqn{\end{displaymath}}
\def\bqa{\begin{eqnarray}} \def\eqa{\end{eqnarray}}
\def\bqan{\begin{eqnarray*}} \def\eqan{\end{eqnarray*}}
\else %condensed math spacing
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
\fi
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\ifjcss
\def\cal{\mathcal}
\else
\newenvironment{keyword}{\centerline{\bf\small
Keyword}\vspace{-1ex}\begin{quote}\small}{\par\end{quote}\vskip
1ex}
\fi
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{figurex}{Figure}
\def\ftheorem#1#2#3{\begin{theorem}[\boldmath #2]\label{#1} #3 \end{theorem} }
\def\fcorollary#1#2#3{\begin{corollary}[\boldmath #2]\label{#1} #3 \end{corollary} }
\def\fdefinition#1#2#3{\begin{definition}[\boldmath #2]\label{#1} #3 \end{definition} }
\long\def\ffigurex#1#2#3#4{{#4}\begin{figurex}[\boldmath #2]\label{#1}#3\end{figurex}}
\ifjcss
\def\paradot#1{{\itshape{#1.}}}
\def\paranodot#1{{\itshape{#1}}}
\else
\def\myparskip{\vspace{1.5ex plus 0.5ex minus 0.5ex}\noindent}
\def\paradot#1{\myparskip{\bfseries\boldmath{#1.}}}
\def\paranodot#1{\myparskip{\bfseries\boldmath{#1}}}
\fi
\def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}
\def\nq{\hspace{-1em}}
\def\qed{\hspace*{\fill}$\Box\quad$}
\def\odt{{\textstyle{1\over 2}}}
\def\odf{{\textstyle{1\over 4}}}
\def\eps{\varepsilon} % for small positive number
\def\epstr{\epsilon} % for empty string
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\def\equa{\stackrel+=} %\eqa unfortunately already used up
\def\leqa{\stackrel+\leq} % don't put around another braces,
\def\geqa{\stackrel+\geq} % since this messes up spacings
\def\eqm{\stackrel\times=} % for some reason
\def\leqm{\stackrel\times\leq}
\def\geqm{\stackrel\times\geq}
\def\odn{{\textstyle{1\over n}}}
\def\v#1{{\bf #1}}
\def\l{{\ell}} % length of string or program
\def\M{{\cal M}} % Set of prob. distributions
\def\X{{\cal X}} % input/perception set/alphabet
\def\Y{{\cal Y}} % output/action set/alphabet
\def\I{{\cal I}} % some set
\def\S{{\cal S}} % some set
\def\Q{{\cal Q}}
\def\A{{\cal A}}
\def\E{{\bf E}} % Expectation value
\def\P{{\bf P}} % Expectation value
%\def\B{\SetB} % Binary set (or \SetB)
\def\B{\{0,1\}} % Binary set (or \SetB)
\def\MM{M} % Solomonoff's prior
\def\KM{K\!M}
\def\Km{K\!m}
\def\SetN{{I\!\!N}} \def\SetQ{{I\!\!\!Q}} \def\SetR{{I\!\!R}} \def\SetZ{{Z\!\!\!Z}}
\def\lb{\log}
\def\mrcp{the chain rule}
\def\SL{Solomonoff}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\ifjcss
\begin{frontmatter}
\title{Sequential Predictions based \\ on Algorithmic Complexity}
\author{Marcus Hutter}
\address{IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland \\
marcus@idsia.ch \hspace{9ex} http://www.idsia.ch/$^{_{_\sim}}\!$marcus}
\thanks{Part of this work appeared in the proceedings of
the 2003 COLT conference \cite{Hutter:03unimdl}}
\else
\title{\vskip -10mm\normalsize\sc Technical Report \hfill IDSIA-16-04
\vskip 2mm\bf\LARGE\hrule height5pt \vskip 5mm
\sc Sequential Predictions based \\ on Algorithmic Complexity
\vskip 6mm \hrule height2pt \vskip 4mm}
\author{{\bf Marcus Hutter}\\[3mm]
\normalsize IDSIA, Galleria 2, CH-6928\ Manno-Lugano, Switzerland\thanks{Part of this work appeared in the proceedings of
the 2003 COLT conference \cite{Hutter:03unimdl}.}\\
\normalsize marcus@idsia.ch \hspace{8.5ex} http://www.idsia.ch/$^{_{_\sim}}\!$marcus}
\date{\normalsize Submitted: Oct.\ 2003 \hspace{11ex} Published: Oct.\ 2005}
\maketitle
\fi
\begin{abstract}
\noindent This paper studies sequence prediction based on the
monotone Kolmogorov complexity $\Km=-\lb\,m$, i.e.\ based on
universal deterministic/one-part MDL. $m$ is extremely close to
\SL's universal prior $M$, the latter being an excellent predictor
in deterministic as well as probabilistic environments, where
performance is measured in terms of convergence of posteriors or
losses. Despite this closeness to $M$, it is difficult to assess
the prediction quality of $m$, since little is known about the
closeness of their posteriors, which are the important quantities
for prediction. We show that for deterministic computable
environments, the ``posterior'' and losses of $m$ converge, but
rapid convergence could only be shown on-sequence; the
off-sequence convergence can be slow. In probabilistic
environments, neither the posterior nor the losses converge, in
general.
\end{abstract}
\ifjcss\else\def\sep{; }\fi
\begin{keyword}
Sequence prediction\sep
Algorithmic Information Theory\sep
\SL's prior\sep
Monotone Kolmogorov Complexity\sep
Minimal Description Length\sep
Convergence\sep
Self-Optimization.%
\end{keyword}
\ifjcss\end{frontmatter}\else\hfill\fi
\ifjcss\else\pagebreak
%------------------------------%
% Table of Contents %
%------------------------------%
\setcounter{tocdepth}{1}
\begin{quote}\begin{quote}
\def\contentsname{\normalsize \hfil Contents \hfil}
{\ifjcss\else\parskip=-2.5ex\fi\boldmath\tableofcontents}
\end{quote}\end{quote}
\fi
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secIntro}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
%\paradot{Complexity based sequence prediction}
%------------------------------%
In this work we study the performance of Occam's razor based
sequence predictors. Given a data sequence $x_1$, $x_2$, ...,
$x_{n-1}$ we want to predict (certain characteristics) of the next
data item $x_n$. Every $x_t$ is an element of some domain $\X$, for
instance weather data or stock-market data at time $t$, or the
$t^{th}$ digit of $\pi$. Occam's razor \cite{Li:97}, appropriately
interpreted, tells us to search for the simplest explanation
(model) of our data $x_1,...,x_{n-1}$ and to use this model for
predicting $x_n$. Simplicity, or more precisely, effective
complexity can be measured by the length of the shortest program
computing sequence $x:=x_1...x_{n-1}$. This length is called the
algorithmic information content of $x$, which we denote by $\tilde
K(x)$. $\tilde K$ stands for one of the many variants of
``Kolmogorov'' complexity (plain, prefix, monotone, ...) or for
$-\lb\,\tilde k(x)$ of universal distributions/measures $\tilde k(x)$.
Algorithmic information theory mainly considers binary sequences.
For finite alphabet $\X$ one could code each $x_t\in\X$ as a
binary string of length $^\lceil\lb|\X|^\rceil$, but this would
not simplify the analysis in this work. The reason being that
binary coding would {\em not} reduce the setting to bit by bit
predictions, but to predict a block of bits before observing the
true block of bits. The only difference in the analysis of general
alphabet versus binary block-prediction is in the convention of
how the length of a string is defined.
The most well-studied complexity regarding its predictive
properties is $\KM(x)=-\lb M(x)$, where $M(x)$ is Solmonoff's
\cite[Eq.(7)]{Solomonoff:64} universal prior. Solomonoff has shown
that the posterior $M(x_t|x_1...x_{t-1})$ rapidly converges to the
true data generating distribution \cite{Solomonoff:78}. In
\cite{Hutter:99errbnd,Hutter:02spupper} it has been shown that $M$
is also an excellent predictor from a decision-theoretic point of
view, where the goal is to minimize loss. In any case, for
prediction, the posterior $M(x_t|x_1...x_{t-1})$, rather than the
prior $M(x_1...x_t)$, is the more important quantity.
%------------------------------%
%\paradot{Other complexity measures}
%------------------------------%
Most complexities $\tilde K$ coincide within an additive
logarithmic term, which implies that their ``priors'' $\tilde
k=2^{-\tilde K}$ are close within polynomial accuracy. Some of them
are extremely close to each other.
Many papers deal with the proximity of various complexity measures
\cite[...]{Levin:73random,Gacs:83}. Closeness of two complexity
measures is regarded as indication that the quality of their
prediction is similarly good \cite[p.334]{Li:97}.
On the other hand, besides $M$, little is really known about the
closeness of ``posteriors'', relevant for prediction.
%------------------------------%
\paradot{Aim and conclusion}
%------------------------------%
The main aim of this work is to study the predictive properties of
complexity measures other than $\KM$. The monotone complexity
$\Km$ is, in a sense, closest to \SL\ complexity $\KM$. While
$\KM$ is defined via a mixture of infinitely many programs, the
conceptually simpler $\Km$ approximates $\KM$ by the contribution
of the single shortest program. This is also closer to the spirit
of Occam's razor. $\Km$ is a universal deterministic/one-part
version of the popular Minimal Description Length (MDL) principle.
We mainly concentrate on $\Km$ because it has a direct
interpretation as a universal deterministic/one-part MDL
predictor, and it is closest to the excellent performing $\KM$, so
we expect predictions based on other $\tilde K$ not to be better.
%------------------------------%
%\paradot{Conclusion}
%------------------------------%
The main conclusion we will draw is that closeness of priors does
neither necessarily imply closeness of posteriors, nor good
performance from a decision-theoretic perspective. It is far from
obvious, whether $\Km$ is a good predictor in general, and indeed
we show that $\Km$ can fail (with probability strictly greater
than zero) in the presence of noise, as opposed to $\KM$. We do
not suggest that $\Km$ fails for sequences occurring in practice.
It is not implausible that (from a practical point of view) minor
extra (apart from complexity) assumptions on the environment or
loss function are sufficient to prove good performance of $\Km$.
Some complexity measures like the prefix complexity $K$, fail
completely for prediction.
%------------------------------%
\paradot{Contents}
%------------------------------%
{\em Section~\ref{secSetup}} introduces notation and describes how
prediction performance is measured in terms of convergence of
posteriors or losses.
%
{\em Section~\ref{secMProp}} summarizes known predictive
properties of \SL's prior $M$.
%
{\em Section~\ref{secASP}} introduces the monotone complexity
$\Km$ and the prefix complexity $K$ and describes how they and
other complexity measures can be used for prediction.
%
In {\em Section~\ref{secGPF}} we enumerate and relate eight
important properties, which general predictive functions may
posses or not: proximity to $M$, universality, monotonicity, being
a semimeasure, \mrcp, enumerability, convergence, and
self-optimization. Some later needed normalization issues are also
discussed. Furthermore, convergence of non-semimeasures that are
close to $M$ is proven.
%
{\em Section~\ref{secmProp}} contains our main results. Monotone
complexity $\Km$ is analyzed quantitatively w.r.t.\ the eight
predictive properties. Qualitatively, for deterministic,
computable environments, the posterior converges and is
self-optimizing, but rapid convergence could only be shown
on-sequence; the (for prediction equally important) off-sequence
convergence can be slow. In probabilistic environments, $m$
neither converges, nor is it self-optimizing, in general.
%
{\em Section~\ref{secFurther}} presents some further results: Poor
predictive performance of the prefix complexity $K$ is shown and a
simpler MDL-inspired way of using $\Km$ for prediction is briefly
discussed.
%
{\em Section~\ref{secOpen}} contains an outlook and a list of open
question, including the convergence speed of $m$, natural Turing
machines, non-self-optimization for general Turing machines and
losses, other complexity measures, two-part MDL, extra conditions
on environments, and other generalizations.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Notation and Setup}\label{secSetup}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
\paradot{Strings and natural numbers}
%------------------------------%
We write $\X^*$ for the set of finite strings over finite alphabet
$\X$, and $\X^\infty$ for the set of infinity sequences. We use
letters $i,t,n$ for natural numbers, $x,y,z$ for finite strings,
$\epstr$ for the empty string, $\l(x)$ for the length of string
$x$, and $\omega=x_{1:\infty}$ for infinite sequences. We write
$xy$ for the concatenation of string $x$ with $y$. For a string of
length $n$ we write $x_1x_2...x_n$ with $x_t\in\X$ and further
abbreviate $x_{1:n}:=x_1x_2...x_{n-1}x_n$ and $x_{0$ such that $|f(x)|\leq c|g(x)|\,\forall x>x_0$. The
small $o$-notation $f(x)=o(g(x))$ abbreviates
$\lim_{x\to\infty}f(x)/g(x)=0$.
We write $f(x)\leqm g(x)$ for $f(x)=O(g(x))$ and
$f(x)\leqa g(x)$ for $f(x)\leq g(x)+O(1)$.
Corresponding equalities can be defined similarly. They hold if
the corresponding inequalities hold in both directions.
$\sum_{t=1}^\infty a_t^2<\infty$ implies $a_t\toinfty{t} 0$. We say
that $a_t$ converges fast or rapidly to zero if
$\sum_{t=1}^\infty a_t^2\leq c$, where $c$ is a constant of
reasonable size; $c=100$ is reasonable, maybe even $c=2^{30}$, but
$c=2^{500}$ is not.$\!$\footnote{Environments
of interest have reasonable complexity $K$,
but $2^K$ is not of reasonable size.}
The number of times for which $a_t$ deviates
from 0 by more than $\eps$ is finite and bounded by $c/\eps^2$;
no statement is possible for {\em which} $t$ these deviations occur.
The cardinality of a set $\cal S$ is denoted by $|{\cal S}|$ or
$\#\cal S$.
For properties $A(t)\in\{true,false\}$ we say
\begin{center}
\begin{tabular}{c||c|c|c|c}
$A(t)$ is valid for ... $t$ & almost all & most & many & finitely many \\\hline
{\em iff} $\;\;\#\{t\leq n:A(t)\}$ & $\equa\; n$ & $=\; n-o(n)$ & $\eqm\; n$ & $\leq\; c\quad (\exists c)$
\end{tabular}
\end{center}
%------------------------------%
\paradot{(Semi)measures}
%------------------------------%
We call $\rho:\X^*\to[0,1]$ a (semi)measure {\em iff}
$\sum_{x_n\in\X}\rho(x_{1:n}) \stackrel{(<)}= \rho(x_{0: b(x)\geq c\cdot\nu(x)\forall x$.
\item[$ii)$] {\em Monotonicity:} $b(x_{1:t})\leq
b(x_{0$ is
independent of $x_t$. Such a scaling does not affect the
prediction scheme $\Lambda_b$ (\ref{xlrdef}), i.e.\
$y_t^{\smash{\Lambda_b}}=y_t^{\smash{\Lambda_{b_{norm}}}}$, which implies
$l_t^{\smash{\Lambda_{b_{norm}}}}=l_t^{\smash{\Lambda_b}}$. Convergence
$b(x'_t|x_{{1\over
3}=l_t^{\smash{\Lambda_\mu}}$. Extending $U$ to a universal Turing
machine by $U(0^{s+1}p)=U'(p)$ leaves this result intact with
probability $\geq 1-2^{-s}$, since random strings cannot be
compressed (by $U'$). \qed
\setcounter{subsection}{-1}
%-------------------------------------------------------------%
\subsection{Proximity of $m=2^{-\Km}$}
%-------------------------------------------------------------%
The following closeness/separation results between $\Km$ and $\KM$
are known:
%------------------------------%
\ftheorem{thmo}{o) (Proximity of $m=2^{-\Km}$}{\hfill
%------------------------------%
\begin{list}{}{\parsep=1ex\itemsep=0ex\leftmargin=5ex\labelwidth=5ex}
\item[$\scriptstyle(1)$]
$\forall\mu\in\M_{comp}^{msr}\,\forall\mu$-random $\omega\,\exists
c_\omega \,:\, \Km(\omega_{1:n})\leq \KM(\omega_{1:n})+c_\omega\,\forall
n$, \hfill \cite{Levin:73random}
\item[$\scriptstyle(2)$]
$\KM(x)\leq \Km(x)\leq \KM(x)+2\,\lb \Km(x)+O(1)\,\forall x$. \hfill \cite[Thm.3.4]{Zvonkin:70}
\item[$\scriptstyle\neg(3)$]
$\forall c \,:\, \Km(x)-\KM(x)\geq c$ for infinitely many $x$. \hfill \cite{Gacs:83}
\end{list}
}%------------------------------%
\paradot{Remarks}
% line 1
The first line $(o_1)$ shows that $m$ is close to $M$
within a multiplicative constant for nearly all strings in a very
strong sense. $\sup_n{M(\omega_{1:n})\over m(\omega_{1:n})}\leq
2^{c_\omega}$ is finite for every $\omega$ which is random (in
the sense of Martin-L{\"o}f) w.r.t.\ {\em any} computable $\mu$,
but note that the constant $c_\omega$ depends on $\omega$.
% line 2
Levin falsely conjectured the result to be true for {\em all}
$\omega$, but could only prove it to hold within logarithmic
accuracy $(o_2)$.
% line 3
A later result by G\'acs $\neg(o_3)$, indeed, shows that
$\Km-\KM$ is unbounded (for infinite alphabet it can even increase
logarithmically).
%------------------------------%
\paradot{Proof}
%------------------------------%
The first two properties are due to Levin and are proven in
\cite{Levin:73random} and \cite[Thm.3.4]{Zvonkin:70}, respectively.
The third property follows easily from G\'acs result
\cite{Gacs:83}, which says that if $g$ is some monotone
co-enumerable function for which $\Km(x)-\KM(x)\leq g(\l(x))$
holds for all $x$, then $g(n)$ must be $\geqa K(n)$. Assume
$\Km(x)-\KM(x)\geq \lb\,\l(x)$ only for finitely many $x$.
Then there exists a $c$ such that $\Km(x)-\KM(x)\leq \lb\,\l(x)+c$
for {\em all} $x$. G\'acs' theorem now implies $\lb\,n+c\geqa
K(n)\,\forall n$, which is wrong due to Kraft's inequality $\sum_n
2^{-K(n)}\leq 1$.
\qed
%-------------------------------------------------------------%
\subsection{Universality of $m=2^{-\Km}$}
%-------------------------------------------------------------%
%------------------------------%
\addtocounter{theorem}{-1}
\ftheorem{thmi}{i) (Universality of $m=2^{-\Km}$}{\hfill
%------------------------------%
\begin{list}{}{\parsep=1ex\itemsep=0ex\leftmargin=5ex\labelwidth=5ex}
\item[$\scriptstyle(1)$]
$\Km(x)\leqa -\lb\,\mu(x)+K(\mu) \qmbox{if}
\mu\in\M_{comp}^{msr}$, \hfill \cite[Thm.4.5.4]{Li:97}
\item[$\scriptstyle(2)$]
$m\geqm\M_{comp}^{msr}, \qmbox{but}
m\not\geqm\M_{enum}^{semi}$ (unlike
$\MM\geqm\M_{enum}^{semi}$).
\end{list}
}%------------------------------%
\paradot{Remarks} The first line $(i_1)$ can be interpreted as a
``continuous'' coding theorem for $\Km$ and recursive $\mu$. It
implies (by exponentiation) that $m$ dominates all computable
measures $(i_2)$. Unlike $\MM$ it does {\em not} dominate all
enumerable semimeasures. Dominance is a key feature for good
predictors. From a practical point of view the assumption that the
true generating distribution $\mu$ is a proper measure and
computable seems not to be restrictive. The problem will be that
$m$ is not a semimeasure.
%------------------------------%
\paradot{Proof}
%------------------------------%
The first line is proven in \cite[Thm.4.5.4]{Li:97}. Exponentiating
this result gives $m(x)\geq c_\mu\mu(x)\,\forall
x,\mu\in \M_{comp}^{msr}$, i.e.\
$m\geqm\M_{comp}^{msr}$. Exponentiation of $\neg(o_3)$
implies
$m(x)\not\geqm\MM(x)\in\M_{enum}^{semi}$, i.e.\
$m\not\geqm\M_{enum}^{semi}$.
\qed
%-------------------------------------------------------------%
\subsection{Monotonicity of $m=2^{-\Km}$}
%-------------------------------------------------------------%
Monotonicity of $\Km$ is obvious from the definition of
$\Km$ and is the origin of calling $\Km$ monotone complexity:
%------------------------------%
\addtocounter{theorem}{-1}
\ftheorem{thmii}{ii) (Monotonicity of $m=2^{-\Km}$}{\hfill
%------------------------------%
\begin{list}{}{\parsep=1ex\itemsep=0ex\leftmargin=5ex\labelwidth=5ex}
\item[] $\Km(xy)\geq \Km(x)\in\SetN_0$,
$\quad 00$. Hence,
whenever $\Km(x_{1:n})=\Km(x_{ m(x_{1:n})=m(x_{0$ such that $(\mu-\gamma,\mu+\gamma)\cap 2^{-\SetN_0}=\{\}$,
hence $|m(x_t|x_{0$.
For $m_{norm}$ we proceed as follows: With
$z_i:=\Km(1|x_{0$ surrounding $\mu$, such that Box$\cap\I^{|\X|}=\{\}$,
which implies the desired result
$|m(x_t|x_{0$ :
$|m_{norm}(x'_t|x_{2\,\exists\ell,
\mu \;:\; {l_t^{\smash{\Lambda_m}}/l_t^{\smash{\Lambda_\mu}}}
= c > 1\,\forall t\quad$
$(c={6\over 5}-\eps$ possible$)$.
\item[$\scriptstyle\neg(4)$]
$\exists\ell,\mu \;:\; {l_t^{\smash{\Lambda_m}}/l_t^{\smash{\Lambda_\mu}}}
= c > 1$ for many $t$
with $\mu$-probability $\geq\odt$ \ifjcss \\ \fi
$(c=\sqrt{2}-\eps$ possible$)$.
\item[$\scriptstyle\neg(5)$]
$\forall$ non-degenerate\footnote{A formal definition of {\em non-degenerate} is
given in the remarks after the theorem.} $\ell\;\exists U,\mu \;:\;
{l_t^{\smash{\Lambda_m}}/l_t^{\smash{\Lambda_\mu}}} \;\;\not\!\!\!\toinfty{t} 1$
with high probability.
\end{list}
}%------------------------------%
\paradot{Remarks} Since $(vi)$ implies $(vii_1)$ by continuity,
we have convergence of the instantaneous losses for computable
environments $x_{1:\infty}$, but since convergence off-sequence is
potentially slow, the convergence of the losses to optimum is
potentially slow.
Non-convergence $\neg(vi_6)$ in probabilistic environments does
not necessarily imply that $\Lambda_m$ is not self-optimizing,
since different predictive functions can lead to the same
predictor $\Lambda$. But $\neg(vii_4)$ shows that $\Lambda_m$ is
not self-optimizing even in Bernoulli environments $\mu$ for
particular losses $\ell$ with probability $\geq\odt$.
Interestingly, excluding binary action alphabets allows for a
stronger for-sure statement $\neg(vii_3)$.
In $\neg(vii_5)$, non-self-optimization is shown for {\em any}
{\em non-degenerate loss function} (especially for the error
loss, cf.\ (\ref{eqMDLk})), for specific choices of the universal
Turing machine $U$. Loss $\ell$ is defined to be non-degenerate
{\em iff} $\bigcap_{x\in\X}\{\tilde y\,:\,\ell_{x\tilde
y}=\min_y\ell_{xy}\} = \{\}$. Assume the contrary that a {\it
single} action $\tilde y$ is optimal for {\it every} outcome $x$,
i.e.\ that ($\arg\min_y$ can be chosen such that)
$\arg\min_y\ell_{xy}=\tilde y\,\forall x$. This implies
$y_t^{\smash{\Lambda_\rho}}=\tilde y\,\forall\rho$, which implies
$l_t^{\smash{\Lambda_m}}/l_t^{\smash{\Lambda_\mu}}\equiv 1$. So
the non-degeneracy assumption is necessary (and sufficient).
%------------------------------%
\paranodot{Proof $\bf(vii_1)$}
%------------------------------%
follows from $(vi_{1\&3})$ and Theorem~\ref{thPredRel}d.
%------------------------------%
$\bf(vii_2)$
That normalization does not affect the predictor,
follows from the definition of $y_t^{\smash{\Lambda_\rho}}$ (\ref{xlrdef})
and the fact that $\arg\min()$ is not affected by scaling its
argument.
%------------------------------%
$\bf\neg(vii_3)$
Non-convergence of $m$ does not necessarily imply non-convergence
of the losses. For instance, for $\X=\Y=\{0,1\}$, and
$\omega'_t:=1/0$ for
$\mu(1|x_{\atop<}\gamma := {\ell_{01}-\ell_{00}\over
\ell_{01}-\ell_{00}+\ell_{10}-\ell_{11}}$, one can show
that $y_t^{\smash{\Lambda_\mu}}=y_t^{\smash{\Lambda_{\omega'}}}$, hence
convergence of $m(x_t|x_{{3\over 8}=l_\mu^1$, we have
$y_t^{\smash{\Lambda_\mu}}=1$ and $l_t^{\smash{\Lambda_\mu}}=l_\mu^1={3\over 8}$.
For $\rho\leq{1\over 3}$, we have
$l_\rho^01$. The constant ${16\over 15}$ can be enlarged to ${6\over
5}-\eps$ by setting $\ell_{x1}={1\over 3}+\eps$ instead of
${3\over 8}$.
For $\Y=\{0,...,|\Y|-1\}$, $|\Y|>3$, we extend the loss function
by defining $\ell_{xy}=1$ $\forall y\geq 3$, ensuring that actions
$y\geq 2$ are never favored.
For $\X=\{0,...,|\X|-1\}$, $|\X|>2$,
we extend $\mu$ and define $\mu(x_t|x_{l_a^0$ and
$l_\mu^1l_m^0$ whenever $m\leq a$,
which is the case for many $t$ by assumption. Hence
$l_t^{\smash{\Lambda_m}}/l_t^{\smash{\Lambda_\mu}}=l_\mu^0/l_\mu^1=c>1$.
For instance, choose $\mu=\sqrt{2}-1$ and $\ell_{00}=0$ and
$\ell_{10}=1$ ($\Rightarrow l_\rho^0=\rho$). We get
$c=\sqrt{2}-O(\eps)$ by choosing $\ell_{01}=\odt+\eps$ and
$\ell_{11}=0$ ($\Rightarrow l_\rho^1=(\odt+\eps)(1-\rho)$) in the
former case with $a={1\over 3}$ (and $\ell_{01}=1-\eps$ and
$\ell_{11}=0$ ($\Rightarrow l_\rho^1=(1-\eps)(1-\rho)$) in the
latter case with $b=\odt$ and $l_b^1l_\mu^0$).
The generalization to general $\X$ and $\Y$ can be performed
similarly to $\neg(vii_3)$.
%------------------------------%
$\bf\neg(vii_5)$
%specific loss
We first present a simple proof for a particular loss function and
$\X=\Y=\B$, which contains the main idea also used to prove the
general result. We define a monotone Turing machine $U$ by
$U(1x0)=x0$ for all $x\in\X^*$. More precisely, if the first bit
of the input tape of $U$ contains 1, $U$ copies the half-infinite
input tape (without the first 1) to the output tape, but always
withholds the output until a $0$ appears. We have
$\Km(x1)=\Km(x10)=\l(x)+2=\Km(x0)+1$, which implies
$m_{norm}(1|x)={1\over 3}$ and $m_{norm}(0|x)={2\over 3}$. For the
loss function $\ell_{00}=\ell_{11}=0$, $\ell_{10}=1$,
$\ell_{01}={2\over 3}$ and a Bernoulli($\odt$) process
$\mu(x_t|x_{{1\over 3}=l_{m_{norm}}^0$, hence
$l_t^{\smash{\Lambda_m}}/l_t^{\smash{\Lambda_\mu}}=l_\mu^0/l_\mu^1={3\over
2}>1$. $U$ is not yet universal. We make $U$ universal by
additionally defining $U(0^{s+1}p)=U'(p)$ for some (large, but
reasonable) $s\in\SetN$ and some (other) universal monotone TM
$U'$. We have to check whether this can alter (lower) the monotone
complexity. Fix $n$. Every $x$ of length $n$ has description $1x0$
of length $n+2$, so $U'$ only matters if $U'(p)=x*$ for some $p$
of length $1$ for $t-1\in(s+1)\SetN$, i.e.\ for (infinitely) many $t$.
What remains to do is to extend $U$ to a universal Turing machine.
We extend $U$ by defining $U(0zp)=U'(p)$ for any $z\in\B^{3s}$,
where $U'$ is some universal Turing machine.
Clearly, $U$ is now universal. We have to show that this extension
does not spoil the preceding consideration, i.e.\ that the shortest
code of $x$ has sufficiently often the form $1p$ and sufficiently
seldom the form $0p$. Above, $\mu$ has been chosen in such a way that
$c(x)$ is a Shannon-Fano code for $\mu$-distributed strings,
i.e.\ $c(x)$ is with high $\mu$-probability a shortest code of $x$.
More precisely, $\l(c(x))\leq\Km_T(x)+s$ with $\mu$-probability at
least $1-2^{-s}$, where $\Km_T$ is the monotone complexity w.r.t.\
any decoder $T$, especially $T=U'$. This implies
$
\min_p\{\l(0p) : U(0p)=x*\}
= 3s+1+\Km_{U'}(x)
\geq 3s+1+\l(c(x))-s
> \l(c(x))+s+1
\geq \min_p\{\l(1p) : U(1p)=x*\},
$
where the first $\geq$ holds with high probability ($1-2^{-s}$)
and the last $\geq$ holds with $\mu$-probability 1. This shows
that the expressions (\ref{eqEUKm}) for $\Km$ are with high
probability (w.h.p.) not affected by the extension of $U$.
Altogether this
shows ${l_t^{\smash{\Lambda_m}}/l_t^{\smash{\Lambda_\mu}}}=c>1$ w.h.p.
A martingale argument can strengthen this result to yield
non-self\-opti\-mizing\-ness. For
$z_t:={M(\omega_{1:t})\over\mu(\omega_{1:t})}$ we have $z_0=1$,
$\E[z_t]\leq 1$, and $\E[z_t|\omega_{