%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Algorithmic Complexity Bounds %%
%% on Future Prediction Errors %%
%% Alexey Chernov & Marcus Hutter: Start 2005 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
% Document-Style %
%-------------------------------%
\documentclass[12pt,twoside]{article}
\usepackage{amsmath,amssymb,amsthm}
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\sloppy\lineskip=0pt
%-------------------------------%
% Spacings %
%-------------------------------%
\edef\,{\thinspace} \edef\;{\thickspace} \edef\!{\negthinspace} %\def\>{\mskip 4mu plus 2mu minus 4mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
\def\paradot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1.}}}
\def\paranodot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1}}}
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{claim}[theorem]{Claim}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\theoremstyle{remark}
\newtheorem*{note*}{Note}
\renewcommand{\proofname}{\bfseries Proof}
\def\qedsymbol{$\Box\quad$}
\def\qedx{}
\newenvironment{keywords}%
{\centerline{\bf\small Keywords}\begin{quote}\small}%
{\par\end{quote}\vskip 1ex}
\def\equa{\smash{\stackrel{\raisebox{0.8ex}{$\scriptstyle+$}}{\smash=}}}
\def\leqa{\smash{\stackrel{\raisebox{1ex}{$\scriptstyle\!\!\;+$}}{\smash\leq}}}
\def\geqa{\smash{\stackrel{\raisebox{1ex}{$\scriptstyle+\!\!\;$}}{\smash\geq}}}
\def\eqm{\smash{\stackrel{\raisebox{0.6ex}{$\scriptstyle\times$}}{\smash=}}}
\def\leqm{\smash{\stackrel{\raisebox{1ex}{$\scriptstyle\!\!\;\times$}}{\smash\leq}}}
\def\geqm{\smash{\stackrel{\raisebox{1ex}{$\scriptstyle\times\!\!\;$}}{\smash\geq}}}
\def\eps{\varepsilon}
\def\nq{\hspace{-1em}}
\def\odt{{\textstyle{1\over 2}}}
\def\SetB{I\!\!B}
\def\SetR{I\!\!R}
\def\SetN{I\!\!N}
\def\E{{\bf E}} % Expectation value
\def\M{{\cal M}} % Set of prob. distributions
\def\P{{\bf P}} % Expectation value
\def\X{{\cal X}} % input/perception set/alphabet
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\def\l{{\ell}} % length of string or program
\def\lb{{\log_2}}
\def\eps{\varepsilon} % for small positive number
\def\epstr{\epsilon} % for empty string
\def\toinfty#1{\stackrel{\smash{#1}\to\infty}{\longrightarrow}}
\def\a{\alpha}
\def\o{\omega}
\def\e{{\rm e}} % natural e
\def\KM{K\!M}
\def\Km{K\!m}
\def\KP{K}
\def\C{C}
\def\KPM{{K_*}}
\newcommand{\kpc}[2]{\KP(#1|#2)} % usual conditional compl.
\newcommand{\kpm}[2]{\KPM(#1|#2*)} % new condditional compl.
\newcommand{\bin}{\{0,1\}}
\newcommand{\fin}{\bin^\ast}
\newcommand{\infin}{\bin^\infty}
\newcommand{\emptyword}{\epstr} %{\Lambda}
\newcommand*{\pair}[1]{\langle #1\rangle}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\normalsize
\vskip 2mm\bf\Large\hrule height5pt \vskip 6mm
Algorithmic Complexity Bounds \\
on Future Prediction Errors\thanks{
This work was supported by SNF grants
200020-107590/1,
2100-67712 and 200020-107616.
A shorter version appeared in the proceedings of the ALT'05 conference~\cite{Chernov:05postbnd}.
}
\vskip 6mm \hrule height2pt \vskip 5mm}
\author{{{\bf Alexey Chernov}$^{1,4}$
\; {\bf Marcus Hutter}$^{1,3}$
\; {\bf J\"{u}rgen Schmidhuber}$^{1,2}$}\\[3mm]
\normalsize $^1$IDSIA, Galleria 2, CH-6928\ Manno-Lugano, Switzerland\\
\normalsize $^2$TU Munich, Boltzmannstr.\ 3, 85748 Garching, M\"{u}nchen, Germany\\
\normalsize $^3$RSISE/ANU/NICTA, Canberra, ACT, 0200, Australia\\
\normalsize $^4$LIF, CMI, 39 rue Joliot Curie, 13453 Marseille cedex 13, France\\
\normalsize \{alexey,marcus,juergen\}@idsia.ch, \ http://www.idsia.ch/$^{_{_\sim}}\!$\{alexey,marcus,juergen\}
}
\date{19 January 2007}
\maketitle
\begin{abstract}
We bound the future loss when predicting any (computably)
stochastic sequence online. Solomonoff finitely bounded the total
deviation of his universal predictor $M$ from the true
distribution $\mu$ by the algorithmic complexity of $\mu$. Here we
assume that we are at a time $t>1$ and have already observed $x=x_1...x_t$.
We bound the future prediction performance on $x_{t+1}x_{t+2}...$
by a new variant of algorithmic complexity of $\mu$ given $x$,
plus the complexity of the randomness deficiency of $x$. The new
complexity is monotone in its condition in the sense that this
complexity can only decrease if the condition is prolonged. We
also briefly discuss potential generalizations to Bayesian model
classes and to classification problems.
\end{abstract}
\begin{keywords}
Kolmogorov complexity,
posterior bounds,
online sequential prediction,
Solomonoff prior,
monotone conditional complexity,
total error,
future loss,
randomness deficiency.
\end{keywords}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
%\paradot{Historical Survey}
%-------------------------------%
We consider the problem of online=sequential predictions. We
assume that the sequences $x=x_1x_2x_3...$ are drawn from some ``true''
but unknown probability distribution $\mu$. Bayesians proceed by
considering a class $\M$ of models=hypotheses=distributions,
sufficiently large such that $\mu\in\M$, and a prior over $\M$.
%
Solomonoff considered the truly large class that contains all
computable probability distributions~\cite{Solomonoff:64}.
He showed that his universal distribution $M$ converges rapidly to
$\mu$~\cite{Solomonoff:78}, i.e.\ predicts well in any environment
as long as it is computable or can be modeled by a computable
probability distribution (all physical theories are of this sort).
%
$M(x)$ is roughly $2^{-\KP(x)}$, where $\KP(x)$ is the length of
the shortest description of $x$, called the Kolmogorov complexity of
$x$. Since $\KP$ and $M$ are incomputable, they have to be
approximated in practice.
See e.g.~\cite{Schmidhuber:02speed,Hutter:04uaibook,Li:97,Cilibrasi:05} and
references therein.
%
The universality of $M$ also precludes useful statements about the
prediction quality at particular time
instances~$n$~\cite[p.\,62]{Hutter:04uaibook},
as opposed to simple classes like
i.i.d.\ sequences (data) of size $n$, where accuracy is typically
$O(n^{-1/2})$.
%
Luckily, bounds on the expected \emph{total}=cumulative loss
(e.g.\ number of prediction errors) for $M$ can be
derived~\cite{Solomonoff:78,Hutter:01errbnd,Hutter:02spupper,Hutter:03optisp},
which is often sufficient in an online setting. The bounds are in
terms of the (Kolmogorov) complexity of $\mu$. For instance, for
deterministic $\mu$, the number of errors is (in a sense tightly)
bounded by $\KP(\mu)$ which measures in this case the information
(in bits) in the observed infinite sequence~$x$.
%-------------------------------%
\paradot{What's new}
%-------------------------------%
In this paper we assume we are at a time $t>1$ and have already
observed $x=x_1...x_t$. Hence we are interested in the future
prediction performance on $x_{t+1}x_{t+2}...$, since typically we
don't care about past errors.
%
If the total loss is finite, the future loss must necessarily be
small for large $t$. In a sense the paper intends to quantify this
apparent triviality.
%
If the complexity of $\mu$ bounds the total loss, a natural guess is
that something like the conditional complexity of $\mu$ given $x$
bounds the future loss. (If $x$ contains a lot of (or even all)
information about $\mu$, we should make fewer (no) errors anymore.)
%
Indeed, we prove two bounds of this kind but with additional terms
describing structural properties of $x$. These additional terms
appear since the total loss is bounded only in expectation, and
hence the future loss is small only for ``most'' $x_1...x_t$. In the
first bound (Theorem~\ref{thm:lbnd}), the additional term is the
complexity of the length of $x$ (a kind of worst-case estimation).
The second bound (Theorem~\ref{thm:KPMbound}) is finer: the
additional term is the complexity of the randomness deficiency of
$x$. The advantage is that the deficiency is small for ``typical''
$x$ and bounded on average (in contrast to the length). But in this
case the conventional conditional complexity turned out to be
unsuitable.
%
So we introduce a new natural modification of conditional
Kolmogorov complexity, which is monotone as a function of
condition. Informally speaking, we require programs
(=descriptions) to be consistent in the sense that if a program
generates some $\mu$ given $x$, then it must generate the same
$\mu$ given any prolongation of $x$.
%
The new posterior bounds also significantly improve upon the
previous total bounds.
%-------------------------------%
\paradot{Contents}
%-------------------------------%
The paper is organized as follows. Some basic notation and
definitions are given in Sections~\ref{secNot} and~\ref{secSetup}.
In Section~\ref{secPostBnd} we prove and discuss the length-based
bound Theorem~\ref{thm:lbnd}. In Section~\ref{secKpmBnd} we show why
a new definition of complexity is necessary and formulate the
deficiency-based bound Theorem~\ref{thm:KPMbound}. We discuss the
definition and basic properties of the new complexity in
Section~\ref{secKpmDisc}, and prove Theorem~\ref{thm:KPMbound} in
Section~\ref{secKpmProof}. We briefly discuss potential
generalizations to general model classes $\M$ and classification in
the concluding Section~\ref{secDisc}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Notation \& Definitions}\label{secNot}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We essentially follow the notation of~\cite{Li:97,Hutter:04uaibook}.
%------------------------------%
\paradot{Strings and natural numbers}
%------------------------------%
We write $\X^*$ for the set of finite strings over a finite
alphabet $\X$, and $\X^\infty$ for the set of infinite sequences.
The cardinality of a set $\cal S$ is denoted by $|{\cal S}|$. We
use letters $i,k,l,n,t$ for natural numbers, $u,v,x,y,z$ for
finite strings, $\epstr$ for the empty string, and
$\a=\a_{1:\infty}$ etc.\ for infinite sequences. For a string $x$
of length $\l(x)=n$ we write $x_1x_2...x_n$ with $x_t\in\X$ and
further abbreviate $x_{k:n}:=x_kx_{k+1}...x_{n-1}x_n$ and
$x_{}\frac{k}{n}\}$ is enumerable.
A measure $\mu$ is called \emph{computable} if it is enumerable and
co-enumerable and the set $\{x\mid\mu(x)=0\}$ is decidable (i.\,e.\
enumerable and co-enumerable).
To simplify the statements of the theorems below, we assume that for
every computable measure $\mu$, there is one fixed computable
version of conditional probability $\mu(y|x)$, for example,
$\mu(y|x)$ is the uniform measure on $y$'s for $\mu(x)=0$.
%------------------------------%
\paradot{Prefix Kolmogorov complexity}
%------------------------------%
The conditional prefix complexity
$\kpc{y}{x}:=\min\{\l(p):U(p,x)=y\}$
is the length of the shortest binary
(self-delimiting) program $p\in\fin$ on a universal prefix Turing
machine $U$ with output $y\in\X^*$ and input $x\in\X^*$~\cite{Li:97}.
$\KP(x):=\kpc{x}{\epstr}$.
%
For non-string objects $o$ we define $\KP(o):=\KP(\langle
o\rangle)$, where $\langle o\rangle\in\X^*$ is some standard code
for $o$. In particular, if $(f_i)_{i=1}^\infty$ is an enumeration
of all (co-)enumerable functions, we define
$\KP(f_i):=\KP(i)$.
%
We need the following properties: %
The co-enumerability of $\KP$, %
the upper bounds $\kpc{x}{\l(x)}\leqa\l(x)\lb|\X|$ %
and $\KP(n)\leqa 2\lb n$, %
Kraft's inequality $\sum_x 2^{-\KP(x)}\leq 1$, %
the lower bound $\KP(x)\geq l(x)$ for ``most'' $x$ %
and $\KP(n)\toinfty{n}\infty$, %
extra information bounds $\kpc{x}{y}\leqa \KP(x)\leqa \KP(x,y)$, %
subadditivity $\KP(xy)\leqa \KP(x,y)\leqa \KP(y)+\kpc{x}{y}$, %
%symmetry of information $\KP(x,y)\equa \kpc{x}{y,\KP(y)}+\KP(y)\equa \KP(y,x)$, %
information non-increase $\KP(f(x))\leqa \KP(x)+\KP(f)$
for computable $f:\X^*\to\X^*$, %
and coding relative to a probability distribution (MDL):
if $P:\X^*\to[0,1]$ is enumerable and $\sum_x P(x)\leq 1$,
then
$\KP(x)\leqa -\lb P(x)+\KP(P)$.
%------------------------------%
\paradot{Monotone and Solomonoff complexity}
%------------------------------%
The monotone complexity $\Km(x):=\min\{\l(p):U(p)=x*\}$ is the
length of the shortest binary (possibly non-halting) program
$p\in\fin$ on a universal monotone Turing machine $U$ which
outputs a string starting with $x$. Solomonoff's prior
$M(x):=\sum_{p:U(p)=x*}2^{-\l(p)}=:2^{-\KM(x)}$ is the probability
that $U$ outputs a string starting with $x$ if provided with fair
coin flips on the input tape. Most complexities coincide within an
additive term $O(\log\l(x))$, e.g.\
$\kpc{x}{\l(x)}\leqa\KM(x)\leq\Km(x)\leq \KP(x)$, hence similar
relations as for $\KP$ hold.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Setup}\label{secSetup}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
\paradot{Convergent predictors}
%------------------------------%
We assume that $\mu$ is a ``true''\footnote{Also called
\em{objective} or \emph{aleatory} probability or \emph{chance}.}
sequence generating measure, also called an environment. If we know
the generating process $\mu$, and given past data $x_{1$. For illustration we will
sometimes loosely interpret $D_{1:\infty}$ and other quantities as
the number of prediction errors, as for the error-loss they are
closely related to it~\cite{Hutter:01errbnd,Hutter:01alpha}.
%------------------------------%
\paradot{Bayes mixtures}
%------------------------------%
A Bayesian considers a class of distributions $\M :=
\{\nu_1,\nu_2,...\}$, large enough to contain $\mu$, and uses
the Bayes mixture
\beq\label{xidefsp}
\xi(x) \;:=\;
\sum_{\nu\in\M}w_\nu\!\cdot\!\nu(x),\quad
\sum_{\nu\in\M}w_\nu=1,\quad w_\nu>0
\eeq
for prediction, where $w_\nu$ can be interpreted as the prior of
(or initial belief in) $\nu$. The dominance
\beq\label{unixi}
\xi(x) \;\geq\;
w_\mu\!\cdot\!\mu(x) \quad \forall x\in\X^*
\eeq
is its most important property. Using $\rho=\xi$ for prediction,
this implies $D_{1:\infty}\leq\ln w_\mu^{-1}<\infty$, hence
$\xi_t\to\mu_t$. If $\M$ is chosen sufficiently large, then
$\mu\in\M$ is not a serious constraint.
%------------------------------%
\paradot{Solomonoff prior}
%------------------------------%
So we consider the largest (from a computational point of view)
relevant class, the class $\M_U$ of all enumerable semimeasures
(which includes all computable probability distributions) and
choose $w_\nu=2^{-\KP(\nu)}$ which is biased towards simple
environments (Occam's razor). This gives us Solomonoff-Levin's
prior $M$~\cite{Solomonoff:64,Zvonkin:70} (this definition
coincides within an irrelevant multiplicative constant with the
one in Section~\ref{secNot}). In the following we assume
$\M=\M_U$, $\rho=\xi=M$, $w_\nu=2^{-\KP(\nu)}$ and $\mu\in\M_U$
being a computable (proper) measure, hence $M(x)\geq
2^{-K(\mu)}\mu(x)\,\forall x$ by~(\ref{unixi}).
%------------------------------%
\paradot{Prediction of deterministic environments}
%------------------------------%
Consider a computable sequence $\a=\a_{1:\infty}$ ``sampled from
$\mu\in\M$'' with $\mu(\a)=1$, i.e.\ $\mu$ is deterministic,
then from~(\ref{unixi}) we get
\beq\label{detBnd}
\sum_{t=1}^\infty |1-M(\a_t|\a_{-\infty$ for all sequences), i.e.\ iff
\beqn
\sup_n\Big|\sum_{t=1}^n\log{\mu(\a_t|\a_{c$ or $\mu(1|\a_{c$ for
$c:=[3\ln 2]^{-1}<\odt$.
Since $\mu$ is computable, we can find
(effectively) $b\in\bin$ such that $\mu(b|\a_{c$.
Put
$\a_l=\bar b$.
Let us estimate $M(\bar \a_l|\a_{ 0\}$.
This set is prefix free and decidable. Therefore
$P(l)=M(\a_{c$, we get
\bqan
\sum_{b\in\bin}\!\!
\mu(b|\a_{n$, then $\KP(l)\ge\KP(x)$ by definition of $n$.
\qedx\end{proof}
\begin{corollary}
The future deviation of $M_t$ from $\mu_t$ is bounded by
\bqan
\sum_{t=l+1}^\infty \E[s_t|\o_{1:l}]
& \leqa &
[\kpm{\mu}{\o_{1:l}}+\KP(\lceil d_\mu(\o_{1:l})\rceil)]\ln 2
\\
& \leqa &
[\min_{i\le l}\{\kpc{\mu}{\o_{1:i}}\!+\!\KP(i)\}
+\KP(\lceil d_\mu(\o_{1:l})\rceil)]\ln 2\,.
\eqan
\end{corollary}
Let us note that if $\o$ is $\mu$-random, then
${\KP(\lceil d_\mu(\o_{1:l})\rceil)}\leqa
{\KP(\lceil d_\mu(\o_{1:\infty})\rceil)}+{\KP(\KP(\mu))}$,
and therefore we get the bound, which does not increase
with $l$, in contrast to the bound~$(i)$
in~Corollary~\ref{cor:lbnd}.
Finally, let us point out one more approach
to defining the complexity $\KPM$.
The survey~\cite{UspShen:96} provides ``encoding-free'' definitions
of the main complexities.
In a similar fashion, $\KPM$ could be defined as a minimal
(up to an additive constant) function with the following properties:
\begin{enumerate}
\item The function $\kpm{y}{x}$ is non-negative and co-enumerable;
\item $\kpm{y}{xz}\leq\kpm{y}{x}$ for all $x$, $y$, $z$;
\item $\sum\limits_{y}2^{-\kpm{y}{x}}\le 1$ for all $x$.
\end{enumerate}
Probably, condition~2 expressing strict monotonicity is superfluous,
and both conditions~2 and~3 can be replaced by
\begin{itemize}
\item[$2'$.] For any set $A=\{\pair{x,y}\}$ such that
all the first elements $x$ of the pairs from $A$ have a common prolongation
and the second elements $y$ are different for all pairs from $A$,
it holds $\sum\limits_{\pair{x,y}\in A}2^{-\kpm{y}{x}}\le 1$.
\end{itemize}
It is easy to check that these properties are satisfied for all
previously defined ``versions'' of $\KPM$.
We conjecture that all the definitions are equivalent,
though we cannot prove this.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Proof of Theorem~\protect\ref{thm:KPMbound}}\label{secKpmProof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
If $\mu(x)=0$, then $d_\mu(x)=\infty$ and the bound trivially holds.
Below assume that $\mu(x)\ne 0$ and thus $d_\mu(x)$ is finite.
The plan is to get a statement of the form $2^{d}\mu(xy)\leqm M(xy)$,
where $d\approx d_\mu(x)=\lb\frac{M(x)}{\mu(x)}$.
To this end, we define a new semimeasure $\nu$:
we take the set $S=\{z|d_\mu(z)>d\}$
and put $\nu$ to be $2^d\mu$ on prolongations of $z\in S$;
this is possible since $S$ has $\mu$-measure $2^{-d}$.
Then we have $\nu(z)\le C\cdot M(z)$ by universality of $M$.
However, the constant $C$ depends on $\mu$ and also on $d$.
To make the dependence explicit,
we repeat the above construction for all numbers $d$
and all semimeasures $\mu^T$,
obtaining semimeasures $\nu_{d,T}$,
and take $\nu=\sum 2^{-\KP(d)}\cdot 2^{-\KP(T)}\nu_{d,T}$.
This construction would give us the term $\KP(\mu)$
in the right-hand side of Theorem~\ref{thm:KPMbound}.
To get $\kpm{\mu}{x}$,
we need a more complicated strategy:
instead of a sum of semimeasures $\nu_{d,T}$,
for every fixed $d$
we sum ``pieces'' of $\nu_{d,T}$ at each point $z$,
with coefficients depending on $z$ as well as on $d$ and $T$.
Now proceed with the formal proof.
Let $\{\mu^T\}_{T\in\SetN}$ be any
(effective) enumeration of all enumerable
semimeasures.
%
For any integer $d$ and any $T$, put
\beqn
S_{d,T} \;:=\; \{z\mid \sum_{v\in\X^{\l(z)}\setminus\{z\}}
\mu^T(v) + 2^{-d}M(z) > 1\}\,.
\eeqn
The set $S_{d,T}$ is enumerable given $d$ and $T$.
Let $E$ be the optimal $\KPM$-correct set
(satisfying all three requirements),
$E(p,z)$ is the corresponding partial computable function.
%
For any $z\in\X^*$ and $T$, put
\beqn
\lambda_{d,T}(z) \;:=\;
\max\{2^{-\l(p)}\mid\exists k\le\l(z)\colon
z_{1:k}\in S_{d,T}\text{ and } E(p,z_{1:k})=T\}
\eeqn
(if there is no such $p$, then $\lambda_{d,T}(z)=0$).
%
Put
\beqn
\tilde\nu_d(z) \;:=\; \sum_{T}
\lambda_{d,T}(z)\cdot 2^{d}\mu^{T}(z)\,.
\eeqn
Obviously, this value is enumerable.
It is not a semimeasure, but it has the following property.%
\begin{claim}\label{claim:prefix}
For any prefix-free set $A$,
\beqn
\sum_{z\in A}\tilde\nu_d(z) \;\le\; 1\,.
\eeqn
\end{claim}
This implies that there exists an enumerable semimeasure $\nu_d$
such that $\nu_d(z)\ge\tilde\nu_d(z)$ for all $z$.
Actually, to enumerate $\nu_d$,
one enumerates $\tilde\nu_d(z)$ for all $z$,
and at each step the current approximation of $\nu_d(z)$
is the maximum of the current approximations of $\tilde\nu_d(z)$
and $\sum_{u\in\X}\nu_d(zu)$.
Trivially, this provides $\nu_d(z)\ge\sum_{u\in\X}\nu_d(zu)$.
To show that $\nu_d(\emptyword)\le 1$,
let us note that at any step of enumeration the current approximation
of $\nu_d(\emptyword)$
is the sum of current approximations of $\tilde\nu_d(z)$ over some
prefix-free set, and thus is bounded by $1$.
%
Put
\beqn
\nu(z) \;:=\; \sum_{d} 2^{-\KP(d)}\nu_{d}(z)\,.
\eeqn
Clearly, $\nu$ is an enumerable semimeasure,
thus $\nu(z)\leqm M(z)$.
%
Let $\mu$ be an arbitrary computable measure,
and
$x,y\in\X^*$.
Let $p\in\fin$ be a string such that $\kpm{\mu}{x}=\l(p)$,
$E(p,x)=T$, and $\mu=\mu^T$.
%
Put $d=\lceil d_\mu(x)\rceil-1$, i.e.,
${d_\mu(x)-1}\le d < d_\mu(x)$. Hence $\mu(x) < 2^{-d}M(x)$.
Since $\mu=\mu^{T}$ is a measure,
we have
$\sum_{v\in\X^{\l(x)}} \mu^T(v)=1$,
and therefore $x\in S_{d,T}$.
By definition, $\lambda_{d,T}(xy)\ge 2^{-\l(p)}$,
thus $\tilde\nu_{d}(xy)\ge 2^{-\l(p)}2^d\mu(xy)$,
and
\beqn
2^{-\KP(d)} 2^{-\l(p)} 2^d\mu(xy)
\;\le\; \nu(xy) \;\leqm\; M(xy)\,.
\eeqn
Replacing $2^d$ in the left-hand side by a smaller value
$2^{d_\mu(x)-1}$,
after trivial transformations we get
\beqn
\lb\frac{\mu(y|x)}{M(y|x)} \;\leqa \kpm{\mu}{x}+\KP(d)\,,
\eeqn
which completes the proof of Theorem~\ref{thm:KPMbound}.
\renewcommand{\proofname}{\bfseries Proof of Claim~\ref{claim:prefix}}
\begin{proof}
First observe that for all $z\in S_{d,T}$
\beqn
M(z) > 2^d \mu^T(z)\,,
\eeqn
since
\beqn
\sum_{v\in\X^{\l(z)}\setminus\{z\}}
\mu^T (v)
+ 2^{-d}M(z) > 1
\quad\text{and}\quad
\sum_{v\in\X^{\l(z)}} \mu^T (v) \le 1
\eeqn
by definition of $S_{d,T}$ and by the semimeasure property,
respectively.
To prove the claim we will group items with the same $\mu^T$,
replace sums of $\mu^T$-measures of several $z$ by the $\mu^T$-measure
of their common prefix from $S_{d,T}$, change $\mu^T$ to $M$
using the inequality above,
and finally show (using ``prefix-free'' properties of $\KPM$)
that the coefficients of $M(z)$ in the sum are small.
%
By definition,
\beqn
\sum_{z\in A}\tilde\nu_d(z)=
\sum_{z\in A}\sum_{T}
\lambda_{d,T}(z)\cdot 2^{d}\mu^T(z)=
\sum_{T}\sum_{z\in A}
\lambda_{d,T}(z)\cdot 2^{d}\mu^T(z)\,.
\eeqn
Let us estimate the inner sum.
Let $\pi_{d,T}(z)$ be the string $p$ that gives the maximum
in the definition of $\lambda_{d,T}(z)$
(if there are several such $p$ we always take,
say, the lexicographically first),
that is $\lambda_{d,T}(z)=2^{-\l(p)}$
and there exists $z'$ being a prefix of $z$
such that $z'\in S_{d,T}$ and $E(p,z')=T$.
Let $\zeta_{d,T}(z)$ be the shortest of such $z'$.
It is easy to see that
$\zeta_{d,T}(\zeta_{d,T}(z))=\zeta_{d,T}(z)$
and $\lambda_{d,T}(\zeta_{d,T}(z))=\lambda_{d,T}(z)$.
\bqan
\sum_{z\in A} \lambda_{d,T}(z)\cdot 2^{d}\mu^T(z)
&=&
\sum_{v}\sum_{z\in A:\zeta_{d,T}(z)=v\hspace{-6ex}}
\lambda_{d,T}(z)\cdot 2^{d}\mu^T(z)
\;=\; \sum_{v}\sum_{z\in A:\zeta_{d,T}(z)=v\hspace{-6ex}}
\lambda_{d,T}(v)\cdot 2^{d}\mu^T(z)
\\
&\leq& \sum_{v\colon \exists z\in A:\zeta_{d,T}(z)=v\hspace{-8ex}}
\lambda_{d,T}(v)\cdot 2^{d}\mu^T(v)
\;\leq\; \sum_{v\colon \zeta_{d,T}(v)=v\hspace{-3ex}}
\lambda_{d,T}(v)\cdot 2^{d}\mu^T(v)
\\ &<& \sum_{v\colon \zeta_{d,T}(v)=v\hspace{-3ex}}
\lambda_{d,T}(v) M(v)\,.
\eqan
In the first inequality we used that
$\zeta_{d,T}(z)$ is a prefix of $z$,
that the set $A$ is prefix free, and
summed the $\mu^T(z)$ to $\mu^T(v)$.
Now we can forget about $A$.
If $\zeta_{d,T}(z)=v$ for some $z$, then
$\zeta_{d,T}(v)=\zeta_{d,T}(\zeta_{d,T}(z))=v$,
and we get the second inequality.
The last inequality holds since $\zeta_{d,T}(v)$
belongs to $S_{d,T}$.
%
Thus, we need to bound the sum
\beqn
\sum_T\sum_{v\colon v=\zeta_{d,T}(v)} \lambda_{d,T}(v) M(v)
\;=\; \sum_v\left(\sum_{T\colon v=\zeta_{d,T}(v)\nq\nq} \lambda_{d,T}(v)\right) M(v)\,.
\eeqn
We say that a function $f\colon\X^*\to[0,1]$
is \emph{unit-summable along any sequence}
if for any $z\in\X^*$
\beqn
\sum_{i=1}^{\l(z)} f(z_{1:i}) \;\le\; 1\,.
\eeqn
\begin{claim}\label{claim:conv}
The function $f(v)=\sum\limits_{T\colon v=\zeta_{d,T}(v)\hspace{-1em}}\lambda_{d,T}(v)$
is unit-summable along any sequence.
\end{claim}
\begin{lemma}\label{lem:sum}
Let $\nu$ be a semimeasure.
If a function $f$ is unit-summable along any sequence,
then
\beqn
\sum_{z\in\X^*} f(z)\nu(z) \;\le\; 1\,.
\eeqn
\end{lemma}
This concludes the proof of Claim~\ref{claim:prefix}.
\qedx\end{proof}% of Claim~\ref{claim:prefix}
\renewcommand{\proofname}{\standardproofname}
\renewcommand{\proofname}{\bfseries Proof of Lemma~\ref{lem:sum}}
\begin{proof}
Since $f(z)$ and $\nu(z)$ are non-negative,
it is sufficient to prove
$\sum_{\l(z)\le n}f(z)\nu(z)\le 1$ for all $n$.
Also we can assume that $\nu$ is a measure
(the sum does not decrease, if $\nu$ is increased to a measure).
\bqan
\sum_{\l(z)\le n}f(z)\nu(z)
&\;=\;& \sum_{\l(z)\le n}f(z)
\sum_{\substack{\l(v)=n,\\ \text{$z$ prefix of $v$}}}\nu(v)
\;=\; \sum_{\l(v) = n}
\sum_{\substack{\l(z)\le n,\\ \text{$z$ prefix of $v$}}}f(z)\nu(v)
\\
&\;=\;& \sum_{\l(v)=n}\sum_{i=1}^n f(v_{1:i})\nu(v)
\;\le\; \sum_{\l(v)=n}\nu(v)\le 1\,.
\eqan
\qedx\end{proof}% of Lemma~\ref{lem:sum}
\renewcommand{\proofname}{\standardproofname}
\renewcommand{\proofname}{\bfseries Proof of Claim~\ref{claim:conv}}
\begin{proof}
Take any $z\in\X^*$. Let us show that
\beqn
\sum_{\substack{ v\text{ prefix of }z,\\
T\colon v=\zeta_{d,T}(v)}} \lambda_{d,T}(v)
\;\le\; 1\,.
\eeqn
Recall that if $\lambda_{d,T}(v)\ne 0$, then
$\lambda_{d,T}(v)=2^{-\l(\pi_{d,T}(v))}$.
We will show that the set
$B(z)=\{\pi_{d,T}(v)\mid v=\zeta_{d,T}(v), \text{ $v$ is a prefix of }z\}$
is prefix free,
and if
$\pi_{d,T_1}(v_1)=\pi_{d,T_2}(v_2)\in B(z)$,
then $v_1=v_2$ and $T_1=T_2$.
Consequently,
\beqn
\sum_{\substack{ v\text{ prefix of }z,\\
T\colon v=\zeta_{d,T}(v)}} \lambda_{d,T}(v)
\;=\; \sum_{p\in B(z)} 2^{-\l(p)} \;\le\; 1\,.
\eeqn
Assume the converse, that there exist
different $v_i$, $T_i$, $i=1,2$,
such that
$p_1=\pi_{d,T_1}(v_1)$ is a prefix
(proper or not) of $p_2=\pi_{d,T_2}(v_2)$,
$v_1$ and $v_2$ are prefixes of $z$,
and $v_i=\zeta_{d,T_i}(v_i)$.
By definition of $\zeta$,
we have $v_i\in S_{d,T_i}$
and $T_i=E(p_i,v_i)$.
Hence, by the second requirement of $\KPM$-correctness,
$T_1=E(p_1,v_1)=E(p_2,z)=E(p_2,v_2)=T_2$.
Let $T=T_1=T_2$.
Let us show that $v_1=v_2$ too.
Since they both are prefixes of $z$,
one of them is a prefix of the other.
%
Suppose $v_1$ is a prefix of $v_2$:
By the second requirement of $\KPM$-correctness,
$E(p_2,v_1)=E(p_1,v_1)=T$.
By definition, $\zeta_{d,T}(v_2)$ is the shortest prefix
of $v_2$ belonging to $S_{d,T}$ and such that $E(p_2,\cdot)=T$,
therefore $\zeta_{d,T}(v_2)$ is a prefix of $v_1$,
and thus $v_1=v_2$.
%
Suppose $v_2$ is a prefix of $v_1$.
Since $E(p_1,v_1)=T$ and $E(p_2,v_2)=T$,
we have $E(p_1,v_2)=T$ by the third requirement of $\KPM$-correctness.
As before, we get $\zeta_{d,T}(v_1)$ is a prefix of $v_2$,
and $v_1=v_2$.
\qedx\end{proof}
\renewcommand{\proofname}{\standardproofname}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Discussion}\label{secDisc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
\paradot{Conclusion}
%-------------------------------%
We evaluated the quality of predicting a stochastic sequence at an
intermediate time, when some beginning of the sequence has been
already observed, estimating the future loss of the universal
Solomonoff predictor $M$. We proved general upper bounds for the
discrepancy between conditional values of the predictor $M$ and
the true environment $\mu$, and demonstrated a kind of tightness
for these bounds. One of the bounds is based on a new variant of
conditional algorithmic complexity $\KPM$, which has interesting
properties on its own. In contrast to standard prefix complexity
$\KP$, $\KPM$ is a monotone function of conditions:
$\kpm{y}{xz}\leq\kpm{y}{x}$.
%-------------------------------%
\paradot{General Bayesian posterior bounds}
%-------------------------------%
A natural question is whether posterior bounds for general
Bayes mixtures based on general $\M\ni\mu$ could also be derived.
The mixture representation~(\ref{xidefsp}) can be written
as a posterior representation
\beqn
\xi(y|x) \;=\; \sum_{\nu\in\M} w_\nu(x)\nu(y|x)
\;\geq\; w_\mu(x)\mu(y|x), \qmbox{where}
w_\nu(x) := w_\nu{\nu(x)\over\xi(x)}
\eeqn
is the posterior belief in $\nu$ after observing $x$ (and $w_\nu$ is the prior).
This immediately implies the bound $D_{l:\infty}\leq\ln
w_\mu(\o_{