%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% On Sequence Prediction for Arbitrary Measures %%
%% Daniil Ryabko & Marcus Hutter: Start 2006 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt,twoside]{article}
\usepackage{hyperref}
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm \sloppy\lineskip=0pt
\sloppy\lineskip=0pt
%-------------------------------%
% Spacings %
%-------------------------------%
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
\def\paradot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1.}}}
\def\paranodot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1}}}
%-------------------------------%
% Environments %
%-------------------------------%
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{definition}[theorem]{Definition}
\def\qed{\hspace*{\fill}\rule{1.4ex}{1.4ex}$\quad$\\}
\def\qedx{}
\newenvironment{proof}{\paradot{Proof}}{\qed}
\newenvironment{keywords}%
{\centerline{\bf\small Keywords}\begin{quote}\small}%
{\par\end{quote}\vskip 1ex}
%-------------------------------%
% More Macro-Definitions %
%-------------------------------%
\let\emptyset\varnothing
\let\phi\varphi
\def\as{\mbox{ a.s.}}
\def\argmax{\operatorname{argmax}}
\def\P{{\bf P}}
\def\E{{\bf E}}
\def\up{\overline}
\def\low{\underline}
\def\r#1#2{r_{#1..#2}}
\def\nq{\hspace{-1em}}
\def\odt{{\textstyle{1\over 2}}}
\def\odn{{\textstyle{1\over n}}}
\def\odm{{\textstyle{1\over m}}}
\def\SetB{I\!\!B}
\def\SetR{I\!\!R}
\def\SetN{I\!\!N}
\def\C{{\cal C}} % Set of prob. distributions
\def\X{{\cal X}} % generic set
\def\O{{\cal O}} % observation space
\def\C{{\cal C}} % Class
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\def\l{{\ell}} % length of string or program
\def\eps{\varepsilon} % for small positive number
\def\epstr{\epsilon} % for empty string
\def\toinfty#1{\stackrel{\smash{#1}\to\infty}{\longrightarrow}}
\def\a{\alpha}
\def\o{\omega}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\vspace{-3ex}\normalsize\sc Technical Report \hfill IDSIA-13-06
\vskip 2mm\bf\Large\hrule height5pt \vskip 6mm
On Sequence Prediction for Arbitrary Measures
\vskip 6mm \hrule height2pt \vskip 5mm}
\author{{{\bf Daniil Ryabko} and {\bf Marcus Hutter}}\\[3mm]
\normalsize IDSIA, Galleria 2, CH-6928\ Manno-Lugano, Switzerland%
\thanks{This work was supported by the Swiss NSF grant 200020-107616.
}\\
\normalsize \{daniil,marcus\}@idsia.ch, \ http://www.idsia.ch/$^{_{_\sim}}\!$\{daniil,marcus\} }
\date{16 June 2006}
\maketitle
\vspace{-4ex}
\begin{abstract}
Suppose we are given two probability measures on the set of
one-way infinite finite-alphabet sequences and consider the
question when one of the measures predicts the other, that is,
when conditional probabilities converge (in a certain sense) when
one of the measures is chosen to generate the sequence. This
question may be considered a refinement of the problem of sequence
prediction in its most general formulation: for a given class of
probability measures, does there exist a measure which predicts
all of the measures in the class? To address this problem, we find
some conditions on local absolute continuity which are sufficient
for prediction and which generalize several different notions
which are known to be sufficient for prediction. We also formulate
some open questions to outline a direction for finding the
conditions on classes of measures for which prediction is
possible.
\vspace{3ex}
\def\contentsname{\centering\normalsize Contents}
{\parskip=-2.5ex\tableofcontents}
\end{abstract}
\begin{keywords}
Sequence prediction, %
local absolute continuity, %
non-stationary measures, %
average/expected criteria, %
absolute/KL divergence, %
mixtures of measures. %
\end{keywords}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Let a sequence $x_t$, $t\in\SetN$ of letters from some finite
alphabet $\X$ be generated by some probability measure $\mu$.
Having observed the first $n$ letters $x_1,\dots,x_n$ we want to
predict what is the probability of the next letter being $x$, for
each $x\in\X$. This task is motivated by numerous applications ---
from weather forecasting and stock market prediction to data
compression.
If the measure $\mu$ is known completely then the best forecasts
one can make for the $(n+1)$st outcome of a sequence
$x_1,\dots,x_n$ is $\mu$-conditional probabilities of $x\in
\X$ given $x_1,\dots,x_n$. On the other hand, it is immediately
apparent that if nothing is known about the distribution $\mu$
generating the sequence then no prediction is possible, since for
any predictor there is a measure on which it errs (gives
inadequate probability forecasts) on every step. Thus one has to
restrict the attention to some class of measures. Laplace was
perhaps the first to address the question of sequence prediction,
his motivation being as follows: Suppose that we know that the Sun
has risen every day for 5000 years, what is the probability that
it will rise tomorrow? He suggested to assume that the probability
that the Sun rises is the same every day and the trials are
independent of each other. Thus Laplace considered the task of
sequence prediction when the true generating measure belongs to
the family of Bernoulli i.i.d.\ measures with binary alphabet
$\X=\{0,1\}$. The predicting measure suggested by Laplace was
$\rho_L(x_{n+1}=1|x_1,\dots,x_n)=\frac{k+1}{n+2}$ where $k$ is the
number of 1s in $x_1,\dots,x_n$. The conditional probabilities of
Laplace's measure $\rho_L$ converge to the true conditional
probabilities $\mu$-almost surely under any Bernoulli i.i.d
measure $\mu$. This approach generalizes to the problem of
predicting any finite-memory (e.g.\ Markovian) measure. Moreover,
in \cite{BRyabko:88} a measure $\rho_R$ was constructed for
predicting an arbitrary stationary measure. The conditional
probabilities of $\rho_R$ converge to the true ones {\em on
average}, where average is taken over time steps (that is, in
Cesaro sense), $\mu$-almost surely for any stationary measure
$\mu$. However, as it was shown in the same work, there is no
measure for which conditional probabilities converge to the true
ones $\mu$-a.s.\ for every stationary $\mu$. Thus we can see that
already for the problem of predicting outcomes of a stationary
measure two criteria of prediction arise: prediction in the
average (or in Cesaro sense) and prediction on each step, and the
solution exists only for the former problem.
But what if the measure generating the sequence is not stationary?
A different assumption one can make is that the measure $\mu$
generating the sequence is computable. Solomonoff
\cite[Eq.(13)]{Solomonoff:64} suggested a measure $\xi$ for
predicting any computable probability measure. The key observation
here is that the class of all computable probability measures is
countable; let us denote it by $(\nu_i)_{i\in\SetN}$. A Bayesian
predictor $\xi$ for a countable class of measures
$(\nu_i)_{i\in\SetN}$ is constructed as follows:
$\xi(A)=\sum_{i=1}^\infty w_i\nu_i(A)$ for any measurable set A,
where the weights $w_i$ are positive and sum to one\footnote{It
is not necessary for prediction that the weights sum to one. In
\cite{Solomonoff:78} and \cite{Zvonkin:70} $w_i=2^{-K(i)}$
where $K$ stands for the prefix Kolmogorov complexity, and so the
weights do not sum to 1. Further, the $\nu$ and $\xi$ are only
semi-measures.}. The best predictor for a measure $\mu$ is the
measure $\mu$ itself. The Bayesian predictor simply takes the
weighted average of the predictors for all measures in the class
--- for countable classes this is possible. It was shown by
Solomonoff \cite{Solomonoff:78} that $\xi$-conditional
probabilities converge to $\mu$-conditional probabilities almost
surely for any computable measure $\mu$. In fact this is a special
case of a more general (though without convergence rate) result of
Blackwell and Dubins \cite{Blackwell:62} which states that if a
measure $\mu$ is absolutely continuous with respect to a measure
$\rho$ then $\rho$ converges to $\mu$ in total variation
$\mu$-almost surely. Convergence in total variation means
prediction in a very strong sense~--- convergence of conditional
probabilities of arbitrary events (not just the next outcome), or
prediction with arbitrary fast growing horizon. Since for $\xi$ we
have $\xi(A)\ge w_i\nu_i(A)$ for every measurable set $A$ and for
every $\nu_i$, each $\nu_i$ is absolutely continuous with respect
to $\xi$.
Thus the problem of sequence prediction for certain classes of
measures (such as the class of all stationary measures or the
class of all computable measures) was often addressed in the
literature. Although the mentioned classes of measures are
sufficiently interesting, it is often hard to decide in
applications with which assumptions does a problem at hand comply;
not to mention such practical issues as that a predicting measure
for all computable measures is necessarily non-computable itself.
Moreover, to be able to generalize the solutions of the sequence
prediction problem to such problems as active learning, where
outcomes of a sequence may depend on actions of the predictor, one
has to understand better under which conditions the problem of
sequence prediction is solvable. In particular, in active learning,
the stationarity assumption does not seem to be applicable (since
the predictions are non-stationary), although, say, the Markov
assumption is often applicable and is extensively studied. Thus, we
formulate the following general questions which we start to
address in the present work:
%-------------------------------%
\paradot{General motivating questions}
%-------------------------------%
For which classes of measures is sequence prediction possible?
Under which conditions does a measure $\rho$ predict a measure
$\mu$?
As we have seen, these questions have many facets, and in
particular there are many criteria of prediction to be considered,
such as almost sure convergence of conditional probabilities,
convergence in average, etc. Extensive as the literature on
sequence prediction is, these questions in their full generality
have not received much attention. One line of research which
exhibits this kind of generality consists in extending the result
of Blackwell and Dubins mentioned above, which states that if
$\mu$ is absolutely continuous with respect to $\rho$, then $\rho$
predicts $\mu$ in total variation distance. In \cite{Jackson:99} a
question of whether, given a class of measures $\C$ and a
prior (``meta''-measure) $\lambda$ over this class of measures, the
conditional probabilities of a Bayesian mixture of the class
$\C$ w.r.t. $\lambda$ converge to the true
$\mu$-probabilities (weakly merge, in terminology of
\cite{Jackson:99}) for $\lambda$--almost any measure $\mu$ in
$\C$. This question can be considered solved, since the
authors provide necessary and sufficient conditions on the
measure given by the mixture of the class $\C$ w.r.t.
$\lambda$ under which prediction is possible. The major difference
from the general questions we posed above is that we do not wish
to assume that we have a measure on our class of measures. For
large (non-parametric) classes of measures it may not be intuitive
which measure over it is natural; rather, the question is whether
a ``natural'' measure which can be used for prediction exists.
To address the general questions posed, we start with the
following observation. As it was mentioned, for a Bayesian mixture
$\xi$ of a countable class of measures $\nu_i$, $i\in\SetN$, we
have $\xi(A)\ge w_i\nu_i(A)$ for any $i$ and any measurable set
$A$, where $w_i$ is a constant. This condition is stronger than
the assumption of absolute continuity and is sufficient for
prediction in a very strong sense. Since we are willing to be
satisfied with prediction in a weaker sense (e.g.\ convergence of
conditional probabilities), let us make a weaker assumption: Say
that {\em a measure $\rho$ dominates a measure $\mu$ with
coefficients $c_n>0$} if
\beq\label{eq:dom}
\rho(x_1,\dots,x_n) \;\geq\; c_n \mu(x_1,\dots,x_n)
\eeq
for all $x_1,\dots,x_n$.
%-------------------------------%
\paranodot{The first concrete question}
%-------------------------------%
we pose is, under what conditions on $c_n$ does (\ref{eq:dom})
imply that $\rho$ predicts $\mu$? Observe that if
$\rho(x_1,\dots,x_n)>0$ for any $x_1,\dots,x_n$ then any measure
$\mu$ is {\em locally} absolutely continuous with respect to
$\rho$ (that is, the measure $\mu$ restricted to the first $n$
trials $\mu|_{\X^n}$ is absolutely continuous w.r.t.
$\rho|_{\X^n}$ for each $n$), and moreover, for any measure $\mu$
some constants $c_n$ can be found that satisfy (\ref{eq:dom}). For
example, if $\rho$ is Bernoulli i.i.d.\ measure with parameter
$\odt$ and $\mu$ is any other measure, then (\ref{eq:dom}) is
(trivially) satisfied with $c_n=2^{-n}$. Thus we know that if
$c_n\equiv c$ then $\rho$ predicts $\mu$ in a very strong sense,
whereas exponentially decreasing $c_n$ are not enough for
prediction. Perhaps somewhat surprisingly, we will show that
dominance with any subexponentially decreasing coefficients is
sufficient for prediction, in a weak sense of convergence of
expected averages. Dominance with any polynomially decreasing
coefficients, and also with coefficients decreasing (for example)
as $c_n=\exp(-\sqrt{n}/\log n)$, is sufficient for (almost sure)
prediction on average (i.e.\ in Cesaro sense). However, for
prediction on every step we have a negative result: for any
dominance coefficients that go to zero there exists a pair of
measures $\rho$ and $\mu$ which satisfy~(\ref{eq:dom}) but $\rho$
does not predict $\mu$ in the sense of almost sure convergence of
probabilities. Thus the situation is similar to that for
predicting any stationary measure: prediction is possible in the
average but not on every step.
Note also that for Laplace's measure $\rho_L$ it can be shown that
$\rho_L$ dominates any i.i.d.\ measure $\mu$ with linearly
decreasing coefficients $c_n={1\over n+1}$; a generalization of
$\rho_L$ for predicting all measures with memory $k$ (for a given
$k$) dominates them with polynomially decreasing coefficients.
Thus dominance with decreasing coefficients generalizes (in a
sense) predicting countable classes of measures (where we have
dominance with a constant), absolute continuity (via local
absolute continuity), and predicting i.i.d.\ and finite-memory
measures.
Another way to look for generalizations is as follows. The Bayes
mixture $\xi$, being a sum of countably many measures
(predictors), possesses some of their predicting properties. In
general, which predictive properties are preserved under
summation? In particular, if we have two predictors $\rho_1$ and
$\rho_2$ for two classes of measures, we are interested in the
question whether $\odt(\rho_1+\rho_2)$ is a predictor for the
union of the two classes. An answer to this question would improve
our understanding of how far a class of measures for which a
predicting measure exists can be extended without losing this
property.
%-------------------------------%
\paranodot{{\rm Thus,} the second question}
%-------------------------------%
we consider is the following: suppose that a measure $\rho$
predicts $\mu$ (in some weak sense), and let $\chi$ be some other
measure (e.g.\ a predictor for a different class of measures). Does
the measure $\rho'=\odt(\rho+\chi)$ still predict $\mu$?
That is, we ask to which prediction quality criteria does the idea
of taking a Bayesian sum generalize. Absolute continuity is
preserved under summation along with it's (strong) prediction
ability. It was mentioned in \cite{BRyabko:06} that prediction in
the (weak) sense of convergence of expected averages of
conditional probabilities is preserved under summation. Here we
find that several stronger notions of prediction are not preserved
under summation.
Thus we address the following two questions. Is dominance with
decreasing coefficients sufficient for prediction in some sense,
under some conditions on the coefficients? And, if a measure
$\rho$ predicts a measure $\mu$ in some sense, does the measure
$\odt(\rho+\chi)$ also predict $\mu$ in the same sense, where
$\chi$ is an arbitrary measure? Considering different criteria of
prediction (a.s.\ convergence of conditional probabilities, a.s.\
convergence of averages, etc.) in the above two questions we
obtain not two but many different questions, some of which we
answer in the positive and some in the negative, yet some are
left open.
%-------------------------------%
\paradot{Contents}
%-------------------------------%
The paper is organized as follows. Section~\ref{sec:def}
introduces necessary notation and measures of divergence of
probability measures. Section~\ref{sec:dom} addresses the question
of whether dominance with decreasing coefficients is sufficient
for prediction, while in Section~\ref{sec:sum} we consider the
question of summing a predictor with an arbitrary measure. Both
sections~\ref{sec:dom} and~\ref{sec:sum} also propose some open
questions and directions for future research. In
Section~\ref{sec:misc} we discuss some interesting special cases
of the questions considered, and also some related problems.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Notation and Definitions}\label{sec:def}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We consider processes on the set of one-way infinite sequences
$\X^\infty$ where $\X$ is a finite set (alphabet).
In the examples we will often assume $\X=\{0,1\}$. The notation
$x_{1:n}$ is used for $x_1,\dots,x_n$ and $x_{0$ iff
\beqn
\rho(x_{1:n}) \;\geq\; c_n \mu(x_{1:n}).
\eeqn
\end{definition}
Suppose that $\rho$ dominates $\mu$ with decreasing coefficients
$c_n$. Does $\rho$ predict $\mu$ in (expected, expected average)
KL divergence (absolute distance)?
%
First let us give an example.
\begin{proposition}[Dominance of Laplace's measure]\label{prop:Laplace}
Let $\rho_L$ be the Laplace measure, given by
$\rho_L(x_{n+1}=a|x_{1:n})=\frac{k+1}{n+|\X|}$ for any $a\in\X$ and any
$x_{1:n}\in\X^n$, where $k$ is the number of occurrences of $a$
in $x_{1:n}$. Then
\beqn
\rho_L(x_{1:n}) \;\geq\; \frac{n!}{(n+|\X|-1)!} \; \mu({x_{1:n}})
\eeqn
for any measure $\mu$ which generates independently and
identically distributed symbols. This bound is sharp.
\end{proposition}
\begin{proof}
We will only give the proof for $\X=\{0,1\}$, the general case is
analogous. To calculate $\rho_L(x_{1:n})$ observe that it only
depends on the number of 0s and 1s in $x_{1:n}$ and not on their
order. Thus we compute $\rho_L(x_{1:n})=\frac{k!(n-k)!}{(n+1)!}$
where $k$ is the number of 1s. For any measure $\mu$ such that
$\mu(x_n=1)=p$ for some $p\in[0,1]$ independently for all $n$, and
for Laplace measure $\rho_L$ we have
\bqan
\frac{\mu(x_{1:n})}{\rho_L(x_{1:n})}
& = & \frac{(n+1)!}{k!(n-k)!}p^k(1-p)^{n-k} \\
& = & (n+1){n\choose k}p^k(1-p)^{n-k} \\
& \le & (n+1) \sum_{k=1}^n{n\choose k}p^k(1-p)^{n-k}=n+1,
\eqan
for any $n$-letter word $x_1,\dots,x_n$ where $k$ is the number of 1s in it.
%
The bound is attained when $p=1$, so that $k=n$, $\mu(x_{1:n})=1$,
and $\rho_L(x_{1:n})=\frac{1}{n+1}$. \qedx
\end{proof}
Thus for Laplace's measure $\rho_L$ and binary $\X$ we have
$c_n=\O(\frac{1}{n})$. As it was mentioned in the
introduction, in general, exponentially decreasing coefficients
$c_n$ are not sufficient for prediction, since (\ref{eq:dom}) is
satisfied with $\rho$ being a Bernoulli i.i.d.\ measure and $\mu$
any other measure. On the other hand, the following proposition
shows that in a weak sense of convergence in expected average KL
divergence (or absolute distance) the property~(\ref{eq:dom}) with
subexponentially decreasing $c_n$ is sufficient. We also remind
that if $c_n$ are bounded from below then prediction in the
strong sense of total variation is possible.
\begin{theorem}[\boldmath $\E\bar d\to 0$ and $\E\bar a\to 0$]\label{th:dom}
Let $\mu$ and $\rho$ be two measures on $\X^\infty$ and suppose that
$
\rho(x_{1:n})\ge c_n \mu(x_{1:n})
$
for any $x_{1:n}$, where $c_n$ are positive constants satisfying
$\frac{1}{n}\log c_n^{-1}\rightarrow0$. Then $\rho$ predicts
$\mu$ in expected average KL divergence $\E_\mu \bar
d_n(\mu,\rho)\rightarrow0$ and in expected average absolute
distance $\E_\mu \bar a_n(\mu,\rho)\rightarrow0$.
\end{theorem}
The proof of this proposition is based on the same idea as the
proof of convergence of Solomonoff predictor to any of its
summands in \cite{BRyabko:88}, see also \cite{Hutter:04uaibook}.
\begin{proof}
For convergence in average expected KL divergence we have
\bqan
\E_\mu\bar d_n(\mu,\rho)
& = & \frac{1}{n}\E\sum_{t=1}^n \sum_{x_t\in\X} \mu(x_t|x_{0$ such
that
$
\rho(x_{1:n})\ge c_n \mu(x_{1:n})
$
for all $x_{1:n}$, yet $a_n(\mu,\rho|x_{1:n})>\epsilon$ and
$d_n(\mu,\rho|x_{1:n})>\epsilon$ infinitely often $\mu$-a.s.
\end{proposition}
\begin{proof}
Let $\mu$ be concentrated on the sequence $11111\dots$ (that is
$\mu(x_n=1)=1$ for all $n$), and let $\rho(x_n=1)=1$ for all $n$
except for a subsequence of steps $n=n_k$, $k\in\SetN$ on which
$\rho(x_{n_k}=1)=1/2$ independently of each other. It is easy to
see that choosing $n_k$ sparse enough we can make
$\rho(1_1\dots1_n)$ decrease arbitrary slowly; yet
$|\mu(x_{n_k})-\rho(x_{n_k})|=1/2$ for all $k$. \qedx
\end{proof}
Thus for the first question --- whether dominance with some
coefficients decreasing to zero is sufficient for prediction, we
have the following table of questions and answers, where, in
fact, positive answers for $a_n$ are implied by positive answers
for $d_n$ and vice versa for the negative answers: \vskip 2mm
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|}\hline
$\E \bar d_n $ & $\bar d_n \vphantom{\bar{\hat d}} $ & $d_n$ & $\E\bar a_n$& $\bar a_n$& $a_n$ \\\hline
+ & + & $-$ & + & + & $-$\\\hline
\end{tabular}
\end{center}
However, if we take into account the conditions on the
coefficients, we see some open problems left, and different
answers for $\bar d_n$ and $\bar a_n$ may be obtained. Following
is the table of conditions on dominance coefficients and answers
to the questions whether these conditions are sufficient for
prediction (coefficients bounded from below are included for the
sake of completeness).
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}\hline
& $\E \bar d_n $ & $\bar d_n \vphantom{\bar{\hat d}} $ & $d_n$ & $\E\bar a_n$& $\bar a_n$& $a_n$\\\hline
$\log c_n^{-1}=o(n)$ & + & ? & $-$ & + & ? & $-$ \\\hline
$\sum_{n=1}^\infty\frac{\log c_n^{-1}}{n^2}< \infty$ & + & + & $-$ & + & + & $-$ \\\hline\hline
$c_n\ge c>0$ & + & + & $+$ & + & + &+ \\\hline
\end{tabular}
\end{center}
We know form Proposition~\ref{th:nodom} that the condition
$c_n\ge c>0$ for convergence in $d_n$ can not be improved; thus
the open problem left is to find whether $\log c_n^{-1}=o(n)$ is
sufficient for prediction in $\bar d_n$ or at least in $\bar a_n$.
Another open problem is to find whether any conditions on
dominance coefficients are necessary for prediction; so far we
only have some sufficient conditions.
On the one hand, the
obtained results suggest that some form of dominance with decreasing
coefficients may be necessary for prediction, at least in the
sense of convergence of averages. On the other hand, the
condition~(\ref{eq:dom}) is uniform over all sequences which probably
is not necessary for prediction. As for prediction in the sense of
almost sure convergence, perhaps more subtle behavior of the
ratio $\frac{\mu(x_{1:n})}{\rho(x_{1:n})}$ should be analyzed,
since dominance with decreasing coefficients is not sufficient for
prediction in this sense.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Preservation of the Predictive Ability under Summation with an Arbitrary Measure}\label{sec:sum}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Now we turn to the question whether, given a measure $\rho$ that
predicts a measure $\mu$ in some sense, the ``contaminated''
measure $(1-\eps)\rho+\eps\chi$ for some $0<\eps<1$ also predicts
$\mu$ in the same sense, where $\chi$ is an arbitrary measure.
Since most considerations are independent of the choice of $\eps$,
in particularly the results in this section, we set $\eps=\odt$
for simplicity. We define
\begin{definition}[Contamination]\label{def:cont}
By ``$\rho$ contaminated with $\chi$'' we mean
$\rho':=\odt(\rho+\chi)$.
\end{definition}
Positive results can be obtained for convergence in expected
average KL divergence. The statement of the next proposition in a
different form was mentioned in \cite{BRyabko:06,Hutter:06usp}.
Since the proof is simple we present it here for the sake of
completeness; it is based on the same ideas as the proof of
Theorem~\ref{th:dom}.
\begin{proposition}[\boldmath $\E\bar d\to 0$]\label{th:expaverklsum}
Let $\mu$ and $\rho$ be two measures on $\X^\infty$ and suppose
that $\rho$ predicts $\mu$ in expected average KL divergence. Then
so does the measure $\rho'=\odt(\rho +\chi)$ where $\chi$
is any other measure on $\X^\infty$.
\end{proposition}
\begin{proof}
\bqan
0 \;\le\; \E \bar d_n(\mu,\rho')
& = & \frac{1}{n}\E\sum_{t=1}^n \sum_{x_t\in\X} \mu(x_t|x_{0$. Since $\xi(A)\ge
w_i\nu_i(A)$ for any measurable set $A$ ($\xi$ dominates any
$\nu_i$ with a constant $w_i$), it predicts all $\nu_i$ in the
sense of convergence in total variation, in KL divergence and
absolute distance. The question is what else does it predict,
which other measures? Laplace's measure $\rho_L$ is computable,
and hence is present in the sum~(\ref{eq:xi}). We know that
$\rho_L$ predicts any Bernoulli i.i.d.\ measure, so we can ask
whether $\xi$, being a sum of $\rho_L$ and some other measures,
still predicts all Bernoulli measures. The predictor $\rho_R$ from
\cite{BRyabko:88} is computable and predicts all stationary
measures in average KL divergence (and average absolute distance).
Thus $\xi$ is a sum of $\rho_R$ and some other measure, and we can
ask whether it still predicts all stationary measures. (In
expected average KL divergence this follows from
Proposition~\ref{th:expaverklsum} as also pointed out in
\cite{Hutter:06usp}.)
\begin{conjecture}[\boldmath $a\to 0$ for i.i.d.\ and $\bar d\to 0$ for stationary]\label{conj:adb}
For every i.i.d.\ measure $\nu$, the measure $\xi$ as defined
in~(\ref{eq:xi}) predicts $\nu$ in absolute distance. For every
stationary measure $\nu$, the measure $\xi$ predicts $\nu$ in
average KL divergence.
\end{conjecture}
\paradot{Proof idea}
For the first question, consider any Bernoulli i.i.d.\ measure
$\nu$. From Conjecture~\ref{con:sumad} and from the fact that
$\rho_L$ predicts $\nu$ in absolute distance we conclude that
$\xi$ predicts $\nu$ in average absolute distance. Since $\nu$ is
Bernoulli i.i.d.\ measure, that is, probabilities on all steps are
equal and independent of each other, from any measure $\theta$
that predicts $\nu$ in average absolute distance we can make a
measure $\theta'$ which predicts $\nu$ in absolute distance as
follows
$\theta'(x_n=x|x_{