
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%             Predicting Non-Stationary Processes           %%
%%        Daniil Ryabko & Marcus Hutter: Start 2007          %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\documentclass[12pt]{article}
\usepackage{amsmath,amssymb,amsthm}

\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm \sloppy\lineskip=0pt
\sloppy\lineskip=0pt

\newenvironment{keyword}{\centerline{\bf\small
Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\def\qedx{}
\def\as{\mbox{ a.s.}}
\def\argmax{\operatorname{argmax}}
\def\P{{\bf P}}
\def\E{{\bf E}}
\def\odt{{\textstyle{1\over 2}}}
\def\odn{{\textstyle{1\over n}}}
\def\SetN{I\!\!N}
\def\C{{\cal C}}                        % Set of prob. distributions
\def\X{{\cal X}}                        % generic set
\def\O{{\cal O}}                        % observation space
\def\C{{\cal C}}                        % Class

\begin{document}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                      T i t l e - P a g e                      %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\title{\bf\Large\hrule height5pt \vskip 4mm
Predicting Non-Stationary Processes
\vskip 4mm \hrule height2pt}

\author{{ Daniil Ryabko${}^{a}$\footnote{This research was supported by the Swiss NSF grants 200020-107616 and 200021-113364.} \ \  Marcus Hutter${}^{b}$ }\\
\normalsize $^{a}$IDSIA, Galleria 2, CH-6928\ Manno, Switzerland, daniil@ryabko.net\\
\normalsize $^{b}$ RSISE@ANU and SML@NICTA,  Canberra, Australia, marcus@hutter1.net
}
\date{\vspace{3ex} May 2008}
\maketitle

\begin{abstract} \footnote{Parts of the results were reported at  Dagstuhl Seminar  on Combinatorial and Algorithmic Foundations of Pattern and Association Discovery, Dagstuhl, Germany, 2006, see
\cite{dag}.}
Suppose we are given two probability measures on the set of
one-way infinite finite-alphabet sequences. Consider the
question when one of the  measures predicts the other, that is,
when conditional probabilities  converge (in a certain sense), if
one of the measures is chosen to generate the sequence. This
question may be considered a refinement of the problem of sequence
prediction in its most general formulation: for a given  class of
probability measures, does there exist a measure which predicts
all of the measures in the class? To address this problem, we find
some conditions on local absolute continuity which are sufficient
for prediction and  generalize several different notions
that are known to be sufficient for prediction. We also formulate
some open questions to outline a direction for finding the
conditions on classes of measures for which prediction is
possible.
\end{abstract}

\begin{keyword}
Sequence prediction, %
local absolute continuity, %
non-stationary measures, %
average/expected criteria, %
absolute/KL divergence, %
mixtures of measures. %
\end{keyword}

\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Let a sequence $x_t$, $t\in\SetN$ of letters from some finite
alphabet $\X$ be generated by some probability measure $\mu$ on
$\X^\infty$. Having observed the first $n$ letters we want to
predict what is the probability of the next letter being $x$, for
each $x\in\X$. This task is motivated by numerous applications~---
from weather forecasting and stock market prediction to source
coding and data compression.

If the measure $\mu$ is known  then the best forecasts one can make
for the $(n+1)$st  outcome is $\mu$-conditional probabilities of
$x_{n+1}$ being $x\in \X$ given $x_1,\dots,x_n$. However, it is
clear that if nothing is known about the distribution $\mu$ then no
prediction is possible, since for any predictor there is a measure
on which it errs (gives grossly wrong probability forecasts) on
every step. Thus one has to restrict the attention to some class of
measures. Laplace was perhaps the first to address the question of
sequence prediction, asking the question of what is the probability
that the Sun will rise tomorrow given that it has risen every day
for 5000 years. He suggested to assume that the probability that the
Sun rises is the same every day and the trials are independent of
each other. Thus Laplace considered the task of sequence prediction
when the true generating measure belongs to the family of Bernoulli
i.i.d.\ measures  with binary alphabet $\X=\{0,1\}$. The
predicting measure  hi suggested was
$\rho_L(x_{n+1}=1|x_1,\dots,x_n)=\frac{k+1}{n+2}$ where $k$ is the
number of 1s in $x_1,\dots,x_n$. The conditional probabilities of
$\rho_L$ converge to the true conditional probabilities $\mu$-a\,s.
under any Bernoulli i.i.d measure $\mu$. This approach generalizes
to the problem of predicting any finite-memory (e.g.\ Markovian)
measure. Moreover, in \cite{BRyabko:88} a measure $\rho_R$ was
constructed for predicting an arbitrary stationary measure. The
conditional probabilities of $\rho_R$ converge to the true ones {\em
on average}, where the average is taken over time steps  $\mu$-a.s.
for any stationary measure $\mu$. However, as it was shown in the
same work, there is no measure for which conditional probabilities
converge to the true ones $\mu$-a.s. for every stationary $\mu$.
Thus already for the problem of predicting outcomes of a stationary
measure two criteria of prediction arise: prediction in the average
(or in Cesaro sense) and prediction on each step, and the solution
exists only for the former problem.

What if the measure generating the sequence is not stationary?
Another possible assumption is that the measure $\mu$ generating the
sequence is computable. Solomonoff \cite[Eq.(13)]{Solomonoff:64}
suggested a measure $\xi$ for predicting any computable probability
measure. Observe that the class of all computable probability
measures is countable; denote it by $(\nu_i)_{i\in\SetN}$. A
Bayesian predictor $\xi$ for such a class
 is given by
$\xi(A)=\sum_{i=1}^\infty w_i\nu_i(A)$ for any measurable set A,
where the weights $w_i$ are positive  and sum to one\footnote{It
is not necessary for prediction that the weights sum to one. In
\cite{Solomonoff:78} and \cite{Zvonkin:70} $w_i=2^{-K(i)}$
where $K$ stands for the prefix Kolmogorov complexity, and so the
weights do not sum to 1. Further, the $\nu$ and $\xi$ are only
semi-measures.}. It was shown in \cite{Solomonoff:78} that $\xi$-conditional
probabilities converge to $\mu$-conditional probabilities almost
surely for any computable measure $\mu$. In fact this is a special
case of a more general (though without convergence rate) result of
Blackwell and Dubins \cite{Blackwell:62}: if a
measure $\mu$ is absolutely continuous w.r. to a measure
$\rho$ then the conditional measure  $\rho$ given $x_1,\dots, x_n$
converges to $\mu$ given $x_1,\dots,x_n$ in total variation
$\mu$-a.s.

Thus the problem of sequence prediction for certain  classes of
measures was often addressed in the literature. Although the
mentioned classes of measures are sufficiently interesting, it is
often hard to decide in applications with which assumptions does a
problem at hand comply; not to mention such practical issues as that
a predicting measure for all computable measures is necessarily
non-computable itself. Also the general approach may be easier to
extend to the problems of active learning, which is a rather hard
problem itself (see e.g. \cite{Ryabko:06actopt}).

In this work  we start to address the following {\bf general
questions}: For which classes of measures is sequence prediction
possible? Under which conditions does a measure $\rho$ predict a
measure $\mu$?

Extensive as the literature on sequence prediction is, these
questions have not been formulated, and so in the general problem
posed has not received much attention. One line of research which
exhibits this kind of generality consists in extending the result of
Blackwell and Dubins mentioned above, which states that if $\mu$ is
absolutely continuous with respect to $\rho$, then $\rho$ predicts
$\mu$ in total variation distance. In \cite{Jackson:99} a question
of whether, given a class of measures $\C$ and a prior
(``meta''-measure) $\lambda$ over this class of measures, the
conditional probabilities of a Bayesian mixture of the class $\C$
w.r.t.\ $\lambda$ converge to the true $\mu$-probabilities (weakly
merge, in terminology of \cite{Jackson:99}) for $\lambda$-almost any
measure $\mu$ in $\C$. This question can be considered solved, since
the authors provide necessary and sufficient conditions on the
measure given by the mixture of the class $\C$ w.r.t.\ $\lambda$
under which prediction is possible. The major difference from the
general  questions we posed above is that we do not wish to assume
that we have a measure on our class of measures. For large
(non-parametric) classes of measures it may not be intuitive which
measure over it is natural; rather, the question is  whether a
``natural'' measure which can be used for prediction exists.

We start with the following observation. For a Bayesian mixture
$\xi$ of a countable class of measures $\nu_i$, $i\in\SetN$, we have
$\xi(A)\ge w_i\nu_i(A)$ for any $i$ and any measurable set $A$,
where $w_i$ is a constant. This condition is stronger than the
assumption of absolute continuity and is sufficient for prediction
in a very strong sense (in total variation). Since we are willing to
be satisfied with prediction in a weaker sense (e.g.\ convergence of
conditional probabilities), we make a weaker assumption: Say that
{\em a measure $\rho$ dominates a measure $\mu$ with coefficients
$c_n>0$} if
\begin{equation}\label{eq:dom}
  \rho(x_1,\dots,x_n) \;\geq\; c_n \mu(x_1,\dots,x_n)
\end{equation}
for all $x_1,\dots,x_n$.

%-------------------------------%
{\bf The concrete question}
%-------------------------------%
we pose is, under what conditions on $c_n$ does (\ref{eq:dom}) imply
that $\rho$ predicts $\mu$? Observe that if $\rho(x_1,\dots,x_n)>0$
for any $x_1,\dots,x_n$ then any measure $\mu$ is {\em locally}
absolutely continuous with respect to $\rho$, and moreover, for any
measure $\mu$ some constants $c_n$ can be found that satisfy
(\ref{eq:dom}). For example, if $\rho$ is Bernoulli i.i.d.\ measure
with parameter $\odt$ and $\mu$ is any other measure, then
(\ref{eq:dom}) is (trivially) satisfied with $c_n=2^{-n}$. Thus  if
$c_n\equiv c$ then $\rho$ predicts $\mu$ in a very strong sense,
whereas exponentially decreasing $c_n$ are not enough for
prediction. We will show that dominance with any subexponentially
decreasing coefficients is sufficient for prediction, in a weak
sense of convergence of expected averages. Dominance with any
polynomially decreasing coefficients (and some others),is sufficient
for (almost sure) prediction on time-average. However, for
prediction on every step we have a negative result: for any
dominance coefficients that go to zero there exists a pair of
measures $\rho$ and $\mu$ which satisfy~(\ref{eq:dom}) but $\rho$
does not predict $\mu$ in this sense. Thus the situation is similar
to that for predicting any stationary measure: prediction is
possible in the average but not on every step.

Note also that for Laplace's measure $\rho_L$ it can be shown that
$\rho_L$ dominates any i.i.d.\ measure $\mu$ with linearly
decreasing coefficients $c_n={1\over n+1}$. Thus dominance with
decreasing coefficients generalizes (in a sense) predicting
countable classes of measures (where we have dominance with a
constant), absolute continuity (via local absolute continuity), and
predicting i.i.d.\ and finite-memory measures.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Main results}\label{sec:def}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We consider processes on the set of  one-way infinite sequences
$\X^\infty$ where $\X$ is a finite set (alphabet). We use $x_{1:n}$
for $x_1,\dots,x_n$ and $x_{<n}$ for $x_1,\dots,x_{n-1}$,
$x_t\in\X$. The symbol $\mu$ is reserved for the ``true'' measure
generating examples. The symbol  $\E_\nu$ stands for expectation
with respect to a measure $\nu$ and  $\E$ is for $\E_\mu$
(expectation with respect to the ``true'' measure).

For two measures $\mu$ and $\rho$ define the following measures of divergence.
\begin{itemize}
\item[($d$)]{ Kullback-Leibler (KL) divergence} $  d_n(\mu,\rho|x_{<n})=\sum_{x\in\X}\mu(x_n=x|x_{<n})\log\frac{\mu(x_n=x|x_{<n})}{\rho(x_n=x|x_{<n})}$,
\item[($\bar d$)]{ average KL divergence} $  \bar d_n(\mu,\rho|x_{1:n})={1\over n} \sum_{t=1}^n d_t(\mu,\rho|x_{<t})$,
\item[($a$)]{ absolute distance} $ a_n(\mu,\rho|x_{<n})=\sum_{x\in\X}|\mu(x_n=x|x_{<n})-\rho(x_n=x|x_{<n})|$,
\item[($\bar a$)]{ average absolute distance} $ \bar a_n(\mu,\rho|x_{1:n})={1\over n}\sum_{t=1}^n a_t(\mu,\rho|x_{<t})$.
\end{itemize}

The argument $x_{1:n}$ will be often omitted. The following
implications hold (and are complete):
$$
\begin{array}{ccccc}
                   d & \Rightarrow & \bar d &             & \E\bar d \\
               \Downarrow  &   & \Downarrow &             & \Downarrow \\
     a & \Rightarrow & \bar a & \Rightarrow & \E\bar a \\
\end{array}
$$
to be understood as e.g.: if $\bar d_n\to 0$ a.s.\ then $\bar a_n\to
0$ a.s, or, if $\E\bar d_n\to 0$ then $\E\bar a_n\to 0$. The
horizontal implications $\Rightarrow$ follow immediately from the
definitions, and the $\Downarrow$ follow from the following Lemma:

%-------------------------------%
\begin{lemma}[\boldmath $a^2\leq 2d$]\label{th:da}
%-------------------------------%
For all measures $\rho$ and $\mu$ and sequences $x_{1:\infty}$ we
have: $a_t^2\leq 2d_t$ and $\bar a_n^2\leq 2\bar d_n$ and $(\E\bar
a_n)^2\leq 2\E\bar d_n$.
\end{lemma}

\begin{proof}
Pinsker's inequality \cite[Lem.3.11$a$]{Hutter:04uaibook} implies
$a_t^2\leq 2d_t$. Using this and Jensen's inequality for the average
$\odn\sum_{t=1}^n [...]$ we get
$$
  2\bar d_n
  \;=\; {1\over n}\sum_{t=1}^n 2d_t
  \;\geq\;   {1\over n}\sum_{t=1}^n a_t^2
  \;\geq\;  \left({1\over n}\sum_{t=1}^n a_t\right)^2
  \;=\; \bar a_n^2
$$
Using this and Jensen's inequality for the expectation $\E$ we get
$2\E\bar d_n\geq \E\bar a_n^2 \geq (\E\bar a_n)^2$.
\qedx
\end{proof}

The main concept we introduce is the following.

%-------------------------------%
\begin{definition}\label{def:dom}
%-------------------------------%
We say that a measure $\rho$ dominates a measure $\mu$
with coefficients $c_n>0$ iff
$
 \rho(x_{1:n}) \;\geq\; c_n \mu(x_{1:n}).
$ for all $x_{1:n}$.
\end{definition}
Suppose that $\rho$ dominates $\mu$ with decreasing coefficients
$c_n$. Does $\rho$ predict $\mu$ in (expected, expected average)
KL divergence (absolute distance)?
First let us give an example.

%-------------------------------%
\begin{proposition}\label{prop:Laplace}
%-------------------------------%
Let $\rho_L$ be the Laplace measure
$\rho_L(x_{n+1}=a|x_{1:n})=\frac{k+1}{n+|\X|}$ for any $a\in\X$
and any $x_{1:n}\in\X^n$, where $k$ is the number of occurrences
of $a$ in $x_{1:n}$. Then
$
  \rho_L(x_{1:n}) \;\geq\; \frac{n!}{(n+|\X|-1)!} \; \mu({x_{1:n}})
$
for any Bernoulli i.i.d. $\mu$. This bound is sharp.
\end{proposition}

The proof is only technical and can be found in \cite{ext}.
%
Thus for $\rho_L$ and binary $\X$ we have $c_n=\O(\frac{1}{n})$. As
mentioned above, in general, exponentially decreasing  coefficients
$c_n$ are not sufficient for prediction. On the other hand, in a
weak sense of convergence in expected average KL divergence (or
absolute distance) the property~(\ref{eq:dom}) with subexponentially
decreasing $c_n$ is sufficient. We also remind that  if $c_n$ are
bounded from below then prediction in the strong sense of total
variation is possible.

%-------------------------------%
\begin{theorem}\label{th:dom}
%-------------------------------%
Let $\mu$ and $\rho$ be two measures on $\X^\infty$ and suppose that
$
 \rho(x_{1:n})\ge c_n \mu(x_{1:n})
$
for any $x_{1:n}$, where $c_n$ are positive constants satisfying
$\frac{1}{n}\log c_n^{-1}\rightarrow0$. Then $\rho$ predicts
$\mu$ in expected average KL divergence $\E_\mu \bar
d_n(\mu,\rho)\rightarrow0$ and in expected average absolute
distance $\E_\mu \bar a_n(\mu,\rho)\rightarrow0$.
\end{theorem}

The proof can be found in \cite{ext}; it is based on the same idea
as the proof of convergence of Solomonoff predictor to any of its
summands in \cite{BRyabko:88}, see also \cite{Hutter:04uaibook}.

With a stronger condition on $c_n$ prediction in average KL
divergence can be established.

%-------------------------------%
\begin{theorem}[\boldmath $\bar d\to 0$ and $\bar a\to 0$]\label{th:dom2}
%-------------------------------%
Let $\mu$ and $\rho$ be two measures on $\X^\infty$ and suppose that
$
 \rho(x_{1:n})\ge c_n \mu(x_{1:n})
$
for every $x_{1:n}$,
where $c_n$ are positive constants satisfying
\begin{equation}\label{eq:domsum}
 \sum_{n=1}^{\infty} \frac {(\log c_n^{-1})^2}{n^2} \;<\; \infty.
\end{equation}
Then $\rho$ predicts $\mu$  in  average KL divergence $\bar
d_n(\mu,\rho)\rightarrow0$ $\mu$-a.s.\ and in  average absolute
distance $\bar a_n(\mu,\rho)\rightarrow0$ $\mu$-a.s.
\end{theorem}

In particular, the condition~(\ref{eq:domsum}) on the coefficients
is satisfied for polynomially decreasing coefficients, or for
$c_n=\exp(-\sqrt{n}/\log n)$.

\begin{proof}
Again the second statement (about absolute distance) follows from
the first one and Lemma~\ref{th:da}, so that we only have to prove
the statement about KL divergence.

Introduce the symbol $\E^n$ for $\mu$-expectation over $x_n$
conditional on $x_{<n}$. Consider random variables
$l_n=\log\frac{\mu(x_n|x_{<n})}{\rho(x_n|x_{<n})}$ and $\bar l_n=
{1\over n} \sum_{t=1}^n l_t$. Observe that $d_n=\E^n l_n$, so that
the random variables $m_n=l_n-d_n$ form a martingale difference
sequence  (that is, $\E^n m_n=0$) with respect to the standard
filtration defined by $x_1,\dots,x_n,\dots$. Let also $\bar m_n=
{1\over n} \sum_{t=1}^n m_t$. We will show that $\bar
m_n\rightarrow0$ $\mu$-a.s.\ and $\bar l_n\rightarrow0$ $\mu$-a.s.
which implies $\bar d_n\rightarrow 0$ $\mu$-a.s.

Note that
$$
 \bar l_n \;=\; {1\over n}\log \frac{\mu(x_{1:n})}{\rho(x_{1:n})}
 \;\leq\; \frac{\log c_n^{-1}}{n} \;\rightarrow\; 0.
$$ Thus to show that $\bar l_n$ goes to $0$ we need to bound it
from below. It is easy to see that $n\bar l_n$ is ($\mu$-a.s.)
bounded from below by a constant, since $\frac{\rho (x_{1:n})}{\mu
(x_{1:n})}$ is a positive $\mu$-martingale whose expectation is 1,
and so it converges to a finite limit $\mu$-a.s.\ by Doob's
submartingale convergence theorem, see e.g.\
\cite[p.508]{Shiryaev:96}.
%
Next we will show that $\bar m_n\rightarrow0$ $\mu$-a.s.
 We have
\begin{multline*}
  m_n  \;=\;  \log\frac{\mu(x_{1:n})}{\rho(x_{1:n})}
          - \log\frac{\mu(x_{<n})}{\rho(x_{<n})}
      - \E^n\log\frac{\mu(x_{1:n})}{\rho(x_{1:n})}
      + \E^n\log\frac{\mu(x_{<n})}{\rho(x_{<n})}
\\
       \quad\quad\;\; =\;  \log\frac{\mu(x_{1:n})}{\rho(x_{1:n})}
      - \E^n\log\frac{\mu(x_{1:n})}{\rho(x_{1:n})}. \hfill
\end{multline*}
Let $f(n)$ be some function monotonically increasing to infinity such that
\begin{equation}\label{eq:f}
  \sum_{n=1}^\infty\frac{(\log c_n^{-1}+f(n))^2}{n^2} \;<\; \infty.
\end{equation}
For a sequence of random variables  $\lambda_n$ define
$$
  (\lambda_n)^{+(f)} \;=\; \left\{ \begin{array}{ll} \lambda_n & \mbox{ if }
    \lambda_n\ge - f(n) \\ 0 & \mbox{ otherwise }\end{array}\right.
$$
and $\lambda_n^{-(f)}=\lambda_n - \lambda_n^{+(f)}$.
Introduce also
$
  m^+_n \;=\; \left (\log \frac{\mu(x_{1:n})}{\rho(x_{1:n})}\right)^{+(f)}
         - \E^n\left(\log\frac{\mu(x_{1:n})}{\rho(x_{1:n})}\right)^{+(f)},
$ $m_n^-=m_n-m_n^+$ and the averages $\bar m^+_n$ and $\bar
m_n^-$. Observe that $m_n^+$ is a martingale difference sequence.
Hence to establish the convergence $\bar m^+_n\rightarrow0$ we can
use the martingale strong law of large numbers
\cite[p.501]{Shiryaev:96}, which states that, for a martingale
difference sequence $\gamma_n$, if $\E(n\bar \gamma_n)^2<\infty$
and $\sum_{n=1}^\infty \E \gamma_n^2/n^2<\infty$ then $\bar
\gamma_n\rightarrow0$ a.s. Indeed, for $m^+_n$ the first condition
is trivially satisfied (since the expectation in question is a
finite sum of finite numbers), and the second follows from the
fact that $|m_n^+|\le \log c_n^{-1}+f(n)$ and~(\ref{eq:f}).

Furthermore, we have
$
  m_n^- \;=\; \left(\log \frac{\mu(x_{1:n})}{\rho(x_{1:n})}\right)^{-(f)}
        - \E^n\left(\log\frac{\mu(x_{1:n})}{\rho(x_{1:n})}\right)^{-(f)}.
$
As it was mentioned before, $\log
\frac{\mu(x_{1:n})}{\rho(x_{1:n})}$ converges $\mu$-a.s.\ either
to (positive) infinity or to a finite number. Hence $\left(\log
\frac{\mu(x_{1:n})}{\rho(x_{1:n})}\right)^{-(f)}$ is non-zero
only a  finite  number of times, and so  its average goes to zero.
To see that $\E^n\left(\log
\frac{\mu(x_{1:n})}{\rho(x_{1:n})}\right)^{-(f)}\rightarrow0$ we
write
\begin{multline*}
  \E^n\left(\log \frac{\mu(x_{1:n})}{\rho(x_{1:n})}\right)^{-(f)}
  = \sum_{x_n\in\X}\mu(x_n|x_{<n}) \left( \log \frac{\mu(x_{<n})}{\rho(x_{<n})}
        + \log\frac{\mu(x_n|x_{<n})}{\rho(x_n|x_{<n})}\right)^{-(f)}
\\
   \ge \sum_{x_n\in\X}\mu(x_n|x_{<n}) \left( \log \frac{\mu(x_{<n})}{\rho(x_{<n})}
        + \log\mu(x_n|x_{<n})\right)^{-(f)}
\end{multline*}
 and note that the first term in brackets is bounded from
below, and so for the sum in brackets to be less than $-f(n)$
(which is unbounded) the second term $\log\,\mu(x_n|x_{<n})$ has
to go to $-\infty$, but then the expectation goes to zero since
$\lim_{u\rightarrow0} u\log u=0$.

Thus  $\bar m_n^-\rightarrow0$ $\mu$-a.s., which
together with $\bar m_n^+\rightarrow0$ $\mu$-a.s.\ implies $\bar
m_n\rightarrow0$ $\mu$-a.s., which, finally, together with $\bar
l_n\rightarrow0$ $\mu$-a.s.\ implies $\bar d_n\rightarrow0$
$\mu$-a.s. \qedx
\end{proof}

However, no form of dominance with decreasing coefficients is
sufficient for prediction in absolute distance or KL divergence:

%-------------------------------%
\begin{proposition}[\boldmath $d\not\to 0$ and $a\not\to 0$]\label{th:nodom}
%-------------------------------%
For each sequence of positive numbers $c_n$ that goes to 0 there
exist measures $\mu$ and $\rho$ and a number $\epsilon>0$ such
that
$
 \rho(x_{1:n})\ge c_n \mu(x_{1:n})
$
for all $x_{1:n}$, yet $a_n(\mu,\rho|x_{1:n})>\epsilon$ and
$d_n(\mu,\rho|x_{1:n})>\epsilon$ infinitely often $\mu$-a.s.
\end{proposition}

\begin{proof}
Let $\mu$ be concentrated on the sequence $11111\dots$ (that is
$\mu(x_n=1)=1$ for all $n$), and let $\rho(x_n=1)=1$ for all $n$
except for a subsequence of steps $n=n_k$, $k\in\SetN$ on which
$\rho(x_{n_k}=1)=1/2$ independently of each other. It is easy to
see that choosing $n_k$ sparse enough we can make
$\rho(1_1\dots1_n)$ decrease to $0$ arbitrary slowly; yet
$|\mu(x_{n_k})-\rho(x_{n_k})|=1/2$ for all $k$. \qedx
\end{proof}

Following is the table of conditions on dominance coefficients and
answers to the questions whether these conditions are sufficient for
prediction (coefficients bounded from below are included for the
sake of completeness).
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}\hline
                                                      & $\E \bar d_n $ &  $\bar d_n \vphantom{\bar{\hat d}} $ &  $d_n$ &  $\E\bar a_n$& $\bar a_n$& $a_n$\\\hline
 $\log c_n^{-1}=o(n)$                                 &  +             &  ?                                   &   $-$  &       +      &  ?         & $-$  \\\hline
 $\sum_{n=1}^\infty\frac{\log c_n^{-1}}{n^2}< \infty$ &  +             &  +                                   &   $-$ &         +     &  +         & $-$  \\\hline\hline
 $c_n\ge c>0$                                         &  +             &  +                                   &   $+$ & + & + &+ \\\hline
\end{tabular}
\end{center}
An open
question is to find whether  $\log\, c_n^{-1}=o(n)$ is
sufficient for prediction in $\bar d_n$ or at least in $\bar a_n$.
Another  problem is to find out whether any conditions on
dominance coefficients are necessary for prediction; so far we
only have some sufficient conditions. On the one hand, the
obtained results suggest that some form of dominance with
decreasing coefficients may be necessary for prediction, at least
in the sense of convergence of averages. On the other hand, the
condition~(\ref{eq:dom}) is uniform over all sequences which
probably is not necessary for prediction. As for prediction in the
sense of almost sure convergence, perhaps more subtle behavior of
the ratio $\frac{\mu(x_{1:n})}{\rho(x_{1:n})}$ should be analyzed,
since dominance with decreasing coefficients is not sufficient for
prediction in this sense.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%                 B i b l i o g r a p h y                   %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{thebibliography}{99}

\bibitem{dag}
R.~Ahlswede, A.~Apostolico, V.~I.~Levenshtein (Editors),
Combinatorial and Algorithmic Foundations of Pattern and Association Discovery, Dagstuhl Seminar Proceedings 06201, 2006.

\bibitem{Blackwell:62}
D.~Blackwell and L.~Dubins.
\newblock Merging of opinions with increasing information.
\newblock {\em Annals of Mathematical Statistics}, 33:882--887, 1962.

\bibitem{Hutter:04uaibook}
M.~Hutter.
\newblock {\em Universal Artificial Intelligence: Sequential Decisions based on
  Algorithmic Probability}.
\newblock Springer, Berlin, 2005.
\newblock 300 pages, http://www.idsia.ch/$_{^{\sim}}$marcus/ai/uaibook.htm.

\bibitem{Hutter:06usp}
M.~Hutter.
\newblock On the foundations of universal sequence prediction.
\newblock In {\em Proc. 3rd Annual Conference on Theory and Applications of
  Models of Computation ({TAMC'06})}, volume 3959 of {\em LNCS}, pages
  408--420. Springer, 2006.

\bibitem{Jackson:99}
M.~Jackson, E.~Kalai, and R.~Smorodinsky.
\newblock Bayesian representation of stochastic processes under learning: de
  {F}inetti revisited.
\newblock {\em Econometrica}, 67(4):875--794, 1999.

\bibitem{BRyabko:06}
B.~Ryabko and J.~Astola.
\newblock Universal codes as a basis for time series testing.
\newblock {\em Statistical Methodology}, To appear, available online, 2006.

\bibitem{Ryabko:06actopt}
D.~Ryabko and M.~Hutter.
\newblock Asymptotic learnability of reinforcement problems with arbitrary
  dependence.
\newblock In {\em Proc. The 17th International Conference on Algorithmic Learning Theory}, LNCS vol. 4264, pp.~334--347.

\bibitem{ext}
D.~Ryabko and M.~Hutter.
On Sequence Prediction for arbitrary measures.
\newblock http://arxiv.org/abs/cs.LG/0606077

\bibitem{BRyabko:88}
B.~Ryabko.
\newblock Prediction of random sequences and universal coding.
\newblock {\em Problems of Information Transmission}, 24:87--96, 1988.

\bibitem{Shiryaev:96}
A.~N. Shiryaev.
\newblock {\em Probability}.
\newblock Springer, 1996.

\bibitem{Solomonoff:64}
R.~J. Solomonoff.
\newblock A formal theory of inductive inference: Part 1 and 2.
\newblock {\em Inform. Control}, 7:1--22, 224--254, 1964.

\bibitem{Solomonoff:78}
R.~J. Solomonoff.
\newblock Complexity-based induction systems: comparisons and convergence
  theorems.
\newblock {\em IEEE Trans. Information Theory}, IT-24:422--432, 1978.

\bibitem{Zvonkin:70}
A.~K. Zvonkin and L.~A. Levin.
\newblock The complexity of finite objects and the development of the concepts
  of information and randomness by means of the theory of algorithms.
\newblock {\em Russian Mathematical Surveys}, 25(6):83--124, 1970.

\end{thebibliography}
\end{document}
