%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Predicting Non-Stationary Processes %%
%% Daniil Ryabko & Marcus Hutter: Start 2007 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt]{article}
\usepackage{amsmath,amssymb,amsthm}
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm \sloppy\lineskip=0pt
\sloppy\lineskip=0pt
\newenvironment{keyword}{\centerline{\bf\small
Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\def\qedx{}
\def\as{\mbox{ a.s.}}
\def\argmax{\operatorname{argmax}}
\def\P{{\bf P}}
\def\E{{\bf E}}
\def\odt{{\textstyle{1\over 2}}}
\def\odn{{\textstyle{1\over n}}}
\def\SetN{I\!\!N}
\def\C{{\cal C}} % Set of prob. distributions
\def\X{{\cal X}} % generic set
\def\O{{\cal O}} % observation space
\def\C{{\cal C}} % Class
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\bf\Large\hrule height5pt \vskip 4mm
Predicting Non-Stationary Processes
\vskip 4mm \hrule height2pt}
\author{{ Daniil Ryabko${}^{a}$\footnote{This research was supported by the Swiss NSF grants 200020-107616 and 200021-113364.} \ \ Marcus Hutter${}^{b}$ }\\
\normalsize $^{a}$IDSIA, Galleria 2, CH-6928\ Manno, Switzerland, daniil@ryabko.net\\
\normalsize $^{b}$ RSISE@ANU and SML@NICTA, Canberra, Australia, marcus@hutter1.net
}
\date{\vspace{3ex} May 2008}
\maketitle
\begin{abstract} \footnote{Parts of the results were reported at Dagstuhl Seminar on Combinatorial and Algorithmic Foundations of Pattern and Association Discovery, Dagstuhl, Germany, 2006, see
\cite{dag}.}
Suppose we are given two probability measures on the set of
one-way infinite finite-alphabet sequences. Consider the
question when one of the measures predicts the other, that is,
when conditional probabilities converge (in a certain sense), if
one of the measures is chosen to generate the sequence. This
question may be considered a refinement of the problem of sequence
prediction in its most general formulation: for a given class of
probability measures, does there exist a measure which predicts
all of the measures in the class? To address this problem, we find
some conditions on local absolute continuity which are sufficient
for prediction and generalize several different notions
that are known to be sufficient for prediction. We also formulate
some open questions to outline a direction for finding the
conditions on classes of measures for which prediction is
possible.
\end{abstract}
\begin{keyword}
Sequence prediction, %
local absolute continuity, %
non-stationary measures, %
average/expected criteria, %
absolute/KL divergence, %
mixtures of measures. %
\end{keyword}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Let a sequence $x_t$, $t\in\SetN$ of letters from some finite
alphabet $\X$ be generated by some probability measure $\mu$ on
$\X^\infty$. Having observed the first $n$ letters we want to
predict what is the probability of the next letter being $x$, for
each $x\in\X$. This task is motivated by numerous applications~---
from weather forecasting and stock market prediction to source
coding and data compression.
If the measure $\mu$ is known then the best forecasts one can make
for the $(n+1)$st outcome is $\mu$-conditional probabilities of
$x_{n+1}$ being $x\in \X$ given $x_1,\dots,x_n$. However, it is
clear that if nothing is known about the distribution $\mu$ then no
prediction is possible, since for any predictor there is a measure
on which it errs (gives grossly wrong probability forecasts) on
every step. Thus one has to restrict the attention to some class of
measures. Laplace was perhaps the first to address the question of
sequence prediction, asking the question of what is the probability
that the Sun will rise tomorrow given that it has risen every day
for 5000 years. He suggested to assume that the probability that the
Sun rises is the same every day and the trials are independent of
each other. Thus Laplace considered the task of sequence prediction
when the true generating measure belongs to the family of Bernoulli
i.i.d.\ measures with binary alphabet $\X=\{0,1\}$. The
predicting measure hi suggested was
$\rho_L(x_{n+1}=1|x_1,\dots,x_n)=\frac{k+1}{n+2}$ where $k$ is the
number of 1s in $x_1,\dots,x_n$. The conditional probabilities of
$\rho_L$ converge to the true conditional probabilities $\mu$-a\,s.
under any Bernoulli i.i.d measure $\mu$. This approach generalizes
to the problem of predicting any finite-memory (e.g.\ Markovian)
measure. Moreover, in \cite{BRyabko:88} a measure $\rho_R$ was
constructed for predicting an arbitrary stationary measure. The
conditional probabilities of $\rho_R$ converge to the true ones {\em
on average}, where the average is taken over time steps $\mu$-a.s.
for any stationary measure $\mu$. However, as it was shown in the
same work, there is no measure for which conditional probabilities
converge to the true ones $\mu$-a.s. for every stationary $\mu$.
Thus already for the problem of predicting outcomes of a stationary
measure two criteria of prediction arise: prediction in the average
(or in Cesaro sense) and prediction on each step, and the solution
exists only for the former problem.
What if the measure generating the sequence is not stationary?
Another possible assumption is that the measure $\mu$ generating the
sequence is computable. Solomonoff \cite[Eq.(13)]{Solomonoff:64}
suggested a measure $\xi$ for predicting any computable probability
measure. Observe that the class of all computable probability
measures is countable; denote it by $(\nu_i)_{i\in\SetN}$. A
Bayesian predictor $\xi$ for such a class
is given by
$\xi(A)=\sum_{i=1}^\infty w_i\nu_i(A)$ for any measurable set A,
where the weights $w_i$ are positive and sum to one\footnote{It
is not necessary for prediction that the weights sum to one. In
\cite{Solomonoff:78} and \cite{Zvonkin:70} $w_i=2^{-K(i)}$
where $K$ stands for the prefix Kolmogorov complexity, and so the
weights do not sum to 1. Further, the $\nu$ and $\xi$ are only
semi-measures.}. It was shown in \cite{Solomonoff:78} that $\xi$-conditional
probabilities converge to $\mu$-conditional probabilities almost
surely for any computable measure $\mu$. In fact this is a special
case of a more general (though without convergence rate) result of
Blackwell and Dubins \cite{Blackwell:62}: if a
measure $\mu$ is absolutely continuous w.r. to a measure
$\rho$ then the conditional measure $\rho$ given $x_1,\dots, x_n$
converges to $\mu$ given $x_1,\dots,x_n$ in total variation
$\mu$-a.s.
Thus the problem of sequence prediction for certain classes of
measures was often addressed in the literature. Although the
mentioned classes of measures are sufficiently interesting, it is
often hard to decide in applications with which assumptions does a
problem at hand comply; not to mention such practical issues as that
a predicting measure for all computable measures is necessarily
non-computable itself. Also the general approach may be easier to
extend to the problems of active learning, which is a rather hard
problem itself (see e.g. \cite{Ryabko:06actopt}).
In this work we start to address the following {\bf general
questions}: For which classes of measures is sequence prediction
possible? Under which conditions does a measure $\rho$ predict a
measure $\mu$?
Extensive as the literature on sequence prediction is, these
questions have not been formulated, and so in the general problem
posed has not received much attention. One line of research which
exhibits this kind of generality consists in extending the result of
Blackwell and Dubins mentioned above, which states that if $\mu$ is
absolutely continuous with respect to $\rho$, then $\rho$ predicts
$\mu$ in total variation distance. In \cite{Jackson:99} a question
of whether, given a class of measures $\C$ and a prior
(``meta''-measure) $\lambda$ over this class of measures, the
conditional probabilities of a Bayesian mixture of the class $\C$
w.r.t.\ $\lambda$ converge to the true $\mu$-probabilities (weakly
merge, in terminology of \cite{Jackson:99}) for $\lambda$-almost any
measure $\mu$ in $\C$. This question can be considered solved, since
the authors provide necessary and sufficient conditions on the
measure given by the mixture of the class $\C$ w.r.t.\ $\lambda$
under which prediction is possible. The major difference from the
general questions we posed above is that we do not wish to assume
that we have a measure on our class of measures. For large
(non-parametric) classes of measures it may not be intuitive which
measure over it is natural; rather, the question is whether a
``natural'' measure which can be used for prediction exists.
We start with the following observation. For a Bayesian mixture
$\xi$ of a countable class of measures $\nu_i$, $i\in\SetN$, we have
$\xi(A)\ge w_i\nu_i(A)$ for any $i$ and any measurable set $A$,
where $w_i$ is a constant. This condition is stronger than the
assumption of absolute continuity and is sufficient for prediction
in a very strong sense (in total variation). Since we are willing to
be satisfied with prediction in a weaker sense (e.g.\ convergence of
conditional probabilities), we make a weaker assumption: Say that
{\em a measure $\rho$ dominates a measure $\mu$ with coefficients
$c_n>0$} if
\begin{equation}\label{eq:dom}
\rho(x_1,\dots,x_n) \;\geq\; c_n \mu(x_1,\dots,x_n)
\end{equation}
for all $x_1,\dots,x_n$.
%-------------------------------%
{\bf The concrete question}
%-------------------------------%
we pose is, under what conditions on $c_n$ does (\ref{eq:dom}) imply
that $\rho$ predicts $\mu$? Observe that if $\rho(x_1,\dots,x_n)>0$
for any $x_1,\dots,x_n$ then any measure $\mu$ is {\em locally}
absolutely continuous with respect to $\rho$, and moreover, for any
measure $\mu$ some constants $c_n$ can be found that satisfy
(\ref{eq:dom}). For example, if $\rho$ is Bernoulli i.i.d.\ measure
with parameter $\odt$ and $\mu$ is any other measure, then
(\ref{eq:dom}) is (trivially) satisfied with $c_n=2^{-n}$. Thus if
$c_n\equiv c$ then $\rho$ predicts $\mu$ in a very strong sense,
whereas exponentially decreasing $c_n$ are not enough for
prediction. We will show that dominance with any subexponentially
decreasing coefficients is sufficient for prediction, in a weak
sense of convergence of expected averages. Dominance with any
polynomially decreasing coefficients (and some others),is sufficient
for (almost sure) prediction on time-average. However, for
prediction on every step we have a negative result: for any
dominance coefficients that go to zero there exists a pair of
measures $\rho$ and $\mu$ which satisfy~(\ref{eq:dom}) but $\rho$
does not predict $\mu$ in this sense. Thus the situation is similar
to that for predicting any stationary measure: prediction is
possible in the average but not on every step.
Note also that for Laplace's measure $\rho_L$ it can be shown that
$\rho_L$ dominates any i.i.d.\ measure $\mu$ with linearly
decreasing coefficients $c_n={1\over n+1}$. Thus dominance with
decreasing coefficients generalizes (in a sense) predicting
countable classes of measures (where we have dominance with a
constant), absolute continuity (via local absolute continuity), and
predicting i.i.d.\ and finite-memory measures.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Main results}\label{sec:def}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We consider processes on the set of one-way infinite sequences
$\X^\infty$ where $\X$ is a finite set (alphabet). We use $x_{1:n}$
for $x_1,\dots,x_n$ and $x_{0$ iff
$
\rho(x_{1:n}) \;\geq\; c_n \mu(x_{1:n}).
$ for all $x_{1:n}$.
\end{definition}
Suppose that $\rho$ dominates $\mu$ with decreasing coefficients
$c_n$. Does $\rho$ predict $\mu$ in (expected, expected average)
KL divergence (absolute distance)?
First let us give an example.
%-------------------------------%
\begin{proposition}\label{prop:Laplace}
%-------------------------------%
Let $\rho_L$ be the Laplace measure
$\rho_L(x_{n+1}=a|x_{1:n})=\frac{k+1}{n+|\X|}$ for any $a\in\X$
and any $x_{1:n}\in\X^n$, where $k$ is the number of occurrences
of $a$ in $x_{1:n}$. Then
$
\rho_L(x_{1:n}) \;\geq\; \frac{n!}{(n+|\X|-1)!} \; \mu({x_{1:n}})
$
for any Bernoulli i.i.d. $\mu$. This bound is sharp.
\end{proposition}
The proof is only technical and can be found in \cite{ext}.
%
Thus for $\rho_L$ and binary $\X$ we have $c_n=\O(\frac{1}{n})$. As
mentioned above, in general, exponentially decreasing coefficients
$c_n$ are not sufficient for prediction. On the other hand, in a
weak sense of convergence in expected average KL divergence (or
absolute distance) the property~(\ref{eq:dom}) with subexponentially
decreasing $c_n$ is sufficient. We also remind that if $c_n$ are
bounded from below then prediction in the strong sense of total
variation is possible.
%-------------------------------%
\begin{theorem}\label{th:dom}
%-------------------------------%
Let $\mu$ and $\rho$ be two measures on $\X^\infty$ and suppose that
$
\rho(x_{1:n})\ge c_n \mu(x_{1:n})
$
for any $x_{1:n}$, where $c_n$ are positive constants satisfying
$\frac{1}{n}\log c_n^{-1}\rightarrow0$. Then $\rho$ predicts
$\mu$ in expected average KL divergence $\E_\mu \bar
d_n(\mu,\rho)\rightarrow0$ and in expected average absolute
distance $\E_\mu \bar a_n(\mu,\rho)\rightarrow0$.
\end{theorem}
The proof can be found in \cite{ext}; it is based on the same idea
as the proof of convergence of Solomonoff predictor to any of its
summands in \cite{BRyabko:88}, see also \cite{Hutter:04uaibook}.
With a stronger condition on $c_n$ prediction in average KL
divergence can be established.
%-------------------------------%
\begin{theorem}[\boldmath $\bar d\to 0$ and $\bar a\to 0$]\label{th:dom2}
%-------------------------------%
Let $\mu$ and $\rho$ be two measures on $\X^\infty$ and suppose that
$
\rho(x_{1:n})\ge c_n \mu(x_{1:n})
$
for every $x_{1:n}$,
where $c_n$ are positive constants satisfying
\begin{equation}\label{eq:domsum}
\sum_{n=1}^{\infty} \frac {(\log c_n^{-1})^2}{n^2} \;<\; \infty.
\end{equation}
Then $\rho$ predicts $\mu$ in average KL divergence $\bar
d_n(\mu,\rho)\rightarrow0$ $\mu$-a.s.\ and in average absolute
distance $\bar a_n(\mu,\rho)\rightarrow0$ $\mu$-a.s.
\end{theorem}
In particular, the condition~(\ref{eq:domsum}) on the coefficients
is satisfied for polynomially decreasing coefficients, or for
$c_n=\exp(-\sqrt{n}/\log n)$.
\begin{proof}
Again the second statement (about absolute distance) follows from
the first one and Lemma~\ref{th:da}, so that we only have to prove
the statement about KL divergence.
Introduce the symbol $\E^n$ for $\mu$-expectation over $x_n$
conditional on $x_{0$ such
that
$
\rho(x_{1:n})\ge c_n \mu(x_{1:n})
$
for all $x_{1:n}$, yet $a_n(\mu,\rho|x_{1:n})>\epsilon$ and
$d_n(\mu,\rho|x_{1:n})>\epsilon$ infinitely often $\mu$-a.s.
\end{proposition}
\begin{proof}
Let $\mu$ be concentrated on the sequence $11111\dots$ (that is
$\mu(x_n=1)=1$ for all $n$), and let $\rho(x_n=1)=1$ for all $n$
except for a subsequence of steps $n=n_k$, $k\in\SetN$ on which
$\rho(x_{n_k}=1)=1/2$ independently of each other. It is easy to
see that choosing $n_k$ sparse enough we can make
$\rho(1_1\dots1_n)$ decrease to $0$ arbitrary slowly; yet
$|\mu(x_{n_k})-\rho(x_{n_k})|=1/2$ for all $k$. \qedx
\end{proof}
Following is the table of conditions on dominance coefficients and
answers to the questions whether these conditions are sufficient for
prediction (coefficients bounded from below are included for the
sake of completeness).
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}\hline
& $\E \bar d_n $ & $\bar d_n \vphantom{\bar{\hat d}} $ & $d_n$ & $\E\bar a_n$& $\bar a_n$& $a_n$\\\hline
$\log c_n^{-1}=o(n)$ & + & ? & $-$ & + & ? & $-$ \\\hline
$\sum_{n=1}^\infty\frac{\log c_n^{-1}}{n^2}< \infty$ & + & + & $-$ & + & + & $-$ \\\hline\hline
$c_n\ge c>0$ & + & + & $+$ & + & + &+ \\\hline
\end{tabular}
\end{center}
An open
question is to find whether $\log\, c_n^{-1}=o(n)$ is
sufficient for prediction in $\bar d_n$ or at least in $\bar a_n$.
Another problem is to find out whether any conditions on
dominance coefficients are necessary for prediction; so far we
only have some sufficient conditions. On the one hand, the
obtained results suggest that some form of dominance with
decreasing coefficients may be necessary for prediction, at least
in the sense of convergence of averages. On the other hand, the
condition~(\ref{eq:dom}) is uniform over all sequences which
probably is not necessary for prediction. As for prediction in the
sense of almost sure convergence, perhaps more subtle behavior of
the ratio $\frac{\mu(x_{1:n})}{\rho(x_{1:n})}$ should be analyzed,
since dominance with decreasing coefficients is not sufficient for
prediction in this sense.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% B i b l i o g r a p h y %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{thebibliography}{99}
\bibitem{dag}
R.~Ahlswede, A.~Apostolico, V.~I.~Levenshtein (Editors),
Combinatorial and Algorithmic Foundations of Pattern and Association Discovery, Dagstuhl Seminar Proceedings 06201, 2006.
\bibitem{Blackwell:62}
D.~Blackwell and L.~Dubins.
\newblock Merging of opinions with increasing information.
\newblock {\em Annals of Mathematical Statistics}, 33:882--887, 1962.
\bibitem{Hutter:04uaibook}
M.~Hutter.
\newblock {\em Universal Artificial Intelligence: Sequential Decisions based on
Algorithmic Probability}.
\newblock Springer, Berlin, 2005.
\newblock 300 pages, http://www.idsia.ch/$_{^{\sim}}$marcus/ai/uaibook.htm.
\bibitem{Hutter:06usp}
M.~Hutter.
\newblock On the foundations of universal sequence prediction.
\newblock In {\em Proc. 3rd Annual Conference on Theory and Applications of
Models of Computation ({TAMC'06})}, volume 3959 of {\em LNCS}, pages
408--420. Springer, 2006.
\bibitem{Jackson:99}
M.~Jackson, E.~Kalai, and R.~Smorodinsky.
\newblock Bayesian representation of stochastic processes under learning: de
{F}inetti revisited.
\newblock {\em Econometrica}, 67(4):875--794, 1999.
\bibitem{BRyabko:06}
B.~Ryabko and J.~Astola.
\newblock Universal codes as a basis for time series testing.
\newblock {\em Statistical Methodology}, To appear, available online, 2006.
\bibitem{Ryabko:06actopt}
D.~Ryabko and M.~Hutter.
\newblock Asymptotic learnability of reinforcement problems with arbitrary
dependence.
\newblock In {\em Proc. The 17th International Conference on Algorithmic Learning Theory}, LNCS vol. 4264, pp.~334--347.
\bibitem{ext}
D.~Ryabko and M.~Hutter.
On Sequence Prediction for arbitrary measures.
\newblock http://arxiv.org/abs/cs.LG/0606077
\bibitem{BRyabko:88}
B.~Ryabko.
\newblock Prediction of random sequences and universal coding.
\newblock {\em Problems of Information Transmission}, 24:87--96, 1988.
\bibitem{Shiryaev:96}
A.~N. Shiryaev.
\newblock {\em Probability}.
\newblock Springer, 1996.
\bibitem{Solomonoff:64}
R.~J. Solomonoff.
\newblock A formal theory of inductive inference: Part 1 and 2.
\newblock {\em Inform. Control}, 7:1--22, 224--254, 1964.
\bibitem{Solomonoff:78}
R.~J. Solomonoff.
\newblock Complexity-based induction systems: comparisons and convergence
theorems.
\newblock {\em IEEE Trans. Information Theory}, IT-24:422--432, 1978.
\bibitem{Zvonkin:70}
A.~K. Zvonkin and L.~A. Levin.
\newblock The complexity of finite objects and the development of the concepts
of information and randomness by means of the theory of algorithms.
\newblock {\em Russian Mathematical Surveys}, 25(6):83--124, 1970.
\end{thebibliography}
\end{document}