%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Offline to Online Conversion %%
%% Marcus Hutter (2013-2014) %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt,twoside]{article}
\newdimen\paravsp \paravsp=1.3ex % named paragraph spacing
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\usepackage{amsmath,amssymb}
\sloppy\lineskip=0pt
%-------------------------------%
% My Math-Environments %
%-------------------------------%
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\newtheorem{theorem}{Theorem}
\newtheorem{open}[theorem]{Open Problem}
\newenvironment{keywords}{\centerline{\bf\small
Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\newenvironment{proof}{{\noindent\bf Proof.}}{\vskip 1ex}
\def\paradot#1{\vspace{\paravsp plus 0.5\paravsp minus 0.5\paravsp}\noindent{\bf\boldmath{#1.}}} % boldface paragraph.
\def\paranodot#1{\vspace{\paravsp plus 0.5\paravsp minus 0.5\paravsp}\noindent{\bf\boldmath{#1}}} % boldface paragraph
\def\qmbox#1{{\quad\mbox{#1}\quad}} % space text space in math mode
\def\req#1{\eqref{#1}} % reference to equations (obsolete)
\def\eps{\varepsilon} % small real > 0
\def\epstr{\epsilon} % empty string
\def\nq{\hspace{-1em}} % negative 1em space
\def\qed{\hspace*{\fill}\rule{1.4ex}{1.4ex}$\quad$\\} % end of proof
\def\frs#1#2{{^{#1}\!/\!_{#2}}} % ^#1/_#2 fraction
\def\fr#1#2{{\textstyle{#1\over#2}}} % textstyle fraction
\def\SetN{\mathbb{N}} % Set of Natural numbers
\def\SetB{\mathbb{B}} % Set {0,1}
\def\SetQ{\mathbb{Q}} % Set of Rational numbers
\def\e{{\rm e}} % natural base e
\def\v{\boldsymbol} % boldface vector (alt:\boldsymbol\frak\Bbb\pmb\text)
\def\B{\{0,1\}} % {0,1}
\def\E{{\mathbb E}} % Expectation
\def\g{\gamma}
\def\d{\delta}
\def\A{{\cal A}} % measurable set
\def\X{{\cal X}} % alphabet
\def\M{{\cal M}} % class of distributions
\def\N{{\cal N}} % normalization
\def\P{{\cal P}} % partition
\def\t{\tilde} % tilde
\def\Srat{\text{rat}}
\def\Sno{\text{n1}}
\def\Slim{\text{lim}}
\def\Smix{\text{mix}}
\def\DTIME{\text{TIME}}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% T i t l e - P a g e %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\vspace{-4ex}
\vskip 2mm\bf\Large\hrule height5pt \vskip 4mm
Offline to Online Conversion
\vskip 4mm \hrule height2pt}
\author{{\bf Marcus Hutter}\\[3mm]
\normalsize Research School of Computer Science \\[-0.5ex]
\normalsize Australian National University \\[-0.5ex]
\normalsize Canberra, ACT, 0200, Australia \\
\normalsize \texttt{http://www.hutter1.net/}
}
\date{12 July 2014}
\maketitle
\begin{abstract}
We consider the problem of converting offline estimators into an
online predictor or estimator with small extra regret. Formally
this is the problem of merging a collection of probability
measures over strings of length 1,2,3,... into a single
probability measure over infinite sequences. We describe various
approaches and their pros and cons on various examples. As a
side-result we give an elementary non-heuristic purely
combinatoric derivation of Turing's famous estimator. Our main
technical contribution is to determine the computational
complexity of online estimators with good guarantees in general.
\vspace*{1ex}
% tiny top-level table of contents
\def\contentsname{\centering\normalsize Contents}\setcounter{tocdepth}{1}
{\parskip=-2.7ex\tableofcontents}
\end{abstract}
\begin{keywords} %\small
offline, online, batch, sequential, probability, estimation, prediction,
time-consistency, normalization, tractable, regret, combinatorics, Bayes, Laplace, Ristad, Good-Turing.
\end{keywords}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{sec:Intro}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Bach/offline estimation
A standard problem in statistics and machine learning is to
estimate or learn an in general non-i.i.d.\ probability
distribution $q_n:\X^n\to[0,1]$ from a batch of data $x_1,...,x_n$.
$q_n$ might be the Bayesian mixture over a class of distributions
$\M$, or the (penalized) maximum likelihood (ML/MAP/MDL/MML)
distribution from $\M$, or a combinatorial probability, or an
exponentiated code length, or else. This is the batch or {\em
offline} setting. An important problem is to predict $x_{n+1}$ from
$x_1,...,x_n$ sequentially for $n=0,1,2...$, called {\em online}
learning if the predictor improves with $n$. A stochastic
prediction $\t q(x_{n+1}|x_{1:n})$ can be useful in itself (e.g.\
weather forecasts), or be the basis for some decision, or be used
for data compression via arithmetic coding, or otherwise. We use
the prediction picture, but could have equally well phrased
everything in terms of log-likelihoods, or perplexity, or
code-lengths, or log-loss.
% Sequential/online prediction
The naive predictor is $\t
q^\Srat(x_{n+1}|x_1...x_n):=q_{n+1}(x_1...x_{n+1})/q_n(x_1...x_n)$
is not properly normalized to 1 if $q_n$ and $q_{n+1}$ are not
compatible. We could fix the problem by normalization $\t
q^\Sno(x_{n+1}|x_1...x_n):=\t
q^\Srat(x_{n+1}|x_1...x_n)/\sum_{x_{n+1}}\t
q^\Srat(x_{n+1}|x_1...x_n)$, but this may result in a very poor
predictor. We discuss two further schemes, $\t q^\Slim$ and $\t
q^\Smix$, the latter having good performance guarantees (small
regret), but a direct computation of either is prohibitive.
% Open problem
A major open problem is to find a computationally tractable online
predictor $\t q$ with provably good performance given offline
probabilities ($q_n$). A positive answer would benefit many applications.
%-------------------------------%
\paradot{Applications}
%-------------------------------%
(i) Being able to use an offline estimator to make stochastic
predictions (e.g.\ weather forecasts) is of course useful. The
predictive probability needs to sum to 1 which $\t q^\Sno$
guarantees, but the regret should also be small, which only $\t
q^\Smix$ guarantees.
(ii) Given a parameterized class of (already) online estimators
$\{\t q^\theta\}$, estimating the parameter $\theta$ from data $x_1...x_n$
(e.g. maximum likelihood) for $n=1,2,3,...$ leads to a sequence of
parameters $(\hat\theta_n)$ and a sequence of estimators
$(q_n):=(\t q^{\hat\theta_n})$ that is usually {\em not} online. They
need to be reconverted to become online to be useful for prediction or
compression, etc.
(iii) Arithmetic coding requires an online estimator, but often is based
on a class of distributions as described in (ii). The default
`trick' to get a fast and online estimator is to use
$\t q^{\hat\theta_n}(x_{n+1}|x_{1:n})$ which is properly
normalized and often very good.
(iv) Online conversions are needed even for some offline purposes.
For instance, computing the cumulative distribution function
$\sum_{y_{1:n}\leq x_{1:n}} q_n(y_{1:n})$ can be hard in general,
but can be computed in time $O(n)$ if $(q_n)$ is (converted to) online.
%-------------------------------%
\paradot{Contributions \& contents}
%-------------------------------%
% Main purpose
The main purpose of this paper is to introduce and discuss the
problem of converting offline estimators $(q_n)$ to an online
predictor $\t q$ (Section~\ref{sec:Form}).
% Presentation contributions
% Section~\ref{sec:Sol}
We compare and discuss the pros and cons of the four conversion proposals (Section~\ref{sec:Sol}).
We also define the worst-case extra regret of online $\t q$ over offline $(q_n)$,
measuring the conversion quality.
% Section~\ref{sec:Ex}
We illustrate their behavior for various classical estimators (Bayes, MDL, Laplace, Good-Turing,
Ristad) (Section~\ref{sec:Ex}). Naive normalization of the triple uniform
estimator interestingly leads to the Good-Turing estimator, but induces huge extra regret,
while naive normalization of Ristad's quadruple uniform estimator induces negligible extra regret.
% Section~\ref{sec:CCqt}
Given that $\t q^\Sno$ can fail for interesting offline estimators,
natural questions to ask are: whether the excellent predictor $\t q^\Smix$
can be computed or approximated (yes), by an efficient algorithm (no),
whether for every $(q_n)$ there exists any fast $\t q$ nearly as good as
$\t q^\Smix$ (no), or whether there exist $(q_n)$ for which no fast
$\t q$ can even slightly beat the trivial uniform predictor (yes)
(Section~\ref{sec:CCqt}).
% Section~\ref{sec:CCqtproofs}
The proofs for these computational complexity results are deferred to
the next section (Section~\ref{sec:CCqtproofs}).
% Section~\ref{sec:Open}
These results do not preclude a satisfactory positive solution in
practice, in particular given the contrived nature of the
constructed $(q_n)$, but as any negative complexity result they show
that a solution requires extra assumptions or to moderate our
demands. This leads to some precise open problems to this effect
(Section~\ref{sec:Open}).
% Section~\ref{app:CCqt}
Proofs for the regret bounds can be found in Appendix~\ref{app:RnGT} and
a list of notation in Appendix~\ref{app:Notation}.
% Convincing derivation of the Good-Turing estimator
As a side-result we give the arguably most convincing (simplest and
least heuristic) derivation of the famous Good-Turing estimator.
Other attempts at deriving the estimator Alan Turing suggested in
1941 to I.J.~Good are less convincing (to us) \cite{Good:53}. They
appear more heuristic or convoluted, or are incomplete, often
assuming something close to what one wants to get out
\cite{Nadas:85}. Our purely combinatorial derivation also feels
right for 1941 and Alan Turing.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Problem Formulation}\label{sec:Form}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We now formally state the problem of offline to online conversion
in three equivalent ways and the quality of a conversion.
%
Let $x_t\in\X$ for $t\in\{1,...,n\}$ and $x_{t:n}:=x_t...x_n\in\X^{n-t+1}$,
$x_{s
\end{array}
\right.
\eeq
where $Q$ can be an arbitrary measure on $\X^\infty$, e.g.\ uniform $Q(x_{s+1:n}|x_{1:s})=|\X|^{n-s}$.
It is easy to see that $\t q:=\bar q_s$ is \ref{TC} with $R_s(\t q)=R_s(\bar q_s)=R_s(q_s)=0$,
but in general $R_n(\bar q_s)>0$ for $n\neq s$.
Therefore naive minimization of $R_n$ w.r.t.\ $\t q$ does not work.
Minimizing $\lim_{n\to\infty} R_n$ can also fail for a number of reasons:
the limit may not exist or is infinite,
or minimizing it leads to poor finite-$n$ performance
or is not analytically possible or computationally intractable.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conversion Methods}\label{sec:Sol}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We now consider four methods of converting offline estimators to
online predictors and discuss their pros and cons. They illustrate
the difficulties and serve as a starting point to a more
satisfactory solution.
%-------------------------------%
\paradot{Naive ratio}
%-------------------------------%
The simplest way to define a predictor $\t q$ from $q_n$ is via {\em rat}io
\beq\label{eq:qrat}
\t q^\Srat(x_t|x_{0, ~~\sum_{s=0}^\infty w_s=1.
\eeq
$\t q^\Smix$ is \ref{TC} and its regret can easily be upper bounded \cite{Santhanam:06}:
\beq\label{eq:lnwbnd}
R_n(\t q^\Smix) ~=~ \max_{x_{1:n}}\ln {q_n(x_{1:n})\over\sum_{s=0}^\infty \bar q_s(x_{1:n}) w_s}
~\leq \max_{x_{1:n}}\ln {q_n(x_{1:n})\over\bar q_n(x_{1:n}) w_n} ~=~ \ln w_n^{-1}
\eeq
For e.g.\ $w_n:={1\over (n+1)(n+2)}$ we have $\ln w_n^{-1}\leq 2\ln(n+2)$
which usually can be regarded as small.
%
This shows that {\em any} offline estimator can be converted into
an online predictor with very small extra regret \req{eq:ExR}. Note
that while $\t q^\Smix$ depends on arbitrary $Q$ defined in
\req{eq:qbar}, the upper bound \req{eq:lnwbnd} on $R_n$ does not.
%
Unfortunately it is unclear how to convert this heavy construction
into an efficient algorithm.
% variations
A variation is to set $Q\equiv 0$, which makes $\t q^\Smix$ a
semi-measure, which could be made TC by naive normalization \req{eq:qn1}. Bound
\req{eq:lnwbnd} still holds since for $\t q^\Smix$ with $Q\equiv 0$ the
normalizer $\N\leq 1$.
%
Another variation is as follows.
Often $q_n$ violates \ref{TC} only weakly, in
which case a sparser prior, e.g. $w_{2^k}:={1\over(k+1)(k+2)}$ and
$w_n=0$ for all other $n$, can lead to even smaller regret.
%-------------------------------%
\paradot{Further choices for $\t q$}
%-------------------------------%
Of course the four presented choices for $\t q$ do not exhaust all
options. Indeed, finding a tractable $\t q$ with good properties is
a major open problem. Several estimation procedures do not only
provide $q_n$ on $\X^n$, but measures on $\X^\infty$ or
equivalently for {\em each} $n$ {\em separately} a \ref{TC}
$q_n:\X^*\to[0;1]$ (see Bayes and crude MDL below). While this
opens further options for $\t q$, e.g.\ $\t
q(x_{n+1}|x_{1:n}):=q_n(x_{1:n+1})/q_n(x_{1:n})$ with some (weak)
results for MDL \cite{Hutter:05mdl2px}, it does not solve our main
problem.
%-------------------------------%
\paradot{Notes}
%-------------------------------%
Each solution attempt has its down-sides, and
a solution satisfying all our criteria remains open.
It is easy to verify that, if $q_n$ is already \ref{TC}, the first
three definitions of $\t q$ coincide, and $R_n=0$, which is
reassuring, but $\t q_n^\Smix$ in general differs due to the
arbitrary $w$ in \req{eq:qmix} and arbitrary $Q$ in $\bar q$ in \req{eq:qbar}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Examples}\label{sec:Ex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
All examples below fall in one of two major strategies for
designing estimators (the introduction mentions others we do not
consider). One strategy is to start with a class $\M$ of
probability measures $\nu$ on $\X^\infty$ in the hope one of them
is good. For instance, $\M$ may contain (a subset of) i.i.d.\
measures
$\nu_{\v\theta}(x_{1:n}):=\theta_{x_1}\cdot...\cdot\theta_{x_n}$
with $\theta_i\geq 0$ and $\theta_1+...+\theta_d=1$ and $d:=|\X|$.
One may either select a $\nu$ from $\M$ informed by given data
$x_{1:n}$ or take an average over the class. The other strategy
assigns uniform probabilities over subsets of $\X^n$. This
combinatorial approach will be described later. Some strategies
lead to \ref{TC} and some examples are \ref{TC}. For the others we
will discuss the various online conversions $\t q$.
%-------------------------------%
\paradot{Bayes}
%-------------------------------%
The Bayesian mixture over $\M$ w.r.t.\ some
prior (density) $w()$ is defined as
\beqn
q_n(x_{1:n}) ~:=~ \int_\M \nu(x_{1:n})\,w(\nu)\,d\nu
\eeqn
Since $q_n$ is \ref{TC},
$(q_n^\Srat)\equiv(q_n^\Sno)\equiv(q_n^\Slim)$ coincide with $\t q$,
$R_n=0$, and $\t q^\Srat$ is tractable if the Bayes mixture is.
Note that $\t q\not\in\M$ in general, in particular it is not
i.i.d. Assume the true sampling distribution $\mu$ is in $\M$. For
countable $\M$ and counting measure $d\nu$, we have
$q_n(x_{1:n})\geq\mu(x_{1:n})w(\mu)$, hence
$R_n^{\text{online}}=R_n^{\text{offline}}\leq\ln w(\mu)^{-1}$. For
continuous classes $\M$ we have
$R_n^{\text{online}}=R_n^{\text{offline}}\lesssim\ln w(\mu)^{-1}+O(\ln n)$ under some mild conditions
\cite{Barron:91,Hutter:03optisp,Hutter:07pquest}.
%-------------------------------%
\paradot{MDL/NML/MAP}
%-------------------------------%
The MAP or MDL estimator is
\beqn
\hat q_n(x_{1:n}) ~:=~ \sup_{\nu\in\M}\{\nu(x_{1:n})\,w(\nu)\}
\qmbox{and} q_n(x_{1:n}) ~:=~ {\hat q_n(x_{1:n})\over\sum_{x_{1:n}}\hat q_n(x_{1:n})}
\eeqn
Since $\hat q_n$ is not even a probability on $\X^n$, we have to
normalize it to $q_n$. For uniform prior density $w()$, $\hat q_n$
is the maximum likelihood (ML) estimator, and $q_n$ is known under
the name normalized maximum likelihood (NML) or modern minimum description length (MDL).
Unlike Bayes, $q_n$ is {\em not} \ref{TC}, which causes all kinds of
complications
\cite{Gruenwald:07book,Hutter:09mdltvp,Hutter:14martoscx},
many of them can be traced back to our main open problem
and the unsatisfactory choices for $\t q$ \cite{Hutter:05mdl2px}.
$R_n^{\text{offline}}$ is essentially the same as for Bayes under similar conditions,
but $R_n^{\text{online}}$ depends on the choice of $\t q$.
%
Crude MDL simply selects
$q_n:=\arg\max_{\nu\in\M}\{\nu(x_{1:n})\,w(\nu)\}$ at time $n$,
which is a probability measure on $\X^\infty$. While this opens
additional options for defining $\t q$, they also can perform
poorly in the worst case \cite{Hutter:05mdl2px}.
%
Note that most versions of MDL perform often very well in practice,
comparable to Bayes; robustness and proving guarantees are the open problems.
%-------------------------------%
\paradot{Uniform}
%-------------------------------%
The uniform probability $q_n(x_{1:n}):=|\X|^{-n}$ is \ref{TC}, hence all
four $\t q$ coincide and $R_n=0$ (only for uniform $Q$ in case of $q_n^\Smix$).
Unless data is uniform, this is a
lousy estimator, since predictor $\t q(x_t|x_{n$, but due to $\sum_{r=0}^n r\cdot
m_r=n$, $m_r=0$ also for many $r0\}$ that do appear different from
symbols $\X\setminus\A$ that don't.
%
For $n>0$, $x_{1:n}$ may contain $m\in\{1,...,\min\{n,d\}\}$ different symbols, so we set
$q_n(m)=1/\min\{n,d\}$.
Now choose uniformly which $m$ symbols $\A$ appear, $q_n(\A|m)={d\choose m}^{-1}$ for $|\A|=m$.
There are ${n-1\choose m-1}$ ways of choosing the frequency of symbols consistent with
$n_1+...+n_d=n$ and $n_i>0\Leftrightarrow i\in\A$, hence $q_n(\v n|\A)={n-1\choose m-1}^{-1}$.
Finally, $q_n(x_{1:n}|\v n)={n\choose n_1...n_d}^{-1}$ as before.
Together
\beq\label{def:uuuu}
q_n(x_{1:n}) ~=~ {n\choose n_1~...~n_d}^{-1} {n-1\choose m-1}^{-1} {d\choose m}^{-1} {1\over\min\{n,d\}},
\qmbox{which implies}
\eeq
\beqn
\t q^\Srat(x_{n+1}=i|x_{1:n}) ~=~ {\min\{n,d\}\over\min\{n+1,d\}}\cdot
\left\{
{ {(n_i+1)(n-m+1)\over n(n+1)} \qmbox{if} n_i>0 \atop
{m(m+1)\over n(n+1)}\cdot{1\over d-m} \qmbox{if} n_i=0 }
\right.
\eeqn
This is not \ref{TC}, since
\beqn
\N(x_{1:n}) ~=~ {\min\{n,d\}\over\min\{n+1,d\}}\cdot
\left\{
{ 1+{2m\over n(n+1)}~ \qmbox{if} m0 ~\text{and}~ m0]
\end{array}
\right.
\eeq
For $n=0$ we have $\t q^\Srat(x_1)=\t q^\Sno(x_1)=q_n(x_1)=1/d$ and $\N(\epstr)=1$.
While by construction, the offline estimator should have good performance
(in the intended regime), the performance of the online version depends on how much
the normalizer exceeds 1. The first factor in $\N$ is $\leq 1$ and the $m=d$ case is $\leq 1$.
Therefore $\N(x_{1:n})\leq 1+{2m\over n(n+1)}\leq
1+{2\over n+1}$, where we have used $m\leq n$ in the second step.
The regret can hence be bounded by
\beqn
R_n(\t q^\Sno) ~\leq~ \sum_{t=1}^n \ln\max_{x_{0$.
\end{theorem}
The relative accuracy $\eps$ allows us to
compute the predictive distribution $\t q^\Smix(x_t|x_{(1-\eps)\t
q_n^\Smix(x_{1:n})$, hence $R_n(A(\cdot,\eps)||q_n)\leq R_n(\t
q^\Smix||q_n)+{\eps\over 1-\eps}$, and approximate
normalization $|1-\sum_{x_{1:n}}A(x_{1:n},\eps)|<\eps$.
%-------------------------------%
\paradot{Computational complexity of general $\t q$}
%-------------------------------%
% Informal no fast non-oracle qt result
The existence of $\t q^\Smix$ shows that any offline estimator can
be converted into an online estimator with minimal extra regret
$R_n\leq 2\ln(n+2)$. While encouraging and of theoretical interest,
the provided algorithm for $\t q^\Smix$ is prohibitive. Indeed,
Theorem~\ref{thm:subopt} below establishes that there exist offline
$(q_n)$ computable in polynomial time for which the fastest
algorithm for {\em any} online (=TC) $\t q$ with $R_n\leq O(\log
n)$ is at least exponential in time.
% Informal no fast oracle qt result
Trivially $R_n\leq n\ln|\X|$ can always be achieved for any $(q_n)$
by uniform $\t q(x_{1:n})=|\X|^{-n}$. So a very modest quest would
be $R_n\leq (1-\eps)n\ln|\X|$. If we require $\t q$ to run in
polynomial time but with free oracle access to $(q_n)$,
Theorem~\ref{thm:poor} below shows that this is also not possible
for some exponential time $(q_n)$.
% Open problem
Together this does not rule out that for every fast $(q_n)$ there
exists a fast $\t q$ with e.g. $R_n\leq\sqrt{n}$. This is our main
remaining open problem to be discussed in Section~\ref{sec:Open}.
% Proof idea
The main proof idea for both results is as follows: We construct a
deterministic $(q_s)$ that is 1 on the sequence of quasi-independent quasi-random
strings $\dot x_{1:1}^1$, $\dot x_{1:2}^2$, $\dot x_{1:3}^3$,~...~.
The only way for $\t q(x_{1:n})$ to be not too much smaller than $\bar
q_s(\dot x_{1:n}^s)$ is to know $\dot x_{1:s}^s$. If $s=s(n)$ is
exponential in $n$ this costs exponential time. If $\t q$ has only
oracle access to $(q_s)$, it needs exponentially many oracle calls
even for linear $s(n)=(1+\eps)n$.
The general theorem is a bit unwieldy and is stated and proven in
the next section. Here we present and discuss the most interesting
special cases.
%
% Complexity class definitions
$\DTIME(g(n))$ is defined as the class of all algorithms that run
in time $O(g(n))$ on inputs of length $n$. Real-valued algorithms
produce for any rational $\eps>0$ given as an extra argument, an
$\eps$-approximation in this time, as did $A(x_{1:n},\eps)$ for $\t
q^\Smix$ above. Algorithms in $\text{E}^c:=\DTIME(2^{cn})$ run in
exponential time, while $\text{P}:=\bigcup_{k=1}^\infty\DTIME(n^k)$
is the classical class of all algorithms that run in polynomial
time (strictly speaking Function-P or FP \cite{Arora:09}).
%
% P=!=NP note
The theorems don't rest on any complexity separation
assumptions such as P$\neq$NP.
% Binary alphabet only
We only state and prove the theorems for binary alphabet
$\X=\SetB=\{0,1\}$. The generalization to arbitrary finite alphabet
is trivial. `For all large $n$' shall mean `for all but finitely
many $n$', denoted by $\forall'n$. $m>0$ is a constant that depends
on the machine model, e.g. $m=1$ for a random access machine (RAM).
%-------------------------------%
\begin{theorem}[Sub-optimal fast online for fast offline]\label{thm:subopt}
%-------------------------------%
For all $r>0$ and $c>0$ and $\eps>0$
\bqan
& (i) & \exists(q_s)\in\DTIME(s^{b+m})~
\forall\t q\in\text{E}^c: R_n(\t q||q_n)\geq r\ln n~\forall'n,
~\text{where}~ b:=\fr{c+1+\eps}{1-\eps}r
\\
& (ii) & \text{in particular for large $c$ and $r$:}~ \exists(q_s)\in\text{P}~
\forall\t q\in\text{E}^c: R_n\geq r\ln n~\forall'n
\\
& (iii) & \text{in particular for small $c,\eps$:}~ \exists(q_s)\!\in\!\DTIME(s^{r+m+\eps})\,
\forall\t q\!\in\!\text{P}: R_n\geq r\ln n~\forall'n
\\
& (iv) & \text{in particular for $\t q^\Smix$:}~~~ \exists(q_s)\in\text{P} : \t q^\Smix\not\in\text{E}^c
\eqan
\end{theorem}
In particular (iii) implies that there is an offline estimator
$(q_s)$ computable in quartic time $s^4$ on a RAM for which no
polynomial-time online estimator $\t q$ is as good as $\t q^\Smix$.
The slower $(q_s)$ we admit (larger $r$), the higher the lower
bound gets. (ii) says that even algorithms for $\t q$ running in
exponential time $2^{cn}$ cannot achieve logarithmic regret for all
$(q_s)\in\text{P}$. In particular this implies that (iv) any
algorithm for $\t q^\Smix$ requires super-exponential time for some
$(q_s)\in\text{P}$ on some arguments.
The next theorem is much stronger in the sense that it rules out
even very modest demands on $R_n$ but is also much weaker since it
only applies to online estimators for slow $(q_s)$ used as a black box
oracle. That is, $\t q^o(x_{1:n})$ can call $q_s(z_{1:s})$ for any
$s$ and $z_{1:s}$ and receives the correct answer. We define
$\DTIME^o(g(n))$ as the class of all algorithms with such oracle
access that run in time $O(g(n))$, where each oracle call is
counted only as one step, and similarly $\text{P}^o$ and $\text{E}^{c,o}$.
%-------------------------------%
\begin{theorem}[Very poor fast online using offline oracle]\label{thm:poor}
%-------------------------------%
For all $\eps>0$
\bqan
& & \exists o\equiv(q_s)\in\text{E}^1~
\forall \t q^o\in\text{E}^{\eps/2,o}:
R_n(\t q^o||q_n)\geq (1-\eps)n\ln 2~\forall'n
\\
& & \text{or cruder:}~\exists o\equiv(q_s)~
\forall \t q^o\in\text{P}^o:
R_n(\t q^o||q_n)\geq (1-\eps)n\ln 2~\forall'n
\eqan
\end{theorem}
The second line states that the trivial bound $R_n\leq n\ln 2$ achieved
by the uniform distribution can in general not be improved by a fast
$\t q^o$ that (only) has oracle access to the offline estimator.
Usually one Does not state the complexity of the oracle, since it
does not matter, but knowing that an $o\in\text{E}^1$ is sufficient
(first line) tells us something: First, the negative result is not
an artifact of some exotic non-computable offline estimator. On the
other hand, if an exponential time offline $o$ is indeed needed to
make the result true, the result wouldn't be particularly
devastating. It is an open question whether an $o\in\text{P}$ can
cause such bad regret.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Computational Complexity Proofs}\label{sec:CCqtproofs}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
\paradot{Proof of Theorem~\ref{thm:ccqtmix}}
%-------------------------------%
The design of an algorithm for $\t q^\Smix$ and the analysis of its
run-time follows standard recipes, so will only be sketched. A
real-valued function $\t q^\Smix:\X^*\to[0;1]$ is (by definition)
computable (also called estimable
\cite{Hutter:04uaibook}), if there is an always halting
algorithm $A:\X^*\times\SetQ^+\to\SetQ$ with $|A(x_{1:n},\eps)-\t
q^\Smix(x_{1:n})|<\eps$ for all rational $\eps>0$. We assume there
is an oracle $q_t^{\eps}$ that provides $q_t$ to $\eps$-accuracy in
time $O(1)$. We assume that real numbers can be processed in unit
time. In reality we need $O(\ln\frs{1}\eps)$ bits to represent, and
time to process, real numbers to accuracy $\eps$. This leads to
some logarithmic factors in
run-time which are dwarfed by our exponentials, so will be ignored.
To compute $\bar q_s(x_{1:n})$ to accuracy $\frs{\eps}{2}$ we need
to call $q_s^{\eps/2N}$ oracle $N:=\max\{|\X|^{s-n},1\}$ times and
add up all numbers. We can compute $\t q^\Smix$ to $\eps$-accuracy
by the truncated sum $\sum_{s=0}^{2/\eps}\bar
q_s^{\eps/2}(x_{1:n})w_s$ with $w_s={1\over(s+1)(s+2)}$, since the
tail sum is bounded by $\frs{\eps}{2}$. Hence overall runtime is
$O(|\X|^{2/\eps-n})$. But this is not sufficient. For large $n$,
$\t q^\Smix(x_{1:n})$ is typically small, and we need a {\em relative}
accuracy of $\eps$, i.e.\ $|A(x_{1:n},\eps')/\t
q^\Smix(x_{1:n})-1|<\eps$.
%
For $Q(x_{1:n})=|\X|^{-n}$, we have $\t q^\Smix(x_{1:n})\geq
\fr12 Q(x_{1:n})=\fr12|\X|^{-n}$, hence $\eps'=\fr{\eps}{2}|\X|^{-n}$
suffices. Run time becomes $O(|\X|^{{4\over\eps}|\X|^n-n})\leq
\e^{\e^{O(n)}/\eps}$.
\qed
%-------------------------------%
\begin{theorem}[Fast offline can imply slow online (general)]\label{thm:foso}
%-------------------------------%
Let $s(n)$ and $f(n)$ and $g(n)$ be monotone increasing functions.
$s(n)$ shall be injective and $\geq n$ for large $n$ with inverse
$n(s):=\max\{n:s(n)\leq s\}$ and $g(n)<\fr12 n^{-\d}h(n)$, where
$h(n):=2^{s(n)-n}[n^{-\g}-2^{f(s(n))-n}]$.
$m>0$ is a constant depending on the machine model, e.g. $m=1$ for a RAM.
Then for all $\g>0$ and
$\delta>0$ it holds that
\bqan
& & \exists o\equiv(q_s)\in\DTIME(n(s)^{\g+\delta}2^{n(s)}g(n(s))s^m)~
\\
& & \forall \t q^o\in\DTIME^o(g(n)):
R_n(\t q^o||q_n)\geq f(n)\ln 2~\forall'n
\eqan
\end{theorem}
%-------------------------------%
\paradot{Proof of Theorem~\ref{thm:foso}}
%-------------------------------%
%-------------------------------%
\paradot{Effective quasi-sparse sets}
%-------------------------------%
We need a single set $\{\dot x_{1:s}^s\}_{s\in\SetN}\equiv\{\dot
x_1^1,\dot x_{1:2}^2,\dot x_{1:3}^3,...\}$ of sequences that is
``safe'' against {\em every} polynomial time $\t q$ in a sense to
be clarified below. Let $o=(q_s)$ be any deterministic oracle, i.e.
for every $s$, $q_s$ is 1 on exactly one string, namely $\dot
x_{1:s}^s$. Let $T_1^o,T_2^o,...$ be an enumeration of all Turing
machines with access to oracle $o$, but each $T_k^o(z_{1:n})$ is terminated
after time $k^{\delta/\g}g(n)$. Any $\delta>0$ and $\g>0$
will do. Therefore $(T_k^o)$ enumerates all time-bounded machines.
The idea of the following construction is to return an $\dot x_{1:s}^s$ that
is not in any effective quasi-sparse set of the form
\beqn
L_k^{n,o}:=\left\{
{\tilde L_k^{n,o} \qmbox{if} |\tilde L_k^{n,o}|\leq 2^{f(s(n))}
\atop \{\} ~~\qmbox{else}~~~~~~~~~~~~~~~~~~~~~~ } \right.,
\qmbox{where} \tilde L_k^{n,o} :=\{z\in\SetB^n:T_k^o(z)=1\}
\eeqn
and $f(s)$ is some (linear/logarithmic) monotone increasing function
and $s(n)$ is some injective (linear/exponential) monotone increasing function.
%-------------------------------%
\paradot{Constructing quasi-random sequences $\dot x_{1:s}^s$}
%-------------------------------%
For the construction to work, $\dot x_{1:s}^s$ should also not be
probed by any fast algorithm on any input. Since the algorithms can
probe oracle $o=(q_s)$ before $q_s$ has been constructed, we need a
careful construction in stages $s=1,2,3,...$. Assume $\dot
x_{1:s'}^{s'}$ and $q_{s'}$ have already been constructed for all
$s'~~0$ will do) Turing machines $T_k^{o_s}$ on any input
$z_{1:n}$. Now let\vspace{-1ex}
\beqn
F_s ~:=~ \SetB^s \setminus \Big( \bigcup_{k=1}^{n^\g} L_k^{n,o_s}\times\SetB^{s-n}\cup\bigcup_{s'=1}^s C_{\geq s'} \Big)
\eeqn
be the set of strings of length $s$ that roughly $(i)$ are not
queried and $(ii)$ whose length $n$ prefix is not in any quasi-sparse
set. If $F_s\neq\{\}$,
\beqn
\text{let} ~\dot x_{1:s}^s ~\text{be the lexicographically first string in}~ F_s
\eeqn
If $F_s=\{\}$, arbitrarily let $\dot x_{1:s}^s=0_{1:s}$.
In any case define $q_s(\dot x_{1:s}^s) := 1$, and 0 on all other sequences of length $s$.
%-------------------------------%
\paradot{Fast good $\t q^o$ implies $F_s=\{\}$}
%-------------------------------%
Let $\t q^o$ be an online (=TC) estimator with access to oracle
$o=(q_s)$ and small regret
\beq\label{eq:Rfbnd}
R_n(\t q^o||q_n)~\equiv~\max_{x_{1:n}}\ln{q_n(x_{1:n})\over\t q^o(x_{1:n})} ~<~ f(n)\ln 2
\qmbox{for all large $n$}
\eeq
This implies $\t q^o(x_{1:s}) > 2^{-f(s)}q_s(x_{1:s}) ~\forall
x_{1:s}\forall's$ and in particular $\t q^o(\dot x_{1:s}^s) >
2^{-f(s)}$. Since $\t q^o$ is TC, we have
$\t q^o(\dot x_{1:n}^s) \geq \t q^o(\dot x_{1:s}^s) > 2^{-f(s(n))}$.
Now let us assume that $\t q^o\in\DTIME^o(g(n))$. Then membership in
\beqn
L^{n,o} ~:=~ \{x_{1:n} : \t q^o(x_{1:n}) > 2^{-f(s)} \}
\eeqn
can be determined in the same (or less) time and
since $\t q^o$ is a probability, $|L^{n,o}|<2^{f(s)}$.
Therefore, there is a $k_0$ such that $T_{k_0}^o$ computes $L^{n,o}=L_{k_0}^{n,o}$.
Now assume $F_s\neq\{\}$ for some $n^\g\geq k_0$.
The construction of $\dot x_{1:s}^s$ is such that
$T_k^{o_s}(z_{1:n})=T_k^o(z_{1:n})$ for all $k\leq n^\g$ and all
$z_{1:n}$, since their oracles coincide
on all queried strings $y_{1:s'}$:
%
For $y_{1:s'}\neq\dot x_{1:s'}^{s'}$ both oracles answer 0.
For $s'~~~~ 2^{-f(s)}$. Therefore, $\t
q^o\in\DTIME^o(g(n))$ implies $F_s=\{\}$ for all large $s$.
%-------------------------------%
\paradot{$F_s=\{\}$ implies slow good $\t q^o$}
%-------------------------------%
$t(n):=n^\d g(n)$ upper bounds the running time of $T_k^o$ for $k\leq n^\g$.
It also bounds the number of oracle calls in $T_k^o$,
since each oracle call costs at least one step.
Note that $C_{\geq s'}\subseteq C_{\geq s''}$ if $s'\geq s''$ and $n(s')=n(s'')$,
which implies $\bigcup_{s'=1}^s C_{\geq s'} = \bigcup_{n'=1}^n C_{\geq s(n')}$ due to $s(n(s))\leq s$.
Using
\bqan
& & |L_k^{n,o_s}|\leq 2^{f(s)} \qmbox{and} |C_{\geq s}|\leq n^\g\,2^n t(n) ~~\Rightarrow~~ \Big|\bigcup_{n'=1}^nC_{\geq s(n')}\Big|\leq n^\g\,2^{n+1}t(n)
\\
& & \qmbox{implies} |F_s| ~\geq~ 2^s - n^\g\,2^{f(s)}2^{s-n} - n^\g 2^{n+1}t(n)
\\
& & \qmbox{hence} F_s=\{\} \qmbox{implies} 2t(n)\geq 2^{s-n}[n^{-\g}-2^{f(s)-n}] ~=:~ h(n)
\eqan
This contradicts the assumption on $g(n)$ in the theorem, hence
$F_s\neq\{\}$, hence $\t q^o\not\in\DTIME^o(g(n))$ for all $\t
q^o$ with regret \req{eq:Rfbnd}, whose contrapositive is
\beqn
\forall \t q^o\in\DTIME^o(g(n)): R_n(\t q^o||q_n)\geq f(n)\ln 2~\forall'n
\eeqn
%-------------------------------%
\paradot{Complexity of $(q_s)$}
%-------------------------------%
The construction of $q_s$ requires running $T_k^{o_s}(z_{1:n})$ for
all $z_{1:n}$ for all $k\leq n^\g$, each requiring $k^{\delta/\g}
g(n)\leq t(n)$ steps. Hence
$q_s\in\DTIME^{o_s}(n^\g 2^n t(n))$ where $n=n(s)$. We can
get rid of the self-reference to oracle $o_s$ by considering the
complexity of the iterative construction of $q_{s'}$ and $\dot
x_{1:s'}^{s'}$ for all $s'\leq s$.
Assume we have constructed and stored $\dot x_{1:s'}^{s'}$ for all
$s'~~~~0$ and
sufficiently large $n$ we have
\bqan
n^{\g+\delta}2^n g(n)s^m &=& s^m 2^{(c+1)n+(\g+\delta)\log n} \\
&\leq& s^m 2^{(c+1+\eps)n}
~=~ s^m 2^{(c+1+\eps){r\log s\over 1-\eps}}
~=~ s^{m+{c+1+\eps\over 1-\eps}r} ~=~ s^{b+m}
\eqan
This proves (i). (ii) is just a weaker version of (i) since
$\DTIME(s^{b+m})\subset\text{P}$. (iii) follows from the fact that $b:=r+\eps'$
implies $c>0$ for sufficiently small $\eps>0$, and
$\text{E}^c\supset\text{P}$. (iv) follows from (i) and the fact that
$R_n(\t q^\Smix)\leq 2\ln(n+2)2$.
\qed
%-------------------------------%
\paradot{Proof of Theorem~\ref{thm:poor}}
%-------------------------------%
In Theorem~\ref{thm:foso} let $s=(1+\eps)n$ and $f(s)=(1-\eps)s$.
Then $h(n)=2^{\eps n}[n^{-\g}-2^{-\eps^2 n}]$,
so clearly $g(n):=2^{\eps n/2}<\fr12 n^{-\d}h(n)$ for large $n$.
For any $\eps>0$ and sufficiently large $n$ we have
$n^{\g+\delta}2^n g(n)s^m = (1+\eps)^m 2^{n+\eps n/2+(\g+\delta+m)\log n} \leq 2^{(1+\eps)n} = 2^s$.
\qed
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Open Problems}\label{sec:Open}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We now discuss and quantify the problems that we raised earlier and are still open.
%
For some {\em specific} collection $(q_n)$ of probabilities,
does there exist a polynomial-time computable time-consistent $\t q$
with $R_n(\t q||q_n)\leq 2\ln(n+2)\,\forall n$?
%
Note that $\t q^\Smix$ satisfies the bound, but a direct computation is prohibitive.
So one way to a positive answer could be to find an efficient approximation of $\t q^\Smix$.
If the answer is negative for a specific $(q_n)$ one could try to weaken the requirements on $R_n$.
We have seen that for some, (non-\ref{TC}) $(q_n)$, namely Ristad's, simple normalization $\t q^\Sno$ solves the problem.
% GT example
A concrete unanswered example are the triple uniform Good-Turing probabilities $(q_n)$.
Preliminary experiments indicate that they and
therefore $\t q^\Smix$ are more robust than current heuristic
smoothing techniques, so a tractable approximation of $\t q^\Smix$ would
be highly desirable. It would be convenient and insightful if such a
$\t q$ had a traditional GT representation but with a smarter
smoothing function $m_{()}$.
% Natural nasty q_n
The nasty $(q_n)$ constructed in the proof of
Theorem~\ref{thm:foso} is very artificial: It assigns extreme
probabilities (namely 1) to quasi-random sequences. It is unknown
whether there is any offline estimator of practical relevance (such
as Good-Turing) for which no fast online estimator can achieve
logarithmic regret.
%-------------------------------%
%\paradot{Open problem: general}
%-------------------------------%
An open problem for general $(q_n)$ is as follows:
Does there exist for every $(q_n)$ a polynomial-time algorithm that
computes a time-consistent $\t q$ with $R_n(\t q||q_n)\leq
f(n)\,\forall n$. We have shown that this is not possible for
$f(n)=O(\log n)$ and not even for $f(n)=(1-\eps)n\ln 2$ if $\t q$
has only oracle access to $(q_n)$. This still allows for a positive answer to
the following open problem:
%-------------------------------%
\begin{open}[Fast online from offline with small extra regret]\label{open:fofo}
%-------------------------------%
Can every poly\-nomial-time offline estimator $(q_n)$ be converted to
a polynomial-time online estimator $\t q$ with small regret $R_n(\t q||q_n)\leq
\sqrt{n}\,\forall' n$?
%Formally: $\forall(q_n)\exists\t q:R_n\leq\sqrt{n}\forall'n$?
Or weaker: $\forall(q_n)\in\text{P}\,\exists\t q\in\text{P}:R_n=o(n)$? %
Or stronger: $\forall(q_n)\in\text{P}\,\exists\t q\in\text{P}:R_n=O(\log n)^2$?
\end{open}
%-------------------------------%
A positive answer would reduce once and for all the problem of
finding good online estimators to the apparently easier problem of
finding good offline estimators. We could also weaken our notion of
worst-case regret to e.g. expected regret $\E[\ln(q_n/\t q)]$.
Expectation could be taken w.r.t.\ $(q_n)$, but other choices are
possible. Other losses than logarithmic also have practical
interest, but I do not see how this makes the problem easier.
Ignoring computational considerations, of theoretical interest is
whether $O(\log n)$ is the best one can achieve in general, say
$\exists q_n\forall\t q:R_n(\t q)\geq\ln n$, or whether a constant is achievable.
%-------------------------------%
%\paradot{Other open problems}
%-------------------------------%
Devising general techniques to upper bound $R_n(\t q^\Sno||q_n)$,
especially if small, is of interest too.
%-------------------------------%
\paradot{Acknowledgements}
%-------------------------------%
Thanks to Jan Leike for feedback on earlier drafts.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{References}\label{sec:Bib}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\addcontentsline{toc}{section}{\refname}
\def\refname{\vspace{-4ex}}
\bibliographystyle{alpha}
\begin{small}
\begin{thebibliography}{ABCD}\parskip=0ex
\bibitem[AB09]{Arora:09}
S.~Arora and B.~Barak.
\newblock {\em Computational Complexity: A Modern Approach}.
\newblock Cambridge University Press, 2009.
\bibitem[AS74]{Abramowitz:74}
M.~Abramowitz and I.~A. Stegun, editors.
\newblock {\em Handbook of Mathematical Functions}.
\newblock Dover publications, 1974.
\bibitem[BC91]{Barron:91}
A.~R. Barron and T.~M. Cover.
\newblock Minimum complexity density estimation.
\newblock {\em IEEE Transactions on Information Theory}, 37:1034--1054, 1991.
\bibitem[CG99]{Chen:99}
S.~F. Chen and J.~Goodman.
\newblock An empirical study of smoothing techniques for language modeling.
\newblock {\em Computer Speech and Language}, 13:359--394, 1999.
\bibitem[Goo53]{Good:53}
I.~J. Good.
\newblock The population frequencies of species and the estimation of
population parameters.
\newblock {\em Biometrika}, 40(3/4):237--264, 1953.
\bibitem[Gr{\"u}07]{Gruenwald:07book}
P.~D. Gr{\"u}nwald.
\newblock {\em The Minimum Description Length Principle}.
\newblock The MIT Press, Cambridge, 2007.
\bibitem[Hut03]{Hutter:03optisp}
M.~Hutter.
\newblock Optimality of universal {B}ayesian prediction for general loss and
alphabet.
\newblock {\em Journal of Machine Learning Research}, 4:971--1000, 2003.
\bibitem[Hut05]{Hutter:04uaibook}
M.~Hutter.
\newblock {\em Universal Artificial Intelligence: Sequential Decisions based on
Algorithmic Probability}.
\newblock Springer, Berlin, 2005.
\bibitem[Hut09]{Hutter:09mdltvp}
M.~Hutter.
\newblock Discrete {MDL} predicts in total variation.
\newblock In {\em Advances in Neural Information Processing Systems 22
({NIPS'09})}, pages 817--825, Cambridge, MA, USA, 2009. Curran Associates.
\bibitem[LH14]{Hutter:14martoscx}
J.~Leike and M.~Hutter.
\newblock Indefinitely oscillating martingales.
\newblock {\em Technical report:\\ http://www.hutter1.net/publ/martoscx.pdf},
2014.
\bibitem[Nad85]{Nadas:85}
A.~Nadas.
\newblock On {T}uring's formula for word probabilities.
\newblock {\em IEEE Transactions on Acoustics, Speech, and Signal Processing},
33(6):1414--1416, 1985.
\bibitem[PH05]{Hutter:05mdl2px}
J.~Poland and M.~Hutter.
\newblock Asymptotics of discrete {MDL} for online prediction.
\newblock {\em IEEE Transactions on Information Theory}, 51(11):3780--3795,
2005.
\bibitem[RH07]{Hutter:07pquest}
D.~Ryabko and M.~Hutter.
\newblock On sequence prediction for arbitrary measures.
\newblock In {\em Proc. IEEE International Symposium on Information Theory
({ISIT'07})}, pages 2346--2350, Nice, France, 2007. IEEE.
\bibitem[Ris95]{Ristad:95}
E.~S. Ristad.
\newblock A natural law of succession.
\newblock Technical Report CS-TR-495-95, Princeton University, 1995.
\bibitem[San06]{Santhanam:06}
N.~Santhanam.
\newblock {\em Probability Estimation and Compression Involving Large
Alphabets}.
\newblock PhD thesis, Univerity of California, San Diego, USA, 2006.
\bibitem[Sol78]{Solomonoff:78}
R.~J. Solomonoff.
\newblock Complexity-based induction systems: Comparisons and convergence
theorems.
\newblock {\em IEEE Transactions on Information Theory}, IT-24:422--432, 1978.
\end{thebibliography}
\end{small}
\appendix
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Proof of Theorem~\ref{thm:Rn1uuu}}\label{app:RnGT}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Upper bound
For GT we prove $\max_{x_{1:n}}\N_n\to 2$, therefore $\max_{x_{1:n}}\N(x_{1:n})\to 2$ due to
${\text{Part}(n)\over\text{Part}(n+1)}\to 1$ for $n\to\infty$.
We can upper bound \req{def:GTNn} as
\bqan
(n+1)\N_n &=& \sum_{r=0,m_r\neq 0\nq}^n(r+1)m_{r+1} + \sum_{r=0,m_r\neq 0\nq}^n r ~~ + \sum_{r=0,m_r\neq 0\nq}^n 1
\\
&\leq& \sum_{r'=1}^{n+1} r'm_{r'} + \sum_{r=0}^n r m_r + |\{r:m_r\neq 0\}|
\\
&=& n + n + |\{r:m_r\neq 0\}| ~\leq~ 2n+\sqrt{2n}+1
\eqan
$|\{r:m_r\neq 0\}|$ under the constraint $\sum_{r=0}^n r
m_r=n$ is maximized for $m_0=...=m_k=1$ and $m_{k+1}=...=m_n=0$ for
suitable $k$. We may have to set one $m_r=2$ to meet the
constraint. Therefore $n=\sum_{r=0}^n r m_r\geq\sum_{r=0}^k r =
{k(k+1)\over 2}\geq\fr12 k^2$, hence
$|\{r:m_r\neq 0\}|=k+1\leq\sqrt{2n}+1$.
% Lower bound
For the lower bound we construct a sequence that attains the upper bound. For instance,
$x_{1:k(k+1)/2}=1223334444~...~k...k$ has $m_1=...=m_k=1$, hence
$x_{1:\infty}=1223334444...$ has $m_1\geq 1,...,m_k\geq 1$ for all $n\geq\fr12 k(k+1)$.
Conversely, for any $n$ we have $m_1\geq 1,...,m_k\geq 1$ with $k:=\lfloor\sqrt{2n}\rfloor-1$.
For the chosen sequence we therefore have
\beqn
(n+1)\N_n ~\geq~ \sum_{r=0}^{k-1}(r+1)(1+1) ~=~ k(k+1) ~\geq~ 2n-3\sqrt{2n}
\eeqn
The upper and lower bounds together imply $\max_{x_{1:n}}\N_n=2\pm
O(n^{-1/2})$, therefore $\max_{x_{1:n}}\N(x_{1:n})=2\pm
O(n^{-1/2})$ due to
${\text{Part}(n)\over\text{Part}(n+1)}=1-O(n^{-1/2})$ \cite{Abramowitz:74}. Inserting
this into \req{eq:RnGT} gives $R_n=n\ln2\pm O(n^{-1/2})$.
% Lower bound for finite d
The upper bound holds for any $d$, but the lower bound requires
$d=\infty$ or at least $d\geq\sqrt{2n}$. We now show linear growth of
$R_n$ even for finite $d\geq 3$.
The lower bound is based on the same sequence as used in \cite{Santhanam:06}:
For $x_{1:\infty}=12(132)^\infty$ elementary algebra gives
$\N_n={5\over 3}+{7/3\over n+1}$ and $\N_{n+1}={5\over 3}+{5/3\over
n+2}$ and $\N_{n+2}={4\over 3}+{1\over n+3}$ for $n$ a multiple of
3, hence $\N_n\N_{n+1}\N_{n+2}\geq{100\over 27}$ (except
$\N_0\N_1\N_2=\fr23$). Together with asymptotics
$\ln(\text{Part}(n))\sim\pi\sqrt{2n/3}$
\cite{Abramowitz:74}, this implies that $R_n\geq {n\over
3}\ln{100\over 27} - O(\sqrt{n})$.
\qed
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{List of Notation}\label{app:Notation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabbing}
\hspace{0.18\textwidth} \= \hspace{0.68\textwidth} \= \kill
{\bf Symbol } \> {\bf Explanation} \\[0.5ex]
$\equiv$ \> identical, equal by definition, trivially equal \\[0.5ex]
${n\choose n_1...n_d}$ \> multinomial \\[0.5ex]
$n\in\SetN_0$ \> length of sequence \\[0.5ex]
$t\in\{1,...,n\}$ \> current ``time'' \\[0.5ex]
$s\in\SetN$ \> any ``time'' \\[0.5ex]
$\X=\{1,...,d\}$ \> finite alphabet, $d>1$ \\[0.5ex]
$i,x,x_t\in\X$ \> symbol \\[0.5ex]
$x_{t:n}\in\X^{n-t+1}$ \> sequence $x_t...x_n$ \\[0.5ex]
$x_{ sequence of length $t-1$ \\[0.5ex]
$\epstr=x_{1:0}=x_{<1}$ \> empty string \\[0.5ex]
$Q$ \> any measure on $\X^\infty$ \\[0.5ex]
$q_n:\X^n\to[0;1]$ \> offline estimated probability mass function \\[0.5ex]
$\bar q_s:\X^*\to[0;1]$ \> extends $q_n$ to any \ref{TC} probability on $\X^*$ \\[0.5ex]
$\t q:\X^*\to[0;1]$\> online estimator desired to be close to $q_n$ \\[0.5ex]
$\t q_{|\X^n}$ \> constrains the domain of $\t q$ to $\X^n$ \\[0.5ex]
$\log$, $\ln$ \> binary and natural logarithms, respectively \\[0.5ex]
$\DTIME^o(g(n))$ \> algorithms that run in time $O(g(n))$ with access to oracle $o$ \\[0.5ex]
$\text{P}:=\bigcup_{k=1}^\infty\DTIME(n^k)$ \> ~~~~~~~~polynomial time algorithms \\[0.5ex]
$\text{E}^c:=\DTIME(2^{cn})$ \> ~~~exponential time algorithms (much smaller than EXP or even E!) \\[0.5ex]
$\SetB:=\{0,1\}$ \> binary alphabet \\[0.5ex]
$\forall'n$ \> for all but finitely many $n$, short `for all large $n$' \\[0.5ex]
quasi \> akin to but not necessarily an established definition \\[0.5ex]
\end{tabbing}
\end{document}
%--------------------End-of-Off2Online.tex--------------------%
~~