%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Adaptive Online Prediction by Following the Perturbed Leader%
%% Marcus Hutter & Jan Poland: Start: December 2003 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt,twoside]{article}
\usepackage{latexsym}
\topmargin=-1cm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm \unitlength=1mm
\sloppy\lineskip=0pt
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
%\def\dispmuskip{}\def\textmuskip{} %normal math-spacing
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{definition}[theorem]{Definition}
\newenvironment{keywords}{\centerline{\bf\small
Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\def\citet{\cite}\def\citep{\cite}\def\citealt{\cite}\def\citeauthor{\cite}
\def\myparskip{\vspace{1.5ex plus 0.5ex minus 0.5ex}\noindent}
\def\paragraph#1{\myparskip{\bfseries\boldmath{#1.}}}
\def\paradot#1{\myparskip{\bfseries\boldmath{#1.}}}
\def\paranodot#1{\myparskip{\bfseries\boldmath{#1}}}
\def\eps{\varepsilon}
\def\nq{\hspace{-1em}}
\def\qed{\hspace*{\fill}$\Box\quad$\\}
\def\odt{{\textstyle{1\over 2}}}
\def\v{\boldsymbol}
\def\p{{\scriptscriptstyle+}}
\def\n{{n}}
\def\t{\pi}
\def\pin{{\scriptstyle\Pi}}
\def\Var{{\mbox{Var}}}
\def\Cov{{\mbox{Cov}}}
\def\SetR{I\!\!R}
\def\SetN{I\!\!N}
\def\N{{\cal N}}
\def\D{{\cal D}}
\def\S{{\cal S}}
\def\E{{\cal E}}
\def\X{{\cal X}} % input/perception set/alphabet
\def\Y{{\cal Y}} % output/action set/alphabet
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\def\scp{{\scriptscriptstyle^{\,\circ}}}
\def\sooe{{\textstyle{1\over\eta}}}
\def\FPL{\text{FPL} }
\def\IFPL{\text{IFPL} }
\def\leqt{_{1:t}}
\def\leqtj{_{1:{t_j}}}
\def\leqtjj{_{1:{t_{j-1}}}}
\def\leqT{_{1:T}}
\def\ltT{_{0$ other than 1.}
$s_t^i=$Loss$(x_t,y_t^i)\in[0,1]$, and $s_t=(s_t^i)_{1\leq i\leq
n}$ is the vector of all losses at time $t$. Let
$s\ltt=s_1+\ldots+s_{t-1}$ (respectively $s\leqt=s_1+\ldots+s_t$)
be the total past loss vector (including current loss $s_t$) and
$s\leqt\smin=\min_i\{s\leqt^i\}$ be the loss of the \emph{best
expert in hindsight (BEH)}. Usually we do not know in advance the
time $t\geq 0$ at which the performance of our predictions are
evaluated.
%-------------------------------%
\paradot{General decision spaces}
%-------------------------------%
The setup can be generalized as follows. Let $\S\subset\SetR^n$ be the
\emph{state space} and $\D\subset\SetR^n$ the \emph{decision
space}. At time $t$ the state is $s_t\in\S$, and a decision
$d_t\in\D$ (which is made before the state is revealed) incurs a
loss $d_t\!\scp s_t$, where ``$\scp$" denotes the inner product. This
implies that the loss function is \emph{linear} in the states.
Conversely, each linear loss function can be represented in this
way. The decision which minimizes the loss in state $s\in\S$ is
\beq\label{Mdef}
M(s):=\arg\min_{d\in\D} \{d\scp s\}
\eeq
if the minimum exists. The application of this general framework
to PEA is straightforward: $\D$ is identified with the space of
all unit vectors $\E=\{e_i:1\leq i\leq n\}$, since a decision
consists of selecting a single expert, and $s_t\in[0,1]^n$, so
states are identified with losses. Only Theorems~\ref{thIFPL} and
\ref{thLowFPL} will be stated in terms of general decision space.
Our main focus is $\D=\E$. (Even for this special case, the scalar
product notation is not too heavy, but will turn out to be
convenient.) All our results generalize to the simplex
$\D=\Delta=\{v\in[0,1]^n:\sum_i v^i=1\}$, since the minimum of a
linear function on $\Delta$ is always attained on $\E$.
%-------------------------------%
\paradot{Follow the Perturbed Leader}
%-------------------------------%
Given $s\ltt$ at time $t$, an immediate idea to solve the expert
problem is to ``Follow the Leader'' (FL), i.e.\ selecting the
expert $e_i$ which performed best in the past (minimizes
$s\ltt^i$), that is predict according to expert $M(s\ltt)$. This
approach fails for two reasons. First, for $n=\infty$ the minimum
in (\ref{Mdef}) may not exist. Second, for $n=2$ and
$s={\,0\,1\,0\,1\,0\,1 \ldots \choose \frac{1}{2}0\,1\,0\,1\,0
\ldots}$, FL always chooses the wrong prediction \citep{Kalai:03}.
We solve the first problem by penalizing each expert by its
complexity, i.e.\ predicting according to expert $M(s\ltt+k)$. The
\emph{FPL (Follow the Perturbed Leader)} approach solves the
second problem by adding to each expert's loss $s\ltt^i$ a random
perturbation.
%
We choose this perturbation to be negative \emph{exponentially
distributed}, either independent in each time step or once and for
all at the very beginning at time $t=0$. The former choice is
preferable in order to protect against an adaptive adversary who
generates the $s_t$, and in order to get bounds with high
probability (Section~\ref{secMisc}). For the main analysis
however, the latter choice is more convenient. Due to linearity of
expectations, these two possibilities are equivalent when dealing
with {\it expected losses} (this is straightforward for oblivious
adversary, for adaptive adversary see Section~\ref{secAdap}), so
we can henceforth assume without loss of generality one initial
perturbation $q$.
%-------------------------------%
\paranodot{The FPL algorithm} is defined as follows:\\
%-------------------------------%
%
\hspace*{1cm}Choose random vector $q\stackrel{d.}{\sim}\exp$,
i.e.\ $P[q^1...q^n]=\e^{-q^1}\cdot...\cdot\e^{-q^n}$ for $q\geq 0$.\\
\hspace*{1cm}For $t=1,...,T$\\
\hspace*{1cm}- Choose learning rate $\eta_t$.\\
\hspace*{1cm}- Output prediction of expert $i$ which minimizes $s_{0$ is decreasing in $t$, then the loss of the infeasible FPL
knowing $s_t$ at time $t$ in advance (l.h.s.) can be bounded in
terms of the best predictor in hindsight (first term on r.h.s.) plus
additive corrections:
\beqn
\sum_{t=1}^T M(s_{1:t}+{k\!-\!q\over\eta_t})\scp s_t
\leq \min_{d\in\D}\{d\scp(s_{1:T}+{k\over\eta_T})\}
+ {1\over\eta_T}\max_{d\in\D}\{d\scp(q-k)\}
- {1\over\eta_T} M(s_{1:T}+{k\over\eta_T})\scp q.
\eeqn
\end{theorem}
Note that if $\D=\E$ (or $\D=\Delta$) and $s_t\geq 0$, then
all extrema in the theorem are attained almost surely. The
same holds for all subsequent extrema in the proof and
throughout the paper.
\paradot{Proof} For notational convenience, let $\eta_0=\infty$ and
$\tilde s\leqt=s\leqt+\frac{k-q}{\eta_t}$. Consider the losses
$\tilde s_t=s_t+(k-q)\big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\big)$
for the moment. We first show by induction on $T$ that the infeasible
predictor $M(\tilde s_{1:t})$ has zero regret for any loss $\tilde
s$, i.e.\
\beq\label{eqnoregret}
\sum_{t=1}^T M(\tilde s_{1:t})\scp \tilde s_t \leq M(\tilde s_{1:T})\scp \tilde s_{1:T}.
\eeq
For $T=1$ this is obvious. For the induction step from $T-1$ to $T$
we need to show
\beq\label{eq:noregret1}
M(\tilde s_{1:T})\scp \tilde s_T \leq M(\tilde s_{1:T})\scp
\tilde s_{1:T} - M(\tilde s_{0$, the
expected loss of the infeasible FPL exceeds the loss of expert $i$
by at most $k^i/\eta_T$:
\beqn
r_{1:T} \;\leq\; s_{1:T}^i + {1\over\eta_T}k^i \quad\forall i.
\eeqn
\end{corollary}
Theorem~\ref{thIFPL} can be generalized to expert
dependent factorizable $\eta_t\leadsto \eta_t^i=\eta_t\cdot\eta^i$
by scaling $k^i\leadsto k^i/\eta^i$ and $q^i\leadsto q^i/\eta^i$.
Using $E[\max_i\{{q^i-k^i\over\eta^i}\}]\leq
E[\max_i\{q^i-k^i\}]/\min_i\{\eta^i\}$, Corollary~\ref{corIFPL},
generalizes to
\beqn
E[\sum_{t=1}^T M(s_{1:t}+{k-q\over\eta_t^i})\scp s_t]
\;\leq\; s_{1:T}^i + {1\over\eta_T^i}k^i + {1\over\eta_T^{min}}
\quad\forall i,
\eeqn
where $\eta_T^{min}:=\min_i\{\eta_T^i\}$.
For example, for $\eta_t^i=\sqrt{k^i/t}$
we get the desired bound $s_{1:T}^i+\sqrt{T\cdot(k^i+4)}$.
Unfortunately we were not able to generalize Theorem~\ref{thFIFPL}
to expert-dependent $\eta$, necessary for the final bound on FPL.
In Section~\ref{secHierarchy} we solve this problem by a hierarchy
of experts.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Feasible FPL bounded by Infeasible FPL}\label{secFFPL}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This section establishes the relation between the FPL and IFPL
losses. Recall that $\ell_t=E\big[M(s_{1$
larger than for the infeasible FPL:
\beqn
\ell_t\leq \e^{\eta_t}r_t, \qmbox{which implies}
\ell_{1:T}-r_{1:T}\leq \sum_{t=1}^T\eta_t \ell_t.
\eeqn
Furthermore, if $\eta_t\leq 1$, then also $\ell_t\leq
(1+\eta_t+\eta_t^2)r_t\leq (1+2\eta_t)r_t$.
\end{theorem}
\paradot{Proof}
Let $s=s_{0$: In this
case, we have $\ell_t\leq \e^{\eta_t A}r_t$. If $n$ is finite,
then the bound holds for $A=n$. For $n=\infty$, the assertion
holds under the somewhat unnatural assumption that $\S$ is
$l^1$-bounded.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\boldmath Combination of Bounds and Choices for $\eta_t$}\label{secBounds}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Throughout this section, we assume
\beq
\label{eq:Assumptions}
\D=\E,\quad s_t\in[0,1]^n\ \forall t,\quad
P[q]=\e^{-\sum_i q^i} \;\mbox{for}\; q\geq 0,\ \qmbox{and}
\sum_i \e^{-k^i}\leq 1.
\eeq
We distinguish \emph{static} and \emph{dynamic} bounds. Static
bounds refer to a constant $\eta_t\equiv\eta$. Since this value
has to be chosen in advance, a static choice of $\eta_t$ requires
certain prior information and therefore is not practical in many
cases. However, the static bounds are very easy to derive, and
they provide a good means to compare different PEA algorithms. If
on the other hand the algorithm shall be applied without
appropriate prior knowledge, a dynamic choice of $\eta_t$ depending
only on $t$ and/or past observations, is necessary.
\begin{theorem}[FPL bound for static $\eta_t=\eta\propto 1/\sqrt{L}$]\label{thFPLStatic}
Assume (\ref{eq:Assumptions}) holds, then the expected loss
$\ell_t$ of feasible FPL, which employs the prediction of the
expert $i$ minimizing $s_{0\}$ we get
\beqn\label{eqLD}
\ell_{1:T}\!-\!r_{1:T}
\leq \sum_{t=t_0}^T \eta_t\ell_t
\leq \sqrt{K\over 2}\sum_{t=t_0}^T {\ell_{1:t}\!-\!\ell_{0$ is a
decreasing sequence. Then the loss of FPL for uniform
complexities (l.h.s.) can be lower-bounded in terms of the best
predictor in hindsight (first term on r.h.s.) plus/minus additive
corrections:
\beqn
\sum_{t=1}^T M(s_{0$, the
expected loss of FPL is at most
$\ln n/\eta_T$ lower than the loss of the best expert in hindsight:
\beqn
\ell_{1:T} \;\geq\; s_{1:T}^{min} - {\ln n\over\eta_T}
\eeqn
\end{corollary}
The upper and lower bounds on $\ell_{1:T}$
(Theorem~\ref{thFIFPL} and Corollaries~\ref{corIFPL} and
\ref{corLowFPL}) together show that
\beq\label{eqltos}
{\ell_{1:t}\over s_{1:t}^{min}} \to 1
\quad\qmbox{if}\quad
\eta_t\to 0
\qmbox{and}
\eta_t\!\cdot\!s_{1:t}^{min} \to \infty
\qmbox{and}
k^i=K\;\forall i
\eeq
For instance, $\eta_t=\sqrt{K/2 s_{1$ with probability at most $1/c$:
\beqn
P[u_{1:T}\geq c\!\cdot\!\ell_{1:T}]
\;\leq\; {1/c}.
\eeqn
Randomizing independently for each $t$ as described in the
previous Section, the actual loss is
$u_t=M(s_{