%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% General Discounting versus Average Reward %%
%% Marcus Hutter, Start: October 2005 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt,twoside]{article}
\usepackage{latexsym,amsmath,amssymb,hyperref}
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\sloppy\lineskip=0pt
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{myexample}[theorem]{Example}
\newtheorem{proposition}[theorem]{Proposition}
\newenvironment{keywords}{\centerline{\bf\small
Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\def\paradot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf\boldmath{#1.}}}
\def\paranodot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf\boldmath{#1}}}
\def\aidx#1{}
\def\req#1{(\ref{#1})}
\def\toinfty#1{\smash{\stackrel{#1\to\infty}{\longrightarrow}}}
\def\iffs{\Leftrightarrow}
\def\eps{\varepsilon}
\def\epstr{\epsilon} % for empty string
\def\nq{\hspace{-1em}}
\def\qed{\hspace*{\fill}\rule{1.4ex}{1.4ex}$\quad$\\}
\def\odt{{\textstyle{1\over 2}}}
\def\odn{{\textstyle{1\over n}}}
\def\SetR{I\!\!R}
\def\SetN{I\!\!N}
\def\SetB{I\!\!B}
\def\SetZ{Z\!\!\!Z}
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\def\e{{\rm e}} % natural e
\def\B{\{0,1\}}
\def\E{{\bf E}} % Expectation
\def\P{{\rm P}} % Probability
\def\v{\vec}
\def\es{\mbox{\o}} % empty set
\def\l{\ell}
\def\lb{{\log_2}}
\def\s{\sigma}
\def\v{\boldsymbol} %\boldsymbol\frak\Bbb\pmb\text
\def\text#1{\mbox{\scriptsize #1}}
\def\EE{I\!\!E}
\def\g{\gamma}
\def\G{\Gamma}
\def\d{\delta}
\def\a{\alpha}
\def\b{\beta}
\def\low{\underline}
\def\up{\overline}
\def\fract#1#2{{\textstyle{#1\over#2}}}
\def\eh{h^{e\!f\!f}} % effective horizon
\def\ph{h^{\text{\it quasi}}} % pseudo or quasi-horizon
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\vspace{-3ex}\normalsize\sc Technical Report \hfill IDSIA-11-06
\vskip 2mm\bf\Large\hrule height5pt \vskip 6mm
General Discounting versus Average Reward
\vskip 6mm \hrule height2pt}
\author{{\bf Marcus Hutter}\\[3mm]
\normalsize IDSIA \ / \ RSISE \ / \ ANU \ / \ NICTA \ / \ \texttt{http:}//\texttt{www.hutter1.net}
}
\date{January 2006}
\maketitle
\begin{abstract}\noindent
Consider an agent interacting with an environment in cycles. In
every interaction cycle the agent is rewarded for its performance.
We compare the average reward $U$ from cycle $1$ to $m$ (average
value) with the future discounted reward $V$ from cycle $k$ to
$\infty$ (discounted value). We consider essentially arbitrary
(non-geometric) discount sequences and arbitrary reward sequences
(non-MDP environments). We show that asymptotically $U$ for
$m\to\infty$ and $V$ for $k\to\infty$ are equal, provided both
limits exist. Further, if the effective horizon grows linearly
with $k$ or faster, then the existence of the limit of $U$ implies
that the limit of $V$ exists. Conversely, if the effective horizon
grows linearly with $k$ or slower, then existence of the limit of
$V$ implies that the limit of $U$ exists.
\def\contentsname{\centering\normalsize Contents}
{\parskip=-2.5ex\tableofcontents}
\end{abstract}
\begin{keywords}
reinforcement learning;
average value;
discounted value;
arbitrary environment;
arbitrary discount sequence;
effective horizon;
increasing farsightedness;
consistent behavior.
\end{keywords}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Agent cycle and setup (deterministic)
We consider the reinforcement learning setup
\cite{Russell:03,Hutter:04uaibook}, where an agent interacts with
an environment in cycles. In cycle $k$, the agent outputs (acts)
$a_k$, then it makes observation $o_k$ and receives reward $r_k$,
both provided by the environment. Then the next cycle $k+1$
starts. For simplicity we assume that agent and environment are
deterministic.
% Goal is to maximize reward
Typically one is interested in action sequences, called plans or
policies, for agents that result in high reward. The simplest
reasonable measure of performance is the total reward sum or
equivalently the average reward, called average value
$U_{1m}:={1\over m}[r_1+...+r_m]$, where $m$ should be the
lifespan of the agent. One problem is that the lifetime is often
not known in advance, e.g.\ often the time one is willing to let a
system run depends on its displayed performance. More serious is
that the measure is indifferent to whether an agent receives high
rewards early or late if the values are the same.
% Most common: fixed horizon, Problems: lifetime unknown,
% no bias towards early rewards, no asymptotic considerations.
A natural (non-arbitrary) choice for $m$ is to consider the limit
$m\to\infty$. While the indifference may be acceptable for finite
$m$, it can be catastrophic for $m=\infty$. Consider an agent that
receives no reward until its first action is $a_k=b$, and then
once receives reward $k-1\over k$. For finite $m$, the optimal $k$
to switch from action $a$ to $b$ is $k_{opt}=m$. Hence
$k_{opt}\to\infty$ for $m\to\infty$, so the reward maximizing
agent for $m\to\infty$ actually always acts with $a$, and hence
has zero reward, although a value arbitrarily close to 1 would be
achievable. (Immortal agents are lazy \cite[Sec.5.7]{Hutter:04uaibook}).
More seriously, in general the limit $U_{1\infty}$ may not even
exist.
% Non-Solution: moving horizon. Inconsistency
Another approach is to consider a moving horizon. In cycle $k$,
the agent tries to maximize $U_{km}:={1\over m-k+1}[r_k+...+r_m]$,
where $m$ increases with $k$, e.g.\ $m=k+h-1$ with $h$ being the
horizon. This naive truncation is often used in games like chess
(plus a heuristic reward in cycle $m$) to get a reasonably small
search tree. While this can work in practice, it can lead to
inconsistent optimal strategies, i.e.\ to agents that change their
mind. Consider the example above with $h=2$. In every cycle $k$ it
is better first to act $a$ and then $b$
($U_{km}=r_k+r_{k+1}=0+{k\over k+1}$), rather than immediately $b$
($U_{km}=r_k+r_{k+1}={k-1\over k}+0$), or $a,a$ ($U_{km}=0+0$).
But entering the next cycle $k+1$, the agent throws its original plan
overboard, to now choose $a$ in favor of $b$, followed by $b$.
This pattern repeats, resulting in no reward at all.
% Geometric discount.
The standard solution to the above problems is to consider
geometrically=exponentially discounted reward
\cite{Samuelson:37,Bertsekas:96,Sutton:98}. One discounts the
reward for every cycle of delay by a factor $\g<1$, i.e.\ one
considers the future discounted reward sum $V_{k\g} :=
(1-\g)\sum_{i=k}^\infty\g^{i-k} r_i$, which models a preference
towards early rewards. The $V_{1\g}$ maximizing policy is
consistent in the sense that its actions $a_k,a_{k+1},...$
coincide with the optimal policy based on $V_{k\g}$. At first
glance, there seems to be no arbitrary lifetime $m$ or horizon
$h$, but this is an illusion. $V_{k\g}$ is dominated by
contributions from rewards $r_k...r_{k+O(\ln\g^{-1})}$, so has an
effective horizon $\eh\approx\ln\g^{-1}$. While such a sliding
effective horizon does not cause inconsistent policies, it can
nevertheless lead to suboptimal behavior. For every (effective)
horizon, there is a task that needs a larger horizon to be solved.
For instance, while $\eh=5$ is sufficient for tic-tac-toe, it is
definitely insufficient for chess. There are elegant closed form
solutions for Bandit problems, which show that for any $\g<1$, the
Bayes-optimal policy can get stuck with a suboptimal arm (is not
self-optimizing) \cite{Berry:85,Kumar:86}.
% Solution attempt: g->1 <=> m->infty cite
For $\g\to 1$, $\eh\to\infty$, and the defect decreases. There are
various deep papers considering the limit $\g\to 1$
\cite{Kelly:81}, and comparing it to the limit $m\to\infty$
\cite{Kakade:01}. % \cite{Bertsekas:95b}?
The analysis is typically restricted to ergodic MDPs for which the
limits $\lim_{\g\to 1}V_{1\g}$ and $\lim_{m\to\infty}U_{1m}$
exist. But like the limit policy for $m\to\infty$, the limit
policy for $\g\to 1$ can display very poor performance, i.e.\ we
need to choose $\g<1$ fixed in advance (but how?), or consider
higher order terms \cite{Mahadevan:96,Avrachenkov:99}. We also
cannot consistently adapt $\g$ with $k$. Finally, the value limits
may not exist beyond ergodic MDPs.
% Sliding discount
In the computer science literature, geometric discount is
essentially assumed for convenience without outer justification
(sometimes a constant interest rate or probability of surviving is
quoted \cite{Kaelbling:96}). In the psychology and economics
literature it has been argued that people discount a one day=cycle
delay in reward more if it concerns rewards now rather than later,
e.g.\ in a year (plus one day) \cite{Frederick:02}. So there is
some work on ``sliding'' discount sequences $W_{k\g}\propto \g_0
r_k+\g_1 r_{k+1}+...$. One can show that this also leads to
inconsistent policies if $\v\g$ is non-geometric
\cite{Strotz:55,Vieille:04}.
% Only consistent discount
Is there any non-geometric discount leading to consistent
policies? In \cite{Hutter:02selfopt} the generally discounted
value $V_{k\g}:={1\over\G_k}\sum_{i=k}^\infty\g_i r_i$ with
$\G_k:=\sum_{i=k}^\infty\g_i<\infty$ has been introduced. It is
well-defined for arbitrary environments, leads to consistent
policies, and e.g.\ for quadratic discount $\g_k=1/k^2$ to an
increasing effective horizon (proportionally to $k$), i.e.\ the
optimal agent becomes increasingly farsighted in a consistent way,
leads to self-optimizing policies in ergodic ($k$th-order) MDPs in
general, Bandits in particular, and even beyond MDPs. See
\cite{Hutter:02selfopt} for these and \cite{Hutter:04uaibook} for
more results. The only other serious analysis of general discounts
we are aware of is in \cite{Berry:85}, but their analysis is
limited to Bandits and so-called regular discount. This discount
has bounded effective horizon, so also does not lead to
self-optimizing policies.
% Compare value of agents
The {\em asymptotic} total average performance $U_{1\infty}$ and
future discounted performance $V_{\infty\g}$ are of key interest.
For instance, often we do not know the exact environment in
advance but have to {\em learn} it from past experience, which is
the domain of reinforcement learning \cite{Sutton:98} and adaptive
control theory \cite{Kumar:86}. Ideally we would like a learning
agent that performs {\em asymptotically} as well as the optimal
agent that knows the environment in advance.
%-------------------------------%
\paradot{Contents and main results}
%-------------------------------%
% subject of study
The subject of study of this paper is the relation between
$U_{1\infty}$ and $V_{\infty\g}$ for {\em general discount} $\v\g$
and {\em arbitrary environment}. The importance of the performance
measures $U$ and $V$, and general discount $\v\g$ has been
discussed above. There is also a clear need to study general
environments beyond ergodic MDPs, since the real world is neither
ergodic (e.g.\ losing an arm is irreversible) nor completely
observable.
% Main results
The only restriction we impose on the discount sequence $\v\g$ is
summability ($\G_1<\infty$) so that $V_{k\g}$ exists, and
monotonicity ($\g_k\geq\g_{k+1}$). Our main result is that if both
limits $U_{1\infty}$ and $V_{\infty\g}$ exist, then they are
necessarily equal (Section \ref{secAEDV}, Theorem \ref{thUeqV}).
Somewhat surprisingly this holds for {\em any} discount sequence
$\v\g$ and {\em any} environment (reward sequence $\v r$),
whatsoever.
Note that limit $U_{1\infty}$ may exist or not, independent of
whether $V_{\infty\g}$ exists or not. We present examples of the
four possibilities in Section \ref{secEx}. Under certain
conditions on $\v\g$, existence of $U_{1\infty}$ implies existence
of $V_{\infty\g}$, or vice versa. We show that if (a quantity
closely related to) the effective horizon grows linearly with $k$
or faster, then existence of $U_{1\infty}$ implies existence of
$V_{\infty\g}$ and their equality (Section \ref{secAIDV}, Theorem
\ref{thUimpV}). Conversely, if the effective horizon grows
linearly with $k$ or slower, then existence of $V_{\infty\g}$
implies existence of $U_{1\infty}$ and their equality (Section
\ref{secDIAV}, Theorem \ref{thVimpU}). Note that apart from
discounts with oscillating effective horizons, this implies (and
this is actually the path used to prove) the first mentioned main
result. In Sections \ref{secAV} and \ref{secDV} we define and
provide some basic properties of average and discounted value,
respectively.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Example Discount and Reward Sequences}\label{secEx}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In order to get a better feeling for general discount sequences,
effective horizons, average and discounted value, and their
relation and existence, we first consider various examples.
%-------------------------------%
\paradot{Notation}
%-------------------------------%
\begin{itemize}\parskip=0ex\parsep=0ex\itemsep=0ex
\item In the following we assume that $i,k,m,n\in\SetN$ are natural numbers.
\item Let $\low F:=\low\lim_n F_n=\lim_{k\to\infty}\inf_{n>k}F_n$ denote the limit inferior and
\item $\up F:=\up\lim_n F_n=\lim_{k\to\infty}\sup_{n>k}F_n$ the limit superior of $F_n$.
\item $\forall'n$ means for all but finitely many $n$.
\item Let $\v\g=(\g_1,\g_2,...)$ denote a summable discount sequence in the sense that
\item $\G_k:=\sum_{i=k}^\infty\g_i<\infty$ and $\g_k\in\SetR^+$ $\forall k$.
\item Further, $\v r=(r_1,r_2,...)$ is a bounded reward sequence w.l.g.\ $r_k\in[0,1]$ $\forall k$.
\item Let constants $\a,\b\in[0,1]$, boundaries $0\leq k_10
& \sim {1\over\eps}k^{-\eps}
& \sim (2^{1/\eps}-1) k
& \sim {k\over\eps}
& \sim\eps & \to \eps
\\ \hline
\mbox{harmonic$_\approx$}
& {1\over k\ln^2 k}
& \sim{1\over\ln k}
& \sim k^2
& \sim k\ln k
& \sim{1\over\ln k} & \to 0
\end{array}
\eeqn
%
For instance, the standard discount is geometric $\g_k=\g^k$ for
some $0\leq\g<1$, with constant effective horizon
${\ln(1/2)\over\ln\g}$. (An agent with $\g=0.95$ can/will not plan
farther than about 10-20 cycles ahead).
%
Since in this work we allow for general discount, we can even
recover the average value $U_{1m}$ by choosing
$\g_k=\{ {1 \;\mbox{\scriptsize for}\; k\leq m\atop 0 \;\mbox{\scriptsize for}\; k>m}\}$.
%
A power discount $\g_k=k^{-\alpha}$ ($\alpha>1$) is very
interesting, since it leads to a linearly increasing effective
horizon $\eh_k\propto k$, i.e.\ to an agent whose farsightedness
increases proportionally with age. This choice has some appeal, as
it avoids preselection of a global time-scale like $m$ or ${1\over
1-\g}$, and it seems that humans of age $k$ years usually do not
plan their lives for more than, perhaps, the next $k$ years. It is
also the boundary case for which $U_{1\infty}$ exists if and only
if $V_{\infty\g}$ exists.
%-------------------------------%
\paradot{Example reward sequences}
%-------------------------------%
Most of our (counter)examples will be for binary reward $\v
r\in\{0,1\}^\infty$. We call a maximal consecutive subsequence of
ones a 1-run. We denote start, end, and length of the $n$th run
by $k_n$, $m_n-1$, and $A_n=m_n-k_n$, respectively. The following
0-run starts at $m_n$, ends at $k_{n+1}-1$, and has length
$B_n=k_{n+1}-m_n$. The (non-normalized) discount sum in 1/0-run
$n$ is denoted by $a_n$ / $b_n$, respectively.
%
The following definition and two lemmas facilitate the discussion
of our examples. The proofs contain further useful relations.
\begin{definition}[Value for binary rewards]\label{defBin}
Every binary reward sequence $\v r\in\{0,1\}^\infty$ can be defined by
the sequence of change points $0\leq k_1n$, then repeat with $n\leadsto n+1$. The proportionality
constants can be chosen to insure monotonicity of $\v\g$. For
such $\v\g$ neither Theorem \ref{thUimpV} nor Theorem
\ref{thVimpU} is applicable, only Theorem \ref{thUeqV}.
\end{myexample}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Average Value}\label{secAV}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We now take a closer look at the (total) average value $U_{1m}$ and
relate it to the future average value $U_{km}$, an intermediate
quantity we need later. We recall the definition of the average value:
\begin{definition}[Average value, \boldmath $U_{1m}$]\label{defAvVal}
Let $r_i\in[0,1]$ be the reward at time $i\in\SetN$. Then
\beqn
U_{1m} \;:=\; {1\over m}\sum_{i=1}^m r_i \;\in[0,1]
\eeqn
is the average value
from time 1 to $m$, and $U_{1\infty}:=\lim_{m\to\infty}U_{1m}$ the
average value if it exists.
\end{definition}
% future average value
We also need the average value $U_{km}:={1\over m-k+1}\sum_{i=k}^m
r_i$ from $k$ to $m$ and the following Lemma.
\begin{lemma}[Convergence of future average value, \boldmath $U_{k\infty}$]\label{lemUkm}
For $k_m\leq m\to\infty$ and every $k$ we have
\beqn
U_{1m}\to\a \quad\iffs\quad U_{km}\to\a \quad\left.
{\Rightarrow\quad U_{k_m m}\to\a \qmbox{if} \smash{\sup\limits_m}{k_m-1\over m}<1 \atop
\Leftarrow\quad U_{k_m m}\to\a \phantom{\qmbox{if} \smash{\sup\limits_m}{k_m-1\over m}<1} } \right.
\eeqn
\end{lemma}
The first equivalence states the obvious fact (and problem) that
any finite initial part has no influence on the average value
$U_{1\infty}$. Chunking together many $U_{k_m m}$ implies the last
$\Leftarrow$. The $\Rightarrow$ only works if we average in
$U_{k_m m}$ over sufficiently many rewards, which the stated
condition ensures ($\v r=101010...$ and $k_m=m$ is a simple
counter-example). Note that $U_{km_k}\to\a$ for $m_k\geq k\to\infty$
implies $U_{1m_k}\to\a$, but not necessarily $U_{1m}\to\a$ (e.g.\
in Example \ref{exVnU}, $U_{1m_k}={1\over 3}$ and ${k-1\over
m_k}\to 0$ imply $U_{km_k}\to{1\over 3}$ by \req{prUkm}, but
$U_{1\infty}$ does not exist).
\paradot{Proof}
The trivial identity $mU_{1m}=(k-1)U_{1,k-1}+(m-k+1)U_{km}$ implies
$U_{km}-U_{1m}={k-1\over m-k+1}(U_{1m}-U_{1,k-1})$ implies
\beq\label{prUkm}
|U_{km}-U_{1m}|
\;\leq\; {|U_{1m}-U_{1,k-1}|\over{m\over k-1}-1}
\eeq
$\iffs$) The numerator is bounded by 1, and for fixed $k$ and
$m\to\infty$ the denominator tends to $\infty$, which proves
$\iffs$.
$\Rightarrow$) We choose (small) $\eps>0$, $m_\eps$ large enough
so that $|U_{1m}-\a|<\eps$ $\forall m\geq m_\eps$, and
$m\geq{m_\eps\over\eps}$. If $k:=k_m\leq m_\eps$, then \req{prUkm}
is bounded by ${1\over 1/\eps-1}$. If $k:=k_m>m_\eps$, then
\req{prUkm} is bounded by ${2\eps\over 1/c-1}$, where
$c:=\sup_k{k_m-1\over m}<1$. This shows that $|U_{k_m
m}-U_{1m}|=O(\eps)$ for large $m$, which implies $U_{k_m m}\to\a$.
$\Leftarrow$) We partition the time-range
$\{1...m\}=\bigcup_{n=1}^L\{k_{m_n}...m_n\}$, where $m_1:=m$ and
$m_{n+1}:=k_{m_n}-1$. We choose (small) $\eps>0$, $m_\eps$ large
enough so that $|U_{k_m m}-\a|<\eps$ $\forall m\geq m_\eps$,
$m\geq{m_\eps\over\eps}$, and $l$ so that $k_{m_l}\leq m_\eps\leq
m_l$. Then
\bqan
U_{1m} & = & {1\over m} \left[\sum_{n=1}^l+\sum_{n=l+1}^L\right] (m_n\!-\!k_{m_n}\!+\!1) U_{k_{m_n}m_n}
\\
& \leq & {1\over m}\sum_{n=1}^l (m_n\!-\!k_{m_n}\!+\!1)(\a+\eps) + {m_{l+1}\!-\!k_{m_L}\!+\!1\over m}
\\
& \leq & {m_1\!-\!k_{m_l}\!+\!1\over m}(\a+\eps) + {k_{m_l}\over m}
\;\leq\; (\a+\eps) + \eps
\\
\nq\nq \mbox{Similarly}\quad U_{1m} & \geq & {m_1\!-\!k_{m_l}\!+\!1\over m}(\a-\eps)
\geq {m\!-\!m_\eps\over m}(\a-\eps) \geq (1-\eps)(\a-\eps)
\eqan
This shows that $|U_{1m}-\a|\leq 2\eps$ for sufficiently large
$m$, hence $U_{1m}\to\a$.
\qed
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Discounted Value}\label{secDV}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We now take a closer look at the (future) discounted value
$V_{k\g}$ for general discounts $\v\g$, and prove some useful
elementary asymptotic properties of discount $\g_k$ and normalizer
$\G_k$. We recall the definition of the discounted value:
\begin{definition}[Discounted value, \boldmath $V_{k\g}$]\label{defDiscVal}
Let $r_i\in[0,1]$ be the reward and $\g_i\geq 0$ a discount at
time $i\in\SetN$, where $\v\g$ is assumed to be summable in the
sense that $0<\G_k:=\sum_{i=k}^\infty\g_i<\infty$. Then
\beqn
V_{k\g} \;:=\; {1\over\G_k}\sum_{i=k}^\infty\g_i r_i \;\in[0,1]
\eeqn
is the $\v\g$-discounted future value and
$V_{\infty\g}:=\lim_{k\to\infty}V_{k\g}$ its limit if it exists.
\end{definition}
We say that $\v\g$ is {\em monotone} if $\g_{k+1}\leq\g_k\forall
k$. Note that monotonicity and $\G_k>0$ $\forall k$ implies
$\g_k>0$ $\forall k$ and convexity of $\G_k$.
\begin{lemma}[Discount properties, \boldmath $\g/\G$]\label{lemDiscProp}
\bqan
& i) & {\g_{k+1}\over\g_k} \to 1 \quad\Leftrightarrow\quad
{\g_{k+\Delta}\over\g_k} \to 1 \quad\forall\Delta\in\SetN
\\
& ii) & {\g_k\over\G_k}\to 0 \quad\Leftrightarrow\quad
{\G_{k+1}\over\G_k}\to 1 \quad\Leftrightarrow\quad
{\G_{k+\Delta}\over\G_k} \to 1 \quad\forall\Delta\in\SetN
\eqan
Furthermore, $(i)$ implies $(ii)$, but not necessarily the other way around
(even not if $\v\g$ is monotone).
\end{lemma}
\paradot{Proof}
$(i)\Rightarrow$
${\g_{k+\Delta}\over\g_k}=\prod_{i=k}^{\Delta-1}{\g_{i+1}\over\g_i} \toinfty{k} 1$, since
$\Delta$ is finite.
\\
$(i)\Leftarrow$ Set $\Delta=1$.
\\
$(ii)$ The first equivalence follows from $\G_k=\g_k+\G_{k+1}$.
The proof for the second equivalence is the same as for $(i)$ with
$\g$ replaced by $\G$.
\\
$(i)\Rightarrow(ii)$ Choose $\eps>0$. $(i)$ implies
${\g_{k+1}\over\g_k}\geq 1-\eps$ $\forall\,'k$ implies
\beqn
\G_k \;=\; \sum_{i=k}^\infty\g_i
\;=\; \g_k\sum_{i=k}^\infty \prod_{j=k}^{i-1}{\g_{i+1}\over\g_i}
\;\geq\; \g_k\sum_{i=k}^\infty (1-\eps)^{i-k}
\;=\; \g_k/\eps
\eeqn
hence ${\g_k\over\G_k}\leq\eps$ $\forall'k$, which implies
${\g_k\over\G_k}\to 0$.
\\
$(i)\not\Leftarrow(ii)$ Consider counter-example
$\g_k \;=\; 4^{-\lceil\lb k\rceil}$, i.e.\
$\g_k=4^{-n}$ for $2^{n-1}c\geq{k\g_k\over\G_k}=:c_k$ ensures that the excessive
initial part $\propto U_{1,k-1}$ is ``negligible''. It is easy to
show that
\beqn
\sum_{j=i}^\infty\d_j \;=\; \g_i \qmbox{and}
\sum_{j=k}^\infty j\d_j \;=\; (k\!-\!1)\g_k+\G_k
\eeqn
We choose some (small) $\eps>0$, and $m_\eps$ large enough so that
$|U_{1m}-\a|<\eps$ $\forall m\geq m_\eps$. Then, for $k>m_\eps$ we get
\bqan
V_{k\g} & = & {1\over\G_k}\sum_{i=k}^\infty\g_i r_i
\;=\; {1\over\G_k}\sum_{i=k}^\infty\sum_{j=i}^\infty \d_j r_i
\;=\; {1\over\G_k}\sum_{j=k}^\infty \sum_{i=k}^j \d_j r_i
\\
& = & {1\over\G_k}\sum_{j=k}^\infty \d_j[jU_{1j}-(k\!-\!1)U_{1,k-1}]
\\
& \lessgtr & {1\over\G_k}\sum_{j=k}^\infty \d_j[j(\a\pm\eps)-(k\!-\!1)(\a\mp\eps)]
\\
& = & {1\over\G_k}[(k\!-\!1)\g_k+\G_k](\a\pm\eps)-{1\over\G_k}\g_k(k\!-\!1)(\a\mp\eps)
\\
& = & \a \pm \Big( 1+{2(k-1)\g_k\over\G_k}\Big)\eps
\;\lessgtr\; \a \pm (1+2c_k)\eps
\eqan
i.e.\ $|V_{k\g}-\a|<(1+2c_k)\eps\leq (1+2c)\eps$ $\forall k>m_\eps$, which implies
$V_{k\g}\to\a$.
\qed
Theorem \ref{thUimpV} can, for instance, be applied to Example \ref{exUeqV}.
Examples \ref{exUnVs}, \ref{exUnV}, and \ref{exUneqV} demonstrate
that the conditions in Theorem \ref{thUimpV} cannot be dropped.
The following proposition shows more strongly, that the sufficient
condition is actually necessary (modulo monotonicity of $\v\g$),
i.e.\ cannot be weakened.
\begin{proposition}[\bf\boldmath $U_{1\infty}\not\Rightarrow V_{\infty\g}$]\label{propUnV}
For every monotone $\v\g$ with $\sup_k{k\g_k\over\G_k}=\infty$, there
are $\v r$ for which $U_{1\infty}$
exists, but not $V_{\infty\g}$.
\end{proposition}
% proof idea
The proof idea is to construct a binary $\v r$ such that all change points
$k_n$ and $m_n$ satisfy $\G_{k_n}\approx 2\G_{m_n}$. This ensures
that $V_{k_n\g}$ receives a significant contribution from 1-run $n$,
i.e.\ is large. Choosing $k_{n+1}\gg m_n$ ensures that $V_{m_n\g}$ is
small, hence $V_{k\g}$ oscillates. Since the quasi-horizon
$\ph_k\neq\Omega(k)$ is small, the 1-runs are short
enough to keep $U_{1m}$ small so that $U_{1\infty}=0$.
\paradot{Proof}
The assumption ensures that there exists a sequence $m_1$, $m_2$, $m_3$, ...
for which
\beqn
{m_n\g_{m_n}\over\G_{m_n}} \geq n^2
\quad\mbox{We further (can) require $\G_{m_n}<\odt\G_{m_{n-1}+1}$
$\quad(m_0:=0)$}
\eeqn
For each $m_n$ we choose $k_n$ such that $\G_{k_n}\approx
2\G_{m_n}$. More precisely, since $\G$ is monotone decreasing and
$\G_{m_n}<2\G_{m_n}\leq\G_{m_{n-1}+1}$, there exists (a unique)
$k_n$ in the range $m_{n-1}8k_n
\eeqn
We choose a binary reward sequence with $r_k=1$ iff $k_n\leq k