%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Asymptotically Optimal Agents %%
%% Tor Lattimore and Marcus Hutter (2011) %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt,twoside]{article}
\topmargin=-10mm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\usepackage{latexsym,amsmath,amssymb}
\usepackage{wrapfig}
\usepackage{tikz}
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\newcommand{\eqn}[1]{\begin{align}#1\end{align}}
\newcommand{\eq}[1]{\begin{align*}#1\end{align*}}
\newcommand{\proofof}{Proof of }
\newcommand{\proofend}{}
\def\subsubsect#1{\vspace{1ex plus 0.5ex minus 0.5ex}\noindent{\bf\boldmath{#1.}}}
\usetikzlibrary{trees,arrows,automata}
\usetikzlibrary{decorations.pathmorphing}
\usetikzlibrary{decorations.markings}
\renewcommand{\v}[1]{{\boldsymbol #1}}
\newcommand{\argmax}{\operatornamewithlimits{arg\,max}}
\newcommand{\ind}[1]{[\![ #1 ]\!]}
\newcommand{\R}[0]{\mathbb R}
\newcommand{\N}[0]{\mathbb N}
\usepackage{amsthm}
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\newtheorem{example}[theorem]{Example}
\theoremstyle{remark}
\newenvironment{keywords}{\centerline{\bf\small
Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\newcommand{\tpic}[1]{
\tikzstyle{state} = [circle,draw,minimum width=0.5cm, minimum height=0.5cm]
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=\nodedist, semithick]
\scriptsize
{
#1
}
\end{tikzpicture}
}
\newcommand{\X}[0]{\mathcal X}
\newcommand{\T}[0]{T}
\renewcommand{\H}[0]{\mathcal H}
\newcommand{\Y}[0]{Y}
\renewcommand{\O}[0]{\mathcal O}
\newcommand{\B}[0]{\mathcal B}
\newcommand{\A}[0]{\mathcal A}
\renewcommand{\Re}[0]{\mathcal R}
\newcommand{\Rw}[0]{{\boldsymbol{R}}}
\renewcommand{\S}[0]{\ensuremath{\mathcal S}}
\newcommand{\g}[0]{\gamma}
\renewcommand{\d}[0]{{\mathbf d}}
\newcommand{\du}[1]{{\boldsymbol{d}^{#1}}}
\newcommand{\dt}[2]{{d_{#2}^{#1}}}
\newcommand{\dtp}[2]{{\mathbf {\d^{'}}_{#2}^{#1}}}
\newcommand{\dup}[1]{{{\boldsymbol{d}^{'}}^{#1}}}
\newcommand{\td}[0]{{\tilde {\mathbf d}}}
\newcommand{\tdu}[1]{{\boldsymbol{\tilde d}^{#1}}}
\newcommand{\tdt}[2]{{\tilde d_{#2}^{#1}}}
\newcommand{\D}[2]{{D_{#1,#2}}}
\newcommand{\nodedist}[0]{1.2cm}
\newcommand{\shortnodedist}[0]{0.70cm}
\newcommand{\start}{{\tiny \bfseries $\mathcal S$}}
\sloppy
\raggedbottom
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Title - Page %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{%\vspace{-4ex}
\vskip 2mm\bf\Large\hrule height5pt \vskip 4mm
Time Consistent Discounting
\vskip 4mm \hrule height2pt}
\author{{\bf Tor Lattimore$^1$} and {\bf Marcus Hutter$^{1,2}$}\\[3mm]
\normalsize Research School of Computer Science \\[-0.5ex] %, NICTA \\
\normalsize $^1$Australian National University \hspace{1cm} $^2$ETH Z{\"u}rich \\[-0.5ex]
\normalsize\texttt{\{tor.lattimore,marcus.hutter\}@anu.edu.au}
}
\date{15 July 2011}
%\vspace*{-5ex}
\maketitle
\begin{abstract}
A possibly immortal agent tries to maximise its summed discounted rewards over time, where discounting is used to avoid infinite
utilities and encourage the agent to value current rewards more than future ones. Some commonly used discount functions lead
to time-inconsistent behavior where the agent changes its plan over time. These inconsistencies can lead to very poor behavior.
We generalise the usual discounted utility model to one where the discount function changes with the age of the agent. We then
give a simple characterisation of time-(in)consistent discount functions and show the existence of a rational policy for an
agent that knows its discount function is time-inconsistent.
%
\def\contentsname{\centering\normalsize Contents}
{\parskip=-2.7ex\tableofcontents}
\end{abstract}
\begin{keywords} %\small
Rational agents;
sequential decision theory;
general discounting;
time-consistency;
game theory.
\end{keywords}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The goal of an agent is to maximise its expected utility; but how do we measure utility? One method is to
assign an instantaneous reward to particular events, such as having a good meal, or a pleasant walk. It would be
natural to measure the utility of a plan (policy) by simply summing the expected instantaneous rewards,
but for immortal agents this may lead to infinite utility and also assumes
rewards are equally valuable irrespective of the time at which they are received.
One solution, the discounted utility (DU) model introduced by
Samuelson in \cite{Sam37}, is to take a weighted sum of the rewards with earlier rewards usually valued more than
later ones.
There have been a number of criticisms of the DU model, which we will not discuss. For an excellent
summary, see \cite{FOO02}. Despite the criticisms, the DU model is widely used in both economics and computer science.
A discount function is time-inconsistent if plans chosen to maximise expected discounted utility change over time. For
example, many people express a preference for \$110 in 31 days over \$100 in 30 days, but reverse that preference 30 days later
when given a choice between \$110 tomorrow or \$100 today \cite{FGM94}. This behavior can be caused by a rational agent with
a time-inconsistent discount function.
Unfortunately, time-inconsistent discount functions can lead to extremely bad behavior and so it becomes important to
ask what discount functions are time-inconsistent.
Previous work has focussed on a continuous model where agents can take actions at any time in a continuous time-space.
We consider a discrete model where agents act in finite time-steps. In general this is not a limitation since any
continuous environment can be approximated arbitrarily well by a discrete one. The discrete setting has the
advantage of easier analysis, which allows us
to consider a very general setup where environments are arbitrary finite or infinite Markov decision processes.
Traditionally, the DU model has assumed a sliding discount function. Formally, a sequence of instantaneous
utilities (rewards)
$R = (r_k, r_{k+1}, r_{k+2}, \cdots)$ starting at time $k$, is given utility equal to $\sum_{t=k}^\infty \dt{}{t-k} r_{t}$ where
$\du{} \in [0, 1]^\infty$.
We generalise this model as in \cite{Hut06} by allowing the discount function to depend on the age of the agent.
The new utility is given by
$\sum_{t=k}^\infty \dt{k}{t} r_{t}$. This generalisation is consistent with how some agents tend to behave; for example, humans becoming temporally less myopic
as they grow older.
Strotz \cite{Str55} showed that the only time-consistent sliding discount function is geometric discounting. We
extend this result to a full characterisation of time-consistent discount functions where the discount function is
permitted to change over time. We also show that discounting functions that are ``nearly'' time-consistent give rise to low
regret in the anticipated future changes of the policy over time.
Another important question is what policy should be adopted by an agent that knows it is time-inconsistent. For example, if it
knows it will become temporarily myopic in the near future then it may benefit from paying a price to pre-commit
to following a particular policy. A number of authors
have examined this question in special continuous cases, including \cite{Gol80,BM73,Pol68,Str55}. We
modify their results to our general, but discrete, setting using game theory.
The paper is structured as follows. First the required notation is introduced (Section 2). Example discount functions and the consequences
of time-inconsistent discount functions are then presented (Section 3).
We next state and prove the main theorems, the complete classification of discount functions and the continuity result (Section 4). The game theoretic view
of what an agent {\it should} do if it knows its discount function is changing is analyzed (Section 5). Finally we offer some discussion and
concluding remarks (Section 6).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Notation and Problem Setup}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The general reinforcement learning (RL) setup involves an agent interacting sequentially with an environment where in each
time-step $t$ the agent chooses some action $a_t \in \A$, whereupon it receives a reward
$r_t \in \Re \subseteq \R$ and observation $o_t \in \O$.
The environment can be formally defined
as a probability distribution $\mu$ where
$\mu(r_t o_t | a_1 r_1 o_1 a_2 r_2 o_2 \cdots a_{t - 1} r_{t - 1} o_{t - 1} a_t)$ is the probability of
receiving reward $r_t$ and observation $o_t$ having taken action $a_t$ after history $h_{,>=stealth',shorten >=1pt,auto,node distance=5.4cm, semithick]
\tikzstyle{every state}=[fill=none,draw=black,text=black, node distance=2.0cm]
\node[state] (a) {\start};
\node[state] (b) [above right of=a] {};
\node[state] (c) [below right of=a] {};
\node[state] (d) [right of=b] {};
\node[state] (e) [right of=c] {};
\node[state] (f) [right of=d] {};
\node[state] (g) [right of=e] {};
\node[state, draw=none] (h) [right of=f] {};
\node[state, draw=none] (i) [right of=g] {};
\path (a) edge node {$1.0$} (b)
(a) edge node [left] {$0.8$} (c)
(b) edge node {$0.7$} (d)
(b) edge [bend left=30] node [right] {$0.8$} (e)
(c) edge [bend left=30] node [left] {$1.0$} (d)
(c) edge node {$0.5$} (e)
(d) edge node {$0.7$} (f)
(d) edge [bend left=30] node [right] {$0.8$} (g)
(e) edge [bend left=30] node [right] {$1.0$} (f)
(e) edge node {$0.5$} (g)
(f) edge [dashed] node {} (h)
(g) edge [dashed] node {} (i)
;
}
\end{wrapfigure}
A deterministic environment (where every value of $\mu(\cdot)$ is either 1 or 0) can be represented as a graph with edges for actions, rewards
of each action attached to the corresponding edge, and observations in the nodes. For example, the deterministic environment on the right represents an
environment where either pizza or pasta must be chosen at each time-step (evening). An action leading to an upper node is {\tt eat pizza}
while the ones leading to a lower node are
{\tt eat pasta}. The rewards are for a consumer who prefers
pizza to pasta, but dislikes having the same food twice in a row. The starting node is marked as $\mathcal S$. This example, along with all those
for the remainder of this paper, does not require observations.
The following assumption is required for clean results, but may be relaxed if an $\epsilon$ of slop is permitted in some results.
\begin{assumption}\label{assumption}
We assume that $\A$ and $\O$ are finite and that $\Re = [0, 1]$.
\end{assumption}
\begin{definition}[Policy]
A {\it policy} is a mapping $\pi:\H\to\A$ giving an action for each history.
\end{definition}
Given policy $\pi$ and history $h_{1:t}$ and $s \leq t$ then the probability of reaching history $h_{1:t}$ when starting from history
$h_{~~ 0$ for at least one $t \geq k$.
\end{definition}
%
The apparently superfluous superscript $k$ will be useful later when we allow the discount vector to change with time.
We do {\it not} insist that the discount vector be summable, $\sum_{t=k}^\infty \dt{k}{t} < \infty$.
%
\begin{definition}[Expected Values]
The expected discounted reward (or utility or value) when using policy $\pi$ starting in history $h_{ 0$.
\begin{definition}[Optimal Policy/Value]\label{defn_optimal_value}
In general, our agent will try to choose a policy $\pi^*_\du{k}$ to maximise $V^\pi_{\du{k}}(h_{ 1.
}
%-------------------------------%
\subsubsect{Geometric}
%-------------------------------%
$\dt{k}{t} = \gamma^{t}$ with $\gamma \in (0, 1)$.
Geometric discounting is the most commonly used discount matrix. Philosophically it can be justified by assuming
an agent will die (and not care about the future after death) with probability $1 - \gamma$ at each time-step.
Another justification for geometric discount is its analytic simplicity - it is summable and leads to time-consistent policies. It
also models fixed interest rates.
%-------------------------------%
\subsubsect{No Discounting}
%-------------------------------%
$\dt{k}{t} = 1, \text{ for all } k, t$.
\cite{HL07} and \cite{Leg08} point out that discounting future rewards via an explicit discount matrix is unnecessary
since the environment
can capture both temporal preferences for early (or late) consumption, as well as the risk associated with delaying
consumption.
Of course, this ``discount matrix'' is not summable, but
can be made to work by insisting that all environments satisfy Assumption \ref{assumption2}.
%
This approach is elegant in the
sense that it eliminates the need for a discount matrix, essentially admitting far more complex preferences
regarding inter-temporal
rewards than a discount matrix allows. On the other hand, a discount matrix gives the
``controller'' an explicit way to adjust the myopia of the agent.
\setlength{\intextsep}{0pt}
\begin{wrapfigure}[5]{r}{5.2cm}
\topsep=0.0cm
\begin{center}
\vspace{-0.1cm}
\tpic{
\node[state] (a) {\start};
\node[state] (b) [above of=a] {};
\node[state] (c) [right of=a] {};
\node[state] (d) [above of=c] {};
\node[state] (e) [right of=c] {};
\node[state] (f) [above of=e] {};
\node[state] (g) [right of=e] {};
\node[state] (h) [above of=g] {};
\node[state,draw=none] (x) [right of=h] {};
\node[state,draw=none] (y) [right of=g] {};
\path (a) edge node {$1/2$} (b)
(a) edge node {$0$} (c)
(c) edge node {$2/3$} (d)
(c) edge node {$0$} (e)
(e) edge node {$3/4$} (f)
(e) edge node {$0$} (g)
(g) edge node {$4/5$} (h)
(b) edge node {$0$} (d)
(d) edge node {$0$} (f)
(f) edge node {$0$} (h)
(g) edge node {} (y)
(h) edge node {} (x)
;
}
\end{center}
\end{wrapfigure}
To illustrate the potential consequences of time-inconsistent discount matrices we consider the policies of
several agents acting in the following environment.
%
Let agent A use a constant horizon discount matrix with $H = 2$ and agent B a geometric discount matrix with
some discount rate $\gamma$.
In the first time-step agent A prefers to move right with the intention of moving up in the second time-step for a reward of
$2/3$. However, once in second time-step, it will change its plan by moving right again. This continues indefinitely,
so agent A will always delay moving up and receives zero reward forever.
Agent B acts very differently. Let $\pi_t$ be the policy in which the agent moves right until time-step $t$, then up and
right indefinitely. $V^{\pi_t}_{\du{k}}(h_{<1}) = \gamma^t {(t + 1) \over (t + 2)}$. This value does not depend on $k$ and
so the agent will move right until $t = \argmax \left\{\gamma^t {(t + 1) \over {t + 2}} \right\} < \infty$ when it
will move up and receive a reward.
The actions of agent A are an example of the worst possible behavior arising from time-inconsistent discounting.
Nevertheless, agents with a constant horizon discount matrix are used in all kinds of problems. In particular, agents
in zero sum games where fixed depth mini-max searches are common. In practise, serious time-inconsistent behavior
for game-playing agents seems rare, presumably because most strategic games don't have a reward structure similar
to the example above.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Theorems}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The main theorem of this paper is a complete characterisation of time consistent discount matrices.
\begin{theorem}[Characterisation] \label{thm_main}
Let $\d$ be a discount matrix, then the following are equivalent.
\begin{enumerate}
\item $\d$ is time-consistent (Definition \ref{defn_consistent})
\item For each $k$ there exists an $\alpha_k \in \R$ such that $\dt{k}{t} = \alpha_k \dt{1}{t}$ for all $t \geq k \in \N$.
\end{enumerate}
\end{theorem}
%
Recall that a discount matrix is sliding if $\dt{k}{t} = \dt{1}{t - k + 1}$.
Theorem \ref{thm_main} can be used to show that if a sliding discount matrix is used as in \cite{Str55} then the only
time-consistent discount matrix is geometric.
%
Let $\d$ be a time-consistent sliding discount matrix.
By Theorem \ref{thm_main} and the definition of sliding, $\alpha_1 \dt{1}{t + 1} = \dt{2}{t + 1} = \dt{1}{t}$.
Therefore ${1 \over \alpha_1} \dt{1}{2} = \dt{1}{1}$ and $\dt{1}{3} = {1 \over \alpha_1} \dt{1}{2} = \left(1 \over \alpha_1\right)^2 \dt{1}{1}$ and similarly, $\dt{1}{t} = \left(1\over \alpha_1\right)^{t-1} \dt{1}{1} \propto \gamma^t$ with
$\gamma = 1/\alpha_1$, which is geometric discounting. This is the analogue to the results of \cite{Str55} converted to our setting.
The theorem can also be used to construct time-consistent discount rates. Let $\du{1}$ be a discount vector, then the
discount matrix defined by
$\dt{k}{t} := \dt{1}{t}$ for all $t \geq k$ will always be time-consistent, for example, the {\it fixed lifetime} discount matrix with $\dt{k}{t} = 1$
if $t \leq H$ for some horizon $H$. Indeed, all time-consistent discount rates can be constructed in this way (up to scaling).
\begin{proof}[\proofof Theorem \ref{thm_main}]
$2 \Longrightarrow 1$: This direction follows easily from linearity of the scalar product.
\eqn{
\label{eqn-0-1} \pi^*_{\du{k}}(h_{ 0$. By the previous argument we have that,
$[\dt{k}{t_i}, \dt{k}{t_{i+1}}] = \alpha_k [\dt{1}{t_{i}}, \dt{1}{t_{i+1}}]$ and
$[\dt{k}{t_{i+1}}, \dt{k}{t_{i+2}}] = \tilde \alpha_k [\dt{1}{t_{i+1}}, \dt{1}{t_{i+2}}]$.
Therefore $\alpha_k = \tilde \alpha_k$, and by induction, $\dt{k}{t_i} = \alpha_k \dt{1}{t_i}$ for all $i$.
Now if $t \geq k$ and $\dt{1}{t} = 0$ then $\dt{k}{t} = 0$ by equation (\ref{eqn-main-eq3}). By symmetry,
$\dt{k}{t} = 0 \implies \dt{1}{t} = 0$. Therefore $\dt{k}{t} = \alpha_k \dt{1}{t}$ for all $t \geq k$ as required.
\proofend\end{proof}
%
In Section 3 we saw an example where time-inconsistency led to very bad behavior. The discount matrix causing
this was very time-inconsistent. Is it possible that an agent using a ``nearly''
time-consistent discount matrix can exhibit similar bad behavior? For example, could rounding errors when using
a geometric discount matrix seriously affect the agent's behavior? The following Theorem shows that this is not
possible.
%
First we require a measure of the cost of time-inconsistent behavior. The regret experienced by the agent at time
zero from following policy $\pi_\d$ rather than $\pi^*_\du{1}$ is $V^*_\du{1}(h_{<1}) - V^{\pi_\d}_\du{1}(h_{<1})$.
We also need a distance measure on the space of discount vectors.
\begin{definition}[Distance Measure]\label{defn_vector_distance}
Let $\du{k}, \du{j}$ be discount vectors then define a distance measure $D$ by
\eq{
D(\du{k}, \du{j}) := \sum_{i = \max\left\{k, j\right\}}^\infty |\dt{k}{i} - \dt{j}{i}|.
}
Note that this is almost the taxicab metric, but the sum is restricted to $i \geq \max\left\{k, j\right\}$.
\end{definition}
\begin{theorem}[Continuity]\label{thm_cont}
Suppose $\epsilon \geq 0$ and $\D{k}{j}:= D(\du{k}, \du{j})$ then
\eq{
V^*_\du{1}(h_{<1}) - V^{\pi_\d}_\du{1}(h_{<1}) \leq \epsilon + \D{1}{t} + \sum_{k=1}^{t-1} \D{k}{k+1}
}
with $t = \min\left\{t : \sum_{h_{ 0$ is guaranteed to exist by Assumption \ref{assumption2}.
\end{theorem}
Theorem \ref{thm_cont} implies that the regret of the agent at time zero in its future time-inconsistent actions is
bounded by the sum of the differences between the discount vectors used at different times. If these differences
are small then the regret is also small. For example, it implies that small perturbations (such as rounding errors)
in a time-consistent discount matrix lead to minimal bad behavior.
The proof is omitted due to limitations in space. It relies on proving the result for finite horizon environments and showing that
this extends to the infinite case by using the horizon, $t$, after which the actions of the agent are no longer important.
The bound in Theorem \ref{thm_cont} is tight in the following sense.
\begin{theorem}\label{thm_lower_bound}
For $\delta > 0$ and $t \in \N$ and any sufficiently small $\epsilon > 0$ there exists an environment and discount matrix such that
\eq{
(t-2)(1 - \epsilon)\delta < V^*_\du{1}(h_{<1}) - V^{\pi_\d}_\du{1}(h_{<1}) &< (t+1)\delta \\
&\equiv \D{1}{t} + \sum_{i=1}^{t-1} \D{i}{i+1}
}
where
$t = \min\left\{t : \sum_{h_{ 0$ where the regret
due to time-inconsistency is nearly equal to the bound given by Theorem \ref{thm_cont}.
\begin{proof}[\proofof Theorem \ref{thm_lower_bound}]
Define $\d$ by
\eq{
\dt{k}{i} = \begin{cases}
\delta & \text{if } k < i < t \\
0 & \text{otherwise}
\end{cases}
}
Observe that $D(\du{k}, \du{j}) = \delta$ for all $k < j < t$ since $\dt{j}{i} = \dt{k}{i}$ for all $i$ except $i = j$.
Now consider the environment below.
\begin{center}
\tpic{
\tikzstyle{every state}=[fill=none,draw=black,text=black]
\node[state] (a) {\start};
\node[state] (b) [right of=a] {};
\node[state] (c) [right of=b] {};
\node[state,draw=none] (h) [right of=c] {$\cdots$};
\node[state,node distance=\shortnodedist] (i) [right of=h] {};
\node[state] (e) [below of=b] {};
\node[state] (f) [below of=c] {};
\node[state] (j) [below of=i] {};
\path (a) edge node {$0$} (b)
(b) edge node {$0$} (c)
(c) edge node {$0$} (h)
(b) edge node {$1 - \epsilon $} (e)
(c) edge node {$1 - \epsilon^2$} (f)
(i) edge node {$1 - \epsilon^{t-1}$} (j)
(e) edge[loop left] node {$1 - \epsilon$} (e)
(f) edge[loop below] node {$1 - \epsilon^2$} (f)
(j) edge[loop right] node {$0$} (j)
;
}\end{center}
For sufficiently small $\epsilon$, the agent at time zero will plan to move right and then down leading to $\Rw^*_\du{1}(h_{<1}) = [0, 1 - \epsilon, 1 - \epsilon, \cdots]$ and $V^*_\du{1}(h_{<1}) = (t - 1) \delta (1 - \epsilon)$.
To compute $\Rw_\d$ note that $\dt{k}{k} = 0$ for all $k$. Therefore the agent in time-step $k$ doesn't care about the
next instantaneous reward, so prefers to move right with the intention of moving down in the next time-step when the rewards
are slightly better. This leads to
$\Rw_\d(h_{<1}) = [0, 0, \cdots, 1 - \epsilon^{t-1}, 0, 0, \cdots]$.
Therefore,
\eq{
V^*_\du{1}(h_{<1}) - V^{\pi_\d}_{\du{1}}(h_{<1}) &= (t - 1) \delta (1 - \epsilon) - (1 - \epsilon^{t-1}) \delta
\geq (t - 2)\delta (1 - \epsilon)
}
as required.
\proofend\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Game Theoretic Approach}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
What should an agent do if it knows it is time inconsistent? One option is to treat its future selves as ``opponents''
in an extensive game. The game has one player per time-step who chooses the action for that time-step only. At the end of
the game the agent will have received a reward sequence $\v r \in \Re^\infty$. The utility given to the $k$th player is
then $\v r \cdot \du{k}$. So
each player in this game wishes to maximise the discounted reward with respect to a different
discounting vector.
\setlength{\intextsep}{0pt}
\begin{wrapfigure}[7]{r}{3.2cm}
\topsep=0.0cm
\begin{center}
\vspace{-0.1cm}
\tpic{
\tikzstyle{every state}=[fill=none,draw=black,text=black]
\node[state] (a) {\start};
\node[state] (b) [below of=a] {};
\node[state] (c) [right of=a] {};
\node[state] (d) [below of=c] {};
\node[state] (e) [right of=c] {};
\node[state] (f) [below of=e] {};
\path (a) edge node {$4$} (b)
(a) edge node {$1$} (c)
(c) edge node {$3$} (d)
(c) edge node {$1$} (e)
(f) edge[loop below] node {$0$} (f)
(d) edge[loop below] node {$0$} (d)
(b) edge[loop below] node {$0$} (b)
(e) edge node {$3$} (f);
}
\end{center}
\end{wrapfigure}
For example, let $\du{1} = [2, 1, 2,0,0,\cdots]$ and $\du{2} = [*, 3, 1,0,0,\cdots]$ and consider the environment on the right.
Initially, the agent has two choices. It can either move down to guarantee a reward sequence
of $\v r = [4, 0, 0, \cdots]$ which
has utility of $\du{1} \cdot [4, 0, 0, \cdots] = 8$ or it can move right in which case it will receive a reward sequence of
either $\v r' = [1, 3, 0,0,\cdots]$ with utility $5$ or
$\v r'' = [1, 1, 3, 0,0, \cdots]$ with utility $9$. Which of these two reward sequences it receives is determined by the action taken
in the second time-step. However this
action is chosen to maximise utility with respect to discount sequence $\du{2}$ and $\du{2} \cdot \v r' > \du{1} \cdot \v r''$.
This means that if at time $1$ the agent chooses to move right, the final reward sequence will be $[1, 3, 0,0,\cdots]$ and the
final utility with respect to $\du{1}$ will be $5$. Therefore the rational thing to do in time-step 1 is to move down immediately
for a utility of $8$.
The technique above is known as backwards induction which is used to find sub-game perfect equilibria in finite
extensive games. A variant of Kuhn's theorem proves that backwards induction can be used to find such equilibria in
finite extensive games \cite{OR94}. For arbitrary extensive games (possibly infinite) a sub-game perfect
equilibrium need not exist, but we prove a theorem for our particular class of infinite games.
A sub-game perfect equilibrium policy is one the players could agree to play, and subsequently have no incentive to
renege on their agreement during play. It isn't always philosophically clear that a sub-game perfect equilibrium
policy {\it should} be played. For a deeper discussion, including a number of good examples, see \cite{OR94}.
\begin{definition}[Sub-game Perfect Equilibria]
A policy $\pi^*_\d$ is a sub-game perfect equilibrium policy if and only if for each $t$
$V^{\pi^*_\d}_\du{t}(h_{ t$. That is, the environment which gives zero reward always after time $t$.
We can assume without loss of generality that $\pi_t(h_{ t_i$.
\item $\pi_{t_i}(h_{ t$ we have
\eqn{
\label{eqn-ex1} V^\pi_\du{t}(h_{ 0$.
Therefore $\pi_i(h_{~~