
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Exact Bayesian Regression of Piecewise Constant Functions %%
%%            Marcus Hutter, Start: June 2005                %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%-------------------------------%
%   Document-Style              %
%-------------------------------%
\documentclass[12pt,twoside]{article}
\usepackage{latexsym,amsmath,hyperref}
\usepackage{graphicx}
\topmargin=-10mm  \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\sloppy\lineskip=0pt

%-------------------------------%
%     My Math-Environments      %
%-------------------------------%
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip=  4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu                    \medmuskip=  1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
%\def\dispmuskip{}\def\textmuskip{}
\textmuskip
\def\beq{\dispmuskip\begin{equation}}    \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}}    \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}}  \def\eqan{\end{eqnarray*}\textmuskip}

%-------------------------------%
%   Macro-Definitions           %
%-------------------------------%
\newenvironment{keywords}{\centerline{\bf\small
Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\def\paradot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1.}}}
\def\paranodot#1{\vspace{1.3ex plus 0.5ex minus 0.5ex}\noindent{\bf{#1}}}
\def\aidx#1{}
\def\req#1{(\ref{#1})}
\def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}
\def\eps{\varepsilon}
\def\epstr{\epsilon}                    % for empty string
\def\nq{\hspace{-1em}}
\def\qed{\hspace*{\fill}$\Box\quad$\\}
\def\fr#1#2{{\textstyle{#1\over#2}}}
\def\frs#1#2{{^{#1}\!/\!_{#2}}}
\def\SetR{I\!\!R}
\def\SetN{I\!\!N}
\def\SetB{I\!\!B}
\def\SetZ{Z\!\!\!Z}
\def\qmbox#1{{\quad\mbox{#1}\quad}}
\def\e{{\rm e}}                        % natural e
\def\B{\{0,1\}}
\def\E{{\bf E}}                         % Expectation
\def\P{{\rm P}}                         % Probability
\def\v{\vec}
\def\es{\mbox{\o}}                          % empty set
\def\l{\ell}
\def\s{\sigma}
\def\v{\boldsymbol} %\boldsymbol\frak\Bbb\pmb\text
\def\text#1{\mbox{\scriptsize #1}}
\def\EE{I\!\!E}
\def\lesssim{\mbox{\raisebox{-0.8ex}{$\stackrel{\displaystyle<}\sim$}}} %% if not ams
\def\Var{\mbox{Var}}
\def\aai{{\scriptscriptstyle[]}}         % algorithm array index

\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                      T i t l e - P a g e                      %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\title{\vspace{-4ex}
\vskip 2mm\bf\Large\hrule height5pt \vskip 3mm
Exact Bayesian Regression of \\ Piecewise Constant Functions
\vskip 3mm \hrule height2pt}
\author{{\bf Marcus Hutter}\\[3mm]
\normalsize RSISE$\,$@$\,$ANU and SML$\,$@$\,$NICTA \\
\normalsize Canberra, ACT, 0200, Australia%
\thanks{A short version appeared in the proceedings of the Valencia 2006 meeting \cite{Hutter:07pcreg}.}\\
\normalsize \texttt{marcus@hutter1.net \ \  www.hutter1.net} }
\date{4 May 2007}
\maketitle
\vspace*{-5ex}

\begin{abstract}\noindent
We derive an exact and efficient Bayesian regression algorithm for
piecewise constant functions of unknown segment number, boundary
locations, and levels. The derivation works for any noise and segment
level prior, e.g.\ Cauchy which can handle outliers. We derive
simple but good estimates for the in-segment variance. We also
propose a Bayesian regression curve as a better way of smoothing
data without blurring boundaries. The Bayesian approach also allows
straightforward determination of the evidence, break probabilities
and error estimates, useful for model selection and significance and
robustness studies. We discuss the performance on synthetic and
real-world examples. Many possible extensions are discussed.
\def\contentsname{\centering\normalsize Contents}
{\parskip=-2.5ex\tableofcontents}
\end{abstract}

\begin{keywords}
Bayesian regression, exact polynomial algorithm, non-parametric
inference, piecewise constant function, dynamic programming, change point problem.
\vspace{-3ex}
\end{keywords}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{quote}\it
``In science the primary duty of ideas is to be useful and interesting,
even more than to be true.'' \par
\vspace{-2ex}\hfill --- {\sl Wilfred Trotter (1941)}
\end{quote}

\noindent We consider the problem of fitting a piecewise constant function
through noisy one-dimensional data, as e.g.\ in Figure
\ref{figGLPCR}, where the segment number, boundaries and levels are
unknown. Regression with piecewise constant (PC) functions, a
special case of change point detection, has many applications, e.g.\
in seismology, tomography, biology, and econometric modeling.
Determining DNA copy numbers in cancer cells from micro-array data
is a very recent application.

%-------------------------------%
\paradot{Bayesian piecewise constant regression (BPCR)}
%-------------------------------%
We provide a full Bayesian analysis of PC-regression. For a fixed
number of segments we choose some prior over all possible
segment boundary locations. Some prior on the segment levels and
data noise within each segment is assumed. Finally a prior over the
number of segments is chosen. From this we obtain the posterior
segmentation probability distribution (Section \ref{secGM}).
%
In practice we need summaries of this complicated distribution. A
simple maximum (MAP) approximation or mean does not work
here. The right way is to proceed in stages from determining
the most critical segment number, to the boundary location, and
finally to the then trivial segment levels. We also extract the
evidence, the boundary probability distribution, and an
interesting non-PC mean regression curve including error estimate
(Section \ref{secQI}).
%
We derive an exact polynomial-time dynamic-programming-type
algorithm for all quantities of interest (Sections \ref{secCR},
\ref{secESQI} and \ref{secAlg}).
%
Our algorithm is applicable for any noise and level prior. We
consider more closely the Gaussian ``standard'' prior and
heavy-tailed robust-to-outliers distributions like the Cauchy, and
briefly discuss the non-parametric case (Sections \ref{secSM} and
\ref{secSSD}).
%
Finally, some hyper-parameters like the global data average and
variability and local within-level noise have to be determined. We
introduce and discuss efficient semi-principled estimators,
thereby avoiding problematic or expensive numerical EM or
Monte-Carlo estimates (Section \ref{secHP}).
%
We test our method on some synthetic examples (Section
\ref{secSE}) and some real-world data sets (Section \ref{secRWE}).
The simulations show that our method handles difficult data with
high noise and outliers well.
%
Our basic algorithm can (easily) be modified in a variety of ways:
For discrete segment levels, segment dependent variance, piecewise
linear and non-linear regression, non-parametric noise prior, etc.
(Section \ref{secMisc}).

%-------------------------------%
\paradot{Comparison to other work}
%-------------------------------%
Sen and Srivastava \cite{Sen:75} developed a frequentist solution
to the problem of detecting a single (the most prominent) segment
boundary, called change or break point. Olshen et al.\
\cite{Olshen:04} generalize this method to detect pairs of break
points, which improves recognition of short segments. Both methods
are then (heuristically) used to recursively determine further
change points.
%
Another approach is penalized Maximum Likelihood (ML). For a fixed
number of segments, ML chooses the boundary locations that maximize
the data likelihood, which is equivalent to minimizing the mean
square data deviation in case of Gaussian noise. Jong et al.\
\cite{Jong:03} use a population based algorithm as minimizer, while
Picard et al.\ \cite{Picard:05} use dynamic programming, which is
structurally very close to our core recursion, to find the exact
solution in polynomial time. An additional penalty term has to be
added to the likelihood in order to determine the correct number of
segments. The most principled penalty is the Bayesian Information
Criterion \cite{Schwarz:78,Kass:95}. Since it can be biased towards
too simple \cite{Weakliem:99} or too complex \cite{Picard:05}
models, in practice often a heuristic penalty is used. An
interesting heuristic, based on the curvature of the log-likelihood
as a function of the number of segments, has been used in
\cite{Picard:05}.
%
Our Bayesian regressor is a natural response to penalized ML.
%
Another related work to ours is Bayesian bin density
estimation by Endres and F\"oldi\'ak \cite{Endres:05}, who also
average over all boundary locations, but in the context of density
estimation.
%
Another related series of papers, developing exact Bayesian
algorithms for PC regression is \cite{Yao:84,Barry:92,Fearnhead:06},
which we discuss in more detail in Section \ref{secCompare}. All
three papers consider renewal processes (which have a highly
informed prior for the number of segments $k$) and derive recursions
over the number of data points $n$, while we consider a
noninformative uniform prior over $k$ (which is not a renewal
process) and derive a recursion over $k$.
%
See also \cite{Fearnhead:05} for a similar Bayesian/MAP approach.
%
Many other approximate, heuristic, sampling, or non-Bayesian
approaches to PC regression exist; too many to list them all.

%-------------------------------%
\paradot{Advantages of Bayesian regression}
%-------------------------------%
A full Bayesian approach, when computationally feasible, has various
advantages over others: A generic advantage is that it is more
principled and hence involves fewer heuristic design choices. This
is particularly important for estimating the number of segments.
Another generic advantage is that it can be easily embedded in a
larger framework. For instance, one can decide among competing
models solely based on the Bayesian evidence. Finally, Bayes often
works well in practice, and provably so if the model assumptions are
valid.\footnote{Note that we are not claiming here that BPCR works
better than the other mentioned approaches. In a certain sense Bayes
is optimal if the prior is `true'. Practical superiority likely
depends on the type of application. A comparison for micro-array
data is in progress. The major aim of this paper is to derive an
efficient algorithm, and demonstrate the gains of BPCR beyond bare
PC-regression, e.g.\ the (predictive) regression {\em curve} (which
is better than local smoothing which wiggles more and blurs jumps).}
%
We can also easily extract other information, like probability
estimates and variances for the various quantities of interest.
%
Particularly interesting is the expected level (and variance) of
each data point. This leads to a regression curve, which is very
flat, i.e.\ smoothes the data, in long and clear segments, wiggles
in less clear segments, follows trends, and jumps at the segment
boundaries. It thus behaves somewhat between local smoothing
and rigid PC-segmentation.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The General Model}\label{secGM}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%-------------------------------%
\paradot{Setup}
%-------------------------------%
We are given a sequence $\v y=(y_1,...,y_n)$, e.g.\ times-series
data or measurements of some function at locations $1...n$, where
each $y_i\in\SetR$ resulted from a noisy ``measurement'', i.e.\
we assume that the $y_i$ are independently (e.g.\ Gaussian)
distributed with means $\mu_i'$ and\footnote{More generally,
$\mu'_i$ and $\s'_i$ are location and scale parameters of a
symmetric distribution.} variances $\s'_i\!\,^2$. The data
likelihood is therefore\footnote{For notational and verbal
simplicity we will not distinguish between probabilities of
discrete variables and densities of continuous variables.}
\beq\label{lh}
  \mbox{likelihood:} \qquad P(\v y|\v\mu',\v\s') \;:=\;
  \prod_{i=1}^n P(y_i|\mu'_i,\s'_i)
\eeq
The estimation of the true underlying function $f=(f_1,...,f_n)$ is
called regression. We assume or model $f$ as piecewise constant.
Consider $k$ segments with segment boundaries
$0=t_0<t_1<...<t_{k-1}<t_k=n$, i.e.\ $f$ is constant on
$\{t_{q-1}+1,...,t_q\}$ for each $0<q\leq k$. If the noise within
each segment is the same, we have
\beq\label{pc}
  \mbox{piecewise constant:}\quad \mu'_i=\mu_q \qmbox{and} \s'_i=\s_q \qmbox{for} t_{q-1}<i\leq t_q
  \quad\forall q
\eeq
We first consider the case in which the variances of all segments
coincide, i.e.\ $\s_q=\s$ $\forall q$. The easier case of different
variance is described in Section \ref{secMisc}.
Our goal is to estimate the segment levels $\v\mu=(\mu_1,...,\mu_k)$, boundaries $\v t=(t_0,...,t_k)$,
and their number $k$. Bayesian regression proceeds in assuming a prior
for these {\em quantities of interest}. We model the segment levels
by a broad (e.g.\ Gaussian) distribution with mean $\nu$ and variance $\rho^2$.
For the segment boundaries we take some (e.g.\ uniform) distribution among all
segmentations into $k$ segments. Finally we take some prior (e.g.\ uniform)
over the segment number $k$. So our prior $P(\v\mu,\v t,k)$ is the product of
\beq\label{prior}
  \mbox{prior:} \qquad P(\mu_q|\nu,\rho)\,\forall q \qmbox{and}
  P(\v t|k) \qmbox{and} P(k)
\eeq
We regard the global variance $\rho^2$ and mean $\nu$ of $\v\mu$ and
the in-segment variance $\s^2$ as fixed hyper-parameters, and
notationally suppress them in the following. We will return to
their determination in Section \ref{secHP}.

%-------------------------------%
\paradot{Evidence and posterior}
%-------------------------------%
Given the prior and likelihood we can compute the data evidence
and posterior $P(\v y|\v\mu,\v t,k)$ by Bayes' rule:
\beqn
  \mbox{evidence:} \quad P(\v y) \;=\;
  \sum_{k,\v t} \int P(\v y|\v\mu,\v t,k)P(\v\mu,\v t,k)\, d\v\mu
\eeqn
\beqn
  \mbox{posterior:} \quad P(\v\mu,\v t,k|\v y)
  \;=\; {P(\v y|\v\mu,\v t,k)P(\v\mu,\v t,k)\over P(\v y)}
\eeqn
The posterior contains all information of interest, but is
a complex object for practical use. So we need summaries like the
maximum (MAP) or mean and variances. MAP over continuous
parameters ($\v\mu$) is problematic, since it is not
reparametrization invariant. This is particularly dangerous if MAP
is across different dimensions ($k$), since then even a linear
transformation ($\v\mu\leadsto\alpha\v\mu$) scales the posterior
(density) exponentially in $k$ (by $\alpha^k$). This
severely influences the maximum over $k$, i.e.\ the estimated
number of segments. The mean of $\v\mu$ does not have this
problem. On the other hand, the mean of $\v t$ makes only sense
for fixed (e.g.\ MAP) $k$. The most natural solution is to proceed
in stages similar to as the prior \req{prior} has been formed.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Quantities of Interest}\label{secQI}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We now define estimators for all quantities of interest in stages
as suggested in Section \ref{secGM}.
%
Our first quantities are the posterior of the number of segments
and the MAP segment number:
\beqn
  \mbox{\# segments:}\quad P(k|\v y) \qmbox{and} \hat k \;=\; \arg\max_k P(k|\v y)
\eeqn
Second, for each boundary $t_q$ its posterior and MAP, given the MAP estimate of $k$:
\beqn
  \mbox{boundaries:}\quad P(t_q|\v y,\hat k) \qmbox{and} \hat t_q \;=\; \arg\max_{t_q} P(t_q|\v y,\hat k)
\eeqn
Different estimates of $t_q$ (e.g.\ the mean or MAP based on the
joint $\v t$ posterior) will be discussed elsewhere.
Finally we want the segment level means for the MAP segmentation:
\beqn
  \mbox{segment level:}\quad P(\mu_q|\v y,\v{\hat t},\hat k) \qmbox{and}
  \hat\mu_q \;=\; \int P(\mu_q|\v y,\v{\hat t},\hat k) \mu_q d\mu_q
\eeqn
The estimate ($\hat{\v\mu},\hat{\v t},\hat k$) uniquely defines the
piecewise constant (PC) function $\hat f$, which is our estimate of
$f$. A quite different quantity is to Bayes-average over all
piecewise constant functions and to ask for the mean at location $i$
as an estimate for $f_i$:
\beqn
  \mbox{regression curve:}\quad P(\mu'_i|\v y) \qmbox{and}
  \hat\mu'_i \;=\; \int P(\mu'_i|\v y) \mu'_i d\mu'_i
\eeqn
We will see that $\v\mu'$ behaves similar to a local smoothing of
$\v y$, but without blurring true jumps. Standard deviations of
all estimates may also be reported.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Specific Models}\label{secSM}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We now complete the specification by discussing some (purely
exemplary) choices for the data noise and prior.

%-------------------------------%
\paradot{Gaussian model}
%-------------------------------%
The standard assumption on the noise is independent Gauss:
\beq\label{Gn}
  \mbox{Gaussian noise:}\quad
  P(y_i|\mu'_i,\s'_i) \;=\; {1\over\sqrt{2\pi}\s'_i}\;
  \mbox{\large e}^{\textstyle-{(y_i-\mu'_i)^2\over 2\s'_i\!\,^2}}
\eeq
The corresponding standard conjugate prior on the means
$\mu_q$ for each segment $q$ is also Gauss:
\beq\label{Gp}
  \mbox{Gaussian prior:}\quad P(\mu_q|\nu,\rho) \;=\; {1\over\sqrt{2\pi}\rho}\;
  \mbox{\large e}^{\textstyle-{(\mu_q-\nu)^2\over 2\rho^2}}
\eeq

%-------------------------------%
\paradot{Cauchy model}
%-------------------------------%
The standard problem with Gauss is that it does not handle outliers
well. If we do not want to or cannot remove outliers by hand, we
have to properly model them by a prior with heavier tails. This can
be achieved by a mixture of Gaussians or by a Cauchy distribution:
\beq\label{Cn}
  \mbox{Cauchy noise:}\quad
  P(y_i|\mu'_i,\s'_i) \;=\; {1\over\pi}{\s'_i\over\s'_i\!\,^2+(y_i-\mu'_i)^2}
\eeq
Note that $\mu'_i$ and $\s'_i$ determine the location and scale of
Cauchy but are not its mean and variance (which do not exist).
The prior on the levels $\mu_q$ may as well be modeled as Cauchy:
\beq\label{Cp}
  \mbox{Cauchy prior:}\quad P(\mu_q|\nu,\rho) \;=\;
  {1\over\pi}{\rho\over\rho^2+(\mu_q-\nu)^2}
\eeq
Actually, the Gaussian noise model may well be combined with
a non-Gaussian prior and vice versa if appropriate.

%-------------------------------%
\paradot{Segment boundaries}
%-------------------------------%
The most natural assumption is a uniform (non-informative) prior
over all segmentations into $k$ segments. Since there are
$({n-1\atop k-1})$ ways of placing the $k-1$ inner boundaries
(ordered and without repetition) on $(1,...,n-1)$, we have:
\beq\label{ubp}
  \mbox{uniform boundary prior:}\quad
  P(\v t|k)=\textstyle({n-1\atop k-1})^{-1}
\eeq
We discuss more general factorizable $\v t$-priors, including general
renewal processes, later.

%-------------------------------%
\paradot{Number of segments}
%-------------------------------%
Finally, consider the number of segments $k$, which is an integer
between 1 and $n$. Sure, if we have prior knowledge on the
[minimal,maximal] number of segments $[k_{min},k_{max}]$ we
could/should set $P(k)=0$ outside this interval. Otherwise, any non-extreme
choice of $P(k)$ has little influence on the final results, since
it gets swamped by the (implicit) strong (exponential) dependence
of the likelihood on $k$. So we suggest a uniform prior
\beqn
  P(k) \;=\; {1\over k_{max}} \qmbox{for} 1\leq k\leq k_{max}
  \qmbox{and} 0 \qmbox{otherwise}
\eeqn
with $k_{max}=n$ as default (or $k_{max}<n$ discussed later).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Core Recursion}\label{secCR}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We now derive expressions for all quantities of interest, which need
time $O(k_{max}n^2)$ and space $O(n^2)$. The key fact we exploit is
that, fixing a boundary $t_p$, all probabilities $P(\cdot|t_p,k)$
factorize into a $P$ depending only on quantities left of the
boundary times another $P$ depending only on quantities right of the
boundary. This leads to a recursion that terminates with an evidence
matrix and moments $A$ for single segments. The algorithm naturally
breaks into two parts. $(i)$ determination of $A$ from data $y$ for
specific noise distribution and segment level prior as described in
Section \ref{secSSD}, and $(ii)$ the core recursion for specific
boundary prior described in this section. In the next section we
show how the core recursion can be used to compute all quantities of
interest. The dynamic program $(ii)$ depends only on $A$, i.e.\ can
be applied for any noise and level model, as long as $A$ can be
computed.

%-------------------------------%
\paradot{Notation}
%-------------------------------%
Throughout this
and the next two sections we use the following notation: %
$k$ is the total number of segments, %
$t$ some data index, %
$q$ some segment index, %
$0\leq i<h<j\leq n$ are data item indices of segment boundaries
$t_0\leq t_l<t_p<t_m\leq t_k$, %
i.e.\ $t_0=0$, $t_l=i$, $t_p=h$, $t_m=j$, $t_k=n$.
Further, $y_{ij}=(y_{i+1},...,y_j)$ is data %
with segment boundaries $t_{lm}=(t_l,...,t_m)$ %
and segment levels $\mu_{lm}=(\mu_{l+1},...,\mu_m)$.
In particular $y_{0n}=\v y$, $t_{0k}=\v t$, and $\mu_{0k}=\v\mu$.
All introduced matrices below (capital symbols with indices) will be
important in our algorithm.

%-------------------------------%
\paradot{General recursion}
%-------------------------------%
For $m=l+1$, $y_{ij}$ is data from a single segment with mean $\mu_m$
whose joint distribution (given segment boundaries and $m=l+1$) is
\beq\label{ssd}
  \mbox{single segment:}\quad
  P(y_{ij},\mu_m|t_{m-1,m}) \;=\; P(\mu_m)\prod_{t=i+1}^j P(y_t|\mu_m)
\eeq
by the model assumptions \req{lh} and \req{pc}. Here and in the
following, if the last argument of $P$ is an integer, it denotes the
number of segments in the considered data range $i$...$j$. The
probabilities for a general but fixed segmentation are independent,
i.e.\ for $m-l$ segments from $i$ to $j$ with breaks at $t_{lm}$ the
joint data $y_{ij}$ and segment mean $\mu_{lm}$ distribution
factorizes as:
\bqa\label{PPP}
  P(y_{ij},\mu_{lm}|t_{lm})
  &=& \prod_{p=l+1}^m\left[P(\mu_p)\prod_{t=t_{p-1}+1}^{t_p} P(y_t|\mu_p)\right]
\\ \nonumber
  &=& P(y_{ih},\mu_{lp}|t_{lp})P(y_{hj},\mu_{pm}|t_{pm}) \qmbox{(any $p$)}
\eqa
This is our key recursion. We now marginalize over the inner
boundaries $t_{l+1}...t_{m-1}$. If we first keep break $t_p$ fixed
at $h$, then the probability of $y_{ij}\mu_{lm}$ factorizes into
$y_{ih}\mu_{lp}$ and $y_{hj}\mu_{pm}$. We can then average over
location $t_p$. It is more convenient to define the scaled quantity
$Q$ below and combine both steps:
\bqa\label{Qrec}
  Q(y_{ij},\mu_{lm}|m-l)
  &:=& ({\textstyle{j-i-1\atop m-l-1}})P(y_{ij},\mu_{lm}|t_l,t_m)
\\ \label{Qreca}
  &\stackrel{(a)}=& ({\textstyle{j-i-1\atop m-l-1}})
  \sum_{t_{lm}\,:\,i=t_l<...<t_m=j\hspace{-9ex}} P(y_{ij},\mu_{lm}|t_{lm})P(t_{lm}|t_l,t_m)
\\ \nonumber
  &\stackrel{(b)}=& \sum_{t_{lm}\,:\,i=t_l<...<t_m=j\hspace{-9ex}} P(y_{ij},\mu_{lm}|t_{lm})
\\ \nonumber
  &\stackrel{(c)}=& \sum_{t_p=i+p-l}^{j+p-m}\;
  \sum_{t_{lp}\,:\,i=t_l<...<t_p=h\hspace{-11ex}} P(y_{ih},\mu_{lp}|t_{lp})
  \sum_{t_{pm}\,:\,h=t_p<...<t_m=j\hspace{-11ex}} P(y_{hj},\mu_{pm}|t_{pm})
\\ \label{Qrecl}
  &=& \sum_{h=i+p-l}^{j+p-m}
  Q(y_{ih},\mu_{lp}|p-l)Q(y_{hj},\mu_{pm}|m-p)
\eqa
$(a)$ is just an instance of the law of total probability $P(A|B)=\sum_i
P(A|H_i,B)P(H_i|B)$ for a partitioning $(H_i)$ of the sample space,
here $H_i\widehat=t_{l+1}...t_{m-1}$. In $(b)$ we exploited
uniformity \req{ubp}, which implies that
$P(t_{lm}|t_l,t_m)=({j-i-1\atop m-l-1})^{-1}$ is also uniform,
i.e.\ independent from the concrete segmentation $t_{l+1,m-1}$. In
$(c)$ we insert \req{PPP}, fix segment boundary $t_p$, sum over the left and right
segmentations, and finally over $t_p$. The sum starts at $h=i+p-l$,
since $y_{ih}$ must contain at least $p-l$ points in order to allow
for $p-l$ segments, and ends at $h=j+p-m$, since $y_{hj}$ must allow
for $m-p$ segments.

%-------------------------------%
\paradot{Left and right recursions}
%-------------------------------%
If we integrate \req{Qrec} over $\mu_{lm}$, the integral
factorizes and we get a recursion in (a quantity that is
proportional to) the evidence of $y_{ij}$. Let us define more
generally $r^{th}$ ``Q-moments'' of $\mu'_t$. Provided they
exist, for $i<t\leq j$ we have
\bqa\label{Qrrec}
  Q_t^r(y_{ij}|m-l) &:=& \int Q(y_{ij},\mu_{lm}|m-l) \mu'_t\!\,^r d\mu_{lm}
\\ \nonumber
  &=& \nq\sum_{h=i+p-l}^{t-1}\!\! Q^0(y_{ih}|p-l)Q_t^r(y_{hj}|m-p)
  +\!\!\! \sum_{h=t}^{j+p-m}\!\! Q_t^r(y_{ih}|p-l)Q^0(y_{hj}|m-p)
\eqa
Depending on whether $h<t$ or $h\geq t$, the $\mu'_t\!\,^r$ term
combines with the right or left $Q$ in recursion \req{Qrecl} to
$Q_t^r$, while the other $Q$ simply gets integrated to $Q_t^0=Q^0$
independent of $t$. The recursion terminates with
\beq\label{Adef}
  A_{ij}^r \;:=\; Q_t^r(y_{ij}|1) \;=\;
  \int P(\mu_m)\prod_{t=i+1}^j P(y_t|\mu_m) \mu_m^r d\mu_m,
  \quad (0\leq i<j\leq n)
\eeq
Note $A_{ij}^0=P(y_{ij}|t_{m-1,m})$ is the evidence and
$A_{ij}^r/A_{ij}^0=\E[\mu_m^r|y_{ij},t_{m-1,m}]$ the $r^{th}$
moment of $\mu'_t=\mu_m$ in case $y_{ij}$ is modeled by a single
segment.
%
It is convenient to formally start the recursion with
$Q^0(y_{ij}|0)=\delta_{ij}=\{ {1 \text{ if } i=j\atop 0 \text{
else}} \}$ (consistent with the recursion) with interpretation
that only an empty data set ($i=j$) can have 0 segments.
%
Since $p$ was an arbitrary split number, we can choose it conveniently.
We need a left recursion for $r=0$, $i=0$, $p-l=k$, and $m-p=1$:
\beqn
  L_{k+1,j} \;:=\; Q^0(y_{0j}|k+1)
  \;=\; \sum_{h=k}^{j-1} Q^0(y_{0h}|k)Q^0(y_{hj}|1)
  \;=\; \sum_{h=k}^{j-1} L_{kh}A^0_{hj}
\eeqn
That is, (apart from binomial factors) the evidence of $y_{0j}$
with $k+1$ segments equals the evidence of $y_{0h}$ with $k$
segments times the single-segment evidence of $y_{hj}$, summed
over all locations $h$ of boundary $k$.
%
The recursion starts with $L_{1j}=A_{0j}^0$, or more conveniently
with $L_{0j}=\delta_{j0}$.
%
A related recursion has been derived in \cite{Fearnhead:06}.
%
We also need a right recursion for
$r=0$, $j=n$, $p-l=1$, $m-p=k$:
\beqn
  R_{k+1,i} \;:=\; Q^0(y_{in}|k+1)
  \;=\; \sum_{h=i+1}^{n-k} Q^0(y_{ih}|1)Q^0(y_{hn}|k)
  \;=\; \sum_{h=i+1}^{n-k} A^0_{ih}R_{kh}
\eeqn
The recursion starts with $R_{1i}=A_{in}^0$, or more conveniently
with $R_{0i}=\delta_{in}$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Efficient Solution for Quantities of Interest}\label{secESQI}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We now use the core recursion of the last section to derive
expressions for all quantities of interest. We denote them
by capital symbols (with indices).

%-------------------------------%
\paradot{Data evidence}
%-------------------------------%
Note that
\beqn
  L_{kn} \;=\; R_{k0} \;=\; Q^0(\v y|k) \;=\; (\textstyle{n-1\atop k-1})P(\v y|k)
\eeqn
are proportional to the data evidence for fixed $k$. So the data
evidence can be computed as
\beq\label{E}
  E \;:=\; P(\v y) \;=\; \sum_{k=1}^n P(\v y|k)P(k)
  \;=\; {1\over k_{max}}\sum_{k=1}^{k_{max}} {L_{kn}\over({n-1\atop k-1})}
\eeq

%-------------------------------%
\paradot{Number of segments}
%-------------------------------%
The posterior of $k$ and its MAP estimate are
\beq\label{Ck}
  C_k \;:=\; P(k|\v y) \;=\; {P(\v y|k)P(k)\over P(\v y)} \;=\;
  {L_{kn}\over ({n-1\atop k-1})k_{max}E}
  \qmbox{and} \hat k=\mathop{\arg\max}\limits_{k=1..k_{max}} C_k
\eeq

%-------------------------------%
\paradot{Segment boundaries}
%-------------------------------%
We now determine the segment boundaries.
Consider recursion \req{Qrec} for $i=l=0$, $m=k$, $j=n$,
but keep $t_p=h$ fixed, i.e.\
do not sum over it.
Then \req{Qreca} and \req{Qrecl} reduce to the l.h.s.\ and r.h.s.\ of
\beq\label{PQQ}
  ({\textstyle{n-1\atop k-1}})P(\v y,\v\mu,t_p|k) \;=\;
  Q(y_{0h},\mu_{0p}|p)Q(y_{hn},\mu_{pk}|k-p)
\eeq
Integration over $\v\mu$ gives
\beqn
  ({\textstyle{n-1\atop k-1}})P(\v y,t_p|k) \;=\;
  Q^0(y_{0h}|p)Q^0(y_{hn}|k-p)
\eeqn
Hence the posterior probability that boundary $p$ is located at
$t_p=h$, given $\hat k$, is
\beq\label{Bph}
  B_{ph} \;:=\; P(t_p=h|\v y,\hat k) \;=\;
  { ({n-1\atop\hat k-1})P(\v y,t_p|\hat k)\over({n-1\atop\hat k-1})P(\v y|\hat k)}
  \;=\; {L_{ph}R_{\hat k-p,h}\over L_{\hat k n}}
\eeq
So our estimate for segment boundary $p$ is
\beq\label{hattp}
  \hat t_p \;:=\; \arg\max_h P(t_p=h|\v y,\hat k)
  \;=\; \arg\max_h\{ B_{ph} \}
  \;=\; \arg\max_h\{ L_{ph}R_{\hat k-p,h} \}
\eeq

%-------------------------------%
\paradot{Segment levels}
%-------------------------------%
Finally we need the segment levels, given the segment number $\hat k$ and
boundaries $\v{\hat t}$. The $r^{th}$ moment of segment $m$ with boundaries
$i=\hat t_{m-1}$ and $j=\hat t_m$ is
\beq\label{mumr}
  \widehat{\mu_m^r} \;=\; \E[\mu_m^r|\v y,\v{\hat t},\hat k]
  \;=\; \E[\mu_m^r|y_{ij},\hat t_{m-1,m}]
  \;=\; { \int P(y_{ij},\mu_m|\hat t_{m-1,m})\mu_m^r d\mu_m \over
          \int P(y_{ij},\mu_m|\hat t_{m-1,m}) d\mu_m }
  \;=\; {A_{ij}^r\over A_{ij}^0}
\eeq
Note that this expression is independent of other segment boundaries
and their number, as it should.

%-------------------------------%
\paradot{Regression curve}
%-------------------------------%
\footnote{Recursion \req{Qrrec} allows in principle to compute the
regression curve $\E[\mu'_t|\v y]$ by defining $(L_t^{r=1})_{kj}$
and $(R_t^{r=1})_{ki}$ analogous to $L_{kj}$ and $R_{ki}$, but
this procedure needs $O(n^3)$ space and $O(k_{max}n^3)$ time, one
$O(n)$ worse than our target performance.}%
We reduce probabilities of $\mu'_t$ to probabilities of $\mu_m$:
We exploit the fact that in every segmentation, $\mu'_t$ lies in
some segment. Let this (unique) segment be $m$ with (unique)
boundaries $i=t_{m-1}<t\leq t_m=j$. Then $\mu'_t=\mu_m$.
Summing now over all such segments we get
\beq\label{reg1}
  P(\mu'_t|\v y,k) \;=\;
  \sum_{m=1}^k\sum_{i=0}^{t-1}\sum_{j=t}^n P(\mu_m,t_{m-1}=i,t_m=j|\v y,k)
\eeq
By fixing $t_p$ in \req{Qreca} we arrived at \req{PQQ}.
Similarly, dividing the data into three parts and fixing $t_l$ and $t_m$
we can derive
\beqn
  ({\textstyle{n-1\atop k-1}})P(\v y,\v\mu,t_l,t_m|k)
  \;=\; Q(y_{0i},\mu_{0l}|l)Q(y_{ij}\mu_{lm}|m-l)Q(y_{jn}\mu_{mk}|k-m)
\eeqn
Setting $l=m-1$, integrating over $\mu_{0l}$ and $\mu_{mk}$, dividing by
$({\textstyle{n-1\atop k-1}})P(\v y|k)$, and inserting into
\req{reg1}, we get
\beqn
  P(\mu'_t|\v y,k) \;=\;  {1\over L_{kn}}\sum_{m=1}^k\sum_{i<t\leq j}
  L_{m-1,i}Q(y_{ij},\mu_m|1)R_{k-m,j}
\eeqn
The posterior moments of $\mu'_t$, given $\hat k$, can hence be computed
by
\beq\label{muptr}
  \widehat{\mu'_t\!\,^r}
  \;=\; \sum_{i<t\leq j} F_{ij}^r
  \qmbox{with} F_{ij}^r:={1\over L_{\hat k n}}\sum_{m=1}^{\hat k} L_{m-1,i}A_{ij}^r R_{\hat k-m,j}
\eeq
While segment boundaries and values make sense only for fixed $k$
(we chose $\hat k$), the regression curve $\hat\mu'_t$ could
actually be averaged over all $k$ instead of fixing $k=\hat k$.

%-------------------------------%
\paradot{Relative log-likelihood}
%-------------------------------%
Another quantity of interest is how likely it is that $\v y$ is
sampled from $\hat f$. The log-likelihood of
$\v y$ is
\beqn
  ll \;:=\; \log P(\v y|\hat f)
      \;=\; \log P(\v y|\v{\hat\mu},\v{\hat t},\hat k)
      \;=\; \sum_{i=1}^n\log P(y_i|\hat\mu'_i,\s)
\eeqn
Like for the evidence, the number itself is hard to interpret. We
need to know how many standard deviations it is away from its
mean(=entropy). Since noise \req{lh} is i.i.d., mean and variance
of $ll$ are just $n$ times the mean and variance of the log-noise
distribution of a single data item. For Gaussian and Cauchy noise
we get
\bqan
  \mbox{Gauss:} & & \E[ll|\hat f] = \textstyle {n\over 2}\log(2\pi\e\s^2),
         \qquad \Var[ll|\hat f] = {n\over 2}
\\
  \mbox{Cauchy:} & & \E[ll|\hat f] = \textstyle n\log(4\pi\s), \hspace{2ex}
          \qquad \Var[ll|\hat f] = {n\over 3}\pi^2
\eqan

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Computing the Single Segment Distribution}\label{secSSD}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We now determine (at least in the Gaussian case efficient)
expressions for the moments \req{Adef} of the distribution
\req{ssd} of a single segment.

%-------------------------------%
\paradot{Gaussian model}
%-------------------------------%
For Gaussian noise \req{Gn} and prior \req{Gp} we get
\beqn
  A_{ij}^r \;=\;
  \left({1\over\sqrt{2\pi}\s}\right)^d {1\over\sqrt{2\pi}\rho}
  \int_{-\infty}^\infty \mbox{\large e}^{\textstyle-{1\over 2\s^2}\sum_{t=i+1}^j(y_t\!-\!\mu_m)^2
   -{1\over 2\rho^2}(\mu_m\!-\!\nu)^2} \mu_m^r\, d\mu_m
\eeqn
where $d=j-i$. This is an unnormalized Gaussian integral with the
following normalization, mean, and variance \cite[Sec.10.2]{Bolstad:04}:
\bqa\label{GyP}
  P(y_{ij}|t_{m-1,m}) &=&
  A_{ij}^0
  \;=\;
   {\exp\!\big\{ {1\over 2\s^2}\big[{(\sum_t(y_t\!-\!\nu))^2\over d+\s^2/\rho^2} \!-\! \sum_t(y_t\!-\!\nu)^2\big]\big\}
   \over(2\pi\s^2)^{d/2}(1\!+\!d\rho^2/\s^2)^{1/2}}
\\ \label{Gmm}
  \E[\mu_m|y_{ij},t_{m-1,m}]
  &=& {A_{ij}^1\over A_{ij}^0}
  \;=\; {\rho^2(\sum_t y_t)+\s^2\nu \over d\rho^2+\s^2}
  \;\approx\; {1\over d}\sum_{t=i+1}^j y_t
\\ \label{Gmvar}
  \Var[\mu_m|y_{ij},t_{m-1,m}]
  &=& {A_{ij}^2\over A_{ij}^0} - \Big({A_{ij}^1\over A_{ij}^0}\Big)^2
  \;=\; \Big[{d\over\s^2}+{1\over\rho^2}\Big]^{-1}
  \;\approx\; {\s^2\over d}
\eqa
where $\Sigma_t$ runs from $i+1$ to $j$.
The mean/variance is just the weighted average of the
mean/variance of $y_{ij}$ and $\mu_m$.
One may prefer to use the segment prior only for determining
$A_{ij}^0$, but use the unbiased estimators ($\approx$) for the moments.
Higher moments $A_{ij}^r$ can also be computed from the central
moments
\beqn
  \E[(\mu_m-A_{ij}^1/A_{ij}^0)^r|y_{ij},t_{m-1,m}]
  \;=\; {1\!\cdot\!3\!\cdot...\cdot\!(r-1)\over[d\s^{-2}+\rho^{-2}]^{r/2}}
  \;\approx\; 1\!\cdot\!3\!\cdot...\cdot\!(r-1)\!\cdot\!
  \Big({\s^2\over d}\Big)^{r/2}
\eeqn
for even $r$, and 0 for odd $r$.

%-------------------------------%
\paradot{Other models}
%-------------------------------%
Analytic expressions for $A_{ij}^r$ are possible for all
distributions in the exponential family. For others like Cauchy
we need to perform integral \req{Adef} numerically.
%
A very simple approximation is to replace the integral by a sum on
a uniform grid: The stepsize/range of the grid should be some
fraction/multiple of the typical scale of the integrand, and the
center of the grid should be around the mean. A crude estimate of
the mean and scale can be obtained from the Gaussian model
\req{Gmm} and \req{Gmvar}.
%
Or even simpler, use the estimated global mean and variance
\req{eGmv}, and in-segment variance \req{sEst} for determining the
range (e.g.\ $[\hat\nu-25\hat\rho,...,\hat\nu+25\hat\rho]$) and
stepsize (e.g.\ $\hat\s/10$) of one grid used for all $A_{ij}^r$.
%
Note that if $y_{ij}$ really stem from one segment, the integrand
is typically unimodal and the above estimates for stepsize and
range are reasonable, hence the approximation will be good. If
$y_{ij}$ ranges over different segments, the discretization may be
crude, but since in this case, $A_{ij}^r$ is (very) small, crude
estimates are sufficient.
%
Note also that even for the heavy-tailed Cauchy distribution, the
first and second moments $A_{ij}^1$ and $A_{ij}^2$ exist, since
the integrand is a product of at least two Cauchy distributions,
one prior and one noise for each $y_t$.
%
Preferably, standard numerical integration routines (which are
faster, more robust and more accurate) should be used.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Determination of the Hyper-Parameters}\label{secHP}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%-------------------------------%
\paradot{Hyper-Bayes and Hyper-ML}
%-------------------------------%
The developed regression model still contains three
(hyper)parameters, the global variance $\rho^2$ and mean $\nu$ of
$\v\mu$, and the in-segment variance $\s^2$. If they are not known, a
proper Bayesian treatment would be to assume a hyper-prior over
them and integrate them out.
%
Since we do not expect a significant influence of the hyper-prior
(as long as chosen reasonable) on the quantities of interest, one
could more easy proceed in an empirical Bayesian way and choose
the parameters such that the evidence $P(\v y|\s,\nu,\rho)$ is
maximized (``hyper-ML'' or ``ML-II''). (We restored the till now omitted
dependency on the hyper-parameters).

Exhaustive (grid) search for the hyper-ML parameters is expensive.
For data which is indeed noisy piecewise constant, $P(\v
y|\s,\nu,\rho)$ is typically unimodal\footnote{A little care is
necessary with the in-segment variance $\s^2$. If we set it (extremely
close) to zero, all segments will consist of a single data point
$y_i$ with (close to) infinite evidence (see e.g.\ \req{GyP}).
Assuming $k_{max}<n$ eliminates this unwished maximum. Greedy
hill-climbing with proper initialization will also not be
fooled.} in $(\s,\nu,\rho)$ and the global maximum can be found more
efficiently by greed hill-climbing, but even this may cost a
factor of 10 to 1000 in efficiency. Below we present a very simple
and excellent heuristic for choosing $(\s,\nu,\rho)$.

%-------------------------------%
\paradot{Estimate of global mean and variance $\nu$ and $\rho$}
%-------------------------------%
A reasonable choice for the level mean and variance $\nu$ and $\rho$
are the empirical global mean and variance of the data $\v y$.
\beq\label{eGmv}
  \hat\nu \;\approx\; {1\over n} \sum_{t=1}^n y_t \qmbox{and}
  \hat\rho^2 \;\approx\; {1\over n-1} \sum_{t=1}^n (y_t-\hat\nu)^2
\eeq
This overestimates the variance $\rho^2$ of the segment levels,
since the expression also includes the in-segment variance $\s^2$,
which one may want to subtract from this expression.

%-------------------------------%
\paradot{Estimate of in-segment variance $\s^2$}
%-------------------------------%
At first there seems little hope of estimating the in-segment
variance $\s^2$ from $\v y$ without knowing the segmentation, but
actually we can use a simple trick. If $\v y$ would belong to a
single segment, i.e.\ the $y_t$ were i.i.d.\ with variance $\s^2$,
then the following expressions for $\s^2$ would hold:
\beqn
  \E\bigg[{1\over n}\sum_{t=1}^n(y_t-\mu_1)^2\bigg]
  \;=\; \s^2
  \;=\; {1\over 2(n-1)}\E\bigg[\sum_{t=1}^{n-1}(y_{t+1}-y_t)^2\bigg]
\eeqn
i.e.\ instead of estimating $\s^2$ by the squared deviation of the
$y_t$ from their mean, we can also estimate $\s^2$ from the
average squared difference of successive $y_t$. This remains true
even for multiple segments if we exclude the segment boundaries
in the sum. On the other hand, if the number of segment boundaries is
small, the error from including the boundaries will be small, i.e.\ the
second expression remains approximately valid.
%
More precisely, we have within a segment and at the boundaries
\bqan
  \E\bigg[\sum_{t=t_{m-1}+1\nq\nq}^{t_m-1} (y_{t+1}-y_t)^2\bigg]
  &=& 2(t_m-t_{m-1}-1)\s^2 \qmbox{and}
\\
  \E[(y_{t_m+1}-y_{t_m})^2] &=& 2\s^2+(\mu_{m+1}-\mu_m)^2
\eqan
Summing over all $k$ segments and $k-1$ boundaries and solving w.r.t.\
$\s^2$ we get
\bqan
  \s^2 &=& {1\over 2(n-1)}\left\{\E\bigg[\sum_{t=1}^{n-1}(y_{t+1}-y_t)^2\bigg]
  - \sum_{m=1}^{k-1}(\mu_{m+1}-\mu_m)^2\right\}
\\
  &=& {1\over 2(n-1)}\E\bigg[\sum_{t=1}^{n-1}(y_{t+1}-y_t)^2\bigg]\cdot
  \bigg[1-O\Big({k\over n}{\rho^2\over\s^2}\Big)\bigg]
\eqan
The last expression holds, since there are $k$ boundaries in $n$
data items, and the ratio between the variance of $\v\mu$ to the
in-segment variance is $\rho^2/\s^2$. Hence we may estimate
$\s^2$ by the upper bound
\beq\label{sEst}
  \hat\s^2 \;\approx\; {1\over 2(n-1)}\sum_{t=1}^{n-1}(y_{t+1}-y_t)^2
\eeq
If there are not too many segments ($k\ll n$) and the regression
problem is hard (high noise $\rho\lesssim\s$), this is a very good
estimate. In case of low noise ($\rho\gg\s$), regression is very
easy, and a crude estimate of $\s^2$ is sufficient. If there are
many segments, $\hat\s^2$ tends to overestimate $\s^2$, resulting
in a (marginal) bias towards estimating fewer segments (which is
then often welcome).

If the estimate is really not sufficient, one may use \req{sEst}
as an initial estimate for determining an initial segmentation
$\hat t$, which then can be used to compute an improved estimate
of $\hat\s^2$, and possibly iterate.

%-------------------------------%
\paradot{Hyper-ML estimates}
%-------------------------------%
Expressions \req{eGmv} are the standard estimates of mean and
variance of a distribution. They are particularly suitable for
(close to) Gaussian distributions, but also for others, as long as
$\nu$ and $\rho$ parameterize mean and variance. If
mean and variance do not exist or the distribution is quite
heavy-tailed, we need other estimates.
%
The ``ideal'' hyper-ML estimates may be approximated as follows.
If we assume that each data point lies in its own segment,
we get
\beqn
  (\hat\nu,\hat\rho) \;\approx\;
  \mathop{\arg\max}\limits_{(\nu,\rho)}
  \prod_{t=1}^n P(y_t|\hat\s,\nu,\rho) \qmbox{with}
\eeqn
\beq\label{Pysnr}
  P(y_t|\s,\nu,\rho) \;=\; \int P(y_t|\mu,\s)P(\mu|\nu,\rho)d\mu
\eeq
The in-segment variance $\hat\s^2$ can be estimated similarly to the
last paragraph considering data differences and ignoring segment
boundaries:
\beqn
  \hat\s \;\approx\; \arg\max_\s \prod_{t=1}^{n-1} P(y_{t+1}-y_t|\s)
  \qmbox{with}
\eeqn
\beq\label{PDys}
  P(y_{t+1}-y_t=\Delta|\s) \;\approx\; \int_{-\infty}^\infty
  P(y_{t+1}=a+\Delta|\mu,\s) P(y_t=a|\mu,\s)da
\eeq
Note that the last expression is independent of the segment level
(this was the whole reason for considering data differences) and
exact iff $y_t$ and $y_{t+1}$ belong to the same segment.
%
In general (beyond the exponential family)
$(\hat\nu,\hat\rho,\hat\s)$ can only be determined numerically.

%-------------------------------%
\paradot{Using median and quartile}
%-------------------------------%
We present some simpler estimates based on median and quartiles.
Let $[\v y]$ be the data vector $\v y$, but sorted in ascending
order. Then, item $[\v y]_{\alpha n}$ (where the index is assumed
to be rounded up to the next integer) is the $\alpha$-quantile of
empirical distribution $\v y$. In particular $[\v y]_{n/2}$ is the
median of $\v y$. It is a consistent and robust-to-outliers
estimator of the mean segment level
\beq\label{nuEstm}
  \hat\nu \;\approx\; [\v y]_{n/2}
\eeq
if noise and segment levels have symmetric distributions. Further,
half of the data points lie in the interval $[a,b]$, where $a:=[\v
y]_{n/4}$ is the first and $b:=[\v y]_{3n/4}$ is the last quartile
of $\v y$. So, using \req{Pysnr}, $\hat\rho$ should (!) be estimated
such that
\beqn
  P(a\leq y_t\leq b|\s,\hat\nu,\hat\rho) \;\stackrel!\approx \fr12
\eeqn
Ignoring data noise (assuming $\s\approx 0$), we get
\beq\label{rhoEstm}
  \hat\rho \;\approx\; {[\v y]_{3n/4}-[\v y]_{n/4}\over 2\alpha}
  \qmbox{with $\alpha=1$ for Cauchy and $\alpha\doteq 0.6744$ for Gauss,}
\eeq
where $\alpha$ is the quartile of the standard
Cauchy/Gauss/other segment prior. For the data noise $\s$ we
again consider the differences $\Delta_t:=y_{t+1}-y_t$. Using
\req{PDys}, $\hat\s$ should be estimated such that
\beqn
  P(a'\leq y_{t+1}-y_t\leq b'|\hat\s) \stackrel!\approx \fr12
\eeqn
where $a'=[\v\Delta]_{n/4}$ and $b'=[\v\Delta]_{3n/4}\approx -a'$.
One can show that
\beq\label{sEstm}
  \hat\s \;\approx\; {[\v\Delta]_{3n/4}-[\v\Delta]_{n/4}\over 2\beta}
  \qmbox{with $\beta=2$ for Cauchy and $\beta\doteq 0.6744\sqrt{2}$ for Gauss,}
\eeq
where $\beta$ is the quartile of the one time with itself
convolved standard Cauchy/Gauss/other (noise) distribution.
%
Use of quartiles for estimating $\s$ is very robust to the ``outliers''
caused by the segment boundaries, so yields better estimates than
\req{sEst} if noise is low.
%
Again, if the estimates are really not sufficient, one may
iteratively improve them.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Algorithm}\label{secAlg}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The computation of $A$, $L$, $R$, $E$, $C$, $B$, $\hat t_p$,
$\widehat{\mu_m^r}$, $F$, and $\widehat{\mu'_t\!\,^r}$ by the
formulas/recursions derived in Sections \ref{secCR}--\ref{secSSD},
are straightforward. In \req{Adef} one should compute the product,
or in \req{GyP}, \req{Gmm}, \req{Gmvar} the sum, incrementally from
$j\leadsto j+1$. Similarly $\widehat{\mu'_t\!\,^r}$ should be
computed incrementally by
\beqn
  \widehat{\mu'_{t+1}\nq\,^r} \;=\;
  \widehat{\mu'_t\!\,^r} - \sum_{i=0}^{t-1}F_{it}^r + \sum_{j=t+1}^n F_{tj}^r
\eeqn
Typically $r=0,1,2$. In this way, all quantities can be computed in
time $O(k_{max}n^2)$ and space $O(n^2)$, as clear from Algorithm
\ref{PCRegAlg}. Space can be reduced to $O(k_{max}n)$ by computing
$A$ and $F$ on-the-fly in the various expressions at the cost of a
slowdown by a constant factor.
%
Table \ref{PCRegAlg} contains the algorithm in pseudo-C code. The
complete code including examples and data is available at
\cite{Hutter:05pcrcode}.
%
Since $A^0$, $L$, $R$, and $E$ can be exponentially large or small
in $n$, i.e.\ huge or tiny, actually their logarithm has to be
computed and stored. In the expressions, the logarithm is pulled in
by $\log(x\cdot y)=\log(x)+\log(y)$ and
$\log(x+y)=\log(x)+\log(1+\exp(\log(y)-\log(x))$ for $x>y$ and
similarly for $x<y$. Instead of $A_{ij}^r$ we have to compute
$A_{ij}^r/A_{ij}^0$ by pulling the denominator into the integral.

\begin{table} \def\algitsep{\itemsep=0ex}
\vspace{-3.5ex}
{\bf\caption{\label{PCRegAlg}Regression algorithm in pseudo C code\rm}}
\vspace{1ex}
{\bf\boldmath EstGauss($\v y,n$)} and {\bf\boldmath EstGeneral($\v
y,n,\alpha,\beta$)} compute from data $(y_1,...,y_n)$, estimates
for $\nu$, $\rho$, $\s$ (hat `$\hat{\rule{1ex}{0ex}}$' omitted),
and from that the evidence $A_{ij}^0$ of a single segment ranging
from $i+1$ to $j$, and corresponding first and second moments
$A_{ij}^1$ and $A_{ij}^2$. The expressions \req{eGmv}, \req{sEst},
\req{GyP}, \req{Gmm}, \req{Gmvar} are used in EstGauss() for
Gaussian noise and prior, and \req{nuEstm}, \req{rhoEstm},
\req{sEstm} and numerical integration on a uniform Grid in
EstGeneral() for arbitrary noise and prior $P$, e.g.\ Cauchy. $[\v
y]$ denotes the sorted $\v y$ array, Grid is the uniform
integration grid, $+=$ and $*=$ are additive/multiplicative
updates, and $\aai$ denotes arrays.

\vspace*{1ex}
\begin{minipage}[t]{0.53\textwidth}
\begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=0ex\labelwidth=0ex}
  \item {\bf\boldmath EstGauss($\v y_\aai,n$)}
  \begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=2ex\labelwidth=1ex\labelsep=1ex}
    \item[$\lceil$] $\nu={1\over n}\sum_{t=1}^n y_t$;
    \item $\rho^2={1\over n-1}\sum_{t=1}^n (y_t-\nu)^2$;
    \item $\s^2={1\over 2(n-1)}\sum_{t=1}^{n-1}(y_{t+1}-y_t)^2$;
    \item for($i=0..n$)
    \begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=2ex\labelwidth=1ex\labelsep=1ex}
      \item[$\lceil$] $m=0$; $s=0$;
      \item for($j=i+1..n$)
      \begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=2ex\labelwidth=1ex\labelsep=1ex}
        \item[$\lceil$] $d=j-i$; $m+=y_j-\nu$; $s+=(y_j-\nu)^2$;
        \item $A_{ij}^0=\displaystyle{\exp\{ {1\over 2\s^2}[{m^2\over d+\s^2/\rho^2}-s]\}
                        \over (2\pi\s^2)^{d/2}(1+d\rho^2/\s^2)^{1/2}}$;
        \item $A_{ij}^1=A_{ij}^0(\nu+m/d)$; % unbiased
      \end{list}
      \item[$\lfloor$] $\lfloor$ $A_{ij}^2 = A_{ij}^0 ((A_{ij}^1/A_{ij}^0)^2 + \s^2/d)$;
    \end{list}
    \item [$\lfloor$] {\bf\boldmath return ($A_{\aai\aai}^\aai,\nu,\rho,\s$); }
  \end{list}
\end{list}
\end{minipage}
%
\begin{minipage}[t]{0.46\textwidth}
\begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=0ex\labelwidth=0ex}
  \item {\bf\boldmath EstGeneral($\v y_\aai,n,\alpha,\beta$)}
  \begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=2ex\labelwidth=1ex\labelsep=1ex}
    \item[$\lceil$] $\nu=[\v y]_{n/2}$;
    \item $\rho=([\v y]_{3n/4}-[\v y]_{n/4})/2\alpha$;
    \item for($t=1..n-1$) $\Delta_t=y_{t+1}-y_t$;
    \item $\sigma=([\v\Delta]_{3n/4}-[\v\Delta]_{n/4})/2\beta$;
    \item Grid$=({\s\over 10}\SetZ)\cap[\nu-25\rho,\nu+25\rho]$;
    \item for($i=0..n$)
    \begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=2ex\labelwidth=1ex\labelsep=1ex}
      \item[$\lceil$] for($\mu\in$Grid) $R_\mu=P(\mu|\nu,\rho)$;
      \item for($j=i+1..n$)
      \begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=2ex\labelwidth=1ex\labelsep=1ex}
        \item[$\lceil$] for($\mu\in$Grid) $R_\mu*=P(y_j|\mu,\s)$;
      \end{list}
      \item[$\lfloor$] $\lfloor$ $A_{ij}^r={\s\over 10}\sum_{\mu\in\text{Grid}}R_\mu\,\mu^r$; $(r=0,1,2)$
    \end{list}
    \item [$\lfloor$] {\bf\boldmath return ($A_{\aai\aai}^\aai,\nu,\rho,\s$); }
  \end{list}
\end{list}
\end{minipage}

\vspace{-1ex}
\begin{minipage}{0.44\textwidth}
{\bf\boldmath Regression($\v A,n,k_{max}$)} takes $\v A$, $n$, and an upper bound
on the number of segments $k_{max}$, and computes the evidence $E=P(\v y)$ \req{E}, %
the probability $C_k=P(k|\v y)$ of $k$ segments %
and its MAP estimate $\hat k$ \req{Ck}, %
the probability $B_i=P(\exists p:t_p=i|\v y,\hat k)$ that a boundary is at $i$ \req{Bph} %
and the MAP location $\hat t_p$ of the $p^{th}$ boundary \req{hattp}, %
the first and second segment level moments $\mu_p$ and $\mu_p^2$ of all segments $p$ \req{mumr},
and the Bayesian regression curve $\mu'_t$ and its second moment $\mu'_t\!\,^2$ \req{muptr}.
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}{0.54\textwidth}
\begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=0ex\labelwidth=0ex}
  \item {\bf\boldmath Regression($\v A_{\aai\aai}^\aai,n,k_{max}$)}
  \begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=2ex\labelwidth=1ex\labelsep=1ex}
    \item[$\lceil$] for($i=0..n$) \{ $L_{0i}=\delta_{i0}$; $R_{0i}=\delta_{in}$; \}
    \item for($k=0..k_{max}-1$)
    \begin{list}{}{\parskip=0ex\parsep=0ex\algitsep\leftmargin=2ex\labelwidth=1ex\labelsep=1ex}
      \item[$\lceil$] for($i=0..n$) $L_{k+1,i}=\sum_{h=k}^{i-1}L_{kh}A^0_{hi}$;
      \item[$\lfloor$] for($i=0..n$) $R_{k+1,i}=\sum_{h=i+1}^{n-k}A^0_{ih}R_{kh}$;
    \end{list}
    \item $E=k_{max}^{-1}\sum_{k=1}^{k_{max}} L_{kn}/({n-1\atop k-1})$;
    \item for($k=0..k_{max}$) $C_k= L_{kn}/[({n-1\atop k-1})k_{max}E]$;
    \item $\hat k=\mathop{\arg\max}_{k=1..k_{max}}\{C_k\}$;
    \item for($i=0..n$) $B_i=\sum_{p=0}^{\hat k}L_{pi}R_{\hat k-p,i}/L_{\hat k n}$;
    \item for($p=0..\hat k$) $\hat t_p=\arg\max_h\{ L_{ph}R_{\hat k-p,h} \}$;
    \item for($p=1..\hat k$) $\widehat{\mu_p^r}=A_{\hat t_{p-1}\hat t_p}^r/A_{\hat t_{p-1}\hat t_p}^0$; $(r=1,2)$
    \item for($i=0..n)$ for($j=i+1..n$)
    \item $[$ $F_{ij}^r=\sum_{m=1}^{\hat k} L_{m-1,i}A_{ij}^r R_{\hat k-m,j}/L_{\hat k n}$;
    \item $\mu'_0\!^r=0$; $(r=1,2)$
    \item for($t=0..n-1$)
    \item $[$ $\widehat{\mu'_{t+1}\nq\,^r} \;=
           \widehat{\mu'_t\!\,^r}-\sum_{i=0}^{t-1}F_{it}^r + \sum_{j=t+1}^n F_{tj}^r$
    \item [$\lfloor$] {\bf\boldmath return ($E,C_\aai,\hat k,B_\aai,\hat t_\aai,\widehat{\mu_\aai^r},\widehat{\mu_\aai'\!\,^r}$); }
  \end{list}
\end{list}
\end{minipage}

\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Synthetic Examples}\label{secSE}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{figure}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{GLPCR.eps}
\caption{\label{figGLPCR}[GL: low Gaussian noise]
data (blue), PCR (black), BP (red), and variance$^{1/2}$ (green).}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{GMPCR.eps}
\caption{\label{figGMPCR}[GM: medium Gaussian noise]
data (blue), PCR (black), BP (red), and variance$^{1/2}$ (green).}
\end{minipage}
\end{figure}

\begin{figure}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{GMBR.eps}
\caption{\label{figGMBR}[GM: medium Gaussian noise]
data with Bayesian regression $\pm$ 1 std.-deviation.}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{GHData.eps}
\caption{\label{figGHData}[GH: high Gaussian noise]
data.}
\end{minipage}
\end{figure}

\begin{figure}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{GHPCR.eps}
\caption{\label{figGHPCR}[GH: high Gaussian noise]
data (blue), PCR (black), BP (red), and variance$^{1/2}$ (green).}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{GHBR.eps}
\caption{\label{figGHBR}[GH: high Gaussian noise]
data with Bayesian regression $\pm$ 1 std.-deviation.}
\end{minipage}
\end{figure}

%-------------------------------%
\paradot{Description}
%-------------------------------%
In order to test our algorithm we created various synthetic data
sets. We considered piecewise constant functions with noisy
observations.
%
The considered function was defined $-1$ in its first quarter,
$+1$ in its second quarter, and $0$ in the last half. So the
function consists of two small and one large segments, with a
large jump at the first and a small jump at the second boundary.
%
For $n$ we chose 100, i.e.\ $f_1...f_{25}=-1$, $f_{26}...f_{50}=+1$,
and $f_{51}...f_{100}=0$. Data $y_t$ was obtained by adding
independent Gaussian/Cauchy noise of same scale $\s$ for all $t$.
We considered low $\s=0.1$, medium $\s=0.32$, and high $\s=1$
noise, resulting in an easy, medium, and hard regression problem
(Figures \ref{figGLPCR}-\ref{figCMFvarwG}).
%
We applied our regression algorithm to these 6 data sets (named
GL,GM,GH,CL,CM,CH), where we modeled noise and prior as Gaussian
or Cauchy with hyper-parameters also estimated by the algorithms
in Table \ref{PCRegAlg}. Table \ref{tabrs} contains these and
other scalar summaries, like the evidence, likelihood, MAP segment
number $\hat k$ and their probability.

%-------------------------------%
\paradot{Three segment Gaussian with low noise}
%-------------------------------%
% Gauss Low
Regression for low Gaussian noise ($\s=0.1$) is very easy. Figure
\ref{figGLPCR} shows the data points $(1,y_1),...,(100,y_{100})$
together with the estimated segment boundaries and levels,
i.e.\ the Piecewise Constant Regression (PCR) curve (black).
%
The red BP curve (with the two spikes) is the posterior probability
that a boundary (break point BP) is at $t$. It is defined as
$B_t:=\sum_{p=1}^{\hat k} B_{pt}$. Our Bayesian regressor (BPCR) is
virtually sure that the boundaries are at $t_1=25$ ($B_{25}=100\%$)
and $t_2=50$ ($B_{25}=99.9994\%$). The segment levels
$\hat\mu_1=-0.98\approx-1$, $\hat\mu_2=0.97\approx 1$,
$\hat\mu_3=0.01\approx 0$ are determined with high accuracy i.e.\
with low standard deviation (green curve) $\s/\sqrt{25}=2\%$ for the
first two and $\s/\sqrt{50}\doteq 1.4\%$ for the last segment. The
Bayesian regression curve $\hat\mu_t$ (not shown) is
indistinguishable from the PCR.

%-------------------------------%
\paradot{Three segment Gaussian with medium noise}
%-------------------------------%
% Gauss Medium
Little changes for medium Gaussian noise ($\s=0.32$). Figure
\ref{figGMPCR} shows that the number and location of boundaries is
still correctly determined, but the posterior probability of the
second boundary location (red BP curve) starts to get a little
broader ($B_{50}=87\%$). The regression curve in Figure
\ref{figGMBR} is still essentially piecewise constant. At $t=50$
there is a small kink and the error band gets a little wider, as
can better be seen in the (kink of the) green
$\sqrt{\Var[\mu'_t|...]}$ curve in Figure \ref{figGMPCR}.
%
In Figure \ref{figGMFvar} we study the sensitivity of our
regression to the noise estimate $\hat\s$. Keeping everything else
fixed, we varied $\s$ from 0.1 to 1 and plotted the log-evidence
$\log P(\v y|\s)$ and the segment number estimate $\hat k(\s)$ as
a function of $\s$. We see that our estimate $\hat\s\doteq 0.35$
is close to the hyper-ML value $\s_{\text{HML}}=\arg\max_\s P(\v y|\s)
\doteq 0.33$, which itself is close to the true $\s=0.32$. The
number of segments $\hat k$ is correctly recovered for a wide
range of $\s$ around $\hat\s$. If $\s$ is chosen too small (below
the critical value 0.2), BPCR cannot regard typical deviations from the
segment level as noise anymore and has to break segments into
smaller pieces for a better fit ($\hat k$ increases).
%
For higher noise, the critical value gets closer to $\hat\s$, but
also the estimate becomes (even) better. For lower noise, $\hat\s$
overestimates the true $\s$, but BPCR is at the same time even
less sensitive to it.

%-------------------------------%
\paradot{Three segment Gaussian with high noise}
%-------------------------------%
%Gauss High
Figure \ref{figGHData} shows the data with Gaussian noise of the
same order as the jump of levels ($\s=1$). One can imagine some
up-trend in the first quarter, but one can hardly see any
segments. Nevertheless, BPCR still finds the correct boundary
number and location of the first boundary (Figure \ref{figGHPCR}).
The second boundary is one off to the left, since $y_{50}$ was
accidentally close to zero, hence got assigned to the last segment.
The (red) boundary probability (BP) curve is significantly blurred, in
particular at the smaller second jump with quite small
$B_{49}=12\%$ and $B_{50}=10\%$. The levels themselves are within
expected accuracy $\s/\sqrt{25}=20\%$ and $\s/\sqrt{50}\doteq
14\%$, respectively, yielding still a PCR close to the true
function.
%
The Bayesian regression (and error) curve (Figure \ref{figGHBR}),
though, changed shape completely. It resembles more a local data
smoothing, following trends in the data (more on this in the next
section). The variance (green curve in Figure \ref{figGHPCR}) has
a visible bump at $t=25$, but only a broad slight elevation around
$t=50$.

\begin{figure}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{PKY.eps}
\caption{\label{figPKY}Posterior segment number probability
$P(k|\v y)$ for medium Gaussian noise (GM, black), high Cauchy
noise (CH, blue), medium Cauchy noise with Gaussian regression
(CMwG, green), aberrant gene copy \# of chromosome 1 (Gen(3,1),
red), normal gene copy \# of chromosome 9 (Gen(5,9), pink).}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{CMPCR.eps}
\caption{\label{figCMPCR}[CM: medium Cauchy noise]
data (blue), PCR (black), BP (red), and variance$^{1/2}$ (green).}
\end{minipage}
\end{figure}

\begin{figure}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{CHData.eps}
\caption{\label{figCHData}[CH: high Cauchy noise]
data.}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{CHPCR.eps}
\caption{\label{figCHPCR}[CH: high Cauchy noise]
data (blue), PCR (black), BP (red), and variance$^{1/2}$ (green).}
\end{minipage}
\end{figure}

\begin{figure}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{CHBR.eps}
\caption{\label{figCHBR}[CH: high Cauchy noise]
data with Bayesian regression $\pm$ 1 std.-deviation.}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{CMPCRwG.eps}
\caption{\label{figCMPCRwG}[CMwG: medium Cauchy noise]
data (blue), but with Gaussian PCR (black), BP (red),
and variance$^{1/2}$ (green).}
\end{minipage}
\end{figure}

\begin{figure}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{GMFvar.eps}
\caption{\label{figGMFvar}[GM: medium Gaussian noise]
$\log P(\v y)$ (blue) and $\hat k$ (green) as function of $\s$ and
our estimate $\hat\s$ of $(\arg)\max_\s P(\v y)$ and $\hat
k(\hat\s)$ (black triangles).}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{CMFvarWG.eps}
\caption{\label{figCMFvarwG}[CMwG: medium Cauchy noise] with
Gaussian regression, $\log P(\v y)$ (blue) and $\hat k$ (green) as
function of $\s$ and our estimate $\hat\s$ of $(\arg)\max_\s P(\v
y)$ and $\hat k(\hat\s)$ (black triangles).}
\end{minipage}
\end{figure}

%-------------------------------%
\paradot{Three segment Cauchy}
%-------------------------------%
The qualitative results for the Cauchy with low noise ($\s=0.1$)
are the same as for Gauss, perfect recovery of the underlying
function, and is hence not shown. Worth mentioning is that the
estimate $\hat\s$ based on quartiles is excellent(ly close to
hyper-ML) even for this low noise (and of course higher noise),
i.e.\ is very robust against the segment boundaries.

Also for medium Cauchy noise ($\s=0.32$, Figure \ref{figCMPCR})
our BPCR does not get fooled (even) by (clusters of) ``outliers''
at $t=16$, $t=48,49$, and $t=86,89,90$. The second boundary is one
off to the right, since $y_{51}$ is slightly too large. Break
probability $B_t$ (red) and variance $\Var[\mu'_t|\v y,\hat k]$
(green) are nicely peaked at $\hat t_1=25$ and $\hat t_2=51$.

For high Cauchy noise ($\s=1$, Figure \ref{figCHData}) it is nearly
impossible to see any segment (levels) at all. Amazingly, BPCR
still recovers three segments (Figure \ref{figCHPCR}), but the
first boundary is significantly displaced ($\hat t_1=14$). $B_t$
and $\Var[\mu'_t|\v y,\hat k]$ contain many peaks indicating that
BPCR was quite unsure where to break. The Bayesian regression in
Figure \ref{figCHBR} identifies an upward trend in the data
$y_{14:35}$, explaining the difficulty/impossibility of recovering
the correct location of the first boundary.

%-------------------------------%
\paradot{Cauchy analyzed with Gauss and vice versa}
%-------------------------------%
In order to test the robustness of BPCR under misspecification,
we analyzed the data with Cauchy noise by Gaussian BPCR (and vice
versa). Gaussian BPCR perfectly recovers the segments for low
Cauchy noise.
%
For medium noise (CMwG, Figure \ref{figCMPCRwG}) the outlier at
$t=49$ is not tolerated and placed in it own segment, and the last
segment is broken in two parts, but overall the distortion is
less than possibly expected (e.g.\ not all outliers are in own
segments). The reason for this robustness can be attributed to the
way we estimate $\s$. Figure \ref{figCMFvarwG} shows that the
outliers have increased $\hat\s$ far beyond the peak of $P(\v
y|\s)$, which in turn leads to lower (more reasonable) number of
segments. This is a nice stabilizing property of $\hat\s$.
%
The other way round, segmentation of data with medium Gaussian
noise is essentially insensitive to whether performed with
Gaussian BPCR (Fig.\ \ref{figGMPCR} and \ref{figGMBR}) or Cauchy
BPCR (GMwC, not shown), which confirms (once again) the robustness of
the Cauchy model.
%
But for high noise BPCR fails in both misspecification
directions.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Real-World Example \& More Discussion}\label{secRWE}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%(gene expression)

\begin{figure}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{Gen31PCR.eps}
\caption{\label{figGen31PCR}[Gen31: Aberrant gene copy \# of chromosome 1]
data (blue), PCR (black), BP (red), and variance$^{1/2}$ (green).}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[t]{0.49\textwidth}
\includegraphics[width=\textwidth]{Gen31BR.eps}
\caption{\label{figGen31BR}[Gen31: Aberrant gene copy \# of chromosome 1]
data with Bayesian regression $\pm$ 1 std.-deviation.}
\end{minipage}
\end{figure}

\begin{figure}
\begin{minipage}[c]{0.49\textwidth}
\includegraphics[width=\textwidth]{Gen31Fvar.eps}
\caption{\label{figGen31Fvar}[Gen31: Aberrant gene copy \# of chromosome 1]
$\log P(\v y)$ (blue) and $\hat k$ (green) as function of $\s$ and
our estimate $\hat\s$ of $(\arg)\max_\s P(\v y)$ and $\hat
k(\hat\s)$ (black triangles).}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[c]{0.49\textwidth}
\includegraphics[width=\textwidth]{Gen59BR.eps}
\caption{\label{figGen59BR}[Gen59: normal gene copy \# of chromosome 9]
with Bayesian regression.}
\end{minipage}
\end{figure}

%-------------------------------%
\paradot{Gene copy number data}
%-------------------------------%
% Introduction
We now turn to a real-world data set.
All chromosomes (except for the sex chromosomes in males) in a healthy
human cell come in pairs, but pieces or entire chromosomes can be lost
or multiplied in tumor cells. With modern micro-arrays one can measure
the local copy number along a chromosome. It is important to determine
the breaks, where copy-number changes. The measurements are {\em very
noisy} \cite{Pinkel:98}. Hence this is a natural application for
piecewise constant regression of noisy (one-dimensional) data. An
analysis with BPCR of chromosomal aberrations of real tumor samples,
its biological interpretation, and comparison to other methods will be
given elsewhere. Here, we only show the
regression results of one aberrant and one healthy chromosome (without
biological interpretation).

% Regression
Most tumor samples are impure, so the copy number is unfortunately not
an integer. The ``log-ratios'' $\v y$ of the copy numbers of a
normal cell (and also the $\v\Delta$ of any cell) are very close to
Gaussian distributed, so we chose Gaussian BPCR. The log-ratios $\v
y$ of chromosome 1 of a sample known to have multiple myeloma are
shown in Figure \ref{figGen31PCR}, together with the regression
results. Visually, the segmentation is very reasonable. Long
segments (e.g.\ $t=89...408$) as well as very short ones around
$t=87$ and $641$ of length 3 are detected.
%
The Bayesian regression curve in Figure \ref{figGen31BR} also
behaves nicely. It is very flat i.e.\ smoothes the data in long
and clear segments, wiggles in less clear segments, and has jumps
at the segment boundaries. Compare this to local smoothing
techniques \cite{Rinaldi:06}, which wiggle much more within a
segment and severely smooth boundaries. In this sense our Bayesian
regression curve is somewhere in-between local smoothing and hard
segmentation.
%
We also see that the regression curve has a broad dip around
$t=535...565$, although $t=510...599$ has been assigned to a single
segment. This shows that other contributions breaking the segment
have been mixed into the Bayesian regression curve. The PCR favor
for a single segment is close to ``tip over'' as can be seen from
the spikes in the break probability (red curve) in this segment.

% sensitivity to sigma estimate
The dependence of evidence and segment number on $\s$ is shown in
Figure \ref{figGen31Fvar}. Our estimate $\hat\s$ (black triangle)
perfectly maximizes $P(\v y|\s)$ (blue curve). It is at a deep
slope of $P(k|\v y,\s)$ (green curve), which means that
the segmentation is sensitive to a good estimate of $\hat\s$.
There is no unique (statistically) correct segmentation (number).
Various segmentations within some range are supported by
comparable evidence.

% normal chromosome
Figure \ref{figGen59BR} shows a healthy chromosome 9, correctly lumped
into one big segment.

%-------------------------------%
\paradot{Posterior probability of the number of segments $P(k|\v y)$}
%-------------------------------%
One of the most critical steps for good segmentation is
determining the right segment number, which we did by maximizing
$P(k|\v y)$. The whole curves shown in Figure \ref{figPKY} give
additional insight. A representative selection is presented.

For truly piecewise constant functions with $k_0\ll n$ segments
and low to medium noise, $\log P(k|\v y)$ typically raises rapidly
with $k$ till $k_0$ and thereafter decays approximately linear
(black GM curve). This shows that BPCR certainly does not
underestimate $k_0$ ($P(k<k_0|\v y)\approx 0$). Although it also
does not overestimate $k_0$, only $P(k\geq k_0|\v y)\approx 1$,
but $P(k_0|\v y)\not\approx 1$ due to the following reason:
If a segment is broken into two (or more) and assigned
(approximately) equal levels, the curve and hence the likelihood
does not change. BPCR does not explicitly penalize
this, only implicitly by the Bayesian averaging (Bayes factor
phenomenon \cite{Good:83,Jaynes:03,MacKay:03}). This gives very
roughly an additive term in the log-likelihood of $\fr12\log n$ for
each additional degree of freedom (segment level and boundary).
This observation is the core of the Bayesian Information Criterion
(BIC) \cite{Schwarz:78,Kass:95,Weakliem:99}.

With increasing noise, the acute maximum become more round (blue CH
curve), i.e.\ as expected, BPCR becomes less sure about the correct
number of segments. This uncertainty gets pronounced under
misspecification (green CMwG curve), and in particular when the true
number of segments is far from clear (or nonexistent) like in the
genome abberation example (red Gen(3,1) curve). The pink Gen(5,9)
curve shows that $\log P(k|\v y)$ is not necessarily unimodal.

\begin{table}[t]
\begin{center}\begin{small}
\caption[Regression summary]{\label{tabrs} {\bf (Regression
summary)} The table summarizes the regression results for the
considered synthetic and real-world data sets described in the main
text: Data consisting of three segments with (low,medium,high)
Gaussian/Cauchy noise (GL,GM,GH)/(CL,CM,CH), cross regressed
(GMwC,CMwG), and the (tumor,healthy) DNA copy number data
(Gen31,Gen59). The second last column contains $\hat k$ with and
without counting multiple breaks at the same point. The last column
contains $P(\hat k)$ and in parentheses $P(\hat k-1)$ and $P(\hat
k+1)$ in percent.
}\vspace{1ex}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline
  \rotatebox{90}{Gauss, Cauchy,} \rotatebox{90}{Low, Medium,} \rotatebox{90}{High noise, Gene}&
  \rotatebox{90}{true noise} \rotatebox{90}{scale} &
  \rotatebox{90}{data size} &
  \rotatebox{90}{method} &
  \rotatebox{90}{global mean} \rotatebox{90}{estimate} &
  \rotatebox{90}{global deviation} \rotatebox{90}{estimate} &
  \rotatebox{90}{in-segment} \rotatebox{90}{deviation est.} &
  \rotatebox{90}{log-evidence} \rotatebox{90}{$\log P(\v y)$}&
  \rotatebox{90}{rel.\ log-likelihood} \rotatebox{90}{${ll-\E[ll|\hat f]\over\Var[ll|\hat f]^{1/2}}$} &
  \rotatebox{90}{Opt.\#segm.} &
  \rotatebox{90}{Confidence} \rotatebox{90}{$P(\hat k(-1,+1)|\v y)$}
\\ \hline
Name  & $\s$ & $n$ & P &$\hat\nu$&$\hat\rho$& $\hat\s$&$\log E$&${ll-{\bf E}\over\s_{ll}}$&$\hat k$&$ C_{k(-1,+1)}$ \\ \hline\hline
GL    & 0.10 & 100 & G & -0.01 & 0.69 &  0.18 &  39 & 4.9 & $3|3$ & 74\%(0$|$20)  \\ %\hline
GM    & 0.32 & 100 & G & -0.03 & 0.73 &  0.35 & -48 & 1.2 & $3|3$ & 44\%(0$|$29)  \\ %\hline
GH    & 1.00 & 100 & G & -0.10 & 1.15 &  1.03 &-156 & 0.3 & $3|4$ & 13\%(10$|$12) \\ \hline
CL    & 0.10 & 100 & C & -0.02 & 0.58 &  0.09 & -17 & 1.0 & $3|3$ & 69\%(0$|$21)  \\ %\hline
CM    & 0.32 & 100 & C & -0.09 & 0.70 &  0.27 &-127 & 0.8 & $3|3$ & 38\%(0$|$27)  \\ %\hline
CH    & 1.00 & 100 & C & -0.20 & 0.99 &  0.86 &-234 & 0.9 & $3|4$ & 12\%(11$|$11) \\ \hline
GMwC  & 0.32 & 100 & C &  0.00 & 0.49 &  0.17 & -70 & 1.5 & $3|3$ & 27\%(0$|$26)  \\ %\hline
CMwG  & 0.32 & 100 & G &  0.01 & 1.24 &  1.22 &-160 & 2.9 & $5|8$ &  8\%(8$|$8)   \\ \hline
Gen31 &  --  & 769 & G &  0.55 & 0.45 &  0.30 &-283 &-1.5 &$15|34$&  6\%(6$|$6)   \\ %\hline
Gen59 &  --  & 483 & G &  1.05 & 0.47 &  0.44 &-336 &-2.3 & $1|1$ &  8\%(0$|$6)   \\ \hline
\end{tabular}
\end{small}\end{center}
\end{table}

%-------------------------------%
\paradot{Miscellaneous}
%-------------------------------%
Table \ref{tabrs} summarizes the most important quantities of the
considered examples.

While using the variance of
$\v\Delta$ as estimate for $\hat\s$ tends to overestimate $\s$ for
low noise, the quartile method does not suffer from this
(non)problem.

The usefulness of quoting the evidence cannot be overestimated.
While the absolute number itself is hard to comprehend,
comparisons (based on this absolute(!) number) are invaluable.
Consider, for instance, the three segment medium Gaussian noise
data $\v y_{\text{GM}}$ from Figure \ref{figGMPCR}. Table \ref{tabrs} shows
that $\log E($GM$)=-48$, while $\log E($GMwC$)=-70$, i.e.\ the
odds that $\v y_{\text{GM}}$ has Cauchy rather than Gaussian noise is
tiny $\e^{48-70}< 10^{-9}$, and similarly the odds that $\v
y_{\text{CM}}$ has Gaussian rather than Cauchy noise is
$\e^{127-160}<10^{-14}$. This can be used to decide on the model
to use. For instance it clearly indicates that noise in Gene31 and Gen59
is not Cauchy for which log-evidences would be $-398$ and $-406$,
respectively.
The smallness of the relative log-likelihoods does not indicate
any gross misspecification.

The indicated 4$^{th}$ segment for GH and CH is spurious, since it
has length zero (two breaks at the same position). In Gene31, only
15 out of the indicated 34 segments are real. The spurious ones
would be real had we estimated the breaks $\v{\hat t}$ jointly,
rather than the marginals $t_p$ separately. They would often be
single data segments at the current boundaries, since it costs
only a single extra break to cut off an ``outlier'' at a boundary
versus two breaks in the middle of a segment.

In the last column we indicated the confidence $C_{\hat k}$
$(C_{\hat k-1},C_{\hat k+1})$ of BPCR in the
estimate $\hat k$. For clean data (GL,GM,CL,GM) it is certain that
there are at least 3 segments. We already explained the general
tendency to also believe in higher number of segments.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Extensions \& Outlook}\label{secMisc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Generalized in-segment evidence
The core Regression($\v A,n,k_{max}$) algorithm does not care
where the in-segment evidence matrix and moments $\v A$ come from.
This allows for plenty of easy extensions of the basic idea.

% Generalized mu and sigma,rho,nu
Instead of the {\em scalar mean} $\mu_q$ of segment $q$, we can
consider any other scalar, vector, or discrete segment property
$\mu_q$. The only relevant property is that, given the segment
boundaries, the $\mu_q$ of different segments are independent.
Similarly, the global parameters may differ from $(\s,\nu,\rho)$.
Some of the generalizations described below, like segment dependent
variance $\s_q$, are of this type.

%known segment levels
If the segment levels are known to belong to a discrete set (e.g.\
integer DNA copy numbers \cite{Picard:05segcl}), this
simply corresponds to a discrete prior on $\mu$ and leads naturally
(rather than by need) to a Grid sum as in EstGeneral().

%segment dependent variance
If each segment can have its own (unknown) variance $\s_m^2$,
we can assume some prior over $\s_m$ and average \req{Adef} (which
depends on $\s_m$, notationally suppressed) additionally over
$\s_m$. Possibly $P(\s_m|...)$ depends on some hyper-parameter that
now has to be estimated instead of $\s$; all the better if not.

%general t-prior
Our recursion can easily be generalized
to more general factorizable boundary prior
\beq\label{Tdef}
  P(\v t|k) \;=\; T_{t_0 t_1}\cdot...\cdot T_{t_{k-1}t_k}/T_{0n}^{0k}
  \qmbox{with norm} T_{0n}^{0k}:=\sum_{0<t_1<...<t_{k-1}<n\hspace{-8ex}} T_{t_0 t_1}\cdot...\cdot T_{t_{k-1}t_k}
\eeq
for any $T_{ij}\geq 0$, e.g.\ $T_{ij}\equiv 1$ and
$T_{0n}^{0k}=({n-1\atop k-1})$ for our uniform boundary prior
\req{ubp}, or $T_{ij}=P(t_{l+1}=j|t_l=i)$ for renewal processes
discussed in Section \ref{secCompare}. The only change to the
algorithm in Table \ref{PCRegAlg} is to replace $({n-1\atop k-1})$
everywhere by $T_{0n}^{0k}$ (computable in time $O(k_{max}n^2)$)
and call Regression() with argument $(A_{ij}^r
T_{ij})$ instead of $(A_{ij}^r)$.

%(Non)constant regressors.
We assumed a constant regression function within a segment.
Actually any other function could be used. We simply choose
likelihood and prior for a single segment and compute its evidence
$A_{ij}^0$. This is all what Regression() needs to determine the
segment number and boundaries. Once we have the segment boundaries
it is easy to compute the in-segment quantities we are
interested in, e.g.\ the MAP or mean regression curve.

%piecewise linear regression
For instance, if we consider all linear functions within a segment,
we get a piecewise linear regression curve. But note that this curve
is not continuous. This model is, for instance good, if the true
function is essentially piecewise constant, but there is an additional
underlying trend (slope) in the segments. Using non-linear functions allows
to handle more complicated trends. See e.g.\ \cite{Fearnhead:05}.

%continuous regression
Piecewise linear (or other) {\em continuous} regression is more
complicated. Assume that $\mu_p$ in \req{Qrec} does not denote
the level of the whole segment $p$, but its level at the right
boundary, which together with $\mu_{p-1}$ determines the linear
function in segment $p$. Only after fixing $\mu_p$, left and right
side decouple. So the recursion analogous to \req{Qrrec} now
involves a quantity $Q$ which in addition to $(i,j)$ also depends on
$(\mu_l,\mu_m)$. This functional recursion may approximately be
solved by discretizing $\{(\mu_l,\mu_m)\in\SetR^2\}$, or by
approximating $Q$ by a 2-dimensional Gaussian in $(\mu_l,\mu_m)$ and
storing only the 2 means and the $2\times 2$ covariance matrix for
each $(i,j)$.
%
%heuristic continuous regression
The following two simpler heuristic approaches may work
sufficiently well in practice: One could ignore the continuity
constraint when determining the boundaries, and only take them
into account in the subsequent (much simpler) regression problem
with known boundaries. Another possibility is to consider instead
of the continuous piecewise linear function $f$ its piecewise
constant derivative $f'$, i.e.\ use BPCR on $\Delta_t$ and
finally integrate the result.

%non-parametric prior and noise
It is also not necessary to use a parametric model for the noise. If
different segments can have different noise distributions, we could
compute the in-segment evidence, mean, and variance $A_{ij}^r$ based
on some (fast) non-parametric model. If all segments have the same
distribution, we could non-parametrically estimate a single density
for the differences $\v\Delta$ and then deconvolve the density
(e.g.\ by $\mbox{FT}^{-1}(\sqrt{\mbox{FT(density)$\!\!$}}\;$), and
henceforth use this as prior for $\s$ in EstGeneral(), where FT is
the fourier transform. As non-parametric density estimator we could
use the fast (linear-time) exact Bayesian tree model
\cite{Hutter:05bayestree}.

%Large n
Finally, for (very) large $n$, say $>1000$, the $O(k_{max}n^2)$
algorithm is too slow. Fortunately, there is nearly no interaction
between distant segments; boundary $t_k$ is often practically
independent of where $t_{k\pm 2}$, $t_{k\pm 3}$, etc.\ are placed.
This suggests to break the whole data set into smaller overlapping
pieces, where each piece should be long enough to contain at least
four segments. Then boundaries $t_2^{piece},...,t_{k-2}^{piece}$ of
each piece are used, and appropriately merged. For the Bayesian
regression curve one should use some blending on the overlap. If
single segments are very long, one could coarsen (locally lump
together) the data and later refine around the boundaries.
Alternatively one can, of course, resort to simulation
based (Monte-Carlo) methods \cite{Barry:93,Fearnhead:06}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Comparison to other work}\label{secCompare}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{table}
\rotatebox{90}%
{\begin{minipage}{0.95\textheight}
\caption[Quantities of Interest for various models]{\label{tabQIkuq}
{\bf (Quantities of Interest for various models)} Rows (1)--(9)
contain various quantities of interest (first column) for our model
($P_u$, fourth column), our model with fixed $k$ ($P_k$, third
column), and Yao's model with geometrically distributed segment
lengths ($P_q$, third column). The second column contains the
definition or a relation or way to compute the quantity. Section
\ref{secCR} explains notation.
%
Note that conditional on $k$ all three models are the same, hence
$P_k(\cdot|k)=P_u(\cdot|k)=P_q(\cdot|k)=P_k(\cdot)$ can be read off
from the third column.
}
$$\textmuskip
\begin{array}{|cc|c|c|c|c|} \hline
& \hfill \mbox{Model}
&
& \mbox{Fixed}
&
& \mbox{Markov or}
\\ \cline{3-3}
& \mbox{Function} \hfill
& \mbox{definition or relation}
& \mbox{\#segments $k$}
& \mbox{Uniform $k$}
& \mbox{geometric}
\\ \cline{1-2}\cline{4-6}
& P(\cdots)
& \mbox{or way to compute}
& P_k(\cdots)
& P_u(\cdots)
& P_q(\cdots)
\\ \hline\hline
1)
& P(k)
& \sum_{0<t_1<...<t_{k-1}<n,t_k\geq n} P(t_{1k})
& \mbox{``}\delta_{kk}\mbox{''}
& \frs1n
& {n-1\choose k-1}q^{k-1}(1-q)^{n-k}
\\ \hline
2)
& P(t_{1,k-1}|k)
& P(t_{1,k-1}\wedge t_k\geq n)/P(k)
& {n-1\choose k-1}^{-1}
& {n-1\choose k-1}^{-1}
& {n-1\choose k-1}^{-1}
\\ \hline
3)
& P(t_{lm})
& \sum_{k=m+1}^{n-t_m+1}P(t_{lm}|k)P(k)
& {{t_l-1\choose l-1}{n-t_m-1\choose k-m-1} \over {n-1\choose k-1}}
& {{t_l-1\choose l-1} \over (t_m+1){t_m\choose m}}
& {t_l-1\choose l-1}q^m(1-q)^{t_m-m}
\\ \hline
4)
& P(t_l|t_{1,l-1})
& {P(t_{1l})\over P(t_{1,l-1})}
& {{n-t_l-1\choose k-l-1} \over {n-t_{l-1}-1\choose k-l}}
& {t_{l-1}+1\over t_l+1}  {{t_{l-1}\choose l-1}\over{t_l\choose l}}
& q(1-q)^{t_l-t_{l-1}-1}
\\ \hline
5)
& \E[t_l-t_{l-1}|t_{1,l-1}]
& \sum_{t_l=l}^{n-1}(t_l-t_{l-1})P(t_l|t_{1,l-1})
& {n-t_{l-1}\over k-l+1}
& {t_{l-1}+l\over l-1} \quad (n=\infty)
& {1\over q}-1
\\ \hline
6)
& P(t_l)
& P(t_{ll})
& \uparrow P(t_{ll})
& {l\over (t_l+1)t_l}
& \uparrow P(t_{ll})
\\ \hline
7)
& P(t_{l-1}<i\wedge t_l\geq j)
& \sum_{t_{l-1}=l-1}^{i-1}\sum_{t_l=j+1}^{n-1}P(t_{l-1,l})
& {{i-1\choose l-1}{n-j\choose k-l}\over {n-1\choose k-1}}
& {1\over l}{{i-1\choose l-1} \over {j\choose l}} \quad (n=\infty)
& {i-1\choose l-1}q^{l-1}(1-q)^{j-l}
\\ \hline
8)
& P(\forall l:i\not\leq t_l\not<j)
& \sum_{l=1}^i P(t_{l-1}<i\wedge t_l\geq j)
& {{n-j+i-1\choose k-1} \over {n-1\choose k-1}}
& {1\over j-i+1} \quad (n=\infty)
& (1-q)^{j-i}
\\ \hline
9)
& P(\exists l:t_l=i)
& \sum_{l=1}^i P(t_l=i)
& {k-1\over n-1}
& \frs12
& q
\\ \hline
A)
& p_{0n}
& P(k=1)
& \delta_{k1}
& \frs1n
& (1-q)^{n-1}
\\
B)
& p_{0j}
& P(t_1=j)
& {n-j-1\choose k-2}/{n-1\choose k-1}
& {1\over j(j+1)}
& q(1-q)^{j-1}
\\
C)
& p_{in}
& {\sum_{k=2}^{i+1} P(t_{k-1}=i|k)P(k)  \over \sum_{l=2}^{i+1} P(t_{l-1}=i)}
& {i-1\choose k-2}/{n-2\choose k-2}
& {2\over (n-i)(n-i+1)}
& (1-q)^{n-i-1}
\\
D)
& p_{ij}
& {\sum_{l=2}^{i+1} P(t_{l-1}=i\wedge t_l=j) \over \sum_{l=2}^{i+1} P(t_{l-1}=i)}
& {n+i-j-2\choose k-3}/{n-2\choose k-2}
& {4\over (j-i)(j-i+1)(j-i+2)}
& q(1-q)^{j-i-1}
\\ \hline
\end{array}
$$
\end{minipage}
}\end{table}

% other exact Bayesian work, renewal process
As discussed in the introduction, there are many non-Bayesian
approaches to change point detection in general and PC regression in
particular, e.g.\ \cite{Sen:75,Olshen:04,Jong:03,Picard:05}. There
are also simulation (Monte-Carlo) approaches like \cite{Barry:93}.
In this work we developed an exact Bayesian solution. In the
following we compare our model to a different exact Bayesian model
developed in \cite{Yao:84,Barry:92,Fearnhead:06}. The authors
consider a renewal process: Given that there is a break at $i$, the
prior probability $p_{ij}$ that the next break is at $j>i$ is independent
of where and how many change points occurred before $i$, that is
$P(t_{1l})=p_{0t_1}p_{t_1 t_2}...p_{t_{l-1}t_l}$. They develop
recursions in $n$ for the quantities of interest, while our
recursion is over $k$. Since the number of segments in their model
is variable without a recursion over $k$, their algorithm runs in
time $O(n^2)$.

% Yao's geometric model
Yao \cite{Yao:84} considers identically geometrically distributed
breaks with $p_{ij}=q(1-q)^{j-i-1}$, i.e.\ for all $i$, the
probability of a break at $i$ is $q$, independent of the breaks at
other $j\neq i$. In the following we compare Yao's model ($P_q$) to
our model with uniform $k$-prior ($P_u$) and our model with fixed
segment number ($P_k$). Table \ref{tabQIkuq} summarizes the most
important quantities, derived from the segment prior $P(\v t|k)P(k)$
by exploiting various binomial identities.

% row 2
The first observation is that, given $k$, the breaks in Yao's model
are uniformly distributed as in ours (row 2).
%
% row 1
Hence his model can only differ from ours in the prior of $k$.
Indeed, his prior is strongly concentrated around $k=qn$ for typical
(large) $n$ ($P_q(k)$ in row 1). It more resembles a model with
fixed $k$. Note that in all three models, probabilities involving
$t_{lm}$ only depend on $t_l$ and $t_m$ and are independent of
$t_{l+1},...,t_{m-1}$.
%
% row 4 and 5
Yao's model suppresses long segments exponentially ($P_q($row
$4)=q(1-q)^{\text{length-1}}$) with fixed expected length
$\E_q[$row $5]={1\over q}-1$, while our model has a wide polynomial
tail ($P_u($row $4)\sim l\cdot t_{l-1}^l/t_l^{l+1}$) with expected
length $\E_u[$row $5]\sim {(t_{l-1}+l)/(l-1)}$ adapting to
the past average length and no prior bias ($\E_u[t_1]$ is undefined).
%
% row 8 and D
Similarly, the probability that there is no break in the range from
$i$ to $j$ (row 8) and the probability of a break at $j$ given the
previous one is at $i$ (row D) decay exponentially in Yao's model
but only harmonically respectively inverse cubic in our model.

% Barry and renewal
Barry and Hartigan \cite{Barry:92} consider renewal processes with
general transition probabilities $p_{ij}$. We can also use our
algorithm for renewal processes by setting $T_{ij}=p_{ij}$ in \req{Tdef}.
Interestingly, Barry and Hartigan's favorite model (stated without
derivation or motivation) essentially coincides with the long-tail
$p_{ij}$ (row A-D) of our model. Nevertheless, their model is
completely different from ours, since our model with uniform prior
is {\em not} a renewal process. In our model
$p^l_{ij}:=P_u(t_l=j|t_{l-1}=i)$ (row 4) depends on $l$, i.e.\
$P_u(t_{1l}) = p^1_{0t_1}p^2_{t_1 t_2}...p^l_{t_{l-1}t_l} \neq
p_{0t_1}p_{t_1 t_2}...p_{t_{l-1}t_l}$. This $l$-dependence also
prevents us from a recursion algorithm in $n$.
%
% row 9 and D
This also explains why the high probability of a break at $i$
($P($row $9)=\frs12$) and of segments of length one ($p_{i-1,i}={4\over
1\cdot 2\cdot 3}$) in both models, imply a high bias towards short
segments in Barry's model, but not in ours. This explains why their
method overestimates the number of segments.
%the last sentence in their paper,

% general pij
In general, any renewal process $p_{ij}$ has some expected or median
segment length, which leads to a strongly peaked $k$-prior,
in contrast to our uniform prior.
% Bayes-average over q
If we would perform a Bayesian average over $q\in[0,1]$ with uniform
prior, Yao's model would reduce to ours. Both algorithms are
$O(n^2)$ with a factor $k_{max}$ in our case and the grid size for
integrating over $q$ in Yao's case. Actually Yao \cite{Yao:84} and
Fearnhead \cite{Fearnhead:06} use a maximum likelihood estimate for
$q$.

% what's different/new in our work
So the major difference of our work from
\cite{Yao:84,Barry:92,Fearnhead:06} is that we start with a
non-informative uniform prior over $k$, while they start with a
renewal process necessarily corresponding to a highly informed prior
for $k$. In contrast to their recursion over $n$, our recursion over
$k$ allows for arbitrary (in particular uniform) prior over $k$.
This also naturally allows us to find the MAP $\hat k$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Summary}\label{secDisc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We considered Bayesian regression of piecewise constant functions
with unknown segment number, location and level.
We derived an efficient algorithm that works for any noise and
segment level prior, e.g.\ Cauchy which can handle outliers. We
derived simple but good estimates for the in-segment variance. We
also proposed a Bayesian regression curve as a better way of
smoothing data without blurring boundaries. The Bayesian approach
also allowed us to straightforwardly determine the global
evidence, break probabilities and error estimates, useful for
model selection and significance and robustness studies. We
discussed the performance on synthetic and real-world examples.
Many possible extensions have been discussed.

%-------------------------------%
\paradot{Acknowledgements}
%-------------------------------%
Thanks to IOSI for providing the gene copy \# data and to Ivo
Kwee for discussions.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%         Bibliography        %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{small}
\begin{thebibliography}{ABCD}\parskip=0ex

\bibitem[BH92]{Barry:92}
D.~Barry and J.~A. Hartigan.
\newblock Product partition models for change point problems.
\newblock {\em Annals of Statistics}, 20:260--279, 1992.

\bibitem[BH93]{Barry:93}
D.~Barry and J.~A. Hartigan.
\newblock A {B}ayesian analysis for change point problems.
\newblock {\em Journal of the American Statistical Association}, 88:309--319,
  1993.

\bibitem[Bol04]{Bolstad:04}
W.~M. Bolstad.
\newblock {\em Introduction to {B}ayesian Statistics}.
\newblock Wiley Interscience, New Jersey, 2004.

\bibitem[EF05]{Endres:05}
D.~Endres and P.~F{\"o}ldi{\'a}k.
\newblock Bayesian bin distribution inference and mutual information.
\newblock {\em IEEE Transactions on Information Theory}, 51(11):3766--3779,
  2005.

\bibitem[Fea05]{Fearnhead:05}
P.~Fearnhead.
\newblock Exact {Bayesian} curve fitting and signal segmentation.
\newblock {\em IEEE Transactions on Signal Processing}, 53:2160--2166, 2005.

\bibitem[Fea06]{Fearnhead:06}
P.~Fearnhead.
\newblock Exact and efficient {B}ayesian inference for multiple changepoint
  problems.
\newblock {\em Statistics and Computing}, 16:203--213, 2006.

\bibitem[Goo83]{Good:83}
I.~J. Good.
\newblock Explicativity, corroboration, and the relative odds of hypotheses.
\newblock In {\em Good thinking: The Foundations of Probability and its
  applications}. University of Minnesota Press, Minneapolis, MN, 1983.

\bibitem[Hut05a]{Hutter:05pcrcode}
M.~Hutter.
\newblock Additional material to article.
\newblock {\em \\
  http://www.idsia.ch/\linebreak[1]\raisebox{-1ex}{\~{}}marcus\linebreak[1]/ai%
/pcreg.htm}, 2005.

\bibitem[Hut05b]{Hutter:05bayestree}
M.~Hutter.
\newblock Fast non-parametric {B}ayesian inference on infinite trees.
\newblock In {\em Proc. 10th International Conf. on Artificial Intelligence and
  Statistics ({AISTATS-2005})}, pages 144--151. Society for Artificial
  Intelligence and Statistics, 2005.

\bibitem[Hut07]{Hutter:07pcreg}
M.~Hutter.
\newblock {B}ayesian regression of piecewise constant functions.
\newblock In J.M. Bernardo et al., editors, {\em Proc.
  Bayesian Statistics}, volume~8, Benidorm, pages 607--612. Oxford University Press, 2007.

\bibitem[Jay03]{Jaynes:03}
E.~T. Jaynes.
\newblock {\em Probability Theory: The Logic of Science}.
\newblock Cambridge University Press, Cambridge, MA, 2003.

\bibitem[Jon03]{Jong:03}
K.~Jong{ et al.}
\newblock Chromosomal breakpoint detection in human cancer.
\newblock In {\em Applications of Evolutionary Computing: EvoWorkshops'03},
  volume 2611 of {\em LNCS}, pages 54--65. Springer, 2003.

\bibitem[KW95]{Kass:95}
R.~E. Kass and L.~Wasserman.
\newblock A reference {Bayesian} test for nested hypotheses with large samples.
\newblock {\em Journal of the ACM}, 90:773--795, 1995.

\bibitem[Mac03]{MacKay:03}
D.~J.~C. MacKay.
\newblock {\em Information theory, inference and learning algorithms}.
\newblock Cambridge University Press, Cambridge, MA, 2003.

\bibitem[OVLW04]{Olshen:04}
A.~B. Olshen, E.~S. Venkatraman, R.~Lucito, and M.~Wigler.
\newblock Circular binary segmentation for the analysis of array-based {DNA}
  copy number data.
\newblock {\em Biostatistics}, 5:557--572, 2004.

\bibitem[Pic05]{Picard:05}
F.~Picard{ et al.}
\newblock A statistical approach for array {CGH} data analysis.
\newblock {\em BMC Bioinformatics}, 6(27):1--14, 2005.

\bibitem[Pin98]{Pinkel:98}
D.~Pinkel{ et al.}
\newblock High resolution analysis of {DNA} copy number variation using
  comparative genomic hybridization to microarrays.
\newblock {\em Nature Genetics}, 20:207--211, 1998.

\bibitem[PRLD05]{Picard:05segcl}
F.~Picard, S.~Robin, E.~Lebarbier, and J.~J. Daudin.
\newblock A segmentation-clustering problem for the analysis of array CGH data.
\newblock In {\em Proc. 11th International Symposium on Applied Stochastic
  Models and Data Analysis ({ASMDA'05})}, pages 145--152, Brest, France, 2005.

\bibitem[Rin06]{Rinaldi:06}
A.~Rinaldi{ et al.}
\newblock Genomic and expression profiling identifies the B-cell associated
  tyrosine kinase Syk as a possible therapeutic target in mantle cell lymphoma.
\newblock {\em British Journal of Haematology}, 132(3):303--316, 2006.

\bibitem[Sch78]{Schwarz:78}
G.~Schwarz.
\newblock Estimating the dimension of a model.
\newblock {\em Annals of Statistics}, 6(2):461--464, 1978.

\bibitem[SS75]{Sen:75}
A.~Sen and M.~S. Srivastava.
\newblock On tests for detecting a change in mean.
\newblock {\em Annals of Statistics}, 3:98--108, 1975.

\bibitem[Wea99]{Weakliem:99}
D.~L. Weakliem.
\newblock A critique of the {Bayesian} information criterion for model
  selection.
\newblock {\em Sociological Methods and Research}, 27:359--397, 1999.

\bibitem[Yao84]{Yao:84}
Y.-C. Yao.
\newblock Estimation of a noisy discrete-time step function: {B}ayes and
  empirical {B}ayes approaches.
\newblock {\em Annals of Statistics}, 12:1434--1447, 1984.

\end{thebibliography}
\end{small}

\end{document}

%--------------------End-of-PCRegx.tex------------------------%
