Optimality of Universal Bayesian Sequence Prediction (Marcus Hutter)

previous home search

PostScript (420kb) PDF (293kb) Html/Gif

contact up next

Optimality of Universal Bayesian Sequence Prediction

Author: Marcus Hutter (2001)

Comments: 15 LaTeX pages

Subj-class: Artificial Intelligence; Learning

ACM-class:
I.2; I.2.6; I.2.8; F.1.3; F.2

Paper: PostScript (420kb) - PDF (293kb) - Html/Gif

Slides: PostScript - PDF

Remark: Technical Report

Keywords: Bayesian sequence prediction; Mixture distributions, Solomonoff induction; Kolmogorov complexity; learning; universal probability; tight loss and error bounds; Pareto-optimality.

Abstract: Various optimality properties of universal sequence prediction based on Bayes-mixtures in general, and Solomonoff's prediction scheme in particular will be studied. The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t-1}$ can be computed with Bayes' rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is unknown, but known to belong to a class $\M$ one can base ones prediction on the Bayes-mix $\xi$ defined as a $w_\nu$ weighted sum of distributions $\nu\in\M$. The number of additional prediction errors $E_\xi$ made by the optimal universal prediction scheme based on $\xi$ minus the number of errors $E_\mu$ of the optimal informed prediction scheme based on $\mu$ is bounded by $O(\sqrt{E_\mu})$. We show that the bound is tight and that no other predictor can lead to smaller bounds. Furthermore, for various performance measures we show Pareto-optimality of $\xi$ in the sense that there is no other predictor which performs better or equal in all environments $\nu\in\M$ and strictly better in at least one. So, optimal predictors can (w.r.t.\ to most performance measures in expectation) be based on the mixture $\xi$. Finally we give an Occam's razor argument that Solomonoff's choice $w_\nu\sim 2^{-K(\nu)}$ for the weights is optimal, where $K(\nu)$ is the length of the shortest program describing $\nu$.

previous home search

PostScript (420kb) PDF (293kb) Html/Gif

contact up next

Table of Contents

Introduction

Setup and Convergence

Error & Loss Bounds

Lower Error Bound

Pareto Optimality of ß

On the Optimal Choice of Weights

Conclusions

previous home search

PostScript (420kb) PDF (293kb) Html/Gif

contact up next

BibTeX Entry

@TechReport{Hutter:02splower,
  author =       "Marcus Hutter",
  number =       "IDSIA-09-01",
  institution =  "Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA)",
  title =        "Optimality of Universal {Bayesian} Sequence Prediction",
  year =         "2002",
  pages =        "15",
  address =      "Manno(Lugano), CH",
  abstract =     "Various optimality properties of universal sequence prediction
                  based on Bayes-mixtures in general, and Solomonoff's prediction
                  scheme in particular will be studied. The probability of observing
                  $x_t$ at time $t$, given past observations $x_1...x_{t-1}$ can be
                  computed with Bayes' rule if the true generating distribution
                  $\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is
                  unknown, but known to belong to a class $\M$ one can base ones
                  prediction on the Bayes-mix $\xi$ defined as a $w_\nu$ weighted
                  sum of distributions $\nu\in\M$. The number of additional
                  prediction errors $E_\xi$ made by the optimal universal prediction
                  scheme based on $\xi$ minus the number of errors $E_\mu$ of the
                  optimal informed prediction scheme based on $\mu$ is bounded by
                  $O(\sqrt{E_\mu})$. We show that the bound is tight and that no
                  other predictor can lead to smaller bounds. Furthermore, for
                  various performance measures we show Pareto-optimality of $\xi$ in
                  the sense that there is no other predictor which performs better
                  or equal in all environments $\nu\in\M$ and strictly better in at
                  least one. So, optimal predictors can (w.r.t.\ to most performance
                  measures in expectation) be based on the mixture $\xi$. Finally we
                  give an Occam's razor argument that Solomonoff's choice $w_\nu\sim
                  2^{-K(\nu)}$ for the weights is optimal, where $K(\nu)$ is the
                  length of the shortest program describing $\nu$.",
  url  =         "http://www.hutter1.net/ai/splower.htm",
}

previous home search

PostScript (420kb) PDF (293kb) Html/Gif

contact up next

Author:	Marcus Hutter (2001)
Comments:	15 LaTeX pages
Subj-class:	Artificial Intelligence; Learning
ACM-class:	I.2; I.2.6; I.2.8; F.1.3; F.2
Paper:	PostScript (420kb) - PDF (293kb) - Html/Gif
Slides:	PostScript - PDF
Remark:	Technical Report