Optimality of Universal Bayesian Prediction for General Loss and Alphabet (Marcus Hutter)

previous home search

LaTeX (51kb) PostScript (596kb) PDF (390kb) Html/Gif

contact up next

Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet

Author: Marcus Hutter (2002-2003)

Comments: 34 LaTeX pages

Subj-class: Learning; Artificial Intelligence

ACM-class:
I.2; I.2.6; I.2.8; E.4; F.1.3; F.2; G.3

Reference: Journal of Machine Learning Research, 4 (2003) 971-1000

Report-no: IDSIA-02-02 and cs.LG/0311014

Paper: LaTeX (125kb) - PostScript (455kb) - PDF (326kb) - Html/Gif

Slides: PostScript - PDF

Presented at:
ICML (1 Jul 2001), ECML (6 Sep 2001),

Keywords: Bayesian sequence prediction; mixture distributions; Solomonoff induction; Kolmogorov complexity; learning; universal probability; tight loss and error bounds; Pareto-optimality; games of chance; classification.

Abstract: Various optimality properties of universal sequence predictors based on Bayes-mixtures in general, and Solomonoff's prediction scheme in particular, will be studied. The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t-1}$ can be computed with the chain rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is unknown, but known to belong to a countable or continuous class $\M$ one can base ones prediction on the Bayes-mixture $\xi$ defined as a $w_\nu$-weighted sum or integral of distributions $\nu\in\M$. The cumulative expected loss of the Bayes-optimal universal prediction scheme based on $\xi$ is shown to be close to the loss of the Bayes-optimal, but infeasible prediction scheme based on $\mu$. We show that the bounds are tight and that no other predictor can lead to significantly smaller bounds. Furthermore, for various performance measures, we show Pareto-optimality of $\xi$ and give an Occam's razor argument that the choice $w_\nu\sim 2^{-K(\nu)}$ for the weights is optimal, where $K(\nu)$ is the length of the shortest program describing $\nu$. The results are applied to games of chance, defined as a sequence of bets, observations, and rewards. The prediction schemes (and bounds) are compared to the popular predictors based on expert advice. Extensions to infinite alphabets, partial, delayed and probabilistic prediction, classification, and more active systems are briefly discussed.

previous home search

LaTeX (51kb) PostScript (596kb) PDF (390kb) Html/Gif

contact up next

Table of Contents

Introduction

Setup and Convergence

Error Bounds

Application to Games of Chance

Optimality Properties

Miscellaneous

Summary

previous home search

LaTeX (51kb) PostScript (596kb) PDF (390kb) Html/Gif

contact up next

BibTeX Entry

@Article{Hutter:03optisp,
  author =       "Marcus Hutter",
  title =        "Optimality of Universal {B}ayesian Prediction for General Loss and Alphabet",
  volume =       "4",
  year =         "2003",
  pages =        "971--997",
  journal =      "Journal of Machine Learning Research",
  publisher =    "MIT Press",
  http =         "http://www.hutter1.net/ai/optisp.htm",
  url =          "http://arxiv.org/abs/cs.LG/0311014",
  url2 =         "http://www.jmlr.org/papers/volume4/hutter03a/",
  ftp =          "ftp://ftp.idsia.ch/pub/techrep/IDSIA-02-02.ps.gz",
  keywords =     "Bayesian sequence prediction; mixture distributions; Solomonoff
                  induction; Kolmogorov complexity; learning; universal probability;
                  tight loss and error bounds; Pareto-optimality; games of chance;
                  classification.",
  abstract =     "Various optimality properties of universal sequence predictors
                  based on Bayes-mixtures in general, and Solomonoff's prediction
                  scheme in particular, will be studied. The probability of
                  observing $x_t$ at time $t$, given past observations
                  $x_1...x_{t-1}$ can be computed with the chain rule if the true
                  generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is
                  known. If $\mu$ is unknown, but known to belong to a countable or
                  continuous class $\M$ one can base ones prediction on the
                  Bayes-mixture $\xi$ defined as a $w_\nu$-weighted sum or integral
                  of distributions $\nu\in\M$. The cumulative expected loss of the
                  Bayes-optimal universal prediction scheme based on $\xi$ is shown
                  to be close to the loss of the Bayes-optimal, but infeasible
                  prediction scheme based on $\mu$. We show that the bounds are
                  tight and that no other predictor can lead to significantly
                  smaller bounds. Furthermore, for various performance measures, we
                  show Pareto-optimality of $\xi$ and give an Occam's razor argument
                  that the choice $w_\nu\sim 2^{-K(\nu)}$ for the weights is
                  optimal, where $K(\nu)$ is the length of the shortest program
                  describing $\nu$. The results are applied to games of chance,
                  defined as a sequence of bets, observations, and rewards. The
                  prediction schemes (and bounds) are compared to the popular
                  predictors based on expert advice. Extensions to infinite
                  alphabets, partial, delayed and probabilistic prediction,
                  classification, and more active systems are briefly discussed.",
}

previous home search

LaTeX (51kb) PostScript (596kb) PDF (390kb) Html/Gif

contact up next

Author:	Marcus Hutter (2002-2003)
Comments:	34 LaTeX pages
Subj-class:	Learning; Artificial Intelligence
ACM-class:	I.2; I.2.6; I.2.8; E.4; F.1.3; F.2; G.3
Reference:	Journal of Machine Learning Research, 4 (2003) 971-1000
Report-no:	IDSIA-02-02 and cs.LG/0311014
Paper:	LaTeX (125kb) - PostScript (455kb) - PDF (326kb) - Html/Gif
Slides:	PostScript - PDF
Presented at:	ICML (1 Jul 2001), ECML (6 Sep 2001),