How to Predict (Sequences) with Bayes, MDL, and Experts (Marcus Hutter)

previous home search

PostScript - PDF

contact up next

How to Predict (Sequences) with Bayes, MDL, and Experts

Author: Marcus Hutter (2005)

Comments: 90 slides

Lecture at: Machine Learning Summer School (MLSS 2005), ANU/RSISE/NICTA Canberra

Tutorial at: International Conference on Machine Learning (ICML 2005), Bonn

Slides: PostScript - PDF - Video

Keywords: Machine learning; Philosophical foundations; Epicurus+Occam+Bayes+Turing+Solomonoff; Online/sequential prediction; Bayesian objective/subjective probabilities; Mixture distributions; Convergence, consistency, and loss bounds; Minimum description Length principle; Algorithmic information theory; Prefix codes and priors; Universal similarity metric; Tree-based clustering: genetic, music, text; Prediction with expert advice; Weighted majority algorithm; Follow the perturbed leader.

Abstract: Most passive Machine Learning tasks can be (re)stated as sequence prediction problems. This includes pattern recognition, classification, time-series forecasting, and others. Moreover, the understanding of passive intelligence also serves as a basis for active learning and decision making. In the recent past, rich theories for sequence prediction have been developed, and this is still an ongoing process. On the other hand, we are arriving at the stage where some important results are already termed classical. While much of the current Learning Theory is formulated under the assumption of independent and identically distributed (i.i.d.) observations, this lecture series focusses on situations without this prerequisite (e.g. weather or stock-market time-series). There are several general approaches to sequence prediction:

Bayesian framework: Here one considers a (finite or countable or uncountable) class of probabilistic models. Each model is a distribution which generates sequences. The model class is endowed with a prior distribution. The true model, i.e. the model which actually generates the sequence we want to predict, is assumed to be within the class. Now Bayes rule is the optimal way of predicting, provided that the true model is randomly chosen according to the prior distribution. But even if this does not hold, one can give good performance guarantees for Bayesian prediction which depend on the prior of the true model.
MDL/MML: The Minimum Description/Message Length principle is one of the most important concepts in Machine Learning, and serves as a scientific guide, in general. The motivation is as follows: To make predictions involves finding regularities in past data, regularities in data allows for compression, hence short descriptions of data should help in making predictions. More formally, from a model class, a model is chosen which minimizes the joint description length of the model and the data observed so far given the model. These criteria can be related to a MAP (maximum a posteriori) model choice and thus to the Bayesian framework.
The MDL method has been studied from very concrete and highly tuned practical applications to general theoretical assertions. Sequence prediction is just one application of MDL. The MDL idea has also be used to define the so called information distance or universal similarity metric, measuring the similarity between two individual objects.
I will present some very impressive recent clustering applications based on standard Lempel-Ziv or bzip2 compression, including a completely automatic reconstruction

of the evolutionary tree of 24 mammals based on complete mtDNA
of the classification tree of 52 languages based on the declaration of human rights.

Prediction with expert advice: This very active field of research studies prediction in a worst-case setting, i.e. the actual sequences are generated by an adversary. Instead of trying to predict absolutely well, the goal is a good relative performance with respect to a pool of experts. One can prove that the regret is essentially bounded by the square root of a complexity term and the loss of the best expert. This method and corresponding performance bounds are related or dual to the two previous approaches.

References:

Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences, 55(1):119--139, 1997.

P. D. Gruenwald, Tutorial on Minimum Description Length
Chapters 1 and 2 of: Advances in Minimum Description Length: Theory and Applications, MIT Press, pp. 3-80, 2005.

R. Cilibrasi and P.M.B. Vitanyi, Clustering by compression
IEEE Trans. Information Theory, 51(4):1523-1545, 2005

M. Hutter, Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability
Springer, Berlin, 300 pages, 2004.

previous home search

PostScript - PDF - Html/Gif

contact up next

BibTeX Entry

@Slides{Hutter:05predict,
  author =       "M. Hutter",
  title =        "How to Predict (Sequences) with {Bayes}, {MDL}, and {Experts}",
  year =         "2005",
  pages =        "1--90",
  note =         "Presented at the Machine Learning Summer School (MLSS) and Conference (ICML-2005)",
  http1 =        "http://canberra05.mlss.cc/",
  http2 =        "http://icml.ais.fraunhofer.de/tutorials.php",
  url =          "http://www.hutter1.net/ai/predict.htm",
  slides =       "http://www.hutter1.net/ai/spredict.pdf",
}

previous home search

PostScript - PDF

contact up next

Author:	Marcus Hutter (2005)
Comments:	90 slides
Lecture at:	Machine Learning Summer School (MLSS 2005), ANU/RSISE/NICTA Canberra
Tutorial at:	International Conference on Machine Learning (ICML 2005), Bonn
Slides:	PostScript - PDF - Video