How to Predict (Sequences) with Bayes, MDL, and Experts
Keywords: Machine learning;
Philosophical foundations;
Epicurus+Occam+Bayes+Turing+Solomonoff;
Online/sequential prediction;
Bayesian objective/subjective probabilities;
Mixture distributions;
Convergence, consistency, and loss bounds;
Minimum description Length principle;
Algorithmic information theory;
Prefix codes and priors;
Universal similarity metric;
Tree-based clustering: genetic, music, text;
Prediction with expert advice;
Weighted majority algorithm;
Follow the perturbed leader.
Abstract:
Most passive Machine Learning tasks can be (re)stated as sequence
prediction problems. This includes pattern recognition, classification,
time-series forecasting, and others. Moreover, the understanding of
passive intelligence also serves as a basis for active learning and
decision making. In the recent past, rich theories for sequence
prediction have been developed, and this is still an ongoing process. On
the other hand, we are arriving at the stage where some important
results are already termed classical. While much of the current Learning
Theory is formulated under the assumption of independent and identically
distributed (i.i.d.) observations, this lecture series focusses on
situations without this prerequisite (e.g. weather or stock-market
time-series). There are several general approaches to sequence
prediction:
- Bayesian framework: Here one considers a (finite or
countable or uncountable) class of probabilistic models. Each model is
a distribution which generates sequences. The model class is endowed
with a prior distribution. The true model, i.e. the model which
actually generates the sequence we want to predict, is assumed to be
within the class. Now Bayes rule is the optimal way of predicting,
provided that the true model is randomly chosen according to the prior
distribution. But even if this does not hold, one can give good
performance guarantees for Bayesian prediction which depend on the
prior of the true model.
- MDL/MML: The Minimum Description/Message
Length principle is one of the most important concepts in Machine
Learning, and serves as a scientific guide, in general. The
motivation is as follows: To make predictions involves finding
regularities in past data, regularities in data allows for
compression, hence short descriptions of data should help in
making predictions. More formally, from a model class, a model is
chosen which minimizes the joint description length of the model
and the data observed so far given the model. These criteria can
be related to a MAP (maximum a posteriori) model choice and thus
to the Bayesian framework.
The MDL method has been studied from very concrete and highly
tuned practical applications to general theoretical assertions.
Sequence prediction is just one application of MDL. The MDL idea
has also be used to define the so called information distance or
universal similarity metric, measuring the similarity between two
individual objects.
I will present some very impressive recent clustering applications
based on standard Lempel-Ziv or bzip2 compression, including a
completely automatic reconstruction
- of the evolutionary tree of 24 mammals based on complete mtDNA
- of the classification tree of 52 languages based on the
declaration of human rights.
- Prediction with expert advice: This very active field of
research studies prediction in a worst-case setting, i.e. the
actual sequences are generated by an adversary. Instead of trying
to predict absolutely well, the goal is a good relative
performance with respect to a pool of experts. One can prove that
the regret is essentially bounded by the square root of a
complexity term and the loss of the best expert. This method and
corresponding performance bounds are related or dual to the two
previous approaches.
References:
- Y. Freund and R. E. Schapire,
A decision-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences, 55(1):119--139, 1997.
- P. D. Gruenwald,
Tutorial on Minimum Description Length
Chapters 1 and 2 of: Advances in Minimum Description Length: Theory and Applications, MIT Press, pp. 3-80, 2005.
- R. Cilibrasi and P.M.B. Vitanyi,
Clustering by compression
IEEE Trans. Information Theory, 51(4):1523-1545, 2005
- M. Hutter,
Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability
Springer, Berlin, 300 pages, 2004.
BibTeX Entry
@Slides{Hutter:05predict,
author = "M. Hutter",
title = "How to Predict (Sequences) with {Bayes}, {MDL}, and {Experts}",
year = "2005",
pages = "1--90",
note = "Presented at the Machine Learning Summer School (MLSS) and Conference (ICML-2005)",
http1 = "http://canberra05.mlss.cc/",
http2 = "http://icml.ais.fraunhofer.de/tutorials.php",
url = "http://www.hutter1.net/ai/predict.htm",
slides = "http://www.hutter1.net/ai/spredict.pdf",
}