Most passive Machine Learning tasks can be (re)stated as sequence
prediction problems. This includes pattern recognition, classification,
time-series forecasting, and others. Moreover, the understanding of
passive prediction also serves as a basis for active learning and
decision making. In the recent past, rich theories for sequence
prediction have been developed, and this is still an ongoing process. On
the other hand, we are arriving at the stage where some important
results are already termed classical. While much of the current Learning
Theory is formulated under the assumption of independent and identically
distributed (i.i.d.) observations, this tutorial focusses on
situations without this prerequisite (e.g. weather or stock-market
time-series). The three most important approaches to sequence
prediction are: Bayesian methods,
Prediction with Expert Advice, and the Minimal Description Length (MDL) principle.
Although aiming at the same (or similar) goal, comparisons between
them are scarce. A tutorial
comparing these different approaches fairly is in need.
The tutorial shall initiate a discussion between (young) researchers in
each field and may also
trigger research on sequence prediction towards a more unified treatment.
Keywords: Machine learning;
Bayesian objective/subjective probabilities;
Convergence, consistency, and loss bounds;
Minimum description Length principle;
Algorithmic information theory;
Prefix codes and priors;
Universal similarity metric;
Tree-based clustering: genetic, music, text;
Prediction with expert advice;
Weighted majority algorithm;
Follow the perturbed leader.
The tutorial should by interesting for the majority of the ICML community,
in particular to researchers interested in predictions of non-iid data.
The philosophical and mathematical differences
and similarities, advantages and disadvantages, relations, and
combinations of the three approaches will be discussed.
Basic prior knowledge of probability theory and statistics is assumed, but no more.
30-60 minutes per item. 4-5 hours in total. More material for a full-day tutorial is available.
The tutorial starts with the philosophical problems concerning
machine learning in general and induction in particular. I
illustrate the problems and their intuitive solution on various
(classical) induction examples. The common principle to their
solution is Occam's simplicity principle.
I briefly describe the formal theory of induction based on Occam's and
Epicurus' principle, Bayesian probability theory, and Turing's
universal machine. Finally I describe the sequential/online setup considered in
this tutorial and place it into the wider machine learning
Bayesian framework: Here one considers a (finite or
countable or uncountable) class of probabilistic models. Each model is
a distribution which generates sequences. The model class is endowed
with a prior distribution. The true model, i.e. the model which
actually generates the sequence we want to predict, is assumed to be
within the class. Now Bayes rule is the optimal way of predicting,
provided that the true model is randomly chosen according to the prior
distribution. But even if this does not hold, one can give good
performance guarantees for Bayesian prediction which depend on the
prior of the true model.
MDL/MML: The Minimum Description/Message
Length principle is one of the most important concepts in Machine
Learning, and serves as a scientific guide, in general. The
motivation is as follows: To make predictions involves finding
regularities in past data, regularities in data allows for
compression, hence short descriptions of data should help in
making predictions. More formally, from a model class, a model is
chosen which minimizes the joint description length of the model
and the data observed so far given the model. These criteria can
be related to a MAP (maximum a posteriori) model choice and thus
to the Bayesian framework.
The MDL method has been studied from very concrete and highly
tuned practical applications to general theoretical assertions.
Sequence prediction is just one application of MDL.
Universal Similarity Metric: The MDL idea
has also be used to define the so called information distance or
universal similarity metric, measuring the similarity between two
I will present some very impressive recent clustering applications
based on standard Lempel-Ziv or bzip2 compression, including a
completely automatic reconstruction
of the evolutionary tree of 24 mammals based on complete mtDNA
of the classification tree of 52 languages based on the
declaration of human rights.
Prediction with expert advice: This very active field of
research studies prediction in a worst-case setting, i.e. the
actual sequences are generated by an adversary. Instead of trying
to predict absolutely well, the goal is a good relative
performance with respect to a pool of experts. One can prove that
the regret is essentially bounded by the square root of a
complexity term and the loss of the best expert. This method and
corresponding performance bounds are related or dual to Bayesian sequence
Marcus Hutter (*1967) received a masters degree in computer sciences in 1992 at the University of Technology in Munich, Germany. After his PhD in theoretical
particle physics in 1995, he developed algorithms in a medical software company for 5 years. Since 2000 he is working as a researcher at the AI
institute IDSIA in Lugano, Switzerland,
where he published over 35 research papers
in/at ECML, ICML, COLT, ALT, JMLR, JCSS, IEEE-TIT, ..., with half of them being on topics of the tutorial. His current interests are centered around reinforcement learning, algorithmic information theory and statistics, universal induction schemes, adaptive control theory, and related areas.
In his recently published book
"Universal Artificial Intelligence"
he unifies sequential decision theory with algorithmic
Since 2003, he has also been an honorary official lecturer at the
Munich University of Technology.
Marcus Hutter IDSIA