| [previous] [home] [search] | Research Projects of Marcus Hutter | [contact] [up] [next] |

Personal motivation.
Since my early youth I have been following two great goals in
my life. One is to work on the physical ``Theory of Everything'',
which motivated me to do my BSc and PhD in Theoretical
Particle Physics.
The other is Artificial Intelligence, which motivated me
to do my BSc & MSc & Habilitation in
Theoretical Computer Science. This personal drive ``explains''
most of my CV.
I have also, always been interested in significant open math
problems, most prominently the P=NP question.
My current research is centered around
artificial intelligence,
machine (reinforcement) learning,
information theory and statistics,
algorithmic information theory,
Kolmogorov complexity,
minimal description length (MDL) principle,
Bayesian/robust/expert/MDL/online/sequence prediction,
computational complexity theory,
universal Solomonoff induction,
universal Levin search,
sequential decision theory, and
adaptive control theory (see e.g.
[P04uaibook] how they fit together).
Current and past projects (...-2006)
Below you can find a description of my current and past (and future) projects
with links to corresponding publications. Alternatively you may want to
have a look at the (partial and quite condensed) summary
slides of my projects. Therein I give an overview of what I regard as the
foundations of (statistical) machine learning and of my own
research in this direction.
I start with a survey of the key subfields and dichotomies.
Most, if not all problems can formally be reduced to
predicting sequences or acting well.
Occam's razor, Bayes' rule, utility, information, and computability theory
are key to solving these problems.
I discuss the following developments and applications:
determining (in)dependence of samples by Mutual Information,
non-parametric Bayesian inference,
robust estimation,
Bayesian change point detection,
Bayesian sequence prediction,
the MDL principle,
prediction with expert advice,
applications of algorithmic information theory,
optimization,
computer vision,
and image processing.
I also briefly summarize my past work
in particle physics, medical software development, and others.
Future projects (2006-...)
It is hard to make predictions, which includes future research
projects. Research is not performed in a vacuum, but adapts to
trends, chances, and the local environment.
Some completely new topics may strike my interest and/or some of my former activities on
image processing, 3D graphics, numerical simulations, and others
may be revived.
The Minimum Description Length (2003-...)
(MDL) principle is one of the most important concepts in Machine
Learning, and even serves as a scientific guide, in general. An
MDL estimator seeks a simple probabilistic model for which
simultaneously the data likelihood is large. MDL has been studied
on all possible levels from very concrete and highly tuned
practical applications up to general theoretical assertions. MDL
research mainly concentrates on continuous model classes, coding,
modelling, and independent identically distributed (i.i.d.) data.
On the other hand, the non-i.i.d. case is important for
sequential prediction tasks, and from a computational point of
view the largest relevant model class is countable, namely the
class of all enumerable semimeasures, deeply rooted in Algorithmic
Information Theory and Solomonoff's universal induction scheme.
Convergence and stabilization of MDL based sequential prediction
for arbitrary discrete (deterministic and probabilistic) model
classes has been shown in
[P03unimdl,
P04mdl2p],
but convergence may be slow
[P04unimdlx,
P04mdlspeed,
P05mdl2px,
P06mdlspeedx].
Three MDL variants have
been identified, and the results have been
extended to regression and classification
[P05mdlreg].
Apart from these findings, the countable
non-i.i.d. case has been left mostly untouched for future
exploration.
Key open problems: Characterize the class of coding schemes
and discrete model classes for which MDL performs well (as well
Bayes); further explore the differences between the 3 MDL variants.
Example applications: prevention of overfitting; regression
for large model classes; Bayesian net learning; inference of
decision trees; learning (e.g. grammars) from positive examples
only; sequence prediction; molecular biology; functional genomics
(expression/mapping/interpretation).
Algorithmic Probability and Information Theory (2002-...)
are the mathematical foundations of universal (domain independent)
induction and information.
Kolmogorov defined the information content or complexity K of
(an arbitrary object coded as) a string as the length of the
shortest program on a universal Turing machine computing that
string. This description "finds" all effective
regularities of the string and is no further compressible.
Shannon entropy, specific MDL codings, Lempel-Ziv
compression, and many others can be regarded as efficient
approximations to the incomputable Kolmogorov complexity K.
Solomonoff defined (earlier) the closely related universal a priori
probability M, which assigns high/low probability to simple/complex
strings, thus quantifying Occam's razor and Epicurus'
multiple-explanation principle. Solomonoff's induction scheme based on M
results in optimal sequential predictions (in expectation and with
probability one), the only assumption to be made is that the
sequence is sampled from a computable probability distribution
[P99errbnd,
P03optisp,
P03unipriors,
P05unipriorx,
P05postbnd].
Despite some nearby results and proofs in the literature
[P03mlconv],
the stronger result of convergence for all (Martin-Löf) random
sequences, does not hold in general
[P04mlconvx,
P06mlconvxx].
Key open problems: effective approximations of K which
respect key properties of K; Martin-Löf convergence of M
or related quantities; applicability of other K complexity
variants for prediction; convergence class of time-limited M
variants like Schmidhuber's speed prior; task-dependent
universal distance measures.
Example applications: foundation of information and
induction; Maxwell's demon; incompressibility proof-method;
reversible computing; universal search and optimization; key
ingredient to the AIXI theory (see below).
Prediction with expert advice (2002-...)
(PEA) is currently a very active field.
Many variations known by many names (prediction/learning with
expert advice, weighted majority/average, aggregating strategy,
boosting, hedge algorithm, ...) have meanwhile been invented.
PEA combines forecasts of experts to form its own prediction and
can be shown to perform nearly as well as the best expert in the
considered class of experts, without making any assumption on the
environment(al data sequence). An old rediscovered variant,
"Follow the Perturbed Leader" allowed to elegantly derive
generalized performance/regret bounds for adaptive learning rate
(general weights, alphabet and loss)
[P04expert,
P05expertx],
even in reactive environments
[P05actexp,
P05actexp2].
Dual to this approach, expected
performance bounds for predictors based on Bayes mixtures of
environments rather than experts have been studied (see below).
The duality of both approaches and results has been pointed out in
[P04bayespea] and should be
further explored in the future,
probably leading to novel and fruitful bounds and algorithms.
Key open problems: quantification, interpretation and
exploitation of the duality of PEA and Bayesian sequence
prediction (see below); loss bounds for some PEA variant for
general weights without the hierarchy trick used in FPL;
investigate structures for which PEA is efficient even for large
number of experts; Close the sqrt(2) gap between lower and
upper bounds.
Example applications: combining forecasting software (for
weather, stock market, ...); online shortest path problem; binary
search tree.
Bayesian (Sequence) Prediction (2001-...)
The Bayesian framework is ideally suited for induction problems,
including classification tasks, pattern recognition, time-series
(e.g. stock market and weather) forecasting, and many more, where
(sequential) data is generated from some distribution μ.
If μ is unknown, but known to belong to some
class of distributions, one can base ones prediction on
the Bayes-mixture ξ defined as a weighted average over
these distributions. One can show that the posterior of ξ
converges rapidly to the true posterior μ
[P01alpha,
P03mlconv,
P04mlconvx,
P06mlconvxx].
The expected loss of the Bayes-optimal universal prediction scheme
based on ξ is very close to the loss of the optimal, but
infeasible prediction scheme based on μ
[P01loss,
P02spupper].
Tight bounds have been derived, and
no other predictor can be significantly better. Furthermore, ξ
is Pareto-optimal and an Occam's razor argument shows that the
universal weights based on Algorithmic Information Theory leading
to Solomonoff's universal induction scheme are optimal
[P99errbnd,
P03optisp,
P03unipriors,
P05unipriorx,
P05postbnd].
The results may be generalized and applied in various ways, e.g. to
games of chance in form of a sequence of bets, observations, and
rewards [P03optisp].
In particular for i.i.d. data, a novel non-parametric, consistent,
scale-invariant Bayesian tree prior has been constructed, which
allows very easy, efficient and exact inference and estimation of
most quantities of interest
[P04bayestree,
P05bayestreex].
Finally, an exact and efficient Bayesian regression algorithm
based on Dynamic Programming for piecewise constant functions of
unknown segment number, boundary location, and levels has been
developed [H06pcreg].
It is intended to adapt and
further develop this algorithm within a joint SNF grant requested
with the Oncology Institute of Southern Switzerland (IOSI)
and apply it to microarray-CGH data for determining
cancer relevant genes.
Key open problems: High probability performance guarantees.
Example applications: prediction of Markov processes;
speech recognition; handwritten word/text/character recognition,
microarray-CGH data analysis.
Robust predictions (2002-...),
rather than (risky), best on average predictions, are sometimes
necessary. Various ways (apart from PEA) have been suggested to
go beyond maximizing expected performance. In the Bayesian
framework this means computing quantiles and/or variances.
Higher moments of mutual information under a second order
Dirichlet prior have been derived [P01xentropy],
successfully used for robust feature selection
[P02feature], and extended to the case of incomplete
data [P03mimiss, P02mifs]. For the Imprecise Dirichlet Model,
which goes beyond Bayes, a comprehensive collection of analytical
formulae for computing robust intervals have been derived
[P03idm] and applied to infer robust trees
[P03tree]. These approaches are promising, theoretical
as well as practical, and should be investigated further.
Key open problems: adequate treatment of incomplete
observations; efficient approximations; principles for choosing
acceptance/rejection thresholds.
Example applications: safety critical areas such as planning
and designing medical treatments and atomic reactors.
Optimization (2002-...)
of functions on very large/high-dimensional spaces is crucial in a
plethora of applications, including operations research and
artificial intelligence, to name only two. The functions often
involve all known difficulties at once: multimodality,
deceptiveness, discontinuities, ... Usually one has to be
satisfied with approximate solutions.
For concrete, often NP-complete problems, polynomial time
approximation algorithms (PTAS) can often be found. To find these
algorithms for problem after problem is a laborious, but rewarding
procedure. Finding a PTAS does not necessarily imply possessing a
practically efficient algorithm.
Remarkably, a fully linear time approximation algorithm for
knapsack problems could be developed [P02knapsack],
improving upon previous PTAS algorithms. Other problems
are also expected to possess linear time algorithms.
Since a rigorous theoretical approach is not always
possible/successful, more heuristic approaches are often used,
including simulated annealing and population based optimization.
More effective selection and deletion schemes (FUSS/FUDS) for
evolutionary (or other population based) algorithms, which
generate selection pressure towards sparsely populated fitness
regions, rather than towards higher fitness, have been invented
[P01fuss,
P05fuds,
P05fuo].
First experiments, to be continued in the future, indicate superiority
to all previous selection schemes and to random deletion on many
hard optimization problems, including the Traveling Salesman
Problem, the set covering problem, and maximum CNF3SAT
[P04fussexp,
P05fuo].
The interesting ``go with the winners'' algorithms by Aldous and
Vazirani may bridge the gap between heuristic and rigorous
approaches to complex optimization problems. This connection is
currently being explored.
Key open problems: linear time approximation algorithms;
test FUSS and variants on other hard optimization problems;
Consider ``go with the winners'' on more general practically
important graph structures.
Example applications: operations research; scheduling
problems.
Reinforcement Learning (2005-...)
is a general approach for sequential decision making and active
learning. The setup consists of a decision maker (agent) which
interacts with an environment (world) by an alternating sequence
of actions and observations. The observations consist of an
informative part (e.g. a video image) and occasionally a
potentially delayed reward part (e.g. function value in case
function maximization or battery level in case of self-maintaining
robots). The goal is to learn the decisions which maximize reward.
In case the decision does not influence the environment the setup
reduces to passive predictions, discussed above. In case of known
environment, sequential decision theory provides a formal, but in
general inefficient, optimal answer. Dynamic programming
efficiently solves the case of known Markov environments with few
states. Adaptive/dynamic control theory addresses the case of
unknown/known linear environments with quadratic loss function
(interesting for engineering, but too limited for AI). The general
case of unknown environment and active agents is very difficult.
The AIXI(tl) theory
[P01aixi]
represents an optimal but
inefficient ``solution''. More heuristic approaches to this
problem are typically subsumed under the name
Reinforcement-Learning (RL).
Temporal difference (TD) learning (and variants) are
computationally efficient (but data inefficient) asymptotically
converging methods if the environment is a completely observed
stationary ergodic Markov Decision Process (MDP) with a small
finite number of states. RL/TD for infinite state spaces (function
approximation), partial observability (belief states), and
handling the exploration versus exploitation
[P05aixifoe], is in its infancies.
Key open problems: balancing exploration versus
exploitation; optimal learning rate and eligibility parameter;
convergence in case of function approximation and partial
observability; comparison to model-based learning, non-ergodic
and/or non-stationary environments; beyond (PO)MDPs;
imitation-based learning (e.g. in a multi-agent or symmetric game
setup).
Example applications: board games like backgammon and checkers;
elevator control; robo-soccer; dynamic channel allocation;
inventory problems.
Universal decision theory (2003-...)
My favorite research project is on optimal sequential decision
making, based on universal algorithmic probability (called
AIXI(tl), see below).
Whereas past research was focused on proving optimality for
ultra general problem classes [P02selfopt],
future research will also investigate the applicability of AIXItl.
Variants of universal Levin search have been implemented and successfully
applied
at IDSIA by Schmidhuber to a variety of problems concerning
induction, classification, and others. Analogously, we
expect variants of AIXItl to be successful in solving problems
in ``reactive'' environments, also called active learning.
The goal is to implement AIXItl and apply it to difficult, yet
simply describable problems and compare the performance to other
approaches.
Key open problems: in which sense is AIXI universally optimal;
analyze behavior of AIXI for several concept classes;
axiomatic/algebraic characterization of AIXI; choice of the horizon;
downscale and implement AIXI (like Levin-search and OOPS); more insights
how AIXI works.
Example applications: mathematical and philosophical
foundation of intelligence; repeated prisoner problem and
generalized variants [P05aixifoe];
the n-armed bandit problem; simple strategic games.
Universal Artificial Intelligence (2000-2004)
My current research is focused around a recently introduced
mathematical theory for intelligence, called AIXI. The AIXI model
is a parameter-free optimal reinforcement learning agent embedded
in an arbitrary unknown environment. Disciplines of importance
involved in the development of the AIXI model are artificial
intelligence, reinforcement learning, algorithmic information
theory, Kolmogorov complexity, minimal description length (MDL)
principle, computational complexity theory, information theory and
statistics, universal Solomonoff induction, universal Levin
search, sequential decision theory, and adaptive control theory.
The model was first introduced and discussed in
[P00kcunai]. More succinct descriptions have been
published in
[P01aixi,
P01decision]1.
See also
[P03aixigentle] for an introductory survey
and [P04uaibook] for
a comprehensive treatment. The AIXI
model has been argued to formally solve a number of problem
classes, including sequence prediction, strategic games, function
minimization, reinforcement and supervised learning
[P00kcunai].
It comes along with a universal intelligence order-relation/measure
[P04uaibook,
P05iors].
The generalization AIxi has recently
been shown to be self-optimizing and Pareto-optimal
[P02selfopt].
Tight [P03optisp] error
[P99errbnd,
P01alpha]2
and loss
[P01loss] bounds for Solomonoff's universal sequence
prediction scheme have been proven.
I found a remarkable, asymptotically fastest and shortest
algorithm for all well-defined problems - fastest save for
a constant factor 5 and a problem class-specific additive constant
[P01fast]3.
A closely related idea led to a computable AIXItl model
[P00kcunai,P01aixi].
Ideas, loosely related to AIXI, on
a market/economy based reinforcement learner
[P01market],
a gradient based reinforcement planner
[P01grep],
and a reinforcement learning mobile robot
[P06robot]
have been implemented.
Applied R&D at Company BrainLAB (1996-2000)
During my full-time employment at BrainLAB from 1996 to 2000 I
developed various numerical algorithms in the medical field. They
include a Neuro-Navigation system, a Brachytherapy planning
system, a dose algorithm (PencilBeam) for radiotherapy for IMRT, a
real time software volume renderer, various image processing
modules, and many more.
Image and volume data enhancement and post-antialiasing
algorithms based on finite-element interpolation have been
developed, patented [P01eupatent,P02uspatent],
and implemented.
Due to tough time-schedules and the nondisclosure
policy of the company further publications were unfortunately not
possible.
PhD in Particle Physics (1993-1996)
The universe as a whole is determined by gravity, which is described
by the general relativity theory. Its constituents, the elementary
particles, behave according to the rules of quantum field theory or,
more specifically, according to the standard model of particle physics.
Super string theory could turn out to be the single, elegant Theory
of Everything physicists dream of. This may also solve the old
outstanding (philosophical) problem of the interpretation of quantum
theory (collapse of the wave function, Schrödinger's cat,...).
My interests in physics are centered around these topics. My
active research concentrates on non-perturbative QCD, especially
instantons [P96diss,P96thesis], gluon mass
[P93gluon], quark propagator [P95prop],
eta' mass [P96eta], meson correlators and masses
[P97instanto], and the proton spin
[P95spin]. My favorite paper explains the exponential
fermion mass spectrum between successive generations
[P97family].
Early student work (1983-1992)
A classifier system, allowing for comparison of many popular
variants, was implemented in my Masters thesis [P92cfs].
It also contains the first proof of the equivalence of ranking
and tournament selection.
Hebb nets are unsupervised learners. In [P90fopra] the
possibility of Hebb nets learning by reinforcement feedback
was demonstrated.
Computer aided design demands significant computational resources.
Only two serious CAD programs for 8 bit computers exist.
One of them (written in assembler) being
[P87cad].
Some other mentionable programming jobs include the development of
a member organization program in DBase (1983, Steuerberater
Keller), a user interface for an expert system under GEM (1987,
IABG), and a protection module and organization program for licensing
programs in C (1993, IABG).
| © 2000 by ... | [previous] [home] [search] [science] [calculators] [personal] [contact] [up] [next] | ... Marcus Hutter |