[previous] [home] [search] | Research
Projects of Marcus Hutter |
[contact] [up] [next] |

- Personal motivation
- Current and past projects (...-2010) [Slides]
- Future projects (2004-...)
- Bayesian Statistics
- Information Theory
- Complexity Theory
- Minimum Description Length (2003-...)
- Algorithmic Probability and Information Theory (2002-...)
- Prediction with expert advice (2002-...)
- Bayesian Sequence Prediction (2001-...)
- Robust predictions (2002-...)
- Optimization (2002-...)
- Reinforcement Learning (2005-...)
- Universal decision theory (2003-...)
- Computer vision & image processing
- Bio-statistics & bio-informatics (2006-2010)
- Universal Artificial Intelligence (2000-2006)
- Applied R&D at Company BrainLAB (1996-2000)
- PhD in Particle Physics (1993-1996)
- Early student work (1983-1992)

**Personal motivation.**
Since my early youth I have been following two great goals in
my life. One is to work on the physical ``Theory of Everything'',
which motivated me to do my BSc and PhD in Theoretical
Particle Physics.
The other is Artificial Intelligence, which motivated me
to do my BSc & MSc & Habilitation in
Theoretical Computer Science. This personal drive ``explains''
most of my CV.
I have also, always been interested in significant open math
problems, most prominently the P=NP question.
My current research is centered around
artificial intelligence,
machine (reinforcement) learning,
information theory and statistics,
algorithmic information theory,
Kolmogorov complexity,
minimal description length (MDL) principle,
Bayesian/robust/expert/MDL/online/sequence prediction,
computational complexity theory,
universal Solomonoff induction,
universal Levin search,
sequential decision theory, and
adaptive control theory (see e.g.
[P04uaibook] how they fit together).

**Current and past projects (...-2010)**
Below you can find a description of my current and past (and future) projects
with links to corresponding publications. Alternatively you may want to
have a look at the (partial and quite condensed) summary
slides of my projects. Therein I give an overview of what I regard as the
foundations of (statistical) machine learning and of my own
research in this direction.
I start with a survey of the key subfields and dichotomies.
Most, if not all problems can formally be reduced to
predicting sequences or acting well.
Occam's razor, Bayes' rule, utility, information, and computability theory
are key to solving these problems.
I discuss the following developments and applications:
determining (in)dependence of samples by Mutual Information,
non-parametric Bayesian inference,
robust estimation,
Bayesian change point detection,
Bayesian sequence prediction,
the MDL principle,
prediction with expert advice,
applications of algorithmic information theory,
optimization,
computer vision,
and image processing.
I also briefly summarize my past work
in particle physics, medical software development, and others.

**Future projects (2010-...)**
It is hard to make predictions, which includes future research
projects. Research is not performed in a vacuum, but adapts to
trends, chances, and the local environment.
Some completely new topics may strike my interest and/or some of my former activities on
image processing, 3D graphics, numerical simulations, and others
may be revived.

**Bayesian statistics.**
The Bayesian framework is ideally suited for complex statistical
induction problems, including classification tasks, pattern
recognition, time-series (e.g. stock market and weather)
forecasting, and many more, where data is generated from some
distribution μ. If μ is unknown, but known to belong to
some class of environments=hypotheses=models M, from one's
prior belief w_{ν} in environment ν in M and the observed
data likelihood, Bayes' rule yields one's posterior confidence in
hypothesis ν. In the predictive setting one can base one's
prediction directly on the Bayes-mixture ξ defined as the
w_{ν}-weighted average over these distributions.
For i.i.d. data, a novel, non-parametric, consistent,
scale-invariant Bayesian tree prior has been constructed, which
allows very easy, efficient and exact inference and estimation of
most quantities of interest
[P05bayestree,
P09bayestreex].
An exact and efficient Bayesian regression algorithm
based on Dynamic Programming for piecewise constant functions of
unknown segment number, boundary location, and levels has also been
developed [P07pcreg].

*Key open problems:* The Bayesian framework leaves open how to
choose the model class M and prior w_{ν}. Subjective,
objective, universal, and sets-of priors are the key choices
[P06usp].

*Applications* are abundant, e.g. density estimation and
regression, feature selection, dependency networks; handling small
sample sizes. It is intended to adapt and further develop
[P07pcreg] within a joint SNF grant requested with the
Oncology Institute of Southern Switzerland (IOSI) and apply it to
microarray-CGH data for determining cancer relevant genes.

**Information theory.**
Shannon entropy is the standard measure of the information rate of
a stochastic source. The mutual information between two random
variables is defined similarly, and is the default (in)dependency
measure. If the true probabilities are unknown, they are typically
estimated from observed relative frequencies. The resulting
empirical mutual information is then itself a random variable. For
robust or credible estimates, rather than risky average
predictions, not only the mean but also quantiles or variances are
needed.
These and higher moments under a second order Dirichlet prior have
been derived [P01xentropy], successfully used for
robust feature selection [P02feature], and extended to
the case of incomplete data [P03mimiss,P05mifs].
For the Imprecise Dirichlet Model, which goes beyond Bayes, a
comprehensive collection of analytical formulae for computing
robust intervals of entropy, mutual information and other
quantities has been derived [P03idm] and applied to
infer robust trees [P05tree].

*Key open problems:* better treatment of incomplete
observations; efficient approximations; principles for choosing
acceptance/rejection thresholds.

*Example applications:* safety critical areas such as planning
and designing medical treatments and atomic reactors.

**Complexity theory.**
Computational complexity theory studies the computation time and
space required to solve a given problem.
Remarkably, an asymptotically fastest algorithm for *all*
well-defined problems exists
[P02fast].
A closely related idea led to the computable AIXI*tl* model
[P00kcunai,P01aixi].
The class P of polynomially-time (=efficiently) solvable problems,
and the class NP of problems whose solution can be verified in
polynomial time are the most famous in the ever-increasing zoo of
complexity classes. Whether P=NP, is the most important open
problem. Some NP-hard problems allow for efficient approximations
(PTAS), like the knapsack problem, which can actually be solved in
linear time [P06knapsack].
Algorithmic complexity (or information) theory is concerned with
the information content in strings, programs, and other objects.
Kolmogorov defined the complexity *K* of
(an arbitrary object coded as) a string as the length of the
shortest program on a universal Turing machine computing that
string. This description ``finds'' *all* effective
regularities of the string and is no further compressible.
Shannon entropy, specific MDL codings, Lempel-Ziv or CTW
compression, and many others can be regarded as efficient
approximations to the incomputable Kolmogorov complexity *K*.

*Key open problems:* effective approximations of *K* which
respect key properties of *K*; applicability of other *K*
complexity variants for prediction
[P03unipriors,
P06unipriorx].

*Example applications:* foundation of information and
induction, universal sequence prediction
[P01errbnd,
P03unimdl,
P06unimdlx,
P03mlconv,
P04mlconvx,
P07mlconvxx,
P05postbnd,
P06usp];
Maxwell's demon; incompressibility proof-method; reversible
computing; universal search and optimization [P02fast];
key ingredient to the AIXI theory [P04uaibook];
detecting similarities of objects.

**The Minimum Description Length (2003-...)**
(MDL) principle is one of the most important concepts in Machine
Learning, and even serves as a scientific guide, in general. An
MDL estimator seeks a simple probabilistic model for which
simultaneously the data likelihood is large. MDL has been studied
on all possible levels from very concrete and highly tuned
practical applications up to general theoretical assertions. MDL
research mainly concentrates on continuous model classes, coding,
modelling, and independent identically distributed (i.i.d.) data.
On the other hand, the non-i.i.d. case is important for
sequential prediction tasks, and from a computational point of
view the largest relevant model class is countable, namely the
class of all enumerable semimeasures, deeply rooted in Algorithmic
Information Theory and Solomonoff's universal induction scheme.
Convergence and stabilization of MDL based sequential prediction
for arbitrary discrete (deterministic and probabilistic) model
classes has been shown in
[P03unimdl,
P04mdl2p],
but convergence may be slow
[P04unimdlx,
P04mdlspeed,
P05mdl2px,
P06mdlspeedx].
Three MDL variants have
been identified, and the results have been
extended to regression and classification
[P05mdlreg].
Apart from these findings, the countable
non-i.i.d. case has been left mostly untouched for future
exploration.

*Key open problems:* Characterize the class of coding schemes
and discrete model classes for which MDL performs well (as well
Bayes); further explore the differences between the 3 MDL variants.

*Example applications:* prevention of overfitting; regression
for large model classes; Bayesian net learning; inference of
decision trees; learning (e.g. grammars) from positive examples
only; sequence prediction; molecular biology; functional genomics
(expression/mapping/interpretation).

**Algorithmic Probability and Information Theory (2002-...)**
are the mathematical foundations of universal (domain independent)
induction and information.
Kolmogorov defined the information content or complexity *K* of
(an arbitrary object coded as) a string as the length of the
shortest program on a universal Turing machine computing that
string. This description "finds" *all* effective
regularities of the string and is no further compressible.
Shannon entropy, specific MDL codings, Lempel-Ziv
compression, and many others can be regarded as efficient
approximations to the incomputable Kolmogorov complexity *K*.
Solomonoff defined (earlier) the closely related universal a priori
probability *M*, which assigns high/low probability to simple/complex
strings, thus quantifying Occam's razor and Epicurus'
multiple-explanation principle. Solomonoff's induction scheme based on *M*
results in optimal sequential predictions (in expectation and with
probability one), the only assumption to be made is that the
sequence is sampled from a computable probability distribution
[P99errbnd,
P03optisp,
P03unipriors,
P05unipriorx,
P05postbnd].
Despite some nearby results and proofs in the literature
[P03mlconv],
the stronger result of convergence for all (Martin-Löf) random
sequences, does not hold in general
[P04mlconvx,
P06mlconvxx].

*Key open problems:* effective approximations of *K* which
respect key properties of *K*; Martin-Löf convergence of *M*
or related quantities; applicability of other *K* complexity
variants for prediction; convergence class of time-limited *M*
variants like Schmidhuber's speed prior; task-dependent
universal distance measures.

*Example applications:* foundation of information and
induction; Maxwell's demon; incompressibility proof-method;
reversible computing; universal search and optimization; key
ingredient to the AIXI theory (see below).

**Prediction with expert advice (2002-...)**
(PEA) is currently a very active field.
Many variations known by many names (prediction/learning with
expert advice, weighted majority/average, aggregating strategy,
boosting, hedge algorithm, ...) have meanwhile been invented.
PEA combines forecasts of experts to form its own prediction and
can be shown to perform nearly as well as the best expert in the
considered class of experts, without making any assumption on the
environment(al data sequence). An old rediscovered variant,
"Follow the Perturbed Leader" allowed to elegantly derive
generalized performance/regret bounds for adaptive learning rate
(general weights, alphabet and loss)
[P04expert,
P05expertx],
even in reactive environments
[P05actexp,
P05actexp2].
Dual to this approach, expected
performance bounds for predictors based on Bayes mixtures of
environments rather than experts have been studied (see below).
The duality of both approaches and results has been pointed out in
[P04bayespea] and should be
further explored in the future,
probably leading to novel and fruitful bounds and algorithms.

*Key open problems:* quantification, interpretation and
exploitation of the duality of PEA and Bayesian sequence
prediction (see below); loss bounds for some PEA variant for
general weights without the hierarchy trick used in FPL;
investigate structures for which PEA is efficient even for large
number of experts; Close the sqrt(2) gap between lower and
upper bounds.

*Example applications:* combining forecasting software (for
weather, stock market, ...); online shortest path problem; binary
search tree.

**Bayesian (Sequence) Prediction (2001-...)**
The Bayesian framework is ideally suited for induction problems,
including classification tasks, pattern recognition, time-series
(e.g. stock market and weather) forecasting, and many more, where
(sequential) data is generated from some distribution μ.
If μ is unknown, but known to belong to some
class of distributions, one can base ones prediction on
the Bayes-mixture ξ defined as a weighted average over
these distributions. One can show that the posterior of ξ
converges rapidly to the true posterior μ
[P01alpha,
P03mlconv,
P04mlconvx,
P06mlconvxx].
The expected loss of the Bayes-optimal universal prediction scheme
based on ξ is very close to the loss of the optimal, but
infeasible prediction scheme based on μ
[P01loss,
P02spupper].
Tight bounds have been derived, and
no other predictor can be significantly better. Furthermore, ξ
is Pareto-optimal and an Occam's razor argument shows that the
universal weights based on Algorithmic Information Theory leading
to Solomonoff's universal induction scheme are optimal
[P99errbnd,
P03optisp,
P03unipriors,
P05unipriorx,
P05postbnd].
The results may be generalized and applied in various ways, e.g. to
games of chance in form of a sequence of bets, observations, and
rewards [P03optisp].
In particular for i.i.d. data, a novel non-parametric, consistent,
scale-invariant Bayesian tree prior has been constructed, which
allows very easy, efficient and exact inference and estimation of
most quantities of interest
[P04bayestree,
P05bayestreex].
Finally, an exact and efficient Bayesian regression algorithm
based on Dynamic Programming for piecewise constant functions of
unknown segment number, boundary location, and levels has been
developed [H06pcreg].
It is intended to adapt and
further develop this algorithm within a joint SNF grant requested
with the Oncology Institute of Southern Switzerland (IOSI)
and apply it to microarray-CGH data for determining
cancer relevant genes.

*Key open problems:* High probability performance guarantees.

*Example applications:* prediction of Markov processes;
speech recognition; handwritten word/text/character recognition,
microarray-CGH data analysis.

**Robust predictions (2002-...)**,
rather than (risky), best on average predictions, are sometimes
necessary. Various ways (apart from PEA) have been suggested to
go beyond maximizing expected performance. In the Bayesian
framework this means computing quantiles and/or variances.
Higher moments of mutual information under a second order
Dirichlet prior have been derived [P01xentropy],
successfully used for robust feature selection
[P02feature], and extended to the case of incomplete
data [P03mimiss, P02mifs]. For the Imprecise Dirichlet Model,
which goes beyond Bayes, a comprehensive collection of analytical
formulae for computing robust intervals have been derived
[P03idm] and applied to infer robust trees
[P03tree]. These approaches are promising, theoretical
as well as practical, and should be investigated further.

*Key open problems:* adequate treatment of incomplete
observations; efficient approximations; principles for choosing
acceptance/rejection thresholds.

*Example applications:* safety critical areas such as planning
and designing medical treatments and atomic reactors.

**Optimization (2002-...)**
of functions on very large/high-dimensional spaces is crucial in a
plethora of applications, including operations research and
artificial intelligence, to name only two. The functions often
involve all known difficulties at once: multimodality,
deceptiveness, discontinuities, ... Usually one has to be
satisfied with approximate solutions.
For concrete, often NP-complete problems, polynomial time
approximation algorithms (PTAS) can often be found. To find these
algorithms for problem after problem is a laborious, but rewarding
procedure. Finding a PTAS does not necessarily imply possessing a
practically efficient algorithm.
Remarkably, a fully linear time approximation algorithm for
knapsack problems could be developed [P02knapsack],
improving upon previous PTAS algorithms. Other problems
are also expected to possess linear time algorithms.
Since a rigorous theoretical approach is not always
possible/successful, more heuristic approaches are often used,
including simulated annealing and population based optimization.
More effective selection and deletion schemes (FUSS/FUDS) for
evolutionary (or other population based) algorithms, which
generate selection pressure towards sparsely populated fitness
regions, rather than towards higher fitness, have been invented
[P01fuss,
P05fuds,
P05fuo].
First experiments, to be continued in the future, indicate superiority
to all previous selection schemes and to random deletion on many
hard optimization problems, including the Traveling Salesman
Problem, the set covering problem, and maximum CNF3SAT
[P04fussexp,
P05fuo].
The interesting ``go with the winners'' algorithms by Aldous and
Vazirani may bridge the gap between heuristic and rigorous
approaches to complex optimization problems. This connection is
currently being explored.

*Key open problems:* linear time approximation algorithms;
test FUSS and variants on other hard optimization problems;
Consider ``go with the winners'' on more general practically
important graph structures.

*Example applications:* operations research; scheduling
problems.

**Reinforcement Learning (2005-...)**
is a general approach for sequential decision making and active
learning. The setup consists of a decision maker (agent) which
interacts with an environment (world) by an alternating sequence
of actions and observations. The observations consist of an
informative part (e.g. a video image) and occasionally a
potentially delayed reward part (e.g. function value in case
function maximization or battery level in case of self-maintaining
robots). The goal is to learn the decisions which maximize reward.
In case the decision does not influence the environment the setup
reduces to passive predictions, discussed above. In case of known
environment, sequential decision theory provides a formal, but in
general inefficient, optimal answer. Dynamic programming
efficiently solves the case of known Markov environments with few
states. Adaptive/dynamic control theory typically addresses the case of
unknown/known linear environments with quadratic loss function
(interesting for engineering, but too limited for AI). The general
case of unknown environment and active agents is very difficult.
The AIXI(tl) theory
[P01aixi,
P05aixifoe]
represents an optimal but
inefficient ``solution''.
Efficient learning and planning algorithms exist for completely
observable finite state Markov decision processes (MDPs). While
these environments are very limited, one can reduce many real-world
problems to fit this case. Traditionally this is an art that
involves significant effort by designers, but it is possible to
automate the reduction process to a large degree
[P09phimdp,
P09phimdpx].
Extensions to much more powerful structured MDPs are also possible
[P09phidbn].
Many practical heuristic algorithms have been
developed, in particular, versions of temporal difference learning
[P08qlearn].

*Key open problems:* balancing exploration versus
exploitation; optimal learning rate and eligibility parameter;
convergence in case of function approximation and partial
observability; comparison to model-based learning, non-ergodic
and/or non-stationary environments; beyond (PO)MDPs;
imitation-based learning (e.g. in a multi-agent or symmetric game
setup).

*Example applications:* board games like backgammon and checkers;
elevator control; robo-soccer; dynamic channel allocation;
inventory problems.

**Universal decision theory (2003-...)**
My favorite research project is on optimal sequential decision
making, based on universal algorithmic probability (called
AIXI(tl), see below).
Whereas past research was focused on proving optimality for
ultra general problem classes [P02selfopt],
future research will also investigate the applicability of AIXItl.
Variants of universal Levin search have been implemented and successfully
applied
at IDSIA by Schmidhuber to a variety of problems concerning
induction, classification, and others. Analogously, we
expect variants of AIXItl to be successful in solving problems
in ``reactive'' environments, also called active learning.
The goal is to implement AIXItl and apply it to difficult, yet
simply describable problems and compare the performance to other
approaches.

*Key open problems:* in which sense is AIXI universally optimal;
analyze behavior of AIXI for several concept classes;
axiomatic/algebraic characterization of AIXI; choice of the horizon;
downscale and implement AIXI (like Levin-search and OOPS); more insights
*how* AIXI works.

*Example applications:* mathematical and philosophical
foundation of intelligence; repeated prisoner problem and
generalized variants [P05aixifoe];
the n-armed bandit problem; simple strategic games.

**Computer vision & image processing.**
While humans can easily recognize 3d objects or whole scenes from a
stereographic projection, for computers, the reconstruction of 3d
objects from single or multiple photographic projections is a
challenge. Consider the recognition and 3d reconstruction of a
specific object in a realistic (messy) outdoor photograph. The
picture contains background, other objects, unknown lighting
condition, shades, and texture. Reliably finding ``landmark'' points
of interest for matching them with an idealized 3d model is not
easy in general.
But for cars, two wheels, if visible, can reliably be detected and
provide enough information for performing the 2d-3d registration
[P09wheel].
Alternatively, one may match 2d or 3d images directly by minimizing
a pixel/voxel-based distance measure that is insensitive to
different modalities, like mutual information.
Overlaying medical Xray, CT, MRI, Ultra-Sound, etc. images is an
important application.
The merged multi-modal volume data can be displayed with volume or
(after 3d segmentation) surface rendering algorithms.
Sophisticated anti-aliasing based on finite-element interpolation
can significantly enhance the display quality
[P02uspatent,
P01eupatent,
R&D@BrainLAB].

**Bio-statistics & bio-informatics (2006-2010)**
The advent of high-throughput DNA sequencing and genomic profiling
with micro-arrays resulted in an unprecedented flood of data in
molecular biology. This required to augment or replace traditional
statistical modeling and analysis (bio-statistics) by
computation-intensive algorithmic data processing (bio-informatics).
Micro-arrays can determine the expression, copy-number, and LOH
at millions of locations in our DNA simultaneously.
Comparing the results from cancerous to normal cells can reveal the
genes and molecules responsible for the progression of
cancer.
Since the measurements are extremely noisy, sophisticated
statistical algorithms are required to separate signal from noise
[P09bcnax,
P09mbpcrcode].
By combining various data sources one can improve
the identification of the different types of genomic aberrations
[P09cnloh].

**Universal Artificial Intelligence (2000-2006)**
My all-time favorite research project is about a
mathematical theory for intelligence, called AIXI. The AIXI model
is a parameter-free optimal reinforcement learning agent embedded
in an arbitrary unknown environment. Disciplines of importance
involved in the development of the AIXI model are artificial
intelligence, reinforcement learning, algorithmic information
theory, Kolmogorov complexity, minimal description length (MDL)
principle, computational complexity theory, information theory and
statistics, universal Solomonoff induction, universal Levin
search, sequential decision theory, and adaptive control theory.
The model was first introduced and discussed in
[P00kcunai]. More succinct descriptions have been
published in
[P01aixi,
P01decision].
See also
[P03aixigentle] for an introductory survey
and [P04uaibook] for
a comprehensive treatment. The AIXI
model has been argued to formally solve a number of problem
classes, including sequence prediction, strategic games, function
minimization, reinforcement and supervised learning
[P00kcunai].
It comes along with a universal intelligence order-relation/measure
[P04uaibook,
P05iors].
The generalization AIxi has recently
been shown to be self-optimizing and Pareto-optimal
[P02selfopt].
Tight [P03optisp] error
[P99errbnd,
P01alpha]
and loss
[P01loss] bounds for Solomonoff's universal sequence
prediction scheme have been proven.
I found a remarkable, asymptotically fastest and shortest
algorithm for *all* well-defined problems - fastest save for
a constant factor 5 and a problem class-specific additive constant
[P01fast].
A closely related idea led to a computable AIXItl model
[P00kcunai,P01aixi].
Ideas, loosely related to AIXI, on
a market/economy based reinforcement learner
[P01market],
a gradient based reinforcement planner
[P01grep],
and a reinforcement learning mobile robot
[P06robot]
have been implemented.

**Applied R&D at Company BrainLAB (1996-2000)**
During my full-time employment at BrainLAB from 1996 to 2000 I
developed various numerical algorithms in the medical field. They
include a Neuro-Navigation system, a Brachytherapy planning
system, a dose algorithm (PencilBeam) for radiotherapy for IMRT, a
real time software volume renderer, various image processing
modules, and many more.
Image and volume data enhancement and post-antialiasing
algorithms based on finite-element interpolation have been
developed, patented [P01eupatent,P02uspatent],
and implemented.
Due to tough time-schedules and the nondisclosure
policy of the company further publications were unfortunately not
possible.

**PhD in Particle Physics (1993-1996)**
The universe as a whole is determined by gravity, which is described
by the general relativity theory. Its constituents, the elementary
particles, behave according to the rules of quantum field theory or,
more specifically, according to the standard model of particle physics.
Super string theory could turn out to be the single, elegant
Theory of Everything
[P10ctoex]
physicists dream of. This may also solve the old
outstanding (philosophical) problem of the interpretation of quantum
theory (collapse of the wave function, Schrödinger's cat,...).
My interests in physics are centered around these topics. My
active research concentrates on non-perturbative QCD, especially
instantons [P96diss,P96thesis], gluon mass
[P93gluon], quark propagator [P95prop],
eta' mass [P96eta], meson correlators and masses
[P97instanto], and the proton spin
[P95spin]. My favorite paper explains the exponential
fermion mass spectrum between successive generations
[P97family].
On a lighter note,
The importance of observer localization when developing theories
of everything
see my paper how to find theories of everything.

**Early student work (1983-1992)**
A classifier system, allowing for comparison of many popular
variants, was implemented in my Masters thesis [P92cfs].
It also contains the first proof of the equivalence of ranking
and tournament selection.
Hebb nets are unsupervised learners. In [P90fopra] the
possibility of Hebb nets learning by reinforcement feedback
was demonstrated.
Computer aided design demands significant computational resources.
Only two serious CAD programs for 8 bit computers exist.
One of them (written in assembler) being
[P87cad].
Some other mentionable programming jobs include the development of
a member organization program in DBase (1983, Steuerberater
Keller), a user interface for an expert system under GEM (1987,
IABG), and a protection module and organization program for licensing
programs in C (1993, IABG).

© 2000 by ... | [previous] [home] [search] [science] [personal] [contact] [up] [next] | ... Marcus Hutter |