NIPS 2002 Workshop - (Juergen Schmidhuber & Marcus Hutter)

NIPS 2002 Workshop
Universal Learning Algorithms and Optimal Search
Whistler/Blackcomb Resort, B.C., Canada, Saturday 14, December, 2002
Organizers: Jürgen Schmidhuber & Marcus Hutter, IDSIA, Switzerland

Main Workshop Page

Abstracts of Presentations

Ray Solomonoff A General System For Incremental Machine Learning
Nicolo Cesa-Bianchi Universal Prediction, Game Theory, and Pattern Recognition
Juergen Schmidhuber Optimal Ordered Problem Solver
Ilya Nemenman Universal Learning: A View of a Bayesian
Paul Vitanyi The Similarity Metric
Marcus Hutter Universal Sequential Decisions based on Algorithmic Probability

A General System For Incremental Machine Learning
Ray Solomonoff, Oxbridge Research, Cambridge, Ma., USA

We will describe recent developments in a system for machine learning that we've been working on for some time. The system is designed to learn to solve two kinds of problems. Most, but not all problems in science and engineering are of these two kinds.

The first kind is function inversion. These are the P and NP problems of computational complexity theory. They include theorem proving, solution of equations, symbolic integration, etc.

The second kind of problem is time limited optimization. Inductive inference, surface reconstruction, and image restoration are a few examples of this kind of problem. Designing an automobile in 6 months satisfying certain specifications and having minimal cost, is another.

In the following discussion, we will be using the term "probability" in a special sense: i.e. the estimate given by the best probabilistic model that we can find in the available time.

Our system starts out with a small set of Problem Solving Techniques (PST's) and a simple General Conditional Probability Distribution (GCPD). When the system is given a problem, the description of this problem is the "Condition" for the GCPD. Its output is a probability distribution on PST's - the likelihood that each of them will be the best way to solve the problem. It uses these PST's and their associated probabilities to solve the first problem.

Next, it executes its update algorithm: The PST's are modified, new ones may be added, some may be deleted. The GCPD is modified. These changes attempt to incorporate into the system, information obtained in solving recent problems, and prepare it to solve harder problems.

The next problem presentation and the system's solution is followed by the corresponding update and so on. After the nth update, the system is usually able to work problems more difficult than the nth. Giving the system a suitable training sequence of problems of progressive difficulty, makes it possible for the system to eventually solve very hard problems.

One critical component in the system is the initial set of PST's. These will be Levin's Universal Search Algorithm (Lsearch). Simple Lsearch is only practical for very small problems, since its solution time is exponential in problem size. In the update phase, we modify the probability distribution that is used to guide Lsearch. This effectively reduces the sizes of more difficult problems, so that Lsearch becomes feasible for them.

The update algorithm is another critical component. It has been designed to facilitate "transfer learning", so that the system can utilize any similarities between problems in the same or disparate domains. It has been possible to express this algorithm as an inductive inference problem of a type the system normally solves - so we can simply ask the system to update itself as one of its regular problems.

Perhaps the most critical component of all is the training sequence - To devise a sequence of problems so that each of them challenges the system to devise new concepts to solve it, yet keeping the challenges within the capacity of the machine. For PST's that use simple Lsearch, this is not hard to do. It is always possible to get a good estimate of the time needed for the system to find a known solution to a problem. For certain modifications of Lsearch this is not so easy and designing training sequences can become as difficult as teaching people.

The system's generality invites comparison with Genetic Programming. We will discuss differences in the two approaches to optimization problems, and suggest a way in which Genetic Programming might be made a part of the present system.

Another way to look at the system is to think of it as a "Wrapper" system - a system that modifies its parameters in view of its experience with earlier problem solving. The present system is an "Extreme Wrapper System" in that - through the Magic of Universal Search - it is possible for the entire system to effectively replace itself by one that is more efficient!

R. Solomonoff. Progress in Incremental Machine Learning
Preliminary Report (2002), http://world.std.com/~rjs/nips02.pdf

R. Solomonoff. A System for Incremental Learning Based on Algorithmic Probability
Proceedings of the 6th Israeli Conference on Artificial Intelligence, Computer Vision and Pattern Recognition, Tel Aviv, Israel, pp. 515-527, 1989

R. Solomonoff. The Application of Algorithmic Probability to Problems in Artificial Intelligence
in: L.N. Kanal and J.F. Lemmer (Eds.), Uncertainty in Artificial Intelligence, Elsevier Science Publishers B.V., pp. 473-491, 1986

Universal Prediction, Game Theory, and Pattern Recognition
Nicolo Cesa-Bianchi, Dipartimento di Tecnologie dell'Informazione, Università degli Studi di Milano, Crema, Italy

Prediction of individual sequences is concerned with the problem of guessing, in a sequential fashion, the elements of an unknown data sequence without making any assumptions on the way the sequence is generated. A predictor is universal for a class of reference predictors if its performance on each individual data sequence is close to that of the best reference predictor for that sequence. In the talk we illustrate the evolution of universal prediction from the theory of iterated games, we describe a general methodology for constructing universal predictors, and we show applications to the analysis of algorithms for pattern recognition.

Optimal Ordered Problem Solver
Juergen Schmidhuber, IDSIA, Manno-Lugano, Switzerland

We extend principles of non-incremental universal search to build a novel, optimally fast, incremental learner that is able to improve itself through experience. The Optimal Ordered Problem Solver (OOPS) finds a program that solves each task in a sequence of tasks, such as problems of optimization or prediction. It continually and efficiently organizes and exploits previously found solutions to earlier tasks.

Programs are composed from primitives, which may include assembler-like instructions as well as time-consuming software such as neural network algorithms or theorem provers. It is essential that those primitives whose runtimes are not known in advance can be interrupted by OOPS at any time.

The initial bias is embodied by a task-dependent probability distribution on possible program prefixes (pieces of code that may continue). Prefixes are self-delimiting and executed in online fashion while being generated. Let p(n) denote a found prefix solving the first n tasks. It may exploit previous solutions p(i) (i smaller than n) stored in non-modifiable memory, by calling them as subprograms, or by copying them into modifiable memory and editing the copies before executing them. We provide equal resources for two searches that run in parallel until p(n+1) is discovered and stored in non-modifiable memory. The first search is exhaustive; it systematically tests all possible prefixes on all tasks up to n+1. The second search is much more focused; it only searches for prefixes that start with p(n), and only tests them on task n+1, which is safe, because we already know that such prefixes solve all tasks up to n. Both searches are depth-first and bias-optimal: the branches of the search trees are program prefixes, and backtracking is triggered once the sum of the runtimes of the current prefix on all current tasks exceeds the prefix probability multiplied by the total search time so far.

A prefix may use information conveyed by previously frozen programs to assign self-computed probabilities to its own possible continuations or suffixes, thus temporarily rewriting the search procedure itself. This results in time-optimal search for faster search methods (metasearching).

In illustrative experiments, our self-improver first learns to produce longer and longer samples of a simple context free language. Then it demonstrates incremental knowledge transfer, using its experience with the seemingly unrelated language tasks to find a prefix that rewrites the search procedure such that it discovers a suffix solving all Towers of Hanoi tasks up to solution size 10^9. Previous reinforcement learners and non-learning AI planners already fail for solution sizes exceeding 10^2 and 10^5, respectively.

Since OOPS will scale to larger problems in essentially unbeatable fashion, we also examine its physical limitations.

J. Schmidhuber. Optimal Ordered Problem Solver. TR IDSIA-12-02, 31 July 2002. Gzipped postscript , PDF version , also published in the public computer science archive . See the OOPS webpage for slides and further details.

Universal Learning: A View of a Bayesian
Ilya Nemenman, Kavli Institute for Theoretical Physics, University of California, Santa Barbara, USA

We will discuss applications of universal learning schemes for predicting the next sample point taken from an unknown probability distribution. We will analyze various No Free Lunch theorems, as well as arguments rooted in our ability to reparameterize the sample space to show that, if learning succeeds, it is because there are a priori, non-universal measures (maybe implicit) defined on the space of possible solutions. This simple observation relates Bayesian statistics to the theory of Kolmogorov complexity, and we will discuss the appropriate extension of the theory.

The Similarity Metric
Paul Vitanyi, Centrum voor Wiskunde en Informatica, Amsterdam, The Netherlands

A new class of similarity measures appropriate for measuring relations between sequences is studied. The "normalized information distance", based on the noncomputable notion of Kolmogorov complexity is shown to be universal in this class (discovers all effective similarities). We demonstrate that it is a metric and takes values in [0,1]. This is a theory foundation for a new general practical tool. We give two distinctive applications in widely divergent areas (the experiments by necessity use just computable approximations to the target notions).

First, we computationally compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we give fully automatically computed language tree of 52 different language based on translated versions of the "Universal Declaration of Human Rights''.

Ming Li, Xin Chen, Xin Li, Bin Ma, Paul Vitanyi, The similarity metric
Proc. 14th ACM-SIAM Symp. Discrete Algorithms (SODA), 2003.

Universal Sequential Decisions based on Algorithmic Probability
Marcus Hutter, IDSIA, Manno-Lugano, Switzerland

Decision theory formally solves the problem of rational agents in uncertain worlds if the true environmental probability distribution is known. Solomonoff's theory of universal induction formally solves the problem of sequence prediction for unknown distribution. We unify both theories and present strong arguments that the resulting universal AIXI model behaves optimally in any computable environment. The major drawback of the AIXI model is that it is incomputable. To overcome this problem, we construct a modified algorithm AIXI^tl, which is still superior to any other time t and space l bounded agent. The computation time of AIXI^tl is of the order t x 2^l.

M. Hutter. Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures
Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT-2002), cs.AI/0204040

M. Hutter. Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions
Proceedings of the 12th European Conference on Machine Learning (ECML-2001) 226-238

Back to main Workshop Page