We will describe recent developments in a system for machine learning that we've been working on for some time. The system is designed to learn to solve two kinds of problems. Most, but not all problems in science and engineering are of these two kinds.
The first kind is function inversion. These are the P and NP problems of computational complexity theory. They include theorem proving, solution of equations, symbolic integration, etc.
The second kind of problem is time limited optimization. Inductive inference, surface reconstruction, and image restoration are a few examples of this kind of problem. Designing an automobile in 6 months satisfying certain specifications and having minimal cost, is another.
In the following discussion, we will be using the term "probability" in a special sense: i.e. the estimate given by the best probabilistic model that we can find in the available time.
Our system starts out with a small set of Problem Solving Techniques (PST's) and a simple General Conditional Probability Distribution (GCPD). When the system is given a problem, the description of this problem is the "Condition" for the GCPD. Its output is a probability distribution on PST's - the likelihood that each of them will be the best way to solve the problem. It uses these PST's and their associated probabilities to solve the first problem.
Next, it executes its update algorithm: The PST's are modified, new ones may be added, some may be deleted. The GCPD is modified. These changes attempt to incorporate into the system, information obtained in solving recent problems, and prepare it to solve harder problems.
The next problem presentation and the system's solution is followed by the corresponding update and so on. After the nth update, the system is usually able to work problems more difficult than the nth. Giving the system a suitable training sequence of problems of progressive difficulty, makes it possible for the system to eventually solve very hard problems.
One critical component in the system is the initial set of PST's. These will be Levin's Universal Search Algorithm (Lsearch). Simple Lsearch is only practical for very small problems, since its solution time is exponential in problem size. In the update phase, we modify the probability distribution that is used to guide Lsearch. This effectively reduces the sizes of more difficult problems, so that Lsearch becomes feasible for them.
The update algorithm is another critical component. It has been designed to facilitate "transfer learning", so that the system can utilize any similarities between problems in the same or disparate domains. It has been possible to express this algorithm as an inductive inference problem of a type the system normally solves - so we can simply ask the system to update itself as one of its regular problems.
Perhaps the most critical component of all is the training sequence - To devise a sequence of problems so that each of them challenges the system to devise new concepts to solve it, yet keeping the challenges within the capacity of the machine. For PST's that use simple Lsearch, this is not hard to do. It is always possible to get a good estimate of the time needed for the system to find a known solution to a problem. For certain modifications of Lsearch this is not so easy and designing training sequences can become as difficult as teaching people.
The system's generality invites comparison with Genetic Programming. We will discuss differences in the two approaches to optimization problems, and suggest a way in which Genetic Programming might be made a part of the present system.
Another way to look at the system is to think of it as a "Wrapper" system - a system that modifies its parameters in view of its experience with earlier problem solving. The present system is an "Extreme Wrapper System" in that - through the Magic of Universal Search - it is possible for the entire system to effectively replace itself by one that is more efficient!
R. Solomonoff. Progress
in Incremental Machine Learning
Preliminary Report (2002), http://world.std.com/~rjs/nips02.pdf
R. Solomonoff. A System
for Incremental Learning Based on Algorithmic Probability
Proceedings of the 6th Israeli Conference on Artificial Intelligence, Computer Vision and Pattern Recognition, Tel Aviv, Israel, pp. 515-527, 1989
R. Solomonoff. The
Application of Algorithmic Probability to Problems in Artificial Intelligence
in: L.N. Kanal and J.F. Lemmer (Eds.), Uncertainty in Artificial Intelligence, Elsevier Science Publishers B.V., pp. 473-491, 1986
Prediction of individual sequences is concerned
with the problem of guessing, in a sequential
fashion, the elements of an unknown data sequence
without making any assumptions on the way the sequence
is generated. A predictor is universal for a class
of reference predictors if its performance on each
individual data sequence is close to that of the
best reference predictor for that sequence.
In the talk we illustrate the evolution of universal
prediction from the theory of iterated games, we
describe a general methodology for constructing
universal predictors, and we show applications to
the analysis of algorithms for pattern recognition.
We extend principles of non-incremental universal search to build a novel, optimally fast, incremental learner that is able to improve itself through experience. The Optimal Ordered Problem Solver (OOPS) finds a program that solves each task in a sequence of tasks, such as problems of optimization or prediction. It continually and efficiently organizes and exploits previously found solutions to earlier tasks.
Programs are composed from primitives, which may include assembler-like instructions as well as time-consuming software such as neural network algorithms or theorem provers. It is essential that those primitives whose runtimes are not known in advance can be interrupted by OOPS at any time.
The initial bias is embodied by a task-dependent probability distribution on possible program prefixes (pieces of code that may continue). Prefixes are self-delimiting and executed in online fashion while being generated. Let p(n) denote a found prefix solving the first n tasks. It may exploit previous solutions p(i) (i smaller than n) stored in non-modifiable memory, by calling them as subprograms, or by copying them into modifiable memory and editing the copies before executing them. We provide equal resources for two searches that run in parallel until p(n+1) is discovered and stored in non-modifiable memory. The first search is exhaustive; it systematically tests all possible prefixes on all tasks up to n+1. The second search is much more focused; it only searches for prefixes that start with p(n), and only tests them on task n+1, which is safe, because we already know that such prefixes solve all tasks up to n. Both searches are depth-first and bias-optimal: the branches of the search trees are program prefixes, and backtracking is triggered once the sum of the runtimes of the current prefix on all current tasks exceeds the prefix probability multiplied by the total search time so far.
A prefix may use information conveyed by previously frozen programs to assign self-computed probabilities to its own possible continuations or suffixes, thus temporarily rewriting the search procedure itself. This results in time-optimal search for faster search methods (metasearching).
In illustrative experiments, our self-improver first learns to produce longer and longer samples of a simple context free language. Then it demonstrates incremental knowledge transfer, using its experience with the seemingly unrelated language tasks to find a prefix that rewrites the search procedure such that it discovers a suffix solving all Towers of Hanoi tasks up to solution size 10^9. Previous reinforcement learners and non-learning AI planners already fail for solution sizes exceeding 10^2 and 10^5, respectively.
Since OOPS will scale to larger problems in essentially unbeatable fashion, we also examine its physical limitations.
Optimal Ordered Problem Solver.
TR IDSIA-12-02, 31 July 2002.
also published in the
public computer science archive
OOPS webpage for slides and further details.
We will discuss applications of universal learning schemes for
predicting the next sample point taken from an unknown probability
distribution. We will analyze various No Free Lunch theorems, as well as
arguments rooted in our ability to reparameterize the sample space to
show that, if learning succeeds, it is because there are a priori,
non-universal measures (maybe implicit) defined on the space of possible
solutions. This simple observation relates Bayesian statistics to the
theory of Kolmogorov complexity, and we will discuss the appropriate
extension of the theory.
A new class of similarity measures appropriate for measuring relations between sequences is studied. The "normalized information distance", based on the noncomputable notion of Kolmogorov complexity is shown to be universal in this class (discovers all effective similarities). We demonstrate that it is a metric and takes values in [0,1]. This is a theory foundation for a new general practical tool. We give two distinctive applications in widely divergent areas (the experiments by necessity use just computable approximations to the target notions).
First, we computationally compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we give fully automatically computed language tree of 52 different language based on translated versions of the "Universal Declaration of Human Rights''.
Ming Li, Xin Chen, Xin Li, Bin Ma, Paul Vitanyi,
The similarity metric
Proc. 14th ACM-SIAM Symp. Discrete Algorithms (SODA), 2003.
Decision theory formally solves the problem of rational agents in uncertain worlds if the true environmental probability distribution is known. Solomonoff's theory of universal induction formally solves the problem of sequence prediction for unknown distribution. We unify both theories and present strong arguments that the resulting universal AIXI model behaves optimally in any computable environment. The major drawback of the AIXI model is that it is incomputable. To overcome this problem, we construct a modified algorithm AIXI^tl, which is still superior to any other time t and space l bounded agent. The computation time of AIXI^tl is of the order t x 2^l.
M. Hutter. Self-Optimizing and Pareto-Optimal Policies
in General Environments based on Bayes-Mixtures
Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT-2002), cs.AI/0204040
M. Hutter. Towards a Universal Theory of Artificial
Intelligence based on Algorithmic Probability and Sequential Decisions
Proceedings of the 12th European Conference on Machine Learning (ECML-2001) 226-238