"High-dimensional Computational Geometry"
Piotr Indyk
Stanford University

(click here to download the PS file)

Abstract

Consider the following problem: given a database of points in a multidimensional space, construct a data structure which given any query point finds the database point(s) closest to it. This problem, called nearest neighbor search has been of great importance to several areas of computer science, including pattern recognition, databases, vector compression, computational statistics and data mining. Many of the above applications involve data sets whose size and dimensionality are very large. Therefore, it is crucial to design algorithms which scale well with the database size as well as with the dimension.

The nearest neighbor problem is an example of a large class of proximity problems . Apart from the nearest neighbor, the class contains problems like closest pair, diameter (or furthest pair), minimum spanning tree and clustering problems. In the latter case the goal is to find a partition of points into k clusters, in order to minimize a certain function. Example functions are: the sum of the distances from each point to its nearest cluster representative (this problem is called k-median), the maximum such distance (k-center), the sum of all distances between points in same clusters (k-clustering), etc.

Since these problems are ubiquitous, they have been investigated in computer science for a long while (e.g., in computational geometry). As a result of this research effort, many efficient solutions have been discovered for the case when the points lie in a space of small dimension . Unfortunately, their running time grows exponentially with the dimension.

In this thesis we show that it is in fact possible to obtain efficient algorithms for the aforementioned problems, if we are satisfied with answers which are approximate . The running time of our algorithms for the aforementioned problems has only polynomial dependence on the dimension, and sublinear (for the nearest neighbor problem) or subquadratic (for the closest pair, minimum spanning tree, clustering etc.) dependence on the number of input points. The highlights of the results include: