Rafail Ostrovsky - Publications

Subquadratic Approximation Algorithms For Clustering Problems in High Dimensional Spaces

Allan Borodin, Rafail Ostrovsky and Yuval Rabani

Abstract:

One of the central problems in data-mining, computational biology, statistical analysis, computer vision, geographic analysis, pattern recognition, distributed protocols and the web hierarchical search engines (like Yahoo) is the question of classification of data according to some clustering rule. Often the data is noisy and even approximate classification is of extreme importance. The difficulty of such classification stems from the fact that usually the data has many incomparable attributes, and often results in the question of clustering problems in high dimensional spaces. Since they require measuring distance between every pair of data points, standard algorithms for computing the exact clustering solutions use quadratic or "nearly quadratic" running time; i.e. O(d n^2-a(d)) where n is the number of data points, d is the dimension of the space and a(d) approaches 0 as d grows. In this paper, we show (for three fairly natural clustering rules) that computing an approximate solution can be done much more efficiently. More specifically, for the agglomerative clustering (used, for example, in in the Alta Vista search engine), for clustering defined by sparse partitions, and for clustering based on minimum spanning trees we derive randomized (1+∈) approximation algorithms with running times Õ(d^2-y where Y > 0 depends only on the approximation parameter ∈ and is independent of the dimension d.

comment: Appeared in Mahine Learning Journal Special Issue: Theoretical Advances in Data Clustering (Guest Editors: Nina Mishra and Rajeev Motwani) 56 (1-3): 153-167, 2004
Preliminary version in proceedings of The 31'st ACM Symposium on Theory of Computing (STOC-99)

Fetch PostScript file of the paper Fetch PDF file of the paper

Back to Publications List