Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications
Junfeng He
Reading Time
at 250 WPM1 minute
The average reader, reading at a speed of 250 WPM, would take 1 minute to read Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications.
Personalise your estimate by entering your reading speed below
Test my reading speedEnter speed in words per minute
1
day at 30 min/day
1
total minutes
Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications
by Junfeng He
Published
2014
Publisher
[publisher not identified]
Pages
1
Description
We are witnessing a data explosion era, in which huge data sets of billions or more samples represented by high-dimensional feature vectors can be easily found on the Web, enterprise data centers, surveillance sensor systems, and so on. On these large scale data sets, nearest neighbor search is fundamental for lots of applications including content based search/retrieval, recommendation, clustering, graph and social network research, as well as many other machine learning and data mining problems. Exhaustive search is the simplest and most straightforward way for nearest neighbor search, but it can not scale up to huge data set at the sizes as mentioned above. To make large scale nearest neighbor search practical, we need the online search step to be sublinear in terms of the database size, which means offline indexing is necessary. Moreover, to achieve sublinear search time, we usually need to make some sacrifice on the search accuracy, and hence we can often only obtain approximate nearest neighbor instead of exact nearest neighbor. In other words, by large scale nearest neighbor search, we aim at approximate nearest neighbor search methods with sublinear online search time via offline indexing. To some extent, indexing a vector dataset for (sublinear time) approximate search can be achieved by partitioning the feature space to different regions, and mapping each point to its closet regions. There are different kinds of partition structures, for example, tree based partition, hashing based partition, clustering/quantization based partition, etc. From the viewpoint of how the data partition function is generated, the partition methods can be grouped into two main categories: 1. data independent (random) partition such as locality sensitive hashing, randomized trees/forests methods, etc.; 2. data dependent (optimized) partition, such as compact hashing, quantization based indexing methods, and some tree based methods like kd-tree, pca tree, etc. With the offline indexing/partitioning, online approximate nearest neighbor search usually consists of three steps: locate the query region that the query point falls in, obtain candidates which are the database points in the regions near the query region, and rerank/return candidates. For large scale nearest neighbor search, the key question is: how to design the optimal offline indexing, such that the online search performance is the best, or more specifically, the online search can be as fast as possible, while meeting a required accuracy? In this thesis, we have studied theories, algorithms, systems and applications for (approximate) nearest neighbor search on large scale data sets, for both indexing with random partition and indexing with learning based partition. Our specific main contributions are: 1. We unify various nearest neighbor search methods into the data partition framework, and provide a general formulation of optimal data partition, which supports fastest search speed while satisfying a required search accuracy. The formulation is general, and can be used to explain most existing (sublinear) large scale approximate nearest neighbor search methods. 2. For indexing with data-independent partitions, we have developed theories on their lower and upper bounds of time and space complexity, based on the optimal data partition formulation. The bounds are applicable for a general group of methods called Nearest Neighbor Preferred Hashing and Nearest Neighbor Preferred Partition, including, locality sensitive hashing, random forest, and many other random hashing methods, etc. Moreover, we also extend the theory to study how to choose the parameters for indexing methods with random partitions. 3. For indexing with data-dependent partitions, I have applied the same formulation to develop a joint optimization approach with two important criteria: nearest neighbor preserving and region size balancing. we have applied the joint optimization to different partition structures such as hashing and clustering, and achieved several new nearest neighbor search methods, outperforming (or at least comparable) to state-of-the-art solutions for large scale nearest neighbor search. 4. we have further studied fundamental problems for nearest neighbor search beyond search methods, for example, what is the difficulty of nearest neighbor search on a given data set (independent of search methods)? What data properties affect the difficulty and how? How will the theoretical analysis and algorithm design of large scale nearest neighbor search problem be affected by the data set difficulty? 5. Finally, we have applied our nearest neighbor search methods for practical applications. We focus on the development of large visual search engines using new indexing methods developed in this thesis. The techniques can be applied to other domains with data-intensive applications, and moreover, be extended to other applications beyond visual search engine, such as large scale machine learning, data mining, and social network analysis, etc.
Frequently Asked Questions
How many pages are in Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications?
This edition of Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications has approximately 1 pages. Please note, this is an estimate and the exact page count can vary between hardcover, paperback, and e-book versions.
How long does it take to read Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications?
For most readers, Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications typically takes between 1m and 1m to complete. This is based on the book's length of approximately 250 words and common reading speeds.
Here's a detailed breakdown: • Continuous reading at 250 WPM: approximately 1m of focused reading • Casual reading (30 minutes/day): you could finish in roughly 1 day • Estimated word count: 250 words
Your individual reading time will vary based on your personal reading pace, the amount of daily reading time, and your familiarity with the subject matter.
What is the word count of Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications?
The estimated word count for Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications is approximately 250 words. This figure is calculated using industry-standard methods that consider genre-specific word density patterns, typical formatting and layout characteristics, and standard words-per-page ratios for published books.
This is an approximation — actual word count may vary based on font size, formatting, edition, and the presence of illustrations or charts.
Who is the author of Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications?
Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications was written by Junfeng He.
When was Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications published?
The publication date for this specific edition is 2014. The original work may have been published on a different date.