Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search
Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, Houda Benbrahim
Data series are a special type of high-dimensional data that are present in numerous domains. They pose a challenge to conventional data analysis techniques, because for multiple applications their dimensionality measures to thousands and their size to multiple terabytes. A key operation in data series analysis pipelines is similarity search, which has been extensively studied in the data series literature, leading to the development of efficient exact indexing methods. In parallel, the high-dimensional community at large has focused its attention on the study of approximate similarity search techniques. In this paper, we propose a comprehensive taxonomy that reconciles the terminology used in these two domains, and we conduct a thorough experimental evaluation to compare approximate similarity search techniques under a unified framework, on synthetic and real datasets in memory and on disk. Although data series differ from generic high-dimensional vectors (they usually exhibit correlation between neighboring values), our results show that simple modifications to exact data series indexing techniques enable them to answer approximate similarity queries with strong guarantees and an excellent empirical performance, on data series and vectors alike. In fact, these techniques outperform, in accuracy, efficiency and memory footprint the state-of-the-art approximate techniques designed for vectors on disk, while being very competitive in memory.
You may also be interested in our study for exact similarity search.
You may use this code as is for research purposes, provided that you properly acknowledge the authors using the following reference: Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, Houda Benbrahim. Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search. Proceedings of the VLDB Endowment (PVLDB) Journal, 2020.
- Repository with source code for all the algorithms used in the paper.
We produced a set of synthetic datasets with sizes from 50 million to 250 million data series composed by random walks of length 256. Each data point in the data series is produced as xi+1=N(xi,1), where N(0,1) is a standard normal distribution (we used seed 1184 for the datasets, and seed 14784 for the queries). The synthetic data generator code is included in the source code we are making available.
We used the following four real datasets. 1. SIFT1B: consists of Scale-Invariant Feature Transform (SIFT) vectors, representing image feature descriptions. We obtained 200 million data series of size 128. The dataset size was 100GB. 2. Seismic: we used the IRIS Seismic Data Access repository to gather data series representing seismic waves from various locations (seismology). We obtained 100 million data series of size 256. The dataset size was 100 GB. 3.SALD: includes brain MRI data (neuroscience). The dataset comprised of 200 million data series of size 128. The dataset size was 100GB. 4. Deep1B: contains vectors extracted from the last layers of a convolutional neural network (computer vision). The dataset comprised 267 million vectors of size 96. The dataset size was 100GB.
We provide the full set of experimental results.
All rigths reserved, 2019.