Source Code, Datasets, and Comparative Experimental Results

Frequent Items in Streaming Data:
An Experimental Evaluation of the State-of-the-Art

(doi:10.1016/j.datak.2008.11.001)

Nishad Manerikar and Themis Palpanas



The problem of detecting frequent items in streaming data is relevant to many different applications across many domains. Several algorithms, diverse in nature, have been proposed in the literature for the solution of the above problem. In this paper, we review these algorithms, and we present the results of the first extensive comparative experimental study of the most prominent algorithms in the literature. The algorithms were comprehensively tested using a common test framework on several real and synthetic datasets. Their performance with respect to the different parameters (i.e., parameters intrinsic to the algorithms, and data related parameters) was studied. We report the results, and insights gained through these experiments.


Journal Publication


Source Code

You may freely use this code for research purposes, provided that you acknowledge the authors with the following reference:

Nishad Manerikar, Themis Palpanas. Frequent Items in Streaming Data: An Experimental Evaluation of the State-of-the-Art. Data and Knowledge Engineering (DKE) 68(4), 2009: 415-430.

@ARTICLE{dbTrentoFrequentItemsDKE,
   AUTHOR={Nishad Manerikar and Themis Palpanas},
   TITLE={{Frequent Items in Streaming Data An Experimental Evaluation of the State-of-the-Art}},
   JOURNAL={Data Knowl. Eng. (DKE)},
   VOLUME={68},
   NUMBER={4},
   PAGES={415-430},
   YEAR=2009}


Synthetic Datasets

The synthetic datasets were generated according to a Zipfian distribution. We generated datasets with the size, N, ranging between 10,000-100,000,000 items, item domain cardinality, M, 65,000-1,000,000, and Zipf parameter, Z, 0.6-3.5. The parameters used in each run are explicitly mentioned in the discussion of each experiment. We should note that we generated several independent datasets for each particular choice of the data parameters mentioned above, and repeated each experiment for all these datasets.


Real Datasets

If you also use these datasets, include the proper acknowledgements and references, as those appearing in our technical report.