Random Sampling for Continuous Streams with Arbitrary Updates

 

Yufei Tao, Xiang Lian, Dimitris Papadias, Marios Hadjieleftheriou

 

IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1): 96-110, 2007

 
  Abstract

  
The existing random sampling methods have at least one of the following disadvantages: they (i) are applicable only to certain update patterns, (ii) entail large space overhead, or (iii) incur prohibitive maintenance cost. These drawbacks prevent their effective application in stream environments (where a relation is updated by a large volume of insertions and deletions that may arrive in any order), despite the considerable success of random sampling in conventional databases.

Motivated by this, we develop several fully-dynamic algorithms for obtaining random samples from individual relations, and from the join result of two tables. Our solutions can handle any update pattern with small space and computational overhead. We also present an in-depth analysis that provides valuable insight into the characteristics of alternative sampling strategies and leads to precision guarantees. Extensive experiments validate our theoretical findings and demonstrate the efficiency of our techniques in practice.
 

  Paper download

  
 
  Implementation and datasets
 
Before you proceed with downloading, please read and agree to the terms of using our implementation.
 
Source codes for sampling a single relation, and sampling the joined result (implemented: by Xiang Lian).
Datasets and generator used in our experiments: the generator for synthetic datasets (see Note 1), Stock, Tickwise (see Note 2)

Note 1: The generator creates only insertions in the input stream. Deletions are dynamically generated in the implementation of the sampling algorithms.

Note 2
: Stock contains the values of 197 North American stocks during the period from 1993 through 1996, and Tickwise consists of the exchange rates from Swiss francs to US dollars recorded from August 7, 1990 to April 18, 1991. The series Stock (Tickwise) has 512k (279k) values in total. Again, deletions are generated in the sampling implementation.

File format
: All datasets are in the textual format. Each line in a synthetic dataset has the form "id <space> v", where id denotes the id of the tuple being inserted, and v its new value. Each line in Stock and Tickwise has the form "id <space> v <space> dummy", where id and v are the same as mentioned earlier, and dummy a dummy value that can be ignored.
 

 

Back to Yufei's home, or publication list