|
The existing random sampling methods have at least one of the following
disadvantages: they (i) are applicable only to certain update patterns, (ii)
entail large space overhead, or (iii) incur prohibitive maintenance cost.
These drawbacks prevent their effective application in stream environments
(where a relation is updated by a large volume of insertions and deletions
that may arrive in any order), despite the considerable success of random
sampling in conventional databases.
Motivated by this, we develop several fully-dynamic algorithms for obtaining
random samples from individual relations, and from the join result of two
tables. Our solutions can handle any update pattern with small space and
computational overhead. We also present an in-depth analysis that provides
valuable insight into the characteristics of alternative sampling strategies
and leads to precision guarantees. Extensive experiments validate our
theoretical findings and demonstrate the efficiency of our techniques in
practice.
|
Before you proceed with downloading, please read and agree to
the terms of using our implementation.
Source codes for sampling a single
relation, and sampling the
joined result
(implemented: by
Xiang Lian).
Datasets and generator used in our experiments: the
generator for synthetic
datasets (see Note 1), Stock,
Tickwise (see Note 2)Note 1:
The generator creates only insertions in the input stream. Deletions are
dynamically generated in the implementation of the sampling algorithms.
Note 2: Stock contains the values of 197
North American stocks during the period from 1993 through 1996, and
Tickwise consists of the exchange rates from Swiss francs to US dollars
recorded from August 7, 1990 to April 18, 1991. The series Stock (Tickwise)
has 512k (279k) values in total. Again, deletions are generated in the
sampling implementation.
File format: All datasets are in the textual format. Each line in a
synthetic dataset has the form "id <space> v", where id
denotes the id of the tuple being inserted, and v its new value. Each
line in Stock and Tickwise has the form "id <space>
v <space> dummy", where id and v are the same as
mentioned earlier, and dummy a dummy value that can be ignored.
|