|
Let R be a set of objects. An object o in R is an
outlier, if there exist less than k objects in R whose
distances to o are at most r. The values of k, r,
and the distance metric are provided by a user at the run time. The
objective is to return all outliers with the smallest I/O cost.
This paper considers a generic version of the problem, where no information
is available for outlier computation, except for objects' mutual distances.
We prove an upper bound for the memory consumption which permits the
discovery of all outliers by scanning the dataset 3 times. The upper bound
turns out to be extremely low in practice, e.g., less than 1% of R.
Since the actual memory capacity of a realistic DBMS is typically larger, we
develop a novel algorithm, which integrates our theoretical findings with
carefully-designed heuristics that leverage the additional memory to improve
I/O efficiency. Our technique reports all outliers by scanning the dataset
at most twice (in some cases, even once), and significantly outperforms the
existing solutions by a factor up to an order of magnitude.
|
Before you proceed with downloading, please read and agree to the
terms of using
our implementation.
Download the source codes
of our algorithm (implemented by
Xiaokui Xiao).
Real datasets used in our experiments:
CA,
Household,
Server.
Download
the generator for creating the Signature datasets.
Dataset formats:
CA: Each line corresponds to a location in California, in the form of:
tuple-id <space> x- <space> y-coordinate
Household: Each line describes the annual expenditure of an American
family:
tuple-id <space> property-insurance <space> property-tax <space> electricity-cost
<space> gas cost <space> water cost <space> fuel cost
Server: Each line captures the statistics of a network connection:
tuple-id <space> count <space> srv-count <space> dest-host-count
<space> dest-host-srv-count <space> dest-host-same-srv-count
The detailed semantics of the above attributes can be found here.
|