Efficient Skyline and Top-k Retrieval in Subspaces

 

Yufei Tao, Xiaokui Xiao, Jian Pei

 

IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(8): 1072-1088, 2007

 
  Abstract

  
  Skyline and top-k queries are two popular operations for preference retrieval. In practice, applications that require these operations usually provide numerous candidate attributes, whereas, depending on their interests, users may issue queries regarding different subsets of the dimensions. The existing algorithms are inadequate for subspace skyline/top-k search because they have at least one of the following defects: they (i) require scanning the entire database at least once; (ii) are optimized for one subspace but incur significant overhead for other subspaces; (iii) demand expensive maintenance cost or space consumption.

  In this paper, we propose a technique, SUBSKY, which settles both types of queries using purely relational technologies. The core of SUBSKY is a transformation that converts
multidimensional data to 1D values. These values are indexed by a simple B-tree, which allows us to answer subspace queries by accessing a fraction of the database. SUBSKY entails low maintenance overhead, which equals the cost of updating a traditional B-tree. Extensive experiments with real data confirm that our technique outperforms alternative solutions significantly in both efficiency and scalability.
 

  Paper download

  
   The short version appeared in ICDE 2006.
 
  Implementation and datasets
 
  Before you proceed with downloading, please read and agree to the terms of using our implementation.
 
  Source codes SUBSKY and BBS (implemented: by Xiaokui Xiao).
  Datasets used in our experiments: NBA, Color, House.

  Dataset description: NBA contains 17k 8-dimensional points, where each point corresponds to the statistics of a player in 8 categories. These categories include the numbers of points scored, rebounds, assists, steals, blocks, field goals attempted, free throws, and three-point shots, all averaged over the number of minutes played. Household consists of 127k 6-dimensional tuples, each of which represents the percentage of an American family’s annual income spent on 6 types of expenditure: gas, electricity, water, heating, insurance, and property tax. Color is a 9-dimensional dataset with a cardinality 68k, and a tuple captures several properties of an image. Specifically, each image is encoded in the HSV space, and those 9 dimensions record the mean, standard deviation, and skewness of all the pixels in the H, S, and V channels, respectively. All the values are normalized into the unit range [0, 1].

  File format: A d-dimensional dataset has the following format:
               <object id> <space> <a_1> <space> <a_2> <space> ... <space> <a_d>
where a_i stands for the i-th attribute of a point.
 

 

Back to Yufei's home, or publication list