A dataset collection
Maintained by Yufei Tao
This page contains some real and synthetic datasets that I collected. Some of them were used in the experiments of my previous papers.
¡@
This site is under construction.
¡@
Color histogram (real, 32-dimensional, ~68k)
¡@
Release date: March 3, 2005.
¡@
[Version
2] (size 67917)
Released on 1 Sep. 2009.
¡@
Description.
Each point represents the color histogram of an image in the Corel collection.
Format.
Each line contains the id and coordinates of a point:
id x1 x2 x3 ... x32
¡@
Mnist (real, 50-dimensional, 59999)
¡@
[Download]
Released on 1 Sep. 2009.
Source
Description
The original Mnist dataset is a set of 60,000 points. Each point is
784-dimensional, capturing the pixel values of a 28*28 image. Since, however,
most pixels are insignificant, we reduced dimensionality by taking the 50
dimensions with the largest variances. The above version takes away a single
point that happens to coincide with another one.
¡@
Format.
Each line contains the id and coordinates of a point:
id x1 x2 x3 ... x50
¡@
Signature (synthetic, metric data, 50k)
¡@
[download]
Release date: March 3, 2005.
¡@
Description: Each object is a string with 65 English letters. We first obtain 20 ``anchor signatures", whose letters are randomly chosen from the alphabet. Then, each anchor produces a ``cluster" with 2.5k objects (resulting in the total cardinality 50k), each of which is obtained by randomly changing $x$ positions in its anchor signature to other (random) letters, where $x$ itself is uniformly distributed in range [1, 18]. The similarity between two strings is calculated as their ``edit distance", i.e., the smallest number of editorial changes (e.g., adding, removing, or modifying a letter) required to convert one string to the other.
¡@
Format: Binary. The byte ordering is ``little endian". The size of an ``integer/float" is 4 bytes and that of a ``char" is 1 byte.
The first 8 bytes (2 integers)
constitute the header: the first number gives the dataset cardinality, while
the 2nd indicates the storage size of an object (in bytes). Then, object data
are stored sequentially after the headers with this format:
1. signature string of 65 char
2. a dummy filed of 1 float
3. object id of 1 integer.
¡@
From: Generated by Man Lung Yiu (mlyiu2 AT cs.hku.hk)
¡@
¡@
Aircraft snapshot (synthesized from real, 3-dimensional, 100k)
¡@
[download]
Release date: March 3, 2005.
¡@
Description: First, 2000 points are sampled from {\em CA} to serve as ``airports". The ``reported location" of an ``airplane" consists of 3 values $a$, $b$, $c$, corresponding to its spatial location ($a$, $b$) and altitude $c$. To obtain the first two numbers, we randomly choose two airports as the aircraft's source and destination; then ($a$, $b$) is set to a random point on the segment connecting the two airports. The value $c$ is uniformly obtained in a (normalized) range [0, 10000]. {\em Aircraft} contains 100k aircrafts thus obtained.
¡@
Format: Texture. Each line represents an object, in the format of "id a b c", where a, b and c are as described above.
From: Generated by me.
¡@
California (real, 2-dimensional, ~62.6k)
¡@
[download]
Release date: March 3, 2005.
¡@
Description: Each point represents a location in CA.
Format: Texture. Each line represents an object, in the format of "id x x y y", where x and y are coordinates of the point (i.e., each coordinate is stored twice).
From: US Census Bureau (Tiger collection).
¡@
Long Beach (real, 2-dimensional, ~53k)
¡@
[download]
Release date: March 3, 2005.
¡@
Description: Each rectangle represents a region in the Long Beach county.
Format: Texture. Each line represents a rectangular object, in the format of "id x0 x1 y0 y1", where [x0, x1] is the projection of the rectangle on the x-axis, and [y0, y1] the projection on the y-dimension.
From: US Census Bureau (Tiger collection).
¡@
North America (real, 2-dimensional, ~569k)
¡@
[download]
Release date: March 3, 2005.
¡@
Description: Each point represents a location in North America.
Format: Texture. Each line represents a point in the format of "id x x y y", i.e., the x and y coordinates are stored twice each.
From: US Census Bureau (Tiger collection).
¡@
¡@
Color moments (real, 9-dimensional, ~68k)
¡@
[download]
Release date: August 1, 2005.
¡@
Description: Each object contains 9 numbers representing the color moments of an image.
Format: Texture. Each line represents an object, in the format of "id att1 att2 ... att9".
From: http://kdd.ics.uci.edu/
¡@
American household expenditure (real, 6-dimensional, ~127k)
¡@
[download]
Release date: August 1, 2005.
¡@
Description: Each object contains 6 numbers that represent the expenditure of an American family on insurance, property tax, electricity, gas, water, and fuel, respectively. Each expenditure is represented as a percentage of the family's annual income.
Format: Texture. Each line represents an object, in the format of "id att1 att2 ... att6".
From: http://www.ipums.org/
¡@
NBA (real, 13-dimensional, ~16k)
¡@
[download]
Release date: August 1, 2005.
¡@
Description: Each object represents the performance statistics of 16k players (in regular seasons) on 13 categories, including the number of points scored, assists, rebounds, etc, all averaged over the number of minutes played.
Format: Texture. Each line represents an object, in the format of "id att1 att2 ... att13".
From: Jian Pei and http://www.nba.com/