WST501 Fundamentals of Searching Web-scale Datasets

Offered by Yufei Tao, Spring 2012.

Brief description

In this course, we will study index structures and algorithms that are commonly applied in practice to support queries on large datasets. The course will be divided into three parts, which focus on data streams, multi-dimensional objects, and web-specific applications, respectively. Topics to be covered include sampling, hashing, sketch structures, R-trees, nearest neighbor search, instance optimality, and so on. We will also discuss a series of foundamental techniques for analyzing the performance of algorithms.

Announcements

News 6 (May 13): The final exam has been released.
News 5 (May 13): Exercise list 3 is out.
News 4 (Apr 16): Exercise list 2 is out.
News 3 (Apr 16): Assignment 2 has been released. The due day is May 16, i.e., a month from now.
News 2 (Mar 12): Exercise list 1 is out.
News 1 (Mar 12): Assignment 1 has been released. The due day is Apr 12, i.e., a month from now.

Time and venues

Mon 1pm - 3pm Room 209, Building E11
Wed 1pm - 2pm Same as above

Grading scheme

Assignments: 50%
Final: 50% (see below)

Style of the final exam:
During this course, the instructor will gradually generate a list of exercises without the solutions given. In the final exam, 60% of the marks will come from problems taken directly from that list. The rest 40%, however, will be from new problems.

Textbooks and Lecture notes

Teaching will be based on books, papers, and notes written by the instructor. Regarding books, the following are good references: The following is the list of paper references:
Textbooks, papers, and/or notes
Lecture 1:
Universal hashing

Sections 11.2 and 11.3 of Ref 1.

Lecture 2:
Bloom filter

Notes after typo correction (uploaded at 2:40pm, 14 Feb)

Old version 1 (uploaded at 10am, 8 Feb)

Lecture 3:
Markov, Chebyshev and Chernoff

Notes after correcting an error in Theorem 4 (uploaded at 12pm, 12 Mar)

Old version 1 (uploaded at 7:40pm, 10 Feb)

Lecture 4:
Flajolet-Martin sketch

Notes (uploaded at 8:10pm, 12 Feb)

Lecture 5:
Bloom-filter replacement

Notes (uploaded at 8:20pm, 26 Feb)

Lecture 6:
Count-min sketch (Part I)

Sec 1-4.1 of Ref 3

Lecture 7:
Count-min sketch (Part II)

Sec 4.2-4.3 of Ref 3

Lecture 8:
Count-min sketch (Part III)

Sec 5 of Ref 3

Lecture 9:
Shuffling and sampling

Notes (uploaded at 11:55pm, 1 Apr)


Lecture 10:
Sampling in a sliding window


Sec 1.1, 1.2, Sec 2 of Ref 4.

Lecture 11:
Logarithmic method

Notes (uploaded at 12:10am, 16 Apr)

Lecture 12:
First-in-first-out indexing

Notes after correcting a typo in Lemma 1 (uploaded at 10pm, 22 Apr)

Old version 1 (uploaded at 10pm, 22 Apr)

Lecture 13:
R-tree

Notes (uploaded at 9:30pm, 29 Apr)

Lecture 14:
Best-first algorithm

Notes (uploaded at 3:30pm, 13 May)


Exercises


As mentioned earlier, some final exam problems will be taken directly from the exercise lists below. No solution to any exercise will be given. You, however, are very welcome to present your solutions to the instructor, who will tell you whether the solutions are correct (if a solution is incorrect, the instruct will explain why).

List 1
List 2
List 3

Assignments and Exams

Assignment 1

Download the description here.

Submissions must be sent to the instructor by email before 11:59pm, 12 Apr, 2012.

Assignment 2

Download the description here.

Every student must submit a pdf by 11:59pm, 16 May, 2012 describing her/his solutions, which must be written in her/his own words.



Final exam

Download the description here.

Every student must submit a pdf by 11:59pm, 30 May, 2012 describing her/his solutions, which must be written in her/his own words.