WST540: Web Search and Text Analysis

Spring 2013

Professor: Yufei Tao
TA: Kyoungrok Jang

Brief Description

This course will cover modeling and algorithmic aspects of web search engines. Topics (subject to time permission) include tf-idf scoring, page ranks, inverted indexes, tries, suffix trees/arrays, q-grams, error-tolerant keyword search, and deep webs. On completion, students are expected to have acquired solid understanding on fundamental text retrieval and link analysis.

Announcements

News 1 (18 Mar): Exercise list 1 posted.
News 2 (26 Mar): Starting from the lecture on 27 Mar, class attendance will be taken during the first break.
News 3 (26 Mar): The first quiz will take place in the lecture of 3 Apr. The quiz will be 20 minutes long. Its scope covers Lectures 2 and 3.
News 4 (1 Apr): Exercise list 2 posted.
News 5 (9 Apr): Quiz 1 solutions posted.
News 6 (2 May): Exercise list 3 posted.
News 7 (2 May): Exercise list 4 posted.
News 8 (2 May): The second quiz will take place in the lecture of 15 May. The quiz will be 20 minutes long. Its scope covers Lectures 4-10.
News 9 (2 Jun): The make up class will start at 12pm, 7 June. The classroom is the same as usual, namely, Room 2222, N5.
News 10 (5 Jun): Exercise list 5 posted.
News 11 (5 Jun): The third quiz will take place in the lecture of 12 Jun. The quiz will be 20 minutes long. Its scope covers Lectures 13-15.
News 12 (9 Jun): Exercise list 6 posted.
News 13 (10 Jun): The FINAL EXAM will start at 1pm, 25 Jun. The duration is 2 hours. The venue is Room 2222, Building N5. The scope of the exam covers all lectures.

Time and Venues

Lecture
Wed 1pm - 4pm Room 2222, N5

Grading Scheme

Attendance: 30%
Quizzes: 30%
Final exam: 40%

Lecture Notes

No existing textbook contains all the materials in this course. The following is a good (and fun) reference book that covers about 30% of the contents:

Michael W. Berry, Murray Browne. Understand Search Engines: Mathematical Modeling and Text Retrieval. 2nd edition, SIAM Book Series: Software, Environments, and Tools. ISBN: 0-89871-581-4.

In addition, the following papers are nice reading related to this course:

[LM05] Langville, A., and Meyer, C. A Survey of Eigenvector Methods of Web Information Retrieval. SIAM Review, 47(1): 135-161, 2005.
[SZTJ12] Sheng, C., Zhang, N., Tao, Y., Jin, X. Optimal Algorithms for Crawling a Hidden Database in the Web. Proceedings of the VLDB Endowment (PVLDB), 5(11): 1112-1123, 2012.
[TS] Tao, Y., Sheng, C. Fast Nearest Neighbor Search with Keywords. To appear in IEEE Transactions on Knowledge and Data Engineering (TKDE).
[WF74] Wagner, R., and Fischer, M. The String-to-String Correction Problem. JACM, 21(1): 168-173, 1974.
[ZM06] Zobel, J., and Moffat, A. Inverted files for text search engines. ACM Comput. Surv. 38(2), 2006.

The instructor will make lecture notes available before each class. These notes are concise but complete description of the course contents. As usual, lecture attendance is vital to gaining thorough understanding.

Notes Suggested Readings
Lecture 1
Course Overview (uploaded at 8:30pm, 4 Mar)

--
Lecture 2
Vector Space Model (uploaded at 3:45pm, 5 Mar)

Sections 3.1 and 3.2.1 of the reference book
Lecture 3
Page Ranks (uploaded at 4:20pm, 12 Mar)
Correction: Added an assumption in Slide 8.

Old version (uploaded at 4:20pm, 12 Mar)

Section 7.2 of the reference book, and Section 4 of reference [LM05].
Lecture 4
Web Crawling (uploaded at 4:30pm, 26 Mar)

--
Lecture 5
Inverted Indexes: Part I (uploaded at 12:45pm, 14 May)
Correction 1: In Line 6 of the pseudocode in Slide 14, the division should be multiplication.

Old version 1 (uploaded at 3:30pm, 1 Apr)
Correction 1: Removed subscript i from w in Slide 9.
Correction 2: Fixed indentation in the algorithm of Slide 14

Old version (uploaded at 4:30pm, 26 Mar)

Section 3 of [ZM06]
Lecture 6
Inverted Indexes: Part II (uploaded at 8pm, 26 Mar)

Section 8 of [ZM06]
Lecture 7
Inverted Indexes: Part III (uploaded at 3:30pm, 1 Apr)

Section 10 of [ZM06]
Lecture 8
Inverted Indexes: Part IV (uploaded at 2:30am, 2 Apr)

Section 4 of [ZM06]
Lecture 9
Tries (uploaded at 5:52pm, 17 Apr)
Correction: Revised the formal definition of balanced tries in Slide 17.

Old version 2 (uploaded at 4:13pm, 10 Apr)
Correction: In Slide 9, the number n should be n + 1.

Old version 1 (uploaded at 10am, 9 Apr)

--
Lecture 10
Patricia Tries (uploaded at 7:40pm, 1 May)
Changes: Removed the last slide.

Old version (uploaded at 6pm, 16 Apr)

--
Lecture 11
Prefix Matching (uploaded at 6:55pm, 1 May)

--
Lecture 12
Suffix Trees and Arrays (uploaded at 4:45pm, 8 May)
Correction: Slide 22, "lcplen(left, q)" should be "lcplen(left, mid)".

Old version 2 (uploaded at 5:20pm, 6 May)
Correction 1: In Slide 14, the "S[m]" in the 3rd line should be "S[n]".
Correction 2: In Slide 19, there was a missing reference in the 2nd line, which has been filled in.

Old version 1 (uploaded at 1:25am, 1 May)

--
Lecture 13
Edit Distances: Part I (uploaded at 12am, 6 May)

Sections 2-4 of [WF74]
Lecture 14
Edit Distances: Part II (uploaded at 9am, 13 May)

Old version (uploaded at 6:50pm, 21 May)
Correction: On Slide 9, the cell at the row of m and the column of m should be 2.

--
Lecture 15
Approximate String Matching (uploaded at 6:50pm, 21 May)

--
Lecture 16
Nearest Neighbor Search with Keywords: Part I (uploaded at 7:30am, 3 Jun)

Old version (uploaded at 10:45pm, 26 May)
Correction 1: Problem definition in Slide 3 (now the data space's dimensions have integer domains)
Correction 2: Typo correction in Slide 6.

Sections 1-3, and 5 of [TS]
Lecture 17
Nearest Neighbor Search with Keywords: Part II (uploaded at 7:35am, 3 Jun)

Section 6 of [TS]
Lecture 18
Searching the Deep Web (uploaded at 12:30pm, 9 Jun)

Sections 1-2 of [SZTJ12]

Exercises, Quizzes and Exams

The following exercises are designed to help students understand the key concepts and algorithms taught in the lectures. Students are encouraged to attempt the excercises by themselves and compare their answers to the solutions given by the instructor.

Exercise List 1 (Solutions)
Exercise List 2 (Solutions)
Exercise List 3 (Solutions)
Exercise List 4 (Solutions)
Exercise List 5 (Solutions)
Exercise List 6 (Solutions)

Quiz 1 Solutions
Quiz 2 Solutions
Quiz 3 Solutions