This is an old revision of the document!
Table of Contents
Query Categorization
Pre-processing
- Stemming
- Abbreviation Extension
- Stopword filtering
- Misspelled words
- Location-based queries
- POS
- Person
- Place
- Thing
- Term extraction
Knowledge Base
- Lexicon (e.g., DBpedia person, location, organization, and product lists)
- Stop-word lexicon (e.g, of, the)
- Abbreviation lexicon (e.g., ad for advertisement)
Useful tools
- The Porter Stemming Algorithm: http://tartarus.org/~martin/PorterStemmer/index-old.html
- Web page structure analysis: http://htmlparser.sourceforge.net/
- KEA for key word extraction: http://www.nzdl.org/Kea/
- WordNet: http://wordnet.princeton.edu/
- WordNet:: Similarity: http://search.cpan.org/dist/WordNet-Similarity/
Input Examples
- the chinese university of hk
- new york pizza
- How do I play mp3 using the java programming language
Crowdsourcing
- Top 1000 queries ⇒ label them into 32 categories
Centroid Method
- Function QuerytoTerm(string query)
Input: a query, Output: terms of this query
Example1: the chinese university of hk → [the chinese university of hk]1 ([]i is the i-th term of this query)
Example2: new york pizza → [new york]1 [pizza]2
Example3: How do I play mp3 using the java programming language → [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6
- Function TermtoCentroid(string terms)
Input: terms of a query, Output: centroid of this query
Example1: [the chinese university of hk]1→ the chinese university of hk
Example2: [new york]1 [pizza]2 → pizza
Example3: [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 → mp3
Similarity-based Method
- Function cateURL(string category, string engine, int n)
Input: a category, Output: top n URLs from search engines (e.g., Google)
Example: cateURL(cuhk, Google, 3)
www.cuhk.edu.hk/
www.cuhk.edu.hk/chinese/
www.cuhk.edu.hk/gss/
- Function keywordsURL(string URL)
Input: a URL, Output: key words of Web pages for this URL
Example: keywordsURL(http://www.cuhk.edu.hk/english/)
research, education, shatin, campus, college, etc
- Function synonym(string keyword)
Input: a word, Output: a set of synonyms of this term in WordNet
Example: synonym(car)
auto, automobile, machine, motorcar