Differences
This shows you the differences between two versions of the page.
projs:qcat:home [2011/04/07 17:31] admin |
projs:qcat:home [2011/04/08 17:16] (current) xfyu |
||
---|---|---|---|
Line 2: | Line 2: | ||
===== Pre-processing ===== | ===== Pre-processing ===== | ||
* Stemming | * Stemming | ||
- | * [[http://tartarus.org/~martin/PorterStemmer/index-old.html|Porter stemmer]] | + | * [[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]] |
* Abbreviation Extension | * Abbreviation Extension | ||
+ | * Use [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]] | ||
* Stopword filtering | * Stopword filtering | ||
+ | * Use [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]] | ||
* Misspelled words | * Misspelled words | ||
+ | * [[http://aspell.net/|GNU Aspell]] | ||
* Location-based queries | * Location-based queries | ||
- | * POS | + | * NER for location detection |
- | * Person | + | * Part-of-speech (POS) tagging |
- | * Place | + | * [[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]] |
- | * Thing | + | * Named entity recognition (NER) |
- | * Term extraction | + | * [[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]] |
+ | * Person (e.g., Bill Gates) | ||
+ | * Location (e.g., Hong Kong) | ||
+ | * Thing (e.g., Table) | ||
===== Knowledge Base ===== | ===== Knowledge Base ===== | ||
- | * Lexicon (e.g., DBpedia person, location, organization, and product lists) | + | * Lexicon (e.g., [[http://dbpedia.org/About|DBpedia]] person, location, organization, and product lists) |
- | * Stop-word lexicon (e.g, of, the) | + | * [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]] (e.g, of, the) |
- | * Abbreviation lexicon (e.g., ad for advertisement) | + | * [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]] (e.g., ad for advertisement) |
===== Useful tools ===== | ===== Useful tools ===== | ||
- | *The Porter Stemming Algorithm: http://tartarus.org/~martin/PorterStemmer/index-old.html | + | *[[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]] |
- | *Web page structure analysis: http://htmlparser.sourceforge.net/ | + | *[[http://aspell.net/|GNU Aspell]] |
- | *KEA for key word extraction: http://www.nzdl.org/Kea/ | + | *[[http://htmlparser.sourceforge.net/|Web page structure analysis]] |
- | *WordNet: http://wordnet.princeton.edu/ | + | *[[http://www.nzdl.org/Kea/|KEA for key word extraction]] |
- | *WordNet:: Similarity: http://search.cpan.org/dist/WordNet-Similarity/ | + | *[[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]] |
+ | *[[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]] | ||
+ | *[[http://wordnet.princeton.edu/|WordNet]] | ||
+ | *[[http://search.cpan.org/dist/WordNet-Similarity/|WordNet:: Similarity]] | ||
* [[http://www.nltk.org/|NLTK Toolkit]] | * [[http://www.nltk.org/|NLTK Toolkit]] | ||
* [[http://docs.python.org/library/bsddb.html|bsddb — Interface to Berkeley DB library]] | * [[http://docs.python.org/library/bsddb.html|bsddb — Interface to Berkeley DB library]] | ||
Line 36: | Line 46: | ||
* Top 1000 queries => label them into 32 categories | * Top 1000 queries => label them into 32 categories | ||
===== Centroid Method ===== | ===== Centroid Method ===== | ||
- | - Function <color red>QuerytoTerm(string query)</color> \\ **Input**: a query, **Output**: terms of this query \\ \\ Example1: the chinese university of hk -> [the chinese university of hk]1 ([]i is the i-th term of this query) \\ Example2: new york pizza -> [new york]1 [pizza]2 \\ Example3: How do I play mp3 using the java programming language -> [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 \\ \\ | + | - Function <color red>Query2Term(string query)</color> \\ **Input**: a query, **Output**: terms of this query \\ \\ Example1: the chinese university of hk -> [the chinese university of hk]1 ([]i is the i-th term of this query) \\ Example2: new york pizza -> [new york]1 [pizza]2 \\ Example3: How do I play mp3 using the java programming language -> [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 \\ \\ |
- | - Function <color red>TermtoCentroid(string terms)</color> \\ **Input**: terms of a query, **Output**: centroid of this query \\ \\ Example1: [the chinese university of hk]1-> the chinese university of hk \\ Example2: [new york]1 [pizza]2 -> pizza \\ Example3: [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 -> mp3 | + | - Function <color red>Term2Centroid(string terms)</color> \\ **Input**: terms of a query, **Output**: centroid of this query \\ \\ Example1: [the chinese university of hk]1-> the chinese university of hk \\ Example2: [new york]1 [pizza]2 -> pizza \\ Example3: [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 -> mp3 \\ \\ |
+ | - Function <color red>synonym(string keyword)</color> \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar | ||
+ | |||
===== Similarity-based Method ===== | ===== Similarity-based Method ===== | ||
- | - Function <color red>cateURL(string category, string engine, int n)</color> \\ **Input**: a category, **Output**: top n URLs from search engines (e.g., Google) \\ \\ Example: cateURL(cuhk, Google, 3) \\ www.cuhk.edu.hk/ \\ www.cuhk.edu.hk/chinese/ \\ www.cuhk.edu.hk/gss/ | + | - Function <color red>catURL(string category, string engine, int n)</color> \\ **Input**: a category, **Output**: top n URLs from search engines (e.g., Google) \\ \\ Example: catURL(cuhk, Google, 3) \\ www.cuhk.edu.hk/ \\ www.cuhk.edu.hk/chinese/ \\ www.cuhk.edu.hk/gss/ \\ \\ |
- | - Function <color red>keywordsURL(string URL)</color> \\ **Input**: a URL, **Output**: key words of Web pages for this URL \\ \\ Example: keywordsURL(http://www.cuhk.edu.hk/english/) \\ research, education, shatin, campus, college, etc | + | - Function <color red>keywordsURL(string URL)</color> \\ **Input**: a URL, **Output**: key words of Web pages for this URL \\ \\ Example: keywordsURL(http://www.cuhk.edu.hk/english/) \\ research, education, shatin, campus, college, etc \\ \\ |
- Function <color red>synonym(string keyword)</color> \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar | - Function <color red>synonym(string keyword)</color> \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar | ||
+ | |||
+ | ===== Overall Workflow ===== | ||
+ | {{:projs:qcat:figure-updated1.pdf|Workflow for query categorization}} |