Differences
This shows you the differences between two versions of the page.
projs:qcat:home [2011/04/07 20:01] xfyu |
projs:qcat:home [2011/04/08 17:16] (current) xfyu |
||
---|---|---|---|
Line 2: | Line 2: | ||
===== Pre-processing ===== | ===== Pre-processing ===== | ||
* Stemming | * Stemming | ||
- | * [[http://tartarus.org/~martin/PorterStemmer/index-old.html|Porter stemmer]] | + | * [[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]] |
* Abbreviation Extension | * Abbreviation Extension | ||
+ | * Use [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]] | ||
* Stopword filtering | * Stopword filtering | ||
+ | * Use [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]] | ||
* Misspelled words | * Misspelled words | ||
+ | * [[http://aspell.net/|GNU Aspell]] | ||
* Location-based queries | * Location-based queries | ||
+ | * NER for location detection | ||
* Part-of-speech (POS) tagging | * Part-of-speech (POS) tagging | ||
- | * Person | + | * [[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]] |
- | * Place | + | * Named entity recognition (NER) |
- | * Thing | + | * [[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]] |
+ | * Person (e.g., Bill Gates) | ||
+ | * Location (e.g., Hong Kong) | ||
+ | * Thing (e.g., Table) | ||
===== Knowledge Base ===== | ===== Knowledge Base ===== | ||
- | * Lexicon (e.g., DBpedia person, location, organization, and product lists) | + | * Lexicon (e.g., [[http://dbpedia.org/About|DBpedia]] person, location, organization, and product lists) |
- | * Stop-word lexicon (e.g, of, the) | + | * [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]] (e.g, of, the) |
- | * Abbreviation lexicon (e.g., ad for advertisement) | + | * [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]] (e.g., ad for advertisement) |
===== Useful tools ===== | ===== Useful tools ===== | ||
*[[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]] | *[[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]] | ||
+ | *[[http://aspell.net/|GNU Aspell]] | ||
*[[http://htmlparser.sourceforge.net/|Web page structure analysis]] | *[[http://htmlparser.sourceforge.net/|Web page structure analysis]] | ||
*[[http://www.nzdl.org/Kea/|KEA for key word extraction]] | *[[http://www.nzdl.org/Kea/|KEA for key word extraction]] | ||
+ | *[[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]] | ||
+ | *[[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]] | ||
*[[http://wordnet.princeton.edu/|WordNet]] | *[[http://wordnet.princeton.edu/|WordNet]] | ||
*[[http://search.cpan.org/dist/WordNet-Similarity/|WordNet:: Similarity]] | *[[http://search.cpan.org/dist/WordNet-Similarity/|WordNet:: Similarity]] | ||
Line 36: | Line 47: | ||
===== Centroid Method ===== | ===== Centroid Method ===== | ||
- Function <color red>Query2Term(string query)</color> \\ **Input**: a query, **Output**: terms of this query \\ \\ Example1: the chinese university of hk -> [the chinese university of hk]1 ([]i is the i-th term of this query) \\ Example2: new york pizza -> [new york]1 [pizza]2 \\ Example3: How do I play mp3 using the java programming language -> [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 \\ \\ | - Function <color red>Query2Term(string query)</color> \\ **Input**: a query, **Output**: terms of this query \\ \\ Example1: the chinese university of hk -> [the chinese university of hk]1 ([]i is the i-th term of this query) \\ Example2: new york pizza -> [new york]1 [pizza]2 \\ Example3: How do I play mp3 using the java programming language -> [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 \\ \\ | ||
- | - Function <color red>Term2Centroid(string terms)</color> \\ **Input**: terms of a query, **Output**: centroid of this query \\ \\ Example1: [the chinese university of hk]1-> the chinese university of hk \\ Example2: [new york]1 [pizza]2 -> pizza \\ Example3: [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 -> mp3 | + | - Function <color red>Term2Centroid(string terms)</color> \\ **Input**: terms of a query, **Output**: centroid of this query \\ \\ Example1: [the chinese university of hk]1-> the chinese university of hk \\ Example2: [new york]1 [pizza]2 -> pizza \\ Example3: [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 -> mp3 \\ \\ |
+ | - Function <color red>synonym(string keyword)</color> \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar | ||
+ | |||
===== Similarity-based Method ===== | ===== Similarity-based Method ===== | ||
Line 44: | Line 57: | ||
===== Overall Workflow ===== | ===== Overall Workflow ===== | ||
+ | {{:projs:qcat:figure-updated1.pdf|Workflow for query categorization}} |