Differences

This shows you the differences between two versions of the page.

projs:qcat:home [2011/04/07 18:13]
xfyu
projs:qcat:home [2011/04/08 17:16] (current)
xfyu
Line 2: Line 2:
===== Pre-processing ===== ===== Pre-processing =====
  * Stemming   * Stemming
-    * [[http://tartarus.org/~martin/PorterStemmer/index-old.html|Porter stemmer]]+    * [[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]]
  * Abbreviation Extension   * Abbreviation Extension
 +    * Use [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]]
  * Stopword filtering   * Stopword filtering
 +    * Use [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]]
  * Misspelled words   * Misspelled words
 +    * [[http://aspell.net/|GNU Aspell]]
  * Location-based queries   * Location-based queries
-  * POS +    * NER for location detection 
-    * Person +  * Part-of-speech (POS) tagging 
-    * Place +    * [[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]] 
-    * Thing +  * Named entity recognition (NER) 
-  * Term extraction+    * [[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]] 
 +    * Person (e.g., Bill Gates) 
 +    * Location (e.g., Hong Kong) 
 +    * Thing (e.g., Table)
===== Knowledge Base ===== ===== Knowledge Base =====
-  * Lexicon (e.g., DBpedia person, location, organization, and product lists) +  * Lexicon (e.g., [[http://dbpedia.org/About|DBpedia]] person, location, organization, and product lists) 
-  * Stop-word lexicon (e.g, of, the) +  * [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]] (e.g, of, the) 
-  * Abbreviation lexicon (e.g., ad for advertisement)+  * [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]] (e.g., ad for advertisement) 
===== Useful tools ===== ===== Useful tools =====
-  *The Porter Stemming Algorithm: http://tartarus.org/~martin/PorterStemmer/index-old.html +  *[[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]] 
-  *Web page structure analysis: http://htmlparser.sourceforge.net/  +  *[[http://aspell.net/|GNU Aspell]] 
-  *KEA for key word extraction: http://www.nzdl.org/Kea/  +  *[[http://htmlparser.sourceforge.net/|Web page structure analysis]]  
-  *WordNet: http://wordnet.princeton.edu/  +  *[[http://www.nzdl.org/Kea/|KEA for key word extraction]]  
-  *WordNet:: Similarity: http://search.cpan.org/dist/WordNet-Similarity/ +  *[[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]] 
 +  *[[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]] 
 +  *[[http://wordnet.princeton.edu/|WordNet]]  
 +  *[[http://search.cpan.org/dist/WordNet-Similarity/|WordNet:: Similarity]]
  * [[http://www.nltk.org/|NLTK Toolkit]]   * [[http://www.nltk.org/|NLTK Toolkit]]
  * [[http://docs.python.org/library/bsddb.html|bsddb — Interface to Berkeley DB library]]   * [[http://docs.python.org/library/bsddb.html|bsddb — Interface to Berkeley DB library]]
Line 37: Line 47:
===== Centroid Method ===== ===== Centroid Method =====
  - Function <color red>Query2Term(string query)</color> \\ **Input**: a query, **Output**: terms of this query \\ \\ Example1: the chinese university of hk -> [the chinese university of hk]1 ([]i is the i-th term of this query) \\ Example2: new york pizza -> [new york]1 [pizza]2 \\ Example3: How do I play mp3 using the java programming language -> [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 \\ \\   - Function <color red>Query2Term(string query)</color> \\ **Input**: a query, **Output**: terms of this query \\ \\ Example1: the chinese university of hk -> [the chinese university of hk]1 ([]i is the i-th term of this query) \\ Example2: new york pizza -> [new york]1 [pizza]2 \\ Example3: How do I play mp3 using the java programming language -> [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 \\ \\
-  - Function <color red>Term2Centroid(string terms)</color> \\ **Input**: terms of a query, **Output**: centroid of this query \\ \\ Example1: [the chinese university of hk]1-> the chinese university of hk \\ Example2: [new york]1 [pizza]2 -> pizza \\ Example3: [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 -> mp3 +  - Function <color red>Term2Centroid(string terms)</color> \\ **Input**: terms of a query, **Output**: centroid of this query \\ \\ Example1: [the chinese university of hk]1-> the chinese university of hk \\ Example2: [new york]1 [pizza]2 -> pizza \\ Example3: [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 -> mp3 \\ \\ 
 +  - Function <color red>synonym(string keyword)</color> \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar 
 +
===== Similarity-based Method ===== ===== Similarity-based Method =====
-  - Function <color red>cateURL(string category, string engine, int n)</color> \\ **Input**: a category, **Output**: top n URLs from search engines (e.g., Google) \\ \\ Example: cateURL(cuhk, Google, 3) \\ www.cuhk.edu.hk/ \\ www.cuhk.edu.hk/chinese/ \\ www.cuhk.edu.hk/gss/ \\ \\+  - Function <color red>catURL(string category, string engine, int n)</color> \\ **Input**: a category, **Output**: top n URLs from search engines (e.g., Google) \\ \\ Example: catURL(cuhk, Google, 3) \\ www.cuhk.edu.hk/ \\ www.cuhk.edu.hk/chinese/ \\ www.cuhk.edu.hk/gss/ \\ \\
  - Function <color red>keywordsURL(string URL)</color> \\ **Input**: a URL, **Output**: key words of Web pages for this URL  \\ \\ Example: keywordsURL(http://www.cuhk.edu.hk/english/) \\ research, education, shatin, campus, college, etc \\ \\   - Function <color red>keywordsURL(string URL)</color> \\ **Input**: a URL, **Output**: key words of Web pages for this URL  \\ \\ Example: keywordsURL(http://www.cuhk.edu.hk/english/) \\ research, education, shatin, campus, college, etc \\ \\
  - Function <color red>synonym(string keyword)</color> \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar   - Function <color red>synonym(string keyword)</color> \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar
 +
 +===== Overall Workflow =====
 +{{:projs:qcat:figure-updated1.pdf|Workflow for query categorization}}
 
projs/qcat/home.1302171237.txt.gz · Last modified: 2011/04/07 18:13 by xfyu     Back to top