====== Query Categorization ====== ===== Pre-processing ===== * Stemming * [[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]] * Abbreviation Extension * Use [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]] * Stopword filtering * Use [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]] * Misspelled words * [[http://aspell.net/|GNU Aspell]] * Location-based queries * NER for location detection * Part-of-speech (POS) tagging * [[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]] * Named entity recognition (NER) * [[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]] * Person (e.g., Bill Gates) * Location (e.g., Hong Kong) * Thing (e.g., Table) ===== Knowledge Base ===== * Lexicon (e.g., [[http://dbpedia.org/About|DBpedia]] person, location, organization, and product lists) * [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]] (e.g, of, the) * [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]] (e.g., ad for advertisement) ===== Useful tools ===== *[[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]] *[[http://aspell.net/|GNU Aspell]] *[[http://htmlparser.sourceforge.net/|Web page structure analysis]] *[[http://www.nzdl.org/Kea/|KEA for key word extraction]] *[[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]] *[[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]] *[[http://wordnet.princeton.edu/|WordNet]] *[[http://search.cpan.org/dist/WordNet-Similarity/|WordNet:: Similarity]] * [[http://www.nltk.org/|NLTK Toolkit]] * [[http://docs.python.org/library/bsddb.html|bsddb — Interface to Berkeley DB library]] ===== Input Examples ===== * the chinese university of hk * new york pizza * How do I play mp3 using the java programming language ===== Crowdsourcing ===== * Top 1000 queries => label them into 32 categories ===== Centroid Method ===== - Function Query2Term(string query) \\ **Input**: a query, **Output**: terms of this query \\ \\ Example1: the chinese university of hk -> [the chinese university of hk]1 ([]i is the i-th term of this query) \\ Example2: new york pizza -> [new york]1 [pizza]2 \\ Example3: How do I play mp3 using the java programming language -> [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 \\ \\ - Function Term2Centroid(string terms) \\ **Input**: terms of a query, **Output**: centroid of this query \\ \\ Example1: [the chinese university of hk]1-> the chinese university of hk \\ Example2: [new york]1 [pizza]2 -> pizza \\ Example3: [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 -> mp3 \\ \\ - Function synonym(string keyword) \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar ===== Similarity-based Method ===== - Function catURL(string category, string engine, int n) \\ **Input**: a category, **Output**: top n URLs from search engines (e.g., Google) \\ \\ Example: catURL(cuhk, Google, 3) \\ www.cuhk.edu.hk/ \\ www.cuhk.edu.hk/chinese/ \\ www.cuhk.edu.hk/gss/ \\ \\ - Function keywordsURL(string URL) \\ **Input**: a URL, **Output**: key words of Web pages for this URL \\ \\ Example: keywordsURL(http://www.cuhk.edu.hk/english/) \\ research, education, shatin, campus, college, etc \\ \\ - Function synonym(string keyword) \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar ===== Overall Workflow ===== {{:projs:qcat:figure-updated1.pdf|Workflow for query categorization}}