====== Text Processing and Analysis ====== === Input === * Text * File * URL === Output === * Statistics * Scores ===== Filtering Procedure ===== * Misspelled words * Stop words * Stemming * Filtering * Minimum characters per word * special word or expression * number of words to be analyzed * analyze number * log the query (only for websites) * apply stoplist * apply internal stoplist * extract links * polyword phrases ===== Basic Text Analysis ===== * character, word, paragraph, sentence, syllable, etc. counting * Average syllables per word * Average sentence length (words) * Max sentence length (words) * Min sentence length (words) * line feed, return, tab, special characters, etc. * repetition words in a short range * Frequent words * Word length * bi-gram, tri-gram, n-gram * Unique words * Lexical density ===== Readability Scores ===== * [[http://en.wikipedia.org/wiki/Flesch–Kincaid_readability_test|Flesch-Kincaid]] * [[http://en.wikipedia.org/wiki/Gunning_fog_index|Gunning-Fog]] * [[http://en.wikipedia.org/wiki/Coleman-Liau_Index|Coleman-Liau]] * [[http://en.wikipedia.org/wiki/SMOG|SMOG]] * Lau-King Chinese Readability ===== Grammar Analysis ===== * POS tagger * Parser * NER * location, person, thing, etc. * event, year, telephone, address, etc. * Summarization/annotation * sentiment/opinion analysis * concept clustering ===== Visualization ===== * Histogram * Similar words, concept graph * Word cloud ===== Document Segmentation ===== * Title * Name * Abstract * Conclusion * References ===== Resources ===== * [[http://www.addedbytes.com/lab/readability-score/]] * [[http://www.ghacks.net/2008/03/16/text-statistics/]] * [[https://github.com/DaveChild/Text-Statistics]] * [[http://www.usingenglish.com/resources/text-statistics.php]] * [[http://www.usingenglish.com/resources/wordcheck/]] - [[http://gnosis.cx/TPiP/]] - [[http://www.amazon.com/Text-Processing-Python-David-Mertz/dp/0321112547]] - [[http://www.amazon.com/Python-Text-Processing-Beginners-Guide/dp/1849512124]] - [[http://gnosis.cx/publish/programming/charming_python_5.html]] - [[https://www.packtpub.com/python-text-processing-nltk-20-cookbook/book]] - [[http://text-processing.com/]] - [[http://www.textarc.org/]] - [[http://www.tagcrowd.com/]] - [[http://www.wordle.net/]] - [[http://en.wikipedia.org/wiki/Text_analytics]] - [[http://en.wikipedia.org/wiki/Text_mining]] - [[http://en.wikipedia.org/wiki/Content_analysis]] * [[https://wiki.projectbamboo.org/display/BPUB/Text+Analysis,+Data-Mining+and+Machine+Learning]] * [[http://www.clarabridge.com/]] * [[http://contour.dac.us/Explore?ReturnUrl=%2findex.html]] * [[http://www.provalisresearch.com/wordstat/wordstat.html]] * [[http://www.maxqda.com/products/maxqda10]] * [[http://textanalysis.com/]] ===== Things To Do ===== - Check the web for similar software packages - Learning Python and packages - Check Python packages on text analysis, NLTK, etc.