Text Processing and Analysis

Input

  • Text
  • File
  • URL

Output

  • Statistics
  • Scores

Filtering Procedure

  • Misspelled words
  • Stop words
  • Stemming
  • Filtering
    • Minimum characters per word
    • special word or expression
    • number of words to be analyzed
    • analyze number
    • log the query (only for websites)
    • apply stoplist
    • apply internal stoplist
    • extract links
    • polyword phrases

Basic Text Analysis

  • character, word, paragraph, sentence, syllable, etc. counting
    • Average syllables per word
    • Average sentence length (words)
    • Max sentence length (words)
    • Min sentence length (words)
  • line feed, return, tab, special characters, etc.
  • repetition words in a short range
  • Frequent words
  • Word length
  • bi-gram, tri-gram, n-gram
  • Unique words
  • Lexical density

Readability Scores

Grammar Analysis

  • POS tagger
  • Parser
  • NER
  • location, person, thing, etc.
  • event, year, telephone, address, etc.
  • Summarization/annotation
  • sentiment/opinion analysis
  • concept clustering

Visualization

  • Histogram
  • Similar words, concept graph
  • Word cloud

Document Segmentation

  • Title
  • Name
  • Abstract
  • Conclusion
  • References

Resources

Things To Do

  1. Check the web for similar software packages
  2. Learning Python and packages
  3. Check Python packages on text analysis, NLTK, etc.
 
projs/text/home.txt · Last modified: 2011/06/01 13:14 by admin     Back to top