Introduction to Text Mining in SAS® Enterprise Miner
Mengmeng Liu and Jack Dai
November 10, 2014
Growing Text
• First tweet: 2006
• Tweets per day in 2007: 5,000
• Tweets per day in 2013: 500,000,000
What is Text Mining?
Unstructured
Text Data
Numeric
Data
Statistical
Analysis
Text Mining Process Flow in SAS Enterprise Miner
Data Importing
• Clean the text as much as you can
before importing
Import node Data Structure
File Import All text in one file (CSV)
Text Import Separate documents (TXT,
PDF)
Create New Data
Sources
SAS dataset (sas7bdat)
Data Importing
Text Parsing
• Parse the variable with longest length
• Associate similar terms into one group
• Build customized dictionary of relevant terms
• Control number of terms per document
Text Filter
• Correct misspellings
• Assign frequency weights and term weights
• Manually filter out terms using filter view
Target variables Options
Present Mutual information
Not present Entropy
• Text cluster groups documents with similar text
contents
• Convert documents to Singular Value
Decomposition(SVD) based on the term
weights and frequency weights
• Group documents into mutually exclusive
cluster based on SVD
• Select dimensions of SVD and numbers of
clusters
• Select number of clusters
Text Cluster
Text Topic
• Create a number of topics that are prevalent in
documents
• Score each document on probability of
containing the topic
• Each document could have multiple topics
Possible statistical analysis methods
For Classification Purposes:
Demo: Hotel Reviews for Riviera
Data Structure
Cleaning the raw data
Manually filtering terms
Terms Eliminated:
• quot
• riviera
• hotel
• stay verb
• strip
• vegas
• year
• riv
Results
Alternative: Read the Reviews
Questions?
Resources:
• Dr. Jim Love and Dr. Joni Shreve from LSU ISDS
Department
• Data obtained from UCI Data Repository
• http://www.internetlivestats.com/twitter-statistics/
• ‘Text Analytics Using SAS Enterprise Miner’

Text mining mengmeng & jack_lsu

  • 1.
    Introduction to TextMining in SAS® Enterprise Miner Mengmeng Liu and Jack Dai November 10, 2014
  • 2.
    Growing Text • Firsttweet: 2006 • Tweets per day in 2007: 5,000 • Tweets per day in 2013: 500,000,000
  • 3.
    What is TextMining? Unstructured Text Data Numeric Data Statistical Analysis
  • 4.
    Text Mining ProcessFlow in SAS Enterprise Miner
  • 5.
    Data Importing • Cleanthe text as much as you can before importing Import node Data Structure File Import All text in one file (CSV) Text Import Separate documents (TXT, PDF) Create New Data Sources SAS dataset (sas7bdat)
  • 6.
  • 7.
    Text Parsing • Parsethe variable with longest length • Associate similar terms into one group • Build customized dictionary of relevant terms • Control number of terms per document
  • 8.
    Text Filter • Correctmisspellings • Assign frequency weights and term weights • Manually filter out terms using filter view Target variables Options Present Mutual information Not present Entropy
  • 9.
    • Text clustergroups documents with similar text contents • Convert documents to Singular Value Decomposition(SVD) based on the term weights and frequency weights • Group documents into mutually exclusive cluster based on SVD • Select dimensions of SVD and numbers of clusters • Select number of clusters Text Cluster
  • 10.
    Text Topic • Createa number of topics that are prevalent in documents • Score each document on probability of containing the topic • Each document could have multiple topics
  • 11.
    Possible statistical analysismethods For Classification Purposes:
  • 12.
  • 13.
  • 14.
  • 15.
    Manually filtering terms TermsEliminated: • quot • riviera • hotel • stay verb • strip • vegas • year • riv
  • 16.
  • 17.
  • 18.
  • 19.
    Resources: • Dr. JimLove and Dr. Joni Shreve from LSU ISDS Department • Data obtained from UCI Data Repository • http://www.internetlivestats.com/twitter-statistics/ • ‘Text Analytics Using SAS Enterprise Miner’

Editor's Notes

  • #3 Due to the rising of popularity of online uses such as Social media, the amount of text grows fast every day. The first tweet was sent on March 21, 2006 by Jack Dorsey, the creator of Twitter. In Twitter's short history, we went from 5,000 tweets per day in 2007 to 500,000,000 tweets per day in 2013 the volume of tweets is still growing and the estimated growth rate is around 30%
  • #4 However traditional statistical analysis could not be applied to these unstructured text. How would you solve problems if you are facing whole bunch of text data? Text mining provides a solution to this problem by converting this unstructured text to numeric data which will hopefully provide companies with new insights that could lead to business advantages. For example, we recently just saw a presentation, where a team text mined consumer complaints about their cars to the national highway safety group, they clustered the complaints into parts and used survival analysis to predict if the car manufacturer need to conduct a recall. Today, we will show u a demo on analyzing hotel reviews.
  • #8 Text parsing basically identifies which terms are in the documents. SAS Enterprise Miner has its own default dictionary of terms to keep. And you can also import your own customized dictionary through the use of a start and stop list. A start list is a list of terms to include, and stop list is a list of terms to exclude. The last red box shows the maximum numbers of terms you can extract from each document.
  • #9 Text filter has many functions: it could correct misspelling, reduce the numbers of terms, and most importantly assigns frequency weights and term weights. Frequency weight reflect how important a term is in each document, where as term weight reflects how important a term is for the collection of documents as a whole. Normally you can just use default options, but it’s better to select options manually to ensure the correct methodology are used
  • #10 From this point, the next two nodes are the only two nodes which could output numeric variables from text data. Text cluster is one of them, it groups documents with similar text contents. Using the term weights and frequency weights generated by text filter node, documents are scored in SVDs and categorized into clusters however each document could only belong to one cluster, and all the documents within each cluster has similar SVD values or other words contents.
  • #11 The other node is text topic
  • #12 After obtaining numeric variables, you could use different kinds of analysis methods. Text mining is not a simple process just linking nodes and clicking run. You need to manipulate your data to get meaningful results. Jack will demonstrate an example of text mining.