Text Mining
Barbara Barbosa @bahbbc
BankFacil
26th February 2016
Barbara Barbosa @bahbbc BankFacil
Text Mining
What is it?
The process to deriving information from the text. It usually
requires a preprocessing of the input data.
Barbara Barbosa @bahbbc BankFacil
Text Mining
Learning problem
Figure: Flow chart of learning problem
Barbara Barbosa @bahbbc BankFacil
Text Mining
Corpus
Corpus is the set of n documents. Each of these documents is
defined as a set of m terms (radicals, words or a set of words).
The corpus will be all text available by clients from the BankFacil’s
page on facebook (https://www.facebook.com/bankfacil)
You can check the code in R - http://bit.ly/1XQ0mWw
Barbara Barbosa @bahbbc BankFacil
Text Mining
Tokenizing - Lexical Analysis
Convert to lower case
Remove punctuation
Remove numbers
Barbara Barbosa @bahbbc BankFacil
Text Mining
StopWords
Stopwords 1 is a list of words that doesn’t have the potential to
contribute to characterize the content in the text.
They can reduce the size of texts by 30% to 50%.
1
Portuguese stopwords available at:
http://snowball.tartarus.org/algorithms/portuguese/stop.txt
Barbara Barbosa @bahbbc BankFacil
Text Mining
Stemming
Figure:
There are experiments that shows 5% of reduction from the
document original size.
Barbara Barbosa @bahbbc BankFacil
Text Mining
Space Vector Model
Binary
Frequency
tf-idf
tf-idf normalized
Barbara Barbosa @bahbbc BankFacil
Text Mining
TF-IDF
TF-IDF (Term Frequency - Inverse Document Frequency)
tfidf(tk, dj) = #(tk, dj) ∗ log
|#Tr|
Tr(tk)
(1)
Tr - representa o n´umero total de documentos (corpus)
#(tk, dj) - o n´umero de vezes que tk ocorre em dj
Tr(tk) - n´umero de documentos em Tr em que tk aparece
Barbara Barbosa @bahbbc BankFacil
Text Mining
Luhn’s experiment
Figure:
Barbara Barbosa @bahbbc BankFacil
Text Mining
Zipf’s law
Zipf’s law states that given some corpus, the frequency of any
word is inversely proportional to its rank in the frequency table.
More about Zipf’s law
https://www.youtube.com/watch?v=fCn8zs912OE
Barbara Barbosa @bahbbc BankFacil
Text Mining
Bibliography
Based on slides from Prof. Sarajane Marques Peres in Data Mining
course
Barbara Barbosa @bahbbc BankFacil
Text Mining
Text Mining
Barbara Barbosa @bahbbc
BankFacil
26th February 2016
Barbara Barbosa @bahbbc BankFacil
Text Mining

Text mining Pre-processing

  • 1.
    Text Mining Barbara Barbosa@bahbbc BankFacil 26th February 2016 Barbara Barbosa @bahbbc BankFacil Text Mining
  • 2.
    What is it? Theprocess to deriving information from the text. It usually requires a preprocessing of the input data. Barbara Barbosa @bahbbc BankFacil Text Mining
  • 3.
    Learning problem Figure: Flowchart of learning problem Barbara Barbosa @bahbbc BankFacil Text Mining
  • 4.
    Corpus Corpus is theset of n documents. Each of these documents is defined as a set of m terms (radicals, words or a set of words). The corpus will be all text available by clients from the BankFacil’s page on facebook (https://www.facebook.com/bankfacil) You can check the code in R - http://bit.ly/1XQ0mWw Barbara Barbosa @bahbbc BankFacil Text Mining
  • 5.
    Tokenizing - LexicalAnalysis Convert to lower case Remove punctuation Remove numbers Barbara Barbosa @bahbbc BankFacil Text Mining
  • 6.
    StopWords Stopwords 1 isa list of words that doesn’t have the potential to contribute to characterize the content in the text. They can reduce the size of texts by 30% to 50%. 1 Portuguese stopwords available at: http://snowball.tartarus.org/algorithms/portuguese/stop.txt Barbara Barbosa @bahbbc BankFacil Text Mining
  • 7.
    Stemming Figure: There are experimentsthat shows 5% of reduction from the document original size. Barbara Barbosa @bahbbc BankFacil Text Mining
  • 8.
    Space Vector Model Binary Frequency tf-idf tf-idfnormalized Barbara Barbosa @bahbbc BankFacil Text Mining
  • 9.
    TF-IDF TF-IDF (Term Frequency- Inverse Document Frequency) tfidf(tk, dj) = #(tk, dj) ∗ log |#Tr| Tr(tk) (1) Tr - representa o n´umero total de documentos (corpus) #(tk, dj) - o n´umero de vezes que tk ocorre em dj Tr(tk) - n´umero de documentos em Tr em que tk aparece Barbara Barbosa @bahbbc BankFacil Text Mining
  • 10.
    Luhn’s experiment Figure: Barbara Barbosa@bahbbc BankFacil Text Mining
  • 11.
    Zipf’s law Zipf’s lawstates that given some corpus, the frequency of any word is inversely proportional to its rank in the frequency table. More about Zipf’s law https://www.youtube.com/watch?v=fCn8zs912OE Barbara Barbosa @bahbbc BankFacil Text Mining
  • 12.
    Bibliography Based on slidesfrom Prof. Sarajane Marques Peres in Data Mining course Barbara Barbosa @bahbbc BankFacil Text Mining
  • 13.
    Text Mining Barbara Barbosa@bahbbc BankFacil 26th February 2016 Barbara Barbosa @bahbbc BankFacil Text Mining