SlideShare a Scribd company logo
1 of 20
Introduction to Text Mining
Agenda
• Defining Text Mining
• Structured vs. Unstructured Data
• Why Text Mining
• Some Text Mining Ambiguities
• Pre-processing the Text
Text Mining
• The discovery by computer of new, previously unknown information, by
automatically extracting information from a usually large amount of different
unstructured textual resources
Previously unknown means:
• Discovering genuinely new information
• Discovering new knowledge vs. merely finding patterns is like the difference
between a detective following clues to find the criminal vs. analysts looking at
crime statistics to assess overall trends in car theft
Unstructured means:
• Free naturally occurring text
• As opposed HTML, XML….
Text Mining Vs. Data Mining
• Data in Data mining is a series of numbers. Data for text mining is a collection of
documents.
• Data mining methods see data in spreadsheet format. Text mining methods see
data in document format
Structured vs. Unstructured Data
• Structured data
• Loadable into “spreadsheets”
• Arranged into rows and columns
• Each cell filled or could be filled
• Data mining friendly
• Unstructured daa
• Microsoft Word, HTML, PDF documents, PPTs
• Usually converted into XML  semi structured
• Not structured into cells
• Variable record length, notes, free form survey-answers
• Text is relatively sparse, inconsistent and not uniform
• Also images, video, music etc.
Why Text Mining?
• Leveraging text should improve decisions and predictions
• Text mining is gaining momentum
• Sentiment analysis (twitter, facebook)
• Predicting stock market
• Predicting churn
• Customer influence
• Customer service and help desk
• Not to mention Watson
Why Text Mining is Hard?
• Language is ambiguous
• Context is needed to clarify
• The same words can have different meaning (homographs)
• Bear (verb) – to support or carry
• Bear (noun) – a large animal
• Different words can mean the same (synonyms)
• Language is subtle
• Concept / word extraction usually results in huge number of dimensions
• Thousands of new fields
• Each field typically has low information content (sparse)
• Misspellings, abbreviations, spelling variants
• Renders search engines, SQL queries.. ineffective.
Some Text Mining Ambiguities
• Homonomy: same word, different meaning
• Mary walked along the bank of the river
• HarborBank is the richest bank in the citys
• Synonymy: Synonyms, different words, similar or same meaning, can
substitute one word for other without changing meaning
• Miss Nelson became a kind of big sister to Benjamin
• Miss Nelson became a kind of large sister to Benjamin
• Polysemy: same word or form, but different, albeit related meaning
• The bank raised its interest rates yesterday
• The store is next to the newly constructed bank
• The bank appeared first in Italy I the Renaissance
• Hyponymy: Concept hierarchy or subclass
• Animal (noun) – cat, dog
• Injury – broken leg, intusion
Seven Types of Text Mining
• Search and Information Retrieval – storage and retrieval of text documents, including
search engines and keyword search
• Document Clustering – Grouping and categorizing terms, snippets, paragraphs or
documents using clustering methods
• Document Classification – grouping and categorizing snippets, paragraphs or document
using data mining classification methods, based on methods trained on labelled
examples
• Web Mining – Data and Text mining on the internet with specific focus on scaled and
interconnectedness of the web
• Information Extraction – Identification and extraction of relevant facts and relationships
from unstructured text
• Natural Language Processing – Low level language processing and understanding of
tasks (eg. Tagging part of speech)
• Concept extraction – Grouping of words and phrases into semantically similar groups
Text Mining – Some Definitions
• Document – a sequence of words and punctuation, following the grammatical
rules of the language.
• Term – usually a word, but can be a word-pair or phrase
• Corpus – a collection of documents
• Lexicon – set of all unique words in corpus
Pre-processing the Text
• Text Normalization
• Parts of Speech Tagging
• Removal of stop words
Stop words – common words that don’t add meaningful content to the document
• Stemming
• Removing suffices and prefixes leaving the root or stem of the word.
• Term weighting
• POS Tagging
• Tokenization
Text Normalization
• Case
• Make all lower case (if you don’t care about proper nouns, titles, etc)
• Clean up transcription and typing errrors
• do n’t, movei
• Correct misspelled words
• Phonetically
• Use fuzzy matching algorithms such as Soundex, Metaphone or string edit distance
• Dictionaries
• Use POS and context to make good guess
Parts of Speech Tagging
• Useful for recognizing names of people, places, organizations, titles
• English language
• Minimum set includes noun, verb, adjective, adverb, prepositions, congjunctions
POS Tags from Penn Tree Bank
Tag Description Tag Description Tag Description
CC Coordinating Conjunction CD Cardinal Number DT Determiner
EX Existential there FW Foreign Word IN Preposition or subordinating
conjuction
JJ Adjective JJR Adjective, comparative JJS Adjective, superlative
LS List Item Marker MD Modal NN Noun, singular or mass
NNS Noun Plural NNPS Proper Noun Plural PDT Prederminer
POS Possessive Ending PRP Personal pronoun PRPS Possessive pronoun
RB Adverb RBR Adverb, comparative RBS Adverb, superlative
RP Particle SYM Symbol TO To
UH Interjection VB Verb, base form VBD Verb, past tens
Example of Tagging
• In this talk, Mr. Pole discussed how Target was using Predictive Analytics including
descriptions of using potential value models, coupon models, and yes predicting
when a woman is due
• In/IN this/DT talk/NN, Mr./NNP Pole/NNP discussed/VBD how/WRB Target/NNP
was/VBD using/VBG Predictive/NNP Analytics/NNP including/VBG
descriptions/NNS of/IN using/VBG potential/JJ value/NN models/NNS,
coupon/NN models/NNS, and yes predicting/VBG when/WRB a/DT woman/NN is
due/JJ
Tokenization
• Converts streams of characters into words
• Main clues (in English): Whitespace
• No single algorithms ‘works’ always
• Some languages do not have white space (Chinese, Japanese)
Stemming
• Normalizes / unifies variations of the same data
• ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk
• Inflectional stemming
• Remove plurals
• Normalize verb tenses
• Remove other affixes
• Stemming to root
• Reduce word to most basic element
• More aggressive than inflectional
• ‘denormalization’  norm
• ‘Apply’, ‘applications’, ‘reapplied’  apply
Common English Stop Words
• a, an, and, are, as, at, be, but, buy, for, if, in, into, is, it, no, not, of, on, or, such,
that, the, their, then, these, they, this, to, was, will, with
• Stop words are very common and rarely provide useful information for
information extraction and concept extraction
• Removing stop words also reduce dimensionality
Dictionaries and Lexicons
• Highly recommended, can be very time consuming
• Reduces set of key words to focus on
• Words of interest
• Dictionary words
• Increase set of keywords to focus on
• Proper nouns
• Acronyms
• Titles
• Numbers
• Key ways to use dictionary
• Local dictionary (specialized words)
• Stop words and too frequent words
• Stemming – reduce stems to dictionary words
• Synonyms – replace synonyms with root words in the list
• Resolve abbreviations and acronyms
Sentiment Analysis Workflow
Content Retrieval
Content Extraction
Corpus Generation
Corpus Transformation
Corpus Filtering
Sentiment Calculation
WebDataRetrievalCorpusPre
Processing
Sentiment
Analysis
Sentiment Indicators
• 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 =
𝑝−𝑛
𝑝+𝑛
• 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑝+𝑛
𝑁
• 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =
𝑝
𝑁
• 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =
𝑛
𝑁
• 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =
𝑝 − 𝑛
𝑁

More Related Content

What's hot

Mdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesMdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesRafael Alvarado
 
Subject analysis, an introduction
Subject analysis, an introductionSubject analysis, an introduction
Subject analysis, an introductionRichard.Sapon-White
 
Introduction to subject cataloguing
Introduction to subject cataloguingIntroduction to subject cataloguing
Introduction to subject cataloguingLiah Shonhe
 

What's hot (6)

Mdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesMdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databases
 
Plagirism checker
Plagirism checkerPlagirism checker
Plagirism checker
 
Thesaurus 2101
Thesaurus 2101Thesaurus 2101
Thesaurus 2101
 
Subject analysis, an introduction
Subject analysis, an introductionSubject analysis, an introduction
Subject analysis, an introduction
 
Introduction to subject cataloguing
Introduction to subject cataloguingIntroduction to subject cataloguing
Introduction to subject cataloguing
 
Troubleshooting your Search Strategy
Troubleshooting your Search StrategyTroubleshooting your Search Strategy
Troubleshooting your Search Strategy
 

Viewers also liked

ידיעון כרמיה 2-2015
ידיעון כרמיה 2-2015ידיעון כרמיה 2-2015
ידיעון כרמיה 2-2015perachadi
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsSeth Grimes
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Tugas email client_dinda_yulya_agustina_
Tugas email client_dinda_yulya_agustina_Tugas email client_dinda_yulya_agustina_
Tugas email client_dinda_yulya_agustina_dindayulya
 
RECULL LITERARI INFANTIL
RECULL LITERARI INFANTILRECULL LITERARI INFANTIL
RECULL LITERARI INFANTILexpressioblog3
 
Who are neet - 2013
Who are neet - 2013Who are neet - 2013
Who are neet - 2013Angelo Mosca
 
Diversity in the Media: How the Media Sees Me
Diversity in the Media: How the Media Sees MeDiversity in the Media: How the Media Sees Me
Diversity in the Media: How the Media Sees MeAndrea Ruiz
 
Презентация ДЗОЛ Юность г. Артём
Презентация ДЗОЛ Юность г. АртёмПрезентация ДЗОЛ Юность г. Артём
Презентация ДЗОЛ Юность г. Артёмzaslavets
 
Nonverbal communication 2.1
Nonverbal communication 2.1Nonverbal communication 2.1
Nonverbal communication 2.1belenita78
 
ידיעון כרמיה 1-2015
ידיעון כרמיה 1-2015ידיעון כרמיה 1-2015
ידיעון כרמיה 1-2015perachadi
 
SAMOA CDC COC
SAMOA CDC COCSAMOA CDC COC
SAMOA CDC COCyaknilesh
 
Publishing in the Digital Age
Publishing in the Digital AgePublishing in the Digital Age
Publishing in the Digital AgeVietnamBusinessTV
 
Evaluation - Question 1
Evaluation - Question 1Evaluation - Question 1
Evaluation - Question 1Kim Brilus
 

Viewers also liked (20)

ידיעון כרמיה 2-2015
ידיעון כרמיה 2-2015ידיעון כרמיה 2-2015
ידיעון כרמיה 2-2015
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and Semantics
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Choose Your Battles by David Morgan
Choose Your Battles by David MorganChoose Your Battles by David Morgan
Choose Your Battles by David Morgan
 
Tugas email client_dinda_yulya_agustina_
Tugas email client_dinda_yulya_agustina_Tugas email client_dinda_yulya_agustina_
Tugas email client_dinda_yulya_agustina_
 
RECULL LITERARI INFANTIL
RECULL LITERARI INFANTILRECULL LITERARI INFANTIL
RECULL LITERARI INFANTIL
 
Who are neet - 2013
Who are neet - 2013Who are neet - 2013
Who are neet - 2013
 
Diversity in the Media: How the Media Sees Me
Diversity in the Media: How the Media Sees MeDiversity in the Media: How the Media Sees Me
Diversity in the Media: How the Media Sees Me
 
Soumyadip_Chandra
Soumyadip_ChandraSoumyadip_Chandra
Soumyadip_Chandra
 
Презентация ДЗОЛ Юность г. Артём
Презентация ДЗОЛ Юность г. АртёмПрезентация ДЗОЛ Юность г. Артём
Презентация ДЗОЛ Юность г. Артём
 
Nonverbal communication 2.1
Nonverbal communication 2.1Nonverbal communication 2.1
Nonverbal communication 2.1
 
ידיעון כרמיה 1-2015
ידיעון כרמיה 1-2015ידיעון כרמיה 1-2015
ידיעון כרמיה 1-2015
 
Trinidad
TrinidadTrinidad
Trinidad
 
Get supportedsitesjson
Get supportedsitesjsonGet supportedsitesjson
Get supportedsitesjson
 
SAMOA CDC COC
SAMOA CDC COCSAMOA CDC COC
SAMOA CDC COC
 
Plano seriado
Plano seriadoPlano seriado
Plano seriado
 
Publishing in the Digital Age
Publishing in the Digital AgePublishing in the Digital Age
Publishing in the Digital Age
 
Evaluation - Question 1
Evaluation - Question 1Evaluation - Question 1
Evaluation - Question 1
 

Similar to 3. introduction to text mining

Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Skills and language objectives crwe feb 9 2020
Skills and language objectives crwe feb 9 2020Skills and language objectives crwe feb 9 2020
Skills and language objectives crwe feb 9 2020RJWilks
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxrohithprabhas1
 
Finding information
Finding informationFinding information
Finding informationFiona Beals
 
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...hajinouha0
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Esl weinstock spring 2014 libguide
Esl  weinstock spring 2014 libguideEsl  weinstock spring 2014 libguide
Esl weinstock spring 2014 libguidepachtmar
 
Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash CourseCharlie Greenbacker
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 documentUma Kant
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxSHIBDASDUTTA
 
Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Ali Arsalan Kazmi
 

Similar to 3. introduction to text mining (20)

Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Skills and language objectives crwe feb 9 2020
Skills and language objectives crwe feb 9 2020Skills and language objectives crwe feb 9 2020
Skills and language objectives crwe feb 9 2020
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Finding information
Finding informationFinding information
Finding information
 
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Esl weinstock spring 2014 libguide
Esl  weinstock spring 2014 libguideEsl  weinstock spring 2014 libguide
Esl weinstock spring 2014 libguide
 
Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash Course
 
IR
IRIR
IR
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
 
DHUG 2017 - Thesaurus Construction Training
DHUG 2017 - Thesaurus Construction TrainingDHUG 2017 - Thesaurus Construction Training
DHUG 2017 - Thesaurus Construction Training
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Textmining
TextminingTextmining
Textmining
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 
Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(
 
NLTK
NLTKNLTK
NLTK
 
Intro
IntroIntro
Intro
 
Intro
IntroIntro
Intro
 
Information Literacy Award - Drama, Theatre & Dance
Information Literacy Award - Drama, Theatre & DanceInformation Literacy Award - Drama, Theatre & Dance
Information Literacy Award - Drama, Theatre & Dance
 

3. introduction to text mining

  • 2. Agenda • Defining Text Mining • Structured vs. Unstructured Data • Why Text Mining • Some Text Mining Ambiguities • Pre-processing the Text
  • 3. Text Mining • The discovery by computer of new, previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources Previously unknown means: • Discovering genuinely new information • Discovering new knowledge vs. merely finding patterns is like the difference between a detective following clues to find the criminal vs. analysts looking at crime statistics to assess overall trends in car theft Unstructured means: • Free naturally occurring text • As opposed HTML, XML….
  • 4. Text Mining Vs. Data Mining • Data in Data mining is a series of numbers. Data for text mining is a collection of documents. • Data mining methods see data in spreadsheet format. Text mining methods see data in document format
  • 5. Structured vs. Unstructured Data • Structured data • Loadable into “spreadsheets” • Arranged into rows and columns • Each cell filled or could be filled • Data mining friendly • Unstructured daa • Microsoft Word, HTML, PDF documents, PPTs • Usually converted into XML  semi structured • Not structured into cells • Variable record length, notes, free form survey-answers • Text is relatively sparse, inconsistent and not uniform • Also images, video, music etc.
  • 6. Why Text Mining? • Leveraging text should improve decisions and predictions • Text mining is gaining momentum • Sentiment analysis (twitter, facebook) • Predicting stock market • Predicting churn • Customer influence • Customer service and help desk • Not to mention Watson
  • 7. Why Text Mining is Hard? • Language is ambiguous • Context is needed to clarify • The same words can have different meaning (homographs) • Bear (verb) – to support or carry • Bear (noun) – a large animal • Different words can mean the same (synonyms) • Language is subtle • Concept / word extraction usually results in huge number of dimensions • Thousands of new fields • Each field typically has low information content (sparse) • Misspellings, abbreviations, spelling variants • Renders search engines, SQL queries.. ineffective.
  • 8. Some Text Mining Ambiguities • Homonomy: same word, different meaning • Mary walked along the bank of the river • HarborBank is the richest bank in the citys • Synonymy: Synonyms, different words, similar or same meaning, can substitute one word for other without changing meaning • Miss Nelson became a kind of big sister to Benjamin • Miss Nelson became a kind of large sister to Benjamin • Polysemy: same word or form, but different, albeit related meaning • The bank raised its interest rates yesterday • The store is next to the newly constructed bank • The bank appeared first in Italy I the Renaissance • Hyponymy: Concept hierarchy or subclass • Animal (noun) – cat, dog • Injury – broken leg, intusion
  • 9. Seven Types of Text Mining • Search and Information Retrieval – storage and retrieval of text documents, including search engines and keyword search • Document Clustering – Grouping and categorizing terms, snippets, paragraphs or documents using clustering methods • Document Classification – grouping and categorizing snippets, paragraphs or document using data mining classification methods, based on methods trained on labelled examples • Web Mining – Data and Text mining on the internet with specific focus on scaled and interconnectedness of the web • Information Extraction – Identification and extraction of relevant facts and relationships from unstructured text • Natural Language Processing – Low level language processing and understanding of tasks (eg. Tagging part of speech) • Concept extraction – Grouping of words and phrases into semantically similar groups
  • 10. Text Mining – Some Definitions • Document – a sequence of words and punctuation, following the grammatical rules of the language. • Term – usually a word, but can be a word-pair or phrase • Corpus – a collection of documents • Lexicon – set of all unique words in corpus
  • 11. Pre-processing the Text • Text Normalization • Parts of Speech Tagging • Removal of stop words Stop words – common words that don’t add meaningful content to the document • Stemming • Removing suffices and prefixes leaving the root or stem of the word. • Term weighting • POS Tagging • Tokenization
  • 12. Text Normalization • Case • Make all lower case (if you don’t care about proper nouns, titles, etc) • Clean up transcription and typing errrors • do n’t, movei • Correct misspelled words • Phonetically • Use fuzzy matching algorithms such as Soundex, Metaphone or string edit distance • Dictionaries • Use POS and context to make good guess
  • 13. Parts of Speech Tagging • Useful for recognizing names of people, places, organizations, titles • English language • Minimum set includes noun, verb, adjective, adverb, prepositions, congjunctions POS Tags from Penn Tree Bank Tag Description Tag Description Tag Description CC Coordinating Conjunction CD Cardinal Number DT Determiner EX Existential there FW Foreign Word IN Preposition or subordinating conjuction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List Item Marker MD Modal NN Noun, singular or mass NNS Noun Plural NNPS Proper Noun Plural PDT Prederminer POS Possessive Ending PRP Personal pronoun PRPS Possessive pronoun RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol TO To UH Interjection VB Verb, base form VBD Verb, past tens
  • 14. Example of Tagging • In this talk, Mr. Pole discussed how Target was using Predictive Analytics including descriptions of using potential value models, coupon models, and yes predicting when a woman is due • In/IN this/DT talk/NN, Mr./NNP Pole/NNP discussed/VBD how/WRB Target/NNP was/VBD using/VBG Predictive/NNP Analytics/NNP including/VBG descriptions/NNS of/IN using/VBG potential/JJ value/NN models/NNS, coupon/NN models/NNS, and yes predicting/VBG when/WRB a/DT woman/NN is due/JJ
  • 15. Tokenization • Converts streams of characters into words • Main clues (in English): Whitespace • No single algorithms ‘works’ always • Some languages do not have white space (Chinese, Japanese)
  • 16. Stemming • Normalizes / unifies variations of the same data • ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk • Inflectional stemming • Remove plurals • Normalize verb tenses • Remove other affixes • Stemming to root • Reduce word to most basic element • More aggressive than inflectional • ‘denormalization’  norm • ‘Apply’, ‘applications’, ‘reapplied’  apply
  • 17. Common English Stop Words • a, an, and, are, as, at, be, but, buy, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, these, they, this, to, was, will, with • Stop words are very common and rarely provide useful information for information extraction and concept extraction • Removing stop words also reduce dimensionality
  • 18. Dictionaries and Lexicons • Highly recommended, can be very time consuming • Reduces set of key words to focus on • Words of interest • Dictionary words • Increase set of keywords to focus on • Proper nouns • Acronyms • Titles • Numbers • Key ways to use dictionary • Local dictionary (specialized words) • Stop words and too frequent words • Stemming – reduce stems to dictionary words • Synonyms – replace synonyms with root words in the list • Resolve abbreviations and acronyms
  • 19. Sentiment Analysis Workflow Content Retrieval Content Extraction Corpus Generation Corpus Transformation Corpus Filtering Sentiment Calculation WebDataRetrievalCorpusPre Processing Sentiment Analysis
  • 20. Sentiment Indicators • 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 = 𝑝−𝑛 𝑝+𝑛 • 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑝+𝑛 𝑁 • 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = 𝑝 𝑁 • 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = 𝑛 𝑁 • 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = 𝑝 − 𝑛 𝑁