MK99 – Big Data 1 
Big data & cross-platform analytics 
MOOC lectures Pr. Clement Levallois
MK99 – Big Data 2 
A primer on text mining for business 
• 
Text mining: 
computational methods to find interesting information in texts 
• 
Quasi synonyms: 
– 
natural language processing (abbreviated in NLP) 
– 
computational linguistics (name of a scientific discipline)
MK99 – Big Data 3 
Text… what kinds? 
• 
Books 
• 
Tweets 
• 
Product reviews on Amazon 
• 
LinkedIn profiles 
• 
The whole Wikipedia 
• 
Free text answers in the results of a survey 
• 
Tenders, contracts, laws, … 
• 
Print and online media 
• 
Archival material 
• 
…
MK99 – Big Data 4 
What can be done? 
• 
Sentiment analysis 
– 
Is this piece of text of a positive or negative tone? 
• 
Topic modeling / topic detection 
– 
What is the main theme of this 20-page booklet? 
• 
Semantic disambiguation 
– 
“Paris” is mentioned in this text. Is this Paris Hilton or Paris, France? 
• 
Named Entity Recognition (NER) 
– 
Automatically find the individuals, organizations and events named in the text, and the relations between them. 
• 
Semantic enrichment 
– 
If you searched Google for “TV”, results for “television” will also show up 
• 
Language detection 
– 
“Ich spreche Deutsch” -> this sentence is written in German 
• 
Automatic Translation 
– 
See Google Translate 
•Summarizing 
–Shortening a text while keeping its core message intact 
•Spelling correction 
–Well, that’s easy 
•Topic Classification 
–Is this email a spam or not?
MK99 – Big Data 5 
Amaze me! 
• 
Demo on sentiment analysis 
With a tool by Stanford: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html 
• 
Demo on semantic disambiguation 
With a tool by a collaborative effort: http://dbpedia-spotlight.github.io/demo/ 
(click on “annotate”, and also change the text for one of your own)
MK99 – Big Data 6 
What can’t be done yet (but is actively researched) 
• 
Detection of irony 
• 
Robust translation 
• 
Reasoning beyond Q&A 
What makes things harder 
• 
Non English texts 
• 
Slang and colloquial speech-forms 
• 
Real time processing
MK99 – Big Data 7 
Example of routine operations when working with text (or, how to follow the most basic conversation in comput. linguistics) 
• 
Stemming 
– 
“liked” and “like” will be reduced to their stem “lik” to facilitate further operations 
• 
Lemmatizing 
– 
Grouping “liked”, “like” and “likes” to count them as one basic semantic unit 
• 
Part-of-Speech tagging (aka POS tagging) 
– 
Automatically detecting the grammatical function of the terms used in a sentence, to facilitate translation or else 
• 
“Starting the text analysis with a bag-of-words model” 
– 
Operation which consists in just listing and counting all different words in the text. 
• 
N-grams 
– 
The text “I am Dutch” is made of 3 words: I, am, Dutch. But it can also be interesting to look at bigrams in the text: “I am”, “am Dutch”. Or trigrams: “I am Dutch”. 
– 
When neighboring words are considered together just like we did, they are called n-grams. This can reveal interesting things about frequent expressions used in the text. 
– 
A good example of how useful this can be: visit the Ngram Viewer by Google: https://books.google.com/ngrams
MK99 – Big Data 8 
Chief benefit: Getting to know individuals better 
• 
Without text mining, we have access to “external”, “cold” states of the individual 
– 
Behavior (eg, clicks), external attributes (address, gender, encyclopedia entry), social networks (but relatively cold ones.) 
• 
With text mining, we have access to “internal”, “hot” states: 
- opinions - intentions - preferences - degree of consensus - social networks (who mentions whom: how, in which context) - implicit attributes of the speaker
MK99 – Big Data 9 
How easy is it? 
• 
Too easy… the limit is legal and ethical, not technical 
“Predicting the Political Alignment of Twitter Users” by Conover et al. (2011). 
http://cnets.indiana.edu/wp-content/uploads/conover_prediction_socialcom_pdfexpress_ok_version.pdf 
“Political Tendency Identification in Twitter using Sentiment Analysis Techniques” 
by Pla and Hurtado (2014). http://anthology.aclweb.org/C/C14/C14-1019.pdf 
“Private traits and attributes are predictable from digital records of human behavior” 
by Kosinski et al. (2013). http://www.pnas.org/content/110/15/5802.abstract 
(and this gets even more powerful when mixing text mining, network analysis and machine learning)
MK99 – Big Data 10 
What use for text mining in a business context? 
1. 
Client facing 
2. 
Business management 
3. 
Business development
MK99 – Big Data 11 
1. Market facing activities 
• 
Refined scoring: propensity scores (including churn), scoring of prospects 
•Refined individualization of campaigns 
–ads, email campaigns, coupons, etc. 
•Better community management 
–Getting a clear and precise picture of how customers and prospects perceive, talk about, and engage with your brand / product / industry.
MK99 – Big Data 12 
2. Business Management 
• 
Organizational mapping 
– 
Getting a view of the organization through text flows. 
– 
Example: getting a view on the activity of a business school through a map of its scientific publications. 
• 
HRM 
– 
Finding talents in niche industries, based on the mining of their profiles 
• 
Marketing research 
– 
refined segmentation + targeting + positioning, measuring customer satisfaction, perceptual mapping.
MK99 – Big Data 13 
3. Business development 
• 
Developing adjunct services 
– 
product recommendation systems (eg, Amazon’s) 
– 
detection and matching of needs (eg, detection of complaints / mood changes) 
– 
product enhancements (eg, content enrichment through localization/personalization) 
• 
Developing new products entirely, based on 
– 
different search engines 
– 
alert systems / automated systems based on monitoring textual input 
– 
knowledge databases 
– 
new forms of content curation / high value info creation + delivery
MK99 – Big Data 14 
Interesting players 
through their “Data Services” package 
+ many APIs listed on www.programmableweb.com
MK99 – Big Data 15 
This slide presentation is part of a course offered by EMLYON Business School (www.em-lyon.com) 
Contact Clement Levallois (levallois [at] em-lyon.com) for more information.

A Primer on Text Mining for Business

  • 1.
    MK99 – BigData 1 Big data & cross-platform analytics MOOC lectures Pr. Clement Levallois
  • 2.
    MK99 – BigData 2 A primer on text mining for business • Text mining: computational methods to find interesting information in texts • Quasi synonyms: – natural language processing (abbreviated in NLP) – computational linguistics (name of a scientific discipline)
  • 3.
    MK99 – BigData 3 Text… what kinds? • Books • Tweets • Product reviews on Amazon • LinkedIn profiles • The whole Wikipedia • Free text answers in the results of a survey • Tenders, contracts, laws, … • Print and online media • Archival material • …
  • 4.
    MK99 – BigData 4 What can be done? • Sentiment analysis – Is this piece of text of a positive or negative tone? • Topic modeling / topic detection – What is the main theme of this 20-page booklet? • Semantic disambiguation – “Paris” is mentioned in this text. Is this Paris Hilton or Paris, France? • Named Entity Recognition (NER) – Automatically find the individuals, organizations and events named in the text, and the relations between them. • Semantic enrichment – If you searched Google for “TV”, results for “television” will also show up • Language detection – “Ich spreche Deutsch” -> this sentence is written in German • Automatic Translation – See Google Translate •Summarizing –Shortening a text while keeping its core message intact •Spelling correction –Well, that’s easy •Topic Classification –Is this email a spam or not?
  • 5.
    MK99 – BigData 5 Amaze me! • Demo on sentiment analysis With a tool by Stanford: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html • Demo on semantic disambiguation With a tool by a collaborative effort: http://dbpedia-spotlight.github.io/demo/ (click on “annotate”, and also change the text for one of your own)
  • 6.
    MK99 – BigData 6 What can’t be done yet (but is actively researched) • Detection of irony • Robust translation • Reasoning beyond Q&A What makes things harder • Non English texts • Slang and colloquial speech-forms • Real time processing
  • 7.
    MK99 – BigData 7 Example of routine operations when working with text (or, how to follow the most basic conversation in comput. linguistics) • Stemming – “liked” and “like” will be reduced to their stem “lik” to facilitate further operations • Lemmatizing – Grouping “liked”, “like” and “likes” to count them as one basic semantic unit • Part-of-Speech tagging (aka POS tagging) – Automatically detecting the grammatical function of the terms used in a sentence, to facilitate translation or else • “Starting the text analysis with a bag-of-words model” – Operation which consists in just listing and counting all different words in the text. • N-grams – The text “I am Dutch” is made of 3 words: I, am, Dutch. But it can also be interesting to look at bigrams in the text: “I am”, “am Dutch”. Or trigrams: “I am Dutch”. – When neighboring words are considered together just like we did, they are called n-grams. This can reveal interesting things about frequent expressions used in the text. – A good example of how useful this can be: visit the Ngram Viewer by Google: https://books.google.com/ngrams
  • 8.
    MK99 – BigData 8 Chief benefit: Getting to know individuals better • Without text mining, we have access to “external”, “cold” states of the individual – Behavior (eg, clicks), external attributes (address, gender, encyclopedia entry), social networks (but relatively cold ones.) • With text mining, we have access to “internal”, “hot” states: - opinions - intentions - preferences - degree of consensus - social networks (who mentions whom: how, in which context) - implicit attributes of the speaker
  • 9.
    MK99 – BigData 9 How easy is it? • Too easy… the limit is legal and ethical, not technical “Predicting the Political Alignment of Twitter Users” by Conover et al. (2011). http://cnets.indiana.edu/wp-content/uploads/conover_prediction_socialcom_pdfexpress_ok_version.pdf “Political Tendency Identification in Twitter using Sentiment Analysis Techniques” by Pla and Hurtado (2014). http://anthology.aclweb.org/C/C14/C14-1019.pdf “Private traits and attributes are predictable from digital records of human behavior” by Kosinski et al. (2013). http://www.pnas.org/content/110/15/5802.abstract (and this gets even more powerful when mixing text mining, network analysis and machine learning)
  • 10.
    MK99 – BigData 10 What use for text mining in a business context? 1. Client facing 2. Business management 3. Business development
  • 11.
    MK99 – BigData 11 1. Market facing activities • Refined scoring: propensity scores (including churn), scoring of prospects •Refined individualization of campaigns –ads, email campaigns, coupons, etc. •Better community management –Getting a clear and precise picture of how customers and prospects perceive, talk about, and engage with your brand / product / industry.
  • 12.
    MK99 – BigData 12 2. Business Management • Organizational mapping – Getting a view of the organization through text flows. – Example: getting a view on the activity of a business school through a map of its scientific publications. • HRM – Finding talents in niche industries, based on the mining of their profiles • Marketing research – refined segmentation + targeting + positioning, measuring customer satisfaction, perceptual mapping.
  • 13.
    MK99 – BigData 13 3. Business development • Developing adjunct services – product recommendation systems (eg, Amazon’s) – detection and matching of needs (eg, detection of complaints / mood changes) – product enhancements (eg, content enrichment through localization/personalization) • Developing new products entirely, based on – different search engines – alert systems / automated systems based on monitoring textual input – knowledge databases – new forms of content curation / high value info creation + delivery
  • 14.
    MK99 – BigData 14 Interesting players through their “Data Services” package + many APIs listed on www.programmableweb.com
  • 15.
    MK99 – BigData 15 This slide presentation is part of a course offered by EMLYON Business School (www.em-lyon.com) Contact Clement Levallois (levallois [at] em-lyon.com) for more information.