Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS

Dr. Normand Péladeau
CEO
Provalis Research Corp.
peladeau@provalisresearch.com
Promise and Pitfalls
of Text Analytics

Our Software
Qualitative Analysis
2004
Content Analysis &Text Mining1998

Text Analytics Applications
• Sentiment Analysis (social media)
• Voice of the Customer (emails, chat, call center transcripts)
• Product improvement (warranty claims)
• Competitive Intelligence (patents, web sites)
• Risk management (incident or maintenance reports)
• Fraud detection (insurance claims)
• Survey analysis (open-ended questions)
• Interview & focus group transcripts
• Reputation management (news, blogs, social media)
• Scientometrics studies (journal articles, titles & abstracts)
• Crime analysis (narratives, computer forensics, testimonies)
• Financial prediction (earnings releases, news, press releases)
• Surveillance system (communication, medical reports)
• Many more...

Text Mining in the World Data Sciences

State of the art text analytics…
« Regarding language, we start to see some breakthrough, but there is
more work ahead of us that behind us…
…it starts to show some results, but it is still in its infancy.»
Yoshua Bengio, 2017

La boîte noire de l’analyse de texte

THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge

Challenge #1 – Quantity
31,996 comments about hotels
• 1,7 million words (tokens)
• 20,116 terms or word forms (types)
1,8 million course evaluations
• 35 millions words (tokens)
• 78,159 terms or word forms (types)

THREE MAJOR OBSTACLES

Irascible by Mel Bochner (2006)

Amazing by Mel Bochner (2011)
Meaningless by Mel Bochner (2016)

Challenge #2 – Polymorphy of Language
Boring

Challenge #2 – Polymorphy of Language
Lack of Preparation & Organization

Challenge #3 – Polysemy of words

Source: https://muse.dillfrog.com/lists/ambiguous

“This fall take a break from the cold, catch a plane and go south.”
44 44 75 16 39 10 7 5
44 x 44 x 75 x 16 x 39 x 10 x 7 x 5
31.7 billions
Challenge #3 – Polysemy of words

FOUR MAJOR OBSTACLES

FOUR MAJOR OBSTACLES
4) Misspellings

1.8 million student comments
• More than 35 million words
• 78,159 word forms
• 46,404 “unknown” words
o 75 % misspellings (≈ 35,000)
o 21 % proper names (products & people)
o 4% acronyms
Challenge #4 – Misspellings

Challenge #4 – Misspellings
95 ways to be “Enthusiastic”

Accuracy of Sentiment Analysis

Sentiment Analysis using Machine Learning
What about machine learning?
• Training set: 20,000 beauty product reviews
• Algorithm: Naïve Bayes
• Predicting Features: 4553 most frequent words
• Optimisation: Leave-one-out crossvalidation
• Evaluation: 20,000 new reviews

Justesse de l’analyse des sentiments
M
achine
Learning

Bias in sentiment analysis
M
achine
Learning

The Origin of Topic Modeling
2003 – Latent Dirichlet Allocation (LDA)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of machine Learning research
3(Jan), 993–1022 (2003)
1999 – Probabilistic Latent Semantic Analysis (pLSA)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international
ACM SIGIR conference on Research and development in information retrieval. pp. 50–57 (1999)
1990 –Latent Semantic Analysis (LSA)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic
analysis. Journal of the American society for information science 41(6), 391 (1990)

The Forgotten Origins of Topic Modeling
1963 – Information Retrieval
Borko, H., Bernick, M.: Automatic document classification. Journal of the ACM 10(2), 151–162 (1963)
Borko, H., Bernick, M.: Automatic document classification part ii. Additional experiments. Journal of the ACM
(JACM) 11(2), 138–151 (1964)
1964 – Psychologie
Harway, N.I., Iker, H.P.: Computer analysis of content in psychotherapy. Psychological Reports 14(3), 720–722 (1964)
Iker, H.P., Harway, N.I.: A computer approach towards the analysis of content. Systems Research and Behavioral
Science 10(2), 173–182 (1965)
1972 - Communication
Jandt, F.E.: Sources for computer utilization in interpersonal communication instruction and research.
Communication Quarterly 20(2), 25–31 (1972)
1973 - Analyse Littéraire
Sainte-Marie, P, Robillard, P., & Bratley, P. : An application of principal Component Analysis to the work of Molière.
Computers and the Humanities 7, 131-137. (1973).

Experimental Data
1. Subset of TREC-AP
• 2250 Associated Press News
2. HICSS Abstracts
• 1750 abstracts of the HICSS conference 2014-2016
3. Hotel Reviews
• 31,898 review of hotels in Las Vegas (Expedia)

L’état de l’art en intelligence artificielle
AI
Artificial Intelligence Augmented Intelligence

Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS

Recommended

Recommended

More Related Content

Similar to Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS

Similar to Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS (20)

More from Matt Stubbs

More from Matt Stubbs (20)

Recently uploaded

Recently uploaded (20)

Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS