SlideShare a Scribd company logo
1 of 47
Dr. Normand Péladeau
CEO
Provalis Research Corp.
peladeau@provalisresearch.com
Promise and Pitfalls
of Text Analytics
Our Software
Qualitative Analysis
2004
Content Analysis &Text Mining1998
Some of our Clients
Text Analytics Applications
• Sentiment Analysis (social media)
• Voice of the Customer (emails, chat, call center transcripts)
• Product improvement (warranty claims)
• Competitive Intelligence (patents, web sites)
• Risk management (incident or maintenance reports)
• Fraud detection (insurance claims)
• Survey analysis (open-ended questions)
• Interview & focus group transcripts
• Reputation management (news, blogs, social media)
• Scientometrics studies (journal articles, titles & abstracts)
• Crime analysis (narratives, computer forensics, testimonies)
• Financial prediction (earnings releases, news, press releases)
• Surveillance system (communication, medical reports)
• Many more...
Text Mining in the World Data Sciences
What you think you get…
What you pay…
What you really get…
State of the art text analytics…
« Regarding language, we start to see some breakthrough, but there is
more work ahead of us that behind us…
…it starts to show some results, but it is still in its infancy.»
Yoshua Bengio, 2017
La boîte noire de l’analyse de texte
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
Challenge #1 – Quantity
31,996 comments about hotels
• 1,7 million words (tokens)
• 20,116 terms or word forms (types)
1,8 million course evaluations
• 35 millions words (tokens)
• 78,159 terms or word forms (types)
Text Analytics Challenge
Text Analytics Challenge
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
Text Analytics Challenge
Irascible by Mel Bochner (2006)
Amazing by Mel Bochner (2011)
Meaningless by Mel Bochner (2016)
Challenge #2 – Polymorphy of Language
Boring
Challenge #2 – Polymorphy of Language
Lack of Preparation & Organization
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
Text Analytics Challenge
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
Challenge #3 – Polysemy of words
Source: https://muse.dillfrog.com/lists/ambiguous
“This fall take a break from the cold, catch a plane and go south.”
44 44 75 16 39 10 7 5
44 x 44 x 75 x 16 x 39 x 10 x 7 x 5
31.7 billions
Challenge #3 – Polysemy of words
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
FOUR MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
FOUR MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
4) Misspellings
Text Analytics Challenge
1.8 million student comments
• More than 35 million words
• 78,159 word forms
• 46,404 “unknown” words
o 75 % misspellings (≈ 35,000)
o 21 % proper names (products & people)
o 4% acronyms
Challenge #4 – Misspellings
Challenge #4 – Misspellings
95 ways to be “Enthusiastic”
Accuracy of Sentiment Analysis
Accuracy of Sentiment Analysis
Sentiment Analysis using Machine Learning
What about machine learning?
• Training set: 20,000 beauty product reviews
• Algorithm: Naïve Bayes
• Predicting Features: 4553 most frequent words
• Optimisation: Leave-one-out crossvalidation
• Evaluation: 20,000 new reviews
Justesse de l’analyse des sentiments
M
achine
Learning
Bias in sentiment analysis
M
achine
Learning
What about topic modeling?
The Origin of Topic Modeling
2003 – Latent Dirichlet Allocation (LDA)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of machine Learning research
3(Jan), 993–1022 (2003)
1999 – Probabilistic Latent Semantic Analysis (pLSA)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international
ACM SIGIR conference on Research and development in information retrieval. pp. 50–57 (1999)
1990 –Latent Semantic Analysis (LSA)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic
analysis. Journal of the American society for information science 41(6), 391 (1990)
The Forgotten Origins of Topic Modeling
1963 – Information Retrieval
Borko, H., Bernick, M.: Automatic document classification. Journal of the ACM 10(2), 151–162 (1963)
Borko, H., Bernick, M.: Automatic document classification part ii. Additional experiments. Journal of the ACM
(JACM) 11(2), 138–151 (1964)
1964 – Psychologie
Harway, N.I., Iker, H.P.: Computer analysis of content in psychotherapy. Psychological Reports 14(3), 720–722 (1964)
Iker, H.P., Harway, N.I.: A computer approach towards the analysis of content. Systems Research and Behavioral
Science 10(2), 173–182 (1965)
1972 - Communication
Jandt, F.E.: Sources for computer utilization in interpersonal communication instruction and research.
Communication Quarterly 20(2), 25–31 (1972)
1973 - Analyse Littéraire
Sainte-Marie, P, Robillard, P., & Bratley, P. : An application of principal Component Analysis to the work of Molière.
Computers and the Humanities 7, 131-137. (1973).
Experimental Data
1. Subset of TREC-AP
• 2250 Associated Press News
2. HICSS Abstracts
• 1750 abstracts of the HICSS conference 2014-2016
3. Hotel Reviews
• 31,898 review of hotels in Las Vegas (Expedia)
L’état de l’art en intelligence artificielle
AI
Artificial Intelligence Augmented Intelligence
QUESTIONS?

More Related Content

Similar to Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS

Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowTony Russell-Rose
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Dawn Anderson MSc DigM
 
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...Micah Altman
 
Language of Politics on Twitter - 03 Analysis
Language of Politics on Twitter - 03 AnalysisLanguage of Politics on Twitter - 03 Analysis
Language of Politics on Twitter - 03 AnalysisYelena Mejova
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
 
New Frontiers in IA: Design in the Era of Cognitive Computing
New Frontiers in IA: Design in the Era of Cognitive ComputingNew Frontiers in IA: Design in the Era of Cognitive Computing
New Frontiers in IA: Design in the Era of Cognitive ComputingPaul King
 
Applications: Prediction
Applications: PredictionApplications: Prediction
Applications: PredictionNBER
 
Content Analysis Overview for Persona Development
Content Analysis Overview for Persona DevelopmentContent Analysis Overview for Persona Development
Content Analysis Overview for Persona DevelopmentPamela Rutledge
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptOlusolaTop
 
Why Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveWhy Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveJames Hendler
 
"Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective""Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective"Micah Altman
 
Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBiplav Srivastava
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
How do you know what I mean?: Psycholinguistics of spoken language communicat...
How do you know what I mean?: Psycholinguistics of spoken language communicat...How do you know what I mean?: Psycholinguistics of spoken language communicat...
How do you know what I mean?: Psycholinguistics of spoken language communicat...ChigusaKurumada
 
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptxLiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptxVishnuRajuV
 

Similar to Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS (20)

Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and Tomorrow
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
 
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
 
Language of Politics on Twitter - 03 Analysis
Language of Politics on Twitter - 03 AnalysisLanguage of Politics on Twitter - 03 Analysis
Language of Politics on Twitter - 03 Analysis
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
New Frontiers in IA: Design in the Era of Cognitive Computing
New Frontiers in IA: Design in the Era of Cognitive ComputingNew Frontiers in IA: Design in the Era of Cognitive Computing
New Frontiers in IA: Design in the Era of Cognitive Computing
 
Applications: Prediction
Applications: PredictionApplications: Prediction
Applications: Prediction
 
Content Analysis Overview for Persona Development
Content Analysis Overview for Persona DevelopmentContent Analysis Overview for Persona Development
Content Analysis Overview for Persona Development
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
Why Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveWhy Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspective
 
"Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective""Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective"
 
Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near You
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
How do you know what I mean?: Psycholinguistics of spoken language communicat...
How do you know what I mean?: Psycholinguistics of spoken language communicat...How do you know what I mean?: Psycholinguistics of spoken language communicat...
How do you know what I mean?: Psycholinguistics of spoken language communicat...
 
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptxLiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
 

More from Matt Stubbs

Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Blueprint Series: Banking In The Cloud – Ultra-high Reliability ArchitecturesBlueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Blueprint Series: Banking In The Cloud – Ultra-high Reliability ArchitecturesMatt Stubbs
 
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...Matt Stubbs
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformBlueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformMatt Stubbs
 
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...Matt Stubbs
 
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.Matt Stubbs
 
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCEBig Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCEMatt Stubbs
 
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLBig Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLMatt Stubbs
 
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTSBig Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTSMatt Stubbs
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
 
Big Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: AI VS. GDPRBig Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: AI VS. GDPRMatt Stubbs
 
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...Matt Stubbs
 
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...Matt Stubbs
 
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...Matt Stubbs
 
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...Matt Stubbs
 
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICSBig Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICSMatt Stubbs
 
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSEBig Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSEMatt Stubbs
 
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNINGBig Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNINGMatt Stubbs
 
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...Matt Stubbs
 
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...Matt Stubbs
 
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATEBig Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATEMatt Stubbs
 

More from Matt Stubbs (20)

Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Blueprint Series: Banking In The Cloud – Ultra-high Reliability ArchitecturesBlueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
 
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformBlueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data Platform
 
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
 
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
 
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCEBig Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
 
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLBig Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
 
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTSBig Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
 
Big Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: AI VS. GDPRBig Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: AI VS. GDPR
 
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
 
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
 
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
 
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
 
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICSBig Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
 
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSEBig Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
 
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNINGBig Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
 
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
 
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
 
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATEBig Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
 

Recently uploaded

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Recently uploaded (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 

Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS

  • 1. Dr. Normand Péladeau CEO Provalis Research Corp. peladeau@provalisresearch.com Promise and Pitfalls of Text Analytics
  • 3. Some of our Clients
  • 4. Text Analytics Applications • Sentiment Analysis (social media) • Voice of the Customer (emails, chat, call center transcripts) • Product improvement (warranty claims) • Competitive Intelligence (patents, web sites) • Risk management (incident or maintenance reports) • Fraud detection (insurance claims) • Survey analysis (open-ended questions) • Interview & focus group transcripts • Reputation management (news, blogs, social media) • Scientometrics studies (journal articles, titles & abstracts) • Crime analysis (narratives, computer forensics, testimonies) • Financial prediction (earnings releases, news, press releases) • Surveillance system (communication, medical reports) • Many more...
  • 5. Text Mining in the World Data Sciences
  • 6. What you think you get…
  • 9. State of the art text analytics… « Regarding language, we start to see some breakthrough, but there is more work ahead of us that behind us… …it starts to show some results, but it is still in its infancy.» Yoshua Bengio, 2017
  • 10. La boîte noire de l’analyse de texte
  • 11. THREE MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea ® multiple forms 3) Polysemy of words One word ® many ideas Text Analytics Challenge
  • 12. THREE MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea ® multiple forms 3) Polysemy of words One word ® many ideas Text Analytics Challenge
  • 13. Challenge #1 – Quantity 31,996 comments about hotels • 1,7 million words (tokens) • 20,116 terms or word forms (types) 1,8 million course evaluations • 35 millions words (tokens) • 78,159 terms or word forms (types)
  • 16. THREE MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea ® multiple forms Text Analytics Challenge
  • 17. Irascible by Mel Bochner (2006)
  • 18. Amazing by Mel Bochner (2011) Meaningless by Mel Bochner (2016)
  • 19. Challenge #2 – Polymorphy of Language Boring
  • 20. Challenge #2 – Polymorphy of Language Lack of Preparation & Organization
  • 21. THREE MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea ® multiple forms Text Analytics Challenge
  • 22. THREE MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea ® multiple forms 3) Polysemy of words One word ® many ideas Text Analytics Challenge
  • 23. Challenge #3 – Polysemy of words
  • 25. “This fall take a break from the cold, catch a plane and go south.” 44 44 75 16 39 10 7 5 44 x 44 x 75 x 16 x 39 x 10 x 7 x 5 31.7 billions Challenge #3 – Polysemy of words
  • 26. THREE MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea ® multiple forms 3) Polysemy of words One word ® many ideas Text Analytics Challenge
  • 27. FOUR MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea ® multiple forms 3) Polysemy of words One word ® many ideas Text Analytics Challenge
  • 28. FOUR MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea ® multiple forms 3) Polysemy of words One word ® many ideas 4) Misspellings Text Analytics Challenge
  • 29. 1.8 million student comments • More than 35 million words • 78,159 word forms • 46,404 “unknown” words o 75 % misspellings (≈ 35,000) o 21 % proper names (products & people) o 4% acronyms Challenge #4 – Misspellings
  • 30. Challenge #4 – Misspellings 95 ways to be “Enthusiastic”
  • 33. Sentiment Analysis using Machine Learning What about machine learning? • Training set: 20,000 beauty product reviews • Algorithm: Naïve Bayes • Predicting Features: 4553 most frequent words • Optimisation: Leave-one-out crossvalidation • Evaluation: 20,000 new reviews
  • 34. Justesse de l’analyse des sentiments M achine Learning
  • 35. Bias in sentiment analysis M achine Learning
  • 36. What about topic modeling?
  • 37.
  • 38.
  • 39. The Origin of Topic Modeling 2003 – Latent Dirichlet Allocation (LDA) Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of machine Learning research 3(Jan), 993–1022 (2003) 1999 – Probabilistic Latent Semantic Analysis (pLSA) Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. pp. 50–57 (1999) 1990 –Latent Semantic Analysis (LSA) Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American society for information science 41(6), 391 (1990)
  • 40. The Forgotten Origins of Topic Modeling 1963 – Information Retrieval Borko, H., Bernick, M.: Automatic document classification. Journal of the ACM 10(2), 151–162 (1963) Borko, H., Bernick, M.: Automatic document classification part ii. Additional experiments. Journal of the ACM (JACM) 11(2), 138–151 (1964) 1964 – Psychologie Harway, N.I., Iker, H.P.: Computer analysis of content in psychotherapy. Psychological Reports 14(3), 720–722 (1964) Iker, H.P., Harway, N.I.: A computer approach towards the analysis of content. Systems Research and Behavioral Science 10(2), 173–182 (1965) 1972 - Communication Jandt, F.E.: Sources for computer utilization in interpersonal communication instruction and research. Communication Quarterly 20(2), 25–31 (1972) 1973 - Analyse Littéraire Sainte-Marie, P, Robillard, P., & Bratley, P. : An application of principal Component Analysis to the work of Molière. Computers and the Humanities 7, 131-137. (1973).
  • 41. Experimental Data 1. Subset of TREC-AP • 2250 Associated Press News 2. HICSS Abstracts • 1750 abstracts of the HICSS conference 2014-2016 3. Hotel Reviews • 31,898 review of hotels in Las Vegas (Expedia)
  • 42.
  • 43.
  • 44.
  • 45. L’état de l’art en intelligence artificielle AI Artificial Intelligence Augmented Intelligence
  • 46.