Date: 13th November 2018
Location: AI Lab Theatre
Time: 13:10 - 13:40
Speaker: Normand Peladeau
Organisation: Provalis Research
About: Over the last 10 years text analytics has become quite popular witnessed by the numerous offerings from commercial companies and open source libraries, for automatic information extraction, sentiment analysis, relation extraction, to name a few applications. Many of these products make bold claims about their high accuracy and impressive ability to tackle the most difficult challenges in the analysis of human language (polysemy, entity resolution, sarcasm, etc.). Their use of buzz words like AI, NLP, deep semantic, gives them an aura of scientific credibility, yet users who dare to look closely are often disappointed by the performance. In this presentation, we will discuss why human language represents such a challenge for data analysts. We will look inside the black box of some text analytics techniques to get a better understanding of the main challenges that still need to be solved. We will also illustrate some successful applications to help the audience appreciate the true value text analytics can offer. We will go behind the curtain to show you what is questionable so that you can establish realistic expectations and appreciate the real power and potential of text analytics.
9. State of the art text analytics…
« Regarding language, we start to see some breakthrough, but there is
more work ahead of us that behind us…
…it starts to show some results, but it is still in its infancy.»
Yoshua Bengio, 2017
11. THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
12. THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
13. Challenge #1 – Quantity
31,996 comments about hotels
• 1,7 million words (tokens)
• 20,116 terms or word forms (types)
1,8 million course evaluations
• 35 millions words (tokens)
• 78,159 terms or word forms (types)
20. Challenge #2 – Polymorphy of Language
Lack of Preparation & Organization
21. THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
Text Analytics Challenge
22. THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
25. “This fall take a break from the cold, catch a plane and go south.”
44 44 75 16 39 10 7 5
44 x 44 x 75 x 16 x 39 x 10 x 7 x 5
31.7 billions
Challenge #3 – Polysemy of words
26. THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
27. FOUR MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
Text Analytics Challenge
28. FOUR MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of language
One idea ® multiple forms
3) Polysemy of words
One word ® many ideas
4) Misspellings
Text Analytics Challenge
29. 1.8 million student comments
• More than 35 million words
• 78,159 word forms
• 46,404 “unknown” words
o 75 % misspellings (≈ 35,000)
o 21 % proper names (products & people)
o 4% acronyms
Challenge #4 – Misspellings
30. Challenge #4 – Misspellings
95 ways to be “Enthusiastic”
39. The Origin of Topic Modeling
2003 – Latent Dirichlet Allocation (LDA)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of machine Learning research
3(Jan), 993–1022 (2003)
1999 – Probabilistic Latent Semantic Analysis (pLSA)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international
ACM SIGIR conference on Research and development in information retrieval. pp. 50–57 (1999)
1990 –Latent Semantic Analysis (LSA)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic
analysis. Journal of the American society for information science 41(6), 391 (1990)
40. The Forgotten Origins of Topic Modeling
1963 – Information Retrieval
Borko, H., Bernick, M.: Automatic document classification. Journal of the ACM 10(2), 151–162 (1963)
Borko, H., Bernick, M.: Automatic document classification part ii. Additional experiments. Journal of the ACM
(JACM) 11(2), 138–151 (1964)
1964 – Psychologie
Harway, N.I., Iker, H.P.: Computer analysis of content in psychotherapy. Psychological Reports 14(3), 720–722 (1964)
Iker, H.P., Harway, N.I.: A computer approach towards the analysis of content. Systems Research and Behavioral
Science 10(2), 173–182 (1965)
1972 - Communication
Jandt, F.E.: Sources for computer utilization in interpersonal communication instruction and research.
Communication Quarterly 20(2), 25–31 (1972)
1973 - Analyse Littéraire
Sainte-Marie, P, Robillard, P., & Bratley, P. : An application of principal Component Analysis to the work of Molière.
Computers and the Humanities 7, 131-137. (1973).
41. Experimental Data
1. Subset of TREC-AP
• 2250 Associated Press News
2. HICSS Abstracts
• 1750 abstracts of the HICSS conference 2014-2016
3. Hotel Reviews
• 31,898 review of hotels in Las Vegas (Expedia)
42.
43.
44.
45. L’état de l’art en intelligence artificielle
AI
Artificial Intelligence Augmented Intelligence