Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Text Mining and Visualization with Interactive Web Application

256 views

Published on

Introduction to Interactive Text Mining tool built with R and Shiny.

Published in: Data & Analytics
  • Be the first to comment

Introduction to Text Mining and Visualization with Interactive Web Application

  1. 1. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Interactive Visual Data Analysis Part Two Interactive Text Mining Suite Olga Scrivner Indiana University Workshop in Methods 1 / 33
  2. 2. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Outline 1 Introduce a web application for text processing and mining 2 Learn about natural language processing techniques 3 Develop practical skills 2 / 33
  3. 3. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Data Mining “As our collective knowledge continues to be digitized and stored (...) it becomes more difficult to find and discover what we are looking for.” (Blei 2012) 3 / 33
  4. 4. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References New Ways of Exploring Data Collections Word clouds (Vuillemot et al., 2009) 4 / 33
  5. 5. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Visualization Methods Social network graphs (Rydberg-Cox, 2011) 5 / 33
  6. 6. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Visualization Methods Tracking emotion and sentiment in fairy tales (Mohammad, 2012) 6 / 33
  7. 7. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Topic Modeling Discovering underlying theme of collection from Science magazine 1990-2000 (Blei 2012) 7 / 33
  8. 8. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Technological and Methodological Obstacles Many tools require some programming skills (Mallet, Meta, R and Python libraries) GUI tools are limited to certain formats and functions (Voyant, PaperMachine) Lack of active control by users 8 / 33
  9. 9. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Interactive Text Mining Suite A user-friendly tool for quantitative analysis and visualization of unstructured data Platform-independent Interactive 9 / 33
  10. 10. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References ITMS Structure 1 File Uploads Upload files (txt, pdf, rdf and Google books API) 2 Data Preparation Data preprocessing (stopwords, stemming, metadata) 3 Data Visualization Word frequencies, Cluster analysis and topic modeling 10 / 33
  11. 11. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References ITMS Structure 1 File Uploads Upload files (txt, pdf, rdf and Google books API) 2 Data Preparation Data preprocessing (stopwords, stemming, metadata) 3 Data Visualization Word frequencies, Cluster analysis and topic modeling 10 / 33
  12. 12. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Workshop Files Download 3 text files http://ssrc.indiana.edu/seminars/wim.shtml NY Times articles (3 documents in a plain text format) ITMS Web site: http://www.interactivetextminingsuite.com 11 / 33
  13. 13. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Upload File 12 / 33
  14. 14. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Upload File 12 / 33
  15. 15. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Upload File 12 / 33
  16. 16. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Preprocessing Data Before performing data analysis we should preprocess data. 13 / 33
  17. 17. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Preprocessing Options Select preprocessing options and click apply. 14 / 33
  18. 18. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Stopwords Stopwords (e.g. the, and): select Default for English 15 / 33
  19. 19. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Manual Removal of Stopwords Based on the need, remove any additional stopwords that you may consider a noise, e,g, paper, shows etc Select apply 16 / 33
  20. 20. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Stemming To improve analytics, you can stem all your tokens, ex. instead of worked, works, working, you will have only one relevant stem work 17 / 33
  21. 21. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Metadata Extraction You can extract or upload metadata. You will need datestamp (year) information for chronological topic modeling. 18 / 33
  22. 22. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Visualization 19 / 33
  23. 23. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Word Cloud Representation 20 / 33
  24. 24. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Customization 21 / 33
  25. 25. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Cluster Analysis You need to have at least three documents Documents will be grouped based on their term similarity measures 22 / 33
  26. 26. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Cluster Analysis 23 / 33
  27. 27. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Topic Modeling LDA (Latent Dirichlet allocation) STM (Structural Topic model) Chronological topic visualization (lda): requires metadata 24 / 33
  28. 28. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Topic Modeling Tuning Selection of topics (how many different themes) Selection of words per theme (how many words per topic) Selection of iteration 25 / 33
  29. 29. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Topic Model Selection 26 / 33
  30. 30. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References LDA Topic Model 27 / 33
  31. 31. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References STM Topic Model 28 / 33
  32. 32. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Other Formats - Google Books Before switching to other data formats, refresh your local browser. Start with File Uploads and select Structured Data 29 / 33
  33. 33. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Other Formats - Google Books Select your search terms and submit Current limitation is 40 books 30 / 33
  34. 34. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Future Options Shiny Web Application is highly customizable 1 Part-of-speech tagging (tm package) 2 Network analysis (igraph package) 3 Name Entity Recognition (NLP package) 4 Twitter Streaming (twitterR package) - will requires user’s twitter set-up for streaming but information will be provided how to set it up Open for other suggestions and collaboration - contact obscrivn@indiana.edu 31 / 33
  35. 35. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Acknowledgements I would like to thank WIM for providing this opportunity. Contributors: Jefferson Davis, Irina Trapido, Jay Lee 32 / 33
  36. 36. Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References References I [1] Many open source R packages: tm, shiny, NLP, stringi, stringr, topicmodels, lda and many more [2] Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge: Cambridge University Press [3] Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber Randi Reppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: Cambridge University Press [4] Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. Quantitative Methods in the Humanities and Social Sciences. Springer International Publishing, Cham [5] Moretti, Franco. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso [6] Oelke, Daniella, Dimitrios Kokkinakis, and Mats Malm. 2012. Advanced visual analytics methods for literature analysis. Proceedings of the 6th EACL Workshop on Language Technology for Cultural Heritage, Social 561Sciences, and Humanities, pages 35–44 image credits: https://media.giphy.com/media/10zsjaH4g0GgmY/giphy.gif 33 / 33

×