Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building an LDA topic model using Wikipedia


Published on

Presentation given at the Data Harmony User Group 2018 meeting about creating training data for use in JSTOR's new Text Analyzer, a tool that allows users to upload a document, have it automatically analyzed, and find relevant content in JSTOR. Using the JSTOR Thesaurus terms the team identified and reviewed Wikipedia articles to be used as training data for a topic model.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Building an LDA topic model using Wikipedia

  1. 1. February 6, 2018 Ron Snyder and Sharon Garewal Building an LDA topic model using Wikipedia
  2. 2. ITHAKA is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. JSTOR is a not-for-profit digital library of academic journals, books, and primary sources. Ithaka S+R is a not-for-profit research and consulting service that helps academic, cultural, and publishing communities thrive in the digital environment. Portico is a not-for-profit preservation service for digital publications, including electronic journals, books, and historical collections. Artstor provides 2+ million high-quality images and digital asset management software to enhance scholarship and teaching.
  3. 3. JSTOR Labs works with partner publishers, libraries and labs to create tools for researchers, teachers and students that are immediately useful – and a little bit magical.
  4. 4. Presentation Outline Text Analyzer – What it is and how it works, including a short demo LDA topic models – What they are and how we’re creating/using them Topic data curation – Our process, lessons-learned, and future work Multilingual Text Analysis – An experimental approach leveraging LDA topic models and Wikipedia relationships (with short demo)
  5. 5. Text Analyzer
  6. 6. Text Analyzer - beta
  7. 7. Text Analyzer Analyzes arbitrary text to find related content in JSTOR archive Drag-n-drop File select
  8. 8. Text Analyzer • Text Analyzer extracts topics and named entities from submitted text to find related/similar documents in the JSTOR archive
  9. 9. Text Analyzer • Text Analyzer extracts topics and named entities from submitted text to find related/similar document in the JSTOR archive • Topics are based on the terms in the JSTOR Thesaurus
  10. 10. Text Analyzer Text is submitted via: • Direct input • Copy/paste • Local file • Drag and drop from local computer filesystem or a web URL • Photo of text, via phone camera A variety of document types are supported: • PDF • MS-Word • HTML • RTF • Plain text, Powerpoint, and Excel • Images (on-the-fly OCR is performed)
  11. 11. Another example My bookshelf at work Topics inferred Using a smartphone photo as input…
  12. 12. Image analysis My bookshelf at work Topics inferred
  13. 13. Text Analyzer recommendations Recommendations are based on a ”best fit” of all prioritized terms (topics and entities) and weights • The selection of documents in results represents an ‘OR’ of documents containing one or more terms • Results ordering is based on a score representing the number of terms matched, the importance of the term(s) to the document (based on LDA weight) and the user-specified importance • A user is able to quickly refine the terms and weights to tailor the results to a specific need Values used in relevancy and ranking calculations are available for inspection
  14. 14. Latent dirichlet allocation (LDA) • LDA is one of the most common algorithms for topic modeling The Latent part of LDA comes into play because in statistics, a variable we have to infer rather than directly observing is called a "latent variable". We're only directly observing the words and not the topics, so the topics themselves are latent variables (along with the distributions themselves).* • LDA is based on the concept that: • Every document is a mixture of topics • Every topic is a mixture of words • LDA is a mathematical method for estimating both of these at the same time: • Finding the mixture of words that is associated with each topic, • while also determining the mixture of topics that describes each document *
  15. 15. Latent dirichlet allocation (LDA) • Model training • Can be supervised or unsupervised • When performed unsupervised a predefined number of topics are identified and are represented by word probabilities • Supervised training involves the use of a tagged corpus where the tags will be used as topic labels • For topics we use a subset of the JSTOR Thesaurus • For the model training documents we’re now using Wikipedia articles associated with each topic • Topic inferencing • Using a trained topic model, ‘latent’ topics in a document can be inferred using the words in the text • Topics are expressed as probability distribution
  16. 16. What is an LDA topic? It’s simple…
  17. 17. OK, the math isn’t so simple but conceptually a topic is just a set of “word” relationships climate temperature earth ice warming global weather atmosphere climate_change ocean cycle oscillation carbon greenhouse model scientist water age ice_age atmospheric event tropical gas pacific heat dioxide carbon_dioxide pattern average region wave surface extreme variation air classification cold precipitation latitude cooling global_warming land radiation solar science greenhouse_gas rainfall determine condition enso report emission current dry theory variability force summer polar infrared annual future north range mass climatic feedback atlantic northern record sea rise scientific natural evidence scale factor planet winter cold_wave glacial air_mass warm climate_model regional cfc extreme_weather climatology niño interglacial oceanic assessment pollution phase absorption location published america ancient result energy arrhenius surface_temperature climate_oscillation vegetation seasonal trend moon south shift activity sun assessment_report climate_pattern humid hurricane fluctuation anomaly methane conclude decadal maritime tree arctic concentration month short infrared_radiation glacier future_climate monsoon forecasting global_temperature continental milankovitch orbital water_vapor vapor james estimate normal observe maximum variable pdo heat_wave pacific_ocean climatologist tree_ring arid convince forcings holocene ice_sheet cloud fourier climate_sensitivity icehouse weather_event wmo southern_oscillation climate_climate solar_variation climate_cycle north_america croll human_emission icehouse_climate agassiz enso_event climate_index global_climate climate_variability mid_latitude paleoclimatology thornthwaite köppen current_climate indian_ocean niño_southern_oscillation niño_southern interglacial_period climate_classification electromagnetic_radiation term_climate ocean_atmosphere wind_shear fossil_fuel ramanathan wetherald manabe keeling absorbing_infrared james_croll charpentier scientific_opinion buckland level_pressure sea_level_pressure inter_decadal tropical_pacific decadal_oscillation mjo climate_science ozone_depletion nao extreme_weather_event change_climate climate_proxy east_pacific annual_basis ice_cap subarctic_climate oceanic_climate humid_subtropical modern_climate climate_zone bergeron polar_ice regular_cycle scientific_literature lake_bed current_interglacial temperature_fluctuation warm_period shorter_term classification_include climate_force milankovitch_cycle projected_increase excessive_heat heatwave bioclimatology cfc_focused dioxide_molecule carbon_dioxide_molecule absorbing_infrared_radiation lovelock_speculate james_lovelock_speculate scientist_james_lovelock_speculate scientist_james_lovelock british_scientist_james_lovelock british_scientist_james core_drilled ice_core_drilled particulate_pollution aerosol_pollution sea_core deep_sea_core david_keeling charles_david_keeling charles_david callendar varves högbom infrared_absorption measure_infrared cycle_lasting venetz perraudin change_climate_change climate_change_climate_change climate_change_climate change_science climate_change_science century_scientist background_climate bake_crust british_scientist langley james_lovelock cfc_molecule chlorofluorocarbon_cfc tyndall john_tyndall extreme_event scientist_james hothouse energy_budget teleconnections sst_anomaly For example, the top “words” associated with the topic Climatology
  18. 18. LDA topics Climate change Viticulture An LDA topic can then be thought of as the density of associated terms in an analyzed text For example, in this article on climate change and wine from a recent edition of the JSTOR Daily we see the top words for 2 topics highlighted
  19. 19. Named Entity Recognition (NER) • Entities in a submitted text are identified and available for document selection • Persons • Locations • Organizations • Results from multiple entity recognition engines are merged during analysis • IBM Alchemy • OpenCalais (Thompson Reuters) • OpenNLP (Apache) • Stanford NER
  20. 20. Using Wikipedia for LDA training data Early versions of LDA topic models were trained with JSTOR documents using MAIstro indexing terms This worked fairly well but had some significant limitations/challenges inhibiting further improvement of the models and inferencing • Many tagged articles were only semi-related to the topic • Documents often contained too many topics • The JSTOR document text was often too “noisy” 1. OCR errors 2. Running headers/footers 3. Citations and references
  21. 21. Using Wikipedia for LDA training data Early experimentation with Wikipedia articles for training data in mid-summer proved promising • Performed comparison tests of models built from JSTOR-only, Wikipedia-only, and hybrid training datasets Converted to the use of Wikipedia training data for Text Analyzer in September • Initially hoped for 100% automated mapping from topic to training docs • Eventually concluded that some level of manual curation would be needed • Training data curation performed in Q4 with JSTOR and Access Innovation staff using an internal tool (more on that in a bit)
  22. 22. Wikepedia and Wikidata Wikidata provides rich machine readable (semantic) data for augmenting and linking wikipedia training data For example:
  23. 23. Downloadable Wikepedia data dumps provide clean text for model training Uncluttered and error-free • No HTML markup, hyperlinks, etc • Ideal for text processing • As a bonus, summary snippets are easily extracted • These snippets are not currently exposed in the interface but could be used in a number of ways in the future
  24. 24. Compare that with some OCR text from a typical JSTOR article
  25. 25. Goal: Train a new topic model Produce a “super set” of terms Find training articles using Wikipedia New model will catch nuances & more subtle language
  26. 26. Project phases Spreadsheets, curation tool, thesaurus and Wikipedia 1. Mapping thesaurus terms to Wikipedia categories 2. Identifying Wikipedia training articles for thesaurus terms 3. Whitelisting terms 4. Working in the curation tool 5. Spreadsheet validation
  27. 27. Mapping terms to Wikipedia categories
  28. 28. Research Category page Article level page JSTOR Thesaurus
  29. 29. Wikipedia category Thesaurus term Wikipedia category link Notes Musical instruments Musical instruments Musical notation Music notation Musical scales Musical scales Musical theatre Musical theater Musical tuning Musical tuning Musicians Musicians Musicologists Musicology Musicology Musicology Mustaali ENTITY named entity Mustard (condiment) MATCH Didn't match due to parens; "Mustards" in jthes Muswell Hill ENTITY named entity Mutilation NO MATCH Don't have this term but variations of term Mutineers Mutiny Mutinies Mutiny Mutualism (biology) MATCH Didn't match due to parens; "Mutualism" in jthes Mutualism (movement) NO MATCH Didn't match due to parens; "Mutualism" within Ecology Mycology Mycology Myeloid neoplasia NO MATCH no match in jthes Myoneural junction and neuromuscular diseases NO MATCH comprised too many concepts Myrmecophagous mammals NO MATCH MySQL SQL
  30. 30. Choose 10+ articles 1 week of Labs time Try to cover first four levels of hierarchy Identifying Wikipedia Training articles
  31. 31. The Whitelist Cut down the list of thesaurus terms Used high/low count to help with assessment Chose 18k of original 48k
  32. 32. Updated curation tool Levels 1-4 – Full coverage Back in the curation tool we decided, for efficiency, we would do all top level branches down to the 4 level so all subjects were covered to the same depth. Learned from the Labs week training documents that a better target is 1-5 training docs per term and being selective is better than including those that may only be tangentially related. Some terms only have one or two strong documents.
  33. 33. Spreadsheet validation
  34. 34. Lessons learned Challenges • Size of the thesaurus • Lack of knowledge of some subject areas • Wikipedia only articles • Time/Staffing constraints • Tool glitches The future • Coverage of all thesaurus terms • Other articles outside of Wikipedia • Integrated as part of our weekly workflow • Working with Subject Matter Experts to choose training documents
  35. 35. Mulltilingual topic inferencing
  36. 36. LLDA Topic Model JSTOR Thesaurus Training docs 16,000 topics + 30,000 wikipedia articles Topic model training
  37. 37. LLDA Topic ModelJSTOR Thesaurus Training docs + 30,000 wikipedia articles English Arabic (74%) LLDA Topic Model Turkish (55%) Chinese (82%) Dutch (78%) French (86%) German (86%) Hebrew (63%) Italian (76%) Japanese (82%) Korean (66%) Polish (74%) Portuguese (75%) Russian (81%) Spanish (84%) Multilingual topic inferencing
  38. 38. Multilingual topic inferencing Demo…
  39. 39. Mulltilingual topic inferencing
  40. 40. Mulltilingual topic inferencing
  41. 41. Thank You