Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyzing the Biological Literature with evoText

5 views

Published on

A talk presenting the technical infrastructure behind the evoText project, as well as some sample results.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Analyzing the Biological Literature with evoText

  1. 1. Analyzing the Biological Literature with evoText UQAM, 1/11/2019 Charles H. Pence @pencechp • @pencelab
  2. 2. Outline 1. Tool development and future work 2. Current corpus status 3. evoText in action: analyzing biodiversity The take-home: Could evoText be useful for your work? Get in touch! Charles H. Pence 2 / 33
  3. 3. Tool Development Charles H. Pence Tool Development 3 / 33
  4. 4. Why a New Tool? Journal articles are weird. They’re small. They carry unusual metadata. People are used to searching/indexing them in particular ways. They’re hard to get and come in an infinite variety of formats. Charles H. Pence Tool Development 4 / 33
  5. 5. Tricks Have to analyze using only basic, plain text. (No guarantee we can always get anything better.) Ways to search/filter by journals, years (categories coming soon) Have to be robust to low-quality data (often entirely dependent on publishers for data quality; too many documents for manual cleaning) Also coming soon: citation analysis support Charles H. Pence Tool Development 5 / 33
  6. 6. Current Infrastructure Frontend: Ruby on Rails web application Analysis backend: Ruby Article database: Apache Solr https://www.evotext.org/ Charles H. Pence Tool Development 6 / 33
  7. 7. Charles H. Pence Tool Development 7 / 33
  8. 8. Coming soon: New version! Frontend: Angular web app (new interface!) Analysis backend: Go (faster analysis!) Article database: Solr (or maybe MongoDB?) Charles H. Pence Tool Development 8 / 33
  9. 9. Charles H. Pence Tool Development 9 / 33
  10. 10. Current Corpus Charles H. Pence Our Corpus 10 / 33
  11. 11. The Corpus 1,625,790 documents 334 journals Charles H. Pence Our Corpus 11 / 33
  12. 12. Access Charles H. Pence Our Corpus 12 / 33
  13. 13. Data Sources Charles H. Pence Our Corpus 13 / 33
  14. 14. Journals Charles H. Pence Our Corpus 14 / 33
  15. 15. Example: Biodiversity Charles H. Pence Example: Biodiversity 15 / 33
  16. 16. Conceptual Analysis Classic conceptual analysis question: What do scientists mean by biodiversity? Charles H. Pence Example: Biodiversity 16 / 33
  17. 17. Corpus Articles from Conservation Biology, from 1987–2012. 5,459 articles; 27M tokens; 427k types. Charles H. Pence Example: Biodiversity 17 / 33
  18. 18. Craig-Zeta Algorithm Take a corpus that you believe to consist of two “sub-corpora,” A and B. Goal: Find ‘marker words’ that distinguish A papers as opposed to B papers, and vice versa. Charles H. Pence Example: Biodiversity 18 / 33
  19. 19. Craig-Zeta Algorithm Idea: Analyze as though papers that mention biodiversity are a different sub-corpus from those that do not. Charles H. Pence Example: Biodiversity 19 / 33
  20. 20. CZ Marker Words Markers for biodiversity: diversity richness ecological protected planning conservation development policy economic assessment international management (Year numbers, 1997–2007) Markers for not-biodiversity: population genetic breeding individuals survival rate variation reproductive mortality adult mean demographic (Year numbers, 1980–1988) Charles H. Pence Example: Biodiversity 20 / 33
  21. 21. Craig-Zeta Algorithm How do we know that the algorithm was successful? Charles H. Pence Example: Biodiversity 21 / 33
  22. 22. Charles H. Pence Example: Biodiversity 22 / 33
  23. 23. Cooccurrence What words occur (significantly often) within a given distance of ‘biodiversity’ (our focal word of interest)? Charles H. Pence Example: Biodiversity 23 / 33
  24. 24. Cooccurrences (500-word window) biointegrity bioresources macroclimatic distributive bureaucratically countdown neoliberalization postmodernism manifesto cataloguing underprotected biopiracy hotspots coextinctions Charles H. Pence Example: Biodiversity 24 / 33
  25. 25. A first suggestion... Biodiversity is unusually related with words indicating social and political context. Charles H. Pence Example: Biodiversity 25 / 33
  26. 26. A first suggestion... Biodiversity is unusually related with words indicating social and political context. Good! This lines up both with what historians of science and the practitioners themselves have told us about the self-image of the biodiversity paradigm. Charles H. Pence Example: Biodiversity 25 / 33
  27. 27. Comparing Definitions Could we try to look for places where different definitions of biodiversity have been used in different places in the literature? Charles H. Pence Example: Biodiversity 26 / 33
  28. 28. Comparing Definitions Could we try to look for places where different definitions of biodiversity have been used in different places in the literature? One idea: Biodiversity is an inherently spatial concept. What parts of the globe has it been applied to in different contexts? Charles H. Pence Example: Biodiversity 26 / 33
  29. 29. Named Entity Recognition Look for any instances of proper place names throughout the Conservation Biology corpus, map them. Charles H. Pence Example: Biodiversity 27 / 33
  30. 30. Named Entity Recognition Look for any instances of proper place names throughout the Conservation Biology corpus, map them. Unsolved problem for evoText: Geocoding! How to get from “Australia” to a set of geographic coordinates on a map? Currently using 3rd-party API (with a large free service tier). Charles H. Pence Example: Biodiversity 27 / 33
  31. 31. Charles H. Pence Example: Biodiversity 28 / 33
  32. 32. Charles H. Pence Example: Biodiversity 29 / 33
  33. 33. Charles H. Pence Example: Biodiversity 30 / 33
  34. 34. A second suggestion... Various definitions of biodiversity seem to come apart when we compare scientific and political uses of the concept. Charles H. Pence Example: Biodiversity 31 / 33
  35. 35. Back to evoText Other tools available: General word frequency analysis (1-gram or N-gram, with tf-idf scores, configurable block creation behavior, etc.) Collocation analysis (like cooccurrence but for direct bigrams) Some simple graphing-by-date Export to various reference manager citation formats Charles H. Pence Example: Biodiversity 32 / 33
  36. 36. Questions? charles@charlespence.net https://pencelab.be @pencechp • @pencelab The Pence Lab
  37. 37. Craig-Zeta Algorithm 1. Divide corpus A and corpus B into blocks of 500 words. 2. For each word, determine how many blocks in which that word appears. (Discard any words that appear in every block.) 3. For each word, compute the Zeta score: Zw = countA nA + nB − countB nB (1) That is: the fraction of blocks of A that contain w plus the fraction of blocks of B that do not contain w. Zw values range from 2 (appears in every block of A and no blocks of B) to 0 (appears in no blocks of A and every block of B). Charles H. Pence Bonus! 1 / 2
  38. 38. Collocation Algorithm 1. Inputs: word of interest t, window s. 2. Divide the corpus into n blocks of size s. 3. For every word w in the corpus, compute both fw (how many blocks contain w) and fwt (how many blocks contain both w and t). 4. Compute a score for the significance of the wt pair. In this case, we used mutual information: Iwt = log( fwt ⋅ n ft ⋅ fw ) (2) Charles H. Pence Bonus! 2 / 2

×