Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On Beyond Keyword Search: The Thinking Behind JSTOR Labs' Text Analyzer - NFAIS Webinar 2017

291 views

Published on

How Text Analyzer enables researchers, through the use of natural language processing, to upload a document and get relevant results including content, topics and subjects. JSTOR pushed the envelope of traditional searching and will share what challenges and opportunities were learned from their beta test of this new tool.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

On Beyond Keyword Search: The Thinking Behind JSTOR Labs' Text Analyzer - NFAIS Webinar 2017

  1. 1. ON BEYOND KEYWORD SEARCH: THE THINKING BEHIND JSTOR LABS’ TEXT ANALYZER NFAIS Webinar: Shifting Patterns in Search and Discovery June 15, 2017 @abhumphreys Alex Humphreys, JSTOR Labs
  2. 2. ITHAKA is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. JSTOR is a not-for-profit digital library of academic journals, books, and primary sources. Ithaka S+R is a not-for-profit research and consulting service that helps academic, cultural, and publishing communities thrive in the digital environment. Portico is a not-for-profit preservation service for digital publications, including electronic journals, books, and historical collections. Artstor provides 2+ million high-quality images and digital asset management software to enhance scholarship and teaching.
  3. 3. JSTOR Labs works with partner publishers, libraries and labs to create tools for researchers, teachers and students that are immediately useful – and a little bit magical.
  4. 4. WHAT’S A TEXT ANALYZER?
  5. 5. LET’S JUST START WITH A DEMO www.jstor.org/analyze
  6. 6. WHAT’S IT GOOD FOR?
  7. 7. SCHOLARS DOING LITERATURE REVIEWS
  8. 8. FINDING KEYWORDS IN UNFAMILIAR FIELDS https://publish.illinois.edu/commonsknowledge/2017/04/04/spotlight-jstor-labs-text-analyzer/:
  9. 9. ESL RESEARCHERS FINDING KEYWORDS
  10. 10. OK, I GUESS IT’S KINDA COOL. SO HOW’D YOU COME UP WITH IT?
  11. 11. THIS IS HOW:
  12. 12. The Design Squiggle Damien Newman: http://cargocollective.com/central/The-Design-Squiggle/
  13. 13. THE SEED…
  14. 14. CONCEPT TESTING
  15. 15. ITERATING ON INTERACTION DESIGN
  16. 16. RELEASE AS JSTOR BETA
  17. 17. STILL A LONG WAY TO GO!
  18. 18. A BRIEF ASIDE (OR YOU COULD CALL IT A RANT)
  19. 19. #devops is great.
  20. 20. #devops is great. Can we please try #userresearchproddesigndevopscustservice?
  21. 21. WHAT HAVE WE LEARNED SO FAR?
  22. 22. Combining semantic indexing with topic modeling can be powerful.
  23. 23. THREE STEPS FOR EACH SEARCH • From many textual formats (pdf, word, html, etc.) • OCR, if needed (e.g. a picture of a page in a magazine) • Topics: JSTOR Thesaurus & an LDA Topic Model • Entities: Alchemy (Watson), OpenCalais, Stanford, Apache • TF-IDF to select 5 terms • “OR” search • Relevance ranked based on “equalizer” 1. Extract text 2. Identify terms 3. Generate results
  24. 24. WHERE DO THE TOPICS COME FROM? • A controlled vocabulary containing +40,000 terms, representing concepts (no entities, currently) found in the JSTOR corpus • Constructed from 20 thesauri obtained from various sources, including ERIC, MeSH, and NASA • Developed in collaboration with Access Innovations • Key branches in the thesaurus are reviewed and corrected by subject matter experts THE JSTOR THESAURUS
  25. 25. JSTOR THESAURUS
  26. 26. WHY THESE TOPICS? AND, WHERE DID THEY COME FROM? Human curated tagging rules have been developed for each concept in the JSTOR Thesaurus enabling concepts to be extracted from unstructured text All documents in the JSTOR corpus have been tagged with thesaurus concepts using a rules-based indexer
  27. 27. THESAURUS TAGGER RULE BUILDER
  28. 28. WHY THESE TOPICS? AND, WHERE DID THEY COME FROM? This tagged corpus is then used to select training documents for building an LDA topic model The LDA topic model enables us to identify latent topics found in text in addition to those explicitly identified with the human-generated rules
  29. 29. TOPIC MODEL • Labeled LDA Topic model • Model trained using documents selected from JSTOR corpus with tagged thesaurus concepts • Using OSS Mallet tool • Current version of model includes approximately 11,000 topics • Each topic represents a distribution of word probabilities redistricting district congressional minority political majority house legislative racial gerrymandering court republican plan electoral districting seat representative black voter democrat partisan election democratic representation line supreme legislature drawn control population voting drawing policy texas draw map claim boundary following commission outcome shaw race census legal principle creation decision create finding elect lublin polarization optimal elected composition affect member measure vote gain previous legislator geographic southern section every approach controlled round note gerrymander reapportionment compactness decennial bipartisan constitutional find substantive california roll competitive county competition party requirement federal north post redrawn incumbent criterion consequence likely formal safe delegation georgia justice influence shotts equal favor might scholar equality south power law judicial bias king carolina call according voss baker panel professor rule mandate creating increased determine constraint politics argue standard redis grofman reno cain redrawing margin share ing tricting decrease congress geographical requires simple held critic empirical david niemi perverse latino analyze examine debate rather impact next provides give balance affected subsequent possible take practice community robbins constitution computer evenly fraction constituent illinois supporter shape responsiveness typically various proposed despite either focus conclusion african opportunity redistrict mcdonald white numerous test statewide percent suggests thus choice largely develop decade conclude fact four reached Redistricting district congressional congress house representative member federal districting seat majority plan representation population congressman apportionment elected court president washington columbia legislative census party interest political gerrymandering redistricting home thomas affect every black democrat dis foley carolina find reapportionment constituency supreme constitution voting geographic active dinner responsiveness south force john gingrich legislature equal membership neighborhood testimony north james service decennial constituent passed boundary law creation firm charles spending congruent election politically addition april contact proportion con assistant position following york land unconstitutional resident miller voter pledge stephen city official minority respective mainland kentucky post clause better divisor perimeter yao secretary republican senate moderate congruence map county grant senior drawing portion speaker feature decision professor became gerrymander swain trict leapfrog federalist partisan senator vote captain compelling lucas candidate race create harm require fourth shape you traditional purpose shaped concern people shaw historical simply policy henry david allocation vetoed arkansas smiley serra carl volunteer politician budget burden electoral leaf education reduced principle proximity november significant just represented second gathered fiorina representa gressional glazer apportion gerrymandered boris bronx issn rank redrawing twice refused eliminates provincial jefferson returned witness campaign fletcher georgia empirically personnel size maximize half reserve read demographic percent contrary required determining throughout … Congressional districts Top words from some sample topics
  30. 30. Keyword searching is great, but it ain’t perfect. There’s more we can do for users.
  31. 31. THANKS, DESIGN THINKING! FOCUS ON A USER’S GOALS… This article needs to pass peer review. I need more sources to back up my argument. I need to make sure I’m not missing anything.
  32. 32. THANKS, DESIGN THINKING! …AND WHAT’S STANDING IN THEIR WAY This research touches on disciplines I’m new to. How do I know if I’m finding everything? I know what I’m interested in, but the search terms I’m using aren’t working. Blergh, boolean search is too complicated.
  33. 33. THANKS, DESIGN THINKING! UNDERSTAND THE USER’S CONTEXT Hey, I’ve got my first draft right here. At least I’ve found ONE article I can use. All I have to work with is the assignment my teacher handed out. I’m nowhere near my laptop.
  34. 34. WHAT ARE WE STILL LEARNING?
  35. 35. Can we improve the topic model & the recommendations?
  36. 36. How can we embed this deeper within a user’s workflow?
  37. 37. Is this a feature, a product or a business?
  38. 38. Thank you Alex Humphreys Director, JSTOR Labs ITHAKA labs.jstor.org @abhumphreys alex.humphreys@ithaka.org
  39. 39. APPENDIX (OPEN IN CASE OF NO INTERNET CONNECTION) (BUT THIS IS A BIT SILLY, SINCE THIS IS A WEBINAR)

×