ALA 2010 -- Jeremy York

793 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
793
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ALA 2010 -- Jeremy York

  1. 1. HATHI TRUST A Shared Digital Repository Delivering Data For  New Generations of Research New Generations of Research Strategies and Challenges Strategies and Challenges Jeremy York NISO/BISG Forum NISO/BISG Forum ALA 2010
  2. 2. Introduction • Digital Repository Digital Repository – Initial focus on digitized book and journal content – “Light” archive Light archive • Collections and Collaboration – Comprehensive collection C h i ll ti – Shared strategies – Local services Local services – Public Good
  3. 3. Content Distribution Content Distribution 19% In Copyright Public Domain 81% 6,173,575 – Total 1,177,667 – Public Domain      * As of June 15, 2010
  4. 4. Language Distribution (1) Language Distribution (1) The top 10 languages make up ~86%  p g g p % of all content  Polish Arabic 1% Remaining  2% Italian Languages 3% Japanese 14% 4% English 48% Chinese h 4% Spanish French 4% 7% German 8% Russian 5% * As of June 15, 2010
  5. 5. Language Distribution (2) Language Distribution (2) The next 40  Serbian Romanian Ancient‐Greek Slovenian Multiple Yiddish p languages make up  2% % 1% % 1%1%% Portuguese ~13% of total Panjabi 1% Malayalam 1% Bulgarian 6% 1%1% Slovak Finnish Vietnamese 2% 1% Hindi Greek Catalan Armenian Malay 1% 2% Ukrainian 1% 6% 1% 1%% 1% 1 Hebrew Hungarian 2% 6% 2% Sanskrit Indonesian 2% 6% Norwegian Dutch D t h 2% 5% Bengali 2% Korean Latin 2% 5% Persian Urdu 3% Undetermined 4% 3% Swedish Tamil Danish Thai Czech Turkish 4% 3% Croatian 3% 3% Unknown 3% 4% 3% 4% * As of June 15, 2010
  6. 6. Originating Institution Originating Institution Penn State  Uni ersit of Indiana  University of University of  University of  University Wisconsin University Minnesota 3% 1% 0% 6% University of  California 25% University of  Michigan 65% * As of June 15, 2010
  7. 7. Content over time Content over time 100% 80% 60% Minnesota Penn State 40% California 20% Indiana 0% Wisconsin Michigan Sep‐04 4 Nov‐04 Jan‐05 Mar‐05 May‐05 Jul‐05 Sep‐05 Nov‐05 an‐06 ar‐06 y‐06 Ja May Ma N * As of June 15, 2010
  8. 8. Content Growth Content Growth
  9. 9. Data Distribution & APIs Data Distribution & APIs • OAI PMH OAI‐PMH • Metadata files • Bibliographic API ibli hi • Data API
  10. 10. Extended Services Extended Services • Community Development Environment Community Development Environment • Non‐Google Ingest • Non‐Book/Non‐Journal Ingest k/ l • Computational Research
  11. 11. Strategies for Computational Research Strategies for Computational Research • Data distribution Data distribution • Protocol‐based access • Research Center hC
  12. 12. SEASR Architecture Visualizations User Interfaces Web  Apps Plugins Services Apps Meandre Workbench r Tools Meandre Data‐Intensive Flows Repositories Components Developer Data Data Analytics Visualization Analysis Component Repository Component Discovery Components Flows Meandre Infrastructure Virtualization Infrastructure Cloud Computing
  13. 13. SEASR @ Work – Tag Cloud • Count tokens • Filter options supported • St Stem words d
  14. 14. SEASR @ Work – Entity Mash-up • E tit E t ti with Entity Extraction ith OpenNLP or Stanford NER • Locations viewed on Google Map • D Dates viewed on i d Simile Timeline
  15. 15. SEASR @ Work – Entities To Network • Identify entities • Define relationships between entities within same sentence
  16. 16. SEASR @ Work – Text Clustering • Clustering of Text by token counts • Filtering options for stop words Part of Speech words, • Dendogram Visualization
  17. 17. SEASR @ Work – Audio Analysis • NEMA: Executes a SEASR flow for each run – Loads audio data – Extracts features for every 10 sec moving window of audio i d f di – Loads and applies the models – Sends results back to the WebUI • NESTER: Annotation of Audio via Spectral Analysis
  18. 18. SEASR @ Work – Zotero • Plugin to Firefox • Zotero manages the collection • Launch SEASR Analytics – Citation Analysis uses the JUNG network importance algorithms to rank the authors in the citation network that is exported as RDF data from Zotero to SEASR – Zotero Export to Fedora through SEASR – Saves results from SEASR Analytics to a Collection • Launch MONK Processing – MONK DB Ingestion Workflow
  19. 19. SEASR @ Work – Emotion Tracking Goal is to have this type of Visualization to track emotions across  a text document (Leveraging flare.prefuse.org)
  20. 20. Sentiment Analysis: Visualization
  21. 21. Person Extraction: Scott's Waverley, Ivanhoe, and The Heart of Midlothian. 
  22. 22. Location Extraction: Top: Walter Scott's Waverley Bottom: Maria Edgeworth's Castle Rackrent
  23. 23. Thank you! hathitrust‐info@umich.edu jjyork@umich.edu

×