Your SlideShare is downloading. ×
  • Like
ALA 2010 -- Jeremy York
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

ALA 2010 -- Jeremy York

  • 527 views
Published

 

Published in Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
527
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. HATHI TRUST A Shared Digital Repository Delivering Data For  New Generations of Research New Generations of Research Strategies and Challenges Strategies and Challenges Jeremy York NISO/BISG Forum NISO/BISG Forum ALA 2010
  • 2. Introduction • Digital Repository Digital Repository – Initial focus on digitized book and journal content – “Light” archive Light archive • Collections and Collaboration – Comprehensive collection C h i ll ti – Shared strategies – Local services Local services – Public Good
  • 3. Content Distribution Content Distribution 19% In Copyright Public Domain 81% 6,173,575 – Total 1,177,667 – Public Domain      * As of June 15, 2010
  • 4. Language Distribution (1) Language Distribution (1) The top 10 languages make up ~86%  p g g p % of all content  Polish Arabic 1% Remaining  2% Italian Languages 3% Japanese 14% 4% English 48% Chinese h 4% Spanish French 4% 7% German 8% Russian 5% * As of June 15, 2010
  • 5. Language Distribution (2) Language Distribution (2) The next 40  Serbian Romanian Ancient‐Greek Slovenian Multiple Yiddish p languages make up  2% % 1% % 1%1%% Portuguese ~13% of total Panjabi 1% Malayalam 1% Bulgarian 6% 1%1% Slovak Finnish Vietnamese 2% 1% Hindi Greek Catalan Armenian Malay 1% 2% Ukrainian 1% 6% 1% 1%% 1% 1 Hebrew Hungarian 2% 6% 2% Sanskrit Indonesian 2% 6% Norwegian Dutch D t h 2% 5% Bengali 2% Korean Latin 2% 5% Persian Urdu 3% Undetermined 4% 3% Swedish Tamil Danish Thai Czech Turkish 4% 3% Croatian 3% 3% Unknown 3% 4% 3% 4% * As of June 15, 2010
  • 6. Originating Institution Originating Institution Penn State  Uni ersit of Indiana  University of University of  University of  University Wisconsin University Minnesota 3% 1% 0% 6% University of  California 25% University of  Michigan 65% * As of June 15, 2010
  • 7. Content over time Content over time 100% 80% 60% Minnesota Penn State 40% California 20% Indiana 0% Wisconsin Michigan Sep‐04 4 Nov‐04 Jan‐05 Mar‐05 May‐05 Jul‐05 Sep‐05 Nov‐05 an‐06 ar‐06 y‐06 Ja May Ma N * As of June 15, 2010
  • 8. Content Growth Content Growth
  • 9. Data Distribution & APIs Data Distribution & APIs • OAI PMH OAI‐PMH • Metadata files • Bibliographic API ibli hi • Data API
  • 10. Extended Services Extended Services • Community Development Environment Community Development Environment • Non‐Google Ingest • Non‐Book/Non‐Journal Ingest k/ l • Computational Research
  • 11. Strategies for Computational Research Strategies for Computational Research • Data distribution Data distribution • Protocol‐based access • Research Center hC
  • 12. SEASR Architecture Visualizations User Interfaces Web  Apps Plugins Services Apps Meandre Workbench r Tools Meandre Data‐Intensive Flows Repositories Components Developer Data Data Analytics Visualization Analysis Component Repository Component Discovery Components Flows Meandre Infrastructure Virtualization Infrastructure Cloud Computing
  • 13. SEASR @ Work – Tag Cloud • Count tokens • Filter options supported • St Stem words d
  • 14. SEASR @ Work – Entity Mash-up • E tit E t ti with Entity Extraction ith OpenNLP or Stanford NER • Locations viewed on Google Map • D Dates viewed on i d Simile Timeline
  • 15. SEASR @ Work – Entities To Network • Identify entities • Define relationships between entities within same sentence
  • 16. SEASR @ Work – Text Clustering • Clustering of Text by token counts • Filtering options for stop words Part of Speech words, • Dendogram Visualization
  • 17. SEASR @ Work – Audio Analysis • NEMA: Executes a SEASR flow for each run – Loads audio data – Extracts features for every 10 sec moving window of audio i d f di – Loads and applies the models – Sends results back to the WebUI • NESTER: Annotation of Audio via Spectral Analysis
  • 18. SEASR @ Work – Zotero • Plugin to Firefox • Zotero manages the collection • Launch SEASR Analytics – Citation Analysis uses the JUNG network importance algorithms to rank the authors in the citation network that is exported as RDF data from Zotero to SEASR – Zotero Export to Fedora through SEASR – Saves results from SEASR Analytics to a Collection • Launch MONK Processing – MONK DB Ingestion Workflow
  • 19. SEASR @ Work – Emotion Tracking Goal is to have this type of Visualization to track emotions across  a text document (Leveraging flare.prefuse.org)
  • 20. Sentiment Analysis: Visualization
  • 21. Person Extraction: Scott's Waverley, Ivanhoe, and The Heart of Midlothian. 
  • 22. Location Extraction: Top: Walter Scott's Waverley Bottom: Maria Edgeworth's Castle Rackrent
  • 23. Thank you! hathitrust‐info@umich.edu jjyork@umich.edu