Linked Open Communism:    Better discovery through data dis- and re-aggregation                            --- or ---     ...
Linked Data   • Metadata as a Graph   • Typed “things”, named by URIs   • The relationships between those     things, also...
2013-02-13   ☭ code4lib2013 ☭   3
Refine   2013-02-13   ☭ code4lib2013 ☭   4
ViewShare   2013-02-13   ☭ code4lib2013 ☭   5
Context                     Narrative                    Story telling                      Context                The arc...
Users’ stories       Adding context through recombinant                    metadata2013-02-13       ☭ code4lib2013 ☭   7
Backing Away from Evangelism... Image NOT used by permission. Probably a violation of several copyrights & trademarks.   2...
Image by Jonestown Institute via Wikimedia Commons                     http://en.wikipedia.org/wiki/File:Jonestown_entranc...
Image by Joe Mabel via Wikimedia Commons.                     http://en.wikipedia.org/wiki/File:Furthur_05.jpg            ...
2013-02-13   ☭ code4lib2013 ☭   11
Premise                Context is so central   2013-02-13     ☭ code4lib2013 ☭      12
And yet our Controlled Vocabs                    Are nearly gone             Because the interfaces to them               ...
2013-02-13   ☭ code4lib2013 ☭   14
The Death of Browse    • Next-Gen Discovery Systems dont      make use of Authority Control    • “Browse” was/is broken as...
Linked Data Based UI DesignFor Boutique Collections   2013-02-13           ☭ code4lib2013 ☭   16
Public Domain image of Paulette Goddard                   via Wikimedia Commons.                   http://en.wikipedia.org...
Public Domain image via Wikimedia Commons.                http://en.wikipedia.org/wiki/File:Symbol-hammer-and-sickle.svg  ...
Linked Open Communism  • Dis-aggregate EAD records into    Collections & Components  • Create a broad set of resource “typ...
2013-02-13   ☭ code4lib2013 ☭   20
2013-02-13   ☭ code4lib2013 ☭   21
2013-02-13   ☭ code4lib2013 ☭   22
Technology Stack - UI    • Vanilla Blacklight         Minor SOLR Index Tweaks / Additions         Minor View Hacks    • ...
Technology Stack – Support Tools   2013-02-13            ☭ code4lib2013 ☭   24
Gadget!   2013-02-13   ☭ code4lib2013 ☭   25
Technology Stack - Backend    • Python & RDFLib    • 4Store & HTTP4Store    • Sunburnt    • FuzzyWuzzy    • (Lots of other...
Fuzzy Wuzzy – Awesome Library from SeatGeek                         https://github.com/seatgeek/fuzzywuzzy                ...
Data Flow   2013-02-13   ☭ code4lib2013 ☭   28
Object Oriented Python    • Classes: Collections, Components,      Entities    • Class methods         makeGraph        ...
Performance Benchmarks   • EAD -> SOLR:         ~26 hrs to parse 1600 EAD, push 385k          “records” to SOLR   • DBPed...
4Store    • Provenance-ish          Naming of sub-graphs          Default context is everything    • First EAD cut produ...
Image by wallygrom via flickrhttp://www.flickr.com/photos/33037982@N04/3669790240/                                        ...
Future Steps: Code to Incorporate    • Components: Inheritance of      accesspoints         fuzzywuzzy string match to un...
DBPedia Override Examples      Germany. |t Treaties, etc. |g Soviet Union, |d      1939 Aug. 23.      http://dbpedia.org/p...
Further Development Next Steps    • EAC-CPF reconciliation, record creation    • Possibly relationship to Hydra?         ...
Thanks!                corey.harper@nyu.edu                    212.998.2479                       @chrpr   2013-02-13     ...
Upcoming SlideShare
Loading in...5
×

Linked Open Communism - c4l13

2,145

Published on

Linked Open Communism - c4l13

  1. 1. Linked Open Communism: Better discovery through data dis- and re-aggregation --- or --- How I learned to shut about about linked data AND BUILD SOMETHING!!Presented at code4lib2013by Corey A Harper2013-02-13
  2. 2. Linked Data • Metadata as a Graph • Typed “things”, named by URIs • The relationships between those things, also built on URIs • Ease of integration *across* data sources – “merging graphs” 2013-02-13 ☭ code4lib2013 ☭ 2
  3. 3. 2013-02-13 ☭ code4lib2013 ☭ 3
  4. 4. Refine 2013-02-13 ☭ code4lib2013 ☭ 4
  5. 5. ViewShare 2013-02-13 ☭ code4lib2013 ☭ 5
  6. 6. Context Narrative Story telling Context The archive’s story, The librarys story, but also… 2013-02-13 ☭ code4lib2013 ☭ 6
  7. 7. Users’ stories Adding context through recombinant metadata2013-02-13 ☭ code4lib2013 ☭ 7
  8. 8. Backing Away from Evangelism... Image NOT used by permission. Probably a violation of several copyrights & trademarks. 2013-02-13 ☭ code4lib2013 ☭ 8
  9. 9. Image by Jonestown Institute via Wikimedia Commons http://en.wikipedia.org/wiki/File:Jonestown_entrance.jpg 9 ☭ code4lib2013 ☭Aside on metaphors 2013-02-13
  10. 10. Image by Joe Mabel via Wikimedia Commons. http://en.wikipedia.org/wiki/File:Furthur_05.jpg 10 ☭ code4lib2013 ☭Aside on metaphors 2013-02-13
  11. 11. 2013-02-13 ☭ code4lib2013 ☭ 11
  12. 12. Premise Context is so central 2013-02-13 ☭ code4lib2013 ☭ 12
  13. 13. And yet our Controlled Vocabs Are nearly gone Because the interfaces to them were broken2013-02-13 ☭ code4lib2013 ☭ 13
  14. 14. 2013-02-13 ☭ code4lib2013 ☭ 14
  15. 15. The Death of Browse • Next-Gen Discovery Systems dont make use of Authority Control • “Browse” was/is broken as a UI Design • Rich data in Authorities, disconnected from narrative, context, search • Richer “Authority” type data outside libraries... 2013-02-13 ☭ code4lib2013 ☭ 15
  16. 16. Linked Data Based UI DesignFor Boutique Collections 2013-02-13 ☭ code4lib2013 ☭ 16
  17. 17. Public Domain image of Paulette Goddard via Wikimedia Commons. http://en.wikipedia.org/wiki/File:Paulette_Goddard-publicity.JPG 17 ☭ code4lib2013 ☭A research leave 2013-02-13
  18. 18. Public Domain image via Wikimedia Commons. http://en.wikipedia.org/wiki/File:Symbol-hammer-and-sickle.svg 18 ☭ code4lib2013 ☭Initial Scope 2013-02-13
  19. 19. Linked Open Communism • Dis-aggregate EAD records into Collections & Components • Create a broad set of resource “types” • Extract key “entities” from EAD  People, Places, Topics, Corporate Bodies  Incorporate additional data about entites • Put this in Blacklight • Load MARC & other data 2013-02-13 ☭ code4lib2013 ☭ 19
  20. 20. 2013-02-13 ☭ code4lib2013 ☭ 20
  21. 21. 2013-02-13 ☭ code4lib2013 ☭ 21
  22. 22. 2013-02-13 ☭ code4lib2013 ☭ 22
  23. 23. Technology Stack - UI • Vanilla Blacklight  Minor SOLR Index Tweaks / Additions  Minor View Hacks • “pre-beta”  Only on localhost right now 2013-02-13 ☭ code4lib2013 ☭ 23
  24. 24. Technology Stack – Support Tools 2013-02-13 ☭ code4lib2013 ☭ 24
  25. 25. Gadget! 2013-02-13 ☭ code4lib2013 ☭ 25
  26. 26. Technology Stack - Backend • Python & RDFLib • 4Store & HTTP4Store • Sunburnt • FuzzyWuzzy • (Lots of other Python modules....) 2013-02-13 ☭ code4lib2013 ☭ 26
  27. 27. Fuzzy Wuzzy – Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python 27 ☭ code4lib2013 ☭FuzzyWuzzy & SeatGeek! 2013-02-13
  28. 28. Data Flow 2013-02-13 ☭ code4lib2013 ☭ 28
  29. 29. Object Oriented Python • Classes: Collections, Components, Entities • Class methods  makeGraph  makeSolr  to4store  output (turtle, rdf/xml, etc) 2013-02-13 ☭ code4lib2013 ☭ 29
  30. 30. Performance Benchmarks • EAD -> SOLR:  ~26 hrs to parse 1600 EAD, push 385k “records” to SOLR • DBPedia matching  X-ref label varients for entities against 9.4 million DBPedia labels (labels-en.ttl).  Should be using Hadoop  Other ideas? • Re-solr-izing entities: ~10 minutes  Pulls local copy of dbpedia data from 4store 2013-02-13 ☭ code4lib2013 ☭ 30
  31. 31. 4Store • Provenance-ish  Naming of sub-graphs  Default context is everything • First EAD cut produced ~4m triples • Easy to delete whole graphs, or individ triples • SPARQL-able – good for stats:  992 DBPedia links for 6331 “Entities” 2013-02-13 ☭ code4lib2013 ☭ 31
  32. 32. Image by wallygrom via flickrhttp://www.flickr.com/photos/33037982@N04/3669790240/ 32 https://github.com/chrpr/ead2rdf2solr ☭ code4lib2013 ☭ 2013-02-13
  33. 33. Future Steps: Code to Incorporate • Components: Inheritance of accesspoints  fuzzywuzzy string match to unittitle  matched about 10%  Extend to cross ead match via 4Store • VIAF, id.loc, fast reconciliation • Override configs for DBPedia matching 2013-02-13 ☭ code4lib2013 ☭ 33
  34. 34. DBPedia Override Examples Germany. |t Treaties, etc. |g Soviet Union, |d 1939 Aug. 23. http://dbpedia.org/page/Treaty_of_Non- Aggression_between_Germany_and_the_Sovi et_Union Textile Workers Strike, Gastonia, N.C., 1929. http://dbpedia.org/page/Loray_Mill_Strike 2013-02-13 ☭ code4lib2013 ☭ 34
  35. 35. Further Development Next Steps • EAC-CPF reconciliation, record creation • Possibly relationship to Hydra?  Annotation Interface, DBP Overrides • SOLR Relevancy Ranking • SOLR-Marc Modifications • Update mechanism • Test with other Datasets (NYPL/NYU/METRO project) 2013-02-13 ☭ code4lib2013 ☭ 35
  36. 36. Thanks! corey.harper@nyu.edu 212.998.2479 @chrpr 2013-02-13 ☭ code4lib2013 ☭ 36
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×