Linked Open Communism:
    Better discovery through data dis- and re-aggregation
                            --- or ---
      How I learned to shut about about linked data
            AND BUILD SOMETHING!!




Presented at code4lib2013
by Corey A Harper
2013-02-13
Linked Data




   • Metadata as a Graph
   • Typed “things”, named by URIs
   • The relationships between those
     things, also built on URIs
   • Ease of integration *across* data
     sources – “merging graphs”

   2013-02-13   ☭ code4lib2013 ☭   2
2013-02-13   ☭ code4lib2013 ☭   3
Refine




   2013-02-13   ☭ code4lib2013 ☭   4
ViewShare




   2013-02-13   ☭ code4lib2013 ☭   5
Context




                     Narrative
                    Story telling
                      Context

                The archive’s story,
                The library's story,
                    but also…

   2013-02-13      ☭ code4lib2013 ☭    6
Users’ stories

       Adding context through recombinant
                    metadata



2013-02-13       ☭ code4lib2013 ☭   7
Backing Away from Evangelism...




 Image NOT used by permission.
 Probably a violation of several copyrights & trademarks.
   2013-02-13                ☭ code4lib2013 ☭               8
Image by Jonestown Institute via Wikimedia Commons
                     http://en.wikipedia.org/wiki/File:Jonestown_entrance.jpg
                                                                                  9
                                                                                ☭ code4lib2013 ☭
Aside on metaphors




                                                                                  2013-02-13
Image by Joe Mabel via Wikimedia Commons.
                     http://en.wikipedia.org/wiki/File:Furthur_05.jpg
                                                                          10
                                                                        ☭ code4lib2013 ☭
Aside on metaphors




                                                                          2013-02-13
2013-02-13   ☭ code4lib2013 ☭   11
Premise




                Context is so central




   2013-02-13     ☭ code4lib2013 ☭      12
And yet our Controlled Vocabs
                    Are nearly gone




             Because the interfaces to them
                     were broken

2013-02-13          ☭ code4lib2013 ☭   13
2013-02-13   ☭ code4lib2013 ☭   14
The Death of Browse




    • Next-Gen Discovery Systems don't
      make use of Authority Control
    • “Browse” was/is broken as a UI Design
    • Rich data in Authorities, disconnected
      from narrative, context, search
    • Richer “Authority” type data outside
      libraries...

   2013-02-13         ☭ code4lib2013 ☭   15
Linked Data Based UI Design
For Boutique Collections




   2013-02-13           ☭ code4lib2013 ☭   16
Public Domain image of Paulette Goddard
                   via Wikimedia Commons.
                   http://en.wikipedia.org/wiki/File:Paulette_Goddard-publicity.JPG
                                                                                        17
                                                                                      ☭ code4lib2013 ☭
A research leave




                                                                                        2013-02-13
Public Domain image via Wikimedia Commons.
                http://en.wikipedia.org/wiki/File:Symbol-hammer-and-sickle.svg
                                                                                   18
                                                                                 ☭ code4lib2013 ☭
Initial Scope




                                                                                   2013-02-13
Linked Open Communism




  • Dis-aggregate EAD records into
    Collections & Components
  • Create a broad set of resource “types”
  • Extract key “entities” from EAD
        People, Places, Topics, Corporate Bodies
        Incorporate additional data about entites
  • Put this in Blacklight
  • Load MARC & other data

   2013-02-13           ☭ code4lib2013 ☭   19
2013-02-13   ☭ code4lib2013 ☭   20
2013-02-13   ☭ code4lib2013 ☭   21
2013-02-13   ☭ code4lib2013 ☭   22
Technology Stack - UI




    • Vanilla Blacklight
         Minor SOLR Index Tweaks / Additions
         Minor View Hacks
    • “pre-beta”
         Only on localhost right now




   2013-02-13           ☭ code4lib2013 ☭   23
Technology Stack – Support Tools




   2013-02-13            ☭ code4lib2013 ☭   24
Gadget!




   2013-02-13   ☭ code4lib2013 ☭   25
Technology Stack - Backend




    • Python & RDFLib
    • 4Store & HTTP4Store
    • Sunburnt
    • FuzzyWuzzy
    • (Lots of other Python modules....)



   2013-02-13           ☭ code4lib2013 ☭   26
Fuzzy Wuzzy – Awesome Library from SeatGeek
                         https://github.com/seatgeek/fuzzywuzzy
                         http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
                                                                                                     27
                                                                                                   ☭ code4lib2013 ☭
FuzzyWuzzy & SeatGeek!




                                                                                                     2013-02-13
Data Flow




   2013-02-13   ☭ code4lib2013 ☭   28
Object Oriented Python




    • Classes: Collections, Components,
      Entities
    • Class methods
         makeGraph
         makeSolr
         to4store
         output (turtle, rdf/xml, etc)


   2013-02-13            ☭ code4lib2013 ☭   29
Performance Benchmarks


   • EAD -> SOLR:
         ~26 hrs to parse 1600 EAD, push 385k
          “records” to SOLR
   • DBPedia matching
         X-ref label varients for entities against 9.4
          million DBPedia labels (labels-en.ttl).
         Should be using Hadoop
         Other ideas?
   • Re-solr-izing entities: ~10 minutes
         Pulls local copy of dbpedia data from 4store
   2013-02-13            ☭ code4lib2013 ☭      30
4Store




    • Provenance-ish
          Naming of sub-graphs
          Default context is everything
    • First EAD cut produced ~4m triples
    • Easy to delete whole graphs, or individ
      triples
    • SPARQL-able – good for stats:
          992 DBPedia links for 6331 “Entities”

   2013-02-13          ☭ code4lib2013 ☭      31
Image by wallygrom via flickr
http://www.flickr.com/photos/33037982@N04/3669790240/
                                                                                                  32
                                                        https://github.com/chrpr/ead2rdf2solr
                                                                                                ☭ code4lib2013 ☭
                                                                                                  2013-02-13
Future Steps: Code to Incorporate




    • Components: Inheritance of
      accesspoints
         fuzzywuzzy string match to unittitle
         matched about 10%
         Extend to cross ead match via 4Store
    • VIAF, id.loc, fast reconciliation
    • Override configs for DBPedia matching


   2013-02-13             ☭ code4lib2013 ☭   33
DBPedia Override Examples




      Germany. |t Treaties, etc. |g Soviet Union, |d
      1939 Aug. 23.
      http://dbpedia.org/page/Treaty_of_Non-
      Aggression_between_Germany_and_the_Sovi
      et_Union

      Textile Workers' Strike, Gastonia, N.C., 1929.
      http://dbpedia.org/page/Loray_Mill_Strike


   2013-02-13          ☭ code4lib2013 ☭   34
Further Development Next Steps




    • EAC-CPF reconciliation, record creation
    • Possibly relationship to Hydra?
         Annotation Interface, DBP Overrides
    • SOLR Relevancy Ranking
    • SOLR-Marc Modifications
    • Update mechanism
    • Test with other Datasets
      (NYPL/NYU/METRO project)

   2013-02-13           ☭ code4lib2013 ☭   35
Thanks!




                corey.harper@nyu.edu
                    212.998.2479
                       @chrpr




   2013-02-13      ☭ code4lib2013 ☭    36

Linked Open Communism - c4l13

  • 1.
    Linked Open Communism: Better discovery through data dis- and re-aggregation --- or --- How I learned to shut about about linked data AND BUILD SOMETHING!! Presented at code4lib2013 by Corey A Harper 2013-02-13
  • 2.
    Linked Data • Metadata as a Graph • Typed “things”, named by URIs • The relationships between those things, also built on URIs • Ease of integration *across* data sources – “merging graphs” 2013-02-13 ☭ code4lib2013 ☭ 2
  • 3.
    2013-02-13 ☭ code4lib2013 ☭ 3
  • 4.
    Refine 2013-02-13 ☭ code4lib2013 ☭ 4
  • 5.
    ViewShare 2013-02-13 ☭ code4lib2013 ☭ 5
  • 6.
    Context Narrative Story telling Context The archive’s story, The library's story, but also… 2013-02-13 ☭ code4lib2013 ☭ 6
  • 7.
    Users’ stories Adding context through recombinant metadata 2013-02-13 ☭ code4lib2013 ☭ 7
  • 8.
    Backing Away fromEvangelism... Image NOT used by permission. Probably a violation of several copyrights & trademarks. 2013-02-13 ☭ code4lib2013 ☭ 8
  • 9.
    Image by JonestownInstitute via Wikimedia Commons http://en.wikipedia.org/wiki/File:Jonestown_entrance.jpg 9 ☭ code4lib2013 ☭ Aside on metaphors 2013-02-13
  • 10.
    Image by JoeMabel via Wikimedia Commons. http://en.wikipedia.org/wiki/File:Furthur_05.jpg 10 ☭ code4lib2013 ☭ Aside on metaphors 2013-02-13
  • 11.
    2013-02-13 ☭ code4lib2013 ☭ 11
  • 12.
    Premise Context is so central 2013-02-13 ☭ code4lib2013 ☭ 12
  • 13.
    And yet ourControlled Vocabs Are nearly gone Because the interfaces to them were broken 2013-02-13 ☭ code4lib2013 ☭ 13
  • 14.
    2013-02-13 ☭ code4lib2013 ☭ 14
  • 15.
    The Death ofBrowse • Next-Gen Discovery Systems don't make use of Authority Control • “Browse” was/is broken as a UI Design • Rich data in Authorities, disconnected from narrative, context, search • Richer “Authority” type data outside libraries... 2013-02-13 ☭ code4lib2013 ☭ 15
  • 16.
    Linked Data BasedUI Design For Boutique Collections 2013-02-13 ☭ code4lib2013 ☭ 16
  • 17.
    Public Domain imageof Paulette Goddard via Wikimedia Commons. http://en.wikipedia.org/wiki/File:Paulette_Goddard-publicity.JPG 17 ☭ code4lib2013 ☭ A research leave 2013-02-13
  • 18.
    Public Domain imagevia Wikimedia Commons. http://en.wikipedia.org/wiki/File:Symbol-hammer-and-sickle.svg 18 ☭ code4lib2013 ☭ Initial Scope 2013-02-13
  • 19.
    Linked Open Communism • Dis-aggregate EAD records into Collections & Components • Create a broad set of resource “types” • Extract key “entities” from EAD  People, Places, Topics, Corporate Bodies  Incorporate additional data about entites • Put this in Blacklight • Load MARC & other data 2013-02-13 ☭ code4lib2013 ☭ 19
  • 20.
    2013-02-13 ☭ code4lib2013 ☭ 20
  • 21.
    2013-02-13 ☭ code4lib2013 ☭ 21
  • 22.
    2013-02-13 ☭ code4lib2013 ☭ 22
  • 23.
    Technology Stack -UI • Vanilla Blacklight  Minor SOLR Index Tweaks / Additions  Minor View Hacks • “pre-beta”  Only on localhost right now 2013-02-13 ☭ code4lib2013 ☭ 23
  • 24.
    Technology Stack –Support Tools 2013-02-13 ☭ code4lib2013 ☭ 24
  • 25.
    Gadget! 2013-02-13 ☭ code4lib2013 ☭ 25
  • 26.
    Technology Stack -Backend • Python & RDFLib • 4Store & HTTP4Store • Sunburnt • FuzzyWuzzy • (Lots of other Python modules....) 2013-02-13 ☭ code4lib2013 ☭ 26
  • 27.
    Fuzzy Wuzzy –Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python 27 ☭ code4lib2013 ☭ FuzzyWuzzy & SeatGeek! 2013-02-13
  • 28.
    Data Flow 2013-02-13 ☭ code4lib2013 ☭ 28
  • 29.
    Object Oriented Python • Classes: Collections, Components, Entities • Class methods  makeGraph  makeSolr  to4store  output (turtle, rdf/xml, etc) 2013-02-13 ☭ code4lib2013 ☭ 29
  • 30.
    Performance Benchmarks • EAD -> SOLR:  ~26 hrs to parse 1600 EAD, push 385k “records” to SOLR • DBPedia matching  X-ref label varients for entities against 9.4 million DBPedia labels (labels-en.ttl).  Should be using Hadoop  Other ideas? • Re-solr-izing entities: ~10 minutes  Pulls local copy of dbpedia data from 4store 2013-02-13 ☭ code4lib2013 ☭ 30
  • 31.
    4Store • Provenance-ish  Naming of sub-graphs  Default context is everything • First EAD cut produced ~4m triples • Easy to delete whole graphs, or individ triples • SPARQL-able – good for stats:  992 DBPedia links for 6331 “Entities” 2013-02-13 ☭ code4lib2013 ☭ 31
  • 32.
    Image by wallygromvia flickr http://www.flickr.com/photos/33037982@N04/3669790240/ 32 https://github.com/chrpr/ead2rdf2solr ☭ code4lib2013 ☭ 2013-02-13
  • 33.
    Future Steps: Codeto Incorporate • Components: Inheritance of accesspoints  fuzzywuzzy string match to unittitle  matched about 10%  Extend to cross ead match via 4Store • VIAF, id.loc, fast reconciliation • Override configs for DBPedia matching 2013-02-13 ☭ code4lib2013 ☭ 33
  • 34.
    DBPedia Override Examples Germany. |t Treaties, etc. |g Soviet Union, |d 1939 Aug. 23. http://dbpedia.org/page/Treaty_of_Non- Aggression_between_Germany_and_the_Sovi et_Union Textile Workers' Strike, Gastonia, N.C., 1929. http://dbpedia.org/page/Loray_Mill_Strike 2013-02-13 ☭ code4lib2013 ☭ 34
  • 35.
    Further Development NextSteps • EAC-CPF reconciliation, record creation • Possibly relationship to Hydra?  Annotation Interface, DBP Overrides • SOLR Relevancy Ranking • SOLR-Marc Modifications • Update mechanism • Test with other Datasets (NYPL/NYU/METRO project) 2013-02-13 ☭ code4lib2013 ☭ 35
  • 36.
    Thanks! corey.harper@nyu.edu 212.998.2479 @chrpr 2013-02-13 ☭ code4lib2013 ☭ 36