SlideShare a Scribd company logo
1 of 36
Linked Open Communism:
    Better discovery through data dis- and re-aggregation
                            --- or ---
      How I learned to shut about about linked data
            AND BUILD SOMETHING!!




Presented at code4lib2013
by Corey A Harper
2013-02-13
Linked Data




   • Metadata as a Graph
   • Typed “things”, named by URIs
   • The relationships between those
     things, also built on URIs
   • Ease of integration *across* data
     sources – “merging graphs”

   2013-02-13   ☭ code4lib2013 ☭   2
2013-02-13   ☭ code4lib2013 ☭   3
Refine




   2013-02-13   ☭ code4lib2013 ☭   4
ViewShare




   2013-02-13   ☭ code4lib2013 ☭   5
Context




                     Narrative
                    Story telling
                      Context

                The archive’s story,
                The library's story,
                    but also…

   2013-02-13      ☭ code4lib2013 ☭    6
Users’ stories

       Adding context through recombinant
                    metadata



2013-02-13       ☭ code4lib2013 ☭   7
Backing Away from Evangelism...




 Image NOT used by permission.
 Probably a violation of several copyrights & trademarks.
   2013-02-13                ☭ code4lib2013 ☭               8
Image by Jonestown Institute via Wikimedia Commons
                     http://en.wikipedia.org/wiki/File:Jonestown_entrance.jpg
                                                                                  9
                                                                                ☭ code4lib2013 ☭
Aside on metaphors




                                                                                  2013-02-13
Image by Joe Mabel via Wikimedia Commons.
                     http://en.wikipedia.org/wiki/File:Furthur_05.jpg
                                                                          10
                                                                        ☭ code4lib2013 ☭
Aside on metaphors




                                                                          2013-02-13
2013-02-13   ☭ code4lib2013 ☭   11
Premise




                Context is so central




   2013-02-13     ☭ code4lib2013 ☭      12
And yet our Controlled Vocabs
                    Are nearly gone




             Because the interfaces to them
                     were broken

2013-02-13          ☭ code4lib2013 ☭   13
2013-02-13   ☭ code4lib2013 ☭   14
The Death of Browse




    • Next-Gen Discovery Systems don't
      make use of Authority Control
    • “Browse” was/is broken as a UI Design
    • Rich data in Authorities, disconnected
      from narrative, context, search
    • Richer “Authority” type data outside
      libraries...

   2013-02-13         ☭ code4lib2013 ☭   15
Linked Data Based UI Design
For Boutique Collections




   2013-02-13           ☭ code4lib2013 ☭   16
Public Domain image of Paulette Goddard
                   via Wikimedia Commons.
                   http://en.wikipedia.org/wiki/File:Paulette_Goddard-publicity.JPG
                                                                                        17
                                                                                      ☭ code4lib2013 ☭
A research leave




                                                                                        2013-02-13
Public Domain image via Wikimedia Commons.
                http://en.wikipedia.org/wiki/File:Symbol-hammer-and-sickle.svg
                                                                                   18
                                                                                 ☭ code4lib2013 ☭
Initial Scope




                                                                                   2013-02-13
Linked Open Communism




  • Dis-aggregate EAD records into
    Collections & Components
  • Create a broad set of resource “types”
  • Extract key “entities” from EAD
        People, Places, Topics, Corporate Bodies
        Incorporate additional data about entites
  • Put this in Blacklight
  • Load MARC & other data

   2013-02-13           ☭ code4lib2013 ☭   19
2013-02-13   ☭ code4lib2013 ☭   20
2013-02-13   ☭ code4lib2013 ☭   21
2013-02-13   ☭ code4lib2013 ☭   22
Technology Stack - UI




    • Vanilla Blacklight
         Minor SOLR Index Tweaks / Additions
         Minor View Hacks
    • “pre-beta”
         Only on localhost right now




   2013-02-13           ☭ code4lib2013 ☭   23
Technology Stack – Support Tools




   2013-02-13            ☭ code4lib2013 ☭   24
Gadget!




   2013-02-13   ☭ code4lib2013 ☭   25
Technology Stack - Backend




    • Python & RDFLib
    • 4Store & HTTP4Store
    • Sunburnt
    • FuzzyWuzzy
    • (Lots of other Python modules....)



   2013-02-13           ☭ code4lib2013 ☭   26
Fuzzy Wuzzy – Awesome Library from SeatGeek
                         https://github.com/seatgeek/fuzzywuzzy
                         http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
                                                                                                     27
                                                                                                   ☭ code4lib2013 ☭
FuzzyWuzzy & SeatGeek!




                                                                                                     2013-02-13
Data Flow




   2013-02-13   ☭ code4lib2013 ☭   28
Object Oriented Python




    • Classes: Collections, Components,
      Entities
    • Class methods
         makeGraph
         makeSolr
         to4store
         output (turtle, rdf/xml, etc)


   2013-02-13            ☭ code4lib2013 ☭   29
Performance Benchmarks


   • EAD -> SOLR:
         ~26 hrs to parse 1600 EAD, push 385k
          “records” to SOLR
   • DBPedia matching
         X-ref label varients for entities against 9.4
          million DBPedia labels (labels-en.ttl).
         Should be using Hadoop
         Other ideas?
   • Re-solr-izing entities: ~10 minutes
         Pulls local copy of dbpedia data from 4store
   2013-02-13            ☭ code4lib2013 ☭      30
4Store




    • Provenance-ish
          Naming of sub-graphs
          Default context is everything
    • First EAD cut produced ~4m triples
    • Easy to delete whole graphs, or individ
      triples
    • SPARQL-able – good for stats:
          992 DBPedia links for 6331 “Entities”

   2013-02-13          ☭ code4lib2013 ☭      31
Image by wallygrom via flickr
http://www.flickr.com/photos/33037982@N04/3669790240/
                                                                                                  32
                                                        https://github.com/chrpr/ead2rdf2solr
                                                                                                ☭ code4lib2013 ☭
                                                                                                  2013-02-13
Future Steps: Code to Incorporate




    • Components: Inheritance of
      accesspoints
         fuzzywuzzy string match to unittitle
         matched about 10%
         Extend to cross ead match via 4Store
    • VIAF, id.loc, fast reconciliation
    • Override configs for DBPedia matching


   2013-02-13             ☭ code4lib2013 ☭   33
DBPedia Override Examples




      Germany. |t Treaties, etc. |g Soviet Union, |d
      1939 Aug. 23.
      http://dbpedia.org/page/Treaty_of_Non-
      Aggression_between_Germany_and_the_Sovi
      et_Union

      Textile Workers' Strike, Gastonia, N.C., 1929.
      http://dbpedia.org/page/Loray_Mill_Strike


   2013-02-13          ☭ code4lib2013 ☭   34
Further Development Next Steps




    • EAC-CPF reconciliation, record creation
    • Possibly relationship to Hydra?
         Annotation Interface, DBP Overrides
    • SOLR Relevancy Ranking
    • SOLR-Marc Modifications
    • Update mechanism
    • Test with other Datasets
      (NYPL/NYU/METRO project)

   2013-02-13           ☭ code4lib2013 ☭   35
Thanks!




                corey.harper@nyu.edu
                    212.998.2479
                       @chrpr




   2013-02-13      ☭ code4lib2013 ☭    36

More Related Content

What's hot

Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDave Cross
 
쉽게 이해하는 LOD
쉽게 이해하는 LOD쉽게 이해하는 LOD
쉽게 이해하는 LODMyungjin Lee
 
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson Victor de Boer
 
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Approaching Join Index: Presented by Mikhail Khludnev, Grid DynamicsApproaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Approaching Join Index: Presented by Mikhail Khludnev, Grid DynamicsLucidworks
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQLMark Wong
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Erik Hatcher
 
SuRf – Tapping Into The Web Of Data
SuRf – Tapping Into The Web Of DataSuRf – Tapping Into The Web Of Data
SuRf – Tapping Into The Web Of Datacosbas
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
Taking your ball and going home
Taking your ball and going homeTaking your ball and going home
Taking your ball and going homePhil Cryer
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会Toshihiro Suzuki
 
Ruby on CouchDB - SimplyStored and RockingChair
Ruby on CouchDB - SimplyStored and RockingChairRuby on CouchDB - SimplyStored and RockingChair
Ruby on CouchDB - SimplyStored and RockingChairJonathan Weiss
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Базы данных. HBase
Базы данных. HBaseБазы данных. HBase
Базы данных. HBaseVadim Tsesko
 
Apache Any23 - Anything to Triples
Apache Any23 - Anything to TriplesApache Any23 - Anything to Triples
Apache Any23 - Anything to TriplesMichele Mostarda
 
Linked data: spreading data over the web
Linked data: spreading data over the webLinked data: spreading data over the web
Linked data: spreading data over the webshellac
 

What's hot (20)

Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::Class
 
쉽게 이해하는 LOD
쉽게 이해하는 LOD쉽게 이해하는 LOD
쉽게 이해하는 LOD
 
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
 
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Approaching Join Index: Presented by Mikhail Khludnev, Grid DynamicsApproaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
MongoDB (Advanced)
MongoDB (Advanced)MongoDB (Advanced)
MongoDB (Advanced)
 
Mongodb hackathon 02
Mongodb hackathon 02Mongodb hackathon 02
Mongodb hackathon 02
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
 
SuRf – Tapping Into The Web Of Data
SuRf – Tapping Into The Web Of DataSuRf – Tapping Into The Web Of Data
SuRf – Tapping Into The Web Of Data
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Taking your ball and going home
Taking your ball and going homeTaking your ball and going home
Taking your ball and going home
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会
 
Ruby on CouchDB - SimplyStored and RockingChair
Ruby on CouchDB - SimplyStored and RockingChairRuby on CouchDB - SimplyStored and RockingChair
Ruby on CouchDB - SimplyStored and RockingChair
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Базы данных. HBase
Базы данных. HBaseБазы данных. HBase
Базы данных. HBase
 
Apache Any23 - Anything to Triples
Apache Any23 - Anything to TriplesApache Any23 - Anything to Triples
Apache Any23 - Anything to Triples
 
Linked data: spreading data over the web
Linked data: spreading data over the webLinked data: spreading data over the web
Linked data: spreading data over the web
 

Similar to Linked Open Communism - c4l13

Unlocking doors: recent initiatives in open and linked data at National Libra...
Unlocking doors: recent initiatives in open and linked data at National Libra...Unlocking doors: recent initiatives in open and linked data at National Libra...
Unlocking doors: recent initiatives in open and linked data at National Libra...Gill Hamilton
 
Islandora and Linked Open Data
Islandora and Linked Open Data Islandora and Linked Open Data
Islandora and Linked Open Data eohallor
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked DataRichard Wallis
 
Unlocking doors: recent initiatives in open and linked data at National Libra...
Unlocking doors: recent initiatives in open and linked data at National Libra...Unlocking doors: recent initiatives in open and linked data at National Libra...
Unlocking doors: recent initiatives in open and linked data at National Libra...Gill Hamilton
 
NetCommunity New Features Preview
NetCommunity New Features Preview NetCommunity New Features Preview
NetCommunity New Features Preview JeffTe
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Jon Voss
 
557 ahn ppt exercise
557 ahn ppt exercise557 ahn ppt exercise
557 ahn ppt exerciseasoyoung
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?Ivan Herman
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked DataRichard Wallis
 
Discover or no discover?That is the question
Discover or no discover?That is the questionDiscover or no discover?That is the question
Discover or no discover?That is the questionHoueida Kammourié
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531charper
 
Springtime for publishers - 20120711
Springtime for publishers - 20120711Springtime for publishers - 20120711
Springtime for publishers - 20120711Richard Akerman
 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we haveRichard Wallis
 
Business Intelligence for RFID-tracked Indoor Moving Objects af Hua Lu, AAU
Business Intelligence for RFID-tracked Indoor Moving Objects af Hua Lu, AAUBusiness Intelligence for RFID-tracked Indoor Moving Objects af Hua Lu, AAU
Business Intelligence for RFID-tracked Indoor Moving Objects af Hua Lu, AAUInfinIT - Innovationsnetværket for it
 
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版Rikkyo University
 

Similar to Linked Open Communism - c4l13 (20)

Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...
 
Unlocking doors: recent initiatives in open and linked data at National Libra...
Unlocking doors: recent initiatives in open and linked data at National Libra...Unlocking doors: recent initiatives in open and linked data at National Libra...
Unlocking doors: recent initiatives in open and linked data at National Libra...
 
Islandora and Linked Open Data
Islandora and Linked Open Data Islandora and Linked Open Data
Islandora and Linked Open Data
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 
Unlocking doors: recent initiatives in open and linked data at National Libra...
Unlocking doors: recent initiatives in open and linked data at National Libra...Unlocking doors: recent initiatives in open and linked data at National Libra...
Unlocking doors: recent initiatives in open and linked data at National Libra...
 
NetCommunity New Features Preview
NetCommunity New Features Preview NetCommunity New Features Preview
NetCommunity New Features Preview
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
LOD2: State of Play WP5 - Linked Data Visualization, Browsing and Authoring
LOD2: State of Play WP5 - Linked Data Visualization, Browsing and AuthoringLOD2: State of Play WP5 - Linked Data Visualization, Browsing and Authoring
LOD2: State of Play WP5 - Linked Data Visualization, Browsing and Authoring
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.
 
557 ahn ppt exercise
557 ahn ppt exercise557 ahn ppt exercise
557 ahn ppt exercise
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
 
Discover or no discover?That is the question
Discover or no discover?That is the questionDiscover or no discover?That is the question
Discover or no discover?That is the question
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531
 
Springtime for publishers - 20120711
Springtime for publishers - 20120711Springtime for publishers - 20120711
Springtime for publishers - 20120711
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we have
 
Business Intelligence for RFID-tracked Indoor Moving Objects af Hua Lu, AAU
Business Intelligence for RFID-tracked Indoor Moving Objects af Hua Lu, AAUBusiness Intelligence for RFID-tracked Indoor Moving Objects af Hua Lu, AAU
Business Intelligence for RFID-tracked Indoor Moving Objects af Hua Lu, AAU
 
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
 

More from charper

Charper.penn.20140411
Charper.penn.20140411Charper.penn.20140411
Charper.penn.20140411charper
 
Charper.lawdi.20120601
Charper.lawdi.20120601Charper.lawdi.20120601
Charper.lawdi.20120601charper
 
Of Cataloging & Context
Of Cataloging & ContextOf Cataloging & Context
Of Cataloging & Contextcharper
 
20101020 harper
20101020 harper20101020 harper
20101020 harpercharper
 
20080917 Rev
20080917 Rev20080917 Rev
20080917 Revcharper
 
Cornell20080516
Cornell20080516Cornell20080516
Cornell20080516charper
 
C4l2008charper
C4l2008charperC4l2008charper
C4l2008charpercharper
 

More from charper (7)

Charper.penn.20140411
Charper.penn.20140411Charper.penn.20140411
Charper.penn.20140411
 
Charper.lawdi.20120601
Charper.lawdi.20120601Charper.lawdi.20120601
Charper.lawdi.20120601
 
Of Cataloging & Context
Of Cataloging & ContextOf Cataloging & Context
Of Cataloging & Context
 
20101020 harper
20101020 harper20101020 harper
20101020 harper
 
20080917 Rev
20080917 Rev20080917 Rev
20080917 Rev
 
Cornell20080516
Cornell20080516Cornell20080516
Cornell20080516
 
C4l2008charper
C4l2008charperC4l2008charper
C4l2008charper
 

Linked Open Communism - c4l13

  • 1. Linked Open Communism: Better discovery through data dis- and re-aggregation --- or --- How I learned to shut about about linked data AND BUILD SOMETHING!! Presented at code4lib2013 by Corey A Harper 2013-02-13
  • 2. Linked Data • Metadata as a Graph • Typed “things”, named by URIs • The relationships between those things, also built on URIs • Ease of integration *across* data sources – “merging graphs” 2013-02-13 ☭ code4lib2013 ☭ 2
  • 3. 2013-02-13 ☭ code4lib2013 ☭ 3
  • 4. Refine 2013-02-13 ☭ code4lib2013 ☭ 4
  • 5. ViewShare 2013-02-13 ☭ code4lib2013 ☭ 5
  • 6. Context Narrative Story telling Context The archive’s story, The library's story, but also… 2013-02-13 ☭ code4lib2013 ☭ 6
  • 7. Users’ stories Adding context through recombinant metadata 2013-02-13 ☭ code4lib2013 ☭ 7
  • 8. Backing Away from Evangelism... Image NOT used by permission. Probably a violation of several copyrights & trademarks. 2013-02-13 ☭ code4lib2013 ☭ 8
  • 9. Image by Jonestown Institute via Wikimedia Commons http://en.wikipedia.org/wiki/File:Jonestown_entrance.jpg 9 ☭ code4lib2013 ☭ Aside on metaphors 2013-02-13
  • 10. Image by Joe Mabel via Wikimedia Commons. http://en.wikipedia.org/wiki/File:Furthur_05.jpg 10 ☭ code4lib2013 ☭ Aside on metaphors 2013-02-13
  • 11. 2013-02-13 ☭ code4lib2013 ☭ 11
  • 12. Premise Context is so central 2013-02-13 ☭ code4lib2013 ☭ 12
  • 13. And yet our Controlled Vocabs Are nearly gone Because the interfaces to them were broken 2013-02-13 ☭ code4lib2013 ☭ 13
  • 14. 2013-02-13 ☭ code4lib2013 ☭ 14
  • 15. The Death of Browse • Next-Gen Discovery Systems don't make use of Authority Control • “Browse” was/is broken as a UI Design • Rich data in Authorities, disconnected from narrative, context, search • Richer “Authority” type data outside libraries... 2013-02-13 ☭ code4lib2013 ☭ 15
  • 16. Linked Data Based UI Design For Boutique Collections 2013-02-13 ☭ code4lib2013 ☭ 16
  • 17. Public Domain image of Paulette Goddard via Wikimedia Commons. http://en.wikipedia.org/wiki/File:Paulette_Goddard-publicity.JPG 17 ☭ code4lib2013 ☭ A research leave 2013-02-13
  • 18. Public Domain image via Wikimedia Commons. http://en.wikipedia.org/wiki/File:Symbol-hammer-and-sickle.svg 18 ☭ code4lib2013 ☭ Initial Scope 2013-02-13
  • 19. Linked Open Communism • Dis-aggregate EAD records into Collections & Components • Create a broad set of resource “types” • Extract key “entities” from EAD  People, Places, Topics, Corporate Bodies  Incorporate additional data about entites • Put this in Blacklight • Load MARC & other data 2013-02-13 ☭ code4lib2013 ☭ 19
  • 20. 2013-02-13 ☭ code4lib2013 ☭ 20
  • 21. 2013-02-13 ☭ code4lib2013 ☭ 21
  • 22. 2013-02-13 ☭ code4lib2013 ☭ 22
  • 23. Technology Stack - UI • Vanilla Blacklight  Minor SOLR Index Tweaks / Additions  Minor View Hacks • “pre-beta”  Only on localhost right now 2013-02-13 ☭ code4lib2013 ☭ 23
  • 24. Technology Stack – Support Tools 2013-02-13 ☭ code4lib2013 ☭ 24
  • 25. Gadget! 2013-02-13 ☭ code4lib2013 ☭ 25
  • 26. Technology Stack - Backend • Python & RDFLib • 4Store & HTTP4Store • Sunburnt • FuzzyWuzzy • (Lots of other Python modules....) 2013-02-13 ☭ code4lib2013 ☭ 26
  • 27. Fuzzy Wuzzy – Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python 27 ☭ code4lib2013 ☭ FuzzyWuzzy & SeatGeek! 2013-02-13
  • 28. Data Flow 2013-02-13 ☭ code4lib2013 ☭ 28
  • 29. Object Oriented Python • Classes: Collections, Components, Entities • Class methods  makeGraph  makeSolr  to4store  output (turtle, rdf/xml, etc) 2013-02-13 ☭ code4lib2013 ☭ 29
  • 30. Performance Benchmarks • EAD -> SOLR:  ~26 hrs to parse 1600 EAD, push 385k “records” to SOLR • DBPedia matching  X-ref label varients for entities against 9.4 million DBPedia labels (labels-en.ttl).  Should be using Hadoop  Other ideas? • Re-solr-izing entities: ~10 minutes  Pulls local copy of dbpedia data from 4store 2013-02-13 ☭ code4lib2013 ☭ 30
  • 31. 4Store • Provenance-ish  Naming of sub-graphs  Default context is everything • First EAD cut produced ~4m triples • Easy to delete whole graphs, or individ triples • SPARQL-able – good for stats:  992 DBPedia links for 6331 “Entities” 2013-02-13 ☭ code4lib2013 ☭ 31
  • 32. Image by wallygrom via flickr http://www.flickr.com/photos/33037982@N04/3669790240/ 32 https://github.com/chrpr/ead2rdf2solr ☭ code4lib2013 ☭ 2013-02-13
  • 33. Future Steps: Code to Incorporate • Components: Inheritance of accesspoints  fuzzywuzzy string match to unittitle  matched about 10%  Extend to cross ead match via 4Store • VIAF, id.loc, fast reconciliation • Override configs for DBPedia matching 2013-02-13 ☭ code4lib2013 ☭ 33
  • 34. DBPedia Override Examples Germany. |t Treaties, etc. |g Soviet Union, |d 1939 Aug. 23. http://dbpedia.org/page/Treaty_of_Non- Aggression_between_Germany_and_the_Sovi et_Union Textile Workers' Strike, Gastonia, N.C., 1929. http://dbpedia.org/page/Loray_Mill_Strike 2013-02-13 ☭ code4lib2013 ☭ 34
  • 35. Further Development Next Steps • EAC-CPF reconciliation, record creation • Possibly relationship to Hydra?  Annotation Interface, DBP Overrides • SOLR Relevancy Ranking • SOLR-Marc Modifications • Update mechanism • Test with other Datasets (NYPL/NYU/METRO project) 2013-02-13 ☭ code4lib2013 ☭ 35
  • 36. Thanks! corey.harper@nyu.edu 212.998.2479 @chrpr 2013-02-13 ☭ code4lib2013 ☭ 36