Your SlideShare is downloading. ×
0
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Cni2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cni2012

358

Published on

#cni12f SNAC slides

#cni12f SNAC slides

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
358
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • In the order of importance Lenny the link head is last
  • So, this is what happens when you let the programmer design the user interface In phase two, Rachel Hu, CDL's user experience designer in our in house assessment group will be helping
  • Hopefully this is where the user will focus
  • AZ browse
  • Featured items on home page (rather than 0-9) Note the tabs to limit by record type
  • Also note the subject and occupation facets
  • Person
  • Advanced search hides, allows On other browsers, hierarchy represented graphically
  • Advanced search help
  • Autocomplete
  • Search results for Oppenheimer
  • View EAD Report data issue link has been added back Will come back to the radial graph demo
  • Sometimes the related resources will come from the EAD, but most of these are from VIAF This whole section is hard to use when there are lots of related items
  • This was the first iteration of the graph visualization
  • Transcript

    • 1. Opening Slide
    • 2. Building an Archival IdentityManagement Network: Transforming Archival Practice and Historical Research Daniel Pitti* and Brian Tingle** * Institute for Advance Technology in the Humanities ** California Digital Library Thanks to Ray R. Larson of the University of California, Berkeley, School of Information for many of the slides here 12/11/12 2012-11-04 - SLIDE
    • 3. Funding and People• Funding and Timeline – National Endowment for the Humanities – May 2010-April 2012 – Andrew W. Mellon Foundation – May 2012-April 2014• People – Daniel Pitti (PI) and Worthy Martin (Institute for Advanced Technology in the Humanities, University of Virginia) – Adrian Turner and Brian Tingle (California Digital Library, University of California) – Ray Larson (School of Information, University of California, Berkeley) 12/11/12 2012-11-04 - SLIDE
    • 4. The Source Data• EAD-encoded finding aids (guides to archival records) – 150K – Primarily from U.S. sources, but also U.K. and France• Archival authority records (360K) – National Archives and Records Administration – State Archive of New York – Smithsonian Institution – British Library – National Archives (France) & BnF• WorldCat Archival Descriptions: 2M 12/11/12 2012-11-04 - SLIDE
    • 5. Library and Museum Authority Records• Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names)• Virtual International Authority File (16M+ cluster records) – Contributed from around the world by national libraries and others 12/11/12 2012-11-04 - SLIDE
    • 6. 12/11/122012-11-04 - SLIDE
    • 7. Methods and Processing• Extract EAC-CPF records from existing EAD- encoded archival descriptions – Extracting both creators and referenced CPF names• Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) – Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN)• Create a prototype historical resource and access system – Historical data and social-professional networks – Links to archive, library, and museum resources (by and about) 12/11/12 2012-11-04 - SLIDE
    • 8. Example EAD Record (Hub) <ARCHDESC LEVEL = "FONDS" LANGMATERIAL = "English"><EAD> <DID> <EADHEADER LANGENCODING = "ISO 639"> <REPOSITORY> <EADID> University of Manchester, John Rylands University Library of ManchesterGB 0133 TAB </REPOSITORY> </EADID> <UNITID ENCODINGANALOG = "ISADG3.1.1." COUNTRYCODE = "GB" <FILEDESC> REPOSITORYCODE = "0133"> <TITLESTMT> GB 0133 TAB <TITLEPROPER> </UNITID>Tabley Muniments <UNITTITLE LABEL = "Title" ENCODINGANALOG = "ISADG3.1.2."> </TITLEPROPER> Tabley Muniments </TITLESTMT> </UNITTITLE> <PUBLICATIONSTMT> <UNITDATE LABEL = "Dates of Creation" ENCODINGANALOG = "ISADG3.1.3."> <PUBLISHER> 19th centuryJohn Rylands University Library of </UNITDATE>Manchester <PHYSDESC LABEL = "Extent" ENCODINGANALOG = "ISADG3.1.5."> </PUBLISHER> <EXTENT> <ADDRESS> 1.24 cu.m <ADDRESSLINE> </EXTENT>150 Deansgate </PHYSDESC> </ADDRESSLINE> <ORIGINATION LABEL = "Creator" ENCODINGANALOG = "ISADG3.2.1."> <ADDRESSLINE> <FAMNAME SOURCE = "NCARULES">Manchester Warren, family, of Tabley, Cheshire </ADDRESSLINE> </FAMNAME> <ADDRESSLINE> <PERSNAME SOURCE = "NCARULES">... (Parts removed )… Warren, John Byrne Leicester, 1835-1895, 3rd Baron de Tabley, poet </FRONTMATTER> </PERSNAME> </ORIGINATION> </DID> 12/11/12 2012-11-04 - SLIDE
    • 9. Example EAD Record (Hub) <BIOGHIST ENCODINGANALOG = "ISADG3.2.2."> <HEAD> Administrative/Biographical History </HEAD> <P> The poet John Byrne Leicester Warren, later 3rd and last Baron de Tabley, of Tabley near Knutsford, Cheshire, was born in 1835, the son of the 2nd Baron de Tabley (1811-1887), and his wife, Catherina. His mother was Italian, the daughter of the count de Soglio, and Warren spent much of his early childhood with her in Italy and Greece. He was educated at Eton and Christ Church, Oxford. At Oxford he published a volume of poetry. Originally he published under the pseudonyms George F. Preston (1859-1862) and William Lancaster (1863-1868), but latterly under his own name. </P> <P> His early verse included <TITLE> Praeterita </TITLE> (1863), <TITLE> Eclogues and Monodramas </TITLE> (1864), <TITLE> Studies in Verse </TITLE> (1865), <TITLE> Philocletes </TITLE> (1866), and <TITLE> Orestes </TITLE> (1868). His early work was Tennysonian in style, but he was later to be influenced by both Browning and Swinburne. In 1873 he produced …. (some data removed)… 12/11/12 2012-11-04 - SLIDE
    • 10. Example EAD Record (Hub) <SCOPECONTENT ENCODINGANALOG = "ISADG3.3.1."> <HEAD> Scope and Content </HEAD> <P> The collection consists mainly of the personal papers of the 3rd Baron de Tabley. The papers reflect his interests in literature, politics, botany and numismatics and include correspondence with numerous prominent later Victorian figures. Attention should also be drawn to de Tabley’s extensive and important collection of armorial bookplates. </P> <P> Correspondents include Sir Mountstuart Grant Duff, Edmund Gosse, Lord Houghton, A.C.Benson, and Robert Bridges. There are volumes of Tableys essays and verse, as well as a considerable number of notebooks and loose manuscripts of verse and other writings. There are various bundles and boxes relating to &quot;Coins&quot;, &quot;Botany&quot;, &quot;Poetry&quot;, &quot;Literary&quot;, &quot;Financial&quot; and bookplates. </P> </SCOPECONTENT> <ADD> <OTHERFINDAID ENCODINGANALOG = "ISADG3.4.6."> <P> Preliminary survey list. </P> </OTHERFINDAID> <RELATEDMATERIAL ENCODINGANALOG = "ISADG3.5.3."> <P> There is correspondence with the 3rd Baron de Tabley among the Edward Freeman Papers, held at JRULM. The Library also has custody of the important Tabley Book Collection. </P> </RELATEDMATERIAL> <SEPARATEDMATERIAL> <P> The family and estate papers of the Leicester-Warren Family of Tabley are held by Cheshire Record Office. Some of these papers were originally in the custody of the John Rylands University Library of Manchester. </P> </SEPARATEDMATERIAL> </ADD> 12/11/12 2012-11-04 - SLIDE
    • 11. Example EAD Record (Hub)<CONTROLACCESS> <PERSNAME SOURCE = "NCARULES"> <HEAD> <EMPH ALTRENDER = "surname">Milnes</EMPH>Index terms <EMPH ALTRENDER = "forename">Richard Monckton</EMPH> </HEAD> <EMPH ALTRENDER = "dates">1809-1885</EMPH> <GEOGNAME SOURCE = "NCARULES"> <EMPH ALTRENDER = "epithet">1st Baron Houghton</EMPH><EMPH ALTRENDER = "a">Tabley Inferior</EMPH> </PERSNAME><EMPH ALTRENDER = "a-">Cheshire SJ7378</EMPH> <SUBJECT SOURCE = "LCSH"> </GEOGNAME> <EMPH ALTRENDER = "a">Bookplates</EMPH> <PERSNAME SOURCE = "NCARULES"> </SUBJECT><EMPH ALTRENDER = "surname">Benson</EMPH> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "forename">Arthur Christopher</EMPH> <EMPH ALTRENDER = "a">Botany</EMPH><EMPH ALTRENDER = "dates">1862-1923</EMPH> </SUBJECT> </PERSNAME> <SUBJECT SOURCE = "LCSH"> <PERSNAME SOURCE = "NCARULES"> <EMPH ALTRENDER = "a">Numismatics</EMPH><EMPH ALTRENDER = "surname">Bridges</EMPH> </SUBJECT><EMPH ALTRENDER = "forename">Robert Seymour</EMPH> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "dates">1844-1930</EMPH> <EMPH ALTRENDER = "a-">Poetry</EMPH> </PERSNAME> <EMPH ALTRENDER = "a">Modern</EMPH> <PERSNAME SOURCE = "NCARULES"> <EMPH ALTRENDER = "y">19th century</EMPH><EMPH ALTRENDER = "surname">Duff</EMPH> </SUBJECT><EMPH ALTRENDER = "title">Sir</EMPH> </CONTROLACCESS><EMPH ALTRENDER = "forename">Mountstuart Elphinstone Grant</EMPH> </ARCHDESC><EMPH ALTRENDER = "dates">1829-1906</EMPH> </EAD><EMPH ALTRENDER = "epithet">Knight</EMPH> </PERSNAME> <PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Gosse</EMPH><EMPH ALTRENDER = "title">Sir</EMPH><EMPH ALTRENDER = "forename">Edmund William</EMPH><EMPH ALTRENDER = "dates">1849-1928</EMPH><EMPH ALTRENDER = "epithet">Knight</EMPH> </PERSNAME> 12/11/12 2012-11-04 - SLIDE
    • 12. 2010-2012 Extraction Results• Source data: 30,000 finding aids• EAC-CPF records extracted – LoC: 43,702 from 1,159 finding aids – OAC: 91,811 from ~15,400 – NWDA: 22,609 from 5,160 – VH: 15,175 from 8,390 – Total 173,297 12/11/12 2012-11-04 - SLIDE
    • 13. Phase II preliminary results• unmerged SIA Henry Correspondence• 32,988 Names• unmerged WorldCat MARC• 4,548,270 Names 12/11/12 2012-11-04 - SLIDE
    • 14. Methods and Processing• Extract EAC-CPF records from existing EAD- encoded archival descriptions – Extracting both creators and referenced CPF names• Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) – Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN)• Create a prototype historical resource and access system – Historical data and social-professional networks – Links to archive, library, and museum resources (by and about) 12/11/12 2012-11-04 - SLIDE
    • 15. The Problem• Proliferation of the forms of names – Different names for the same person – Different people with the same names• Examples – from Books in Print (semi-controlled but not consistent) – ERIC author index (not controlled) 12/11/12 2012-11-04 - SLIDE
    • 16. Goethe …etc… 12/11/12 2012-11-04 - SLIDE
    • 17. John Muir 12/11/12 2012-11-04 - SLIDE
    • 18. Library and Archive Authority Control• Library (or bibliographic) authority control is almost exclusively about the control of names• Archival identity control involves biographical- historical description of the CPF entity – Descriptions based on controlled vocabularies, for example, occupations, place of birth and death – But also biographical-historical description • Prose • Chronological list• Archival authority control provides context for understanding records, the context of their creation, the provenance 12/11/12 2012-11-04 - SLIDE
    • 19. Merging EAC-CPF Records LCNAF Repository VIAF Repository ULAN Repository Cheshire Search Connect Connect exactly records using Merge matching name authority records information Repository of Repository ofEAC Repository merged EAC connected EAC Records Records (MongoDB) 12/11/12 2012-11-04 - SLIDE
    • 20. Merging EAC-CPF Records VIAF Repository Cheshire Search Connect Connect exactly records using Merge matching name authority records information Repository of Repository ofEAC Repository merged EAC connected EAC Records Records (MongoDB) 12/11/12 2012-11-04 - SLIDE
    • 21. Connect Exact Matches• The EAC-CPF records provide the names without having to parse texts, etc.• Allows us to use some simple methods like exact matching – Assume identical name entries means the same person/corporate body/family – Enter the full names and record IDs into a database and flag IDs with same names for merging 12/11/12 2012-11-04 - SLIDE
    • 22. But…• Exact merging assumes that archives are following LC cataloging practice in their EAD records – There are some problems with this assumption 12/11/12 2012-11-04 - SLIDE
    • 23. Some failures for merging…• Different abbreviations: – A. & G. Carisch & C. – A. & G. Carisch & Co.• And spacing issues: – A. C. Peters & Bro. – A. C. Peters & Brother. – A. C. Peters. (??) – A. C.Peters & Bro.• Completeness and alternate rules – Tabb, John B. (John Banister), 1845-1909. – Tabb, John Banister, 1845-1909.• Also differing transliterations for non-Latin scripts 12/11/12 2012-11-04 - SLIDE
    • 24. More…• Variant romanizations (and spacing): – M. P. Belaieff. – M. P. Belaïeff. – M. P. Bieliaev. – M.P. Belaïeff. – M.P.Belaïeff.• Initials vs. names: – Zabolotskii, N.A. – Zabolotskii, Nikolai Alekseevich, 1903-1958. – Zabolotskii. 12/11/12 2012-11-04 - SLIDE
    • 25. More…• Inverted order vs. uninverted – Taylor, Zachary, 1784-1850. – Zachary Taylor.• Various combinations: – Tchaikovsky, Peter I. – Tchaikovsky, Pëtr Il. – Tchaikovsky, Piotr Ilyich. – Tchaikovsky, Pyotr Il. – Tchaikovsky, Pyotr Ilyich. 12/11/12 2012-11-04 - SLIDE
    • 26. Merging EAC-CPF Records VIAF Repository Cheshire Search Connect Connect exactly records using Merge matching name authority records information Repository of Repository ofEAC Repository merged EAC connected EAC Records Records (MongoDB) 12/11/12 2012-11-04 - SLIDE
    • 27. Search Authority Files• For each name, formulate a search of the VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching) – Search both the “authoritative” and “non- authoritative” forms – Consider any name matching a non- authoritative form to be a candidate match for the authoritative form – Flag EAC records that match the same authority record as potential matches 12/11/12 2012-11-04 - SLIDE
    • 28. NGRAM or Shingle MatchingName: Einstein Albert Shingle sequence: ein, ins, nst, ste, tei, ein … , ertProbability that the sequence (ins, nst, ste) follows ein is very high for the nameeinstein Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph 12/11/12 2012-11-04 - SLIDE
    • 29. Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein ein In hta tai na ein In ain ste na sht ste al al nst al nsh nst alb alb ins alb insins lbe ein lbe Ain ein lbe ert ert ein ert ein ein rte tei rte tei tei rte Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph 12/11/12 2012-11-04 - SLIDE
    • 30. Merging EAC-CPF Records VIAF Repository Cheshire Search Connect Connect exactly records using Merge matching name authority records information Repository of Repository ofEAC Repository merged EAC connected EAC Records Records (MongoDB) 12/11/12 2012-11-04 - SLIDE
    • 31. Merge Flagged Records• For all of the exact matches and authority matches – Use the Authoritative form of the name – Combine data from each match into a single EAC-CPF record – Retain all source record IDs and information• Finally, output the merged EAC-CPF records 12/11/12 2012-11-04 - SLIDE
    • 32. Inputs to SNAC merging• LoC: 43,702 EAC-CPF records derived from 1159 finding aids• OAC: 91,814 EAC-CPF records derived from ~15,400 finding aids• NWDA: 24952 EAC-CPF records derived from 5,568 finding aids• VH: 15,175 EAC-CPF records• Total: 175,688 Input EAC records for merging• Result: 128,781 “unique” names 12/11/12 2012-11-04 - SLIDE
    • 33. Another view of the numbers…• 95624 Person names merged from 125555 Person records• 31287 Institutions merged from 47189 Institution records• 1980 Families merged from 2899 Family records 12/11/12 2012-11-04 - SLIDE
    • 34. Merging Conclusions• There will not be a single merging method, but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information 12/11/12 2012-11-04 - SLIDE
    • 35. Next• Developing an updateable database of merged EAC data (dumping Mongo for PostgreSQL) – Will permit incremental addition of new data and support editing and “forced” merges• Process the 2M WorldCat archival descriptions• Process the 150,000 finding aids• Convert several hundred thousand archival authority records into EAC-CPF and match/merge process 12/11/12 2012-11-04 - SLIDE
    • 36. Methods and Processing• Extract EAC-CPF records from existing EAD- encoded archival descriptions – Extracting both creators and referenced CPF names• Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) – Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN)• Create a prototype historical resource and access system – Historical data and social-professional networks – Links to archive, library, and museum resources (by and about) 12/11/12 2012-11-04 - SLIDE
    • 37. Outline• User Persona• Search and Display• Network graph visualization• Linked Data / RDF• Future Plans 12/11/12
    • 38. Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand orproduct in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.• Quincy: Library School Student working to QA record matching.• Adele: Person doing authority work during collection processing.• Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically. 12/11/12
    • 39. Outline• User Persona• Search and Display• Network graph visualization• Linked Data / RDF• Future Plans 12/11/12
    • 40. 12/11/12
    • 41. 12/11/12
    • 42. 12/11/12
    • 43. Advanced limits match EAC sections
    • 44. Outline• User Persona• Search and Display• Network graph visualization • Context widget (needs new name)• Linked Data / RDF• Future Plans 12/11/12
    • 45. Tinkerpop graph database stack• Simple "property graph" model• "JDBC for graph databases" [SNAC is using Neo4J for the graphDB]• XPath like "gremlin" for graph query• REST interfaces with "Rexster"• For me, this was 10 to 100 times easier than using RDF 12/11/12
    • 46. Outline• User Persona• Search and Display• Network graph visualization• Linked Data / RDF• Future Plans 12/11/12
    • 47. What is Linked Open Data?• w3c Semantic Web Technology Stack• Web of atomized Data, not a web of documents• RDF; OWL ontologies; SPARQL queries; triple/quad/quint stores• httpRange14; content negotiation; CURIE• No restrictions on data use; free and easy license• Lenny wants it, but does Randy? 12/11/12
    • 48. What is Linked Open Data?• Getting to the good stuff • Blue underlined text • Pulling in data from multiple sources, in an intelligent way, into a "document"• Understand and discover relationships• Open access for research, education, private study and other fair use 12/11/12
    • 49. RDFa owl:sameAs
    • 50. HTML 5 microdata in chron list
    • 51. RDF of the social graph Thanks Ed Summers!
    • 52. Silvia Mazzini regesta.exe srlhttp://templates.xdams.net/IBC/ontology/eac-cpf.rdf
    • 53. &mode=xml2owl [experimental] 12/11/12
    • 54. My opinion on the use cases for w3c RDFtech• Good for publishing data• Good for controlled vocabularies• Data models?• Most people with open source RDF-store type systems do the real stuff with solr• Consider a graph database 12/11/12
    • 55. Outline• User Persona• Search and Display• Linked Data / RDF• Network graph visualization• Future Plans 12/11/12
    • 56. Future Plans• Conduct assessment activities involving members of target audiences to establish mental model of users for design work• Scale interface to millions of names• Visualizations useful and integrated (network and geospatial)• Stable URLs between batches for linked data• Social and personalization features (gateway to crowdsourcing) 12/11/12• Integration with local systems (such as with the context
    • 57. • Photo attribution http://www.flickr.com/photos/dsevilla• http://xtf.cdlib.org/• http://code.google.com/p/eac-graph-load/source/bro• http://tinkerpop.com/• http://thejit.org/• https://github.com/tingletech/snac-related- widget 12/11/12

    ×