saa-2011-snac

529 views

Published on

wtf.... this slideshare is f***ing lame a** s***

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
529
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Flexible description: series description; dispersed collections Cooperative authority control: dispersed collections; but also creator of one collection is referenced in a collection created by someone else (co-referencing); economic and descriptive benefits Integrated access to cultural heritage: context for archival records, essential, but the descriptions can also provide context for all types of resources Archival authority records, like museum authority records, provide historical and biographical data that can enhance identification and understanding; (biographical dictionary; administrative histories)
  • Remember that we will solicit public evaluation and suggestions on drafts of the public interface, starting in the fall.
  • saa-2011-snac

    1. 1. <ul><li>EAC-CPF and Social Networks </li></ul><ul><li>Society of American Archivists </li></ul><ul><li>Chicago </li></ul><ul><li>August 2011 </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    2. 2. SNAC Overview <ul><li>Funding and Timeline </li></ul><ul><li>Project Team </li></ul><ul><li>Project Objectives and Rationale </li></ul><ul><li>Data Contributing Institutions </li></ul><ul><li>Archival Standards Employed </li></ul><ul><li>Methods, Processing, and Products </li></ul><ul><li>Year One Extraction Results </li></ul><ul><li>Basic Observations on Extraction </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    3. 3. Funding and Timeline <ul><li>National Endowment for the Humanities </li></ul><ul><li>A Preservation and Access, Research and Development grant </li></ul><ul><li>Two-year project </li></ul><ul><li>May 2010-April 2012 </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    4. 4. Project Team <ul><li>Daniel Pitti (PI) and Worthy Martin (Institute for Advanced Technology in the Humanities, University of Virginia) </li></ul><ul><li>Adrian Turner and Brian Tingle (California Digital Library, University of California) </li></ul><ul><li>Ray Larson (School of Information, University of California, Berkeley) </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    5. 5. Project Objectives <ul><li>Archival finding aids currently intermix description of records with description of the creators of records and persons evident in the records </li></ul><ul><li>Further the ongoing process of transforming archival description using advanced technologies </li></ul><ul><li>By facilitating the separation of the description of people from the description of records </li></ul><ul><li>Using EAC-CPF, an International archival authority control standard </li></ul><ul><li>Goal: enhance the economy and effectiveness of archival description to enhance access and understanding of users of archives, libraries, and museums </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    6. 6. Rationale for Separation <ul><li>Authority control of forms of names </li></ul><ul><li>Flexible description </li></ul><ul><li>Cooperative authority control </li></ul><ul><li>Integrated access to cultural heritage </li></ul><ul><li>Biographical/historical resource </li></ul><ul><li>Social/historical context (social-professional networks) </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    7. 7. The Data <ul><li>EAD-encoded finding aids </li></ul><ul><ul><li>Library of Congress (1,159) </li></ul></ul><ul><ul><li>Online Archive of California (~15,400 ) </li></ul></ul><ul><ul><li>Northwest Digital Archive (5,160) </li></ul></ul><ul><ul><li>Virginia Heritage (8,390) </li></ul></ul><ul><li>Authority records </li></ul><ul><ul><li>Library of Congress: NACO/LCNAF (3.8M personal names; 900K corporate names) </li></ul></ul><ul><ul><li>Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names) </li></ul></ul><ul><ul><li>Virtual International Authority File (5M+ personal names) </li></ul></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    8. 8. Methods and Processing <ul><li>Extract EAC-CPF records from existing EAD-encoded archival descriptions </li></ul><ul><ul><li>Extracting both creators and referenced CPF names </li></ul></ul><ul><li>Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF); merge records for the same entity </li></ul><ul><ul><li>Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) </li></ul></ul><ul><ul><li>Key challenge: two or more people with the same name; two or more names for the same person </li></ul></ul><ul><li>Create a prototype historical resource and access system </li></ul><ul><ul><li>Historical data and social-professional networks </li></ul></ul><ul><ul><li>Links to archive, library, and museum resources (by and about) </li></ul></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    9. 9. EAD Source Data <ul><li>Encoded Archival Description </li></ul><ul><ul><li>Intermixes description of creators of records and, at the discretion of the archivists, names associated with the content of the records </li></ul></ul><ul><ul><li>Detailed description of creators of records </li></ul></ul><ul><li>Widely varying quality </li></ul><ul><ul><li>In the number of names identified and encoded </li></ul></ul><ul><ul><li>In the formation of the names (direct or inverted, capitalization, punctuation, and so on) </li></ul></ul><ul><ul><li>In the categorization of names (personal, corporate, or family </li></ul></ul><ul><li>Many names given but not identified as such </li></ul><ul><li>Most important of these in biographies/histories and in correspondence description </li></ul><ul><li>Extraction has focused on the “low hanging fruit,” that is the names tagged as names </li></ul><ul><li>Attention shifting to names not identified as such </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    10. 10. Archival Records <ul><li>Records are the by-products of people living and working as individuals, in organized groups, in families </li></ul><ul><li>Records document people living and working </li></ul><ul><li>People exist in social-professional contexts, in relation to others </li></ul><ul><li>Records document these relations </li></ul><ul><li>All records created by the same entity are described together (a fonds or collection) </li></ul><ul><ul><li>Creators documented in detail </li></ul></ul><ul><ul><li>Many of the people documented in the record referenced in description </li></ul></ul><ul><li>Archival descriptions document interrelations among people and records (documents) </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    11. 11. Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia Source: J. Robert Oppenheimer Papers (LoC) <origination> <persname source=&quot;lcnaf&quot;>Oppenheimer, J. Robert, 1904-1967</persname> </origination> <controlaccess> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;100&quot; role=&quot;creator&quot;>Oppenheimer, J. Robert, 1904-1967</persname> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Bethe, Hans Albrecht, 1906- --Correspondence</persname> <!-- […] --> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Born, Max, 1882-1970 --Correspondence</persname> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Boyd, Julian P. (Julian Parks), 1903- --Correspondence</persname> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Bush, Vannevar, 1890-1974 --Correspondence</persname> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Casals, Pablo, 1876-1973 --Correspondence</persname> <!-- […] --> <corpname source=&quot;lcnaf&quot; encodinganalog=&quot;610&quot; role=&quot;subject&quot;>Institute for Advanced Study (Princeton, N.J.)</corpname> <corpname source=&quot;lcnaf&quot; encodinganalog=&quot;610&quot; role=&quot;subject&quot;>Los Alamos Scientific Laboratory</corpname> <!-- […] --> </controlaccess>
    12. 12. Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia Source: Leonard Bernstein Collection (LoC)   <c02> <did> <container type=&quot;box&quot;>1</container> <unittitle>Aaltonen, Erkki <unitdate era=&quot;ce&quot; calendar=&quot;gregorian&quot;>1981</unitdate> </unittitle> <physdesc> <extent>1</extent> </physdesc> </did> </c02> <c02> <did> <unittitle>Abbado, Claudio <unitdate era=&quot;ce&quot; calendar=&quot;gregorian&quot;>1963-90</unitdate> </unittitle> <physdesc> <extent>5</extent> </physdesc> </did> </c02> […]
    13. 13. Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <bioghist> <head>Biographical Sketch</head> <p>José Marcos Mugarrieta, prior to his term as Mexican consul in San Francisco 1857-1863, served in the Mexican army from 1837. He saw action in numerous battles and campaigns – Jamaica, under General Canalizo in 1841; Campeche, 1842-1843; Merida, 1843; Veracruz, 1845; Mexico City, 1846; Angostura and Cerro-gordo, 1847; Guanajuato, 1848, and Sierra-Gorda under Bustamante, 1848-1849; and Matamoros, 1849-1850. […] </p> <p>In April 1857 Mugarrieta received an appointment from the Comonfort government for the consulship in San Francisco. He did not actually begin his new duties until September 1, 1859, due to illness and to the political situation in Mexico. […]</p> </bioghist>
    14. 14. Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <bioghist> <head>Chronology</head> <chronlist> <chronitem> <date>1900</date> <event>Born on Jan. 20 in Hastings, Minnesota.</event> </chronitem> <chronitem> <date>1922</date> <event>Received baccalaureate from Princeton University, major in philosophy. </event> </chronitem> […] <chronitem> <date>1965</date> <event>Died on April 4.</event> </chronitem> </chronlist> </bioghist>
    15. 15. EAC-CPF <ul><li>Encoded Archival Context-Corporate bodies, Persons, Families </li></ul><ul><li>An international communication standard for archival authority control </li></ul><ul><li>Based on International Council for Archives, International Standard Archival Authority Records-Corporate bodies, persons, families (ISAAR(CPF)) </li></ul><ul><li>SAA Standards Committee, Technical Subcommittee on Encoded Archival Context </li></ul><ul><li>Co-chairs </li></ul><ul><ul><li>Katherine Wisser, Simmons College </li></ul></ul><ul><ul><li>Anila Angjeli, Bibliothèque nationale de France </li></ul></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    16. 16. Library and Archive Authority Control <ul><li>Library (or bibliographic) authority control is almost exclusively about the control of names </li></ul><ul><li>Archival authority control involves biographical-historical description of the CPF entity </li></ul><ul><ul><li>Descriptions based on controlled vocabularies or values, for example, occupations, place of birth and death </li></ul></ul><ul><ul><li>But also biographical-historical description </li></ul></ul><ul><ul><ul><li>Prose </li></ul></ul></ul><ul><ul><ul><li>Chronological list </li></ul></ul></ul><ul><li>Archival authority control provides context for understanding records, the context of their creation, the provenance </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    17. 17. Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <identity> <entityType>person</entityType> <nameEntry scriptCode=&quot;Latn&quot; xml:lang=&quot;eng&quot;> <part>Oppenheimer, J. Robert, 1904-1967.</part> <authorizedForm>AACR2</authorizedForm> </nameEntry> <nameEntry localType=&quot;VIAF:MainHeading&quot;> <part>Oppenheimer, J. Robert (Julius Robert), 1904-1967</part> <alternativeForm>VIAF</alternativeForm> </nameEntry> <nameEntry localType=&quot;VIAF:MainHeading&quot;> <part>Oppenheimer, Julius Robert, 1904-1967</part> <alternativeForm>VIAF</alternativeForm> </nameEntry> <nameEntry localType=&quot;VIAF:x400&quot;> <part>Oppenheimer, Robert</part> <alternativeForm>VIAF</alternativeForm> </nameEntry> <nameEntry localType=&quot;VIAF:x400&quot;> <part>Ou-pẽn-hai-mo, 1904-1967</part> <alternativeForm>VIAF</alternativeForm> </nameEntry> </identity>
    18. 18. Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <existDates> <dateRange> <fromDate standardDate=“1904-04-22”>1904, Apr. 22</fromDate> <toDate standardDate=“1967-02-18”>1967, Feb. 18</toDate> </dateRange> </existDates> <!-- ... --> <localDescription localType=&quot;subject&quot;> <term>Science--Societies, etc.</term> </localDescription> <localDescription localType=&quot;VIAF:nationality&quot;> <placeEntry countryCode=&quot;US&quot;/> </localDescription> <localDescription localType=&quot;VIAF:gender&quot;> <term>Male</term> </localDescription> <languageUsed> <language languageCode=&quot;eng&quot;/> </languageUsed> <occupation> <term>Physicists.</term> </occupation> <!-- ... -->
    19. 19. Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <chronList> <chronItem> <date>1904, Apr. 22</date> <placeEntry>New York, N.Y.</placeEntry> <event>Born, New York, N.Y.</event> </chronItem> <!-- ... --> <chronItem> <date>1943-1945</date> <placeEntry>Los Alamos, N. Mex.</placeEntry> <event>Director, Los Alamos Scientific Laboratory, Los Alamos, N. Mex.</event> </chronItem> <!-- ... --> <chronItem> <date>1954</date> <event>(1) Denied security clearance […] (2) Published Science and the Common Understanding […] </event> </chronItem> <!-- ... --> <chronItem> <date>1967, Feb. 18</date> <placeEntry>Princeton, N.J.</placeEntry> <event>Died, Princeton, N.J.</event> </chronItem> </chronList>
    20. 20. Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <cpfRelation xmlns:xlink=&quot;http://www.w3.org/1999/xlink&quot; xlink:type=&quot;simple&quot; xlink:role=&quot;http://RDVocab.info/uri/schema/FRBRentitiesRDA/Person&quot; xlink:arcrole=&quot;correspondedWith&quot;> <relationEntry>Bush, Vannevar, 1890-1974.</relationEntry> <descriptiveNote> <p>recordId: DLC.ms998007.r007</p> </descriptiveNote> </cpfRelation>
    21. 21. Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <resourceRelation xmlns:xlink=&quot;http://www.w3.org/1999/xlink&quot; xlink:arcrole=&quot;creatorOf&quot; xlink:role=&quot;archivalRecords” xlink:type=&quot;simple” xlink:href=&quot;http://hdl.loc.gov/loc.mss/eadmss.ms998007&quot;> <relationEntry>J. Robert Oppenheimer Papers, 1799-1980 (bulk 1947-1967)</relationEntry> <objectXMLWrap> <did xmlns=&quot;urn:isbn:1-931666-22-9” > <unittitle>Papers <unitdate normal=&quot;1799/1980” era=&quot;ce” calendar=&quot;gregorian&quot;>1799-1980 </unitdate><unitdate label=&quot;Bulk Dates&quot; type=&quot;bulk&quot; normal=&quot;1947/1967” era=&quot;ce” calendar=&quot;gregorian&quot;>(bulk 1947-1967)</unitdate></unittitle> <unitid countrycode=&quot;US&quot; repositorycode=&quot;US-DLC&quot;>MSS35188</unitid> <origination label=&quot;Creator&quot;> <persname>Oppenheimer, J. Robert, 1904-1967</persname> </origination> <!-- ... --> <repository><corpname>Manuscript Division. Library of Congress</corpname> </repository> <abstract>Physicist and director of the Institute for Advanced Study, Princeton, New Jersey. [...] Topics include theoretical physics, development of the atomic bomb, the relationship between government and science, nuclear energy, security, and national loyalty. </abstract> </did> </objectXMLWrap> </resourceRelation>
    22. 22. Year One Results-Extraction <ul><li>EAC-CPF records extracted </li></ul><ul><ul><li>LoC: 43,702 from 1,159 finding aids </li></ul></ul><ul><ul><li>OAC: 91,811 from ~15,400 </li></ul></ul><ul><ul><li>NWDA: 22,609 from 5,160 </li></ul></ul><ul><ul><li>VH: 15,175 from 8,390 </li></ul></ul><ul><ul><li>Total 173,297 </li></ul></ul><ul><ul><li>Note: in a more recent extraction: 196,218, but have not had time analyze the results </li></ul></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    23. 23. Early Observations-Extraction <ul><li>Depth of analysis and quality of description of CPF entities varies widely in EAD-encoded finding aids </li></ul><ul><ul><li>LoC a lot of names under authority control </li></ul></ul><ul><ul><li>OAC and NWDA have less names and control varies </li></ul></ul><ul><li>To be fair, the finding aids were created without SNAC processing in mind! </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    24. 24. Next on Extraction <ul><li>Refine extraction processing, incorporating some NLP-like processing, for example </li></ul><ul><ul><li>Verifying type of name: C or P or F </li></ul></ul><ul><ul><li>Massaging poorly formed names into better formed names </li></ul></ul><ul><ul><li>Identifying names in strings that are names-plus (but name not identified as such) </li></ul></ul><ul><ul><li>Provide context information to enhance matching, for example, date or dates of correspondence, or occupation of creator of records </li></ul></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    25. 25. Beyond the Project <ul><li>Building a National Archival Authorities Infrastructure </li></ul><ul><ul><li>IMLS funded two-year project, October 2011-September 2013 </li></ul></ul><ul><ul><li>EAC-CPF SAA workshops: 140 scholarships </li></ul></ul><ul><ul><li>National Archival Authorities Cooperative planning </li></ul></ul><ul><li>SNAC II: a proposal to expand SNAC </li></ul><ul><ul><li>A lot more data </li></ul></ul><ul><ul><li>NARA, SI, MARC WorldCat records, a lot more finding aids </li></ul></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    26. 26. For More Information <ul><li>http://socialarchive.iath.virginia.edu/ (Project website) </li></ul><ul><li>http://socialarchive.iath.virginia.edu/xtf/search (public prototype) </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
    27. 27. Social Networks and Archival Context Project: Matching and Merging EAC-CPF Records <ul><li>Ray R. Larson </li></ul><ul><li>Krishna Janakiraman </li></ul><ul><li>University of California, Berkeley </li></ul><ul><li>School of Information </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE Thanks to Daniel V. Pitti of the Institute for Advanced Technology in the Humanities, University of Virginia, for many of the slides here
    28. 28. SNAC Project <ul><li>The outlines of the project have been discussed by Daniel Pitti previously </li></ul><ul><li>The primary focus of the Berkeley group for the project is on combining data resources from multiple archives and other information sources </li></ul><ul><li>In this talk I will focus on our current methods used in the prototype (to be described by Brian Tingle later) </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    29. 29. Data Contributing Institutions <ul><li>EAD-encoded finding aids </li></ul><ul><ul><li>Library of Congress (1159) </li></ul></ul><ul><ul><li>Online Archive of California (15,400+) </li></ul></ul><ul><ul><li>Northwest Digital Archive (5,563+) </li></ul></ul><ul><ul><li>Virginia Heritage (8,390+) </li></ul></ul><ul><li>Authority records </li></ul><ul><ul><li>Library of Congress: NACO/LCNAF (3.8M personal names; 900K corporate names) </li></ul></ul><ul><ul><li>Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names) </li></ul></ul><ul><ul><li>Virtual International Authority File (intersection with NACO/LCNAF, 5M personal names) </li></ul></ul><ul><li>Other biographical sources (e.g., DBPedia, IMDB) </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    30. 30. Methods and Processing <ul><li>Extract EAC-CPF records from existing EAD-encoded archival descriptions </li></ul><ul><ul><li>Extracting both creators and referenced CPF names </li></ul></ul><ul><li>Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) </li></ul><ul><ul><li>Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) </li></ul></ul><ul><li>Create a prototype historical resource and access system </li></ul><ul><ul><li>Historical data and social-professional networks </li></ul></ul><ul><ul><li>Links to archive, library, and museum resources (by and about) </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    31. 31. Merging EAC-CPF Records SAA 2011 - Chicago 2011-08-27 - SLIDE LCNAF Repository ULAN Repository Connect exactly matching records Connect records using name authority information Merge Cheshire Search
    32. 32. Authority Control <ul><li>Identifying creator entities and referenced entities (correspondents, etc.) </li></ul><ul><li>Recording name or names used by and for them </li></ul><ul><li>Rule-based heading or entry formation and control </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    33. 33. Controlled Vocabularies <ul><li>Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information </li></ul><ul><li>That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    34. 34. The Problem <ul><li>Proliferation of the forms of names </li></ul><ul><ul><li>Different names for the same person </li></ul></ul><ul><ul><li>Different people with the same names </li></ul></ul><ul><li>Examples </li></ul><ul><ul><li>from Books in Print (semi-controlled but not consistent) </li></ul></ul><ul><ul><li>ERIC author index (not controlled) </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    35. 35. Goethe SAA 2011 - Chicago 2011-08-27 - SLIDE … etc…
    36. 36. John Muir SAA 2011 - Chicago 2011-08-27 - SLIDE
    37. 37. Pauline Cochrane nee Atherton SAA 2011 - Chicago 2011-08-27 - SLIDE
    38. 38. Pauline Cochrane nee Atherton SAA 2011 - Chicago 2011-08-27 - SLIDE
    39. 39. Name Authority Files SAA 2011 - Chicago 2011-08-27 - SLIDE ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973 Different names for the same person
    40. 40. Merging EAC-CPF Records SAA 2011 - Chicago 2011-08-27 - SLIDE Connect exactly matching records Connect records using name authority information Merge Cheshire Search
    41. 41. Connect Exact Matches <ul><li>The EAC-CPF records provide the names without having to parse texts, etc. </li></ul><ul><li>Allows us to use some simple methods like exact matching </li></ul><ul><ul><li>Assume identical name entries means the same person/corporate body/family </li></ul></ul><ul><ul><li>Enter the full names and record IDs into a database and flag IDs with same names for merging </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    42. 42. Merging EAC-CPF Records SAA 2011 - Chicago 2011-08-27 - SLIDE Connect exactly matching records Connect records using name authority information Merge Cheshire Search
    43. 43. Search Authority Files <ul><li>For each name, formulate a search of the VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching) </li></ul><ul><ul><li>Search both the “authoritative” and “non-authoritative” forms </li></ul></ul><ul><ul><li>Consider any name matching a non-authoritative form to be a candidate match for the authoritative form </li></ul></ul><ul><ul><li>Flag EAC records that match the same authority record as potential matches </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    44. 44. Merging EAC-CPF Records SAA 2011 - Chicago 2011-08-27 - SLIDE Connect exactly matching records Connect records using name authority information Merge Cheshire Search
    45. 45. Merge Flagged Records <ul><li>For all of the exact matches and authority matches </li></ul><ul><ul><li>Use the Authoritative form of the name </li></ul></ul><ul><ul><li>Combine data from each match into a single EAC-CPF record </li></ul></ul><ul><ul><li>Retain all source record IDs and information </li></ul></ul><ul><li>Finally, output the merged EAC-CPF records </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    46. 46. Inputs to SNAC merging <ul><li>LoC: 43,702 EAC-CPF records derived from 1159 finding aids </li></ul><ul><li>OAC: 91,811 EAC-CPF records derived from ~15,400 finding aids </li></ul><ul><li>NWDA: 22,609 EAC-CPF records derived from 5,568 finding aids </li></ul><ul><li>Result: 123,920 “unique” names </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    47. 47. Another view of the numbers… <ul><li>93033 Person names merged from 114639 Person records </li></ul><ul><li>30161 Institutions merged from 41177 Institution records </li></ul><ul><li>1669 Families merged from 2263 Family records </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    48. 48. But… <ul><li>Exact merging assumes that archives are following LC cataloging practice in their EAD records </li></ul><ul><ul><li>There are some problems with this assumption </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    49. 49. Some failures for merging… <ul><li>Different abbreviations: </li></ul><ul><ul><li>A. & G. Carisch & C. </li></ul></ul><ul><ul><li>A. & G. Carisch & Co. </li></ul></ul><ul><li>And spacing issues: </li></ul><ul><ul><li>A. C. Peters & Bro. </li></ul></ul><ul><ul><li>A. C. Peters & Brother. </li></ul></ul><ul><ul><li>A. C. Peters. (??) </li></ul></ul><ul><ul><li>A. C.Peters & Bro. </li></ul></ul><ul><li>Completeness and alternate rules </li></ul><ul><ul><li>Tabb, John B. (John Banister), 1845-1909. </li></ul></ul><ul><ul><li>Tabb, John Banister, 1845-1909. </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    50. 50. More… <ul><li>Variant romanizations (and spacing): </li></ul><ul><ul><li>M. P. Belaieff. </li></ul></ul><ul><ul><li>M. P. Belaïeff. </li></ul></ul><ul><ul><li>M. P. Bieliaev. </li></ul></ul><ul><ul><li>M.P. Belaïeff. </li></ul></ul><ul><ul><li>M.P.Belaïeff. </li></ul></ul><ul><li>Initials vs. names: </li></ul><ul><ul><li>Zabolotskii, N.A. </li></ul></ul><ul><ul><li>Zabolotskii, Nikolai Alekseevich, 1903-1958. </li></ul></ul><ul><ul><li>Zabolotskii. </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    51. 51. More… <ul><li>Inverted order vs. uninverted </li></ul><ul><ul><li>Taylor, Zachary, 1784-1850. </li></ul></ul><ul><ul><li>Zachary Taylor. </li></ul></ul><ul><li>Various combinations: </li></ul><ul><ul><li>Tchaikovsky, Peter I. </li></ul></ul><ul><ul><li>Tchaikovsky, Pëtr Il. </li></ul></ul><ul><ul><li>Tchaikovsky, Piotr Ilyich. </li></ul></ul><ul><ul><li>Tchaikovsky, Pyotr Il. </li></ul></ul><ul><ul><li>Tchaikovsky, Pyotr Ilyich. </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    52. 52. Another kind of failure <ul><li>Entry for “Zaphiropoulos” - no dates, no first name: </li></ul><ul><ul><li>The entry from VIAF was for “Zaphiropoulos, Lela, 1941-” </li></ul></ul><ul><ul><li>But the name in EAD came as an attribution for photos: </li></ul></ul><ul><ul><ul><ul><li>Box 113 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Lot PP13 Zaphiropoulos. [Bas-relief at Troy], 1872. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Physical Description: 2 photographs </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Scope and Content Note </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Photographs taken for Schliemann. </li></ul></ul></ul></ul><ul><li>Not sure that the Zaphiropoulos indicated is a person, and definitely not one born in 1941. </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    53. 53. Addressing the failures <ul><li>First we need to know where things are not working, and why </li></ul><ul><ul><li>We are planning to do a random sample and detailed evaluation of the database to help identify the problems </li></ul></ul><ul><li>Many of the problems we have seen already appear to be solvable using: </li></ul><ul><ul><li>Additional contextual clues from the EAD records </li></ul></ul><ul><ul><li>More sophisticated matching for phonetic variants </li></ul></ul><ul><ul><ul><li>Such as n-grams or phonetic schemes like phonex </li></ul></ul></ul><ul><ul><li>Additional normalization of names before merging </li></ul></ul><ul><ul><ul><li>For name order, etc. </li></ul></ul></ul><ul><ul><li>Use of advance matching methods </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    54. 54. Testing new merging methods <ul><li>Work done in conjunction with SNAC for a I School Masters’ project called Biograph </li></ul><ul><ul><li>Krishna Janakiraman and Sean Marimpietri </li></ul></ul><ul><li>Using SNAC and merging with FreeBase and IMDB </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    55. 55. SAA 2011 - Chicago 2011-08-27 - SLIDE Einstein, Albert, 1879-1955. Einstein, Albert. Ainshutain, A. 1879-1955 Aiyinsitan 1879-1955 Einstein, A. Albert Einstein Albert Einstein Krishna Janakiraman and Sean Marimpietri - Biograph
    56. 56. SAA 2011 - Chicago 2011-08-27 - SLIDE Learn binary classifiers over varying names and existence dates Perturb existing information to generate additional samples within specific error levels Our approach Krishna Janakiraman and Sean Marimpietri - Biograph
    57. 57. SAA 2011 - Chicago 2011-08-27 - SLIDE Krishna Janakiraman and Sean Marimpietri - Biograph Features Names Shingle Language Model Features Birth and Death dates Features Names String distance metrics T R A I N Learn decision tree classifiers PRED I C T 0 Link Records 0
    58. 58. SAA 2011 - Chicago 2011-08-27 - SLIDE Shingle Language Model for names Name : Einstein Albert Shingle sequence : ein, ins, nst, ste, tei, ein … , ert Probability that the sequence (ins, nst, ste) follows ein is very high for the name einstein Krishna Janakiraman and Sean Marimpietri - Biograph
    59. 59. SAA 2011 - Chicago 2011-08-27 - SLIDE Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein ein ins nst ste ein In ert rte tei ein lbe lbe lbe Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph Ain ins nsh sht hta tai ain alb ert al rte tei ein ein ins nst ste ein In ert rte tei ein
    60. 60. SAA 2011 - Chicago 2011-08-27 - SLIDE Example Decision Tree For Von Neumann Date String Distance Krishna Janakiraman and Sean Marimpietri - Biograph
    61. 61. SAA 2011 - Chicago 2011-08-27 - SLIDE Albert Einstein TPR: 75.7% FPR: 7% George W Bush TPR: 86.6% FPR: 13% Von Neumann TPR: 75.7% FPR: 7% TPR: 72.7% FPR: 17% Corpus Average Krishna Janakiraman and Sean Marimpietri - Biograph TP:78 FP:11 FN:25 TN:145 TP:39 FP:9 FN:6 TN:60 TP:182 FP:14 FN:27 TN:301
    62. 62. SAA 2011 - Chicago 2011-08-27 - SLIDE 15,300 records, thresh = 0.85 1100 records, thresh = 0.9 How many did we link ?
    63. 63. Conclusions <ul><li>There will not be a single merging method, but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information </li></ul><ul><li>Once records are merged, they are passed along to Brian for search and display… </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE
    64. 64. Discovering Historic Social Networks <ul><li>Prototype Historical Resource Demo </li></ul><ul><li>Brian Tingle, California Digital Library </li></ul><ul><li>Society of American Archivists 2011 Annual Meeting </li></ul><ul><li>August 27, 2011 </li></ul><ul><li>Chicago </li></ul>
    65. 65. Meet the target users <ul><li>Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers.  </li></ul><ul><li>Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is. </li></ul><ul><li>Quincy: Library School Student working to QA record matching. </li></ul><ul><li>Adele: Person doing authority work during collection processing. </li></ul><ul><li>Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically. </li></ul>Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
    66. 66. Home Page
    67. 67. Facet tabs
    68. 68. Facet tabs
    69. 69. Advanced Search
    70. 70. Advanced limits match EAC sections
    71. 71. XTF result
    72. 72. XTF query in the crossQueryResult
    73. 73. doing a search
    74. 74. spellcheck
    75. 75. search results
    76. 76. search results
    77. 77. EAC record view Identity
    78. 78. EAC record view alternative forms of name
    79. 79. EAC record view Biographical History
    80. 80. HTML 5 microdata in chron list
    81. 81. EAC record view Related Entries
    82. 82. EAC record view Related Entries
    83. 83. <ul><li>RDFa owl:sameAs </li></ul>
    84. 84. EAC record view View EAC XML
    85. 85. EAC record view Graph Demo
    86. 87. Tinkerpop Graph Stack http://www.tinkerpop.com/ Property Graph Model graphML RDF Sail support
    87. 88. https://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph vertex edge
    88. 89. _id: auto-assigned by neo4j _type: vertex identity: the name of the entity (string) [indexed] urls: n seperated list of source EAD files entityType: ' corporateBody ', ' family ', or ' person ' _id: auto-assigned by neo4j _type: edge _lable: ' correspondedWith ' or ' associatedWith ' _inV: incoming vertex _id (from) _outV: outgoing vertex _id (to) from_name: from identity (string) denormalized to_name: to identity (string) denormalized vertex edge Graph Schema
    89. 90. indices/name-idx is an index on “identity”; used to look up neo4j record id internal id
    90. 91. “ bothE” shows in and out edges redundant data to save repeated lookups vertices/103994/bothE
    91. 94. Thanks Ed Summers! RDF of the social graph
    92. 96. http://templates.xdams.net/IBC/ontology/eac-cpf.rdf Silvia Mazziniregesta.exe srl
    93. 97. Front End Stack <ul><li>golden grid http://code.google.com/p/the-golden-grid/ </li></ul><ul><li>form style http://formalize.me / </li></ul><ul><li>jquery and jquery ui </li></ul><ul><li>hoverIntent for advanced search </li></ul><ul><li>google analytics with event tracking </li></ul><ul><li>Uservoice forum, google spreadsheets for feedback </li></ul>
    94. 98. XTF XSLT Framework <ul><li>pre filter - do special tokenization to create custom EAC facets </li></ul><ul><ul><li>https://docs.google.com/document/d/1wP9x6sdOZTagJNQXoyJfPh0Y6UzQgqLwLI86WSlIPbk/edit?hl=en_US </li></ul></ul><ul><li>query parser - CGI params to XTF query XML </li></ul><ul><li>result formatter - XTF results to HTML </li></ul><ul><li>doc formatter - EAC-CPF to HTML </li></ul><ul><li>http://code.google.com/p/xtf-cpf/source/browse/?name=xtf-cpf </li></ul>
    95. 99. social graph visualization <ul><li>EAC to graphML https://code.google.com/p/eac-graph-load/ </li></ul><ul><li>graphML file with open license should be viewable in other tools </li></ul><ul><li>old demo uses Dracula Graph Library </li></ul><ul><li>New demo uses Javascript InfoVis Toolkit </li></ul><ul><li>Ed Summer’s “ snac hacks ” post </li></ul>
    96. 100. EAD to EAC XSLT <ul><li>forthcoming from Virginia </li></ul>
    97. 101. Record Merging <ul><li>forthcoming from Berkeley </li></ul>
    98. 102. Demo <ul><li>http:// socialarchive.iath.virginia.edu/xtf/search </li></ul>

    ×