saa-2011-snac
Upcoming SlideShare
Loading in...5
×
 

saa-2011-snac

on

  • 413 views

wtf.... this slideshare is f***ing lame a** s***

wtf.... this slideshare is f***ing lame a** s***

Statistics

Views

Total Views
413
Views on SlideShare
413
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Flexible description: series description; dispersed collections Cooperative authority control: dispersed collections; but also creator of one collection is referenced in a collection created by someone else (co-referencing); economic and descriptive benefits Integrated access to cultural heritage: context for archival records, essential, but the descriptions can also provide context for all types of resources Archival authority records, like museum authority records, provide historical and biographical data that can enhance identification and understanding; (biographical dictionary; administrative histories)
  • Remember that we will solicit public evaluation and suggestions on drafts of the public interface, starting in the fall.

saa-2011-snac saa-2011-snac Presentation Transcript

    • EAC-CPF and Social Networks
    • Society of American Archivists
    • Chicago
    • August 2011
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • SNAC Overview
    • Funding and Timeline
    • Project Team
    • Project Objectives and Rationale
    • Data Contributing Institutions
    • Archival Standards Employed
    • Methods, Processing, and Products
    • Year One Extraction Results
    • Basic Observations on Extraction
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Funding and Timeline
    • National Endowment for the Humanities
    • A Preservation and Access, Research and Development grant
    • Two-year project
    • May 2010-April 2012
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Project Team
    • Daniel Pitti (PI) and Worthy Martin (Institute for Advanced Technology in the Humanities, University of Virginia)
    • Adrian Turner and Brian Tingle (California Digital Library, University of California)
    • Ray Larson (School of Information, University of California, Berkeley)
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Project Objectives
    • Archival finding aids currently intermix description of records with description of the creators of records and persons evident in the records
    • Further the ongoing process of transforming archival description using advanced technologies
    • By facilitating the separation of the description of people from the description of records
    • Using EAC-CPF, an International archival authority control standard
    • Goal: enhance the economy and effectiveness of archival description to enhance access and understanding of users of archives, libraries, and museums
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Rationale for Separation
    • Authority control of forms of names
    • Flexible description
    • Cooperative authority control
    • Integrated access to cultural heritage
    • Biographical/historical resource
    • Social/historical context (social-professional networks)
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • The Data
    • EAD-encoded finding aids
      • Library of Congress (1,159)
      • Online Archive of California (~15,400 )
      • Northwest Digital Archive (5,160)
      • Virginia Heritage (8,390)
    • Authority records
      • Library of Congress: NACO/LCNAF (3.8M personal names; 900K corporate names)
      • Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names)
      • Virtual International Authority File (5M+ personal names)
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Methods and Processing
    • Extract EAC-CPF records from existing EAD-encoded archival descriptions
      • Extracting both creators and referenced CPF names
    • Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF); merge records for the same entity
      • Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN)
      • Key challenge: two or more people with the same name; two or more names for the same person
    • Create a prototype historical resource and access system
      • Historical data and social-professional networks
      • Links to archive, library, and museum resources (by and about)
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • EAD Source Data
    • Encoded Archival Description
      • Intermixes description of creators of records and, at the discretion of the archivists, names associated with the content of the records
      • Detailed description of creators of records
    • Widely varying quality
      • In the number of names identified and encoded
      • In the formation of the names (direct or inverted, capitalization, punctuation, and so on)
      • In the categorization of names (personal, corporate, or family
    • Many names given but not identified as such
    • Most important of these in biographies/histories and in correspondence description
    • Extraction has focused on the “low hanging fruit,” that is the names tagged as names
    • Attention shifting to names not identified as such
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Archival Records
    • Records are the by-products of people living and working as individuals, in organized groups, in families
    • Records document people living and working
    • People exist in social-professional contexts, in relation to others
    • Records document these relations
    • All records created by the same entity are described together (a fonds or collection)
      • Creators documented in detail
      • Many of the people documented in the record referenced in description
    • Archival descriptions document interrelations among people and records (documents)
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia Source: J. Robert Oppenheimer Papers (LoC) <origination> <persname source=&quot;lcnaf&quot;>Oppenheimer, J. Robert, 1904-1967</persname> </origination> <controlaccess> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;100&quot; role=&quot;creator&quot;>Oppenheimer, J. Robert, 1904-1967</persname> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Bethe, Hans Albrecht, 1906- --Correspondence</persname> <!-- […] --> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Born, Max, 1882-1970 --Correspondence</persname> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Boyd, Julian P. (Julian Parks), 1903- --Correspondence</persname> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Bush, Vannevar, 1890-1974 --Correspondence</persname> <persname source=&quot;lcnaf&quot; encodinganalog=&quot;600&quot; role=&quot;subject&quot;>Casals, Pablo, 1876-1973 --Correspondence</persname> <!-- […] --> <corpname source=&quot;lcnaf&quot; encodinganalog=&quot;610&quot; role=&quot;subject&quot;>Institute for Advanced Study (Princeton, N.J.)</corpname> <corpname source=&quot;lcnaf&quot; encodinganalog=&quot;610&quot; role=&quot;subject&quot;>Los Alamos Scientific Laboratory</corpname> <!-- […] --> </controlaccess>
  • Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia Source: Leonard Bernstein Collection (LoC)   <c02> <did> <container type=&quot;box&quot;>1</container> <unittitle>Aaltonen, Erkki <unitdate era=&quot;ce&quot; calendar=&quot;gregorian&quot;>1981</unitdate> </unittitle> <physdesc> <extent>1</extent> </physdesc> </did> </c02> <c02> <did> <unittitle>Abbado, Claudio <unitdate era=&quot;ce&quot; calendar=&quot;gregorian&quot;>1963-90</unitdate> </unittitle> <physdesc> <extent>5</extent> </physdesc> </did> </c02> […]
  • Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <bioghist> <head>Biographical Sketch</head> <p>José Marcos Mugarrieta, prior to his term as Mexican consul in San Francisco 1857-1863, served in the Mexican army from 1837. He saw action in numerous battles and campaigns – Jamaica, under General Canalizo in 1841; Campeche, 1842-1843; Merida, 1843; Veracruz, 1845; Mexico City, 1846; Angostura and Cerro-gordo, 1847; Guanajuato, 1848, and Sierra-Gorda under Bustamante, 1848-1849; and Matamoros, 1849-1850. […] </p> <p>In April 1857 Mugarrieta received an appointment from the Comonfort government for the consulship in San Francisco. He did not actually begin his new duties until September 1, 1859, due to illness and to the political situation in Mexico. […]</p> </bioghist>
  • Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <bioghist> <head>Chronology</head> <chronlist> <chronitem> <date>1900</date> <event>Born on Jan. 20 in Hastings, Minnesota.</event> </chronitem> <chronitem> <date>1922</date> <event>Received baccalaureate from Princeton University, major in philosophy. </event> </chronitem> […] <chronitem> <date>1965</date> <event>Died on April 4.</event> </chronitem> </chronlist> </bioghist>
  • EAC-CPF
    • Encoded Archival Context-Corporate bodies, Persons, Families
    • An international communication standard for archival authority control
    • Based on International Council for Archives, International Standard Archival Authority Records-Corporate bodies, persons, families (ISAAR(CPF))
    • SAA Standards Committee, Technical Subcommittee on Encoded Archival Context
    • Co-chairs
      • Katherine Wisser, Simmons College
      • Anila Angjeli, Bibliothèque nationale de France
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Library and Archive Authority Control
    • Library (or bibliographic) authority control is almost exclusively about the control of names
    • Archival authority control involves biographical-historical description of the CPF entity
      • Descriptions based on controlled vocabularies or values, for example, occupations, place of birth and death
      • But also biographical-historical description
        • Prose
        • Chronological list
    • Archival authority control provides context for understanding records, the context of their creation, the provenance
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <identity> <entityType>person</entityType> <nameEntry scriptCode=&quot;Latn&quot; xml:lang=&quot;eng&quot;> <part>Oppenheimer, J. Robert, 1904-1967.</part> <authorizedForm>AACR2</authorizedForm> </nameEntry> <nameEntry localType=&quot;VIAF:MainHeading&quot;> <part>Oppenheimer, J. Robert (Julius Robert), 1904-1967</part> <alternativeForm>VIAF</alternativeForm> </nameEntry> <nameEntry localType=&quot;VIAF:MainHeading&quot;> <part>Oppenheimer, Julius Robert, 1904-1967</part> <alternativeForm>VIAF</alternativeForm> </nameEntry> <nameEntry localType=&quot;VIAF:x400&quot;> <part>Oppenheimer, Robert</part> <alternativeForm>VIAF</alternativeForm> </nameEntry> <nameEntry localType=&quot;VIAF:x400&quot;> <part>Ou-pẽn-hai-mo, 1904-1967</part> <alternativeForm>VIAF</alternativeForm> </nameEntry> </identity>
  • Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <existDates> <dateRange> <fromDate standardDate=“1904-04-22”>1904, Apr. 22</fromDate> <toDate standardDate=“1967-02-18”>1967, Feb. 18</toDate> </dateRange> </existDates> <!-- ... --> <localDescription localType=&quot;subject&quot;> <term>Science--Societies, etc.</term> </localDescription> <localDescription localType=&quot;VIAF:nationality&quot;> <placeEntry countryCode=&quot;US&quot;/> </localDescription> <localDescription localType=&quot;VIAF:gender&quot;> <term>Male</term> </localDescription> <languageUsed> <language languageCode=&quot;eng&quot;/> </languageUsed> <occupation> <term>Physicists.</term> </occupation> <!-- ... -->
  • Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <chronList> <chronItem> <date>1904, Apr. 22</date> <placeEntry>New York, N.Y.</placeEntry> <event>Born, New York, N.Y.</event> </chronItem> <!-- ... --> <chronItem> <date>1943-1945</date> <placeEntry>Los Alamos, N. Mex.</placeEntry> <event>Director, Los Alamos Scientific Laboratory, Los Alamos, N. Mex.</event> </chronItem> <!-- ... --> <chronItem> <date>1954</date> <event>(1) Denied security clearance […] (2) Published Science and the Common Understanding […] </event> </chronItem> <!-- ... --> <chronItem> <date>1967, Feb. 18</date> <placeEntry>Princeton, N.J.</placeEntry> <event>Died, Princeton, N.J.</event> </chronItem> </chronList>
  • Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <cpfRelation xmlns:xlink=&quot;http://www.w3.org/1999/xlink&quot; xlink:type=&quot;simple&quot; xlink:role=&quot;http://RDVocab.info/uri/schema/FRBRentitiesRDA/Person&quot; xlink:arcrole=&quot;correspondedWith&quot;> <relationEntry>Bush, Vannevar, 1890-1974.</relationEntry> <descriptiveNote> <p>recordId: DLC.ms998007.r007</p> </descriptiveNote> </cpfRelation>
  • Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <resourceRelation xmlns:xlink=&quot;http://www.w3.org/1999/xlink&quot; xlink:arcrole=&quot;creatorOf&quot; xlink:role=&quot;archivalRecords” xlink:type=&quot;simple” xlink:href=&quot;http://hdl.loc.gov/loc.mss/eadmss.ms998007&quot;> <relationEntry>J. Robert Oppenheimer Papers, 1799-1980 (bulk 1947-1967)</relationEntry> <objectXMLWrap> <did xmlns=&quot;urn:isbn:1-931666-22-9” > <unittitle>Papers <unitdate normal=&quot;1799/1980” era=&quot;ce” calendar=&quot;gregorian&quot;>1799-1980 </unitdate><unitdate label=&quot;Bulk Dates&quot; type=&quot;bulk&quot; normal=&quot;1947/1967” era=&quot;ce” calendar=&quot;gregorian&quot;>(bulk 1947-1967)</unitdate></unittitle> <unitid countrycode=&quot;US&quot; repositorycode=&quot;US-DLC&quot;>MSS35188</unitid> <origination label=&quot;Creator&quot;> <persname>Oppenheimer, J. Robert, 1904-1967</persname> </origination> <!-- ... --> <repository><corpname>Manuscript Division. Library of Congress</corpname> </repository> <abstract>Physicist and director of the Institute for Advanced Study, Princeton, New Jersey. [...] Topics include theoretical physics, development of the atomic bomb, the relationship between government and science, nuclear energy, security, and national loyalty. </abstract> </did> </objectXMLWrap> </resourceRelation>
  • Year One Results-Extraction
    • EAC-CPF records extracted
      • LoC: 43,702 from 1,159 finding aids
      • OAC: 91,811 from ~15,400
      • NWDA: 22,609 from 5,160
      • VH: 15,175 from 8,390
      • Total 173,297
      • Note: in a more recent extraction: 196,218, but have not had time analyze the results
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Early Observations-Extraction
    • Depth of analysis and quality of description of CPF entities varies widely in EAD-encoded finding aids
      • LoC a lot of names under authority control
      • OAC and NWDA have less names and control varies
    • To be fair, the finding aids were created without SNAC processing in mind!
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Next on Extraction
    • Refine extraction processing, incorporating some NLP-like processing, for example
      • Verifying type of name: C or P or F
      • Massaging poorly formed names into better formed names
      • Identifying names in strings that are names-plus (but name not identified as such)
      • Provide context information to enhance matching, for example, date or dates of correspondence, or occupation of creator of records
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Beyond the Project
    • Building a National Archival Authorities Infrastructure
      • IMLS funded two-year project, October 2011-September 2013
      • EAC-CPF SAA workshops: 140 scholarships
      • National Archival Authorities Cooperative planning
    • SNAC II: a proposal to expand SNAC
      • A lot more data
      • NARA, SI, MARC WorldCat records, a lot more finding aids
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • For More Information
    • http://socialarchive.iath.virginia.edu/ (Project website)
    • http://socialarchive.iath.virginia.edu/xtf/search (public prototype)
    Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia
  • Social Networks and Archival Context Project: Matching and Merging EAC-CPF Records
    • Ray R. Larson
    • Krishna Janakiraman
    • University of California, Berkeley
    • School of Information
    SAA 2011 - Chicago 2011-08-27 - SLIDE Thanks to Daniel V. Pitti of the Institute for Advanced Technology in the Humanities, University of Virginia, for many of the slides here
  • SNAC Project
    • The outlines of the project have been discussed by Daniel Pitti previously
    • The primary focus of the Berkeley group for the project is on combining data resources from multiple archives and other information sources
    • In this talk I will focus on our current methods used in the prototype (to be described by Brian Tingle later)
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Data Contributing Institutions
    • EAD-encoded finding aids
      • Library of Congress (1159)
      • Online Archive of California (15,400+)
      • Northwest Digital Archive (5,563+)
      • Virginia Heritage (8,390+)
    • Authority records
      • Library of Congress: NACO/LCNAF (3.8M personal names; 900K corporate names)
      • Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names)
      • Virtual International Authority File (intersection with NACO/LCNAF, 5M personal names)
    • Other biographical sources (e.g., DBPedia, IMDB)
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Methods and Processing
    • Extract EAC-CPF records from existing EAD-encoded archival descriptions
      • Extracting both creators and referenced CPF names
    • Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF)
      • Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN)
    • Create a prototype historical resource and access system
      • Historical data and social-professional networks
      • Links to archive, library, and museum resources (by and about)
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Merging EAC-CPF Records SAA 2011 - Chicago 2011-08-27 - SLIDE LCNAF Repository ULAN Repository Connect exactly matching records Connect records using name authority information Merge Cheshire Search
  • Authority Control
    • Identifying creator entities and referenced entities (correspondents, etc.)
    • Recording name or names used by and for them
    • Rule-based heading or entry formation and control
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Controlled Vocabularies
    • Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information
    • That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • The Problem
    • Proliferation of the forms of names
      • Different names for the same person
      • Different people with the same names
    • Examples
      • from Books in Print (semi-controlled but not consistent)
      • ERIC author index (not controlled)
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Goethe SAA 2011 - Chicago 2011-08-27 - SLIDE … etc…
  • John Muir SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Pauline Cochrane nee Atherton SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Pauline Cochrane nee Atherton SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Name Authority Files SAA 2011 - Chicago 2011-08-27 - SLIDE ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973 Different names for the same person
  • Merging EAC-CPF Records SAA 2011 - Chicago 2011-08-27 - SLIDE Connect exactly matching records Connect records using name authority information Merge Cheshire Search
  • Connect Exact Matches
    • The EAC-CPF records provide the names without having to parse texts, etc.
    • Allows us to use some simple methods like exact matching
      • Assume identical name entries means the same person/corporate body/family
      • Enter the full names and record IDs into a database and flag IDs with same names for merging
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Merging EAC-CPF Records SAA 2011 - Chicago 2011-08-27 - SLIDE Connect exactly matching records Connect records using name authority information Merge Cheshire Search
  • Search Authority Files
    • For each name, formulate a search of the VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching)
      • Search both the “authoritative” and “non-authoritative” forms
      • Consider any name matching a non-authoritative form to be a candidate match for the authoritative form
      • Flag EAC records that match the same authority record as potential matches
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Merging EAC-CPF Records SAA 2011 - Chicago 2011-08-27 - SLIDE Connect exactly matching records Connect records using name authority information Merge Cheshire Search
  • Merge Flagged Records
    • For all of the exact matches and authority matches
      • Use the Authoritative form of the name
      • Combine data from each match into a single EAC-CPF record
      • Retain all source record IDs and information
    • Finally, output the merged EAC-CPF records
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Inputs to SNAC merging
    • LoC: 43,702 EAC-CPF records derived from 1159 finding aids
    • OAC: 91,811 EAC-CPF records derived from ~15,400 finding aids
    • NWDA: 22,609 EAC-CPF records derived from 5,568 finding aids
    • Result: 123,920 “unique” names
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Another view of the numbers…
    • 93033 Person names merged from 114639 Person records
    • 30161 Institutions merged from 41177 Institution records
    • 1669 Families merged from 2263 Family records
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • But…
    • Exact merging assumes that archives are following LC cataloging practice in their EAD records
      • There are some problems with this assumption
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Some failures for merging…
    • Different abbreviations:
      • A. & G. Carisch & C.
      • A. & G. Carisch & Co.
    • And spacing issues:
      • A. C. Peters & Bro.
      • A. C. Peters & Brother.
      • A. C. Peters. (??)
      • A. C.Peters & Bro.
    • Completeness and alternate rules
      • Tabb, John B. (John Banister), 1845-1909.
      • Tabb, John Banister, 1845-1909.
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • More…
    • Variant romanizations (and spacing):
      • M. P. Belaieff.
      • M. P. Belaïeff.
      • M. P. Bieliaev.
      • M.P. Belaïeff.
      • M.P.Belaïeff.
    • Initials vs. names:
      • Zabolotskii, N.A.
      • Zabolotskii, Nikolai Alekseevich, 1903-1958.
      • Zabolotskii.
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • More…
    • Inverted order vs. uninverted
      • Taylor, Zachary, 1784-1850.
      • Zachary Taylor.
    • Various combinations:
      • Tchaikovsky, Peter I.
      • Tchaikovsky, Pëtr Il.
      • Tchaikovsky, Piotr Ilyich.
      • Tchaikovsky, Pyotr Il.
      • Tchaikovsky, Pyotr Ilyich.
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Another kind of failure
    • Entry for “Zaphiropoulos” - no dates, no first name:
      • The entry from VIAF was for “Zaphiropoulos, Lela, 1941-”
      • But the name in EAD came as an attribution for photos:
          • Box 113
          • Lot PP13 Zaphiropoulos. [Bas-relief at Troy], 1872.
          • Physical Description: 2 photographs
          • Scope and Content Note
          • Photographs taken for Schliemann.
    • Not sure that the Zaphiropoulos indicated is a person, and definitely not one born in 1941.
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Addressing the failures
    • First we need to know where things are not working, and why
      • We are planning to do a random sample and detailed evaluation of the database to help identify the problems
    • Many of the problems we have seen already appear to be solvable using:
      • Additional contextual clues from the EAD records
      • More sophisticated matching for phonetic variants
        • Such as n-grams or phonetic schemes like phonex
      • Additional normalization of names before merging
        • For name order, etc.
      • Use of advance matching methods
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Testing new merging methods
    • Work done in conjunction with SNAC for a I School Masters’ project called Biograph
      • Krishna Janakiraman and Sean Marimpietri
    • Using SNAC and merging with FreeBase and IMDB
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • SAA 2011 - Chicago 2011-08-27 - SLIDE Einstein, Albert, 1879-1955. Einstein, Albert. Ainshutain, A. 1879-1955 Aiyinsitan 1879-1955 Einstein, A. Albert Einstein Albert Einstein Krishna Janakiraman and Sean Marimpietri - Biograph
  • SAA 2011 - Chicago 2011-08-27 - SLIDE Learn binary classifiers over varying names and existence dates Perturb existing information to generate additional samples within specific error levels Our approach Krishna Janakiraman and Sean Marimpietri - Biograph
  • SAA 2011 - Chicago 2011-08-27 - SLIDE Krishna Janakiraman and Sean Marimpietri - Biograph Features Names Shingle Language Model Features Birth and Death dates Features Names String distance metrics T R A I N Learn decision tree classifiers PRED I C T 0 Link Records 0
  • SAA 2011 - Chicago 2011-08-27 - SLIDE Shingle Language Model for names Name : Einstein Albert Shingle sequence : ein, ins, nst, ste, tei, ein … , ert Probability that the sequence (ins, nst, ste) follows ein is very high for the name einstein Krishna Janakiraman and Sean Marimpietri - Biograph
  • SAA 2011 - Chicago 2011-08-27 - SLIDE Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein ein ins nst ste ein In ert rte tei ein lbe lbe lbe Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph Ain ins nsh sht hta tai ain alb ert al rte tei ein ein ins nst ste ein In ert rte tei ein
  • SAA 2011 - Chicago 2011-08-27 - SLIDE Example Decision Tree For Von Neumann Date String Distance Krishna Janakiraman and Sean Marimpietri - Biograph
  • SAA 2011 - Chicago 2011-08-27 - SLIDE Albert Einstein TPR: 75.7% FPR: 7% George W Bush TPR: 86.6% FPR: 13% Von Neumann TPR: 75.7% FPR: 7% TPR: 72.7% FPR: 17% Corpus Average Krishna Janakiraman and Sean Marimpietri - Biograph TP:78 FP:11 FN:25 TN:145 TP:39 FP:9 FN:6 TN:60 TP:182 FP:14 FN:27 TN:301
  • SAA 2011 - Chicago 2011-08-27 - SLIDE 15,300 records, thresh = 0.85 1100 records, thresh = 0.9 How many did we link ?
  • Conclusions
    • There will not be a single merging method, but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information
    • Once records are merged, they are passed along to Brian for search and display…
    SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Discovering Historic Social Networks
    • Prototype Historical Resource Demo
    • Brian Tingle, California Digital Library
    • Society of American Archivists 2011 Annual Meeting
    • August 27, 2011
    • Chicago
  • Meet the target users
    • Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. 
    • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.
    • Quincy: Library School Student working to QA record matching.
    • Adele: Person doing authority work during collection processing.
    • Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically.
    Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
  • Home Page
  • Facet tabs
  • Facet tabs
  • Advanced Search
  • Advanced limits match EAC sections
  • XTF result
  • XTF query in the crossQueryResult
  • doing a search
  • spellcheck
  • search results
  • search results
  • EAC record view Identity
  • EAC record view alternative forms of name
  • EAC record view Biographical History
  • HTML 5 microdata in chron list
  • EAC record view Related Entries
  • EAC record view Related Entries
    • RDFa owl:sameAs
  • EAC record view View EAC XML
  • EAC record view Graph Demo
  •  
  • Tinkerpop Graph Stack http://www.tinkerpop.com/ Property Graph Model graphML RDF Sail support
  • https://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph vertex edge
  • _id: auto-assigned by neo4j _type: vertex identity: the name of the entity (string) [indexed] urls: n seperated list of source EAD files entityType: ' corporateBody ', ' family ', or ' person ' _id: auto-assigned by neo4j _type: edge _lable: ' correspondedWith ' or ' associatedWith ' _inV: incoming vertex _id (from) _outV: outgoing vertex _id (to) from_name: from identity (string) denormalized to_name: to identity (string) denormalized vertex edge Graph Schema
  • indices/name-idx is an index on “identity”; used to look up neo4j record id internal id
  • “ bothE” shows in and out edges redundant data to save repeated lookups vertices/103994/bothE
  •  
  •  
  • Thanks Ed Summers! RDF of the social graph
  •  
  • http://templates.xdams.net/IBC/ontology/eac-cpf.rdf Silvia Mazziniregesta.exe srl
  • Front End Stack
    • golden grid http://code.google.com/p/the-golden-grid/
    • form style http://formalize.me /
    • jquery and jquery ui
    • hoverIntent for advanced search
    • google analytics with event tracking
    • Uservoice forum, google spreadsheets for feedback
  • XTF XSLT Framework
    • pre filter - do special tokenization to create custom EAC facets
      • https://docs.google.com/document/d/1wP9x6sdOZTagJNQXoyJfPh0Y6UzQgqLwLI86WSlIPbk/edit?hl=en_US
    • query parser - CGI params to XTF query XML
    • result formatter - XTF results to HTML
    • doc formatter - EAC-CPF to HTML
    • http://code.google.com/p/xtf-cpf/source/browse/?name=xtf-cpf
  • social graph visualization
    • EAC to graphML https://code.google.com/p/eac-graph-load/
    • graphML file with open license should be viewable in other tools
    • old demo uses Dracula Graph Library
    • New demo uses Javascript InfoVis Toolkit
    • Ed Summer’s “ snac hacks ” post
  • EAD to EAC XSLT
    • forthcoming from Virginia
  • Record Merging
    • forthcoming from Berkeley
  • Demo
    • http:// socialarchive.iath.virginia.edu/xtf/search