• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Snac saa-aug-2011-try 3 keynote
 

Snac saa-aug-2011-try 3 keynote

on

  • 337 views

try3; this is so jacked

try3; this is so jacked

Statistics

Views

Total Views
337
Views on SlideShare
337
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • Flexible description: series description; dispersed collections\nCooperative authority control: dispersed collections; but also creator of one collection is referenced in a collection created by someone else (co-referencing); economic and descriptive benefits\nIntegrated access to cultural heritage: context for archival records, essential, but the descriptions can also provide context for all types of resources\nArchival authority records, like museum authority records, provide historical and biographical data that can enhance identification and understanding; (biographical dictionary; administrative histories)\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Remember that we will solicit public evaluation and suggestions on drafts of the public interface, starting in the fall.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Snac saa-aug-2011-try 3 keynote Snac saa-aug-2011-try 3 keynote Presentation Transcript

  • EAC‐CPF
and
Social
Networks Society
of
American
Archivists Chicago August
2011Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • SNAC
Overview• Funding
and
Timeline• Project
Team• Project
ObjecEves
and
RaEonale• Data
ContribuEng
InsEtuEons• Archival
Standards
Employed• Methods,
Processing,
and
Products• Year
One
ExtracEon
Results• Basic
ObservaEons
on
ExtracEon Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Funding
and
Timeline• NaEonal
Endowment
for
the
HumaniEes• A
PreservaEon
and
Access,
Research
and
 Development
grant• Two‐year
project• May
2010‐April
2012 Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Project
Team• Daniel
PiP
(PI)
and
Worthy
MarEn
(InsEtute
for
 Advanced
Technology
in
the
HumaniEes,
 University
of
Virginia)• Adrian
Turner
and
Brian
Tingle
(California
Digital
 Library,
University
of
California)• Ray
Larson
(School
of
InformaEon,
University
of
 California,
Berkeley) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Project
ObjecEves• Archival
finding
aids
currently
intermix
descripEon
of
records
 with
descripEon
of
the
creators
of
records
and
persons
evident
 in
the
records• Further
the
ongoing
process
of
transforming
archival
descripEon
 using
advanced
technologies• By
facilitaEng
the
separaEon
of
the
descripEon
of
people
from
 the
descripEon
of
records• Using
EAC‐CPF,
an
InternaEonal
archival
authority
control
 standard• Goal:
enhance
the
economy
and
effecEveness
of
archival
 descripEon
to
enhance
access
and
understanding
of
users
of
 archives,
libraries,
and
museums Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • RaEonale
for
SeparaEon• Authority
control
of
forms
of
names• Flexible
descripEon• CooperaEve
authority
control• Integrated
access
to
cultural
heritage• Biographical/historical
resource• Social/historical
context
(social‐professional
 networks) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • The
Data• EAD‐encoded
finding
aids – Library
of
Congress
(1,159) – Online
Archive
of
California
(~15,400
) – Northwest
Digital
Archive
(5,160) – Virginia
Heritage
(8,390)• Authority
records
 – Library
of
Congress:
NACO/LCNAF
(3.8M
personal
names;
900K
 corporate
names) – Gefy
Vocabulary
Program:
Union
List
of
ArEst
Names
(293K
 personal
and
corporate
names) – Virtual
InternaEonal
Authority
File
(5M+
personal
names) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Methods
and
Processing• Extract
EAC‐CPF
records
from
exisEng
EAD‐encoded
archival
 descripEons – ExtracEng
both
creators
and
referenced
CPF
names• Match
EAC‐CPF
records
against
one
another
and
against
exisEng
 authority
records
(ULAN,
VIAF,
LCNAF);
merge
records
for
the
 same
enEty – Enhance
EAC‐CPF
by
normalizing
entries,
adding
alternaEve
entries,
 Etles
(VIAF),
and
historical
data
(ULAN) – Key
challenge:
two
or
more
people
with
the
same
name;
two
or
more
 names
for
the
same
person• Create
a
prototype
historical
resource
and
access
system – Historical
data
and
social‐professional
networks – Links
to
archive,
library,
and
museum
resources
(by
and
about)
 Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • EAD
Source
Data• Encoded
Archival
DescripEon – Intermixes
descripEon
of
creators
of
records
and,
at
the
discreEon
of
the
archivists,
 names
associated
with
the
content
of
the
records – Detailed
descripEon
of
creators
of
records• Widely
varying
quality – In
the
number
of
names
idenEfied
and
encoded – In
the
formaEon
of
the
names
(direct
or
inverted,
capitalizaEon,
punctuaEon,
and
so
 on) – In
the
categorizaEon
of
names
(personal,
corporate,
or
family• Many
names
given
but
not
idenEfied
as
such• Most
important
of
these
in
biographies/histories
and
in
correspondence
 descripEon• ExtracEon
has
focused
on
the
“low
hanging
fruit,”
that
is
the
names
tagged
as
 names• AfenEon
shiling
to
names
not
idenEfied
as
such Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Archival
Records• Records
are
the
by‐products
of
people
living
and
working
as
 individuals,
in
organized
groups,
in
families• Records
document
people
living
and
working• People
exist
in
social‐professional
contexts,
in
relaEon
to
others• Records
document
these
relaEons
• All
records
created
by
the
same
enEty
are
described
together
(a
 fonds
or
collecEon) – Creators
documented
in
detail – Many
of
the
people
documented
in
the
record
referenced
in
 descripEon• Archival
descripEons
document
interrelaEons
among
people
 and
records
(documents) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Source:
J.
Robert
Oppenheimer
Papers
(LoC)<originaEon>

 <persname
source="lcnaf">Oppenheimer,
J.
Robert,
1904‐1967</persname>
</originaEon><controlaccess>
 <persname
source="lcnaf"
encodinganalog="100"
role="creator">Oppenheimer,
J.

 
Robert,
1904‐1967</persname>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Bethe,
Hans
 
Albrecht,
1906‐
‐‐Correspondence</persname>
<!‐‐
[…]
‐‐>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Born,
Max,
 
1882‐1970
‐‐Correspondence</persname>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Boyd,
Julian
P.
 
(Julian
Parks),
1903‐
‐‐Correspondence</persname>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Bush,
Vannevar,
 
1890‐1974
‐‐Correspondence</persname>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Casals,
Pablo,
 
1876‐1973
‐‐Correspondence</persname>
<!‐‐
[…]
‐‐>
 <corpname
source="lcnaf"
encodinganalog="610"
role="subject">InsEtute
for
 
Advanced
Study
(Princeton,
N.J.)</corpname>
 <corpname
source="lcnaf"
encodinganalog="610"
role="subject">Los
Alamos
 
ScienEfic
Laboratory</corpname>
<!‐‐
[…]
‐‐></controlaccess> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Source:
Leonard
Bernstein
CollecEon
(LoC)
<c02>


<did>





<container
type="box">1</container>






<uniPtle>Aaltonen,
Erkki
<unitdate
era="ce"
calendar="gregorian">1981</unitdate>





</uniPtle>





<physdesc>








<extent>1</extent>






</physdesc>


</did></c02><c02>


<did>





<uniPtle>Abbado,
Claudio
<unitdate
era="ce"
calendar="gregorian">1963‐90</unitdate>






</uniPtle>





<physdesc>








<extent>5</extent>






</physdesc>


</did></c02>[…] Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • <bioghist>



<head>Biographical
Sketch</head>



<p>José
Marcos
Mugarrieta,
prior
to
his
term
as
Mexican
consul
in
San
Francisco
1857‐1863,
served
in
the
Mexican
army
from
1837.
He
saw
acEon
in
numerous
bafles
and
campaigns
–
Jamaica,
under
General
Canalizo
in
1841;
Campeche,
1842‐1843;
Merida,
1843;
Veracruz,
1845;
Mexico
City,
1846;
Angostura
and
Cerro‐gordo,
1847;
Guanajuato,
1848,
and
Sierra‐Gorda
under
Bustamante,
1848‐1849;
and
Matamoros,
1849‐1850.
[…]
</p>



<p>In
April
1857
Mugarrieta
received
an
appointment
from
the
Comonfort
government
for
the
consulship
in
San
Francisco.
He
did
not
actually
begin
his
new
duEes
unEl
September
1,
1859,
due
to
illness
and
to
the
poliEcal
situaEon
in
Mexico.
[…]</p>
</bioghist> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • <bioghist>


<head>Chronology</head>


<chronlist>




<chronitem>






<date>1900</date>






<event>Born
on
Jan.
20
in
HasEngs,
Minnesota.</event>




</chronitem>




<chronitem>






<date>1922</date>






<event>Received
baccalaureate
from
Princeton
University,
major
in
philosophy.
 </event>




</chronitem>




[…]





<chronitem>






<date>1965</date>






<event>Died
on
April
4.</event>




</chronitem>


</chronlist>
</bioghist> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • EAC‐CPF• Encoded
Archival
Context‐Corporate
bodies,
Persons,
 Families• An
internaEonal
communicaEon
standard
for
archival
 authority
control• Based
on
InternaEonal
Council
for
Archives,
InternaEonal
 Standard
Archival
Authority
Records‐Corporate
bodies,
 persons,
families
(ISAAR(CPF))• SAA
Standards
Commifee,
Technical
Subcommifee
on
 Encoded
Archival
Context• Co‐chairs – Katherine
Wisser,
Simmons
College – Anila
Angjeli,
Bibliothèque
naEonale
de
France Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Library
and
Archive
Authority
Control• Library
(or
bibliographic)
authority
control
is
almost
 exclusively
about
the
control
of
names• Archival
authority
control
involves
biographical‐historical
 descripEon
of
the
CPF
enEty – DescripEons
based
on
controlled
vocabularies
or
values,
for
 example,
occupaEons,
place
of
birth
and
death – But
also
biographical‐historical
descripEon • Prose • Chronological
list• Archival
authority
control
provides
context
for
 understanding
records,
the
context
of
their
creaEon,
the
 provenance Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • <idenEty>
 <enEtyType>person</enEtyType>
 <nameEntry
scriptCode="Latn"
xml:lang="eng">
 
 <part>Oppenheimer,
J.
Robert,
1904‐1967.</part>
 
 <authorizedForm>AACR2</authorizedForm>
 </nameEntry>
 <nameEntry
localType="VIAF:MainHeading">
 
 <part>Oppenheimer,
J.
Robert
(Julius
Robert),
1904‐1967</part>
 
 <alternaEveForm>VIAF</alternaEveForm>
 </nameEntry>
 <nameEntry
localType="VIAF:MainHeading">
 
 <part>Oppenheimer,
Julius
Robert,
1904‐1967</part>
 
 <alternaEveForm>VIAF</alternaEveForm>
 </nameEntry>
 
 <nameEntry
localType="VIAF:x400">
 
 <part>Oppenheimer,
Robert</part>
 
 <alternaEveForm>VIAF</alternaEveForm>
 </nameEntry>
 <nameEntry
localType="VIAF:x400">
 
 <part>Ou‐pẽn‐hai‐mo,
1904‐1967</part>
 
 <alternaEveForm>VIAF</alternaEveForm>
 </nameEntry></idenEty> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • <existDates>
 <dateRange>
 
 <fromDate
standardDate=“1904‐04‐22”>1904,
Apr.
22</fromDate>
 
 <toDate
standardDate=“1967‐02‐18”>1967,
Feb.
18</toDate>
 </dateRange></existDates><!‐‐
...
‐‐><localDescripEon
localType="subject">
 <term>Science‐‐SocieEes,
etc.</term></localDescripEon><localDescripEon
localType="VIAF:naEonality">
 <placeEntry
countryCode="US"/></localDescripEon><localDescripEon
localType="VIAF:gender">
 <term>Male</term></localDescripEon><languageUsed>
 <language
languageCode="eng"/></languageUsed><occupaEon>
 <term>Physicists.</term></occupaEon><!‐‐
...
‐‐> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • <chronList>
 <chronItem>
 
 <date>1904,
Apr.
22</date>
 
 <placeEntry>New
York,
N.Y.</placeEntry>
 
 <event>Born,
New
York,
N.Y.</event>
 </chronItem>
<!‐‐
...
‐‐>
 <chronItem>
 
 <date>1943‐1945</date>
 
 <placeEntry>Los
Alamos,
N.
Mex.</placeEntry>
 
 <event>Director,
Los
Alamos
ScienEfic
Laboratory,
Los
Alamos,
N.
Mex.</event>
 </chronItem>
<!‐‐
...
‐‐>
 <chronItem>
 
 <date>1954</date>
 
 <event>(1)
Denied
security
clearance
[…]
(2)
Published
Science
and
the
 
 
 Common
Understanding
[…]
 
 
</event>
 </chronItem>
<!‐‐
...
‐‐>
 <chronItem>
 
 <date>1967,
Feb.
18</date>
 
 <placeEntry>Princeton,
N.J.</placeEntry>
 
 <event>Died,
Princeton,
N.J.</event>
 </chronItem></chronList> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • <cpfRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"

 xlink:type="simple"
 xlink:role="hfp://RDVocab.info/uri/schema/FRBRenEEesRDA/Person"

 xlink:arcrole="correspondedWith">
 <relaEonEntry>Bush,
Vannevar,
1890‐1974.</relaEonEntry>
 <descripEveNote>
 
 <p>recordId:
DLC.ms998007.r007</p>
 </descripEveNote></cpfRelaEon> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • <resourceRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"
xlink:arcrole="creatorOf"
 xlink:role="archivalRecords”
xlink:type="simple”

 xlink:href="hfp://hdl.loc.gov/loc.mss/eadmss.ms998007">
 <relaEonEntry>J.
Robert
Oppenheimer
Papers,
1799‐1980
(bulk
1947‐1967)</relaEonEntry>
 <objectXMLWrap>
 <did
xmlns="urn:isbn:1‐931666‐22‐9”
>
 
 <uniPtle>Papers
<unitdate

normal="1799/1980”
era="ce”
calendar="gregorian">1799‐1980
 
 
</unitdate><unitdate
label="Bulk
Dates"
type="bulk"
normal="1947/1967”
 
 era="ce”
calendar="gregorian">(bulk
1947‐1967)</unitdate></uniPtle>
 
 <uniEd
countrycode="US"
repositorycode="US‐DLC">MSS35188</uniEd>
 
 <originaEon
label="Creator">
 
 
 <persname>Oppenheimer,
J.
Robert,
1904‐1967</persname>
 
 </originaEon>
<!‐‐
...
‐‐>
 
 <repository><corpname>Manuscript
Division.
Library
of
Congress</corpname>
 
 </repository>
 
 <abstract>Physicist
and
director
 
 of
the
InsEtute
for
Advanced
Study,
Princeton,
New
Jersey.
[...]
Topics
include
theoreEcal

 
 physics,
development
of
the
atomic
bomb,
the
relaEonship
between
government
and

 
 
 science,
nuclear
energy,
security,
and
naEonal
loyalty.
</abstract>
 </did>
 </objectXMLWrap></resourceRelaEon> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Year
One
Results‐ExtracEon• EAC‐CPF
records
extracted –LoC:
43,702
from
1,159
finding
aids –OAC:
91,811
from
~15,400
 –NWDA:
22,609
from
5,160 –VH:
15,175
from

8,390 –Total
173,297 –Note:
in
a
more
recent
extracEon:
196,218,
but
have
 not
had
Eme
analyze
the
results Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Early
ObservaEons‐ExtracEon• Depth
of
analysis
and
quality
of
descripEon
of
 CPF
enEEes
varies
widely
in
EAD‐encoded
finding
 aids –LoC
a
lot
of
names
under
authority
control –OAC
and
NWDA
have
less
names
and
control
varies• To
be
fair,
the
finding
aids
were
created
without
 SNAC
processing
in
mind! Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Next
on
ExtracEon• Refine
extracEon
processing,
incorporaEng
some
 NLP‐like
processing,
for
example –Verifying
type
of
name:
C
or
P
or
F –Massaging
poorly
formed
names
into
befer
formed
 names –IdenEfying
names
in
strings
that
are
names‐plus
(but
 name
not
idenEfied
as
such) –Provide
context
informaEon
to
enhance
matching,
for
 example,
date
or
dates
of
correspondence,
or
 occupaEon
of
creator
of
records Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Beyond
the
Project• Building
a
NaEonal
Archival
AuthoriEes
 Infrastructure – IMLS
funded
two‐year
project,
October
2011‐September
 2013 – EAC‐CPF
SAA
workshops:
140
scholarships – NaEonal
Archival
AuthoriEes
CooperaEve
planning• SNAC
II:
a
proposal
to
expand
SNAC – A
lot
more
data – NARA,
SI,
MARC
WorldCat
records,
a
lot
more
finding
aids Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • For
More
InformaEon• hfp://socialarchive.iath.virginia.edu/
(Project
 website)• hfp://socialarchive.iath.virginia.edu/x{/search
 (public
prototype) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • Social Networks and Archival Context Project: Matching and Merging EAC- CPF Records Ray R. Larson Krishna Janakiraman University of California, Berkeley School of Information Thanks
to
Daniel
V.
Pi+

of
the
Ins/tute
for
Advanced
Technology
in
the
Humani/es,

University
of
 Virginia,
for
many
of
the
slides
hereSAA 2011 - Chicago 2011-08-27 - SLIDE
  • SNAC Project• The outlines of the project have been discussed by Daniel Pitti previously• The primary focus of the Berkeley group for the project is on combining data resources from multiple archives and other information sources• In this talk I will focus on our current methods used in the prototype (to be described by Brian Tingle later)SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Data Contributing Institutions• EAD-encoded finding aids – Library of Congress (1159) – Online Archive of California (15,400+) – Northwest Digital Archive (5,563+) – Virginia Heritage (8,390+)• Authority records – Library of Congress: NACO/LCNAF (3.8M personal names; 900K corporate names) – Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names) – Virtual International Authority File (intersection with NACO/LCNAF, 5M personal names)• Other biographical sources (e.g., DBPedia, IMDB)SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Methods and Processing• Extract EAC-CPF records from existing EAD- encoded archival descriptions – Extracting both creators and referenced CPF names• Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) – Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN)• Create a prototype historical resource and access system – Historical data and social-professional networks – Links to archive, library, and museum resources (by and about)SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Merging EAC-CPF Records LCNAF Repository ULAN Repository Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEonSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Authority Control• Identifying creator entities and referenced entities (correspondents, etc.)• Recording name or names used by and for them• Rule-based heading or entry formation and controlSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Controlled Vocabularies• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information• That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadataSAA 2011 - Chicago 2011-08-27 - SLIDE
  • The Problem• Proliferation of the forms of names –Different names for the same person –Different people with the same names• Examples –from Books in Print (semi-controlled but not consistent) –ERIC author index (not controlled)SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Goethe …etc…SAA 2011 - Chicago 2011-08-27 - SLIDE
  • John MuirSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Pauline Cochrane nee AthertonSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Pauline Cochrane nee AthertonSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973Different names for the 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise same person 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Merging EAC-CPF Records Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEonSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Connect Exact Matches• The EAC-CPF records provide the names without having to parse texts, etc.• Allows us to use some simple methods like exact matching –Assume identical name entries means the same person/corporate body/family –Enter the full names and record IDs into a database and flag IDs with same names for mergingSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Merging EAC-CPF Records Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEonSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Search Authority Files• For each name, formulate a search of the VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching) –Search both the “authoritative” and “non- authoritative” forms –Consider any name matching a non-authoritative form to be a candidate match for the authoritative form –Flag EAC records that match the same authority record as potential matchesSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Merging EAC-CPF Records Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEonSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Merge Flagged Records• For all of the exact matches and authority matches –Use the Authoritative form of the name –Combine data from each match into a single EAC- CPF record –Retain all source record IDs and information• Finally, output the merged EAC-CPF recordsSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Inputs to SNAC merging• LoC: 43,702 EAC-CPF records derived from 1159 finding aids• OAC: 91,811 EAC-CPF records derived from ~15,400 finding aids• NWDA: 22,609 EAC-CPF records derived from 5,568 finding aids• Result: 123,920 “unique” namesSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Another view of the numbers…• 93033 Person names merged from 114639 Person records• 30161 Institutions merged from 41177 Institution records• 1669 Families merged from 2263 Family recordsSAA 2011 - Chicago 2011-08-27 - SLIDE
  • But…• Exact merging assumes that archives are following LC cataloging practice in their EAD records –There are some problems with this assumptionSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Some failures for merging…• Different abbreviations: – A. & G. Carisch & C. – A. & G. Carisch & Co.• And spacing issues: – A. C. Peters & Bro. – A. C. Peters & Brother. – A. C. Peters. (??) – A. C.Peters & Bro.• Completeness and alternate rules – Tabb, John B. (John Banister), 1845-1909. – Tabb, John Banister, 1845-1909.SAA 2011 - Chicago 2011-08-27 - SLIDE
  • More…• Variant romanizations (and spacing): –M. P. Belaieff. –M. P. Belaïeff. –M. P. Bieliaev. –M.P. Belaïeff. –M.P.Belaïeff.• Initials vs. names: –Zabolotskii, N.A. –Zabolotskii, Nikolai Alekseevich, 1903-1958. –Zabolotskii.SAA 2011 - Chicago 2011-08-27 - SLIDE
  • More…• Inverted order vs. uninverted –Taylor, Zachary, 1784-1850. –Zachary Taylor.• Various combinations: –Tchaikovsky, Peter I. –Tchaikovsky, Pëtr Il. –Tchaikovsky, Piotr Ilyich. –Tchaikovsky, Pyotr Il. –Tchaikovsky, Pyotr Ilyich.SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Another kind of failure• Entry for “Zaphiropoulos” - no dates, no first name: – The entry from VIAF was for “Zaphiropoulos, Lela, 1941-” – But the name in EAD came as an attribution for photos: – Box 113 – Lot PP13 Zaphiropoulos. [Bas-relief at Troy], 1872. – Physical Description: 2 photographs – Scope and Content Note – Photographs taken for Schliemann.• Not sure that the Zaphiropoulos indicated is a person, and definitely not one born in 1941.SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Addressing the failures• First we need to know where things are not working, and why – We are planning to do a random sample and detailed evaluation of the database to help identify the problems• Many of the problems we have seen already appear to be solvable using: – Additional contextual clues from the EAD records – More sophisticated matching for phonetic variants • Such as n-grams or phonetic schemes like phonex – Additional normalization of names before merging • For name order, etc. – Use of advance matching methodsSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Testing new merging methods• Work done in conjunction with SNAC for a I School Masters’ project called Biograph –Krishna Janakiraman and Sean Marimpietri• Using SNAC and merging with FreeBase and IMDBSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Einstein, Albert, 1879-1955. Einstein, Albert. Ainshutain, A. 1879-1955 Aiyinsitan 1879-1955 Einstein, A. Albert Einstein Albert Einstein Krishna Janakiraman and Sean Marimpietri - BiographSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Learn binary classifiers over varying names and existence dates Our approach Perturb existing information to generate additional samples within specific error levels Krishna Janakiraman and Sean Marimpietri - BiographSAA 2011 - Chicago 2011-08-27 - SLIDE
  • 0T FeaturesR FeaturesA Features NamesI NamesN Birth and Death dates String distance Shingle Language Model metricsPRED Learn decision tree I classifiersC T 0 Krishna Janakiraman and Sean Marimpietri - Biograph Link Records SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Name: Einstein Albert Shingle sequence: ein, ins, nst, ste, tei, ein … , ert Probability that the sequence (ins, nst, ste) follows ein is very high for the name einstein Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - BiographSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein ein In hta tai ein In ain ste sht ste al nst nsh nst alb ins insins lbe ein lbe Ain ein lbe ert ert ein ert ein ein tei rte tei rte tei rte Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Date String Distance Example Decision Tree For Krishna Janakiraman and Sean Marimpietri - Biograph Von NeumannSAA 2011 - Chicago 2011-08-27 - SLIDE
  • Albert Einstein George W Bush Von Neumann TP:78 FP:11 TP:39 FP:9 TP:182 FP:14 FN:25 TN:145 FN:6 TN:60 FN:27 TN:301 TPR: 75.7% TPR: 86.6% TPR: 75.7% FPR: 7% FPR: 13% FPR: 7% Corpus Average TPR: 72.7% FPR: 17% Krishna Janakiraman and Sean Marimpietri - BiographSAA 2011 - Chicago 2011-08-27 - SLIDE
  • 15,300 records, thresh = 0.85 1100 records, thresh = 0.9 How many did we link ?SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Conclusions• There will not be a single merging method, but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information• Once records are merged, they are passed along to Brian for search and display…SAA 2011 - Chicago 2011-08-27 - SLIDE
  • Discovering Historic Social Networks Prototype Historical Resource Demo Brian Tingle, California Digital LibrarySociety of American Archivists 2011 Annual Meeting August 27, 2011 Chicago
  • Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
  • Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. 
  • Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.
  • Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.• Quincy: Library School Student working to QA record matching.
  • Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.• Quincy: Library School Student working to QA record matching.• Adele: Person doing authority work during collection processing.
  • Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.• Quincy: Library School Student working to QA record matching.• Adele: Person doing authority work during collection processing.• Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically.
  • Home Page
  • Facet tabs
  • Facet tabs
  • Advanced Search
  • Advanced limits match EAC sections
  • XTF result
  • XTF query in thecrossQueryResult
  • doing a search
  • spellcheck
  • search results
  • search results
  • EAC record view Identity
  • EAC record view alternative forms of name
  • EAC record viewBiographical History
  • HTML 5 microdata in chron list
  • EAC record view Related Entries
  • EAC record view Related Entries
  • RDFa owl:sameAs
  • EAC record view View EAC XML
  • EAC record view Graph Demo
  • Tinkerpop Graph Stackh ttp://www.tinkerpop.com/Property Graph ModelgraphMLRDF S ail support
  • vertex edgehttps://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph
  • Graph Schema vertex _id: auto-assigned by neo4j _type: vertex identity: the name of the entity (string) [indexed] urls: n seperated list of source EAD files entityType: corporateBody, family, or person edge _id: auto-assigned by neo4j _type: edge _lable: correspondedWith or associatedWith _inV: incoming vertex _id (from) _outV: outgoing vertex _id (to) from_name: from identity (string) denormalized to_name: to identity (string) denormalized
  • internal id indices/name-idx is an index on“identity”; used to look up neo4j record id
  • “bothE” shows in and out edges vertices/103994/bothE redundant data to save repeated lookups
  • RDF of the social graph Thanks Ed Summers!
  • Silvia Mazzini regesta.exe srlhttp://templates.xdams.net/IBC/ontology/eac-cpf.rdf
  • Front End Stack• golden grid http://code.google.com/p/the-golden-grid/• form style http://formalize.me/• jquery and jquery ui• hoverIntent for advanced search• google analytics with event tracking
  • XTF XSLT Framework• pre filter - do special tokenization to create custom EAC facets • https://docs.google.com/document/d/ 1wP9x6sdOZTagJNQXoyJfPh0Y6UzQgqLwLI86WSlIPbk/edit?hl=en_US• query parser - CGI params to XTF query XML• result formatter - XTF results to HTML• doc formatter - EAC-CPF to HTML• http://code.google.com/p/xtf-cpf/source/browse/? name=xtf-cpf
  • social graph visualization• EAC to graphML https://code.google.com/p/eac-graph-load/• graphML file with open license should be viewable in other tools• old demo uses Dracula Graph Library• New demo uses Javascript InfoVis Toolkit• Ed Summer’s “snac hacks” post
  • EAD to EAC XSLT• forthcoming from Virginia
  • Record Merging• forthcoming from Berkeley
  • Demo• http://socialarchive.iath.virginia.edu/xtf/search