EAC‐CPF
and
Social
Networks         Society
of
American
Archivists                     Chicago                  August
201...
SNAC
Overview• Funding
and
Timeline• Project
Team• Project
ObjecEves
and
RaEonale• Data
ContribuEng
InsEtuEons• Archival
S...
Funding
and
Timeline• NaEonal
Endowment
for
the
HumaniEes• A
PreservaEon
and
Access,
Research
and
  Development
grant• Two...
Project
Team• Daniel
PiP
(PI)
and
Worthy
MarEn
(InsEtute
for
  Advanced
Technology
in
the
HumaniEes,
  University
of
Virgi...
Project
ObjecEves• Archival
finding
aids
currently
intermix
descripEon
of
records
  with
descripEon
of
the
creators
of
reco...
RaEonale
for
SeparaEon• Authority
control
of
forms
of
names• Flexible
descripEon• CooperaEve
authority
control• Integrated...
The
Data• EAD‐encoded
finding
aids  – Library
of
Congress
(1,159)  – Online
Archive
of
California
(~15,400
)  – Northwest
D...
Methods
and
Processing• Extract
EAC‐CPF
records
from
exisEng
EAD‐encoded
archival
  descripEons  – ExtracEng
both
creators...
EAD
Source
Data• Encoded
Archival
DescripEon   – Intermixes
descripEon
of
creators
of
records
and,
at
the
discreEon
of
the...
Archival
Records• Records
are
the
by‐products
of
people
living
and
working
as
  individuals,
in
organized
groups,
in
famil...
Source:
J.
Robert
Oppenheimer
Papers
(LoC)<originaEon>

     <persname
source="lcnaf">Oppenheimer,
J.
Robert,
1904‐1967</p...
Source:
Leonard
Bernstein
CollecEon
(LoC)
<c02>


<did>





<container
type="box">1</container>






<uniPtle>Aaltonen,
...
<bioghist>



<head>Biographical
Sketch</head>



<p>José
Marcos
Mugarrieta,
prior
to
his
term
as
Mexican
consul
in
San
Fr...
<bioghist>


<head>Chronology</head>


<chronlist>




<chronitem>






<date>1900</date>






<event>Born
on
Jan.
20
in...
EAC‐CPF• Encoded
Archival
Context‐Corporate
bodies,
Persons,
  Families• An
internaEonal
communicaEon
standard
for
archiva...
Library
and
Archive
Authority
Control• Library
(or
bibliographic)
authority
control
is
almost
  exclusively
about
the
cont...
<idenEty>
    <enEtyType>person</enEtyType>
    <nameEntry
scriptCode="Latn"
xml:lang="eng">
    
     <part>Oppenheimer,
...
<existDates>
       <dateRange>
       
    <fromDate
standardDate=“1904‐04‐22”>1904,
Apr.
22</fromDate>
       
    <toDa...
<chronList>
    <chronItem>
    
     <date>1904,
Apr.
22</date>
    
     <placeEntry>New
York,
N.Y.</placeEntry>
    
  ...
<cpfRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"

    xlink:type="simple"
    xlink:role="hfp://RDVocab.info/uri/sche...
<resourceRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"
xlink:arcrole="creatorOf"
    xlink:role="archivalRecords”
xlin...
Year
One
Results‐ExtracEon• EAC‐CPF
records
extracted –LoC:
43,702
from
1,159
finding
aids –OAC:
91,811
from
~15,400
 –NWDA...
Early
ObservaEons‐ExtracEon• Depth
of
analysis
and
quality
of
descripEon
of
  CPF
enEEes
varies
widely
in
EAD‐encoded
findi...
Next
on
ExtracEon• Refine
extracEon
processing,
incorporaEng
some
  NLP‐like
processing,
for
example –Verifying
type
of
nam...
Beyond
the
Project• Building
a
NaEonal
Archival
AuthoriEes
  Infrastructure  – IMLS
funded
two‐year
project,
October
2011‐...
For
More
InformaEon• hfp://socialarchive.iath.virginia.edu/
(Project
  website)• hfp://socialarchive.iath.virginia.edu/x{/...
Social Networks and Archival Context    Project: Matching and Merging EAC-                CPF Records                     ...
SNAC Project• The outlines of the project have been  discussed by Daniel Pitti previously• The primary focus of the Berkel...
Data Contributing Institutions• EAD-encoded finding aids    – Library of Congress (1159)    – Online Archive of California...
Methods and Processing• Extract EAC-CPF records from existing EAD-  encoded archival descriptions    – Extracting both cre...
Merging EAC-CPF Records              LCNAF Repository                             ULAN Repository                         ...
Authority Control• Identifying creator entities and referenced  entities (correspondents, etc.)• Recording name or names u...
Controlled Vocabularies• Vocabulary control is the attempt to provide  a standardized and consistent set of terms  (such a...
The Problem• Proliferation of the forms of names    –Different names for the same person    –Different people with the sam...
Goethe                     …etc…SAA 2011 - Chicago                       2011-08-27 - SLIDE
John MuirSAA 2011 - Chicago                     2011-08-27 - SLIDE
Pauline Cochrane nee AthertonSAA 2011 - Chicago                                2011-08-27 - SLIDE
Pauline Cochrane nee AthertonSAA 2011 - Chicago                                2011-08-27 - SLIDE
Name Authority Files            ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242             KRC:a NMU:a CRC:c ...
Merging EAC-CPF Records                                           Cheshire
                                            Sea...
Connect Exact Matches• The EAC-CPF records provide the names  without having to parse texts, etc.• Allows us to use some s...
Merging EAC-CPF Records                                           Cheshire
                                            Sea...
Search Authority Files• For each name, formulate a search of the  VIAF database using the Cheshire system  (SGML/XML retri...
Merging EAC-CPF Records                                           Cheshire
                                            Sea...
Merge Flagged Records• For all of the exact matches and authority  matches    –Use the Authoritative form of the name    –...
Inputs to SNAC merging• LoC: 43,702 EAC-CPF records derived from 1159  finding aids• OAC: 91,811 EAC-CPF records derived f...
Another view of the numbers…• 93033 Person names merged from 114639  Person records• 30161 Institutions merged from 41177 ...
But…• Exact merging assumes that archives are  following LC cataloging practice in their EAD  records    –There are some p...
Some failures for merging…• Different abbreviations:    – A. & G. Carisch & C.    – A. & G. Carisch & Co.• And spacing iss...
More…• Variant romanizations (and spacing):    –M. P. Belaieff.    –M. P. Belaïeff.    –M. P. Bieliaev.    –M.P. Belaïeff....
More…• Inverted order vs. uninverted    –Taylor, Zachary, 1784-1850.    –Zachary Taylor.• Various combinations:    –Tchaik...
Another kind of failure• Entry for “Zaphiropoulos” - no dates, no first name:    – The entry from VIAF was for “Zaphiropou...
Addressing the failures• First we need to know where things are not working,  and why    – We are planning to do a random ...
Testing new merging methods• Work done in conjunction with SNAC for a I  School Masters’ project called Biograph    –Krish...
Einstein, Albert, 1879-1955.                     Einstein, Albert.                     Ainshutain, A. 1879-1955           ...
Learn binary classifiers over varying names                                       and existence dates             Our appr...
0T                                                                           FeaturesR                  FeaturesA         ...
Name: Einstein Albert Shingle sequence: ein, ins, nst, ste, tei, ein … , ert  Probability that the sequence (ins, nst, ste...
Name 1 : Einstein Albert                       Name 2 : Ainshtain Albert                     Name 3 : Albert Einstein     ...
Date                                                              String Distance            Example Decision Tree For Kri...
Albert Einstein         George W Bush            Von Neumann TP:78         FP:11    TP:39    FP:9              TP:182     ...
15,300 records, thresh = 0.85                                  1100 records, thresh = 0.9                     How many did...
Conclusions• There will not be a single merging method,  but a staged set of approaches that will allow  us to go from the...
Discovering Historic  Social Networks       Prototype Historical Resource Demo       Brian Tingle, California Digital Libr...
Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted dem...
Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted dem...
Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted dem...
Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted dem...
Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted dem...
Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted dem...
Home Page
Facet tabs
Facet tabs
Advanced Search
Advanced limits match EAC        sections
XTF result
XTF query in thecrossQueryResult
doing a search
spellcheck
search results
search results
EAC record view                  Identity
EAC record view           alternative forms of name
EAC record viewBiographical History
HTML 5 microdata in chron list
EAC record view  Related Entries
EAC record view  Related Entries
RDFa owl:sameAs
EAC record view      View EAC XML
EAC record view       Graph Demo
Tinkerpop    Graph Stackh ttp://www.tinkerpop.com/Property Graph ModelgraphMLRDF S ail support
vertex                                                       edgehttps://github.com/tinkerpop/gremlin/wiki/Defining-a-Prop...
Graph Schema   vertex  _id: auto-assigned by neo4j  _type: vertex  identity: the name of the entity (string) [indexed]  ur...
internal id    indices/name-idx is an index on“identity”; used to look up neo4j record                    id
“bothE” shows in and out edges               vertices/103994/bothE                      redundant data to save repeated   ...
RDF of the social graph                          Thanks Ed Summers!
Silvia Mazzini                                    regesta.exe srlhttp://templates.xdams.net/IBC/ontology/eac-cpf.rdf
Front End Stack• golden grid  http://code.google.com/p/the-golden-grid/• form style http://formalize.me/• jquery and jquer...
XTF XSLT Framework• pre filter - do special tokenization to create custom   EAC facets  • https://docs.google.com/document/...
social graph visualization• EAC to graphML  https://code.google.com/p/eac-graph-load/• graphML file with open license shoul...
EAD to EAC XSLT• forthcoming from Virginia
Record Merging• forthcoming from Berkeley
Demo• http://socialarchive.iath.virginia.edu/xtf/search
Snac saa-aug-2011-try 3 keynote
Snac saa-aug-2011-try 3 keynote
Snac saa-aug-2011-try 3 keynote
Snac saa-aug-2011-try 3 keynote
Upcoming SlideShare
Loading in …5
×

Snac saa-aug-2011-try 3 keynote

563 views

Published on

try3; this is so jacked

Published in: Education, Technology
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
563
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • Flexible description: series description; dispersed collections\nCooperative authority control: dispersed collections; but also creator of one collection is referenced in a collection created by someone else (co-referencing); economic and descriptive benefits\nIntegrated access to cultural heritage: context for archival records, essential, but the descriptions can also provide context for all types of resources\nArchival authority records, like museum authority records, provide historical and biographical data that can enhance identification and understanding; (biographical dictionary; administrative histories)\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Remember that we will solicit public evaluation and suggestions on drafts of the public interface, starting in the fall.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Snac saa-aug-2011-try 3 keynote

    1. 1. EAC‐CPF
and
Social
Networks Society
of
American
Archivists Chicago August
2011Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    2. 2. SNAC
Overview• Funding
and
Timeline• Project
Team• Project
ObjecEves
and
RaEonale• Data
ContribuEng
InsEtuEons• Archival
Standards
Employed• Methods,
Processing,
and
Products• Year
One
ExtracEon
Results• Basic
ObservaEons
on
ExtracEon Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    3. 3. Funding
and
Timeline• NaEonal
Endowment
for
the
HumaniEes• A
PreservaEon
and
Access,
Research
and
 Development
grant• Two‐year
project• May
2010‐April
2012 Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    4. 4. Project
Team• Daniel
PiP
(PI)
and
Worthy
MarEn
(InsEtute
for
 Advanced
Technology
in
the
HumaniEes,
 University
of
Virginia)• Adrian
Turner
and
Brian
Tingle
(California
Digital
 Library,
University
of
California)• Ray
Larson
(School
of
InformaEon,
University
of
 California,
Berkeley) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    5. 5. Project
ObjecEves• Archival
finding
aids
currently
intermix
descripEon
of
records
 with
descripEon
of
the
creators
of
records
and
persons
evident
 in
the
records• Further
the
ongoing
process
of
transforming
archival
descripEon
 using
advanced
technologies• By
facilitaEng
the
separaEon
of
the
descripEon
of
people
from
 the
descripEon
of
records• Using
EAC‐CPF,
an
InternaEonal
archival
authority
control
 standard• Goal:
enhance
the
economy
and
effecEveness
of
archival
 descripEon
to
enhance
access
and
understanding
of
users
of
 archives,
libraries,
and
museums Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    6. 6. RaEonale
for
SeparaEon• Authority
control
of
forms
of
names• Flexible
descripEon• CooperaEve
authority
control• Integrated
access
to
cultural
heritage• Biographical/historical
resource• Social/historical
context
(social‐professional
 networks) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    7. 7. The
Data• EAD‐encoded
finding
aids – Library
of
Congress
(1,159) – Online
Archive
of
California
(~15,400
) – Northwest
Digital
Archive
(5,160) – Virginia
Heritage
(8,390)• Authority
records
 – Library
of
Congress:
NACO/LCNAF
(3.8M
personal
names;
900K
 corporate
names) – Gefy
Vocabulary
Program:
Union
List
of
ArEst
Names
(293K
 personal
and
corporate
names) – Virtual
InternaEonal
Authority
File
(5M+
personal
names) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    8. 8. Methods
and
Processing• Extract
EAC‐CPF
records
from
exisEng
EAD‐encoded
archival
 descripEons – ExtracEng
both
creators
and
referenced
CPF
names• Match
EAC‐CPF
records
against
one
another
and
against
exisEng
 authority
records
(ULAN,
VIAF,
LCNAF);
merge
records
for
the
 same
enEty – Enhance
EAC‐CPF
by
normalizing
entries,
adding
alternaEve
entries,
 Etles
(VIAF),
and
historical
data
(ULAN) – Key
challenge:
two
or
more
people
with
the
same
name;
two
or
more
 names
for
the
same
person• Create
a
prototype
historical
resource
and
access
system – Historical
data
and
social‐professional
networks – Links
to
archive,
library,
and
museum
resources
(by
and
about)
 Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    9. 9. EAD
Source
Data• Encoded
Archival
DescripEon – Intermixes
descripEon
of
creators
of
records
and,
at
the
discreEon
of
the
archivists,
 names
associated
with
the
content
of
the
records – Detailed
descripEon
of
creators
of
records• Widely
varying
quality – In
the
number
of
names
idenEfied
and
encoded – In
the
formaEon
of
the
names
(direct
or
inverted,
capitalizaEon,
punctuaEon,
and
so
 on) – In
the
categorizaEon
of
names
(personal,
corporate,
or
family• Many
names
given
but
not
idenEfied
as
such• Most
important
of
these
in
biographies/histories
and
in
correspondence
 descripEon• ExtracEon
has
focused
on
the
“low
hanging
fruit,”
that
is
the
names
tagged
as
 names• AfenEon
shiling
to
names
not
idenEfied
as
such Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    10. 10. Archival
Records• Records
are
the
by‐products
of
people
living
and
working
as
 individuals,
in
organized
groups,
in
families• Records
document
people
living
and
working• People
exist
in
social‐professional
contexts,
in
relaEon
to
others• Records
document
these
relaEons
• All
records
created
by
the
same
enEty
are
described
together
(a
 fonds
or
collecEon) – Creators
documented
in
detail – Many
of
the
people
documented
in
the
record
referenced
in
 descripEon• Archival
descripEons
document
interrelaEons
among
people
 and
records
(documents) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    11. 11. Source:
J.
Robert
Oppenheimer
Papers
(LoC)<originaEon>

 <persname
source="lcnaf">Oppenheimer,
J.
Robert,
1904‐1967</persname>
</originaEon><controlaccess>
 <persname
source="lcnaf"
encodinganalog="100"
role="creator">Oppenheimer,
J.

 
Robert,
1904‐1967</persname>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Bethe,
Hans
 
Albrecht,
1906‐
‐‐Correspondence</persname>
<!‐‐
[…]
‐‐>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Born,
Max,
 
1882‐1970
‐‐Correspondence</persname>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Boyd,
Julian
P.
 
(Julian
Parks),
1903‐
‐‐Correspondence</persname>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Bush,
Vannevar,
 
1890‐1974
‐‐Correspondence</persname>
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Casals,
Pablo,
 
1876‐1973
‐‐Correspondence</persname>
<!‐‐
[…]
‐‐>
 <corpname
source="lcnaf"
encodinganalog="610"
role="subject">InsEtute
for
 
Advanced
Study
(Princeton,
N.J.)</corpname>
 <corpname
source="lcnaf"
encodinganalog="610"
role="subject">Los
Alamos
 
ScienEfic
Laboratory</corpname>
<!‐‐
[…]
‐‐></controlaccess> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    12. 12. Source:
Leonard
Bernstein
CollecEon
(LoC)
<c02>


<did>





<container
type="box">1</container>






<uniPtle>Aaltonen,
Erkki
<unitdate
era="ce"
calendar="gregorian">1981</unitdate>





</uniPtle>





<physdesc>








<extent>1</extent>






</physdesc>


</did></c02><c02>


<did>





<uniPtle>Abbado,
Claudio
<unitdate
era="ce"
calendar="gregorian">1963‐90</unitdate>






</uniPtle>





<physdesc>








<extent>5</extent>






</physdesc>


</did></c02>[…] Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    13. 13. <bioghist>



<head>Biographical
Sketch</head>



<p>José
Marcos
Mugarrieta,
prior
to
his
term
as
Mexican
consul
in
San
Francisco
1857‐1863,
served
in
the
Mexican
army
from
1837.
He
saw
acEon
in
numerous
bafles
and
campaigns
–
Jamaica,
under
General
Canalizo
in
1841;
Campeche,
1842‐1843;
Merida,
1843;
Veracruz,
1845;
Mexico
City,
1846;
Angostura
and
Cerro‐gordo,
1847;
Guanajuato,
1848,
and
Sierra‐Gorda
under
Bustamante,
1848‐1849;
and
Matamoros,
1849‐1850.
[…]
</p>



<p>In
April
1857
Mugarrieta
received
an
appointment
from
the
Comonfort
government
for
the
consulship
in
San
Francisco.
He
did
not
actually
begin
his
new
duEes
unEl
September
1,
1859,
due
to
illness
and
to
the
poliEcal
situaEon
in
Mexico.
[…]</p>
</bioghist> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    14. 14. <bioghist>


<head>Chronology</head>


<chronlist>




<chronitem>






<date>1900</date>






<event>Born
on
Jan.
20
in
HasEngs,
Minnesota.</event>




</chronitem>




<chronitem>






<date>1922</date>






<event>Received
baccalaureate
from
Princeton
University,
major
in
philosophy.
 </event>




</chronitem>




[…]





<chronitem>






<date>1965</date>






<event>Died
on
April
4.</event>




</chronitem>


</chronlist>
</bioghist> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    15. 15. EAC‐CPF• Encoded
Archival
Context‐Corporate
bodies,
Persons,
 Families• An
internaEonal
communicaEon
standard
for
archival
 authority
control• Based
on
InternaEonal
Council
for
Archives,
InternaEonal
 Standard
Archival
Authority
Records‐Corporate
bodies,
 persons,
families
(ISAAR(CPF))• SAA
Standards
Commifee,
Technical
Subcommifee
on
 Encoded
Archival
Context• Co‐chairs – Katherine
Wisser,
Simmons
College – Anila
Angjeli,
Bibliothèque
naEonale
de
France Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    16. 16. Library
and
Archive
Authority
Control• Library
(or
bibliographic)
authority
control
is
almost
 exclusively
about
the
control
of
names• Archival
authority
control
involves
biographical‐historical
 descripEon
of
the
CPF
enEty – DescripEons
based
on
controlled
vocabularies
or
values,
for
 example,
occupaEons,
place
of
birth
and
death – But
also
biographical‐historical
descripEon • Prose • Chronological
list• Archival
authority
control
provides
context
for
 understanding
records,
the
context
of
their
creaEon,
the
 provenance Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    17. 17. <idenEty>
 <enEtyType>person</enEtyType>
 <nameEntry
scriptCode="Latn"
xml:lang="eng">
 
 <part>Oppenheimer,
J.
Robert,
1904‐1967.</part>
 
 <authorizedForm>AACR2</authorizedForm>
 </nameEntry>
 <nameEntry
localType="VIAF:MainHeading">
 
 <part>Oppenheimer,
J.
Robert
(Julius
Robert),
1904‐1967</part>
 
 <alternaEveForm>VIAF</alternaEveForm>
 </nameEntry>
 <nameEntry
localType="VIAF:MainHeading">
 
 <part>Oppenheimer,
Julius
Robert,
1904‐1967</part>
 
 <alternaEveForm>VIAF</alternaEveForm>
 </nameEntry>
 
 <nameEntry
localType="VIAF:x400">
 
 <part>Oppenheimer,
Robert</part>
 
 <alternaEveForm>VIAF</alternaEveForm>
 </nameEntry>
 <nameEntry
localType="VIAF:x400">
 
 <part>Ou‐pẽn‐hai‐mo,
1904‐1967</part>
 
 <alternaEveForm>VIAF</alternaEveForm>
 </nameEntry></idenEty> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    18. 18. <existDates>
 <dateRange>
 
 <fromDate
standardDate=“1904‐04‐22”>1904,
Apr.
22</fromDate>
 
 <toDate
standardDate=“1967‐02‐18”>1967,
Feb.
18</toDate>
 </dateRange></existDates><!‐‐
...
‐‐><localDescripEon
localType="subject">
 <term>Science‐‐SocieEes,
etc.</term></localDescripEon><localDescripEon
localType="VIAF:naEonality">
 <placeEntry
countryCode="US"/></localDescripEon><localDescripEon
localType="VIAF:gender">
 <term>Male</term></localDescripEon><languageUsed>
 <language
languageCode="eng"/></languageUsed><occupaEon>
 <term>Physicists.</term></occupaEon><!‐‐
...
‐‐> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    19. 19. <chronList>
 <chronItem>
 
 <date>1904,
Apr.
22</date>
 
 <placeEntry>New
York,
N.Y.</placeEntry>
 
 <event>Born,
New
York,
N.Y.</event>
 </chronItem>
<!‐‐
...
‐‐>
 <chronItem>
 
 <date>1943‐1945</date>
 
 <placeEntry>Los
Alamos,
N.
Mex.</placeEntry>
 
 <event>Director,
Los
Alamos
ScienEfic
Laboratory,
Los
Alamos,
N.
Mex.</event>
 </chronItem>
<!‐‐
...
‐‐>
 <chronItem>
 
 <date>1954</date>
 
 <event>(1)
Denied
security
clearance
[…]
(2)
Published
Science
and
the
 
 
 Common
Understanding
[…]
 
 
</event>
 </chronItem>
<!‐‐
...
‐‐>
 <chronItem>
 
 <date>1967,
Feb.
18</date>
 
 <placeEntry>Princeton,
N.J.</placeEntry>
 
 <event>Died,
Princeton,
N.J.</event>
 </chronItem></chronList> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    20. 20. <cpfRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"

 xlink:type="simple"
 xlink:role="hfp://RDVocab.info/uri/schema/FRBRenEEesRDA/Person"

 xlink:arcrole="correspondedWith">
 <relaEonEntry>Bush,
Vannevar,
1890‐1974.</relaEonEntry>
 <descripEveNote>
 
 <p>recordId:
DLC.ms998007.r007</p>
 </descripEveNote></cpfRelaEon> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    21. 21. <resourceRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"
xlink:arcrole="creatorOf"
 xlink:role="archivalRecords”
xlink:type="simple”

 xlink:href="hfp://hdl.loc.gov/loc.mss/eadmss.ms998007">
 <relaEonEntry>J.
Robert
Oppenheimer
Papers,
1799‐1980
(bulk
1947‐1967)</relaEonEntry>
 <objectXMLWrap>
 <did
xmlns="urn:isbn:1‐931666‐22‐9”
>
 
 <uniPtle>Papers
<unitdate

normal="1799/1980”
era="ce”
calendar="gregorian">1799‐1980
 
 
</unitdate><unitdate
label="Bulk
Dates"
type="bulk"
normal="1947/1967”
 
 era="ce”
calendar="gregorian">(bulk
1947‐1967)</unitdate></uniPtle>
 
 <uniEd
countrycode="US"
repositorycode="US‐DLC">MSS35188</uniEd>
 
 <originaEon
label="Creator">
 
 
 <persname>Oppenheimer,
J.
Robert,
1904‐1967</persname>
 
 </originaEon>
<!‐‐
...
‐‐>
 
 <repository><corpname>Manuscript
Division.
Library
of
Congress</corpname>
 
 </repository>
 
 <abstract>Physicist
and
director
 
 of
the
InsEtute
for
Advanced
Study,
Princeton,
New
Jersey.
[...]
Topics
include
theoreEcal

 
 physics,
development
of
the
atomic
bomb,
the
relaEonship
between
government
and

 
 
 science,
nuclear
energy,
security,
and
naEonal
loyalty.
</abstract>
 </did>
 </objectXMLWrap></resourceRelaEon> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    22. 22. Year
One
Results‐ExtracEon• EAC‐CPF
records
extracted –LoC:
43,702
from
1,159
finding
aids –OAC:
91,811
from
~15,400
 –NWDA:
22,609
from
5,160 –VH:
15,175
from

8,390 –Total
173,297 –Note:
in
a
more
recent
extracEon:
196,218,
but
have
 not
had
Eme
analyze
the
results Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    23. 23. Early
ObservaEons‐ExtracEon• Depth
of
analysis
and
quality
of
descripEon
of
 CPF
enEEes
varies
widely
in
EAD‐encoded
finding
 aids –LoC
a
lot
of
names
under
authority
control –OAC
and
NWDA
have
less
names
and
control
varies• To
be
fair,
the
finding
aids
were
created
without
 SNAC
processing
in
mind! Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    24. 24. Next
on
ExtracEon• Refine
extracEon
processing,
incorporaEng
some
 NLP‐like
processing,
for
example –Verifying
type
of
name:
C
or
P
or
F –Massaging
poorly
formed
names
into
befer
formed
 names –IdenEfying
names
in
strings
that
are
names‐plus
(but
 name
not
idenEfied
as
such) –Provide
context
informaEon
to
enhance
matching,
for
 example,
date
or
dates
of
correspondence,
or
 occupaEon
of
creator
of
records Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    25. 25. Beyond
the
Project• Building
a
NaEonal
Archival
AuthoriEes
 Infrastructure – IMLS
funded
two‐year
project,
October
2011‐September
 2013 – EAC‐CPF
SAA
workshops:
140
scholarships – NaEonal
Archival
AuthoriEes
CooperaEve
planning• SNAC
II:
a
proposal
to
expand
SNAC – A
lot
more
data – NARA,
SI,
MARC
WorldCat
records,
a
lot
more
finding
aids Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    26. 26. For
More
InformaEon• hfp://socialarchive.iath.virginia.edu/
(Project
 website)• hfp://socialarchive.iath.virginia.edu/x{/search
 (public
prototype) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

    27. 27. Social Networks and Archival Context Project: Matching and Merging EAC- CPF Records Ray R. Larson Krishna Janakiraman University of California, Berkeley School of Information Thanks
to
Daniel
V.
Pi+

of
the
Ins/tute
for
Advanced
Technology
in
the
Humani/es,

University
of
 Virginia,
for
many
of
the
slides
hereSAA 2011 - Chicago 2011-08-27 - SLIDE
    28. 28. SNAC Project• The outlines of the project have been discussed by Daniel Pitti previously• The primary focus of the Berkeley group for the project is on combining data resources from multiple archives and other information sources• In this talk I will focus on our current methods used in the prototype (to be described by Brian Tingle later)SAA 2011 - Chicago 2011-08-27 - SLIDE
    29. 29. Data Contributing Institutions• EAD-encoded finding aids – Library of Congress (1159) – Online Archive of California (15,400+) – Northwest Digital Archive (5,563+) – Virginia Heritage (8,390+)• Authority records – Library of Congress: NACO/LCNAF (3.8M personal names; 900K corporate names) – Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names) – Virtual International Authority File (intersection with NACO/LCNAF, 5M personal names)• Other biographical sources (e.g., DBPedia, IMDB)SAA 2011 - Chicago 2011-08-27 - SLIDE
    30. 30. Methods and Processing• Extract EAC-CPF records from existing EAD- encoded archival descriptions – Extracting both creators and referenced CPF names• Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) – Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN)• Create a prototype historical resource and access system – Historical data and social-professional networks – Links to archive, library, and museum resources (by and about)SAA 2011 - Chicago 2011-08-27 - SLIDE
    31. 31. Merging EAC-CPF Records LCNAF Repository ULAN Repository Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEonSAA 2011 - Chicago 2011-08-27 - SLIDE
    32. 32. Authority Control• Identifying creator entities and referenced entities (correspondents, etc.)• Recording name or names used by and for them• Rule-based heading or entry formation and controlSAA 2011 - Chicago 2011-08-27 - SLIDE
    33. 33. Controlled Vocabularies• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information• That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadataSAA 2011 - Chicago 2011-08-27 - SLIDE
    34. 34. The Problem• Proliferation of the forms of names –Different names for the same person –Different people with the same names• Examples –from Books in Print (semi-controlled but not consistent) –ERIC author index (not controlled)SAA 2011 - Chicago 2011-08-27 - SLIDE
    35. 35. Goethe …etc…SAA 2011 - Chicago 2011-08-27 - SLIDE
    36. 36. John MuirSAA 2011 - Chicago 2011-08-27 - SLIDE
    37. 37. Pauline Cochrane nee AthertonSAA 2011 - Chicago 2011-08-27 - SLIDE
    38. 38. Pauline Cochrane nee AthertonSAA 2011 - Chicago 2011-08-27 - SLIDE
    39. 39. Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973Different names for the 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise same person 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973SAA 2011 - Chicago 2011-08-27 - SLIDE
    40. 40. Merging EAC-CPF Records Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEonSAA 2011 - Chicago 2011-08-27 - SLIDE
    41. 41. Connect Exact Matches• The EAC-CPF records provide the names without having to parse texts, etc.• Allows us to use some simple methods like exact matching –Assume identical name entries means the same person/corporate body/family –Enter the full names and record IDs into a database and flag IDs with same names for mergingSAA 2011 - Chicago 2011-08-27 - SLIDE
    42. 42. Merging EAC-CPF Records Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEonSAA 2011 - Chicago 2011-08-27 - SLIDE
    43. 43. Search Authority Files• For each name, formulate a search of the VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching) –Search both the “authoritative” and “non- authoritative” forms –Consider any name matching a non-authoritative form to be a candidate match for the authoritative form –Flag EAC records that match the same authority record as potential matchesSAA 2011 - Chicago 2011-08-27 - SLIDE
    44. 44. Merging EAC-CPF Records Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEonSAA 2011 - Chicago 2011-08-27 - SLIDE
    45. 45. Merge Flagged Records• For all of the exact matches and authority matches –Use the Authoritative form of the name –Combine data from each match into a single EAC- CPF record –Retain all source record IDs and information• Finally, output the merged EAC-CPF recordsSAA 2011 - Chicago 2011-08-27 - SLIDE
    46. 46. Inputs to SNAC merging• LoC: 43,702 EAC-CPF records derived from 1159 finding aids• OAC: 91,811 EAC-CPF records derived from ~15,400 finding aids• NWDA: 22,609 EAC-CPF records derived from 5,568 finding aids• Result: 123,920 “unique” namesSAA 2011 - Chicago 2011-08-27 - SLIDE
    47. 47. Another view of the numbers…• 93033 Person names merged from 114639 Person records• 30161 Institutions merged from 41177 Institution records• 1669 Families merged from 2263 Family recordsSAA 2011 - Chicago 2011-08-27 - SLIDE
    48. 48. But…• Exact merging assumes that archives are following LC cataloging practice in their EAD records –There are some problems with this assumptionSAA 2011 - Chicago 2011-08-27 - SLIDE
    49. 49. Some failures for merging…• Different abbreviations: – A. & G. Carisch & C. – A. & G. Carisch & Co.• And spacing issues: – A. C. Peters & Bro. – A. C. Peters & Brother. – A. C. Peters. (??) – A. C.Peters & Bro.• Completeness and alternate rules – Tabb, John B. (John Banister), 1845-1909. – Tabb, John Banister, 1845-1909.SAA 2011 - Chicago 2011-08-27 - SLIDE
    50. 50. More…• Variant romanizations (and spacing): –M. P. Belaieff. –M. P. Belaïeff. –M. P. Bieliaev. –M.P. Belaïeff. –M.P.Belaïeff.• Initials vs. names: –Zabolotskii, N.A. –Zabolotskii, Nikolai Alekseevich, 1903-1958. –Zabolotskii.SAA 2011 - Chicago 2011-08-27 - SLIDE
    51. 51. More…• Inverted order vs. uninverted –Taylor, Zachary, 1784-1850. –Zachary Taylor.• Various combinations: –Tchaikovsky, Peter I. –Tchaikovsky, Pëtr Il. –Tchaikovsky, Piotr Ilyich. –Tchaikovsky, Pyotr Il. –Tchaikovsky, Pyotr Ilyich.SAA 2011 - Chicago 2011-08-27 - SLIDE
    52. 52. Another kind of failure• Entry for “Zaphiropoulos” - no dates, no first name: – The entry from VIAF was for “Zaphiropoulos, Lela, 1941-” – But the name in EAD came as an attribution for photos: – Box 113 – Lot PP13 Zaphiropoulos. [Bas-relief at Troy], 1872. – Physical Description: 2 photographs – Scope and Content Note – Photographs taken for Schliemann.• Not sure that the Zaphiropoulos indicated is a person, and definitely not one born in 1941.SAA 2011 - Chicago 2011-08-27 - SLIDE
    53. 53. Addressing the failures• First we need to know where things are not working, and why – We are planning to do a random sample and detailed evaluation of the database to help identify the problems• Many of the problems we have seen already appear to be solvable using: – Additional contextual clues from the EAD records – More sophisticated matching for phonetic variants • Such as n-grams or phonetic schemes like phonex – Additional normalization of names before merging • For name order, etc. – Use of advance matching methodsSAA 2011 - Chicago 2011-08-27 - SLIDE
    54. 54. Testing new merging methods• Work done in conjunction with SNAC for a I School Masters’ project called Biograph –Krishna Janakiraman and Sean Marimpietri• Using SNAC and merging with FreeBase and IMDBSAA 2011 - Chicago 2011-08-27 - SLIDE
    55. 55. Einstein, Albert, 1879-1955. Einstein, Albert. Ainshutain, A. 1879-1955 Aiyinsitan 1879-1955 Einstein, A. Albert Einstein Albert Einstein Krishna Janakiraman and Sean Marimpietri - BiographSAA 2011 - Chicago 2011-08-27 - SLIDE
    56. 56. Learn binary classifiers over varying names and existence dates Our approach Perturb existing information to generate additional samples within specific error levels Krishna Janakiraman and Sean Marimpietri - BiographSAA 2011 - Chicago 2011-08-27 - SLIDE
    57. 57. 0T FeaturesR FeaturesA Features NamesI NamesN Birth and Death dates String distance Shingle Language Model metricsPRED Learn decision tree I classifiersC T 0 Krishna Janakiraman and Sean Marimpietri - Biograph Link Records SAA 2011 - Chicago 2011-08-27 - SLIDE
    58. 58. Name: Einstein Albert Shingle sequence: ein, ins, nst, ste, tei, ein … , ert Probability that the sequence (ins, nst, ste) follows ein is very high for the name einstein Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - BiographSAA 2011 - Chicago 2011-08-27 - SLIDE
    59. 59. Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein ein In hta tai ein In ain ste sht ste al nst nsh nst alb ins insins lbe ein lbe Ain ein lbe ert ert ein ert ein ein tei rte tei rte tei rte Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph SAA 2011 - Chicago 2011-08-27 - SLIDE
    60. 60. Date String Distance Example Decision Tree For Krishna Janakiraman and Sean Marimpietri - Biograph Von NeumannSAA 2011 - Chicago 2011-08-27 - SLIDE
    61. 61. Albert Einstein George W Bush Von Neumann TP:78 FP:11 TP:39 FP:9 TP:182 FP:14 FN:25 TN:145 FN:6 TN:60 FN:27 TN:301 TPR: 75.7% TPR: 86.6% TPR: 75.7% FPR: 7% FPR: 13% FPR: 7% Corpus Average TPR: 72.7% FPR: 17% Krishna Janakiraman and Sean Marimpietri - BiographSAA 2011 - Chicago 2011-08-27 - SLIDE
    62. 62. 15,300 records, thresh = 0.85 1100 records, thresh = 0.9 How many did we link ?SAA 2011 - Chicago 2011-08-27 - SLIDE
    63. 63. Conclusions• There will not be a single merging method, but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information• Once records are merged, they are passed along to Brian for search and display…SAA 2011 - Chicago 2011-08-27 - SLIDE
    64. 64. Discovering Historic Social Networks Prototype Historical Resource Demo Brian Tingle, California Digital LibrarySociety of American Archivists 2011 Annual Meeting August 27, 2011 Chicago
    65. 65. Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
    66. 66. Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. 
    67. 67. Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.
    68. 68. Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.• Quincy: Library School Student working to QA record matching.
    69. 69. Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.• Quincy: Library School Student working to QA record matching.• Adele: Person doing authority work during collection processing.
    70. 70. Meet the target usersPersonas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brandor product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.• Quincy: Library School Student working to QA record matching.• Adele: Person doing authority work during collection processing.• Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically.
    71. 71. Home Page
    72. 72. Facet tabs
    73. 73. Facet tabs
    74. 74. Advanced Search
    75. 75. Advanced limits match EAC sections
    76. 76. XTF result
    77. 77. XTF query in thecrossQueryResult
    78. 78. doing a search
    79. 79. spellcheck
    80. 80. search results
    81. 81. search results
    82. 82. EAC record view Identity
    83. 83. EAC record view alternative forms of name
    84. 84. EAC record viewBiographical History
    85. 85. HTML 5 microdata in chron list
    86. 86. EAC record view Related Entries
    87. 87. EAC record view Related Entries
    88. 88. RDFa owl:sameAs
    89. 89. EAC record view View EAC XML
    90. 90. EAC record view Graph Demo
    91. 91. Tinkerpop Graph Stackh ttp://www.tinkerpop.com/Property Graph ModelgraphMLRDF S ail support
    92. 92. vertex edgehttps://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph
    93. 93. Graph Schema vertex _id: auto-assigned by neo4j _type: vertex identity: the name of the entity (string) [indexed] urls: n seperated list of source EAD files entityType: corporateBody, family, or person edge _id: auto-assigned by neo4j _type: edge _lable: correspondedWith or associatedWith _inV: incoming vertex _id (from) _outV: outgoing vertex _id (to) from_name: from identity (string) denormalized to_name: to identity (string) denormalized
    94. 94. internal id indices/name-idx is an index on“identity”; used to look up neo4j record id
    95. 95. “bothE” shows in and out edges vertices/103994/bothE redundant data to save repeated lookups
    96. 96. RDF of the social graph Thanks Ed Summers!
    97. 97. Silvia Mazzini regesta.exe srlhttp://templates.xdams.net/IBC/ontology/eac-cpf.rdf
    98. 98. Front End Stack• golden grid http://code.google.com/p/the-golden-grid/• form style http://formalize.me/• jquery and jquery ui• hoverIntent for advanced search• google analytics with event tracking
    99. 99. XTF XSLT Framework• pre filter - do special tokenization to create custom EAC facets • https://docs.google.com/document/d/ 1wP9x6sdOZTagJNQXoyJfPh0Y6UzQgqLwLI86WSlIPbk/edit?hl=en_US• query parser - CGI params to XTF query XML• result formatter - XTF results to HTML• doc formatter - EAC-CPF to HTML• http://code.google.com/p/xtf-cpf/source/browse/? name=xtf-cpf
    100. 100. social graph visualization• EAC to graphML https://code.google.com/p/eac-graph-load/• graphML file with open license should be viewable in other tools• old demo uses Dracula Graph Library• New demo uses Javascript InfoVis Toolkit• Ed Summer’s “snac hacks” post
    101. 101. EAD to EAC XSLT• forthcoming from Virginia
    102. 102. Record Merging• forthcoming from Berkeley
    103. 103. Demo• http://socialarchive.iath.virginia.edu/xtf/search

    ×