Bridging	
  Digital	
  Humani/es	
  Research	
  and	
  Large	
  
Repositories	
  of	
  Digital	
  Text	
  
2nd	
  Encuentr...
SeHng	
  Stage	
  
•  “InformaLcs”	
  is	
  the	
  applicaLon	
  of	
  computer	
  and	
  
informaLon	
  science	
  (CIS)	...
Digital	
  humani,es	
  is	
  an	
  emerging	
  discipline	
  
that	
  applies	
  computaLon	
  to	
  research	
  in	
  th...
Digital	
  HumaniLes	
  acLviLes	
  
categorized	
  
•  Access:	
  	
  	
  big	
  part	
  of	
  what	
  [digital	
  humani...
Why	
  does	
  it	
  maer?	
  	
  
“If	
  I	
  had	
  to	
  predict	
  some	
  interesLng	
  
things	
  for	
  the	
  futu...
Bobley’s	
  PredicLon,	
  cont.	
  
In	
  a	
  world	
  of	
  big,	
  massive	
  scale,	
  he	
  asks:	
  
•  “How	
  migh...
Goal	
  of	
  Talk	
  
Introduce	
  technical	
  architectural	
  big	
  data	
  
developments	
  around	
  HathiTrust,	
 ...
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  
•  HathiTrust	
  is	
  a	
  consorLum	
  of	
  academic	
  &	
  
research	
  i...
#HTRC	
  	
  @HathiTrust	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  of	
  HathiTrust	
  
•  Books	
  and	
  journals	
  
– Plus	
  pilots	
  around	
 ...
#HTRC	
  	
  @HathiTrust	
  
Content	
  Sources	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  distribuLon	
  
360,000	
  volumes	
  
in	
  Spanish	
  
#HTRC	
  	
  @HathiTrust	
  
Mo/va/on	
  for	
  HTRC	
  
à HathiTrust repository is massive scale
-- latent goldmine for ...
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  Research	
  Center	
  
•  	
  The	
  HathiTrust	
  Research	
  Center	
  (HTRC)...
HTRC	
  system	
  	
  
Complexity	
  hiding	
  interface	
  
The	
  complexity	
  
Tabular	
  info	
  
StaLsLcal	
  plots	...
 
	
  
Complexity	
  hiding	
  interface	
  
	
  
	
  
Return	
  to	
  categories	
  of	
  DH	
  acLvity	
  
HTRC	
  in	
  current	
  form	
  best	
  at	
  suppor/ng:	
  
•  Acc...
Workset	
  manages	
  engagement	
  with	
  texts	
  
EXAMPLES	
  OF	
  RESEARCH	
  THAT	
  IS	
  
POSSIBLE	
  AT	
  SCALE	
  
•  Topic	
  modeling	
  
•  Author	
  Gender	
  I...
#HTRC	
  	
  @HathiTrust	
  
Topic	
  Modeling	
  
•  Can	
  answer	
  more	
  complex	
  or	
  nuanced	
  
quesLons	
  
–...
#HTRC	
  	
  @HathiTrust	
  
Themes	
  for	
  Authors	
  
Two	
  topics	
  with	
  idenLcal	
  centraliLes	
  (e.g.,	
  Di...
Ted Underwood, Univ of Illinois
GENDER	
  IDENTIFICATION	
  OF	
  HTRC	
  
AUTHORS	
  BY	
  NAMES	
  
	
  
Stacy	
  Kowalczyk,	
  Asst.	
  Professor,	
  D...
#HTRC	
  	
  @HathiTrust	
  
Gender	
  IdenLficaLon	
  of	
  Text	
  
•  QuesLon	
  InvesLgated:	
  Can	
  we	
  use	
  aut...
#HTRC	
  	
  @HathiTrust	
  
Authors	
  vs	
  Names	
  
There	
  is	
  the	
  author,	
  then	
  there	
  are	
  the	
  na...
#HTRC	
  	
  @HathiTrust	
  
Sources	
  of	
  Data	
  
•  The	
  Virtual	
  InternaLonal	
  Authority	
  File	
  
–  Hoste...
#HTRC	
  	
  @HathiTrust	
  
IniLal	
  Gender	
  Results	
  
•  Approximately	
  80%	
  of	
  name	
  strings	
  have	
  i...
#HTRC	
  	
  @HathiTrust	
  
Results	
  by	
  Data	
  Source	
  
Against	
  the	
  whole	
  set	
  of	
  name	
  strings	
...
Colin	
  Allen,	
  Jamie	
  Murdock	
  
Cogni/ve	
  Science,	
  Indiana	
  University	
  
Ref	
  talk	
  by	
  Jamie	
  Mu...
Digging	
  into	
  philosophy	
  of	
  science	
  
•  Establish	
  points	
  of	
  contact	
  between	
  philosophy	
  
an...
The	
  How	
  
•  1315	
  volumes	
  from	
  HTRC	
  selected	
  using	
  
keyword	
  search	
  for	
  ‘darwin’,	
  ‘roman...
Volume	
  level	
  topic	
  modeling	
  on	
  
‘anthropomorphism’	
  yields	
  set	
  of	
  
topics	
  
..	
  Of	
  set	
  of	
  topics,	
  choose	
  ‘16’	
  as	
  best	
  
Volumes	
  most	
  similar	
  to	
  topic	
  16	
  
Repeat	
  topic	
  modeling	
  at	
  page	
  level	
  
Topic	
  model	
  at	
  page	
  level	
  for	
  topics	
  
anthropomorphism,	
  animal,	
  and	
  psychology	
  
Pick	
  top	
  3:	
  topics	
  16,	
  10,	
  26	
  
Show	
  documents	
  of	
  topics	
  10,	
  16,	
  26	
  
Drop	
  to	
  sentence	
  level	
  
•  Select	
  three	
  books*	
  with	
  highest	
  aggregate	
  of	
  
20-­‐40	
  topi...
Promising	
  early	
  results	
  …	
  
Copyright:	
  A	
  Reality	
  	
  
Full	
  text	
  download	
  is	
  limited	
  by	
  both	
  
size	
  and	
  by	
  copyri...
#HTRC	
  	
  @HathiTrust	
  
CompuLng	
  with	
  Copyrighted	
  
materials:	
  HTRC	
  Data	
  Capsule	
  
•  Copyrighted	...
VM	
  Image	
  
Manager	
  
VM	
  Image	
  
Store	
  
VM	
  Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
  
S...
VM	
  
Image	
  
Manager	
  
VM	
  
Image	
  
Store	
  
VM	
  
Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
 ...
47	
  
HTRC	
  secure	
  data	
  capsule:	
  view	
  from	
  researcher	
  desktop	
  
Thanks	
  to	
  our	
  sponsors	
  
2009:	
  “If	
  I	
  had	
  to	
  predict	
  some	
  interesLng	
  things	
  for	
  
the	
  future	
  in	
  the	
  area	
 ...
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Upcoming SlideShare
Loading in …5
×

Bridging Digital Humanities Research and Big Data Repositories of Digital Text

2,765 views

Published on

Keynote, 2014 Encuentro de Humanistas Digitales, Mexico City

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,765
On SlideShare
0
From Embeds
0
Number of Embeds
67
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Bridging Digital Humanities Research and Big Data Repositories of Digital Text

  1. 1. Bridging  Digital  Humani/es  Research  and  Large   Repositories  of  Digital  Text   2nd  Encuentro  de  Humanistas  Digitales  |  21.May.14   Biblioteca  Vasconcelos,  Mexico  City     Beth  Plale   Professor,  School  of  Informa/cs  and  Compu/ng   Director,  Data  To  Insight  Center     Indiana  University   Tweet  us  -­‐  @HathiTrust    #HTRC   HATHI TRUST RESEARCH CENTER!
  2. 2. SeHng  Stage   •  “InformaLcs”  is  the  applicaLon  of  computer  and   informaLon  science  (CIS)  to  the  data  that  consLtutes   the  primary  research  material  of  that  field.     •  In  Europe,  digital  humaniLes  is  someLmes  called   “cultural  informaLcs”,  but  that  misses  point  that   informaLcs  researcher  brings  CIS  methodologies  to   problems  in  humaniLes,  whereas  DH  researchers  bring   humaniLes  methodologies  to  problems.     •  I  am  an  informaLcs  researcher  (CIS  methodologies)   with  15  year  record  in  geo-­‐informaLcs,  and  over  last  5   years,  a  growing  understanding  of  methodology  and   moLvaLons  of  the  digital  humaniLes  researcher  
  3. 3. Digital  humani,es  is  an  emerging  discipline   that  applies  computaLon  to  research  in  the   humaniLes.  More  than  simply  conducLng   research  with  computers,  digital  humaniLes   scholars  use  informaLon  technology  as  a   central  part  of  their  methodology.     University  of  Illinois  Library  web  site,  2014  
  4. 4. Digital  HumaniLes  acLviLes   categorized   •  Access:      big  part  of  what  [digital  humaniLes  scholar]  does   is  study  cultural  heritage  materials  -­‐  books,  newspapers,   painLngs,  film,  sculptures,  music,  ancient  tablets,  buildings,   etc.  Prey  much  everything  on  that  list  is  being  digiLzed  in   very  large  numbers.     •  Produc/on:    we're  already  seeing  more  and  more  scholars   producing  their  work  for  the  Web.  It  might  take  the  form  of   scholarly  websites,  blogs,  wikis,  or  whatever.    […]  the  enLre   producLon  cycle  uses  technology  (collecLng,  ediLng,   discussing  with  others)  before  the  final  product  is  created.   •  Consump/on:    people  get  their  materials  in  all  kinds  of   new  ways.    Reading  has  changed  with  the  Web.    The  way   we  read  is  changing.    Bits  and  pieces  of  varied  content  from   so  many  places  and  perspecLves.       Interview  with  Bre  Bobley,  NEH,  2009   hp://www.hastac.org/node/1934  
  5. 5. Why  does  it  maer?     “If  I  had  to  predict  some  interesLng   things  for  the  future  in  the  area  of   access,  I'd  sum  it  up  in  one  word:     scale.    Big,  massive,  scale.    That's  what   digiLzaLon  brings  -­‐  access  to  far,  far   more  cultural  heritage  materials  than   you  could  ever  access  before.”       2009  interview  with  Bre  Bobley,  Nat’l   Endowment  of  HumaniLes,  US,  on  predicLons   for  the  future  for  Digital  HumaniLes  
  6. 6. Bobley’s  PredicLon,  cont.   In  a  world  of  big,  massive  scale,  he  asks:   •  “How  might  quanLtaLve  technology-­‐based   methodologies  like  data  mining  help  you  to   beer  understand  a  giant  corpus?    Help  you  zero   in  on  issues?”       •  “What  if  you  are  a  historian  and  you  now  have   access  to  every  newspaper  around  the  world?”       •  “How  might  searching  and  mining  that  kind  of   dataset  radically  change  your  results?”      
  7. 7. Goal  of  Talk   Introduce  technical  architectural  big  data   developments  around  HathiTrust,  emerging   examples  of  use,     …  to  facilitate  discussion  around  whether  Bre   Bobley’s  2009  predicLon  of  “scale.    Big,  massive,   scale”,  which  is  here  today,  can  now  deliver  on   advances  for  digital  humaniLes      
  8. 8. #HTRC    @HathiTrust   HathiTrust   •  HathiTrust  is  a  consorLum  of  academic  &   research  insLtuLons,  offering  a  collecLon  of   millions  of  Ltles  digiLzed  from  libraries   around  the  world.   – Founding  members:  University  of  Michigan,   Indiana  University,  University  of  California,  and   University  of  Virginia   http://www.hathitrust.org/htrc   http://www.hathitrust.org   à  DisLnguished   from  
  9. 9. #HTRC    @HathiTrust  
  10. 10. #HTRC    @HathiTrust   Content  of  HathiTrust   •  Books  and  journals   – Plus  pilots  around  images,  audio,  born-­‐digital   •  DigiLzaLon  sources   – Google  (96.8%,  10,162,104)   – Internet  Archive  (2.9%,  301,972)   – Local  (0.3%,  31,840)  
  11. 11. #HTRC    @HathiTrust   Content  Sources  
  12. 12. #HTRC    @HathiTrust   Content  distribuLon   360,000  volumes   in  Spanish  
  13. 13. #HTRC    @HathiTrust   Mo/va/on  for  HTRC   à HathiTrust repository is massive scale -- latent goldmine for text based research à Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à Size and restrictions demand new paradigm: computation moves to the data (not vice versa)
  14. 14. #HTRC    @HathiTrust   HathiTrust  Research  Center   •   The  HathiTrust  Research  Center  (HTRC)  was   established  in  2011  to  enable  computaLonal  research   across  a  comprehensive  body  of  published  works,  for   the  purposes  of  scholarship,  educaLon,  and  invenLon.     •  HTRC  ExecuLve  Commiee   –  Beth  Plale,  co-­‐Director,  Professor  of  InformaLcs  and   CompuLng,  Indiana  University   –  J.  Stephen  Downie,  co-­‐Director,  Professor  of  InformaLon   Science,  University  of  Illinois   –  Robert  McDonald,  Indiana  University  Libraries   –  Beth  Namachchivaya  Sandore,  University  of  Illinois  Library   –  John  Unsworth,  CIO,  Dean  of  Library,  Brandies  University    
  15. 15. HTRC  system     Complexity  hiding  interface   The  complexity   Tabular  info   StaLsLcal  plots   SpaLal  plots   Request  
  16. 16.     Complexity  hiding  interface      
  17. 17. Return  to  categories  of  DH  acLvity   HTRC  in  current  form  best  at  suppor/ng:   •  Access:      by  narrowing  down  to  essenLal  materials  quickly  –   separaLng  wheat  from  chaff   “big  part  of  what  [digital  humaniLes  scholar]  does  is  study  cultural   heritage  materials  -­‐  books,  newspapers,  painLngs,  film,  sculptures,   music,  ancient  tablets,  buildings,  etc.”     •  Produc/on:  by  supporLng  computaLonal  invesLgaLon   over  massive  scale  of  texts  that  will  require  large-­‐scale   computers  (cloud  compuLng)   •  Consump/on:    by  tracking  the  bits  and  pieces  (i.e.,  the   HTRC  workset)   “The  way  we  read  is  changing.    Bits  and  pieces  of  varied  content   from  so  many  places  and  perspecLves.”       Interview  with  Bre  Bobley,  NEH,  2009  
  18. 18. Workset  manages  engagement  with  texts  
  19. 19. EXAMPLES  OF  RESEARCH  THAT  IS   POSSIBLE  AT  SCALE   •  Topic  modeling   •  Author  Gender  IdenLficaLon   •  Using  Topic  Modeling  to  Locate  (down  to  sentence   level)  Philosophical  Arguments  in  Science  Texts  
  20. 20. #HTRC    @HathiTrust   Topic  Modeling   •  Can  answer  more  complex  or  nuanced   quesLons   – What  are  the  primary  themes  of  an  author?   – What  are  the  primary  themes  of  a  research   domain?   – When  did  a  new  topic  enter  a  research  domain?   •  Provides  more  data  than  word  counts   – 100s  of  topics  can  be  extracted.       – Underlying  data  (topics,  volume,  and  page)  is   available  
  21. 21. #HTRC    @HathiTrust   Themes  for  Authors   Two  topics  with  idenLcal  centraliLes  (e.g.,  Dickens)  but  separate   themes   More  strongly  focused  on  book   (illustraLons,  volume,  literature)   More  strongly  focused  on  author   himself    (leers,  household,  house)  
  22. 22. Ted Underwood, Univ of Illinois
  23. 23. GENDER  IDENTIFICATION  OF  HTRC   AUTHORS  BY  NAMES     Stacy  Kowalczyk,  Asst.  Professor,  Dominican  University   Zong  Peng,  HTRC,  Indiana  University   Talk  by  Stacy  Kowalczyk,  hp://www.hathitrust.org/htrc_uncamp2013  
  24. 24. #HTRC    @HathiTrust   Gender  IdenLficaLon  of  Text   •  QuesLon  InvesLgated:  Can  we  use  author  names  in     bibliographic  records  to  idenLfy  gender?   •  Looked  at  2.6  million  bibliographic  records   –  Extracted  personal  author  data     –  Marc  100  abcd  and  700  abcd   •  606,437  unique  personal  author  strings   •  Bibliographic  data  is  not  fielded  like  patent  names   •  Relying  on  Standard  cataloging  pracLce   –  Last  name,  first  name  middle  name,    Ltles/honorifics,   dates  
  25. 25. #HTRC    @HathiTrust   Authors  vs  Names   There  is  the  author,  then  there  are  the  names  under  which   the  author  is  published…   •  Methuen,  Algernon  Methuen  Marshall,  Sir  bart.,  1856-­‐1924   •  Methuem,  Algernon     •  Methuen  Algernon     •  Methuen  Marshall,  Sir,  bart.,  1856-­‐     •  Methuen,  A.  Sir,  1856-­‐1924     •  Methuen,  A.  Sir,  bart.,  1856-­‐1924     •  Methuen  Marshall,  Sir  bart  1856-­‐1924     •  Methuen,  Algernon  Methuen  Marshall,  Sir,  1856-­‐1924   •  Methuen,  Algernon  Methuen  Marshall,  Sir,  bart.,   1856-­‐1924   •  Methuen,  Algernon,  1856-­‐1924      
  26. 26. #HTRC    @HathiTrust   Sources  of  Data   •  The  Virtual  InternaLonal  Authority  File   –  Hosted  by  OCLC   •  Harvested  names  from  mulLple  data  sources   –  Census  bureau     –  Baby  name  sites   •  EU  Patent  Research  names  list  (Frietsch  et  al,  2009;   Naldi  et  al.  2005)   –  Developed  an  extensive  list  of  European  names   •  Titles  and  honorifics   –  MulLple  web  resources     –  Sir,  Baron,  Count,  Duke,  Father,  Cardinal,  etc   –  Lady,  Mrs.  Miss,  Countess,  Duchess,  Sister,  etc  
  27. 27. #HTRC    @HathiTrust   IniLal  Gender  Results   •  Approximately  80%  of  name  strings  have  iniLal   gender  idenLficaLon   –  Female   •  59,365   •  10%   –  Male   •  425,994   •  70%   –  Unknown   •  114,204   •  19%   –  Ambiguous   •  5,965   •  Less  than  1%  
  28. 28. #HTRC    @HathiTrust   Results  by  Data  Source   Against  the  whole  set  of  name  strings   •  VIAF       – 19%  hit  rate     •  Web  Names   – 54%  hit  rate   •  Patents  Names   – 8%    
  29. 29. Colin  Allen,  Jamie  Murdock   Cogni/ve  Science,  Indiana  University   Ref  talk  by  Jamie  Murdock,  hp://www.hathitrust.org/htrc_uncamp2013  
  30. 30. Digging  into  philosophy  of  science   •  Establish  points  of  contact  between  philosophy   and  science:  where  philosophical  arguments  on   anthropomorphism  appear  in  science  texts   •  Use  topic  modeling  to  idenLfy  the  volumes  and   pages  within  these  volumes  that  are  “rich”  in  a   chosen  topic   •  Use  semi-­‐formal  discourse  analysis  technique  to   idenLfy  key  arguments  in  selected  pages  to   incrementally  expose  and  represent  argument   structures  
  31. 31. The  How   •  1315  volumes  from  HTRC  selected  using   keyword  search  for  ‘darwin’,  ‘romanes’,   ‘anthropomorphism’,  and  ‘comparaLve   psychology’   •  Set  contains  lots  of  uninteresLng  books:    e.g.,   college  course  catalogs   •  Apply  topic  modeling  on  86  volume  subset     •  Using  iPy  Notebook  
  32. 32. Volume  level  topic  modeling  on   ‘anthropomorphism’  yields  set  of   topics  
  33. 33. ..  Of  set  of  topics,  choose  ‘16’  as  best  
  34. 34. Volumes  most  similar  to  topic  16  
  35. 35. Repeat  topic  modeling  at  page  level  
  36. 36. Topic  model  at  page  level  for  topics   anthropomorphism,  animal,  and  psychology  
  37. 37. Pick  top  3:  topics  16,  10,  26  
  38. 38. Show  documents  of  topics  10,  16,  26  
  39. 39. Drop  to  sentence  level   •  Select  three  books*  with  highest  aggregate  of   20-­‐40  topic-­‐relevant  pages  for  more  precise   analysis   •  Model  the  three  books  at  the  sentence  level   (uses  machine  learning)   *  Start  from  1315  texts  to  start,  down  to   86,  then  down  to  most  relevant  3  
  40. 40. Promising  early  results  …  
  41. 41. Copyright:  A  Reality     Full  text  download  is  limited  by  both   size  and  by  copyright  
  42. 42. #HTRC    @HathiTrust   CompuLng  with  Copyrighted   materials:  HTRC  Data  Capsule   •  Copyrighted  materials  can  be  computed  on,  but  cannot  be   shared  by  humans  for  human  (reading)  consumpLon   •  Needs  computaLonal  framework  to  enable  compuLng  but   restricLng  human  consumpLon   •  A  secure  compuLng  framework  that:   –  Trusts  that  researcher  will  not  deliberately  leak  data   –  Prevents  malware  acLng  on  user's  behalf  from  leaking   data.   •  Supports  Openness:  accepts  user-­‐contributed  analysis     •  Supports  Large-­‐scale  and  low  cost:    protecLons  can  be   extended  to  uLlizaLon  of  public  supercomputers  
  43. 43. VM  Image   Manager   VM  Image   Store   VM  Image   Builder   VM   Manager   VM   instance   Secure   Capsule   cluster   SSH   Research   results   Researcher   HTRC  Data   Capsule   Architectural   Components       Registry     Services,   worksets      
  44. 44. VM   Image   Manager   VM   Image   Store   VM   Image   Builder   VM   Manager   VM   instance   Upon  run,   Secure   Capsule:   controls  I/O   behind   scenes   SSH   Research   results   Researcher   HTRC  Data   Capsule   interacLon   Researcher   requests     new  VM  of   type  X   Researcher  install  tools  onto   VM  through  window  on  her   desktop.         Registry     Services,   worksets       Final  locaLon   of  results  is   registry   1)   2)   Image   instance  is   created   3)   4)  
  45. 45. 47   HTRC  secure  data  capsule:  view  from  researcher  desktop  
  46. 46. Thanks  to  our  sponsors  
  47. 47. 2009:  “If  I  had  to  predict  some  interesLng  things  for   the  future  in  the  area  of  access,  I'd  sum  it  up  in  one   word:    scale.    Big,  massive,  scale.    That's  what   digiLzaLon  brings  -­‐  access  to  far,  far  more  cultural   heritage  materials  than  you  could  ever  access  before.”     à Paradigm: computation moves to the data (not vice versa) 2014:    We  are  at  massive  scale  of  data,  but  data   access  is  constrained.    Can  digital  humani/es   researchers  work  within  constraints?    Will  they  find   it  worthwhile  to  do  so?   Reality:    Full  text  download  is   limited  by  size  and  copyright  

×