Successfully reported this slideshow.

Bridging Digital Humanities Research and Big Data Repositories of Digital Text

4

Share

Loading in …3
×
1 of 49
1 of 49

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Bridging Digital Humanities Research and Big Data Repositories of Digital Text

  1. 1. Bridging  Digital  Humani/es  Research  and  Large   Repositories  of  Digital  Text   2nd  Encuentro  de  Humanistas  Digitales  |  21.May.14   Biblioteca  Vasconcelos,  Mexico  City     Beth  Plale   Professor,  School  of  Informa/cs  and  Compu/ng   Director,  Data  To  Insight  Center     Indiana  University   Tweet  us  -­‐  @HathiTrust    #HTRC   HATHI TRUST RESEARCH CENTER!
  2. 2. SeHng  Stage   •  “InformaLcs”  is  the  applicaLon  of  computer  and   informaLon  science  (CIS)  to  the  data  that  consLtutes   the  primary  research  material  of  that  field.     •  In  Europe,  digital  humaniLes  is  someLmes  called   “cultural  informaLcs”,  but  that  misses  point  that   informaLcs  researcher  brings  CIS  methodologies  to   problems  in  humaniLes,  whereas  DH  researchers  bring   humaniLes  methodologies  to  problems.     •  I  am  an  informaLcs  researcher  (CIS  methodologies)   with  15  year  record  in  geo-­‐informaLcs,  and  over  last  5   years,  a  growing  understanding  of  methodology  and   moLvaLons  of  the  digital  humaniLes  researcher  
  3. 3. Digital  humani,es  is  an  emerging  discipline   that  applies  computaLon  to  research  in  the   humaniLes.  More  than  simply  conducLng   research  with  computers,  digital  humaniLes   scholars  use  informaLon  technology  as  a   central  part  of  their  methodology.     University  of  Illinois  Library  web  site,  2014  
  4. 4. Digital  HumaniLes  acLviLes   categorized   •  Access:      big  part  of  what  [digital  humaniLes  scholar]  does   is  study  cultural  heritage  materials  -­‐  books,  newspapers,   painLngs,  film,  sculptures,  music,  ancient  tablets,  buildings,   etc.  Prey  much  everything  on  that  list  is  being  digiLzed  in   very  large  numbers.     •  Produc/on:    we're  already  seeing  more  and  more  scholars   producing  their  work  for  the  Web.  It  might  take  the  form  of   scholarly  websites,  blogs,  wikis,  or  whatever.    […]  the  enLre   producLon  cycle  uses  technology  (collecLng,  ediLng,   discussing  with  others)  before  the  final  product  is  created.   •  Consump/on:    people  get  their  materials  in  all  kinds  of   new  ways.    Reading  has  changed  with  the  Web.    The  way   we  read  is  changing.    Bits  and  pieces  of  varied  content  from   so  many  places  and  perspecLves.       Interview  with  Bre  Bobley,  NEH,  2009   hp://www.hastac.org/node/1934  
  5. 5. Why  does  it  maer?     “If  I  had  to  predict  some  interesLng   things  for  the  future  in  the  area  of   access,  I'd  sum  it  up  in  one  word:     scale.    Big,  massive,  scale.    That's  what   digiLzaLon  brings  -­‐  access  to  far,  far   more  cultural  heritage  materials  than   you  could  ever  access  before.”       2009  interview  with  Bre  Bobley,  Nat’l   Endowment  of  HumaniLes,  US,  on  predicLons   for  the  future  for  Digital  HumaniLes  
  6. 6. Bobley’s  PredicLon,  cont.   In  a  world  of  big,  massive  scale,  he  asks:   •  “How  might  quanLtaLve  technology-­‐based   methodologies  like  data  mining  help  you  to   beer  understand  a  giant  corpus?    Help  you  zero   in  on  issues?”       •  “What  if  you  are  a  historian  and  you  now  have   access  to  every  newspaper  around  the  world?”       •  “How  might  searching  and  mining  that  kind  of   dataset  radically  change  your  results?”      
  7. 7. Goal  of  Talk   Introduce  technical  architectural  big  data   developments  around  HathiTrust,  emerging   examples  of  use,     …  to  facilitate  discussion  around  whether  Bre   Bobley’s  2009  predicLon  of  “scale.    Big,  massive,   scale”,  which  is  here  today,  can  now  deliver  on   advances  for  digital  humaniLes      
  8. 8. #HTRC    @HathiTrust   HathiTrust   •  HathiTrust  is  a  consorLum  of  academic  &   research  insLtuLons,  offering  a  collecLon  of   millions  of  Ltles  digiLzed  from  libraries   around  the  world.   – Founding  members:  University  of  Michigan,   Indiana  University,  University  of  California,  and   University  of  Virginia   http://www.hathitrust.org/htrc   http://www.hathitrust.org   à  DisLnguished   from  
  9. 9. #HTRC    @HathiTrust  
  10. 10. #HTRC    @HathiTrust   Content  of  HathiTrust   •  Books  and  journals   – Plus  pilots  around  images,  audio,  born-­‐digital   •  DigiLzaLon  sources   – Google  (96.8%,  10,162,104)   – Internet  Archive  (2.9%,  301,972)   – Local  (0.3%,  31,840)  
  11. 11. #HTRC    @HathiTrust   Content  Sources  
  12. 12. #HTRC    @HathiTrust   Content  distribuLon   360,000  volumes   in  Spanish  
  13. 13. #HTRC    @HathiTrust   Mo/va/on  for  HTRC   à HathiTrust repository is massive scale -- latent goldmine for text based research à Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à Size and restrictions demand new paradigm: computation moves to the data (not vice versa)
  14. 14. #HTRC    @HathiTrust   HathiTrust  Research  Center   •   The  HathiTrust  Research  Center  (HTRC)  was   established  in  2011  to  enable  computaLonal  research   across  a  comprehensive  body  of  published  works,  for   the  purposes  of  scholarship,  educaLon,  and  invenLon.     •  HTRC  ExecuLve  Commiee   –  Beth  Plale,  co-­‐Director,  Professor  of  InformaLcs  and   CompuLng,  Indiana  University   –  J.  Stephen  Downie,  co-­‐Director,  Professor  of  InformaLon   Science,  University  of  Illinois   –  Robert  McDonald,  Indiana  University  Libraries   –  Beth  Namachchivaya  Sandore,  University  of  Illinois  Library   –  John  Unsworth,  CIO,  Dean  of  Library,  Brandies  University    
  15. 15. HTRC  system     Complexity  hiding  interface   The  complexity   Tabular  info   StaLsLcal  plots   SpaLal  plots   Request  
  16. 16.     Complexity  hiding  interface      
  17. 17. Return  to  categories  of  DH  acLvity   HTRC  in  current  form  best  at  suppor/ng:   •  Access:      by  narrowing  down  to  essenLal  materials  quickly  –   separaLng  wheat  from  chaff   “big  part  of  what  [digital  humaniLes  scholar]  does  is  study  cultural   heritage  materials  -­‐  books,  newspapers,  painLngs,  film,  sculptures,   music,  ancient  tablets,  buildings,  etc.”     •  Produc/on:  by  supporLng  computaLonal  invesLgaLon   over  massive  scale  of  texts  that  will  require  large-­‐scale   computers  (cloud  compuLng)   •  Consump/on:    by  tracking  the  bits  and  pieces  (i.e.,  the   HTRC  workset)   “The  way  we  read  is  changing.    Bits  and  pieces  of  varied  content   from  so  many  places  and  perspecLves.”       Interview  with  Bre  Bobley,  NEH,  2009  
  18. 18. Workset  manages  engagement  with  texts  
  19. 19. EXAMPLES  OF  RESEARCH  THAT  IS   POSSIBLE  AT  SCALE   •  Topic  modeling   •  Author  Gender  IdenLficaLon   •  Using  Topic  Modeling  to  Locate  (down  to  sentence   level)  Philosophical  Arguments  in  Science  Texts  
  20. 20. #HTRC    @HathiTrust   Topic  Modeling   •  Can  answer  more  complex  or  nuanced   quesLons   – What  are  the  primary  themes  of  an  author?   – What  are  the  primary  themes  of  a  research   domain?   – When  did  a  new  topic  enter  a  research  domain?   •  Provides  more  data  than  word  counts   – 100s  of  topics  can  be  extracted.       – Underlying  data  (topics,  volume,  and  page)  is   available  
  21. 21. #HTRC    @HathiTrust   Themes  for  Authors   Two  topics  with  idenLcal  centraliLes  (e.g.,  Dickens)  but  separate   themes   More  strongly  focused  on  book   (illustraLons,  volume,  literature)   More  strongly  focused  on  author   himself    (leers,  household,  house)  
  22. 22. Ted Underwood, Univ of Illinois
  23. 23. GENDER  IDENTIFICATION  OF  HTRC   AUTHORS  BY  NAMES     Stacy  Kowalczyk,  Asst.  Professor,  Dominican  University   Zong  Peng,  HTRC,  Indiana  University   Talk  by  Stacy  Kowalczyk,  hp://www.hathitrust.org/htrc_uncamp2013  
  24. 24. #HTRC    @HathiTrust   Gender  IdenLficaLon  of  Text   •  QuesLon  InvesLgated:  Can  we  use  author  names  in     bibliographic  records  to  idenLfy  gender?   •  Looked  at  2.6  million  bibliographic  records   –  Extracted  personal  author  data     –  Marc  100  abcd  and  700  abcd   •  606,437  unique  personal  author  strings   •  Bibliographic  data  is  not  fielded  like  patent  names   •  Relying  on  Standard  cataloging  pracLce   –  Last  name,  first  name  middle  name,    Ltles/honorifics,   dates  
  25. 25. #HTRC    @HathiTrust   Authors  vs  Names   There  is  the  author,  then  there  are  the  names  under  which   the  author  is  published…   •  Methuen,  Algernon  Methuen  Marshall,  Sir  bart.,  1856-­‐1924   •  Methuem,  Algernon     •  Methuen  Algernon     •  Methuen  Marshall,  Sir,  bart.,  1856-­‐     •  Methuen,  A.  Sir,  1856-­‐1924     •  Methuen,  A.  Sir,  bart.,  1856-­‐1924     •  Methuen  Marshall,  Sir  bart  1856-­‐1924     •  Methuen,  Algernon  Methuen  Marshall,  Sir,  1856-­‐1924   •  Methuen,  Algernon  Methuen  Marshall,  Sir,  bart.,   1856-­‐1924   •  Methuen,  Algernon,  1856-­‐1924      
  26. 26. #HTRC    @HathiTrust   Sources  of  Data   •  The  Virtual  InternaLonal  Authority  File   –  Hosted  by  OCLC   •  Harvested  names  from  mulLple  data  sources   –  Census  bureau     –  Baby  name  sites   •  EU  Patent  Research  names  list  (Frietsch  et  al,  2009;   Naldi  et  al.  2005)   –  Developed  an  extensive  list  of  European  names   •  Titles  and  honorifics   –  MulLple  web  resources     –  Sir,  Baron,  Count,  Duke,  Father,  Cardinal,  etc   –  Lady,  Mrs.  Miss,  Countess,  Duchess,  Sister,  etc  
  27. 27. #HTRC    @HathiTrust   IniLal  Gender  Results   •  Approximately  80%  of  name  strings  have  iniLal   gender  idenLficaLon   –  Female   •  59,365   •  10%   –  Male   •  425,994   •  70%   –  Unknown   •  114,204   •  19%   –  Ambiguous   •  5,965   •  Less  than  1%  
  28. 28. #HTRC    @HathiTrust   Results  by  Data  Source   Against  the  whole  set  of  name  strings   •  VIAF       – 19%  hit  rate     •  Web  Names   – 54%  hit  rate   •  Patents  Names   – 8%    
  29. 29. Colin  Allen,  Jamie  Murdock   Cogni/ve  Science,  Indiana  University   Ref  talk  by  Jamie  Murdock,  hp://www.hathitrust.org/htrc_uncamp2013  
  30. 30. Digging  into  philosophy  of  science   •  Establish  points  of  contact  between  philosophy   and  science:  where  philosophical  arguments  on   anthropomorphism  appear  in  science  texts   •  Use  topic  modeling  to  idenLfy  the  volumes  and   pages  within  these  volumes  that  are  “rich”  in  a   chosen  topic   •  Use  semi-­‐formal  discourse  analysis  technique  to   idenLfy  key  arguments  in  selected  pages  to   incrementally  expose  and  represent  argument   structures  
  31. 31. The  How   •  1315  volumes  from  HTRC  selected  using   keyword  search  for  ‘darwin’,  ‘romanes’,   ‘anthropomorphism’,  and  ‘comparaLve   psychology’   •  Set  contains  lots  of  uninteresLng  books:    e.g.,   college  course  catalogs   •  Apply  topic  modeling  on  86  volume  subset     •  Using  iPy  Notebook  
  32. 32. Volume  level  topic  modeling  on   ‘anthropomorphism’  yields  set  of   topics  
  33. 33. ..  Of  set  of  topics,  choose  ‘16’  as  best  
  34. 34. Volumes  most  similar  to  topic  16  
  35. 35. Repeat  topic  modeling  at  page  level  
  36. 36. Topic  model  at  page  level  for  topics   anthropomorphism,  animal,  and  psychology  
  37. 37. Pick  top  3:  topics  16,  10,  26  
  38. 38. Show  documents  of  topics  10,  16,  26  
  39. 39. Drop  to  sentence  level   •  Select  three  books*  with  highest  aggregate  of   20-­‐40  topic-­‐relevant  pages  for  more  precise   analysis   •  Model  the  three  books  at  the  sentence  level   (uses  machine  learning)   *  Start  from  1315  texts  to  start,  down  to   86,  then  down  to  most  relevant  3  
  40. 40. Promising  early  results  …  
  41. 41. Copyright:  A  Reality     Full  text  download  is  limited  by  both   size  and  by  copyright  
  42. 42. #HTRC    @HathiTrust   CompuLng  with  Copyrighted   materials:  HTRC  Data  Capsule   •  Copyrighted  materials  can  be  computed  on,  but  cannot  be   shared  by  humans  for  human  (reading)  consumpLon   •  Needs  computaLonal  framework  to  enable  compuLng  but   restricLng  human  consumpLon   •  A  secure  compuLng  framework  that:   –  Trusts  that  researcher  will  not  deliberately  leak  data   –  Prevents  malware  acLng  on  user's  behalf  from  leaking   data.   •  Supports  Openness:  accepts  user-­‐contributed  analysis     •  Supports  Large-­‐scale  and  low  cost:    protecLons  can  be   extended  to  uLlizaLon  of  public  supercomputers  
  43. 43. VM  Image   Manager   VM  Image   Store   VM  Image   Builder   VM   Manager   VM   instance   Secure   Capsule   cluster   SSH   Research   results   Researcher   HTRC  Data   Capsule   Architectural   Components       Registry     Services,   worksets      
  44. 44. VM   Image   Manager   VM   Image   Store   VM   Image   Builder   VM   Manager   VM   instance   Upon  run,   Secure   Capsule:   controls  I/O   behind   scenes   SSH   Research   results   Researcher   HTRC  Data   Capsule   interacLon   Researcher   requests     new  VM  of   type  X   Researcher  install  tools  onto   VM  through  window  on  her   desktop.         Registry     Services,   worksets       Final  locaLon   of  results  is   registry   1)   2)   Image   instance  is   created   3)   4)  
  45. 45. 47   HTRC  secure  data  capsule:  view  from  researcher  desktop  
  46. 46. Thanks  to  our  sponsors  
  47. 47. 2009:  “If  I  had  to  predict  some  interesLng  things  for   the  future  in  the  area  of  access,  I'd  sum  it  up  in  one   word:    scale.    Big,  massive,  scale.    That's  what   digiLzaLon  brings  -­‐  access  to  far,  far  more  cultural   heritage  materials  than  you  could  ever  access  before.”     à Paradigm: computation moves to the data (not vice versa) 2014:    We  are  at  massive  scale  of  data,  but  data   access  is  constrained.    Can  digital  humani/es   researchers  work  within  constraints?    Will  they  find   it  worthwhile  to  do  so?   Reality:    Full  text  download  is   limited  by  size  and  copyright  

×