More Related Content

Slideshows for you(20)

Similar to Plale HathiTrust El Colegio de Mexico May2014(20)

Plale HathiTrust El Colegio de Mexico May2014

  1. HathiTrust  and  HTRC:  the  changing  Digital   Library   El  Colegio  de  Mexico  |  20.May.14       Beth  Plale  –  @bplale     Professor,  School  of  InformaCcs  and  CompuCng   Director,  HathiTrust  Research  Center     Indiana  University   Tweet  us  -­‐  @HathiTrust    #HTRC   HATHI TRUST RESEARCH CENTER!
  2. #HTRC    @HathiTrust   HathiTrust   •  HathiTrust  is  a  consorBum  of  academic  &   research  insBtuBons,  offering  a  collecBon  of   millions  of  Btles  digiBzed  from  libraries   around  the  world.   – Founding  members:  University  of  Michigan,   Indiana  University,  University  of  California,  and   University  of  Virginia   http://www.hathitrust.org/htrc   http://www.hathitrust.org   à  DisBnguished   from  
  3. #HTRC    @HathiTrust  
  4. Take  look  at  Details  of   HathiTrust  CollecBon    
  5. #HTRC    @HathiTrust   Content   •  Books  and  journals   – Pilots  around  images,  audio,  born-­‐digital   •  DigiBzaBon  sources   – Google  (96.8%,  10,162,104)   – Internet  Archive  (2.9%,  301,972)   – Local  (0.3%,  31,840)  
  6. #HTRC    @HathiTrust   Content  Sources  
  7. #HTRC    @HathiTrust   Content  Package  
  8. #HTRC    @HathiTrust   Metadata   •  Bibliographic   •  Structural   •  Rights   •  AdministraBve  (preservaBon)   •  Holdings  
  9. HathiTrust     Repository  OrganizaBon  
  10. #HTRC    @HathiTrust   HathiTrust  Repository  OrganizaBon  
  11. #HTRC    @HathiTrust   File  System  
  12. #HTRC    @HathiTrust   Content  distribuBon  
  13. #HTRC    @HathiTrust   Content  distribuBon   Not  public  domain   outside  available  
  14. #HTRC    @HathiTrust   à HathiTrust repository is a latent goldmine for text mining analysis, analysis of large-scale corpi through computational tools, and time-based analysis à Restricted nature of HT content suggests need for new forms of access that preserve intimate nature of research investigation while honoring restrictions à Paradigm: computation moves to the data (not vice versa)
  15. #HTRC    @HathiTrust    Mission  of  HT  Research  Center   •  Research  arm  of  HathiTrust     •  Goal:    enable  researchers  world-­‐wide  to  carry  out   computaBonal  invesBgaBon  of  HT  repository  through   –  Develop  model  for  access:  the  ‘workset’   –  Develop  tools  that  facilitate  research  by  digital  humaniBes   and  informaBcs  communiBes   –  Develop  secure  cyberinfrastructure  that  allows   computaBonal  invesBgaBon  of  enBre  copyrighted  and   public  domain  HathiTrust  repository   •  Established:    July,  2011   •  CollaboraBve  effort  of  Indiana  University  and   University  of  Illinois      
  16. HTRC  system     Complexity  hiding  interface   The  complexity   Tabular  info   StaBsBcal  plots   SpaBal  plots   Request  
  17.     Complexity  hiding  interface      
  18. Workset  builder  
  19. #HTRC    @HathiTrust   HTRC  Timeline   •  Phase  I:    development  01  Jul  2011  –  31  Mar  2013       –  HTRC  soiware  and  services  release  v1.0   hjp://sourceforge.net/p/htrc/code/     •  Phase  II:    outreach,  01  Apr  2013  -­‐  present   –  2nd  HTRC  UnCamp  Sep  ‘13     Ajendees  of  UnCamp’13  
  20. #HTRC    @HathiTrust   Access  to  copyrighted  materials:  HTRC   Data  Capsule   A  secure  compuBng  framework  that:   •  Trusts  that  researcher  will  not  deliberately  leak  repository  data,  but   •  Prevents  malware  acBng  on  user's  behalf  from  leaking  data.     Enforces:   •  Non-­‐consumpBve  use:    framework  provides  safe  handling  of  large   volumes  of  protected  data   •  Openness:  framework  supports  user-­‐contributed  analysis  tools   (that  is,  not  limit  uses  to  a  known  set  of  algorithms)   •  Efficiency:  framework  supports  user-­‐contributed  analysis  tools   without  resorBng  to  code  walkthroughs  prior  to  acceptance   •  Large-­‐scale  and  low  cost:    protecBons  can  be  extended  to  uBlizaBon   of  large-­‐scale  naBonal  (public)  supercomputers  
  21. VM  Image   Manager   VM  Image   Store   VM  Image   Builder   VM   Manager   VM   instance   Secure   Capsule   cluster   SSH   Research   results   Researcher   HTRC  Secure   Capsule   Architectural   Components       Registry     Services,   worksets      
  22. VM   Image   Manager   VM   Image   Store   VM   Image   Builder   VM   Manager   VM   instance   Upon  run,   Secure   Capsule:   controls  I/O   behind   scenes   SSH   Research   results   Researcher   HTRC  Secure   Capsule   Architecture   Researcher   requests     new  VM  of   type  X   Researcher  install  tools  onto   VM  through  window  on  her   desktop.         Registry     Services,   worksets       Final  locaBon   of  results  is   registry   1)   2)   Image   instance  is   created   3)   4)  
  23. 23   HTRC  secure  data  capsule:  view  from  researcher  desktop  
  24. EXAMPLES  OF  RESEARCH  CARRIED  OUT   THROUGH  HATHI  TRUST  RESEARCH   CENTER   •  Author  Gender  IdenBficaBon   •  Using  Topic  Modeling  to  Locate  (down  to   sentence  level)  Philosophical  Arguments  in   Science  Texts  
  25. GENDER  IDENTIFICATION  OF  HTRC   AUTHORS  BY  NAMES     Stacy  Kowalczyk,  Asst.  Professor,  Dominican  University   Zong  Peng,  HTRC,  Indiana  University   Ref  talk  by  Stacy  Kowalczyk,  hjp://www.hathitrust.org/htrc_uncamp2013  
  26. #HTRC    @HathiTrust   Gender  IdenBficaBon  of  Text   •  QuesBon  InvesBgated:  Can  we  use  author  names  in     bibliographic  records  to  idenBfy  gender?   •  2.6  million  bibliographic  records   –  Extracted  personal  author  data     –  Marc  100  abcd  and  700  abcd   •  606,437  unique  personal  author  strings   •  Bibliographic  data  is  not  fielded  like  patent  names   •  Relying  on  Standard  cataloging  pracBce   –  Last  name,  first  name  middle  name,    Btles/honorifics,   dates  
  27. Why  interesBng  to  HTRC?  Introduces  new   source  of  metadata  and  from  sources  with   varying  authority               Raises  quesBons:   1)  How  should  community  contributed  metadata   be  disBnguished  from  more  authoritaBve   sources?     2)  How  should  variability  of  quality  even  within  a   single  contribuBon  be  conveyed  to  community?  
  28. #HTRC    @HathiTrust   Authors  vs  Names   •  Methuen,  Algernon  Methuen  Marshall,  Sir  bart.,   1856-­‐1924   •  Methuem,  Algernon     •  Methuen  Algernon     •  Methuen  Marshall,  Sir,  bart.,  1856-­‐     •  Methuen,  A.  Sir,  1856-­‐1924     •  Methuen,  A.  Sir,  bart.,  1856-­‐1924     •  Methuen  Marshall,  Sir  bart  1856-­‐1924     •  Methuen,  Algernon  Methuen  Marshall,  Sir,  1856-­‐1924   •  Methuen,  Algernon  Methuen  Marshall,  Sir,  bart.,   1856-­‐1924   •  Methuen,  Algernon,  1856-­‐1924      
  29. #HTRC    @HathiTrust   Sources  of  Data   •  The  Virtual  InternaBonal  Authority  File   –  Hosted  by  OCLC   •  Harvested  names  from  mulBple  data  sources   –  Census  bureau     –  Baby  name  sites   •  EU  Patent  Research  names  list  (Frietsch  et  al,  2009;   Naldi  et  al.  2005)   –  Developed  an  extensive  list  of  European  names   •  Titles  and  honorifics   –  MulBple  web  resources     –  Sir,  Baron,  Count,  Duke,  Father,  Cardinal,  etc   –  Lady,  Mrs.  Miss,  Countess,  Duchess,  Sister,  etc  
  30. #HTRC    @HathiTrust   IniBal  Gender  Results   •  Approximately  80%  of  name  strings  have  iniBal   gender  idenBficaBon   –  Female   •  59,365   •  10%   –  Male   •  425,994   •  70%   –  Unknown   •  114,204   •  19%   –  Ambiguous   •  5,965   •  Less  than  1%  
  31. #HTRC    @HathiTrust   Results  by  Data  Source   Against  the  whole  set  of  name  strings   •  VIAF       – 19%  hit  rate     •  Web  Names   – 54%  hit  rate   •  Patents  Names   – 8%    
  32. Colin  Allen,  Jamie  Murdock   CogniCve  Science,  Indiana  University   Ref  talk  by  Jamie  Murdock,  hjp://www.hathitrust.org/htrc_uncamp2013  
  33. The  InPhO  project  is  instrucBve  because  it   demonstrates  an  interacBon  sequence  between   a  researcher  and  his/her  corpus  that  is   nuanced,  is  mulBstep,  and  mulB-­‐modal.         The  HTRC  cyberinfrastructure  must  be  able  to   handle  such  a  nuanced  form  of  interacBon   between  a  researcher  and  their  texts.  
  34. Digging  into  philosophy  of  science   •  Establish  points  of  contact  between  philosophy   and  science:  where  philosophical  arguments  on   anthropomorphism  appear  in  science  texts   •  Use  topic  modeling  to  idenBfy  the  volumes  and   pages  within  these  volumes  that  are  “rich”  in  a   chosen  topic   •  Use  semi-­‐formal  discourse  analysis  technique  to   idenBfy  key  arguments  in  selected  pages  to   incrementally  expose  and  represent  argument   structures  
  35. The  How   •  1315  volumes  from  HTRC  selected  using   keyword  search  for  ‘darwin’,  ‘romanes’,   ‘anthropomorphism’,  and  ‘comparaBve   psychology’   •  Set  contains  lots  of  uninteresBng  books:    e.g.,   college  course  catalogs   •  Apply  LDA  on  86  volume  subset     •  Using  iPy  Notebook  
  36. LDA  topic  modeling   •  LDA  (Latent  Dirichlet  Analysis)  uses  a  Bayesian   updaBng  method  to  generate  a  set  of  “topics”  –   probability  distribuBons  over  set  of  terms  in  a  corpus   •  Number  of  topics  is  a  parameter  in  the  modeling   technique   •  Method  finds  set  of  topics  that  is  best  able  to   reproduce  the  term  distribuBons  in  documents   belonging  to  the  corpus   •  Documents  may  be  whole  volumes,  chapters,  arBcles,   single  pages,  even  individual  sentences  –  modeler’s   choice  
  37. Volume  level  topic  modeling  on   ‘anthropomorphism’  yields  set  of   topics  
  38. ..  Of  set  of  topics,  choose  ‘16’  as  best  
  39. Volumes  most  similar  to  topic  16  
  40. Repeat  LDA  at  page  level  
  41. Topic  model  at  page  level  for  topics   anthropomorphism,  animal,  and  psychology  
  42. Words  sorted  by  similarity  
  43. Pick  top  3:  topics  16,  10,  26  
  44. Show  documents  of  topics  10,  16,  26  
  45. Drop  to  sentence  level   •  Select  three  books  with  highest  aggregate  of   20-­‐40  topic-­‐relevant  pages  for  more  precise   analysis   •  Manually  augment  argument  analysis   – Remodeling  of  three  volumes  at  sentence  level   – Training  other  methods  using  human  analysis  plus   sentence  similarity  
  46. Promising  early  results  …  
  47. Scholarly  Commons     User  Support  Service   •  Develop  training  materials     •  EducaBonal  workshops   •  Tool  and  workset  creaBon   •  Collaborate  with  librarians  and   DH  centers  at  HT  insBtuBons   •  Assist  researchers  in  HTRC  text   data  mining  research  projects   •  Based  at  University  of  Illinois   Library     47  
  48. Scholarly  Commons  User  Support   •  Gives  HT  insBtuBons  exclusive  access  to  training  and  learning  materials   that  help  them  establish  programs  that  integrate  HTRC  tools  and  services   into  their  scholarly  commons  programs  in  libraries  and  digital  humaniBes   centers.       •  Physically  located  on  the  University  of  Illinois  Library’s  Scholarly  commons.       •  Supported  by  several  Library  staff  and  faculty.    Key  among  these  is  the   Digital  Humani,es  Research  Specialist  who  will  assist  with  the   development  of  training  and  outreach  iniBaBves  in  support  of  researchers   working  with  the  Hathi  Trust  Research  Center  and  HathiTrust  digital  library   affiliates  who  seek  to  start  their  own  HTRC  research  services.     •  Effort  involves  planning,  implementaBon  and  conBnuous  development  of   training  materials,  educaBonal  workshops,  and  potenBal  tools,  and   outreach  acBviBes  in  support  of  the  usage  of  HTRC  tools  and  datasets.  
  49. Thanks  to  sponsors  
  50. #HTRC    @HathiTrust   http://www.hathitrust.org/htrc   http://www.hathitrust.org