HathiTrust	
  and	
  HTRC:	
  the	
  changing	
  Digital	
  
Library	
  
El	
  Colegio	
  de	
  Mexico	
  |	
  20.May.14	
...
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  
•  HathiTrust	
  is	
  a	
  consorBum	
  of	
  academic	
  &	
  
research	
  i...
#HTRC	
  	
  @HathiTrust	
  
Take	
  look	
  at	
  Details	
  of	
  
HathiTrust	
  CollecBon	
  	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  
•  Books	
  and	
  journals	
  
– Pilots	
  around	
  images,	
  audio,	
  born-­...
#HTRC	
  	
  @HathiTrust	
  
Content	
  Sources	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  Package	
  
#HTRC	
  	
  @HathiTrust	
  
Metadata	
  
•  Bibliographic	
  
•  Structural	
  
•  Rights	
  
•  AdministraBve	
  (preser...
HathiTrust	
  	
  
Repository	
  OrganizaBon	
  
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  Repository	
  OrganizaBon	
  
#HTRC	
  	
  @HathiTrust	
  
File	
  System	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  distribuBon	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  distribuBon	
  
Not	
  public	
  domain	
  
outside	
  available	
  
#HTRC	
  	
  @HathiTrust	
  
à HathiTrust repository is a latent
goldmine for text mining analysis,
analysis of large-sca...
#HTRC	
  	
  @HathiTrust	
  
	
  Mission	
  of	
  HT	
  Research	
  Center	
  
•  Research	
  arm	
  of	
  HathiTrust	
  	...
HTRC	
  system	
  	
  
Complexity	
  hiding	
  interface	
  
The	
  complexity	
  
Tabular	
  info	
  
StaBsBcal	
  plots	...
 
	
  
Complexity	
  hiding	
  interface	
  
	
  
	
  
Workset	
  builder	
  
#HTRC	
  	
  @HathiTrust	
  
HTRC	
  Timeline	
  
•  Phase	
  I:	
  	
  development	
  01	
  Jul	
  2011	
  –	
  31	
  Mar...
#HTRC	
  	
  @HathiTrust	
  
Access	
  to	
  copyrighted	
  materials:	
  HTRC	
  
Data	
  Capsule	
  
A	
  secure	
  comp...
VM	
  Image	
  
Manager	
  
VM	
  Image	
  
Store	
  
VM	
  Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
  
S...
VM	
  
Image	
  
Manager	
  
VM	
  
Image	
  
Store	
  
VM	
  
Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
 ...
23	
  
HTRC	
  secure	
  data	
  capsule:	
  view	
  from	
  researcher	
  desktop	
  
EXAMPLES	
  OF	
  RESEARCH	
  CARRIED	
  OUT	
  
THROUGH	
  HATHI	
  TRUST	
  RESEARCH	
  
CENTER	
  
•  Author	
  Gender	...
GENDER	
  IDENTIFICATION	
  OF	
  HTRC	
  
AUTHORS	
  BY	
  NAMES	
  
	
  
Stacy	
  Kowalczyk,	
  Asst.	
  Professor,	
  D...
#HTRC	
  	
  @HathiTrust	
  
Gender	
  IdenBficaBon	
  of	
  Text	
  
•  QuesBon	
  InvesBgated:	
  Can	
  we	
  use	
  aut...
Why	
  interesBng	
  to	
  HTRC?	
  Introduces	
  new	
  
source	
  of	
  metadata	
  and	
  from	
  sources	
  with	
  
v...
#HTRC	
  	
  @HathiTrust	
  
Authors	
  vs	
  Names	
  
•  Methuen,	
  Algernon	
  Methuen	
  Marshall,	
  Sir	
  bart.,	
...
#HTRC	
  	
  @HathiTrust	
  
Sources	
  of	
  Data	
  
•  The	
  Virtual	
  InternaBonal	
  Authority	
  File	
  
–  Hoste...
#HTRC	
  	
  @HathiTrust	
  
IniBal	
  Gender	
  Results	
  
•  Approximately	
  80%	
  of	
  name	
  strings	
  have	
  i...
#HTRC	
  	
  @HathiTrust	
  
Results	
  by	
  Data	
  Source	
  
Against	
  the	
  whole	
  set	
  of	
  name	
  strings	
...
Colin	
  Allen,	
  Jamie	
  Murdock	
  
CogniCve	
  Science,	
  Indiana	
  University	
  
Ref	
  talk	
  by	
  Jamie	
  Mu...
The	
  InPhO	
  project	
  is	
  instrucBve	
  because	
  it	
  
demonstrates	
  an	
  interacBon	
  sequence	
  between	
...
Digging	
  into	
  philosophy	
  of	
  science	
  
•  Establish	
  points	
  of	
  contact	
  between	
  philosophy	
  
an...
The	
  How	
  
•  1315	
  volumes	
  from	
  HTRC	
  selected	
  using	
  
keyword	
  search	
  for	
  ‘darwin’,	
  ‘roman...
LDA	
  topic	
  modeling	
  
•  LDA	
  (Latent	
  Dirichlet	
  Analysis)	
  uses	
  a	
  Bayesian	
  
updaBng	
  method	
 ...
Volume	
  level	
  topic	
  modeling	
  on	
  
‘anthropomorphism’	
  yields	
  set	
  of	
  
topics	
  
..	
  Of	
  set	
  of	
  topics,	
  choose	
  ‘16’	
  as	
  best	
  
Volumes	
  most	
  similar	
  to	
  topic	
  16	
  
Repeat	
  LDA	
  at	
  page	
  level	
  
Topic	
  model	
  at	
  page	
  level	
  for	
  topics	
  
anthropomorphism,	
  animal,	
  and	
  psychology	
  
Words	
  sorted	
  by	
  similarity	
  
Pick	
  top	
  3:	
  topics	
  16,	
  10,	
  26	
  
Show	
  documents	
  of	
  topics	
  10,	
  16,	
  26	
  
Drop	
  to	
  sentence	
  level	
  
•  Select	
  three	
  books	
  with	
  highest	
  aggregate	
  of	
  
20-­‐40	
  topic...
Promising	
  early	
  results	
  …	
  
Scholarly	
  Commons	
  	
  
User	
  Support	
  Service	
  
•  Develop	
  training	
  materials	
  	
  
•  EducaBonal	
  w...
Scholarly	
  Commons	
  User	
  Support	
  
•  Gives	
  HT	
  insBtuBons	
  exclusive	
  access	
  to	
  training	
  and	
...
Thanks	
  to	
  sponsors	
  
#HTRC	
  	
  @HathiTrust	
  
http://www.hathitrust.org/htrc	
  
http://www.hathitrust.org	
  
Upcoming SlideShare
Loading in …5
×

Plale HathiTrust El Colegio de Mexico May2014

1,803 views

Published on

HathiTrust digital library and analytics with HTRC. The data, the uses.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,803
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Plale HathiTrust El Colegio de Mexico May2014

  1. 1. HathiTrust  and  HTRC:  the  changing  Digital   Library   El  Colegio  de  Mexico  |  20.May.14       Beth  Plale  –  @bplale     Professor,  School  of  InformaCcs  and  CompuCng   Director,  HathiTrust  Research  Center     Indiana  University   Tweet  us  -­‐  @HathiTrust    #HTRC   HATHI TRUST RESEARCH CENTER!
  2. 2. #HTRC    @HathiTrust   HathiTrust   •  HathiTrust  is  a  consorBum  of  academic  &   research  insBtuBons,  offering  a  collecBon  of   millions  of  Btles  digiBzed  from  libraries   around  the  world.   – Founding  members:  University  of  Michigan,   Indiana  University,  University  of  California,  and   University  of  Virginia   http://www.hathitrust.org/htrc   http://www.hathitrust.org   à  DisBnguished   from  
  3. 3. #HTRC    @HathiTrust  
  4. 4. Take  look  at  Details  of   HathiTrust  CollecBon    
  5. 5. #HTRC    @HathiTrust   Content   •  Books  and  journals   – Pilots  around  images,  audio,  born-­‐digital   •  DigiBzaBon  sources   – Google  (96.8%,  10,162,104)   – Internet  Archive  (2.9%,  301,972)   – Local  (0.3%,  31,840)  
  6. 6. #HTRC    @HathiTrust   Content  Sources  
  7. 7. #HTRC    @HathiTrust   Content  Package  
  8. 8. #HTRC    @HathiTrust   Metadata   •  Bibliographic   •  Structural   •  Rights   •  AdministraBve  (preservaBon)   •  Holdings  
  9. 9. HathiTrust     Repository  OrganizaBon  
  10. 10. #HTRC    @HathiTrust   HathiTrust  Repository  OrganizaBon  
  11. 11. #HTRC    @HathiTrust   File  System  
  12. 12. #HTRC    @HathiTrust   Content  distribuBon  
  13. 13. #HTRC    @HathiTrust   Content  distribuBon   Not  public  domain   outside  available  
  14. 14. #HTRC    @HathiTrust   à HathiTrust repository is a latent goldmine for text mining analysis, analysis of large-scale corpi through computational tools, and time-based analysis à Restricted nature of HT content suggests need for new forms of access that preserve intimate nature of research investigation while honoring restrictions à Paradigm: computation moves to the data (not vice versa)
  15. 15. #HTRC    @HathiTrust    Mission  of  HT  Research  Center   •  Research  arm  of  HathiTrust     •  Goal:    enable  researchers  world-­‐wide  to  carry  out   computaBonal  invesBgaBon  of  HT  repository  through   –  Develop  model  for  access:  the  ‘workset’   –  Develop  tools  that  facilitate  research  by  digital  humaniBes   and  informaBcs  communiBes   –  Develop  secure  cyberinfrastructure  that  allows   computaBonal  invesBgaBon  of  enBre  copyrighted  and   public  domain  HathiTrust  repository   •  Established:    July,  2011   •  CollaboraBve  effort  of  Indiana  University  and   University  of  Illinois      
  16. 16. HTRC  system     Complexity  hiding  interface   The  complexity   Tabular  info   StaBsBcal  plots   SpaBal  plots   Request  
  17. 17.     Complexity  hiding  interface      
  18. 18. Workset  builder  
  19. 19. #HTRC    @HathiTrust   HTRC  Timeline   •  Phase  I:    development  01  Jul  2011  –  31  Mar  2013       –  HTRC  soiware  and  services  release  v1.0   hjp://sourceforge.net/p/htrc/code/     •  Phase  II:    outreach,  01  Apr  2013  -­‐  present   –  2nd  HTRC  UnCamp  Sep  ‘13     Ajendees  of  UnCamp’13  
  20. 20. #HTRC    @HathiTrust   Access  to  copyrighted  materials:  HTRC   Data  Capsule   A  secure  compuBng  framework  that:   •  Trusts  that  researcher  will  not  deliberately  leak  repository  data,  but   •  Prevents  malware  acBng  on  user's  behalf  from  leaking  data.     Enforces:   •  Non-­‐consumpBve  use:    framework  provides  safe  handling  of  large   volumes  of  protected  data   •  Openness:  framework  supports  user-­‐contributed  analysis  tools   (that  is,  not  limit  uses  to  a  known  set  of  algorithms)   •  Efficiency:  framework  supports  user-­‐contributed  analysis  tools   without  resorBng  to  code  walkthroughs  prior  to  acceptance   •  Large-­‐scale  and  low  cost:    protecBons  can  be  extended  to  uBlizaBon   of  large-­‐scale  naBonal  (public)  supercomputers  
  21. 21. VM  Image   Manager   VM  Image   Store   VM  Image   Builder   VM   Manager   VM   instance   Secure   Capsule   cluster   SSH   Research   results   Researcher   HTRC  Secure   Capsule   Architectural   Components       Registry     Services,   worksets      
  22. 22. VM   Image   Manager   VM   Image   Store   VM   Image   Builder   VM   Manager   VM   instance   Upon  run,   Secure   Capsule:   controls  I/O   behind   scenes   SSH   Research   results   Researcher   HTRC  Secure   Capsule   Architecture   Researcher   requests     new  VM  of   type  X   Researcher  install  tools  onto   VM  through  window  on  her   desktop.         Registry     Services,   worksets       Final  locaBon   of  results  is   registry   1)   2)   Image   instance  is   created   3)   4)  
  23. 23. 23   HTRC  secure  data  capsule:  view  from  researcher  desktop  
  24. 24. EXAMPLES  OF  RESEARCH  CARRIED  OUT   THROUGH  HATHI  TRUST  RESEARCH   CENTER   •  Author  Gender  IdenBficaBon   •  Using  Topic  Modeling  to  Locate  (down  to   sentence  level)  Philosophical  Arguments  in   Science  Texts  
  25. 25. GENDER  IDENTIFICATION  OF  HTRC   AUTHORS  BY  NAMES     Stacy  Kowalczyk,  Asst.  Professor,  Dominican  University   Zong  Peng,  HTRC,  Indiana  University   Ref  talk  by  Stacy  Kowalczyk,  hjp://www.hathitrust.org/htrc_uncamp2013  
  26. 26. #HTRC    @HathiTrust   Gender  IdenBficaBon  of  Text   •  QuesBon  InvesBgated:  Can  we  use  author  names  in     bibliographic  records  to  idenBfy  gender?   •  2.6  million  bibliographic  records   –  Extracted  personal  author  data     –  Marc  100  abcd  and  700  abcd   •  606,437  unique  personal  author  strings   •  Bibliographic  data  is  not  fielded  like  patent  names   •  Relying  on  Standard  cataloging  pracBce   –  Last  name,  first  name  middle  name,    Btles/honorifics,   dates  
  27. 27. Why  interesBng  to  HTRC?  Introduces  new   source  of  metadata  and  from  sources  with   varying  authority               Raises  quesBons:   1)  How  should  community  contributed  metadata   be  disBnguished  from  more  authoritaBve   sources?     2)  How  should  variability  of  quality  even  within  a   single  contribuBon  be  conveyed  to  community?  
  28. 28. #HTRC    @HathiTrust   Authors  vs  Names   •  Methuen,  Algernon  Methuen  Marshall,  Sir  bart.,   1856-­‐1924   •  Methuem,  Algernon     •  Methuen  Algernon     •  Methuen  Marshall,  Sir,  bart.,  1856-­‐     •  Methuen,  A.  Sir,  1856-­‐1924     •  Methuen,  A.  Sir,  bart.,  1856-­‐1924     •  Methuen  Marshall,  Sir  bart  1856-­‐1924     •  Methuen,  Algernon  Methuen  Marshall,  Sir,  1856-­‐1924   •  Methuen,  Algernon  Methuen  Marshall,  Sir,  bart.,   1856-­‐1924   •  Methuen,  Algernon,  1856-­‐1924      
  29. 29. #HTRC    @HathiTrust   Sources  of  Data   •  The  Virtual  InternaBonal  Authority  File   –  Hosted  by  OCLC   •  Harvested  names  from  mulBple  data  sources   –  Census  bureau     –  Baby  name  sites   •  EU  Patent  Research  names  list  (Frietsch  et  al,  2009;   Naldi  et  al.  2005)   –  Developed  an  extensive  list  of  European  names   •  Titles  and  honorifics   –  MulBple  web  resources     –  Sir,  Baron,  Count,  Duke,  Father,  Cardinal,  etc   –  Lady,  Mrs.  Miss,  Countess,  Duchess,  Sister,  etc  
  30. 30. #HTRC    @HathiTrust   IniBal  Gender  Results   •  Approximately  80%  of  name  strings  have  iniBal   gender  idenBficaBon   –  Female   •  59,365   •  10%   –  Male   •  425,994   •  70%   –  Unknown   •  114,204   •  19%   –  Ambiguous   •  5,965   •  Less  than  1%  
  31. 31. #HTRC    @HathiTrust   Results  by  Data  Source   Against  the  whole  set  of  name  strings   •  VIAF       – 19%  hit  rate     •  Web  Names   – 54%  hit  rate   •  Patents  Names   – 8%    
  32. 32. Colin  Allen,  Jamie  Murdock   CogniCve  Science,  Indiana  University   Ref  talk  by  Jamie  Murdock,  hjp://www.hathitrust.org/htrc_uncamp2013  
  33. 33. The  InPhO  project  is  instrucBve  because  it   demonstrates  an  interacBon  sequence  between   a  researcher  and  his/her  corpus  that  is   nuanced,  is  mulBstep,  and  mulB-­‐modal.         The  HTRC  cyberinfrastructure  must  be  able  to   handle  such  a  nuanced  form  of  interacBon   between  a  researcher  and  their  texts.  
  34. 34. Digging  into  philosophy  of  science   •  Establish  points  of  contact  between  philosophy   and  science:  where  philosophical  arguments  on   anthropomorphism  appear  in  science  texts   •  Use  topic  modeling  to  idenBfy  the  volumes  and   pages  within  these  volumes  that  are  “rich”  in  a   chosen  topic   •  Use  semi-­‐formal  discourse  analysis  technique  to   idenBfy  key  arguments  in  selected  pages  to   incrementally  expose  and  represent  argument   structures  
  35. 35. The  How   •  1315  volumes  from  HTRC  selected  using   keyword  search  for  ‘darwin’,  ‘romanes’,   ‘anthropomorphism’,  and  ‘comparaBve   psychology’   •  Set  contains  lots  of  uninteresBng  books:    e.g.,   college  course  catalogs   •  Apply  LDA  on  86  volume  subset     •  Using  iPy  Notebook  
  36. 36. LDA  topic  modeling   •  LDA  (Latent  Dirichlet  Analysis)  uses  a  Bayesian   updaBng  method  to  generate  a  set  of  “topics”  –   probability  distribuBons  over  set  of  terms  in  a  corpus   •  Number  of  topics  is  a  parameter  in  the  modeling   technique   •  Method  finds  set  of  topics  that  is  best  able  to   reproduce  the  term  distribuBons  in  documents   belonging  to  the  corpus   •  Documents  may  be  whole  volumes,  chapters,  arBcles,   single  pages,  even  individual  sentences  –  modeler’s   choice  
  37. 37. Volume  level  topic  modeling  on   ‘anthropomorphism’  yields  set  of   topics  
  38. 38. ..  Of  set  of  topics,  choose  ‘16’  as  best  
  39. 39. Volumes  most  similar  to  topic  16  
  40. 40. Repeat  LDA  at  page  level  
  41. 41. Topic  model  at  page  level  for  topics   anthropomorphism,  animal,  and  psychology  
  42. 42. Words  sorted  by  similarity  
  43. 43. Pick  top  3:  topics  16,  10,  26  
  44. 44. Show  documents  of  topics  10,  16,  26  
  45. 45. Drop  to  sentence  level   •  Select  three  books  with  highest  aggregate  of   20-­‐40  topic-­‐relevant  pages  for  more  precise   analysis   •  Manually  augment  argument  analysis   – Remodeling  of  three  volumes  at  sentence  level   – Training  other  methods  using  human  analysis  plus   sentence  similarity  
  46. 46. Promising  early  results  …  
  47. 47. Scholarly  Commons     User  Support  Service   •  Develop  training  materials     •  EducaBonal  workshops   •  Tool  and  workset  creaBon   •  Collaborate  with  librarians  and   DH  centers  at  HT  insBtuBons   •  Assist  researchers  in  HTRC  text   data  mining  research  projects   •  Based  at  University  of  Illinois   Library     47  
  48. 48. Scholarly  Commons  User  Support   •  Gives  HT  insBtuBons  exclusive  access  to  training  and  learning  materials   that  help  them  establish  programs  that  integrate  HTRC  tools  and  services   into  their  scholarly  commons  programs  in  libraries  and  digital  humaniBes   centers.       •  Physically  located  on  the  University  of  Illinois  Library’s  Scholarly  commons.       •  Supported  by  several  Library  staff  and  faculty.    Key  among  these  is  the   Digital  Humani,es  Research  Specialist  who  will  assist  with  the   development  of  training  and  outreach  iniBaBves  in  support  of  researchers   working  with  the  Hathi  Trust  Research  Center  and  HathiTrust  digital  library   affiliates  who  seek  to  start  their  own  HTRC  research  services.     •  Effort  involves  planning,  implementaBon  and  conBnuous  development  of   training  materials,  educaBonal  workshops,  and  potenBal  tools,  and   outreach  acBviBes  in  support  of  the  usage  of  HTRC  tools  and  datasets.  
  49. 49. Thanks  to  sponsors  
  50. 50. #HTRC    @HathiTrust   http://www.hathitrust.org/htrc   http://www.hathitrust.org  

×