Case	
  Study	
  in	
  Big	
  Data	
  :	
  the	
  Socio-­‐Technical	
  
Issues	
  of	
  HathiTrust	
  Digital	
  Texts	
  ...
•  Who	
  are	
  the	
  Players?	
  HathiTrust,	
  
Google,	
  Authors	
  Guild	
  
•  The	
  Object	
  of	
  AJen?on	
  :...
The	
  Players	
  
Books	
  Digi*za*on	
  Project	
  (2007)	
  
Libraries	
  of	
  U	
  Michigan,	
  U	
  California,	
  Virginia,	
  Wiscons...
digi*zed	
  
books	
  
digi*zed	
  
books	
  
Legal	
  
ac*on	
  
Mar	
  2011:	
  	
  New	
  York	
  federal	
  judge	
  r...
•  June	
  2014:	
  	
  2nd	
  Circuit	
  Court	
  
of	
  Appeals	
  ruling	
  on	
  Authors	
  
Guild	
  versus	
  HathiT...
Highlights	
  
2014	
  ruling	
  
•  With	
  respect	
  to	
  the	
  full-­‐text	
  database,	
  the	
  
court	
  found	
 ...
•  The	
  Authors	
  Guild	
  argued	
  that	
  HathiTrust's	
  
use	
  of	
  an	
  iden*cal	
  server	
  and	
  two	
  ta...
Does	
  Authors	
  Guild	
  Represent	
  
All	
  Authors?	
  	
  
•  The	
  Authors	
  Guild	
  members	
  are	
  
overwhe...
Highlight	
  
2014	
  Ruling	
  	
  
•  Given	
  that	
  consistent	
  fair	
  use	
  record	
  for	
  book	
  
digi*za*on...
•  Who	
  are	
  the	
  Players?	
  HathiTrust,	
  
Google,	
  Authors	
  Guild	
  
•  The	
  Object	
  of	
  Aen*on	
  :	...
HTRC,	
  or	
  why	
  I	
  care:	
  	
  	
  
	
  
HathiTrust	
  digital	
  library	
  is	
  “big	
  data”;	
  	
  
and	
  ...
Similar	
  model,	
  
different	
  ends	
  
$$	
  
HTRC	
  goes	
  beyond	
  
“full	
  text	
  
searchable	
  
database”	
 ...
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  
•  HathiTrust	
  is	
  a	
  consor*um	
  of	
  academic	
  &	
  
research	
  i...
#HTRC	
  	
  @HathiTrust	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  of	
  HathiTrust	
  
•  Books	
  and	
  journals	
  
– Plus	
  pilots	
  around	
 ...
#HTRC	
  	
  @HathiTrust	
  
Content	
  Sources	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  distribu*on	
  
360,000	
  volumes	
  
in	
  Spanish	
  
#HTRC	
  	
  @HathiTrust	
  
Mo?va?on	
  for	
  HTRC	
  
à HathiTrust repository is massive scale
-- latent goldmine for ...
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  Research	
  Center	
  
•  	
  The	
  HathiTrust	
  Research	
  Center	
  (HTRC)...
HTRC	
  system	
  	
  
Complexity	
  hiding	
  interface	
  
The	
  complexity	
  
Tabular	
  info	
  
Sta*s*cal	
  plots	...
 
	
  
Complexity	
  hiding	
  interface	
  
	
  
	
  
Text	
  mining	
  at	
  scale:	
  
quick	
  tutorial	
  on	
  topic	
  
modeling	
  of	
  texts	
  
#HTRC	
  	
  @HathiTrust	
  
Topic	
  Modeling	
  
•  Can	
  answer	
  more	
  complex	
  or	
  nuanced	
  
ques*ons	
  
–...
#HTRC	
  	
  @HathiTrust	
  
Themes	
  for	
  Authors	
  
Two	
  topics	
  with	
  iden*cal	
  centrali*es	
  (e.g.,	
  Di...
Ted Underwood, Univ of Illinois
Digging	
  into	
  philosophy	
  of	
  science	
  
Establish	
  points	
  of	
  contact	
  
between	
  philosophy	
  and	
...
The	
  How	
  
•  1315	
  volumes	
  from	
  HTRC	
  selected	
  using	
  
keyword	
  search	
  for	
  ‘darwin’,	
  ‘roman...
..	
  Of	
  set	
  of	
  topics,	
  choose	
  ‘16’	
  as	
  best	
  
Volumes	
  most	
  similar	
  to	
  topic	
  16	
  
Copyright:	
  A	
  Reality	
  	
  
Full	
  text	
  download	
  is	
  limited	
  by	
  both	
  
size	
  and	
  by	
  copyri...
HTRC	
  solu*on	
  to	
  fully-­‐flexible	
  text	
  
mining	
  research	
  on	
  en*re	
  HT	
  digital	
  
repository:	
 ...
#HTRC	
  	
  @HathiTrust	
  
Ques*ons	
  driving	
  HTRC	
  Data	
  Capsule	
  
•  Non-­‐consump*ve	
  use:	
  can	
  fram...
#HTRC	
  	
  @HathiTrust	
  
HTRC	
  Data	
  Capsules	
  
•  Trusts	
  text	
  mining	
  researcher	
  to	
  not	
  
delib...
VM	
  Image	
  
Manager	
  
VM	
  Image	
  
Store	
  
VM	
  Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
  
S...
VM	
  
Image	
  
Manager	
  
VM	
  
Image	
  
Store	
  
VM	
  
Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
 ...
setup	
  
41	
  
HTRC	
  secure	
  data	
  capsule:	
  view	
  from	
  researcher	
  desktop	
  
Thanks	
  to	
  our	
  sponsors	
  
HTRC	
  goes	
  beyond	
  “full	
  text	
  
searchable	
  database”.	
  	
  Security	
  has	
  
to	
  be	
  top	
  concern...
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Upcoming SlideShare
Loading in …5
×

Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

491 views

Published on

Invited talk at TRUST Women’s Institute for Summer Enrichment (WISE), Cornell, NY Jun 16, 2014. Infrastructure support for text mining research of big data repository like HathiTrust raises challenges in access and security when the bulk of the repository is protected by copyright.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
491
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
1
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

  1. 1. Case  Study  in  Big  Data  :  the  Socio-­‐Technical   Issues  of  HathiTrust  Digital  Texts   Women’s  Ins*tute  for  Summer  Enrichment   Cornell  University,  Jun  16,  2014     Beth  Plale   Professor,  School  of  Informa?cs  and  Compu?ng   Director,  Data  To  Insight  Center     Indiana  University   HATHI TRUST RESEARCH CENTER!
  2. 2. •  Who  are  the  Players?  HathiTrust,   Google,  Authors  Guild   •  The  Object  of  AJen?on  :  11  M   books  from  university  libraries   •  Rulings  around  copyright   •  HTRC,  or  why  I  care   •  Is  security  of  HTRC  Data  Capsule   good  enough?  
  3. 3. The  Players  
  4. 4. Books  Digi*za*on  Project  (2007)   Libraries  of  U  Michigan,  U  California,  Virginia,  Wisconsin,  Indiana,  …   digi*zed   books   digi*zed   books   digi*ze  
  5. 5. digi*zed   books   digi*zed   books   Legal   ac*on   Mar  2011:    New  York  federal  judge  rejected  a   $125  million  legal  selement  that  Google  had   worked  out  with  the  authors  and  publishers   over  the  copyright  issues   Nov  2013:  same  Judge  issued  ruling  saying  that   Google's  use  of  the  works  was  a  "fair  use"   under  copyright  law   Google/ Authors   Guild  
  6. 6. •  June  2014:    2nd  Circuit  Court   of  Appeals  ruling  on  Authors   Guild  versus  HathiTrust   (Cornell,  U  Michigan,  U   California,  U  Wisconsin,   Indiana)  is  a  major  victory  for   fair  use   digi*zed   books   Legal   ac*on  
  7. 7. Highlights   2014  ruling   •  With  respect  to  the  full-­‐text  database,  the   court  found  that  although  a  copy  of  the  en*re   work  is  made,  the  purpose  of  a  full-­‐text   searchable  database  is  so  different  from  that   of  the  underlying  works  that  the  use  must  be   considered  transforma*ve.  In  fact,  the  court   wrote,  "the  crea*on  of  a  full-­‐text  searchable   database  is  a  quintessen*ally  transforma*ve   use".     June  10,  2014  |  By  Parker  Higgins     Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    
  8. 8. •  The  Authors  Guild  argued  that  HathiTrust's   use  of  an  iden*cal  server  and  two  tape  back-­‐ ups  cons*tuted  "excessive"  copying.     •  Thankfully,  the  court  rejected  that  premise,   acknowledging  that  when  it  comes  to  digital   technology,  an  approach  that  focuses  only  on   individual  copies  made  is  insufficient.   June  10,  2014  |  By  Parker  Higgins     Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust     Highlights   2014  ruling  
  9. 9. Does  Authors  Guild  Represent   All  Authors?     •  The  Authors  Guild  members  are   overwhelmingly  trade-­‐book  authors;  the   books  scanned  by  the  Hathi  Trust  are   overwhelmingly  scholarly  books  wrien  as   part  of  an  academic  tradi*on  that  takes  free   access  and  sharing  as  its  founda*on.     •  The  Authors  Alliance  :  new  organiza*on   represen*ng  authors  who  are  primarily   concerned  with  being  read.   Court  finds  full-­‐book  scanning  is  fair  use   Cory  Doctorow  at  3:00  pm  Sat,  Jun  14,  2014    
  10. 10. Highlight   2014  Ruling     •  Given  that  consistent  fair  use  record  for  book   digi*za*on,  today's  ruling  might  not  be  totally   surprising.  S*ll,  the  text  of  the  opinion  is   encouraging,  and  reflects  a  court  that  respects   the  Cons/tu/onal  purpose  of  copyright  as  a   tool  to  promote  the  progress  of  science  and   the  useful  arts—not  a  blunt  instrument  for   rightsholders  to  regulate  all  downstream  uses.   June  10,  2014  |  By  Parker  Higgins     Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    
  11. 11. •  Who  are  the  Players?  HathiTrust,   Google,  Authors  Guild   •  The  Object  of  Aen*on  :  11  M   books  from  university  libraries   •  Rulings  around  copyright   •  HTRC,  or  why  I  care   •  Is  security  of  HTRC  Data  Capsule   good  enough?  
  12. 12. HTRC,  or  why  I  care:         HathiTrust  digital  library  is  “big  data”;     and   Text  mining  is  the  new  library  catalog   search  
  13. 13. Similar  model,   different  ends   $$   HTRC  goes  beyond   “full  text   searchable   database”   Scholarly   search   Scholarly   mining  
  14. 14. #HTRC    @HathiTrust   HathiTrust   •  HathiTrust  is  a  consor*um  of  academic  &   research  ins*tu*ons,  offering  a  collec*on  of   millions  of  *tles  digi*zed  from  libraries   around  the  world.   – Founding  members:  University  of  Michigan,   Indiana  University,  University  of  California,  and   University  of  Virginia   http://www.hathitrust.org/htrc   http://www.hathitrust.org   à  Dis*nguished   from  
  15. 15. #HTRC    @HathiTrust  
  16. 16. #HTRC    @HathiTrust   Content  of  HathiTrust   •  Books  and  journals   – Plus  pilots  around  images,  audio,  born-­‐digital   •  Digi*za*on  sources   – Google  (96.8%,  10,162,104)   – Internet  Archive  (2.9%,  301,972)   – Local  (0.3%,  31,840)  
  17. 17. #HTRC    @HathiTrust   Content  Sources  
  18. 18. #HTRC    @HathiTrust   Content  distribu*on   360,000  volumes   in  Spanish  
  19. 19. #HTRC    @HathiTrust   Mo?va?on  for  HTRC   à HathiTrust repository is massive scale -- latent goldmine for text based research à Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à Size and restrictions demand new paradigm: computation moves to the data (not vice versa)
  20. 20. #HTRC    @HathiTrust   HathiTrust  Research  Center   •   The  HathiTrust  Research  Center  (HTRC)  was   established  in  2011  to  enable  computa*onal  research   across  a  comprehensive  body  of  published  works,  for   the  purposes  of  scholarship,  educa*on,  and  inven*on.     •  HTRC  Execu*ve  Commiee   –  Beth  Plale,  co-­‐Director,  Professor  of  Informa*cs  and   Compu*ng,  Indiana  University   –  J.  Stephen  Downie,  co-­‐Director,  Professor  of  Informa*on   Science,  University  of  Illinois   –  Robert  McDonald,  Indiana  University  Libraries   –  Beth  Namachchivaya  Sandore,  University  of  Illinois  Library   –  John  Unsworth,  CIO,  Dean  of  Library,  Brandies  University    
  21. 21. HTRC  system     Complexity  hiding  interface   The  complexity   Tabular  info   Sta*s*cal  plots   Spa*al  plots   Request  
  22. 22.     Complexity  hiding  interface      
  23. 23. Text  mining  at  scale:   quick  tutorial  on  topic   modeling  of  texts  
  24. 24. #HTRC    @HathiTrust   Topic  Modeling   •  Can  answer  more  complex  or  nuanced   ques*ons   – What  are  the  primary  themes  of  an  author?   – What  are  the  primary  themes  of  a  research   domain?   – When  did  a  new  topic  enter  a  research  domain?   •  Provides  more  data  than  word  counts   – 100s  of  topics  can  be  extracted.       – Underlying  data  (topics,  volume,  and  page)  is   available  
  25. 25. #HTRC    @HathiTrust   Themes  for  Authors   Two  topics  with  iden*cal  centrali*es  (e.g.,  Dickens)  but  separate   themes   More  strongly  focused  on  book   (illustra*ons,  volume,  literature)   More  strongly  focused  on  author   himself    (leers,  household,  house)  
  26. 26. Ted Underwood, Univ of Illinois
  27. 27. Digging  into  philosophy  of  science   Establish  points  of  contact   between  philosophy  and   science:  where  philosophical   arguments  on   anthropomorphism  appear  in   science  texts   Colin  Allen,  IU  
  28. 28. The  How   •  1315  volumes  from  HTRC  selected  using   keyword  search  for  ‘darwin’,  ‘romanes’,   ‘anthropomorphism’,  and  ‘compara*ve   psychology’   •  Set  contains  lots  of  uninteres*ng  books:    e.g.,   college  course  catalogs   •  Apply  topic  modeling  on  86  volume  subset     •  Using  iPy  Notebook  
  29. 29. ..  Of  set  of  topics,  choose  ‘16’  as  best  
  30. 30. Volumes  most  similar  to  topic  16  
  31. 31. Copyright:  A  Reality     Full  text  download  is  limited  by  both   size  and  by  copyright  
  32. 32. HTRC  solu*on  to  fully-­‐flexible  text   mining  research  on  en*re  HT  digital   repository:          HTRC  Data  Capsule     Funded  by  Alfred  P.  Sloan   Founda*on;  in  collabora*on  with  Atul   Prakash,  University  of  Michigan    
  33. 33. #HTRC    @HathiTrust   Ques*ons  driving  HTRC  Data  Capsule   •  Non-­‐consump*ve  use:  can  framework  provide   safe  handling  of  large  amounts  of  protected   data?     •  Openness:  can  framework  support  user-­‐ contributed  analysis  without  resor*ng  to  code   walkthroughs  prior  to  acceptance?     •  Large-­‐scale  and  low  cost:  can  protec*ons  be   extended  to  u*liza*on  of  large-­‐scale  na*onal   (public)  computa*onal  resources?    
  34. 34. #HTRC    @HathiTrust   HTRC  Data  Capsules   •  Trusts  text  mining  researcher  to  not   deliberately  leak  repository  data   •  Prevents  malware  ac*ng  on  user’s  behalf  from   leaking  data.   •  V1.0  limits  analysis  to  running    within  single   VM  
  35. 35. VM  Image   Manager   VM  Image   Store   VM  Image   Builder   VM   Manager   VM   instance   Secure   Capsule   cluster   SSH   Research   results   Researcher   HTRC  Data   Capsule   Architectural   Components       Registry     Services,   worksets      
  36. 36. VM   Image   Manager   VM   Image   Store   VM   Image   Builder   VM   Manager   VM   instance   Upon  run,   Secure   Capsule:   controls  I/O   behind   scenes   SSH   Research   results   Researcher   HTRC  Data   Capsule   interac*on   Researcher   requests     new  VM  of   type  X   Researcher  install  tools  onto   VM  through  window  on  her   desktop.         Registry     Services,   worksets       Final  loca*on   of  results  is   registry   1)   2)   Image   instance  is   created   3)   4)  
  37. 37. setup  
  38. 38. 41   HTRC  secure  data  capsule:  view  from  researcher  desktop  
  39. 39. Thanks  to  our  sponsors  
  40. 40. HTRC  goes  beyond  “full  text   searchable  database”.    Security  has   to  be  top  concern.   scholarly   research   HTRC  goes  beyond  “full   text  searchable  database”  

×