SlideShare a Scribd company logo
1 of 52
Download to read offline
HathiTrust	
  Research	
  Center	
  
Secure	
  Commons	
  
Beth	
  Plale	
  
Co-­‐Director,	
  HathiTrust	
  Research	
  Center	
  
Professor	
  of	
  Informa:cs	
  
Director,	
  Data	
  To	
  Insight	
  Center	
  
Indiana	
  University	
  
@bplale	
  University	
  of	
  Toronto,	
  25	
  June	
  2015	
  	
  
HathiTrust	
  is...	
  
•  A	
  trusted	
  digital	
  preserva:on	
  service	
  enabling	
  
the	
  broadest	
  possible	
  access	
  worldwide.	
  
•  An	
  organiza:on	
  with	
  over	
  100	
  research	
  
libraries	
  making	
  up	
  its	
  membership.	
  
•  A	
  distributed	
  set	
  of	
  services	
  operated	
  by	
  
different	
  members	
  (California	
  Digital	
  Library,	
  
Illinois,	
  Indiana,	
  Michigan).	
  
•  A	
  range	
  of	
  programs	
  enabled	
  by	
  the	
  large	
  
scale	
  collec:on	
  of	
  digi:zed	
  materials.	
  	
  
Mission	
  
To	
  contribute	
  to	
  research,	
  scholarship,	
  and	
  the	
  
common	
  good	
  by	
  collabora:vely	
  collec:ng,	
  organizing,	
  
preserving,	
  communica:ng,	
  and	
  sharing	
  the	
  record	
  of	
  
human	
  knowledge.	
  
…building	
  comprehensive	
  collec:ons	
  and	
  infrastructure	
  
co-­‐owned	
  and	
  managed	
  by	
  partners.	
  
…infrastructure	
  for	
  digital	
  content	
  of	
  value	
  to	
  scholars	
  and	
  
researchers	
  
…enabling	
  access	
  by	
  users	
  with	
  print	
  disabili:es.	
  
…suppor:ng	
  research	
  with	
  the	
  collec:ons.	
  
…s:mula:ng	
  shared	
  collec:on	
  storage	
  strategies.	
  
Collec:ons	
  
University	
  of	
  Toronto,	
  25	
  June	
  2015	
  	
  
Preserva:on	
  with	
  Access	
  
•  Preserva:on	
  
– TRAC-­‐cer:fied	
  
•  Discovery	
  
– Bibliographic	
  and	
  full-­‐text	
  search	
  of	
  all	
  materials	
  
•  Access	
  and	
  Use	
  	
  
– Full	
  text	
  search	
  (all	
  users)	
  
– Public	
  domain	
  and	
  open	
  access	
  works	
  (all	
  users)	
  
– Collec:ons	
  and	
  APIs	
  (all	
  users)	
  
– Lawful	
  uses	
  of	
  in-­‐copyright	
  works	
  (members)	
  
HathiTrust	
  in	
  April	
  2015	
  
•  13.3	
  million	
  total	
  items	
  
– 6.8	
  million	
  book	
  :tles	
  
– 355,000	
  serial	
  :tles	
  
– 612,000	
  US	
  federal	
  government	
  documents	
  
– 5.03	
  million	
  items	
  open	
  (public	
  domain	
  &	
  CC-­‐
licenses)	
  
	
  
The	
  collec:on	
  primarily	
  includes	
  published	
  
materials	
  in	
  bound	
  form,	
  digi:zed	
  from	
  library	
  
collec:ons.	
  
7	
  16	
  April	
  2015	
  
8	
  16	
  April	
  2015	
  
HathiTrust	
  Research	
  
Center	
  Secure	
  Commons	
  
University	
  of	
  Toronto,	
  25	
  June	
  2015	
  	
  
 Mission	
  of	
  the	
  HT	
  Research	
  Center	
  
•  Research	
  arm	
  of	
  HathiTrust	
  	
  
•  Established:	
  	
  July,	
  2011	
  
•  Collabora:ve	
  center:	
  	
  Indiana	
  University	
  &	
  
University	
  of	
  Illinois	
  
•  Mission:	
  	
  Enable	
  researchers	
  world-­‐wide	
  to	
  
accomplish	
  tera-­‐scale	
  text	
  data-­‐mining	
  and	
  
analysis	
  
•  Major	
  effort	
  to	
  date:	
  
–  Build	
  secure	
  and	
  trusted	
  environment	
  surrounding	
  
the	
  sensi:ve	
  text	
  and	
  image	
  data:	
  Trust	
  Ring	
  
–  Make	
  the	
  data	
  more	
  useable	
  and	
  accessible	
  to	
  
researcher	
  
	
  
	
  
Secure	
  Commons	
  “Trust	
  Ring”	
  
•  Logical	
  ring	
  within	
  which	
  exist	
  trusted	
  
services	
  and	
  computers	
  that	
  protect	
  and	
  
provide	
  access	
  to	
  the	
  sensi:ve	
  (copyright)	
  
data	
  
•  Computa:on	
  moves	
  to	
  the	
  data	
  not	
  vice	
  versa	
  
•  Computa:on	
  carried	
  out	
  in	
  the	
  trust	
  ring	
  
	
  
IU	
   UIUC	
  
Raw	
  copyright	
  data	
  on	
  1)	
  	
  file	
  system	
  and	
  in	
  archive	
  in	
  
pairtree	
  form,	
  2)	
  chunked	
  form	
  for	
  parallel	
  processing	
  
and	
  2)	
  in	
  full	
  text	
  Solr	
  index	
  
Knowledge	
  product	
  services	
  
Data	
  Capsule	
  VMs	
  
	
  
	
  	
  
Services	
  and	
  Tools:	
  	
  
data	
  discovery,	
  extrac:on,	
  cleaning,	
  
mining/analysis,	
  visualiza:on	
  	
  
Knowledge	
  products	
  (public):	
  	
  workset,	
  
ontology,	
  feature	
  sets	
  	
  
HTRC	
  Portal	
  (for	
  authen:ca:on)	
  
Knowledge	
  products	
  
(private):	
  personal	
  worksets	
  	
  
External	
  data	
  
cache	
  	
  
DH	
   CS	
   NLP	
  R	
   .	
  .	
  .	
  	
  
Data	
  management	
  services	
  
Secure	
  Commons	
  Services	
  Stack	
  
Trust	
  Ring	
  gains	
  core	
  of	
  its	
  trustworthiness	
  from	
  
the	
  highly	
  secure	
  and	
  heavily	
  managed	
  storage	
  
and	
  compute	
  environment	
  at	
  Indiana	
  University	
  
Researcher	
  Interac:on	
  
Interac:on	
  with	
  HTRC	
  is	
  through	
  one	
  of	
  three	
  
op:ons:	
  	
  	
  
1.  Services	
  and	
  tools	
  for	
  data	
  extrac:on,	
  data	
  
cleaning,	
  data	
  analysis	
  and	
  results	
  
visualiza:on.	
  	
  Self	
  service,	
  browser-­‐based.	
  
2.  Check	
  out	
  a	
  Data	
  Capsule	
  VM.	
  Researcher	
  
checks	
  out	
  and	
  configures	
  for	
  their	
  use	
  
(currently	
  for	
  the	
  technology	
  savvy)	
  	
  
3.  Direct	
  engagement	
  with	
  HTRC	
  staff	
  
	
  
	
  	
  	
  	
  HTRC	
  Portal:	
  	
  	
  h.ps://sharc.hathitrust.org/	
  
Self	
  service	
  portal	
  for	
  services	
  and	
  
tools	
  
Data	
  
Extrac:on	
  
Data	
  
Cleaning	
  
Data	
  
Analysis	
  
Visualiza:on	
  
HT	
  
DLib	
  
Result:	
  
stored	
  to	
  
workset	
  Input	
  Parameters	
  (JSON)	
  
	
  Task	
  output	
  (JSON)	
  
Overall	
  Result	
  
•  Graphs	
  
•  Raw	
  data	
  
•  Structured	
  data	
  
•  etc	
  
	
  
Tasks	
  can	
  be	
  programs	
  wrifen	
  in	
  any	
  language	
  
•  Python,	
  R,	
  Java,	
  C#,	
  …	
  
Current	
  solu:on	
  of	
  SEASR	
  workflows	
  being	
  deprecated;	
  	
  
New	
  solu:on:	
  four	
  stage	
  framework	
  for	
  researcher	
  to	
  plug	
  
together	
  desired	
  tasks.	
  	
  	
  	
  
New	
  tools	
  in	
  each	
  task	
  come	
  from	
  HTRC	
  community,	
  open	
  
source,	
  etc.	
  
Data	
  Capsule	
  
Founda:ons	
  of	
  HT	
  Data	
  Capsule:	
  	
  	
  	
  	
  
K.	
  Borders,	
  E.	
  V.	
  Weele,	
  B.	
  Lau,	
  and	
  A.	
  Prakash.	
  	
  Protec:ng	
  confiden:al	
  data	
  on	
  personal	
  computers	
  
with	
  storage	
  capsules.	
  	
  18th	
  USENIX	
  Security	
  Symposium,	
  pp	
  367–382.	
  USENIX	
  Associa:on,	
  2009.	
  
HathiTrust	
  Data	
  Capsule	
  concept	
  
•  Researcher	
  “checks	
  out”	
  a	
  virtual	
  machine	
  
(VM)	
  
•  VM	
  runs	
  in	
  the	
  Trust	
  Ring	
  
•  Researcher	
  owns	
  their	
  VM	
  through	
  weeks/
months	
  of	
  analysis	
  
•  Geong	
  stuff	
  into	
  VM	
  is	
  easy,	
  but	
  there	
  is	
  a	
  
controlled	
  and	
  audited	
  process	
  for	
  geong	
  
results	
  out	
  of	
  the	
  VM	
  
Data	
  Capsule	
  with	
  i-­‐Python	
  installed	
  
Mode	
  switch	
  protec:on:	
  	
  
maintenance	
  mode	
  
Data	
  Capsule	
  Data	
  Capsule	
  
User	
  traffic	
  from	
  	
  
desktop	
  allowed	
  
Arbitrary	
  network	
  
download	
  allowed	
  
Arbitrary	
  network	
  
upload	
  allowed	
  
during	
  maintenance	
  mode,	
  researcher	
  
installs	
  new	
  soqware	
  and	
  loads	
  data	
  
into	
  capsule	
  
HTRC	
  raw	
  data	
  
sources	
  	
  
Mode	
  switch	
  protec:on:	
  	
  
secure	
  mode	
  
Data	
  Capsule	
  Data	
  Capsule	
  
User	
  traffic	
  from	
  	
  
desktop	
  allowed	
  
Arbitrary	
  network	
  
download	
  not	
  
allowed	
  
Arbitrary	
  
network	
  
upload	
  not	
  
allowed	
  
Researcher	
  switches	
  
to	
  secure	
  mode	
  when	
  
ready	
  to	
  run	
  her	
  tools	
  
HTRC	
  raw	
  data	
  
sources	
  	
  
Results	
  :	
  
researcher	
  tools	
  
must	
  write	
  results	
  
to	
  special	
  
directory;	
  these	
  
are	
  reviewed	
  
before	
  release	
  
Threat	
  Model	
  
•  User	
  is	
  trustworthy	
  	
  
•  Virtual	
  machine	
  (VM)	
  manager	
  and	
  the	
  host	
  it	
  
runs	
  on	
  are	
  also	
  trusted.	
  	
  
•  VM	
  is	
  NOT	
  trusted.	
  We	
  assume	
  the	
  possibility	
  
of	
  malware	
  being	
  installed	
  as	
  well	
  as	
  other	
  
remotely	
  ini:ated	
  afacks	
  on	
  the	
  VM,	
  which	
  
are	
  undetectable	
  to	
  the	
  user.	
  	
  
HTRC	
  Data	
  Capsules	
  
See	
  Data	
  Capsule	
  Tutorial	
  for	
  step-­‐by-­‐step	
  instruc:ons:	
  
	
  
Go	
  to	
  wiki:	
  	
  
hfps://wiki.htrc.illinois.edu	
  	
  
Navigate	
  to:	
  
Community	
  >	
  HTRC	
  Data	
  Capsule	
  >	
  HTRC	
  Data	
  Capsule	
  Tutorial	
  
Direct	
  engagement	
  
with	
  HTRC	
  staff	
  
University	
  of	
  Toronto,	
  25	
  June	
  2015	
  	
  
HTRC	
  Advanced	
  Collabora:ve	
  Support	
  
Awards	
  for	
  HTRC	
  developer	
  Hme	
  
1st	
  round	
  awards:	
  
•  Detec:ng	
  Literary	
  Plagiarisms:	
  The	
  Case	
  of	
  Oliver	
  Goldsmith	
  
•  Taxonomizing	
  the	
  Texts:	
  Towards	
  Cultural-­‐Scale	
  Models	
  of	
  Full	
  Text	
  
•  The	
  Trace	
  of	
  Theory	
  
•  Tracking	
  technology	
  diffusion	
  thru	
  :me	
  using	
  HT	
  Corpus	
  	
  
Coming:	
  call	
  for	
  2nd	
  round	
  Proposals.	
  
	
  
h?p://hathitrust.org/htrc	
  for	
  details	
  
…	
  or	
  Dr.	
  Miao	
  Chen,	
  miaochen@indiana.edu	
  
Advanced	
  CollaboraHve	
  Support	
  
•  Pairs	
  HT	
  ins:tu:on	
  researchers	
  with	
  
expert	
  staff	
  for	
  an	
  extended	
  period	
  during	
  
which	
  they	
  work	
  together	
  to	
  address	
  a	
  
par:cularly	
  vexing	
  issue	
  (e.g.,	
  efficient	
  
paralleliza:on	
  and	
  op:miza:on	
  of	
  a	
  
machine	
  learning	
  algorithm)	
  	
  
•  20	
  hours/week	
  available:	
  example:	
  	
  at	
  any	
  
one	
  :me	
  4	
  ac:ve	
  projects,	
  each	
  receiving	
  
5	
  hours	
  a	
  week	
  for	
  up	
  to	
  2	
  months.	
  	
  
•  Resourced	
  at	
  1.25	
  FTE	
  
•  Staffed	
  by	
  HTRC	
  Staff	
  who	
  have	
  signed	
  
the	
  staff	
  agreement	
  	
  
26	
  
HTRC*Advisory*Board*
7ve*
nt*
earch*
*Students*
D*Students*
ystems*
nistrator*
*FTE)*
Advanced*Collabora7ve*
Support*(coordinated*by*
M.*Chen)*
Research*Programmer*
(.5*FTE)*
Computa7onal*Research*
Liaison*
(.5*FTE)*
Asst*Dir*Outreach*&*
Educa7on*(M.*Chen)*
(1*year*at*.25*FTE)**
Scholarly*Commons*
Dig*Humani7es*Specialist*
(1.0*FTE)*
CLIR*Postdoctoral*
Research*Associate*
(2*years*at*1.0*FTE)*
Digital*Research*
Librarian*support*
(.2*FTE)*
Scholars*Commons*
Support*
(.5*FTE)*
LIS*MS*Students*
UI*Managing*Director*
(.11*FTE)*
Scholarly	
  Commons	
  	
  
User	
  Support	
  Services	
  
•  Develop	
  training	
  materials	
  	
  
•  Educa:onal	
  workshops	
  
•  Tool	
  and	
  workset	
  support	
  
•  Collaborate	
  with	
  librarians	
  and	
  
DH	
  centers	
  at	
  HT	
  ins:tu:ons	
  
•  Assist	
  researchers	
  in	
  HTRC	
  text	
  
data	
  mining	
  research	
  projects	
  
•  Collabora:on:	
  University	
  
Libraries,	
  Illinois	
  and	
  Indiana	
  
	
  
Knowledge	
  Products:	
  
ongoing	
  projects	
  to	
  
improve	
  
University	
  of	
  Toronto,	
  25	
  June	
  2015	
  	
  
Worksets	
  
•  The	
  ability	
  to	
  slice	
  through	
  a	
  massive	
  corpus	
  
constructed	
  from	
  many	
  different	
  library	
  
collecHons,	
  and	
  out	
  of	
  that	
  to	
  construct	
  the	
  
precise	
  workset	
  required	
  for	
  a	
  parHcular	
  
scholarly	
  invesHgaHon,	
  is	
  an	
  example	
  of	
  the	
  
“game	
  changing”	
  potenHal	
  of	
  the	
  HathiTrust...	
  
Dimensions	
  of	
  Workset	
  Crea:on	
  (Illustra:ve)	
  
My	
  workset	
  should	
  contain	
  (inspired	
  by	
  2012	
  UnCamp):	
  	
  
•  Volumes	
  pertaining	
  to	
  Japan	
  /	
  in	
  Japanese	
  
•  All	
  volumes	
  relevant	
  to	
  the	
  study	
  of	
  Francis	
  Bacon	
  
•  Music	
  scores	
  or	
  nota:on	
  extracted	
  from	
  HT	
  volumes	
  
•  Images	
  of	
  Victorian	
  England	
  extracted	
  from	
  HT	
  vols.	
  
•  Volumes	
  in	
  HT	
  similar	
  to	
  TCP-­‐ECCO	
  novels	
  
•  19th	
  c.	
  English-­‐language	
  novels	
  by	
  female	
  authors	
  	
  
•  Representa:ve	
  sample	
  (by	
  pub	
  date	
  &	
  genre)	
  of	
  
French	
  language	
  items	
  in	
  HT	
  
What	
  is	
  Workset?	
  #1	
  
•  A	
  workset	
  is	
  an	
  aggrega:on	
  of	
  materials	
  
brought	
  together	
  for	
  the	
  purpose	
  of	
  discovery	
  
and	
  analysis.	
  
What	
  is	
  a	
  Workset?	
  #2	
  
•  Worksets	
  are	
  conceptual	
  and	
  must	
  be	
  
expressible	
  in	
  a	
  variety	
  of	
  ways	
  
•  Need	
  to	
  facilitate	
  inclusion	
  of	
  resources	
  
beyond	
  HathiTrust	
  
•  Need	
  to	
  facilitate	
  the	
  inclusion	
  of	
  
resources	
  at	
  many	
  different	
  levels	
  of	
  
granularity	
  beyond	
  the	
  book	
  
What	
  is	
  Workset	
  #3	
  
•  Worksets	
  encapsulate	
  the	
  specific	
  materials	
  
that	
  underwent	
  analysis	
  
•  Need	
  to	
  capture	
  provenance	
  informa:on	
  
•  Possible	
  recording	
  of	
  parameters	
  
What	
  is	
  a	
  Workset?	
  #4	
  
•  Worksets	
  should	
  be	
  able	
  to	
  spawn	
  
descendants	
  but	
  otherwise	
  immutable	
  
Scope	
  
	
  
rdf:type	
  
Draq	
  Workset	
  Data	
  Model	
  V.	
  0.2	
  
cnt:content	
  
rdf:type	
  
htrc:isGatheredInto	
  
dcterms:created	
  
dcterms:extent	
  
rdf:type	
  
rdf:type	
  
foaf:accountName	
  
dc:creator	
  
rdf:type	
  
:_workset1	
  
htrc:Collec:on	
  
dc::tle	
  
:_desc1	
  
dcterms:abstract	
  
cnt:ContentAsText	
   :_curator1	
  
foaf:Agent	
  
“rkfritz”^^xsd:string	
  
9^^xsd:integer	
  
“2013-­‐11-­‐11T15:55:48-­‐5:00Z”^^xsd:dateTime	
  
dul1.ark:/13960/
t77s8cw40	
  
htrc:BibliographicResource	
  
“Agrippa”^^xsd:string	
  
“Agrippa	
  and	
  Mexia”^^xsd:string	
  
rdf:about	
  
hfp://catalog.hathitrust.org/
Record/010944168	
   htrc:BibliographicRecord	
  
Page-­‐level	
  Sta:s:cs	
  
Extrac:on	
  Over	
  HathiTrust	
  
Corpus	
  for	
  Tech	
  Terms	
  
Acknowledgements:	
  
collabora:on	
  with	
  Michelle	
  
Alexopolous,	
  University	
  of	
  
Toronto.	
  	
  Extrac:on	
  and	
  
analysis	
  by	
  Guangchen	
  Ruan,	
  CS	
  
PhD	
  student	
  at	
  Indiana	
  
University	
  
University	
  of	
  Toronto,	
  25	
  June	
  2015	
  	
  
Mo:va:on	
  and	
  Problem	
  
•  Given	
  a	
  list	
  of	
  terms	
  (n-­‐grams),	
  extract	
  page-­‐
level	
  sta:s:cs	
  for	
  each	
  	
  term	
  
For	
  instance,	
  seek	
  frequency	
  of	
  appearance	
  of	
  term	
  
“diesel	
  engine”	
  at	
  volume	
  level	
  and	
  page	
  level:	
  	
  in	
  
which	
  volumes,	
  and	
  on	
  which	
  pages	
  with	
  frequency	
  
count	
  per	
  page	
  
	
  
•  We	
  undertook	
  to	
  compare	
  the	
  accuracy	
  of	
  two	
  
approaches:	
  	
  one	
  that	
  extracts	
  terms	
  from	
  Solr	
  
index,	
  and	
  other	
  that	
  extracts	
  terms	
  using	
  a	
  
single-­‐pass	
  processing	
  framework	
  we	
  
developed	
  to	
  work	
  directly	
  on	
  the	
  raw	
  data	
  	
  
	
  
•  Sample	
  output	
  for	
  page-­‐level	
  stats	
  for	
  “diesel	
  
engine”	
  
{	
  
	
  	
  "tech_name":	
  "diesel	
  engine",	
  
	
  	
  "volumes":	
  [	
  
	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  "volumeID":	
  "uc1.b4125277",	
  
	
  	
  	
  	
  	
  	
  "pageLevelStats":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "pageSeq":	
  "146",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "pageLabel":	
  "136",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "count":	
  3},	
  
	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "pageSeq":	
  "649",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "pageLabel":	
  "639",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "count":	
  2}]	
  },	
  …	
  ]}	
  
Approach	
  one:	
  	
  page	
  level	
  index	
  using	
  Solr	
  
§  Build	
  page-­‐level	
  index	
  from	
  raw	
  text.	
  Obtain	
  
page-­‐level	
  stats	
  through	
  Solr	
  query	
  
§  Computa:on	
  and	
  :me	
  cost	
  high	
  to	
  build	
  page-­‐
level	
  index	
  for	
  each	
  tech	
  term,	
  so	
  build	
  single	
  
page-­‐level	
  index	
  for	
  group	
  of	
  words	
  with	
  
similar	
  seman:cs	
  	
  
§  e.g.,	
  “diesel	
  engine”,	
  “diesel	
  motor”,	
  “diesel	
  
powered	
  engine”	
  
Approach	
  two:	
  	
  single-­‐pass	
  processing	
  
distributed	
  compu:ng	
  framework	
  
§  For	
  each	
  volume,	
  directly	
  scan	
  tcontent	
  of	
  each	
  
page	
  to	
  check	
  match	
  using	
  regular	
  expression	
  
§  Divide	
  volumes	
  and	
  computa:on	
  across	
  
mul:ple	
  machines	
  to	
  speed	
  up	
  
	
  
§  Not	
  computa:on	
  sensi:ve	
  to	
  the	
  #	
  of	
  tech	
  
terms	
  being	
  searched	
  so	
  can	
  provide	
  page-­‐level	
  
stats	
  for	
  each	
  tech	
  term	
  rather	
  than	
  one	
  for	
  a	
  
group	
  
•  We	
  compare	
  results	
  of	
  approaches	
  under	
  8	
  tech	
  term	
  
groups	
  or	
  equivalently	
  57	
  tech	
  terms	
  
•  Overall,	
  results	
  from	
  two	
  approaches	
  have	
  over	
  95%	
  
consistency	
  
•  For	
  inconsistent	
  por:on,	
  we	
  manually	
  inspect	
  the	
  raw	
  text	
  
content	
  to	
  verify	
  the	
  ground	
  truth	
  
•  Evalua:on	
  shows	
  that	
  single-­‐pass	
  processing	
  approach	
  is	
  
more	
  accurate	
  (less	
  false	
  posi:ves	
  and	
  nega:ves)	
  than	
  Solr	
  
approach	
  
Quality	
  evalua:on:	
  Solr-­‐based	
  vs.	
  Single-­‐pass	
  
processing	
  
Tech	
  term	
  groups	
  
Tech	
  term	
  group	
   Tech	
  terms	
  
diesel	
  engine	
  
(6	
  terms)	
  
“diesel	
   engine”,	
   “diesel	
   engines”,	
   “diesel	
   motor”,	
   “diesel	
  
motors”,	
  “diesel	
  powered	
  engine”,	
  “diesel	
  powered	
  engines”	
  
	
  	
  
gas	
  engine	
  
(20	
  terms)	
  
“gas	
   engine”,	
   “gas	
   engines”,	
   “gas	
   motor”,	
   “gas	
   motors”,	
   “gas	
  
powered	
  engine”,	
  “gas	
  powered	
  engines”,	
  “gas	
  powered	
  motor”,	
  
“gas	
  powered	
  motors”,	
  “gasoline	
  engine”,	
  “gasoline	
  engines”,	
  
“gasoline	
  motor”,	
  “gasoline	
  motors”,	
  “gasoline	
  powered	
  engine”,	
  
“gasoline	
  powered	
  engines”,	
  “gasoline	
  powered	
  motor”,	
  
“gasoline	
  powered	
  motors”,	
  “gasoline-­‐powered	
  engine”,	
  
“gasoline-­‐powered	
  engines”,	
  “gas-­‐powered	
  engines”	
  
“gas-­‐powered	
  motors”	
  
Tech	
  term	
  group	
  
	
  
Tech	
  terms	
  
	
  
internal-­‐combusGon-­‐
engine	
  
(4	
  terms)	
  
“internal	
   combus:on	
   engine”,	
   “internal	
   combus:on	
   engines”,	
  
“internal	
  combus:on	
  motor”,	
  “internal	
  combus:on	
  motors”	
  
steam	
  boat	
  
(2	
  terms)	
  
“steam	
  boat”,	
  “steam	
  boats”	
  
steam	
  engine	
  
(12	
  terms)	
  
“Corliss	
   engine”,	
   “Corliss	
   engines”,	
   “Corliss	
   steam	
   engine”,	
  
“Corliss	
   steam	
   engines”,	
   “Newcomen	
   steam	
   engine”,	
  
“Newcomen	
  steam	
  engines”,	
  “steam	
  engine”,	
  “steam	
  engines”,	
  
“waf	
  engine”,	
  “waf	
  engines”,	
  “waf	
  steam	
  engine”,	
  “waf	
  steam	
  
engines”	
  
steam	
  locomoGve	
  
(4	
  terms)	
  
“steam	
  locomo:ve”,	
  “steam	
  locomo:ves”,	
  “steam	
  train”,	
  “steam	
  
trains”	
  
steam	
  ship	
  
(2	
  terms)	
  
“steam	
  ship”,	
  “steam	
  ships”	
  
Telegraph	
  
(7	
  terms)	
  
“cable	
   gram”,	
   “cablegram”,	
   “cable	
   grams”,	
   “telegram”,	
  
“telegrams”,	
  “telegraph”,	
  “telegraphs”	
  
Tech	
  term	
  groups	
  
Volume-­‐level	
  and	
  page	
  level	
  comparison	
  
Tech	
  term	
  group	
  
Volume	
  level	
  comparison	
  
Page	
  level	
  comparison	
  in	
  common	
  
volume	
  set	
  
(vols	
  appearing	
  
in	
  s1	
  but	
  not	
  
s2)	
  /	
  (total	
  #	
  of	
  
volumes	
  in	
  s1)	
  
(vols	
  appearing	
  
in	
  s2	
  but	
  not	
  s1)	
  /	
  
(total	
  #	
  of	
  
volumes	
  in	
  s2)	
  
(#	
  of	
  page	
  
records	
  in	
  s1	
  but	
  
not	
  s2	
  )	
  /	
  (total	
  #	
  
of	
  page	
  records	
  
in	
  s1	
  )	
  
(#	
  of	
  page	
  
records	
  in	
  s2	
  but	
  
not	
  s1	
  )	
  /	
  (total	
  #	
  
of	
  page	
  records	
  
in	
  s2	
  )	
  
diesel	
  engine	
  
(6	
  terms)	
  
472/19,869	
  
(2.38	
  %)	
  
98/19,495	
  
(0.50%)	
  
2,503/84,142	
  
(2.97	
  %)	
  
1,986/83,625	
  
(2.37	
  %)	
  
gas	
  engine	
  
(20	
  terms)	
  
	
  
1,132/45,321	
  
(2.50%)	
  
846/45,035	
  
(1.88%)	
  
7,065/187,735	
  
(3.76%)	
  
5,831/186,501	
  
(3.12%)	
  
internal-­‐combusGon-­‐
engine	
  
(4	
  terms)	
  
2,418/21,122	
  
(11.4%)	
  
80/18,784	
  
(0.43%)	
  
8,480/75,567	
  
(11.2%)	
  
3,994/71,081	
  
(5.61%)	
  
steam	
  boat	
  
(2	
  terms)	
  
4,209/176,652	
  
(2.38%)	
  
1,166/173,609	
  
(0.64%)	
  
25,808/794,832	
  
(3.25%)	
  
33,194/802,218	
  
(4.14%)	
  
**	
  s1	
  refers	
  to	
  single	
  pass	
  processing,	
  s2	
  refers	
  to	
  Solr-­‐based	
  approach	
  
Volume-­‐level	
  and	
  page	
  level	
  comparison	
  
Tech	
  term	
  group	
  
Volume	
  level	
  comparison	
  
Page	
  level	
  comparison	
  in	
  common	
  
volume	
  set	
  
(Vols	
  appearing	
  
in	
  s1	
  but	
  not	
  
s2)	
  /	
  (total	
  #	
  of	
  
volumes	
  in	
  s1)	
  
(Vols	
  appearing	
  in	
  
s2	
  but	
  not	
  s1)	
  /	
  
(total	
  #	
  of	
  
volumes	
  in	
  s2)	
  
(#	
  of	
  page	
  records	
  
in	
  s1	
  but	
  not	
  s2	
  )	
  /	
  
(total	
  #	
  of	
  page	
  
records	
  in	
  s1	
  )	
  
(#	
  of	
  page	
  records	
  
in	
  s2	
  but	
  not	
  s1	
  )	
  /	
  
(total	
  #	
  of	
  page	
  
records	
  in	
  s2	
  )	
  
steam	
  engine	
  
(12	
  terms)	
  	
  
9,259/127,385	
  
(7.27%)	
  
828/118,954	
  
(0.70%)	
  
31,475/476,992	
  
(6.60%)	
  
19,145/464,662	
  
(4.12%)	
  
steam	
  locomoGve	
  
(4	
  terms)	
  
707/13,539	
  
(5.22%)	
  
169/13001	
  
(1.30%)	
  
2,880/36,294	
  
(7.93%)	
  
1,752/35,166	
  
(4.98%)	
  
steam	
  ship	
  
(2	
  terms)	
  
1,876/134,220	
  
(1.39%)	
  
920/133,264	
  
(0.69%)	
  
12,024/573,453	
  
(2.09%)	
  
15,141/576,570	
  
(2.63%)	
  
Telegraph	
  
(7	
  terms)	
  
67,293/	
  
361,483	
  
(18.6%)	
  
375/294,565	
  
(0.13%)	
  
787,481/2,140,28
4	
  (36.7%)	
  
164,601/1,517,40
4	
  (10.8%)	
  
Analysis	
  of	
  Solr	
  false	
  posi:ve/nega:ve	
  
•  False	
  posi:ve	
  
§  Example	
  one:	
  false	
  posi:ve	
  match	
  for	
  “diesel	
  engine”	
  
“17	
  Engines	
  and	
  Turbines	
  (Excludes	
  aircraq	
  and	
  rocket	
  engines;	
  
automo:ve	
  engines,	
  except	
  diesel;	
  engine	
  generator	
  sets;	
  and	
  
locomo:ves.)”	
  
§  Example	
  two:	
  false	
  posi:ve	
  match	
  for	
  “diesel	
  motor”	
  
“Fossil	
  fuel	
  consump:on	
  (gasoline,	
  diesel,	
  motor	
  oil)	
  would	
  decrease	
  
as	
  a	
  result	
  of	
  this	
  alterna:ve.”	
  
§  Cause	
  analysis:	
  Solr	
  builds	
  page-­‐level	
  index	
  by	
  Lucene	
  
tokenizaHon	
  which	
  removes	
  non-­‐word	
  character.	
  One-­‐pass	
  
processing	
  can	
  correctly	
  handle	
  such	
  cases	
  as	
  it	
  matches	
  by	
  
regular	
  expression	
  against	
  raw	
  text	
  
Cause	
  analysis	
  of	
  Solr’s	
  false	
  posi:ve/nega:ve	
  
(Cont.)	
  
•  False	
  nega:ve	
  
	
  
§  Example	
  one:	
  false	
  nega:ve	
  match	
  for	
  “diesel	
  engine”	
  
“Steam	
  boilers	
  and	
  equipment,	
  steam	
  and	
  gas	
  turbines,	
  	
  
nuclear	
  reactors,	
  steam	
  engines,	
  diesel	
  en-­‐	
  
gines,	
  and	
  other	
  prime	
  movers”	
  
§  Example	
  two:	
  false	
  posi:ve	
  match	
  for	
  “diesel	
  motor”	
  
“The	
  introduc:on	
  of	
  commercial-­‐model	
  diesel	
  engines,	
  in	
  a	
  
rela:vely	
  small	
  quan:ty	
  of	
  trucks.”	
  
Lucence	
  tokeniza:on	
  splits	
  
en-­‐gines	
  into	
  “en”	
  and	
  “gines”,	
  
Thus	
  leads	
  to	
  false	
  nega:ve	
  
End	
  of	
  line	
  
Single-­‐pass	
  processing	
  will	
  handle	
  	
  
word	
  con:nua:on	
  case	
  by	
  	
  
concatena:ng	
  “en-­‐gines”	
  into	
  
“engines”	
  first	
  before	
  matching	
  
Solr	
  failed	
  to	
  detect	
  this	
  straigh‚orward	
  case,	
  we	
  do	
  not	
  
know	
  the	
  reason	
  and	
  need	
  further	
  inves:ga:on	
  
False	
  nega:ve	
  caused	
  by	
  OCR	
  errors	
  
•  Example	
  one:	
  
“Burdick,	
  R.	
  H.	
  Performance	
  of	
  diesel.engine	
  plants	
  in	
  
Texas.”	
  
	
  
•  Example	
  two:	
  
“from	
  gasoline-­‐powered	
  to	
  fuel-­‐efficient	
  diesel-­‐
_powered	
  engines”	
  
	
  
•  One-­‐pass	
  processing	
  approach	
  failed	
  to	
  detect	
  
them	
  in	
  such	
  cases	
  
Tokens	
  generated	
  as	
  result	
  
of	
  OCR	
  error	
  
Raw	
  copyright	
  data	
  on	
  1)	
  	
  file	
  system	
  and	
  in	
  archive	
  in	
  
pairtree	
  form,	
  2)	
  chunked	
  form	
  for	
  parallel	
  processing	
  
and	
  2)	
  in	
  full	
  text	
  Solr	
  index	
  
Knowledge	
  product	
  services	
  
Data	
  Capsule	
  VMs	
  
	
  
	
  	
  
Services	
  and	
  Tools	
  	
  
for	
  data	
  extrac:on,	
  data	
  cleaning,	
  data	
  
analysis	
  and	
  results	
  visualiza:on	
  	
  
Knowledge	
  products	
  (public):	
  	
  workset,	
  
ontology,	
  feature	
  sets	
  	
  
Portal	
  (for	
  authen:ca:on)	
  
Knowledge	
  products	
  
(private):	
  personal	
  worksets	
  	
  
External	
  data	
  
cache	
  	
  
DH	
   CS	
   NLP	
  R	
   .	
  .	
  .	
  	
  
Data	
  management	
  services	
  
Raw	
  copyright	
  data	
  on	
  1)	
  	
  file	
  system	
  and	
  in	
  archive	
  in	
  
pairtree	
  form,	
  2)	
  chunked	
  form	
  for	
  parallel	
  processing	
  
and	
  2)	
  in	
  full	
  text	
  Solr	
  index	
  
Knowledge	
  product	
  services	
  
Data	
  Capsule	
  VMs	
  
	
  
	
  	
  
Services	
  and	
  Tools	
  	
  
for	
  data	
  extrac:on,	
  data	
  cleaning,	
  data	
  
analysis	
  and	
  results	
  visualiza:on	
  	
  
Knowledge	
  products	
  (public):	
  	
  workset,	
  
ontology,	
  feature	
  sets	
  	
  
Portal	
  (for	
  authen:ca:on)	
  
Knowledge	
  products	
  
(private):	
  personal	
  worksets	
  	
  
External	
  data	
  
cache	
  	
  
DH	
   CS	
   NLP	
  R	
   .	
  .	
  .	
  	
  
Data	
  management	
  services	
  
Thank	
  you!	
  
	
  
plale@indiana.edu	
  
@bplale	
  
hfps://sharc.hathitrust.org/	
  

More Related Content

What's hot

Building Capacity for Open Science
Building Capacity for Open ScienceBuilding Capacity for Open Science
Building Capacity for Open ScienceKaitlin Thaney
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a surveyssuser0191d4
 
Data, Data Everywhere: What's A Publisher to Do?
Data, Data Everywhere: What's  A Publisher to Do?Data, Data Everywhere: What's  A Publisher to Do?
Data, Data Everywhere: What's A Publisher to Do?Anita de Waard
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Executive Summary - Data Management Hub
Executive Summary - Data Management HubExecutive Summary - Data Management Hub
Executive Summary - Data Management HubDenis Parfenov
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryRobert H. McDonald
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk
 
Research Data Management Services at UWA (November 2015)
Research Data Management Services at UWA (November 2015)Research Data Management Services at UWA (November 2015)
Research Data Management Services at UWA (November 2015)Katina Toufexis
 
Introduction to Research Data Management at UWA
Introduction to Research Data Management at UWAIntroduction to Research Data Management at UWA
Introduction to Research Data Management at UWAKatina Toufexis
 
Providing support and services for researchers in good data governance
Providing support and services for researchers in good data governanceProviding support and services for researchers in good data governance
Providing support and services for researchers in good data governanceRobin Rice
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
 
LEARN Conference - How to cost
LEARN Conference - How to costLEARN Conference - How to cost
LEARN Conference - How to costJisc RDM
 
Poster: Very Open Data Project
Poster: Very Open Data ProjectPoster: Very Open Data Project
Poster: Very Open Data ProjectEdward Blurock
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FutureASIS&T
 
3 tu.dc 5min nordbib jp rombouts
3 tu.dc 5min nordbib jp rombouts3 tu.dc 5min nordbib jp rombouts
3 tu.dc 5min nordbib jp romboutsJeroen Rombouts
 

What's hot (20)

Building Capacity for Open Science
Building Capacity for Open ScienceBuilding Capacity for Open Science
Building Capacity for Open Science
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a survey
 
Data, Data Everywhere: What's A Publisher to Do?
Data, Data Everywhere: What's  A Publisher to Do?Data, Data Everywhere: What's  A Publisher to Do?
Data, Data Everywhere: What's A Publisher to Do?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Executive Summary - Data Management Hub
Executive Summary - Data Management HubExecutive Summary - Data Management Hub
Executive Summary - Data Management Hub
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital Library
 
Big data road map
Big data road mapBig data road map
Big data road map
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
 
Stephenson - Data Curation for Quantitative Social Science Research
Stephenson - Data Curation for Quantitative Social Science ResearchStephenson - Data Curation for Quantitative Social Science Research
Stephenson - Data Curation for Quantitative Social Science Research
 
Research Data Management Services at UWA (November 2015)
Research Data Management Services at UWA (November 2015)Research Data Management Services at UWA (November 2015)
Research Data Management Services at UWA (November 2015)
 
Introduction to Research Data Management at UWA
Introduction to Research Data Management at UWAIntroduction to Research Data Management at UWA
Introduction to Research Data Management at UWA
 
Levine - Data Curation; Ethics and Legal Considerations
Levine - Data Curation; Ethics and Legal ConsiderationsLevine - Data Curation; Ethics and Legal Considerations
Levine - Data Curation; Ethics and Legal Considerations
 
Providing support and services for researchers in good data governance
Providing support and services for researchers in good data governanceProviding support and services for researchers in good data governance
Providing support and services for researchers in good data governance
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
LEARN Conference - How to cost
LEARN Conference - How to costLEARN Conference - How to cost
LEARN Conference - How to cost
 
Poster: Very Open Data Project
Poster: Very Open Data ProjectPoster: Very Open Data Project
Poster: Very Open Data Project
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
 
3 tu.dc 5min nordbib jp rombouts
3 tu.dc 5min nordbib jp rombouts3 tu.dc 5min nordbib jp rombouts
3 tu.dc 5min nordbib jp rombouts
 

Similar to HathiTrust Research Center Secure Commons

Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterElephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterRobert H. McDonald
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14Robert H. McDonald
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Beth Plale
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchBlue BRIDGE
 
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterNIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterGlobus
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...Carole Goble
 
Virtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceVirtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceBlue BRIDGE
 
#ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love #ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love Kristi Holmes
 
Research methods group accelarating impact by sharing data
Research methods group  accelarating impact by sharing dataResearch methods group  accelarating impact by sharing data
Research methods group accelarating impact by sharing dataWorld Agroforestry (ICRAF)
 
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...Javier González
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of PublishingAnita de Waard
 
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...Carole Goble
 
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...Sarah Anna Stewart
 

Similar to HathiTrust Research Center Secure Commons (20)

Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterElephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative research
 
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content TypesIlik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
 
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterNIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
 
Virtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceVirtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open science
 
#ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love #ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love
 
Sgci esip-7-20-18
Sgci esip-7-20-18Sgci esip-7-20-18
Sgci esip-7-20-18
 
Research methods group accelarating impact by sharing data
Research methods group  accelarating impact by sharing dataResearch methods group  accelarating impact by sharing data
Research methods group accelarating impact by sharing data
 
Open Science
Open ScienceOpen Science
Open Science
 
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of Publishing
 
Johnston - How to Curate Research Data
Johnston - How to Curate Research DataJohnston - How to Curate Research Data
Johnston - How to Curate Research Data
 
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
 
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
 

More from Beth Plale

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open ScienceBeth Plale
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science researchBeth Plale
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedBeth Plale
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBeth Plale
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for scienceBeth Plale
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013Beth Plale
 

More from Beth Plale (6)

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science research
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for science
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013
 

Recently uploaded

User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detailhaiderbaloch3
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 

Recently uploaded (20)

User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detail
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 

HathiTrust Research Center Secure Commons

  • 1. HathiTrust  Research  Center   Secure  Commons   Beth  Plale   Co-­‐Director,  HathiTrust  Research  Center   Professor  of  Informa:cs   Director,  Data  To  Insight  Center   Indiana  University   @bplale  University  of  Toronto,  25  June  2015    
  • 2. HathiTrust  is...   •  A  trusted  digital  preserva:on  service  enabling   the  broadest  possible  access  worldwide.   •  An  organiza:on  with  over  100  research   libraries  making  up  its  membership.   •  A  distributed  set  of  services  operated  by   different  members  (California  Digital  Library,   Illinois,  Indiana,  Michigan).   •  A  range  of  programs  enabled  by  the  large   scale  collec:on  of  digi:zed  materials.    
  • 3. Mission   To  contribute  to  research,  scholarship,  and  the   common  good  by  collabora:vely  collec:ng,  organizing,   preserving,  communica:ng,  and  sharing  the  record  of   human  knowledge.   …building  comprehensive  collec:ons  and  infrastructure   co-­‐owned  and  managed  by  partners.   …infrastructure  for  digital  content  of  value  to  scholars  and   researchers   …enabling  access  by  users  with  print  disabili:es.   …suppor:ng  research  with  the  collec:ons.   …s:mula:ng  shared  collec:on  storage  strategies.  
  • 4. Collec:ons   University  of  Toronto,  25  June  2015    
  • 5. Preserva:on  with  Access   •  Preserva:on   – TRAC-­‐cer:fied   •  Discovery   – Bibliographic  and  full-­‐text  search  of  all  materials   •  Access  and  Use     – Full  text  search  (all  users)   – Public  domain  and  open  access  works  (all  users)   – Collec:ons  and  APIs  (all  users)   – Lawful  uses  of  in-­‐copyright  works  (members)  
  • 6. HathiTrust  in  April  2015   •  13.3  million  total  items   – 6.8  million  book  :tles   – 355,000  serial  :tles   – 612,000  US  federal  government  documents   – 5.03  million  items  open  (public  domain  &  CC-­‐ licenses)     The  collec:on  primarily  includes  published   materials  in  bound  form,  digi:zed  from  library   collec:ons.  
  • 7. 7  16  April  2015  
  • 8. 8  16  April  2015  
  • 9. HathiTrust  Research   Center  Secure  Commons   University  of  Toronto,  25  June  2015    
  • 10.  Mission  of  the  HT  Research  Center   •  Research  arm  of  HathiTrust     •  Established:    July,  2011   •  Collabora:ve  center:    Indiana  University  &   University  of  Illinois   •  Mission:    Enable  researchers  world-­‐wide  to   accomplish  tera-­‐scale  text  data-­‐mining  and   analysis   •  Major  effort  to  date:   –  Build  secure  and  trusted  environment  surrounding   the  sensi:ve  text  and  image  data:  Trust  Ring   –  Make  the  data  more  useable  and  accessible  to   researcher      
  • 11. Secure  Commons  “Trust  Ring”   •  Logical  ring  within  which  exist  trusted   services  and  computers  that  protect  and   provide  access  to  the  sensi:ve  (copyright)   data   •  Computa:on  moves  to  the  data  not  vice  versa   •  Computa:on  carried  out  in  the  trust  ring     IU   UIUC  
  • 12. Raw  copyright  data  on  1)    file  system  and  in  archive  in   pairtree  form,  2)  chunked  form  for  parallel  processing   and  2)  in  full  text  Solr  index   Knowledge  product  services   Data  Capsule  VMs         Services  and  Tools:     data  discovery,  extrac:on,  cleaning,   mining/analysis,  visualiza:on     Knowledge  products  (public):    workset,   ontology,  feature  sets     HTRC  Portal  (for  authen:ca:on)   Knowledge  products   (private):  personal  worksets     External  data   cache     DH   CS   NLP  R   .  .  .     Data  management  services   Secure  Commons  Services  Stack  
  • 13. Trust  Ring  gains  core  of  its  trustworthiness  from   the  highly  secure  and  heavily  managed  storage   and  compute  environment  at  Indiana  University  
  • 14. Researcher  Interac:on   Interac:on  with  HTRC  is  through  one  of  three   op:ons:       1.  Services  and  tools  for  data  extrac:on,  data   cleaning,  data  analysis  and  results   visualiza:on.    Self  service,  browser-­‐based.   2.  Check  out  a  Data  Capsule  VM.  Researcher   checks  out  and  configures  for  their  use   (currently  for  the  technology  savvy)     3.  Direct  engagement  with  HTRC  staff            HTRC  Portal:      h.ps://sharc.hathitrust.org/  
  • 15. Self  service  portal  for  services  and   tools  
  • 16. Data   Extrac:on   Data   Cleaning   Data   Analysis   Visualiza:on   HT   DLib   Result:   stored  to   workset  Input  Parameters  (JSON)    Task  output  (JSON)   Overall  Result   •  Graphs   •  Raw  data   •  Structured  data   •  etc     Tasks  can  be  programs  wrifen  in  any  language   •  Python,  R,  Java,  C#,  …   Current  solu:on  of  SEASR  workflows  being  deprecated;     New  solu:on:  four  stage  framework  for  researcher  to  plug   together  desired  tasks.         New  tools  in  each  task  come  from  HTRC  community,  open   source,  etc.  
  • 17. Data  Capsule   Founda:ons  of  HT  Data  Capsule:           K.  Borders,  E.  V.  Weele,  B.  Lau,  and  A.  Prakash.    Protec:ng  confiden:al  data  on  personal  computers   with  storage  capsules.    18th  USENIX  Security  Symposium,  pp  367–382.  USENIX  Associa:on,  2009.  
  • 18. HathiTrust  Data  Capsule  concept   •  Researcher  “checks  out”  a  virtual  machine   (VM)   •  VM  runs  in  the  Trust  Ring   •  Researcher  owns  their  VM  through  weeks/ months  of  analysis   •  Geong  stuff  into  VM  is  easy,  but  there  is  a   controlled  and  audited  process  for  geong   results  out  of  the  VM  
  • 19. Data  Capsule  with  i-­‐Python  installed  
  • 20. Mode  switch  protec:on:     maintenance  mode   Data  Capsule  Data  Capsule   User  traffic  from     desktop  allowed   Arbitrary  network   download  allowed   Arbitrary  network   upload  allowed   during  maintenance  mode,  researcher   installs  new  soqware  and  loads  data   into  capsule   HTRC  raw  data   sources    
  • 21. Mode  switch  protec:on:     secure  mode   Data  Capsule  Data  Capsule   User  traffic  from     desktop  allowed   Arbitrary  network   download  not   allowed   Arbitrary   network   upload  not   allowed   Researcher  switches   to  secure  mode  when   ready  to  run  her  tools   HTRC  raw  data   sources     Results  :   researcher  tools   must  write  results   to  special   directory;  these   are  reviewed   before  release  
  • 22. Threat  Model   •  User  is  trustworthy     •  Virtual  machine  (VM)  manager  and  the  host  it   runs  on  are  also  trusted.     •  VM  is  NOT  trusted.  We  assume  the  possibility   of  malware  being  installed  as  well  as  other   remotely  ini:ated  afacks  on  the  VM,  which   are  undetectable  to  the  user.    
  • 23. HTRC  Data  Capsules   See  Data  Capsule  Tutorial  for  step-­‐by-­‐step  instruc:ons:     Go  to  wiki:     hfps://wiki.htrc.illinois.edu     Navigate  to:   Community  >  HTRC  Data  Capsule  >  HTRC  Data  Capsule  Tutorial  
  • 24. Direct  engagement   with  HTRC  staff   University  of  Toronto,  25  June  2015    
  • 25. HTRC  Advanced  Collabora:ve  Support   Awards  for  HTRC  developer  Hme   1st  round  awards:   •  Detec:ng  Literary  Plagiarisms:  The  Case  of  Oliver  Goldsmith   •  Taxonomizing  the  Texts:  Towards  Cultural-­‐Scale  Models  of  Full  Text   •  The  Trace  of  Theory   •  Tracking  technology  diffusion  thru  :me  using  HT  Corpus     Coming:  call  for  2nd  round  Proposals.     h?p://hathitrust.org/htrc  for  details   …  or  Dr.  Miao  Chen,  miaochen@indiana.edu  
  • 26. Advanced  CollaboraHve  Support   •  Pairs  HT  ins:tu:on  researchers  with   expert  staff  for  an  extended  period  during   which  they  work  together  to  address  a   par:cularly  vexing  issue  (e.g.,  efficient   paralleliza:on  and  op:miza:on  of  a   machine  learning  algorithm)     •  20  hours/week  available:  example:    at  any   one  :me  4  ac:ve  projects,  each  receiving   5  hours  a  week  for  up  to  2  months.     •  Resourced  at  1.25  FTE   •  Staffed  by  HTRC  Staff  who  have  signed   the  staff  agreement     26   HTRC*Advisory*Board* 7ve* nt* earch* *Students* D*Students* ystems* nistrator* *FTE)* Advanced*Collabora7ve* Support*(coordinated*by* M.*Chen)* Research*Programmer* (.5*FTE)* Computa7onal*Research* Liaison* (.5*FTE)* Asst*Dir*Outreach*&* Educa7on*(M.*Chen)* (1*year*at*.25*FTE)** Scholarly*Commons* Dig*Humani7es*Specialist* (1.0*FTE)* CLIR*Postdoctoral* Research*Associate* (2*years*at*1.0*FTE)* Digital*Research* Librarian*support* (.2*FTE)* Scholars*Commons* Support* (.5*FTE)* LIS*MS*Students* UI*Managing*Director* (.11*FTE)*
  • 27. Scholarly  Commons     User  Support  Services   •  Develop  training  materials     •  Educa:onal  workshops   •  Tool  and  workset  support   •  Collaborate  with  librarians  and   DH  centers  at  HT  ins:tu:ons   •  Assist  researchers  in  HTRC  text   data  mining  research  projects   •  Collabora:on:  University   Libraries,  Illinois  and  Indiana    
  • 28. Knowledge  Products:   ongoing  projects  to   improve   University  of  Toronto,  25  June  2015    
  • 29. Worksets   •  The  ability  to  slice  through  a  massive  corpus   constructed  from  many  different  library   collecHons,  and  out  of  that  to  construct  the   precise  workset  required  for  a  parHcular   scholarly  invesHgaHon,  is  an  example  of  the   “game  changing”  potenHal  of  the  HathiTrust...  
  • 30. Dimensions  of  Workset  Crea:on  (Illustra:ve)   My  workset  should  contain  (inspired  by  2012  UnCamp):     •  Volumes  pertaining  to  Japan  /  in  Japanese   •  All  volumes  relevant  to  the  study  of  Francis  Bacon   •  Music  scores  or  nota:on  extracted  from  HT  volumes   •  Images  of  Victorian  England  extracted  from  HT  vols.   •  Volumes  in  HT  similar  to  TCP-­‐ECCO  novels   •  19th  c.  English-­‐language  novels  by  female  authors     •  Representa:ve  sample  (by  pub  date  &  genre)  of   French  language  items  in  HT  
  • 31. What  is  Workset?  #1   •  A  workset  is  an  aggrega:on  of  materials   brought  together  for  the  purpose  of  discovery   and  analysis.  
  • 32. What  is  a  Workset?  #2   •  Worksets  are  conceptual  and  must  be   expressible  in  a  variety  of  ways   •  Need  to  facilitate  inclusion  of  resources   beyond  HathiTrust   •  Need  to  facilitate  the  inclusion  of   resources  at  many  different  levels  of   granularity  beyond  the  book  
  • 33. What  is  Workset  #3   •  Worksets  encapsulate  the  specific  materials   that  underwent  analysis   •  Need  to  capture  provenance  informa:on   •  Possible  recording  of  parameters  
  • 34. What  is  a  Workset?  #4   •  Worksets  should  be  able  to  spawn   descendants  but  otherwise  immutable  
  • 36. rdf:type   Draq  Workset  Data  Model  V.  0.2   cnt:content   rdf:type   htrc:isGatheredInto   dcterms:created   dcterms:extent   rdf:type   rdf:type   foaf:accountName   dc:creator   rdf:type   :_workset1   htrc:Collec:on   dc::tle   :_desc1   dcterms:abstract   cnt:ContentAsText   :_curator1   foaf:Agent   “rkfritz”^^xsd:string   9^^xsd:integer   “2013-­‐11-­‐11T15:55:48-­‐5:00Z”^^xsd:dateTime   dul1.ark:/13960/ t77s8cw40   htrc:BibliographicResource   “Agrippa”^^xsd:string   “Agrippa  and  Mexia”^^xsd:string   rdf:about   hfp://catalog.hathitrust.org/ Record/010944168   htrc:BibliographicRecord  
  • 37. Page-­‐level  Sta:s:cs   Extrac:on  Over  HathiTrust   Corpus  for  Tech  Terms   Acknowledgements:   collabora:on  with  Michelle   Alexopolous,  University  of   Toronto.    Extrac:on  and   analysis  by  Guangchen  Ruan,  CS   PhD  student  at  Indiana   University   University  of  Toronto,  25  June  2015    
  • 38. Mo:va:on  and  Problem   •  Given  a  list  of  terms  (n-­‐grams),  extract  page-­‐ level  sta:s:cs  for  each    term   For  instance,  seek  frequency  of  appearance  of  term   “diesel  engine”  at  volume  level  and  page  level:    in   which  volumes,  and  on  which  pages  with  frequency   count  per  page     •  We  undertook  to  compare  the  accuracy  of  two   approaches:    one  that  extracts  terms  from  Solr   index,  and  other  that  extracts  terms  using  a   single-­‐pass  processing  framework  we   developed  to  work  directly  on  the  raw  data      
  • 39. •  Sample  output  for  page-­‐level  stats  for  “diesel   engine”   {      "tech_name":  "diesel  engine",      "volumes":  [          {              "volumeID":  "uc1.b4125277",              "pageLevelStats":  [                  {                      "pageSeq":  "146",                      "pageLabel":  "136",                      "count":  3},                  {                      "pageSeq":  "649",                      "pageLabel":  "639",                      "count":  2}]  },  …  ]}  
  • 40. Approach  one:    page  level  index  using  Solr   §  Build  page-­‐level  index  from  raw  text.  Obtain   page-­‐level  stats  through  Solr  query   §  Computa:on  and  :me  cost  high  to  build  page-­‐ level  index  for  each  tech  term,  so  build  single   page-­‐level  index  for  group  of  words  with   similar  seman:cs     §  e.g.,  “diesel  engine”,  “diesel  motor”,  “diesel   powered  engine”  
  • 41. Approach  two:    single-­‐pass  processing   distributed  compu:ng  framework   §  For  each  volume,  directly  scan  tcontent  of  each   page  to  check  match  using  regular  expression   §  Divide  volumes  and  computa:on  across   mul:ple  machines  to  speed  up     §  Not  computa:on  sensi:ve  to  the  #  of  tech   terms  being  searched  so  can  provide  page-­‐level   stats  for  each  tech  term  rather  than  one  for  a   group  
  • 42. •  We  compare  results  of  approaches  under  8  tech  term   groups  or  equivalently  57  tech  terms   •  Overall,  results  from  two  approaches  have  over  95%   consistency   •  For  inconsistent  por:on,  we  manually  inspect  the  raw  text   content  to  verify  the  ground  truth   •  Evalua:on  shows  that  single-­‐pass  processing  approach  is   more  accurate  (less  false  posi:ves  and  nega:ves)  than  Solr   approach   Quality  evalua:on:  Solr-­‐based  vs.  Single-­‐pass   processing  
  • 43. Tech  term  groups   Tech  term  group   Tech  terms   diesel  engine   (6  terms)   “diesel   engine”,   “diesel   engines”,   “diesel   motor”,   “diesel   motors”,  “diesel  powered  engine”,  “diesel  powered  engines”       gas  engine   (20  terms)   “gas   engine”,   “gas   engines”,   “gas   motor”,   “gas   motors”,   “gas   powered  engine”,  “gas  powered  engines”,  “gas  powered  motor”,   “gas  powered  motors”,  “gasoline  engine”,  “gasoline  engines”,   “gasoline  motor”,  “gasoline  motors”,  “gasoline  powered  engine”,   “gasoline  powered  engines”,  “gasoline  powered  motor”,   “gasoline  powered  motors”,  “gasoline-­‐powered  engine”,   “gasoline-­‐powered  engines”,  “gas-­‐powered  engines”   “gas-­‐powered  motors”  
  • 44. Tech  term  group     Tech  terms     internal-­‐combusGon-­‐ engine   (4  terms)   “internal   combus:on   engine”,   “internal   combus:on   engines”,   “internal  combus:on  motor”,  “internal  combus:on  motors”   steam  boat   (2  terms)   “steam  boat”,  “steam  boats”   steam  engine   (12  terms)   “Corliss   engine”,   “Corliss   engines”,   “Corliss   steam   engine”,   “Corliss   steam   engines”,   “Newcomen   steam   engine”,   “Newcomen  steam  engines”,  “steam  engine”,  “steam  engines”,   “waf  engine”,  “waf  engines”,  “waf  steam  engine”,  “waf  steam   engines”   steam  locomoGve   (4  terms)   “steam  locomo:ve”,  “steam  locomo:ves”,  “steam  train”,  “steam   trains”   steam  ship   (2  terms)   “steam  ship”,  “steam  ships”   Telegraph   (7  terms)   “cable   gram”,   “cablegram”,   “cable   grams”,   “telegram”,   “telegrams”,  “telegraph”,  “telegraphs”   Tech  term  groups  
  • 45. Volume-­‐level  and  page  level  comparison   Tech  term  group   Volume  level  comparison   Page  level  comparison  in  common   volume  set   (vols  appearing   in  s1  but  not   s2)  /  (total  #  of   volumes  in  s1)   (vols  appearing   in  s2  but  not  s1)  /   (total  #  of   volumes  in  s2)   (#  of  page   records  in  s1  but   not  s2  )  /  (total  #   of  page  records   in  s1  )   (#  of  page   records  in  s2  but   not  s1  )  /  (total  #   of  page  records   in  s2  )   diesel  engine   (6  terms)   472/19,869   (2.38  %)   98/19,495   (0.50%)   2,503/84,142   (2.97  %)   1,986/83,625   (2.37  %)   gas  engine   (20  terms)     1,132/45,321   (2.50%)   846/45,035   (1.88%)   7,065/187,735   (3.76%)   5,831/186,501   (3.12%)   internal-­‐combusGon-­‐ engine   (4  terms)   2,418/21,122   (11.4%)   80/18,784   (0.43%)   8,480/75,567   (11.2%)   3,994/71,081   (5.61%)   steam  boat   (2  terms)   4,209/176,652   (2.38%)   1,166/173,609   (0.64%)   25,808/794,832   (3.25%)   33,194/802,218   (4.14%)   **  s1  refers  to  single  pass  processing,  s2  refers  to  Solr-­‐based  approach  
  • 46. Volume-­‐level  and  page  level  comparison   Tech  term  group   Volume  level  comparison   Page  level  comparison  in  common   volume  set   (Vols  appearing   in  s1  but  not   s2)  /  (total  #  of   volumes  in  s1)   (Vols  appearing  in   s2  but  not  s1)  /   (total  #  of   volumes  in  s2)   (#  of  page  records   in  s1  but  not  s2  )  /   (total  #  of  page   records  in  s1  )   (#  of  page  records   in  s2  but  not  s1  )  /   (total  #  of  page   records  in  s2  )   steam  engine   (12  terms)     9,259/127,385   (7.27%)   828/118,954   (0.70%)   31,475/476,992   (6.60%)   19,145/464,662   (4.12%)   steam  locomoGve   (4  terms)   707/13,539   (5.22%)   169/13001   (1.30%)   2,880/36,294   (7.93%)   1,752/35,166   (4.98%)   steam  ship   (2  terms)   1,876/134,220   (1.39%)   920/133,264   (0.69%)   12,024/573,453   (2.09%)   15,141/576,570   (2.63%)   Telegraph   (7  terms)   67,293/   361,483   (18.6%)   375/294,565   (0.13%)   787,481/2,140,28 4  (36.7%)   164,601/1,517,40 4  (10.8%)  
  • 47. Analysis  of  Solr  false  posi:ve/nega:ve   •  False  posi:ve   §  Example  one:  false  posi:ve  match  for  “diesel  engine”   “17  Engines  and  Turbines  (Excludes  aircraq  and  rocket  engines;   automo:ve  engines,  except  diesel;  engine  generator  sets;  and   locomo:ves.)”   §  Example  two:  false  posi:ve  match  for  “diesel  motor”   “Fossil  fuel  consump:on  (gasoline,  diesel,  motor  oil)  would  decrease   as  a  result  of  this  alterna:ve.”   §  Cause  analysis:  Solr  builds  page-­‐level  index  by  Lucene   tokenizaHon  which  removes  non-­‐word  character.  One-­‐pass   processing  can  correctly  handle  such  cases  as  it  matches  by   regular  expression  against  raw  text  
  • 48. Cause  analysis  of  Solr’s  false  posi:ve/nega:ve   (Cont.)   •  False  nega:ve     §  Example  one:  false  nega:ve  match  for  “diesel  engine”   “Steam  boilers  and  equipment,  steam  and  gas  turbines,     nuclear  reactors,  steam  engines,  diesel  en-­‐   gines,  and  other  prime  movers”   §  Example  two:  false  posi:ve  match  for  “diesel  motor”   “The  introduc:on  of  commercial-­‐model  diesel  engines,  in  a   rela:vely  small  quan:ty  of  trucks.”   Lucence  tokeniza:on  splits   en-­‐gines  into  “en”  and  “gines”,   Thus  leads  to  false  nega:ve   End  of  line   Single-­‐pass  processing  will  handle     word  con:nua:on  case  by     concatena:ng  “en-­‐gines”  into   “engines”  first  before  matching   Solr  failed  to  detect  this  straigh‚orward  case,  we  do  not   know  the  reason  and  need  further  inves:ga:on  
  • 49. False  nega:ve  caused  by  OCR  errors   •  Example  one:   “Burdick,  R.  H.  Performance  of  diesel.engine  plants  in   Texas.”     •  Example  two:   “from  gasoline-­‐powered  to  fuel-­‐efficient  diesel-­‐ _powered  engines”     •  One-­‐pass  processing  approach  failed  to  detect   them  in  such  cases   Tokens  generated  as  result   of  OCR  error  
  • 50. Raw  copyright  data  on  1)    file  system  and  in  archive  in   pairtree  form,  2)  chunked  form  for  parallel  processing   and  2)  in  full  text  Solr  index   Knowledge  product  services   Data  Capsule  VMs         Services  and  Tools     for  data  extrac:on,  data  cleaning,  data   analysis  and  results  visualiza:on     Knowledge  products  (public):    workset,   ontology,  feature  sets     Portal  (for  authen:ca:on)   Knowledge  products   (private):  personal  worksets     External  data   cache     DH   CS   NLP  R   .  .  .     Data  management  services  
  • 51. Raw  copyright  data  on  1)    file  system  and  in  archive  in   pairtree  form,  2)  chunked  form  for  parallel  processing   and  2)  in  full  text  Solr  index   Knowledge  product  services   Data  Capsule  VMs         Services  and  Tools     for  data  extrac:on,  data  cleaning,  data   analysis  and  results  visualiza:on     Knowledge  products  (public):    workset,   ontology,  feature  sets     Portal  (for  authen:ca:on)   Knowledge  products   (private):  personal  worksets     External  data   cache     DH   CS   NLP  R   .  .  .     Data  management  services  
  • 52. Thank  you!     plale@indiana.edu   @bplale   hfps://sharc.hathitrust.org/