  1. 1. Digital  Libraries:    L History,  Technology,    T R&D Edward  A.  Fox   Professor,  Computer  Science,  Virginia  Tech   Blacksburg,  VA  24061  USA   fox@vt.edu            h�p://fox.cs.vt.edu     6  Jan.  2014   1  
  2. 2. Outline       Acknowledgments     Introduc�on     History     Technology     Research     Development     Summary  and  Discussion   6  Jan.  2014   2  
  3. 3. Sponsored  by  Qatar  University  &  Qatar  Na�onal  Library   HTTP://qnl.qa HTTP://WWW.QU.EDU.QA/ Funding provided thru the ELISQ project: Electronic Library Institute - SeerQ HTTP://WWW.VT.EDU/ HTTP://WWW.PSU.EDU/ 6  Jan.  2014   HTTP://WWW.TAMU.EDU/ 3  
  4. 4. ELISQ  Project  Team    P  T   Qatar  University,  Qatar:   Mohammed Samaka (Ph.D., Co-Lead PI) Sumaya Ali S A Al-Maadeed (Ph.D., PI) Myrna Tabet Asad Nafees Tahseena Moideen   Qatar  Na�onal  Library,  Qatar:   Claudia Lux (PI) Krishna Roy Chowdhury Postdoc - TBA Virginia Tech, USA: Edward Fox (Ph.D., Lead-PI) Tarek Kanan Penn. State University, USA: C. Lee Giles (Ph.D., PI) Sagnik Ray Choudhury Texas A&M, USA: Richard Furuta (Ph.D., PI) Hamed Alhoori Consultants: John Impagliazzo (Ph.D., Key Investigator) Susan Lukesh (Ph.D.) This  project  was  made  possible  by  NPRP  Grant  #  4  -­‐  029  -­‐  1  –  007  from   Carole Thompson the  Qatar  Na�onal  Research  Fund  (a  member  of  Qatar  Founda�on).     6  Jan.  2014   4  
  5. 5. Acknowledgements       Dr.  Mazen  Hasna,  VP  and  Chief  Academic  Officer,  Qatar   University       Dr.  Rashid  Alammari,  Dean,  College  of  Engineering,  Qatar   University       Dr.  Moumen  Hasnah  ,  Director  of  Academic  Research,  Qatar   University     Dr.  Claudia  Lux,  Qatar  Na�onal  Library  Director       Dr.  Imad  Bachir,  Qatar  University  Library  Director     Dr.  Munir  Tag,  Ac�ng  Director  Technical,  ICT  Program   Manager  (QNRF)     Ms.  Krishna  Roy  Chowdhury,  Associate  Director  for  Library  IT,   Qatar  Na�onal  Library     Prof.  Seb�  Foufou,  Head  of  Department  of  Computer  Science   and  Engineering,  Qatar  University    
  6. 6. Addi�onal  Thanks  T Qscience  –  providing  collec�on: Christopher J. Leonard, Editorial Director Paul Coyne, CTO US  Na�onal  Science  Founda�on    (recent  and  current  grants  to  Fox):   IIS-­‐1319578     IIS-­‐0916733     DUE-­‐0840719     OCI-­‐1032677     plus  those  to  PSU,  TAMU   6  Jan.  2014   6  
  7. 7. Outline       Acknowledgments     Introduc�on     History     Technology     Research     Development     Summary  and  Discussion   6  Jan.  2014   7  
  8. 8. Introduc�on   Reasons  to  be  here     Interested     Find  what  to  do  with  your  content     Find  how  to  help  your  user  community     h�p://www.morganclaypool.com/toc/icr/1/1       1.  DL  Introduc�on,  5S  framework  (2012)     2.  DL  Quality,  Integra�on  (2013)     3.  DL  Technologies  (in  press)     4.  DL  Applica�ons  (in  press)   6  Jan.  2014   8  
  9. 9. 6  Jan.  2014   9  
  10. 10. 6  Jan.  2014   10  
  11. 11. 6  Jan.  2014   11  
  12. 12. 6  Jan.  2014   12  
  13. 13. DLs  Shorten  the  Chain  to  S  t  C  t Author Teacher Reader Editor Reviewer Learner Digital Library Librarian 13  
  14. 14. Digital Library Content Content Types Text Documents Video Audio Geographic Information Software, Programs Bio Information Images and Graphics Articles, Reports, Books Speech, Music (Aerial) Photos Models Simulations Genome Human, animal, plant 2D, 3D, VR, CAT 6  Jan.  2014   14  
  15. 15. Content Based Information Retrieval 15  
  16. 16. 16  
  17. 17. Digital  Library  Reference  Model  1.0  p.  30  of  234  
  18. 18. Informal  5S  DL  Defini�ons      5  D  D     DLs  are  complex  systems  that:    help  sa�sfy  info  needs  of  users  (socie�es)    provide  info  services  (scenarios)    organize  info  in  usable  ways  (structures)    present  info  in  usable  ways  (spaces)    communicate  info  with  users  (streams)   18  
  19. 19. Informa�on  Life  Cycle  L  C Authoring Modifying Using Creating Organizing Indexing Retention / Mining Storing Retrieving Accessing Filtering Distributing Networking 6  Jan.  2014   19  
  20. 20. Infrastructure Services Repository-Building Creational Preservational Acquiring Cataloging Crawling (focused) Describing Digitizing Federating Harvesting Purchasing Submitting Conserving Converting Copying/Replicating Emulating Renewing Translating (format) Add Value Annotating Classifying Clustering Evaluating Extracting Indexing Measuring Publicizing Rating Reviewing (peer) Surveying Translating (language) Information Satisfaction Services Browsing Collaborating Customizing Filtering Providing access Recommending Requesting Searching Visualizing 20  
  21. 21. SeerSuite  is  Not  Google  i  N  G   Metadata  (as  in  library  catalogs)  as  well  as  content     Sets  of  collec�ons,  rather  than  the  Web  as  a  whole     Provided  by  a  curator  (e.g.,  publisher,  museum)     Provided  by  user  submissions     Or  collected  by  focused  ‘crawling’     Tailored  services,  rather  than  the  same  for  everyone     Browsing  using  categories,  preserving,  adding  value     Based  on  studying  user  requirements,  e.g.,  chemists     Working  with  en��es,  rather  than  just  words     Cita�ons,  tables,  figures,  names,  chemical  formula     Using  knowledge  bases,  machine  learning,  ar�ficial  intelligence   6  Jan.  2014   21  
  22. 22. Outline       Acknowledgments     Introduc�on     History     Technology     Research     Development     Summary  and  Discussion   6  Jan.  2014   22  
  23. 23. History  Overview  O   1991,  esp.  from  Informa�on  Retrieval     Connec�ng  computer,  library,  and  informa�on  science   communi�es     NSF  DL  Ini�a�ve  1  in  1994  included  funding  for   Stanford,  where  Google  was  prototyped     Interna�onal  conferences  in  the  Americas  (JCDL),  as   well  as  Europe  (TPDL,  by  DELOS),  Asia  (ICADL)     Publishers:  ACM,  …     DOIs,  (Ins�tu�onal)  Repositories     Spinoffs:  content  &  courseware  management  systems     Recently  including  (linked)  data   6  Jan.  2014   23  
  24. 24. www.nsdl.org   6  Jan.  2014   24  
  25. 25. 25  
  26. 26. Ins�tu�onal  Repositories  R   “Ins�tu�onal  repositories  are  digital  collec�ons   that  capture  and  preserve  the  intellectual  output  of   a  single  university  or  a  mul�ple  ins�tu�on   community  of  colleges  and  universi�es.”     Crow,  R.  “Ins�tu�onal  repository  checklist  and   resource  guide”,  SPARC,  Washington,  D.C.,  USA     www.arl.org/sparc/IR/IR_Guide_v1.pdf   6  Jan.  2014   26  
  27. 27. NDLTD:  www.ndltd.org    w   Networked  Digital  Library  of  Theses  and   Disserta�ons  (NDLTD)     Vision:     Every  thesis  and  disserta�on  in  the  world  is:   o  Devised  to  take  advantage  of  the  most  helpful   electronic  publishing  methods   o  Shared  globally  and  easily  found   o  Supported  by  a  suite  of  digital  library  services  to  aid   authors,  researchers,  learners,  universi�es   o  Preserved  and  migrated  permanently   6  Jan.  2014   27  
  28. 28. Crisis,  Tragedy,  and  Recovery  (CTR)  Network  /    T  a  R  (  N  / Integrated  Digital  Event  Archive  &  Library  (IDEAL)  D  E  A  &  L  (   Human  tragedies  that  result  from  man-­‐made   and  natural  events  affect  humans  and   communi�es  significantly.   During  and  a�er  a  tragic  event,  there  are  a   series  of  needs  that  have  to  be  addressed.     o Compounded  by  communica�on  failures  and  a   confusing  plethora  of  data  and  informa�on   6  Jan.  2014   28  
  29. 29.   CTRnet  (Crisis,  Tragedy  &  Recovery  Net)     Disaster  Loca�ons   29  
  30. 30.   CTRnet  (Crisis,  Tragedy  &  Recovery  Net)     Word  Clouds  of  Japan  Earthquake  and  Libya  Revolu�on   (using  tweets)     Japan  Earthquake,   Tsunami  Disaster   Updated  every  10  minutes   Libya  Revolu�on     30  
  31. 31. CTR  stakeholders  s 6  Jan.  2014   31  
  32. 32.   CINET:  Network  Science  Middleware   32  
  33. 33. —  CINET:  Network  Science  Middleware     Netviz:    Course  project  aims   to  develop  a  visualiza�on   component  for  CINET  which   contains  large  network   graphs.  The  visualiza�on   service  will  get  Networks   from  CINET,  convert  from   Galib  to  Gexf  format,  then   visualize  the  graphs  using   Gelphi.   CINET  network  displayed  using  Gephi   33  
  34. 34. Outline       Acknowledgments     Introduc�on     History     Technology     Research     Development     Summary  and  Discussion   6  Jan.  2014   34  
  35. 35. Web  Archiving    A     Introduc�on:  Web  archiving  is  the  process  of   gathering  up  data  recorded  on  the  World  Wide  Web,       storing  it,       ensuring  the  data  is  preserved  in  an  archive,  and       making  the  collected  data  available  for  future   research.         The  Internet  Archive  and  several  na�onal  libraries   ini�ated  Web  archiving  prac�ces  in  1996.     6  Jan.  2014   35  
  36. 36. Crawler  (Heritrix)    ( (for  search  engines  &  Web  archives)  s  e  &  W  a   A  Web  crawler  starts  with  a  list  of  URLs  to  visit,   called  the  seeds.         On  those  page,  iden�fies  all  the  hyperlinks       adds  them  to  the  list  of  URLs  to  visit     recursively  visits  pages  pointed  to       according  to  a  set  of  policies.     Priori�zes  its  downloads  –  some  pages  change  o�en.   6  Jan.  2014   36  
  37. 37. Focused  Crawlers  C   For  a  par�cular  topic  or  event     to  build  a  Web  collec�on  focused  in  that  area     Start  with  URLs  of  interest,  viewed  as  seeds  to  grow  from     Expand  in  a  ‘smart’  way  to  get  all  and  only  what  is  relevant     Use  informa�on  retrieval  /  ar�ficial  intelligence  /  machine   learning   o  Require  ‘knowledge  bases’  and/or  human  training  examples       Nevertheless,  there  is  a  tradeoff  between  the  resul�ng   o  Recall  (i.e.,  coverage  of  what  is  out  there)   o  Precision  (i.e.,  freedom  from  noise  in  what  is  collected)   6  Jan.  2014   37  
  38. 38. SeerSuite  Instan�a�ons  I   CiteSeerx   http://citeseerx.ist.psu.edu   A scientific literature digital library and search engine   ChemXSeer   http://chemxseer.ist.psu.edu   Portal for researchers in environmental chemistry integrating the scientific literature with experimental, analytical, and simulation results and tools   ArchSeer   http://archseer.ist.psu.edu/   Archeology literature   TableSeer   ANY fields with tables 6  Jan.  2014   38  
  39. 39. CiteSeerX   h�p://citeseerx.ist.psu.edu      CiteSeerX  crawls  researcher  homepages  on  the  web  for  scholarly  papers,  formerly  in   computer  science      Converts  PDF  to  text      Automa�cally  extracts  OAI  metadata  and  other  data      Automa�c  cita�on  indexing,  links  to  cited  documents,  crea�on  of   document  page,  author  disambigua�on      So�ware  open  source  –  can  be  used  to  build  other  such  tools      3  M  documents      Ms  of  files      60  M  cita�ons      3  to  6  M  authors      2  to  4  M  hits  day      100K  documents  added   monthly      800K  individual  users      several  Tbytes   6  Jan.  2014   39  
  40. 40. 6  Jan.  2014   40  
  41. 41. 6  Jan.  2014   41  
  42. 42. SeerSuite   Tool  kit  used  to  build  search  engines  and  digital  libraries     CiteSeerX  ,  MyCiteSeerX  ,  ChemXSeer,  ArchSeer,  AlgoSeer,   AckSeer,  BizSeer,  CSSeer,  CollabSeer,  RefSeer,  GrantSeer,   SeerSeer,  YouSeer,  etc.     Built  on  commercial  grade  open  source  tools  (Solr/Lucene)     Penn  State  exper�se  –    automated  specialized  metadata   extrac�on     Supports  research  in     Indexing  and  search     Data  mining  &  structures     Informa�on  and  knowledge  extrac�on     Social  networks:  Name/en�ty  disambigua�on     Scientometrics/infometrics     Systems  engineering     User  interface  design  (HCI  =  human-­‐computer  interac�on)     So�ware  engineering  and  management  
  43. 43. ChemXSeer Highlights Portal for academic researchers in chemistry which integrates the scientific literature with experimental, analytical and simulation results and tools Provides unique metadata extraction, indexing and searching pertinent to the chemical literature by using heuristics combined with machine learning Chemical formulae and names Tables Figures Publication functions as in CiteSeerX Expert and expertise search. After extraction, data stored API accessible xml for users. Hybrid repository: Serves as a federated information interoperational system Scientific papers crawled and indexed from the web User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM toolkit outputs) Scientific documents and metadata from publishers, web or archives. Access control for proprietary provided content and user-submitted experiment data Takes advantage of in-house open source projects such as CiteSeerX/ Seersuite.
  44. 44. Example Formula Search
  45. 45. Outline       Acknowledgments     Introduc�on     History     Technology     Research     Development     Summary  and  Discussion   6  Jan.  2014   45  
  46. 46. Users  -­‐  TAMU  -­‐  T   Requirements  (content,  services)     Prac�ces  (scholarly,  informa�on  seeking)     Social  framework  (collabora�on,  recommenda�on)     Interviews,  surveys     Evalua�ons:  usability,  benefits   6  Jan.  2014   46  
  47. 47. Infrastructure  -­‐  PSU  -­‐  P   Computers,  so�ware,  launching  infrastructure  at:     QU:  powerful  server,  now  crawling            +  ready  to  help  any  group  interes�ng  in  cura�ng  a  collec�on     VT,  QNL  (postdoc),  QCRI  (Prof.  Mitra),  …     Adapt  to  disciplines,  interes�ng  parts  of  documents     Adapt  to  each  collec�on     Develop  knowledge  base  and  heuris�cs  for  the  coll.     Change  document  parser     Change  database  to  match  what  occurs     Change  extractors  :  document  -­‐>  database   6  Jan.  2014   47  
  48. 48. Arabic  -­‐  VT  -­‐  V   Handle  Arabic  text  documents     Obtain  a  suitable  category/classifica�on  system     Have  people  provide  ‘training  set’     Use  machine  learning  to  automa�cally  classify  future   Arabic  text  documents     Support  cross-­‐language  informa�on  retrieval     Arabic  ques�on  against  English  documents     English  ques�on  against  Arabic  documents     6  Jan.  2014   48  
  49. 49. Arabic  Handwri�ng  -­‐  QU  H  -­‐  Q   Images  of  historic  documents     Arabic  text  extracted     Mapping  from  a  part  of  the  text  to  the  corresponding   part  of  the  image     Special  tools  for     Those  processing  the  original  documents     Those  doing  research  with  the  collec�on     Will  allow  work  on  non-­‐textual  collec�ons  too,  e.g.,   museum  images,  set  of  photos  for  teaching  architecture   6  Jan.  2014   49  
  50. 50. Accessible  Collec�ons  in  Qatar  -­‐  QNL  C  i  Q  -­‐  Q   What  collec�ons  have  the  highest  priority?     What  special  handling  is  needed  for  each  class,  for   each  subclass  of  collec�on  type?     How  do  DLs  best  fit  into  the  ac�vi�es  of  the  Na�onal   Library?     Can  .qa  be  fully  archived  for  Wayback  Machine  use?   6  Jan.  2014   50  
  51. 51. Outline       Acknowledgments     Introduc�on     History     Technology     Research     Development     Summary  and  Discussion   6  Jan.  2014   51  
  52. 52. RELATED TOPICS CORE DL TOPICS COURSE STRUCTURE DL  Curriculum  Framework  C  F Semester 1: DL collections: development/creation Digitization Storage Interchange Metadata Cataloging Author submission Digital objects Composites Packages Semester 2: DL services and sustainability Architectures (agents, buses, wrappers/mediators) Interoperability Spaces (conceptual, geographic, 2/3D, VR) Documents E-publishing Markup Multimedia streams/structures Capture/representation Compression/coding Bibliographic information Bibliometrics Citations Content-based analysis Multimedia indexing Naming Repositories Archives Services (searching, linking, browsing, etc.) Archiving and preservation Integrity Architectures (agents, buses, wrappers/mediators) Interoperability Thesauri Ontologies Classification Categorization Info. Needs Relevance Evaluation Effectiveness Intellectual property rights mgmt. Privacy Protection (watermarking) Routing Filtering Community filtering Search & search strategy Info seeking behavior User modeling Feedback Multimedia presentation, rendering 6  Jan.  2014   Info summarization Visualization 52  
  53. 53. Modules   h�p://en.wikiversity.org/wiki/ Curriculum_on_Digital_Libraries     Table  1:  Core  DL  Modules     Table  2:  Informa�on  Retrieval  Packages     Table  3:  Big  Data     Table  4:  Mul�media  So�ware     Like  lesson  plans,  for  a  training  session  or  lecture     Can  be  used  for  self-­‐study,  refreshers   53  
  54. 54. h�p://curric.dlib.vt.edu/modDev/modDev.html   6  Jan.  2014   54  
  55. 55. h�p://elisq.qu.edu.qa/   ELISQ  Audience  (Users)  A  (   Primary:   o  o  o  o  Librarians    and  libraries  in  Qatar   Researchers  and  academics   Government  organiza�ons   Non-­‐Governmental  organiza�ons     (such  as  h�p://www.fsd.org.qa/)     Secondary:   o  o  o  o  o  University  /  School  Students   Teachers  /  Faculty     Managers   Qatari  ci�zens   Other  stakeholders   6  Jan.  2014   55  
  56. 56. ELISQ  Project    ((1    of    2)    P o 2   Project  Objec�ves/Aims   A.  Research   and   prototype   digital   library   systems   and   infrastructure   for   Qatar,   focusing   ini�ally   on   Qatari   informa�on   related   to   government   and   scholarly   ac�vi�es.     Leverage   the   crawling   engine   from   Penn   State‘s   SeerSuite   so�ware  infrastructure,  and  extend  it  beyond  its  current  focus  on   English   to   support   Arabic-­‐English   collec�ons,   and   to   cover   a   broad   range   of   scholarly   disciplines,   and   all   types   of   government   informa�on.     6  Jan.  2014   56  
  57. 57. ELISQ  Project    ((2    of    2)    P o 2   Project  Objec�ves/Aims  (con�nued)     B.  Research   and   build   the   digital   library   community   in   Qatar,   suppor�ng   digital   library   use,   services,   collec�on   development,   tailored   systems,   and   advancing  toward  a  Knowledge  Society.     Study   scholarly   ac�vi�es,   and   engage   in   community   building   in   Qatar,   so   DLs   can   be   tailored   to   specific   domains   and   to   the   unique   needs   of   Qatar.   Through   workshops,   a   consul�ng   center   at   the  proposed  Ins�tute,  and  collabora�ve  efforts  with  libraries  and   museums   in   Qatar,   we   will   iden�fy   par�cular   needs   and   uses,   and   tailor  collec�ons,  systems,  and  services,  to  lead  toward  the  Qatari   Knowledge  Society.   6  Jan.  2014   57  
  58. 58. Significance  to  Librarians,  Corpora�ons,    t  L  C and    Governmental  Agencies    G  A   The  need  to  preserve  cultural  and  historical  heritage  =>   o  Collec�ons  of  fragile  and  precious  ar�facts  =>     o  Libraries,  museums,  and  archives  developing  digital     collec�ons  =>   o  Users  from  all  over  the  world  accessing  and  studying     A  one  stop  search  of:     o  Informa�on  about  Qatar   o  Informa�on  to  preserve  the  culture  of  Qatar     Deep  indexing,  analysis,  and  retrieval  of:   o  Resources,  reports,  sta�s�cs,  and  other  types  of  informa�on   o  Informa�on  in  the  Arabic  language  as  well  as  in  English   6  Jan.  2014   58  
  59. 59. ELISQ  Content  C   Metadata,  data,  and  many  types  of  documents   (including  full  text)     Qatari  resources  that  first  appeared  in  digital  form  -­‐   ‘born’  digital     At  a  later  stage  the  project  will  include:     o  Digital  versions  of  material  already  exis�ng  in  print   o  Mul�media  (image,  audio,  video)  forms     Free  and  open  as  well  as  content  with  limited  access   6  Jan.  2014   59  
  60. 60. ELISQ  Focus  F Community  in  Qatar     Iden�fy  interested  stakeholders,  to  tailor  to  needs     Train  next  genera�on  of  digital  librarians,  archivists,   and  curators     Partners  helping  with  addi�onal  collec�on   development     Advanced  Technology  for  Enhanced  Access     “Low  hanging  fruit”  by  crawling  Qatar-­‐related  Web     Improved  analysis  (cita�ons,  tables,  chemicals,  …)     Support  for  both  Arabic  and  English     6  Jan.  2014   60  
  61. 61. Outline       Acknowledgments     Introduc�on     History     Technology     Research     Development     Summary  and  Discussion   6  Jan.  2014   61  
  62. 62. Summary  (some  highlights)  (  h   Introduc�on  to  digital  libraries:  5S,  any  content     History:  since  1991,  Google,  repositories     Technology:  SeerSuite,  Heritrix,  Solr,  HCI     Ini�al  collec�ons:  Qscience,  news,  …     Research:  extend  SeerSuite;  Arabic     Adapt  other  tools  for  handwri�ng  collec�on,  non-­‐text  collec�ons     Development:  consul�ng  center  (addressing  needs)   6  Jan.  2014   62  
  63. 63. Ques�ons  for  You  f  Y   What  communi�es  should  be  served?     What  collec�ons  should  be  made  accessible?     What  services  are  required?     What  are  the  priori�es  in  the  above?     Can  you  help  us  find  suitable  partners,  content  owners,   curators,  user  groups?   6  Jan.  2014   63  
  64. 64. Ques�ons  for  Us?  f  U   h�p://elisq.qu.edu.qa/     fox@vt.edu     h�p://fox.cs.vt.edu   6  Jan.  2014   64