Using Open Source Tools for Visualization
 and Semantic Mapping in a Large Scale
         Article Digital Library

       ...
Outline

•   Maps of Science
•   Broad Research Interests
•   Research Goals
•   Process
•   Scalability issues
•   Open S...
From Bollen et al 2009 PLOS1
From Leydesdorff
From Leydesdorff & Rafols 2006   & Rafols 2006
From Leydesdorff & Rafols 2006
Broad Research
                                  Interests
• Search results visualization & refinement
• Domain-specific d...
Research Goals

• Use Open Source tools to support large scale semantic text analysis and
     visualization
• Find way to...
Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer,
     etc
• ~4100 journal titles, classified i...
Corpus
 Category                                       # Journals
                                                per cate...
Category                       # Journals per category
Energy and Power               73
Engineering and Technology     32...
Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list,
     Porter stemming using LuSql tool
• Build...
Scalability Issues

•  #items, #unique terms
        – #unique terms: SV easily handles very well
        – #items: SV han...
Open Source Tools

•   Lucene
•   LuSql (High performance Lucene index building tool)
•   Semantic Vectors
•   R
•   Proce...
Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050
    processors with 2x2MB cache, 3.0 Ghz 64bit, 3...
Results: Scalability

• Corpus: ~600GB full-text
• Lucene index: 43GB
      – LuSql: 13 hours 51 minutes to produce
• SV i...
Results: Visualization

• Using Processing environment, built simple
    validation/visualization tool
Harder sciences and
engineering categories
Chemistry
Material Science
Physics and
Astronomy
Engineering and
Technology
Mathematics
Computer Science
Civil Engineering
Chemical Engineering
Agriculture and
biomedical categories
Agriculture and
Biological Sciences
Biochemistry, Genetics
and Molecular Biology
Immunology and
Microbiology
Pharmacology
Neuroscience
Medicine
Medicine
Psychology
Interdisciplinary and
non-science categories
Environmental Science
Earth and
Planetary Science
Energy and Power
Decision Science
Economics,
Econometrics
And Finance
Social Sciences
Business, Management
and Accounting
Arts and Humanities
Examination of outliers,
extrema and cataloging
errors
Ecotoxicology and
Environmental Safety
                       Organic Geochemistry




                              Corpo...
Journal of Biomolecular NMR



              Journal of X-Ray
              Science and Technology




           Medicine...
Colloidal and
Polymer Science




                  Annales Henri Poincare




        Medicine
        Medicine
Medicine
         Medicine
French language Medical
& Psychology Journals
Bulletin of
              Mathematical Biology




Journal of
Medical
Ultrasonics




                 Mathematics
Conclusions

•   Reasonable mapping results
•   Full-text only (no citations, metadata) gives good results
•   Scalable to...
Future Work

• Proper precision and recall evaluation using same corpus
• Validate with NetNews-20 collection for P & R
• ...
Acknowledgements

• Collaborators: Michel Dumontier, Alison Callahan @Carleton
• Support: Greg Kresko, Andre Vellino, Jeff...
Demo

• Link to project demo page
License




Creative Commons Attribution-Noncommercial-No Derivative Works 2.
Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library
Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library
Upcoming SlideShare
Loading in …5
×

Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

3,573 views

Published on

Presentation at Code4Lib-North at Queen's University, Ontario
May 7 2010

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,573
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
62
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

  1. 1. Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library Glen Newton glen.newton@gmail.com Biology Dept, Carleton University http://zzzoot.blogspot.com/ Code4Lib-North Queen's University, Kingston, Ontario Friday May 7 2010 Based on VLDL2009 Workshop Presentation at ECDL2009
  2. 2. Outline • Maps of Science • Broad Research Interests • Research Goals • Process • Scalability issues • Open Source Tools • Environment • Results • Conclusions • Future Work
  3. 3. From Bollen et al 2009 PLOS1
  4. 4. From Leydesdorff From Leydesdorff & Rafols 2006 & Rafols 2006
  5. 5. From Leydesdorff & Rafols 2006
  6. 6. Broad Research Interests • Search results visualization & refinement • Domain-specific discovery, with a particular interest in genomics and drug discovery • Improved discovery in STM domains through results visualization and contextualization, browse/explore/refine • Use of Open Source tools in complex research problem spaces
  7. 7. Research Goals • Use Open Source tools to support large scale semantic text analysis and visualization • Find way to extract journal (& article) semantic vector space (semantics much better than keyword or tf-idf -based representations natural language) • Latent Semantic Analysis (LSA) works for small/medium sized corpora, does not scale to large scale of items and/or terms • New alternative: Semantic Vectors (SV): uses random vectors & avoids expensive singular value decomposition (SVD) • Can SV scale & generate sensible semantic vector space of journals on corpus of this size? • Can the visualization produced be useful for results query visualization, refinement, discovery?
  8. 8. Corpus • Licensed journal articles from STM publishers: Elsevier, Springer, etc • ~4100 journal titles, classified into 23 categories (by publishers) • ~8.4m journal articles • Selection of articles/journals: – Only those with authors, abstract (no notices, obituaries, etc) – Only English language articles – Only journals with >50 articles in corpus – Resulting corpus: 5,733,721 articles from 2231 journals – Categories overlapping: 1.53 categories per journal
  9. 9. Corpus Category # Journals per category Agriculture & Biological Sciences 358 Arts and Humanities 70 Biochemistry, Genetics and Molecular Biology 240 Business, Management and Accounting 106 Chemical Engineering 126 Chemistry 226 Civil Engineering 64 Computer Science 218 Decision Science 50 Earth and Planetary Science 146 Economics, Econometrics and Finance 112
  10. 10. Category # Journals per category Energy and Power 73 Engineering and Technology 328 Environmental Science 138 Immunology and Microbiology 104 Materials Science 160 Mathematics 205 Medicine 671 Neuroscience 103 Pharmacology, Toxicology and 73 Pharmaceutics Physics and Astronomy 210 Psychology 126 Social Science 222
  11. 11. Process • Index full-text (only) with Lucene 2.4, aggressive stopword list, Porter stemming using LuSql tool • Build Semantic Vectors (v1.18, parallelized) index from Lucene index, with 512 semantic dimensions • Find item x item distance matrix from SV index of 512- dimensional vectors • Using R, use multidimensional scaling (MDS) to reduce from 512- D to 2-D
  12. 12. Scalability Issues • #items, #unique terms – #unique terms: SV easily handles very well – #items: SV handles fairly well – #items: impacts size of distance matrix (#items x #items) – R cannot handle huge article distance matrix in MDS (i.e. millions of articles vs. thousands of journals) • Instead of using articles for items, use journals for items • Make single large full-text document from concatenation of all articles of particular journal & index these
  13. 13. Open Source Tools • Lucene • LuSql (High performance Lucene index building tool) • Semantic Vectors • R • Processing • Linux
  14. 14. Environment • Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM, attached to a Dell EMC AX150 storage arrays via SilkWorm 200E Series 16-Port Capable 4Gb Fabric Switch. • Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel 2.6.18.8-0.10-default #1 SMP • Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit Server VM (build 10.0-b23, mixed mode). • Processing 1.0 (processing.org)
  15. 15. Results: Scalability • Corpus: ~600GB full-text • Lucene index: 43GB – LuSql: 13 hours 51 minutes to produce • SV index: 58 minutes, 885 MB, 21.6m terms – Distance matrix: 6 minutes
  16. 16. Results: Visualization • Using Processing environment, built simple validation/visualization tool
  17. 17. Harder sciences and engineering categories
  18. 18. Chemistry
  19. 19. Material Science
  20. 20. Physics and Astronomy
  21. 21. Engineering and Technology
  22. 22. Mathematics
  23. 23. Computer Science
  24. 24. Civil Engineering
  25. 25. Chemical Engineering
  26. 26. Agriculture and biomedical categories
  27. 27. Agriculture and Biological Sciences
  28. 28. Biochemistry, Genetics and Molecular Biology
  29. 29. Immunology and Microbiology
  30. 30. Pharmacology
  31. 31. Neuroscience
  32. 32. Medicine Medicine
  33. 33. Psychology
  34. 34. Interdisciplinary and non-science categories
  35. 35. Environmental Science
  36. 36. Earth and Planetary Science
  37. 37. Energy and Power
  38. 38. Decision Science
  39. 39. Economics, Econometrics And Finance
  40. 40. Social Sciences
  41. 41. Business, Management and Accounting
  42. 42. Arts and Humanities
  43. 43. Examination of outliers, extrema and cataloging errors
  44. 44. Ecotoxicology and Environmental Safety Organic Geochemistry Corporate Environmental Strategy Environmental Science
  45. 45. Journal of Biomolecular NMR Journal of X-Ray Science and Technology Medicine Medicine
  46. 46. Colloidal and Polymer Science Annales Henri Poincare Medicine Medicine
  47. 47. Medicine Medicine French language Medical & Psychology Journals
  48. 48. Bulletin of Mathematical Biology Journal of Medical Ultrasonics Mathematics
  49. 49. Conclusions • Reasonable mapping results • Full-text only (no citations, metadata) gives good results • Scalable to significant size • Open Source tools supported a complex research process and were easy to modify to deal with scalability issues
  50. 50. Future Work • Proper precision and recall evaluation using same corpus • Validate with NetNews-20 collection for P & R • Evaluate non-metric MDS • Project articles onto semantic journal space & build interactive discovery interface & evaluate – Index journal 'documents' and journal articles – SV on all – Distance matrix only on journals – Do MDS – Use eigenvectors to transform N-d article vector to 2-D • Explore 3-D interface (MDS N-d → 3D)
  51. 51. Acknowledgements • Collaborators: Michel Dumontier, Alison Callahan @Carleton • Support: Greg Kresko, Andre Vellino, Jeff Demaine @ NRC- CISTI
  52. 52. Demo • Link to project demo page
  53. 53. License Creative Commons Attribution-Noncommercial-No Derivative Works 2.

×