0
Semantic Journal Mapping for Search Visualization
in a Large Scale Article Digital Library

              Glen Newton1,2, ...
Outline

•   Maps of Science
•   Background
•   Research Interests
•   Research Goals
•   Process
•   Scalability issues
•...
From Bollen et al 2009 PLOS1
From Leydesdorff
From Leydesdorff & Rafols 2006   & Rafols 2006
From Leydesdorff & Rafols 2006
Background

• Canada Institute of Science and Technical Information (CISTI) ==
    Canadian national science library
• ~30...
Research Interests

• Domain-specific discovery
• Improved discovery in STM domains through results visualization
    and ...
Research Goals

• Find way to extract journal (& article) semantic vector space
• Latent Semantic Analysis (LSA) works for...
Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer,
     etc
• ~4100 journal titles, classified i...
Category                                       # Journals
                                               per category
Agri...
Category                       # Journals per category
Energy and Power               73
Engineering and Technology     32...
Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list,
     Porter stemming using LuSql tool
• Build...
Scalability Issues

•  #items, #unique terms
        – #unique terms: SV easily handles very well
        – #items: SV han...
Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050
    processors with 2x2MB cache, 3.0 Ghz 64bit, 3...
Results: Scalability

• Corpus: ~600GB full-text
• Lucene index: 43GB
      – LuSql: 13 hours 51 minutes to produce
• SV i...
Results: Visualization

• Using Processing environment, built simple
    validation/visualization tool
Harder sciences and
engineering categories
Chemistry
Material Science
Physics and
Astronomy
Engineering and
Technology
Mathematics
Computer Science
Civil Engineering
Chemical Engineering
Agriculture and
biomedical categories
Agriculture and
Biological Sciences
Biochemistry, Genetics
and Molecular Biology
Immunology and
Microbiology
Pharmacology
Neuroscience
Medicine
Medicine
Psychology
Interdisciplinary and
non-science categories
Environmental Science
Earth and
Planetary Science
Energy and Power
Decision Science
Economics,
Econometrics
And Finance
Social Sciences
Business, Management
and Accounting
Arts and Humanities
Examination of outliers,
extrema and cataloging
errors
Ecotoxicology and
Environmental Safety
                       Organic Geochemistry




                              Corpo...
Journal of Biomolecular NMR



              Journal of X-Ray
              Science and Technology




           Medicine...
Colloidal and
Polymer Science




                  Annales Henri Poincare




        Medicine
        Medicine
Medicine
         Medicine
French language Medical
& Psychology Journals
Bulletin of
              Mathematical Biology




Journal of
Medical
Ultrasonics




                 Mathematics
Conclusions

• Reasonable mapping results
• Full-text only (no citations, metadata) gives good results
• Scalable to signi...
Future Work

• Proper precision and recall evaluation using same corpus
• Validate with NetNews-20 collection for P & R
• ...
Acknowledgements

• Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-CISTI
Demo

• Link to project demo page
License




Creative Commons Attribution-Noncommercial-No Derivative Works 2.5
Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library
Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library
Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library
Upcoming SlideShare
Loading in...5
×

Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library

1,839

Published on

In this paper we examine the scalability and utility of semantically mapping (visualizing) journals in a large scale (5.7+ million) science, technology and medical article digital library. This work is part of a larger research effort to evaluate semantic journal and article mapping for search query results refinement and visual contextualization in a large scale digital library. In this work the Semantic Vectors software package is parallelized and evaluated to create semantic distances between 2365 journals, from the sum of their full-text. This is used to create a journal semantic map whose production does scale and whose results are comparable to other maps of the scientific literature.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,839
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
54
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library"

  1. 1. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library Glen Newton1,2, Alison Callahan1, Michel Dumontier2 1 National Research Council Canada, 2Carleton University Second Workshop on Very Large Digital Libraries (VLDL) 2009 at ECDL 2009 Oct 2 2009 Corfu, Greece
  2. 2. Outline • Maps of Science • Background • Research Interests • Research Goals • Process • Scalability issues • Environment • Results • Conclusions • Future Work
  3. 3. From Bollen et al 2009 PLOS1
  4. 4. From Leydesdorff From Leydesdorff & Rafols 2006 & Rafols 2006
  5. 5. From Leydesdorff & Rafols 2006
  6. 6. Background • Canada Institute of Science and Technical Information (CISTI) == Canadian national science library • ~3000 active researchers at NRC • Large full text collection of ~8.4m full-text + metadata articles, in science, technology, medicine (STM) • 4100 journal titles • ~1995 to 2009
  7. 7. Research Interests • Domain-specific discovery • Improved discovery in STM domains through results visualization and contextualization, browse/explore/refine • Results set visualization: “mapping”
  8. 8. Research Goals • Find way to extract journal (& article) semantic vector space • Latent Semantic Analysis (LSA) works for small/medium sized corpora, does not scale to large scale of items and/or terms • New alternative: Semantic Vectors (SV): uses random vectors & avoids expensive singular value decomposition (SVD) • Can SV scale & generate sensible semantic vector space of journals on corpus of this size? • Can the visualization produced be useful for results query visualization, refinement, discovery?
  9. 9. Corpus • Licensed journal articles from STM publishers: Elsevier, Springer, etc • ~4100 journal titles, classified into 23 categories (by librarians) • ~8.4m journal articles • Selection of articles/journals: – Only those with authors, abstract (no notices, obituaries, etc) – Only English language articles – Only journals with >50 articles in corpus – Resulting corpus: 5,733,721 articles from 2231 journals – Categories overlapping: 1.53 categories per journal
  10. 10. Category # Journals per category Agriculture & Biological Sciences 358 Arts and Humanities 70 Biochemistry, Genetics and Molecular Biology 240 Business, Management and Accounting 106 Chemical Engineering 126 Chemistry 226 Civil Engineering 64 Computer Science 218 Decision Science 50 Earth and Planetary Science 146 Economics, Econometrics and Finance 112
  11. 11. Category # Journals per category Energy and Power 73 Engineering and Technology 328 Environmental Science 138 Immunology and Microbiology 104 Materials Science 160 Mathematics 205 Medicine 671 Neuroscience 103 Pharmacology, Toxicology and 73 Pharmaceutics Physics and Astronomy 210 Psychology 126 Social Science 222
  12. 12. Process • Index full-text (only) with Lucene 2.4, aggressive stopword list, Porter stemming using LuSql tool • Build Semantic Vectors (v1.18, parallelized) index from Lucene index, with 512 semantic dimensions • Find item x item distance matrix from SV index of 512- dimensional vectors • Using R, use multidimensional scaling (MDS) to reduce from 512- D to 2-D
  13. 13. Scalability Issues • #items, #unique terms – #unique terms: SV easily handles very well – #items: SV handles fairly well – #items: impacts size of distance matrix (#items x #items) – R cannot handle huge article distance matrix in MDS (i.e. millions of articles vs. thousands of journals) • Instead of using articles for items, use journals for items • Make single large full-text document from concatenation of all articles of particular journal & index these
  14. 14. Environment • Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM, attached to a Dell EMC AX150 storage arrays via SilkWorm 200E Series 16-Port Capable 4Gb Fabric Switch. • Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel 2.6.18.8-0.10-default #1 SMP • Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit Server VM (build 10.0-b23, mixed mode). • Processing 1.0 (processing.org)
  15. 15. Results: Scalability • Corpus: ~600GB full-text • Lucene index: 43GB – LuSql: 13 hours 51 minutes to produce • SV index: 58 minutes, 885 MB, 21.6m terms – Distance matrix: 6 minutes
  16. 16. Results: Visualization • Using Processing environment, built simple validation/visualization tool
  17. 17. Harder sciences and engineering categories
  18. 18. Chemistry
  19. 19. Material Science
  20. 20. Physics and Astronomy
  21. 21. Engineering and Technology
  22. 22. Mathematics
  23. 23. Computer Science
  24. 24. Civil Engineering
  25. 25. Chemical Engineering
  26. 26. Agriculture and biomedical categories
  27. 27. Agriculture and Biological Sciences
  28. 28. Biochemistry, Genetics and Molecular Biology
  29. 29. Immunology and Microbiology
  30. 30. Pharmacology
  31. 31. Neuroscience
  32. 32. Medicine Medicine
  33. 33. Psychology
  34. 34. Interdisciplinary and non-science categories
  35. 35. Environmental Science
  36. 36. Earth and Planetary Science
  37. 37. Energy and Power
  38. 38. Decision Science
  39. 39. Economics, Econometrics And Finance
  40. 40. Social Sciences
  41. 41. Business, Management and Accounting
  42. 42. Arts and Humanities
  43. 43. Examination of outliers, extrema and cataloging errors
  44. 44. Ecotoxicology and Environmental Safety Organic Geochemistry Corporate Environmental Strategy Environmental Science
  45. 45. Journal of Biomolecular NMR Journal of X-Ray Science and Technology Medicine Medicine
  46. 46. Colloidal and Polymer Science Annales Henri Poincare Medicine Medicine
  47. 47. Medicine Medicine French language Medical & Psychology Journals
  48. 48. Bulletin of Mathematical Biology Journal of Medical Ultrasonics Mathematics
  49. 49. Conclusions • Reasonable mapping results • Full-text only (no citations, metadata) gives good results • Scalable to significant size
  50. 50. Future Work • Proper precision and recall evaluation using same corpus • Validate with NetNews-20 collection for P & R • Evaluate non-metric MDS • Project articles onto semantic journal space & build interactive discovery interface & evaluate – Index journal 'documents' and journal articles – SV on all – Distance matrix only on journals – Do MDS – Use eigenvectors to transform N-d article vector to 2-D • Explore 3-D interface (MDS N-d → 3D)
  51. 51. Acknowledgements • Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-CISTI
  52. 52. Demo • Link to project demo page
  53. 53. License Creative Commons Attribution-Noncommercial-No Derivative Works 2.5
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×