Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library
1. Using Open Source Tools for Visualization
and Semantic Mapping in a Large Scale
Article Digital Library
Glen Newton
glen.newton@gmail.com
Biology Dept, Carleton University
http://zzzoot.blogspot.com/
Code4Lib-North
Queen's University, Kingston, Ontario
Friday May 7 2010
Based on VLDL2009 Workshop
Presentation at ECDL2009
2. Outline
• Maps of Science
• Broad Research Interests
• Research Goals
• Process
• Scalability issues
• Open Source Tools
• Environment
• Results
• Conclusions
• Future Work
6. Broad Research
Interests
• Search results visualization & refinement
• Domain-specific discovery, with a particular interest in genomics
and drug discovery
• Improved discovery in STM domains through results visualization
and contextualization, browse/explore/refine
• Use of Open Source tools in complex research problem spaces
7. Research Goals
• Use Open Source tools to support large scale semantic text analysis and
visualization
• Find way to extract journal (& article) semantic vector space (semantics
much better than keyword or tf-idf -based representations natural
language)
• Latent Semantic Analysis (LSA) works for small/medium sized corpora,
does not scale to large scale of items and/or terms
• New alternative: Semantic Vectors (SV): uses random vectors & avoids
expensive singular value decomposition (SVD)
• Can SV scale & generate sensible semantic vector space of journals on
corpus of this size?
• Can the visualization produced be useful for results query visualization,
refinement, discovery?
8. Corpus
• Licensed journal articles from STM publishers: Elsevier, Springer,
etc
• ~4100 journal titles, classified into 23 categories (by publishers)
• ~8.4m journal articles
• Selection of articles/journals:
– Only those with authors, abstract (no notices, obituaries, etc)
– Only English language articles
– Only journals with >50 articles in corpus
– Resulting corpus: 5,733,721 articles from 2231 journals
– Categories overlapping: 1.53 categories per journal
9. Corpus
Category # Journals
per category
Agriculture & Biological Sciences 358
Arts and Humanities 70
Biochemistry, Genetics and Molecular Biology 240
Business, Management and Accounting 106
Chemical Engineering 126
Chemistry 226
Civil Engineering 64
Computer Science 218
Decision Science 50
Earth and Planetary Science 146
Economics, Econometrics and Finance 112
10. Category # Journals per category
Energy and Power 73
Engineering and Technology 328
Environmental Science 138
Immunology and Microbiology 104
Materials Science 160
Mathematics 205
Medicine 671
Neuroscience 103
Pharmacology, Toxicology and 73
Pharmaceutics
Physics and Astronomy 210
Psychology 126
Social Science 222
11. Process
• Index full-text (only) with Lucene 2.4, aggressive stopword list,
Porter stemming using LuSql tool
• Build Semantic Vectors (v1.18, parallelized) index from Lucene
index, with 512 semantic dimensions
• Find item x item distance matrix from SV index of 512-
dimensional vectors
• Using R, use multidimensional scaling (MDS) to reduce from 512-
D to 2-D
12. Scalability Issues
• #items, #unique terms
– #unique terms: SV easily handles very well
– #items: SV handles fairly well
– #items: impacts size of distance matrix (#items x #items)
– R cannot handle huge article distance matrix in MDS (i.e.
millions of articles vs. thousands of journals)
• Instead of using articles for items, use journals for items
• Make single large full-text document from concatenation of all
articles of particular journal & index these
13. Open Source Tools
• Lucene
• LuSql (High performance Lucene index building tool)
• Semantic Vectors
• R
• Processing
• Linux
14. Environment
• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050
processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM,
attached to a Dell EMC AX150 storage arrays via SilkWorm
200E Series 16-Port Capable 4Gb Fabric Switch.
• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel
2.6.18.8-0.10-default #1 SMP
• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit
Server VM (build 10.0-b23, mixed mode).
• Processing 1.0 (processing.org)
49. Medicine
Medicine
French language Medical
& Psychology Journals
50. Bulletin of
Mathematical Biology
Journal of
Medical
Ultrasonics
Mathematics
51. Conclusions
• Reasonable mapping results
• Full-text only (no citations, metadata) gives good results
• Scalable to significant size
• Open Source tools supported a complex research process and
were easy to modify to deal with scalability issues
52. Future Work
• Proper precision and recall evaluation using same corpus
• Validate with NetNews-20 collection for P & R
• Evaluate non-metric MDS
• Project articles onto semantic journal space & build interactive
discovery interface & evaluate
– Index journal 'documents' and journal articles
– SV on all
– Distance matrix only on journals
– Do MDS
– Use eigenvectors to transform N-d article vector to 2-D
• Explore 3-D interface (MDS N-d → 3D)