SlideShare a Scribd company logo
1 of 56
Download to read offline
Semantic Journal Mapping for Search Visualization
in a Large Scale Article Digital Library

              Glen Newton1,2, Alison Callahan1, Michel
                              Dumontier2
          1
           National Research Council Canada, 2Carleton University
        Second Workshop on Very Large Digital Libraries (VLDL) 2009
                               at ECDL 2009
                         Oct 2 2009 Corfu, Greece
Outline

•   Maps of Science
•   Background
•   Research Interests
•   Research Goals
•   Process
•   Scalability issues
•   Environment
•   Results
•   Conclusions
•   Future Work
From Bollen et al 2009 PLOS1
From Leydesdorff
From Leydesdorff & Rafols 2006   & Rafols 2006
From Leydesdorff & Rafols 2006
Background

• Canada Institute of Science and Technical Information (CISTI) ==
    Canadian national science library
• ~3000 active researchers at NRC
• Large full text collection of ~8.4m full-text + metadata articles, in
    science, technology, medicine (STM)
• 4100 journal titles
• ~1995 to 2009
Research Interests

• Domain-specific discovery
• Improved discovery in STM domains through results visualization
    and contextualization, browse/explore/refine
• Results set visualization: “mapping”
Research Goals

• Find way to extract journal (& article) semantic vector space
• Latent Semantic Analysis (LSA) works for small/medium sized
     corpora, does not scale to large scale of items and/or terms
• New alternative: Semantic Vectors (SV): uses random vectors &
     avoids expensive singular value decomposition (SVD)
• Can SV scale & generate sensible semantic vector space of
     journals on corpus of this size?
• Can the visualization produced be useful for results query
     visualization, refinement, discovery?
Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer,
     etc
• ~4100 journal titles, classified into 23 categories (by librarians)
• ~8.4m journal articles
• Selection of articles/journals:
       – Only those with authors, abstract (no notices, obituaries, etc)
       – Only English language articles
       – Only journals with >50 articles in corpus
       – Resulting corpus: 5,733,721 articles from 2231 journals
       – Categories overlapping: 1.53 categories per journal
Category                                       # Journals
                                               per category
Agriculture & Biological Sciences              358
Arts and Humanities                            70
Biochemistry, Genetics and Molecular Biology   240
Business, Management and Accounting            106
Chemical Engineering                           126
Chemistry                                      226
Civil Engineering                              64
Computer Science                               218
Decision Science                               50
Earth and Planetary Science                    146
Economics, Econometrics and Finance            112
Category                       # Journals per category
Energy and Power               73
Engineering and Technology     328
Environmental Science          138
Immunology and Microbiology    104
Materials Science              160
Mathematics                    205
Medicine                       671
Neuroscience                   103
Pharmacology, Toxicology and   73
Pharmaceutics
Physics and Astronomy          210
Psychology                     126
Social Science                 222
Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list,
     Porter stemming using LuSql tool
• Build Semantic Vectors (v1.18, parallelized) index from Lucene
     index, with 512 semantic dimensions
• Find item x item distance matrix from SV index of 512-
     dimensional vectors
• Using R, use multidimensional scaling (MDS) to reduce from 512-
     D to 2-D
Scalability Issues

•  #items, #unique terms
        – #unique terms: SV easily handles very well
        – #items: SV handles fairly well
        – #items: impacts size of distance matrix (#items x #items)
        – R cannot handle huge article distance matrix in MDS (i.e.
             millions of articles vs. thousands of journals)
• Instead of using articles for items, use journals for items
• Make single large full-text document from concatenation of all
      articles of particular journal & index these
Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050
    processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM,
    attached to a Dell EMC AX150 storage arrays via SilkWorm
    200E Series 16-Port Capable 4Gb Fabric Switch.
• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel
    2.6.18.8-0.10-default #1 SMP
• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit
  Server VM (build 10.0-b23, mixed mode).
• Processing 1.0 (processing.org)
Results: Scalability

• Corpus: ~600GB full-text
• Lucene index: 43GB
      – LuSql: 13 hours 51 minutes to produce
• SV index: 58 minutes, 885 MB, 21.6m terms
      – Distance matrix: 6 minutes
Results: Visualization

• Using Processing environment, built simple
    validation/visualization tool
Harder sciences and
engineering categories
Chemistry
Material Science
Physics and
Astronomy
Engineering and
Technology
Mathematics
Computer Science
Civil Engineering
Chemical Engineering
Agriculture and
biomedical categories
Agriculture and
Biological Sciences
Biochemistry, Genetics
and Molecular Biology
Immunology and
Microbiology
Pharmacology
Neuroscience
Medicine
Medicine
Psychology
Interdisciplinary and
non-science categories
Environmental Science
Earth and
Planetary Science
Energy and Power
Decision Science
Economics,
Econometrics
And Finance
Social Sciences
Business, Management
and Accounting
Arts and Humanities
Examination of outliers,
extrema and cataloging
errors
Ecotoxicology and
Environmental Safety
                       Organic Geochemistry




                              Corporate Environmental
                              Strategy


                         Environmental Science
Journal of Biomolecular NMR



              Journal of X-Ray
              Science and Technology




           Medicine
           Medicine
Colloidal and
Polymer Science




                  Annales Henri Poincare




        Medicine
        Medicine
Medicine
         Medicine
French language Medical
& Psychology Journals
Bulletin of
              Mathematical Biology




Journal of
Medical
Ultrasonics




                 Mathematics
Conclusions

• Reasonable mapping results
• Full-text only (no citations, metadata) gives good results
• Scalable to significant size
Future Work

• Proper precision and recall evaluation using same corpus
• Validate with NetNews-20 collection for P & R
• Evaluate non-metric MDS
• Project articles onto semantic journal space & build interactive
    discovery interface & evaluate
       – Index journal 'documents' and journal articles
       – SV on all
       – Distance matrix only on journals
       – Do MDS
       – Use eigenvectors to transform N-d article vector to 2-D
• Explore 3-D interface (MDS N-d → 3D)
Acknowledgements

• Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-CISTI
Demo

• Link to project demo page
License




Creative Commons Attribution-Noncommercial-No Derivative Works 2.5
Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library

More Related Content

Similar to Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library

A new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksA new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networks
Nees Jan van Eck
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
inside-BigData.com
 

Similar to Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library (20)

A new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksA new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networks
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
10 Years of Multi-Label Learning
10 Years of Multi-Label Learning10 Years of Multi-Label Learning
10 Years of Multi-Label Learning
 
1. Intro DS.pptx
1. Intro DS.pptx1. Intro DS.pptx
1. Intro DS.pptx
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
 
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
 
Call for presentation-International Journal of Advanced Smart Sensor Network ...
Call for presentation-International Journal of Advanced Smart Sensor Network ...Call for presentation-International Journal of Advanced Smart Sensor Network ...
Call for presentation-International Journal of Advanced Smart Sensor Network ...
 
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
 
Science Mapping and Research Positioning
Science Mapping and Research PositioningScience Mapping and Research Positioning
Science Mapping and Research Positioning
 
Bme451 Fall07 Final
Bme451 Fall07 FinalBme451 Fall07 Final
Bme451 Fall07 Final
 
Call for paper-3rd International Conference on Big Data and Applications (BDA...
Call for paper-3rd International Conference on Big Data and Applications (BDA...Call for paper-3rd International Conference on Big Data and Applications (BDA...
Call for paper-3rd International Conference on Big Data and Applications (BDA...
 
Collins seattle-2014-final
Collins seattle-2014-finalCollins seattle-2014-final
Collins seattle-2014-final
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library