SlideShare a Scribd company logo
1 of 55
Download to read offline
Using Open Source Tools for Visualization
 and Semantic Mapping in a Large Scale
         Article Digital Library

                  Glen Newton
               glen.newton@gmail.com
            Biology Dept, Carleton University
           http://zzzoot.blogspot.com/

                    Code4Lib-North
           Queen's University, Kingston, Ontario
                   Friday May 7 2010

             Based on VLDL2009 Workshop
               Presentation at ECDL2009
Outline

•   Maps of Science
•   Broad Research Interests
•   Research Goals
•   Process
•   Scalability issues
•   Open Source Tools
•   Environment
•   Results
•   Conclusions
•   Future Work
From Bollen et al 2009 PLOS1
From Leydesdorff
From Leydesdorff & Rafols 2006   & Rafols 2006
From Leydesdorff & Rafols 2006
Broad Research
                                  Interests
• Search results visualization & refinement
• Domain-specific discovery, with a particular interest in genomics
    and drug discovery
• Improved discovery in STM domains through results visualization
    and contextualization, browse/explore/refine
• Use of Open Source tools in complex research problem spaces
Research Goals

• Use Open Source tools to support large scale semantic text analysis and
     visualization
• Find way to extract journal (& article) semantic vector space (semantics
     much better than keyword or tf-idf -based representations natural
     language)
• Latent Semantic Analysis (LSA) works for small/medium sized corpora,
     does not scale to large scale of items and/or terms
• New alternative: Semantic Vectors (SV): uses random vectors & avoids
     expensive singular value decomposition (SVD)
• Can SV scale & generate sensible semantic vector space of journals on
     corpus of this size?
• Can the visualization produced be useful for results query visualization,
     refinement, discovery?
Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer,
     etc
• ~4100 journal titles, classified into 23 categories (by publishers)
• ~8.4m journal articles
• Selection of articles/journals:
       – Only those with authors, abstract (no notices, obituaries, etc)
       – Only English language articles
       – Only journals with >50 articles in corpus
       – Resulting corpus: 5,733,721 articles from 2231 journals
       – Categories overlapping: 1.53 categories per journal
Corpus
 Category                                       # Journals
                                                per category
 Agriculture & Biological Sciences              358
 Arts and Humanities                            70
 Biochemistry, Genetics and Molecular Biology   240
 Business, Management and Accounting            106
 Chemical Engineering                           126
 Chemistry                                      226
 Civil Engineering                              64
 Computer Science                               218
 Decision Science                               50
 Earth and Planetary Science                    146
 Economics, Econometrics and Finance            112
Category                       # Journals per category
Energy and Power               73
Engineering and Technology     328
Environmental Science          138
Immunology and Microbiology    104
Materials Science              160
Mathematics                    205
Medicine                       671
Neuroscience                   103
Pharmacology, Toxicology and   73
Pharmaceutics
Physics and Astronomy          210
Psychology                     126
Social Science                 222
Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list,
     Porter stemming using LuSql tool
• Build Semantic Vectors (v1.18, parallelized) index from Lucene
     index, with 512 semantic dimensions
• Find item x item distance matrix from SV index of 512-
     dimensional vectors
• Using R, use multidimensional scaling (MDS) to reduce from 512-
     D to 2-D
Scalability Issues

•  #items, #unique terms
        – #unique terms: SV easily handles very well
        – #items: SV handles fairly well
        – #items: impacts size of distance matrix (#items x #items)
        – R cannot handle huge article distance matrix in MDS (i.e.
             millions of articles vs. thousands of journals)
• Instead of using articles for items, use journals for items
• Make single large full-text document from concatenation of all
      articles of particular journal & index these
Open Source Tools

•   Lucene
•   LuSql (High performance Lucene index building tool)
•   Semantic Vectors
•   R
•   Processing
•   Linux
Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050
    processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM,
    attached to a Dell EMC AX150 storage arrays via SilkWorm
    200E Series 16-Port Capable 4Gb Fabric Switch.
• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel
    2.6.18.8-0.10-default #1 SMP
• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit
  Server VM (build 10.0-b23, mixed mode).
• Processing 1.0 (processing.org)
Results: Scalability

• Corpus: ~600GB full-text
• Lucene index: 43GB
      – LuSql: 13 hours 51 minutes to produce
• SV index: 58 minutes, 885 MB, 21.6m terms
      – Distance matrix: 6 minutes
Results: Visualization

• Using Processing environment, built simple
    validation/visualization tool
Harder sciences and
engineering categories
Chemistry
Material Science
Physics and
Astronomy
Engineering and
Technology
Mathematics
Computer Science
Civil Engineering
Chemical Engineering
Agriculture and
biomedical categories
Agriculture and
Biological Sciences
Biochemistry, Genetics
and Molecular Biology
Immunology and
Microbiology
Pharmacology
Neuroscience
Medicine
Medicine
Psychology
Interdisciplinary and
non-science categories
Environmental Science
Earth and
Planetary Science
Energy and Power
Decision Science
Economics,
Econometrics
And Finance
Social Sciences
Business, Management
and Accounting
Arts and Humanities
Examination of outliers,
extrema and cataloging
errors
Ecotoxicology and
Environmental Safety
                       Organic Geochemistry




                              Corporate Environmental
                              Strategy


                         Environmental Science
Journal of Biomolecular NMR



              Journal of X-Ray
              Science and Technology




           Medicine
           Medicine
Colloidal and
Polymer Science




                  Annales Henri Poincare




        Medicine
        Medicine
Medicine
         Medicine
French language Medical
& Psychology Journals
Bulletin of
              Mathematical Biology




Journal of
Medical
Ultrasonics




                 Mathematics
Conclusions

•   Reasonable mapping results
•   Full-text only (no citations, metadata) gives good results
•   Scalable to significant size
•   Open Source tools supported a complex research process and
      were easy to modify to deal with scalability issues
Future Work

• Proper precision and recall evaluation using same corpus
• Validate with NetNews-20 collection for P & R
• Evaluate non-metric MDS
• Project articles onto semantic journal space & build interactive
    discovery interface & evaluate
       – Index journal 'documents' and journal articles
       – SV on all
       – Distance matrix only on journals
       – Do MDS
       – Use eigenvectors to transform N-d article vector to 2-D
• Explore 3-D interface (MDS N-d → 3D)
Acknowledgements

• Collaborators: Michel Dumontier, Alison Callahan @Carleton
• Support: Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-
    CISTI
Demo

• Link to project demo page
License




Creative Commons Attribution-Noncommercial-No Derivative Works 2.

More Related Content

Viewers also liked

5 mohammad chamani
5 mohammad chamani5 mohammad chamani
5 mohammad chamaniDheeraj Vasu
 
Fri5 35
Fri5 35Fri5 35
Fri5 35medism
 
2015/11/30付 オリジナルiTunes週間トップソングトピックス
2015/11/30付 オリジナルiTunes週間トップソングトピックス2015/11/30付 オリジナルiTunes週間トップソングトピックス
2015/11/30付 オリジナルiTunes週間トップソングトピックスThe Natsu Style
 
081202 Gzt Vtow4
081202 Gzt Vtow4081202 Gzt Vtow4
081202 Gzt Vtow4hemel
 
Chip Project Proposal By Heba
Chip Project Proposal By HebaChip Project Proposal By Heba
Chip Project Proposal By Hebahsayeda
 
Антирадянські виступи 1921 року
Антирадянські виступи 1921 рокуАнтирадянські виступи 1921 року
Антирадянські виступи 1921 рокуKseniya Armashula
 
Cornwall Life magazine analysis
Cornwall Life magazine analysisCornwall Life magazine analysis
Cornwall Life magazine analysisjamesasmedia
 
Lean Green Belt_Yun Lin
Lean Green Belt_Yun LinLean Green Belt_Yun Lin
Lean Green Belt_Yun LinYun Lin
 
135. verdadera oración
135. verdadera oración135. verdadera oración
135. verdadera oraciónfomtv
 
We>Me New Jersey Library Association Presentation 2012
We>Me New Jersey Library Association Presentation 2012We>Me New Jersey Library Association Presentation 2012
We>Me New Jersey Library Association Presentation 2012Patrick "PC" Sweeney
 

Viewers also liked (15)

5 mohammad chamani
5 mohammad chamani5 mohammad chamani
5 mohammad chamani
 
Fri5 35
Fri5 35Fri5 35
Fri5 35
 
Grupos edmodo
Grupos edmodoGrupos edmodo
Grupos edmodo
 
2015/11/30付 オリジナルiTunes週間トップソングトピックス
2015/11/30付 オリジナルiTunes週間トップソングトピックス2015/11/30付 オリジナルiTunes週間トップソングトピックス
2015/11/30付 オリジナルiTunes週間トップソングトピックス
 
081202 Gzt Vtow4
081202 Gzt Vtow4081202 Gzt Vtow4
081202 Gzt Vtow4
 
Chip Project Proposal By Heba
Chip Project Proposal By HebaChip Project Proposal By Heba
Chip Project Proposal By Heba
 
Pesaing pow
Pesaing powPesaing pow
Pesaing pow
 
Антирадянські виступи 1921 року
Антирадянські виступи 1921 рокуАнтирадянські виступи 1921 року
Антирадянські виступи 1921 року
 
Cornwall Life magazine analysis
Cornwall Life magazine analysisCornwall Life magazine analysis
Cornwall Life magazine analysis
 
RobDiploma
RobDiplomaRobDiploma
RobDiploma
 
Lean Green Belt_Yun Lin
Lean Green Belt_Yun LinLean Green Belt_Yun Lin
Lean Green Belt_Yun Lin
 
BASICfinal
BASICfinalBASICfinal
BASICfinal
 
135. verdadera oración
135. verdadera oración135. verdadera oración
135. verdadera oración
 
We>Me New Jersey Library Association Presentation 2012
We>Me New Jersey Library Association Presentation 2012We>Me New Jersey Library Association Presentation 2012
We>Me New Jersey Library Association Presentation 2012
 
M1 PPT
M1 PPTM1 PPT
M1 PPT
 

Similar to Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

ASHG sequencing workshop
ASHG sequencing workshopASHG sequencing workshop
ASHG sequencing workshopruthburton
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Rudy Potenzone
 
A new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksA new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksNees Jan van Eck
 
Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific DataMarcus Hanwell
 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classificationNees Jan van Eck
 
TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013
TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013
TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013Ilias Hatzakis
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Ola Spjuth
 
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...ijassn
 
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...ijassn
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...ijassn
 
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Deborah McGuinness
 
Call for presentation-International Journal of Advanced Smart Sensor Network ...
Call for presentation-International Journal of Advanced Smart Sensor Network ...Call for presentation-International Journal of Advanced Smart Sensor Network ...
Call for presentation-International Journal of Advanced Smart Sensor Network ...ijassn
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...ijassn
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)ijassn
 
CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2IJORCS
 
Call for paper-3rd International Conference on Big Data and Applications (BDA...
Call for paper-3rd International Conference on Big Data and Applications (BDA...Call for paper-3rd International Conference on Big Data and Applications (BDA...
Call for paper-3rd International Conference on Big Data and Applications (BDA...ijassn
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopMarcus Hanwell
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)ijassn
 

Similar to Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library (20)

ASHG sequencing workshop
ASHG sequencing workshopASHG sequencing workshop
ASHG sequencing workshop
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
A new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksA new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networks
 
Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific Data
 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classification
 
TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013
TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013
TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
 
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
 
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
 
Call for presentation-International Journal of Advanced Smart Sensor Network ...
Call for presentation-International Journal of Advanced Smart Sensor Network ...Call for presentation-International Journal of Advanced Smart Sensor Network ...
Call for presentation-International Journal of Advanced Smart Sensor Network ...
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
 
KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017
 
CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2
 
Call for paper-3rd International Conference on Big Data and Applications (BDA...
Call for paper-3rd International Conference on Big Data and Applications (BDA...Call for paper-3rd International Conference on Big Data and Applications (BDA...
Call for paper-3rd International Conference on Big Data and Applications (BDA...
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

  • 1. Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library Glen Newton glen.newton@gmail.com Biology Dept, Carleton University http://zzzoot.blogspot.com/ Code4Lib-North Queen's University, Kingston, Ontario Friday May 7 2010 Based on VLDL2009 Workshop Presentation at ECDL2009
  • 2. Outline • Maps of Science • Broad Research Interests • Research Goals • Process • Scalability issues • Open Source Tools • Environment • Results • Conclusions • Future Work
  • 3. From Bollen et al 2009 PLOS1
  • 4. From Leydesdorff From Leydesdorff & Rafols 2006 & Rafols 2006
  • 5. From Leydesdorff & Rafols 2006
  • 6. Broad Research Interests • Search results visualization & refinement • Domain-specific discovery, with a particular interest in genomics and drug discovery • Improved discovery in STM domains through results visualization and contextualization, browse/explore/refine • Use of Open Source tools in complex research problem spaces
  • 7. Research Goals • Use Open Source tools to support large scale semantic text analysis and visualization • Find way to extract journal (& article) semantic vector space (semantics much better than keyword or tf-idf -based representations natural language) • Latent Semantic Analysis (LSA) works for small/medium sized corpora, does not scale to large scale of items and/or terms • New alternative: Semantic Vectors (SV): uses random vectors & avoids expensive singular value decomposition (SVD) • Can SV scale & generate sensible semantic vector space of journals on corpus of this size? • Can the visualization produced be useful for results query visualization, refinement, discovery?
  • 8. Corpus • Licensed journal articles from STM publishers: Elsevier, Springer, etc • ~4100 journal titles, classified into 23 categories (by publishers) • ~8.4m journal articles • Selection of articles/journals: – Only those with authors, abstract (no notices, obituaries, etc) – Only English language articles – Only journals with >50 articles in corpus – Resulting corpus: 5,733,721 articles from 2231 journals – Categories overlapping: 1.53 categories per journal
  • 9. Corpus Category # Journals per category Agriculture & Biological Sciences 358 Arts and Humanities 70 Biochemistry, Genetics and Molecular Biology 240 Business, Management and Accounting 106 Chemical Engineering 126 Chemistry 226 Civil Engineering 64 Computer Science 218 Decision Science 50 Earth and Planetary Science 146 Economics, Econometrics and Finance 112
  • 10. Category # Journals per category Energy and Power 73 Engineering and Technology 328 Environmental Science 138 Immunology and Microbiology 104 Materials Science 160 Mathematics 205 Medicine 671 Neuroscience 103 Pharmacology, Toxicology and 73 Pharmaceutics Physics and Astronomy 210 Psychology 126 Social Science 222
  • 11. Process • Index full-text (only) with Lucene 2.4, aggressive stopword list, Porter stemming using LuSql tool • Build Semantic Vectors (v1.18, parallelized) index from Lucene index, with 512 semantic dimensions • Find item x item distance matrix from SV index of 512- dimensional vectors • Using R, use multidimensional scaling (MDS) to reduce from 512- D to 2-D
  • 12. Scalability Issues • #items, #unique terms – #unique terms: SV easily handles very well – #items: SV handles fairly well – #items: impacts size of distance matrix (#items x #items) – R cannot handle huge article distance matrix in MDS (i.e. millions of articles vs. thousands of journals) • Instead of using articles for items, use journals for items • Make single large full-text document from concatenation of all articles of particular journal & index these
  • 13. Open Source Tools • Lucene • LuSql (High performance Lucene index building tool) • Semantic Vectors • R • Processing • Linux
  • 14. Environment • Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM, attached to a Dell EMC AX150 storage arrays via SilkWorm 200E Series 16-Port Capable 4Gb Fabric Switch. • Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel 2.6.18.8-0.10-default #1 SMP • Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit Server VM (build 10.0-b23, mixed mode). • Processing 1.0 (processing.org)
  • 15. Results: Scalability • Corpus: ~600GB full-text • Lucene index: 43GB – LuSql: 13 hours 51 minutes to produce • SV index: 58 minutes, 885 MB, 21.6m terms – Distance matrix: 6 minutes
  • 16. Results: Visualization • Using Processing environment, built simple validation/visualization tool
  • 17.
  • 18.
  • 45. Examination of outliers, extrema and cataloging errors
  • 46. Ecotoxicology and Environmental Safety Organic Geochemistry Corporate Environmental Strategy Environmental Science
  • 47. Journal of Biomolecular NMR Journal of X-Ray Science and Technology Medicine Medicine
  • 48. Colloidal and Polymer Science Annales Henri Poincare Medicine Medicine
  • 49. Medicine Medicine French language Medical & Psychology Journals
  • 50. Bulletin of Mathematical Biology Journal of Medical Ultrasonics Mathematics
  • 51. Conclusions • Reasonable mapping results • Full-text only (no citations, metadata) gives good results • Scalable to significant size • Open Source tools supported a complex research process and were easy to modify to deal with scalability issues
  • 52. Future Work • Proper precision and recall evaluation using same corpus • Validate with NetNews-20 collection for P & R • Evaluate non-metric MDS • Project articles onto semantic journal space & build interactive discovery interface & evaluate – Index journal 'documents' and journal articles – SV on all – Distance matrix only on journals – Do MDS – Use eigenvectors to transform N-d article vector to 2-D • Explore 3-D interface (MDS N-d → 3D)
  • 53. Acknowledgements • Collaborators: Michel Dumontier, Alison Callahan @Carleton • Support: Greg Kresko, Andre Vellino, Jeff Demaine @ NRC- CISTI
  • 54. Demo • Link to project demo page