Shenghui Wang
Rob Koopman
Exploring a world of
networked information
built from free-text
metadata
OCLC Research EMEA
ELAG2015
What would you do if you are
interested in a topic?
Difficult to answer these questions:
• What are the different aspects of this topic?
• Are there related aspects missing in my search terms?
• Who are the most prominent authors about this topic?
• Which journals publish most about this topic?
• How have others — e.g. librarians — described and classified
this topic?
Demo
• http://thoth.pica.nl/relate?input=opac
How do we do this?
• OFFLINE: generates a semantic representation
for each entity
• ONLINE: finds the most related entities and
using multidimensional scaling to display
Build semantic representation
• Basic assumptions
– Entities can be represented by its context
– Entities which share more context are more likely
to be related
• Context is the textual environment where an
entity occurs
• The effects of state prekindergarten programs on young
children’s school readiness in five states
• [author:jung kwanghee]
• [subject:readiness for school]
Dataset
● ArticleFirst, 65 million articles
● Selected 4 million entities (topical terms,
authors, ISSNs, Dewey decimal codes)
● Represented by 1 million topical terms
But a matrix of 4M x 1M is too big to process
Dimension reduction based on Random Projection
C: a co-occurrence matrix
R: a random matrix of +/-1
C’: approximation of C
after random projection
-- Semantic matrix
Online interface
• Find mutual nearest neighbors
• Use multidimensional scaling to display
Nearest neighbors
Mutual nearest neighbors
Possible applications
• Explorative interface
• Context based search:
– brain
• Journal finder
– Arctic ice journals
– http://brain.oxfordjournals.org/
• Author name disambiguation
– pre kindergarten
Context matters!
• What does “young” mean in
- AritcleFirst
- WorldCat
- Astrophysics
- Art
Ariadne
(demo) http://thoth.pica.nl/relate
• An extremely fast way of navigating large scale
hetereogeneous entities
• Generalisable to different datasets
– Full WorldCat
– Small but highly curated astrophysics dataset
• Supports explorative information retrieval and
entity disambiguation
References
• Koopman, Rob, and Shenghui Wang. 2014. “Where Should I Publish? Detecting
Journal Similarity Based on What Has Been Published There.” In Proceedings of
Digital Libraries 2014, 483–484. London, United Kingdom. Association for
Computing Machinery. Paper, Poster
• Koopman, Rob, Shenghui Wang, Andrea Scharnhorst, and Gwenn Englebienne.
2015. “Ariadne’s Thread — Interactive Navigation in a World of Networked
Information”. In CHI '15 Extended Abstracts on Human Factors in Computing
Systems. ACM, Seoul, South Korea. Paper, Poster
• Koopman, Rob, Shenghui Wang and Andrea Scharnhorst. 2015. “Contextualization
of topics - browsing through terms, authors, journals and cluster allocations”. In
Proceedings of 15th International Conference on Scientometrics & Informetrics.
Istanbul, Turkey. Paper
Explore. Share. Magnify.
Thank you
Shenghui Wang
Rob Koopman
OCLC Research EMEA
shenghui.wang@oclc.org
rob.koopman@oclc.org

Exploring a world of networked information built from free-text metadata

  • 1.
    Shenghui Wang Rob Koopman Exploringa world of networked information built from free-text metadata OCLC Research EMEA ELAG2015
  • 2.
    What would youdo if you are interested in a topic?
  • 5.
    Difficult to answerthese questions: • What are the different aspects of this topic? • Are there related aspects missing in my search terms? • Who are the most prominent authors about this topic? • Which journals publish most about this topic? • How have others — e.g. librarians — described and classified this topic?
  • 6.
  • 7.
    How do wedo this? • OFFLINE: generates a semantic representation for each entity • ONLINE: finds the most related entities and using multidimensional scaling to display
  • 8.
    Build semantic representation •Basic assumptions – Entities can be represented by its context – Entities which share more context are more likely to be related • Context is the textual environment where an entity occurs • The effects of state prekindergarten programs on young children’s school readiness in five states • [author:jung kwanghee] • [subject:readiness for school]
  • 9.
    Dataset ● ArticleFirst, 65million articles ● Selected 4 million entities (topical terms, authors, ISSNs, Dewey decimal codes) ● Represented by 1 million topical terms But a matrix of 4M x 1M is too big to process
  • 10.
    Dimension reduction basedon Random Projection C: a co-occurrence matrix R: a random matrix of +/-1 C’: approximation of C after random projection -- Semantic matrix
  • 11.
    Online interface • Findmutual nearest neighbors • Use multidimensional scaling to display
  • 12.
  • 13.
  • 15.
    Possible applications • Explorativeinterface • Context based search: – brain • Journal finder – Arctic ice journals – http://brain.oxfordjournals.org/ • Author name disambiguation – pre kindergarten
  • 16.
    Context matters! • Whatdoes “young” mean in - AritcleFirst - WorldCat - Astrophysics - Art
  • 17.
    Ariadne (demo) http://thoth.pica.nl/relate • Anextremely fast way of navigating large scale hetereogeneous entities • Generalisable to different datasets – Full WorldCat – Small but highly curated astrophysics dataset • Supports explorative information retrieval and entity disambiguation
  • 18.
    References • Koopman, Rob,and Shenghui Wang. 2014. “Where Should I Publish? Detecting Journal Similarity Based on What Has Been Published There.” In Proceedings of Digital Libraries 2014, 483–484. London, United Kingdom. Association for Computing Machinery. Paper, Poster • Koopman, Rob, Shenghui Wang, Andrea Scharnhorst, and Gwenn Englebienne. 2015. “Ariadne’s Thread — Interactive Navigation in a World of Networked Information”. In CHI '15 Extended Abstracts on Human Factors in Computing Systems. ACM, Seoul, South Korea. Paper, Poster • Koopman, Rob, Shenghui Wang and Andrea Scharnhorst. 2015. “Contextualization of topics - browsing through terms, authors, journals and cluster allocations”. In Proceedings of 15th International Conference on Scientometrics & Informetrics. Istanbul, Turkey. Paper
  • 19.
    Explore. Share. Magnify. Thankyou Shenghui Wang Rob Koopman OCLC Research EMEA shenghui.wang@oclc.org rob.koopman@oclc.org

Editor's Notes

  • #7 Opac -> journal -> author -> [author:medeiros norm] -> worldcat Ambiguous names: [author:balas janet l] [author:balas j l]
  • #16 Journal finder Name disam