Taxonomy and Corpus Assessment:  Using Visualization Kevin W. Boyack DHUG Meeting, February 9, 2012
SciTech Strategies, Inc. <ul><li>Richard Klavans, Henry Small, Kevin Boyack </li></ul><ul><li>Decades of experience in </l...
Our maps SciVal Spotlight ® UCSD Map of Science
Recurring questions <ul><li>Does our coverage match our mission, vision, or charter? </li></ul><ul><li>Is our thesaurus / ...
Visualization <ul><li>Can visualization play a part in the answers? </li></ul><ul><li>Yes, but just what is visualization?...
Some questions with visual answers <ul><li>From a thesaurus / taxonomy perspective </li></ul><ul><ul><ul><li>What terms ar...
Content vs. context <ul><li>Content </li></ul><ul><ul><ul><li>Requires only your own data </li></ul></ul></ul><ul><ul><ul>...
Content examples
Thesaurus structure / balance
Thesaurus structure / balance
Term relationship structure
Society overlay on term structure
Partitioning and overlaps
Partitioning and overlaps
Partitioning and overlaps
Context examples <ul><li>Requires more data than just your own </li></ul><ul><ul><ul><li>Adding other focused datasets </l...
Term expansion (IEEE, USPTO, MeSH)
Global micro-model (Scopus) 2010 fileyear 1.8 M documents 115k clusters Cluster histories Cluster metrics Cluster   member...
PubMed overlay
Cluster histories (threads)
Map of threads containing IEEE literature
IEEE Society overlays
IEEE Society overlays
IEEE Society overlays
IEEE Society overlays
Science mapping <ul><li>30-40 year tradition of science mapping </li></ul><ul><ul><ul><li>Well-established methodologies <...
Visualization choices Choice Trade-offs Comments Data source - Free vs. Costly - Single vs. Multiple Sources - Content vs....
Visualization choices <ul><li>Are often made based on what is available </li></ul><ul><ul><ul><li>Data, algorithms, expert...
Summary <ul><li>Term and document spaces can be mapped effectively </li></ul><ul><li>The mapped space can be used to show ...
Thank you!
Upcoming SlideShare
Loading in …5
×

Taxonomy and Corpus Assessment: Using Visualization

1,224 views

Published on

Using visualization to show distributions of taxonomic data to give context and show trends in your data. Presented by Kevin W. Boyack of SciTech Strategies, Inc. at the 2012 Data Harmony User Group meeting on February 9, 2012 at the Access Innovations, Inc. offices.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,224
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • This list is not complete. Choices and tradeoffs are oversimplified here, but are representative. Comments are obvious in most cases, but often ignored.
  • Taxonomy and Corpus Assessment: Using Visualization

    1. 1. Taxonomy and Corpus Assessment: Using Visualization Kevin W. Boyack DHUG Meeting, February 9, 2012
    2. 2. SciTech Strategies, Inc. <ul><li>Richard Klavans, Henry Small, Kevin Boyack </li></ul><ul><li>Decades of experience in </li></ul><ul><ul><ul><li>Citation analysis </li></ul></ul></ul><ul><ul><ul><li>Science mapping </li></ul></ul></ul><ul><ul><ul><li>Metrics </li></ul></ul></ul><ul><li>People associate us with “visualization” </li></ul><ul><li>Current interests </li></ul><ul><ul><ul><li>Structure and dynamics of STI (science, technology, innovation) </li></ul></ul></ul><ul><ul><ul><li>Impacts of funding on structure and dynamics </li></ul></ul></ul><ul><ul><ul><li>Emergence, evolution </li></ul></ul></ul><ul><ul><ul><li>How does access to full text change or view of or ability to model the above? </li></ul></ul></ul>
    3. 3. Our maps SciVal Spotlight ® UCSD Map of Science
    4. 4. Recurring questions <ul><li>Does our coverage match our mission, vision, or charter? </li></ul><ul><li>Is our thesaurus / taxonomy up to date / sufficient? </li></ul><ul><li>Where is the industry headed? Does it vary by sector?  Can we forecast our own direction? </li></ul><ul><li>What are the trends – are topics emerging / dying? </li></ul><ul><li>Can we use our own data to answer these questions? </li></ul><ul><li>Can we become smarter about our data and potential markets using our collection in new ways? </li></ul><ul><li>Can we enhance our data? </li></ul>
    5. 5. Visualization <ul><li>Can visualization play a part in the answers? </li></ul><ul><li>Yes, but just what is visualization? </li></ul><ul><li>It’s not just the picture! </li></ul><ul><li>It’s the entire process that leads to the picture </li></ul><ul><ul><ul><li>Define the question (content or context) </li></ul></ul></ul><ul><ul><ul><li>Data selection </li></ul></ul></ul><ul><ul><ul><li>Data cleaning and processing </li></ul></ul></ul><ul><ul><ul><li>Analysis </li></ul></ul></ul><ul><ul><ul><li>Then, you can create the visual </li></ul></ul></ul>
    6. 6. Some questions with visual answers <ul><li>From a thesaurus / taxonomy perspective </li></ul><ul><ul><ul><li>What terms are too broadly defined? </li></ul></ul></ul><ul><ul><ul><li>How do actual topical relationships differ from the thesaurus structure? </li></ul></ul></ul><ul><ul><ul><li>What terms could / should be added? </li></ul></ul></ul><ul><li>From a vendor / publisher perspective </li></ul><ul><ul><ul><li>Which topical areas form our core? periphery? </li></ul></ul></ul><ul><ul><ul><li>Where is the coverage dense? thin? </li></ul></ul></ul><ul><ul><ul><li>Which topical areas are most active? least active? </li></ul></ul></ul><ul><ul><ul><li>Which topical areas seem to be emerging? declining? </li></ul></ul></ul><ul><ul><ul><li>Which topical areas are interrelated? isolated? </li></ul></ul></ul><ul><ul><ul><li>What are the overlaps between journals / segments? </li></ul></ul></ul><ul><ul><ul><li>Where are the potential expansion points? </li></ul></ul></ul>
    7. 7. Content vs. context <ul><li>Content </li></ul><ul><ul><ul><li>Requires only your own data </li></ul></ul></ul><ul><ul><ul><li>Focused </li></ul></ul></ul><ul><ul><ul><li>Computationally tractable </li></ul></ul></ul><ul><ul><ul><li>Good if you don’t need context </li></ul></ul></ul><ul><li>Context </li></ul><ul><ul><ul><li>Requires more data than just your own </li></ul></ul></ul><ul><ul><ul><li>Broad </li></ul></ul></ul><ul><ul><ul><li>More computationally intensive </li></ul></ul></ul><ul><ul><ul><li>but …. </li></ul></ul></ul><ul><ul><ul><li>You can see results in a broader (competitive) context </li></ul></ul></ul>
    8. 8. Content examples
    9. 9. Thesaurus structure / balance
    10. 10. Thesaurus structure / balance
    11. 11. Term relationship structure
    12. 12. Society overlay on term structure
    13. 13. Partitioning and overlaps
    14. 14. Partitioning and overlaps
    15. 15. Partitioning and overlaps
    16. 16. Context examples <ul><li>Requires more data than just your own </li></ul><ul><ul><ul><li>Adding other focused datasets </li></ul></ul></ul><ul><ul><ul><li>Placing your data in the context of a global dataset (e.g. Scopus) </li></ul></ul></ul><ul><li>We have done both </li></ul><ul><li>A micro-structural model of a global database lends itself to analysis at the micro-level </li></ul><ul><ul><ul><li>Micro-level indicators and trends </li></ul></ul></ul>
    17. 17. Term expansion (IEEE, USPTO, MeSH)
    18. 18. Global micro-model (Scopus) 2010 fileyear 1.8 M documents 115k clusters Cluster histories Cluster metrics Cluster memberships
    19. 19. PubMed overlay
    20. 20. Cluster histories (threads)
    21. 21. Map of threads containing IEEE literature
    22. 22. IEEE Society overlays
    23. 23. IEEE Society overlays
    24. 24. IEEE Society overlays
    25. 25. IEEE Society overlays
    26. 26. Science mapping <ul><li>30-40 year tradition of science mapping </li></ul><ul><ul><ul><li>Well-established methodologies </li></ul></ul></ul><ul><ul><ul><li>Current computing power and data availability enable large scale mapping and analysis </li></ul></ul></ul><ul><li>Science maps can/have been created using </li></ul><ul><ul><ul><li>Articles, Journals, Authors, Terms </li></ul></ul></ul><ul><li>A map can be thought of as a visual representation of a classification system provided … </li></ul><ul><ul><ul><li>The classification system DOES contain a relationship structure </li></ul></ul></ul><ul><li>Maps used for communication, strategy, planning, evaluation … </li></ul>
    27. 27. Visualization choices Choice Trade-offs Comments Data source - Free vs. Costly - Single vs. Multiple Sources - Content vs. Context - Citation data isn’t free - De-duplication isn’t trivial - Coverage comes at a cost Unit of analysis - Breadth vs. Detail <ul><li>Base on research question </li></ul><ul><li>Don’t base on data available </li></ul>Sample size - Specialty vs. All of science - Content vs. Context - Single year vs. Multiple years <ul><li>Mapping ALL is costly </li></ul><ul><li>Specialty maps lack context </li></ul><ul><li>- Stability or instability? </li></ul>Similarity approach <ul><li>Simple vs. Complex </li></ul><ul><li>Citation vs. Text vs. Hybrid </li></ul><ul><li>- Threshold vs. Accuracy </li></ul><ul><li>Comp cost: Index < Vector </li></ul><ul><li>Hybrid costly, but likely best </li></ul><ul><li>- How much is really needed? </li></ul>Partitioning / Layout <ul><li>Simple vs. Complex </li></ul><ul><li>- Accuracy vs. Useability </li></ul><ul><li>Simple often size limited </li></ul><ul><li>Is intuition satisfied? </li></ul><ul><li>- Are distributions reasonable? </li></ul><ul><li>- Useful levels of aggregation? </li></ul>
    28. 28. Visualization choices <ul><li>Are often made based on what is available </li></ul><ul><ul><ul><li>Data, algorithms, expertise </li></ul></ul></ul><ul><li>When they should be made based on </li></ul><ul><ul><ul><li>The research question or application </li></ul></ul></ul><ul><ul><ul><li>Balancing of the applicable trade-offs </li></ul></ul></ul>
    29. 29. Summary <ul><li>Term and document spaces can be mapped effectively </li></ul><ul><li>The mapped space can be used to show distributions and trends that give answers to questions regarding </li></ul><ul><ul><ul><li>Context </li></ul></ul></ul><ul><ul><ul><li>Distributions and overlaps </li></ul></ul></ul><ul><ul><ul><li>Trends </li></ul></ul></ul><ul><ul><ul><li>Etc. </li></ul></ul></ul><ul><li>Historically, this has all been done from metadata </li></ul><ul><li>You all have full text – a gold mine of information </li></ul><ul><li>How will you use it? </li></ul>
    30. 30. Thank you!

    ×