Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

1,283 views

Published on

A detailed look at the graphic representation of text and term mining data. Originally presented by Marjorie M.K. Hlava and Dr. Jay Ven Eman at the 2012 International Information Conference on Search, Data Mining and Visualization in Nice, France.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,283
On SlideShare
0
From Embeds
0
Number of Embeds
44
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Access Innovations and its software brand Data Harmony are known for the high caliber of data. It is clean, well formed and very accurately semantically enriched. They updated the IEEE thesaurus in 2005, building a rule base for use in indexing at the same time. The application of the terms to the IEEE content was 90% accurate – that is 90% of the terms suggested are what well trained indexers would use from a controlled vocabulary, and 80% accurate from the more difficult proceedings data at launch of the project. Since that time the rule base has improved over time and the IEEE production team only needs to spot check about 10% of the documents to insure a high standard of indexing is maintained. It has allowed IEEE to process a lot more documents with the same team and made the process more fun at the same time. The indexers are allowed time to think about the content, the thesaurus terms, what should be added and what other information can be collected to continue to enrich the files because the Data harmony software removes many of the clerical aspects of the indexing process, leveraging the mental processing of the staff. The accuracy is high enough that we simply indexed the entire contents of the eXplore database back to the earliest records in a single overnight process. Then to explore the edges of science we also indexed the 1.2 million records using Medical Subject headings and the defense Technical Information Center thesauri with similar accuracy results.
  • Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

    1. 1. TEXT MINING, TERM MINING, AND VISUALIZATION IMPROVING THE IMPACT OF SCHOLARLY PUBLISHING MONDAY 16 APRIL 2012 NICE, FRANCE Marjorie M.K. Hlava, President Jay Ven Eman, CEO Access Innovations, Inc. mhlava@accessinn.com J_ven_eman@accessinn.com 1
    2. 2. What we will cover today • Term and Text Mining • The basics of visualization • Case studies • Using subject terms as metrics • Applications • Visualizing the results
    3. 3. Definitions • Term Mining - a systematic comparison processing algorithmic method to find patterns in text • Text Mining – using controlled vocabulary tags in text to find patterns and directions • Term & text mining  Many similarities  Can be complimentary; not mutually exclusive
    4. 4. Term mining • Precise  Meaningful semantic relationships; contextual  Replicable; repeatable; consistent  Vetted; controlled  Based on a controlled vocabulary  Trends; gaps; relationship analysis; visualizations  Less data processing load
    5. 5. Text mining  Algorithmic; formulaic  Neural nets, statistical, latent semantic, co - occurrence  Serendipitous relationships  Sentiment; hot topics; trends  False drops; noise;  Misleading semantic relationships  Heavy processing load
    6. 6. Why take a visual look? • Humans can process information 17 times faster in visual presentations • Now data can be analyzed, manipulated and presented as visual displays. • To see the trends effectively we need to make the data into rich graph-able formats 6
    7. 7. Visualization of data • Needs − Measurement − Metrics − Numbers • Shows − Adjacency − Relationships − Trends − Co – occurrence − Conceptual distance • Is richer with − Linking − Semantic enrichment − Classification • Supports − Forecasting − Trend analysis − Segmentation − Distribution 7
    8. 8. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Man’s attention to visual display to convey knowledge is ancient 8
    9. 9. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony The art in maps is a longstanding tradition 9
    10. 10. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Super imposing data is now common A mash up example 10 Traffic Injury Map UK Data Archive US National Highway Safety Administration Google Maps Base Accident categories include children automobile bicycle etc. Data time place type Source: JISC TechWatch: Data Mash-ups September 2010
    11. 11. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Mash up of bird flight migrations and weather patterns http://www.youtube.com/watch?v=uPff1t4pXiI&feature=youtu.be 11
    12. 12. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony http://www.youtube.com/watch?v=nokQBjk1s_8&feature=player_embedded
    13. 13. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony How does it work?  Develop controlled vocabulary » Prefer one with hierarchy  Apply to full text » Or to the “heads”  Decide on data points to convey information  Divide the XML into graphable sections
    14. 14. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Start with data – like this XML file 14
    15. 15. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Index or tag using subject terms from thesaurus or taxonomy  date, category, taxonomy term, frequency 15
    16. 16. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Many views of one set of data 16
    17. 17. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Load to a visualization program Like Prefuse 17
    18. 18. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Or Pajek 18
    19. 19. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 19
    20. 20. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony National Information Center for Educational Media Albuquerque’s own » Sandia developed VxInsight » Access Innovations = NICEM Same data – several views Primary and Secondary Education in US Shows the US Valley of Science Little Science taught in elementary years 20
    21. 21. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony
    22. 22. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony
    23. 23. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Using visualization to show  From a society / publisher perspective » Identify Core, Boundary and Cross Border » Provides Indicators  Activity  Growth  Relatedness  Centrality » Locates Journal domains  From a thesaurus perspective » Identifies terms that are too broadly defined » Potential Improvements in thesaurus structure using topic structures 23
    24. 24. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Case Study: Mapping IEEE thesaurus space  We are interested in an expanded map that includes adjacencies to the IEEE data » Expanded term set shows adjacent white space; opportunities for expansion  Overlaps and edges of the science » We need comparison data  Learn the directions in the field » Low occurrence rate in IEEE documents? » Linkage to terms in IEEE documents?  Where do we find these terms? How can we add them? 24
    25. 25. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony The process  Built a rule base to auto index IEEE content » “90 % accuracy out of the box on journal data”* » “80% out of the box on proceedings data”*  The overlapping data sets » Auto indexed 1.2 million Xplore records » Auto indexed 10 years of US Patent data » Auto indexed 10 years of Medline  Term sets used » IEEE thesaurus terms rule base » Medical Subject Headings (MeSH) (and simple rule base) » Defense Technical Information Center (DTIC) Thesaurus ( and simple rule base) » Similar level of detail to current IEEE thesaurus terms 25
    26. 26. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Defining expanded term space 26 IEEE 2kterms 1.2M documents 1. The data - Select related corpus 14kDTIC 475k patents 24kMeSH PubMed 525k docs
    27. 27. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Defining expanded term space 27 IEEE 2kterms 1.2M documents 2. Identify related terms Use the IEEE Thesaurus to index the three collections
    28. 28. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Defining expanded term space 28 IEEE 2kterms 1.2M documents 2. Identify related terms Use MESH and DTIC to also index the three collections
    29. 29. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 29 IEEE 2kterms 1.2M documents 3. Resulting term set The co-indexed items from the three collections Defining expanded term space
    30. 30. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 30 4. Term:Term Matrix Where do the articles and their indexing intersect? Defining expanded term space
    31. 31. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 31 Visualization Strategies Matrix Visualization Software
    32. 32. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 32 All data up-posted to the top level
    33. 33. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 33 Many map options IEEE ExperiencePrevious Experience
    34. 34. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 34 Sensors Council Nucl Plasma Sci Soc Nanotech Council Ultrason, Ferro … Prod Saf Engng Soc Oceanic Engng Soc Geosci Rem Sens Soc Council Supercond Compon, Packag … Instr Measur Soc Magnetics Soc Dielectr El Insul Soc Electromag Compat Soc Antennas Propag Soc Power Electron Soc Electron Dev Soc Circuits & Systems Power & Energy Soc Industry Appl Soc Solid St Circuits Soc Industr Electr Soc Microwave Theory Soc Aerosp Electr Sys Soc Sys Man Cyber Society Computer Intelligence Society Systems Council Reliability Society Education Society Prof Commun Society Computer Society Robot Autom Soc Social Impl Techn Council Electr Design Auto Signal Proc Soc Intell Transp Sys Soc Commun Soc Info Theory Soc Vehicular Techn Soc Consumer Electr Soc Broadcast Techn Soc Photonics Soc Eng Med Biol Sci IEEE Portfolio
    35. 35. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 35 Radial Visualization
    36. 36. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 36 Publication Strategy JASIST reference
    37. 37. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 37 Conference Strategy
    38. 38. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 38 Turbines Measurement Circuits Amplifiers Displays Games Toys Flow Cooling Heating Components Gearing Brakes Dynamics Vehicles, Parts Disk Optics Photochem Molding Conductors Coatings Lasers Lamps Motors Plants, Micro-orgs Control Boats Oilfield Services Med Instruments Welding Conveyers Rubber Acyclic Comp Footwear Lubricants Radiology Catalysis Macromolecules Sprayers Electrochem Fitness Hygiene Cleaning Printing Paper IC Engines Magn/Elect Magnets Textiles Layers Medical Devices Clocks Pipes Valves Blasting Cables Appliances Outerwear Exhaust Pumps Packaging Aircraft Semiconductors Use a Thesaurus to Label Maps Agriculture Food Consumer Products Construction Automotive + Defense Industrial Products Leisure Energy Telecom Computer HW/SW Electronics Chemicals Pharma Metals Health Care
    39. 39. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Questions Answered  Is there a way, using our own information, to forecast our direction?  Where is the industry headed? What about by technology sector?  Does our coverage match our mission and vision?  Can we become smarter about our data and potential markets using our collection in new ways? Are the societies publishing and talking about what their charter indicates they cover?  What are the trends – are topics emerging/cooling?  Can we use technology and our own data to explore these questions while enhancing our data? 39
    40. 40. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony The research team  Access Innovations / Data Harmony » Founded in 1978 » Data enrichment and normalization » Suite of Semantic Enrichment tools  SciTechStrategies » Understanding data through visualization  IEEE Indexing & Abstracting Group 40
    41. 41. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony We looked at visualization of data  Finding the Metrics » Measurement » Numbers » Terms as indicators  Ways to show » Adjacency » Relationships » Trends » Co – occurrence » Conceptual distance  How to enrich with » Linking » Semantic enrichment » Classification  Maps supporting » Forecasting » Trend analysis » Segmentation » Distribution 41
    42. 42. Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony Effective maps require  Contextual data  Detailed data  Classification methods  At least two directions in the matrix  A little art for fun 42
    43. 43. 43 It just takes a little imagination Thank you Marjorie M.K. Hlava President mhlava@accessinn.com Jay Ven Eman, CEO J_ven_eman@accessinn.com , Access Innovations 505-998-0800

    ×