• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
 

Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

on

  • 1,268 views

Slides from Don Pellegrino's Dissertation Defense.

Slides from Don Pellegrino's Dissertation Defense.

Statistics

Views

Total Views
1,268
Views on SlideShare
1,259
Embed Views
9

Actions

Likes
1
Downloads
15
Comments
0

2 Embeds 9

http://www.linkedin.com 7
https://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information Presentation Transcript

    • Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
      A ThesisSubmitted to the FacultyofDrexel UniversitybyDonald Anthony Pellegrino Jr.in partial fulfillment of therequirements for the degreeofDoctor of PhilosophyMay 2011
      visit us at: www.ischool.drexel.edu
    • Committee
      Chaomei Chen (Chair)
      Robert Allen (IST)
      Xia Lin (IST)
      Jean-Claude Bradley (Chemistry)
      Longjian Liu (Epidemiology and Biostatistics)
    • Problem
      • Technological developments enable sharing and reuse of scientific information.
      • Current indexing methods support query-based search and filtering, however they do not support overviews and exploration.
      • Due to these limitations of existing indexing methods, it is challenging to discover records and connections that relate information in new and potentially insightful ways.
      Solution
      • New Indexing Methods
      • Instantiation of graph structures from real-world real-scale scientific collections.
      • Interactive visual exploration of structure.
      • Quantitative and semantic guidance for exploration of the graph.
      • Demonstrate feasibility of new methods for finding novel and significant connections and records in the collections.
    • “Another key is addressing the volume of information – a veritable tsunami – and the need for tools. In short, the totality of information far exceeds the ability of any organization to effectively and completely analyze and render judgments. And there are several aspects to this issue. One is that textual information must be captured and must be retrievable. Another is that the textual information or structured data quickly outstrips the working capability of the mind to retain and this analyze. Yet another is the necessity to integrate that unstructured text information with structured data. These issues present a critical requirement: analytical software (tools) to work on the problems of entity and relationship extraction from texts as well as the analysis of the resulting data (e.g., the discovery of trends or links that are quite simply not obvious to the human analyst)(Strickland, 2005, p.164, emphasis added).”
      Strickland, L. S. (2005). Knowledge Transfer: Information Science Shapes Intelligence in the Cold War Era. In R. V. Williams & B.-A. Lipetz (Eds.), Covert and Overt: Recollecting and Connecting Intelligence Service and Information Science (pp. 147-166). Medford, NJ: Information Today Inc.
    • 2003 Model
      1971 Model
      Søndergaard, T. F., Andersen, J., & Hjørland, B. (2003). Documents and the communication of scientific and scholarly intformation: Revising and updating the UNISIST model. Journal of Documentation, 59(3), 278-320.
    • Theme 1: Advancements in technology can lead to increases in the volume and/or type of artifacts that need to be discoverable.
      “Technology has a profound effect on how scientists can communicate with each other. This affects how quickly science can progress and what kinds of collaboration are possible (Bradley, Lang, Koch, & Neylon, 2011, p.426).”
      IDC predicted, “… in 2011, the amount of digital information produced in the year should equal nearly 1,800 exabytes, or 10 times that produced in 2006. The compound annual growth rate between now [2008] and 2011 is expected to be almost 60%(Gantz et al., 2008).”
      Recent Technological Advancements: Cloud Computing, Cyberinfrastructure, Big Data, eScience, Data Driven Science, Open Notebook Science – i.e., More Data
      Bradley, J.-C., Lang, A. S. I. D., Koch, S., & Neylon, C. (2011). Collaboration Using Open Notebook Science in Academia. In S. Ekins, M. A. Z. Hupcey & A. J. Williams (Eds.), Collaborative Computational Technologies for Biomedical Research (pp. 425-452): John Wiley & Sonds, Inc.
      Gantz, J. F., Chute, C., Manfrediz, A., Minton, S., Reinsel, D., Schlichting, W., & Toncheva, A. (2008). The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth Through 2011: IDC.
    • Theme 2: The introduction of new kinds of artifacts and increases in volume lead to advancements in the methods used for indexing.
      “One of the most serious problems confronting science at the present time is the difficulty in keeping abreast of all the research that is being done and in bringing the published results into some workable order. If the results of research are buried or lost for some reason or other, the research, and the money spent on it, is entirely wasted. To prevent such a loss we need adequate guides to the vast amount of scientific literature and must make intelligent and effective use of them. … It is becoming increasingly difficult for our indexes and abstract journals to keep up with the growing number of medical publications and with articles of medical importance in other scientific journals. … The aspect of the problem which is our immediate concern today and which is particularly important to the Army Medical Library is that of the role of indexes in meeting the needs of the present and of the future (Larkey, 1949).”
      Larkey, S. V. (1949). The Army Medical Library Research Project at the Welch Medical Library. Bulletin of the Medical Library Association, 37(2), 121-124.
    • Preliminary Study
      VAST Challenge 2008
    • Custom Improvise visualization developed by Chris Weaver and the NEVAC team for analysis of the wiki collection.
    • Custom Improvise visualization developed by Chris Weaver and the NEVAC team for analysis of the coast guard intercept collection.
    • Custom Improvise visualization developed by Chris Weaver and the NEVAC team for analysis of the cell phone call collection.
    • Custom Improvise visualization developed by Chris Weaver and the NEVAC team for analysis of the RFID movement collection.
    • All of the mini-challenge data collections were loaded into a single Maple worksheet. (Pellegrino, Chen, et al., 2008, Figure 1)
      Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.
    • "Modeling the evacuation mini-challenge hypotheses in an associative network (Pellegrino, Chen, et al., 2008, Figure 7).”
      Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.
    • Graph representation of data and hypotheses (Pellegrino, Chen, et al., 2008, Figure 8).
      Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.
    • “Path from RFID 21 to RFID 62 (Pellegrino, Chen, et al., 2008, Figure 10).”
      Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.
    • “k-Neighbors within 4 of RFID 56 (Pellegrino, Chen, et al., 2008, Figure 11).”
      Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.
    • Limitations
      • Synthetic Data
      • Only tested in one domain.
      • Significant manual effort required.
    • Scale-Up and Scale-Out
      Influenza Protein Sequence Mapping Study
    • Study Objectives
      • Real-world data.
      • New domain.
      • Reduce manual effort – create a tool.
    • MOVIE
      Temporal Analysis
    • Lessons Learned
      • Real-world data.
      • Suitable domain.
      • Prototype tool developed.
      • Method provides an overview which can not be achieved using other tools.
      • Method provides insight into macroscopic temporal characteristics of the collection.
      • Method provides means for exploring specific records.
    • Scale-Out and Evaluate.
      Open Notebook Science Study
    • Study Objectives
      • Real-world data.
      • New domain.
      • Find a hidden ground truth – variation of the VAST evaluation model.
    • UsefulChem Experiment 262 Notebook Entry by Evan Curtin.
      Curtin, E., “Exp262,” [Online]. Available: http://usefulchem.wikispaces.com/Exp262, Retrieved 20 April 2011.
    • Inventory and model some of the core UsefulChem and Open Notebook Science data.
    • Objective
      To synthesize the precursor diamide to be used subsequently in the pictet spengler reaction affording praziquantel.
      Conclusion
      After two days of reaction time, it is not clear if a Ugi product is formed. Owing to the small scale on which this reaction was carried out (total volume <175uL), and the minuscule amount of precipitate obtained, further work-up seems impractical.
      Experiment aborted.
    • Overview Graph.
    • A disconnected cluster Khalid Mirza - Marshal Moritz cluster.
    • A disconnected Dustin Sprouse cluster.
    • A Sebastian Petrik cluster.
    • David Bulger cluster.
    • Khalid Mirza - Aneh cluster.
    • Marshall Moritz cluster.
    • James Giammarco - Jessica Colditz and David Bulger - Khalid Mirza connections group.
    • Michael Wolfle cluster.
    • “We just tried this exact reaction 2 weeks ago :) http://usefulchem.wikispaces.com/Exp258 [JCB]”
    • Lessons Learned
      • Real-world data.
      • New domain.
      • Find a hidden ground truth – variation of the VAST evaluation model.
      • Extensive opportunity for future work.
      • Social component is key.
    • Systematize and Evaluate
      Pfizer Drug Discovery Study
    • Study Objectives
      • Real-world data.
      • New domain.
      • Explore use of quantitative measures to guide exploration.
    • Timeline view.
    • Coordinated views of clusters and the timeline.
    • Screenshot of in-degree view.
    • Screenshot of out-degree view.
    • Screenshot of betweenness view.
    • Lessons Learned
      • Real-world data.
      • New domain.
      • Explore use of quantitative measures to guide exploration.
      • Indegree and outdegree can be useful for design meetings.
      • Betweenness did not appear to add value.
      • May be particularly useful for researchers who are not yet familiar with a collection.
    • These issues present a critical requirement: analytical software (tools) to work on the problems of entity and relationship extraction from texts as well as the analysis of the resulting data (e.g., the discovery of trends or links that are quite simply not obvious to the human analyst)(Strickland, 2005, p.164, emphasis added).”
      Conclusions
      • Influenza Study yielded the identification of both macroscopic trends and specific records that were not readily identifiable using a search and filter modality.
      • Open Notebook Science Study yielded a structure which may have improved the likelihood that a critical link (Ugi reaction for Praziquantel intermediate) would be discovered.
      • Pfizer Study demonstrated the potential utility of indegree for systematic identification of key compounds.