Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Towards Automating Data Narratives

341 views

Published on

We propose a new area of research on automating data narratives. Data narratives are containers of information about computationally generated research findings. They have three major components: 1) A record of events, that describe a new result through a workflow and/or provenance of all the computations executed; 2) Persistent entries for key entities involved for data, software versions, and workflows; 3) A set of narrative accounts that are automatically generated human-consumable renderings of the record and entities and can be included in a paper. Different narrative accounts can be used for different audiences with different content and details, based on the level of interest or expertise of the reader. Data narratives can make science more transparent and reproducible, because they ensure that the text description of the computational experiment reflects with high fidelity what was actually done. Data narratives can be incorporated in papers, either in the methods section or as supplementary materials. We introduce DANA, a prototype that illustrates how to generate data narratives automatically, and describe the information it uses from the computational records. We also present a formative evaluation of our approach and discuss potential uses of automated data narratives.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Towards Automating Data Narratives

  1. 1. TOWARDS AUTOMATING DATA NARRATIVES Yolanda Gil, Daniel Garijo Information Sciences Institute University of Southern California @yolandagil, @dgarijov {gil,dgarijo}@isi.edu Information Sciences Institute
  2. 2. The Scientific Research Process Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo Formulate hypothesis Define the experiment (data + method) Find data Run experiments (methods) Meta-analysis of results Revise hypothesis
  3. 3. The products of scientific research Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 3 Formulate hypothesis Define the experiment (data + method) Find data Run experiments (methods) Meta-analysis of results Revise hypothesis Publication Methods Data Software Execution traces
  4. 4. Reconstructing the Computations from the Text in the Paper Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 4 Comparison of Ligand Binding Sites The SMAP software was used to compare the binding sites of the 749 M.tb protein structures plus 1,446 homology models (a total of 2,195 protein structures) with the 962 binding sites of 274 approved drugs, in an all-against-all manner. While the binding sites of the approved drugs were already defined by the bound ligand, the entire protein surface of each of the 2,195 M.tb protein structures was scanned in order to identify alternative binding sites. For each pairwise comparison, a P -value representing the significance of the binding site similarity was calculated. “The Mycobacterium Tuberculosis Drugome and Its Polypharmacological Implications.” Kinnings, S. L.; Xie, L.; Fung, K. H.; Jackson, R. M.; Xie, L.; and Bourne, P. E. PLoS Computational Biology, 2011.
  5. 5. Problem with current approaches: what the paper said vs what the software did Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 5 “The Mycobacterium Tuberculosis Drugome and Its Polypharmacological Implications.” Kinnings, S. L.; Xie, L.; Fung, K. H.; Jackson, R. M.; Xie, L.; and Bourne, P. E. PLoS Computational Biology, 2011. Actual computation
  6. 6. Problem with current approaches Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 6  Incomplete  Missing steps and intermediate data  Ambiguous  Several interpretations about how computations are done  Inconsistent level of detail  Mixing of general methods with execution details Step1 Step ?? Step 2 ? Step1 Step 2 Step1’ Step 2’ Implementation 1? Implementation 2? Step1 Step 2 Param1 = 2 File = “Input.txt”
  7. 7. Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 7 Formulate hypothesis Define the experiment (data + method) Find data Run experiments (methods) Meta-analysis of results Revise hypothesis Publication Methods Data http://ext.net/wp-content/uploads/tortoise-svn-logo.png Execution traces Report generation Our approach: From research outputs to text https://image.flaticon.com/icons/svg/28/28842.svg
  8. 8. Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 8 Formulate hypothesis Define the experiment (data + method) Find data Run experiments (methods) Meta-analysis of results Revise hypothesis Publication Methods Data http://ext.net/wp-content/uploads/tortoise-svn-logo.png Execution traces Report generation Our approach: From research outputs to text http://www.hurricanesoftwares.com/wp-content/uploads/2009/03/import-CSV-in-php.png Reports must: • Be true to actual events • Enable inspection • Be human-understandable • Abstract details
  9. 9. Data Narratives • Interlinked record of • High level workflows (methods) • Provenance of results (method executions) • Data • Software metadata • Persistent identifiers • Data narrative accounts • Alternative descriptions of a result with a different level of detail. Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 9 http://bitpoetry.io/content/images/2016/03/uriurnurl.png https://en.wikipedia.org/wiki/File:DOI_logo.svg Truth to actual records Inspectability Human readable, levels of abstraction
  10. 10. Data Narrative Accounts: An example Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 10 How was the dataset used in this visualization generated?
  11. 11. Data Narrative Accounts: An example Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 11 “Topic modeling was run on the Reuters R8 dataset (10.6084/ m9.figshare.776887), and English Words dataset (10.6084/m9.figshare.776888), with iterations set to 100, stop word size set to 3, number of topics set to 10 and batch size set to 10. The results are at 10.6084/m9.figshare.776856” “The topics at 10.6084/m9.figshare.776856 were found in the Reuters R8 dataset (10.6084/m9.figshare.776887) and English Words dataset (10.6084/m9.figshare.776888)” • Execution view • Inputs, parameters and main outputs • Data view • Just the data that influenced the results • Method view • Main steps based on their functionality “Topic training was run on the input dataset. The results are product of PlotTopics, a visualization step”
  12. 12. • Dependency view • How the steps depend on each other • Implementation view • How the steps were implemented in the execution • Software view • Details on the software used to implement the steps Data Narrative Accounts: An example Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 12 “First, the input data is filtered by Stop Words, followed by Small Words, Format Dataset, and Train Topics. The final results are produced by Plot Topics” “Train topics was implemented using Latent Dirichlet allocation” “The train topics step was generated with Online LDA open source software, written in Java. Plot topics was generated with the Termite software.”
  13. 13. DANA: DAta NArratives Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 13 Experiment Records Provenance RepositoryExperiment- specific Knowledge Base DANA Generator Narrative accounts Software registry Query patterns Data Narrative aggregator Input Resource request Response Resource request Response Output Get query Pattern result Get pattern 1. Identify which experiment records to describe 2. Generation of an Experiment-specific knowledge base 3. Creation of the Data Narrative from templates 4. Produce narrative accounts
  14. 14. Generation of an experiment-specific knowledge base: scientific workflows Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 14 WINGS workflow system • High level workflow templates that can be elaborated through component ontologies http://www.wings-workflows.org/
  15. 15. Generation of an experiment-specific knowledge base: provenance records as RDF Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 15 See a hyperlinked description/visualization at its persistent URL: https://goo.gl/v8EPg5 http://www.opmw.org/export/page/resource/WorkflowExecutionAccount/ACCOUNT1348628778528 10.6084/m9.figshare.776887
  16. 16. Generation of an experiment-specific knowledge base: Software metadata • Catalog of motifs [Garijo et al 2013] • A catalog of common domain independent workflow patterns based on the functionality of workflow steps • Ontosoft distributed software registry [Gil et al 2016] • Descriptions of hundreds of software components • Key metadata of software: • License • Usage • Authors • Web page • Code repository • Etc. Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 16 [Garijo et al 2016]: Common Motifs in Scientific Workflows: An Empirical Analysis. Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, . 2013. . http://purl.org/net/wf-motifs http://www.ontosoft.org/portal [Gil et al 2016]: OntoSoft: A Distributed Semantic Registry for Scientific Software. Gil, Y.; Garijo, D.; Mishra, S.; and Ratnakar, V. In Proceedings of the Twelfth IEEE Conference on eScience, Baltimore, MD, 2016.
  17. 17. Generating narrative accounts Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 17 RDF Account template
  18. 18. Formative evaluation • Survey with 6 target scenarios • Each scenario: • Description of a situation where a user has to do a task Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 18
  19. 19. Formative evaluation • Survey with 6 target scenarios • Each scenario: • Description of a situation where a user has to do a task • A workflow sketch of the analysis done • Six candidate narratives of that workflow sketch. Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 19
  20. 20. Formative evaluation • Survey with 6 target scenarios • Each scenario: • Description of a situation where a user has to do a task • A workflow sketch of the analysis done • Six candidate narratives of that workflow sketch. • 12 responses from users • Results • Each narrative is considered appropriate for describing some scenario • Different users chose different narratives for each scenario Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 20
  21. 21. Summary: Benefits of Data Narratives Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 21 Features Data Narratives Provenance Records Visualizations Articles Electronic Notebooks Truth to actual records Y Y Just data Maybe Maybe Enable inspection Y Y Just data N Y Human understandable Y N Y Y Y Abstract details Y N Y Y N Part of papers Y N Y Y Maybe Persistent Y Maybe N Y Maybe Different audiences Y N N N N Automatically generated Y Y Maybe N N
  22. 22. Conclusions and future work • Data Narratives • Interlink data, software, workflows and provenance of a scientific experiment • Persistent identifiers • Narrative accounts • Future work: • Ease navigation through levels of detail • Mixing details of different narratives • Improve summarization of results • Additional evaluation of narrative usefulness Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 22 See more: http://dgarijo.github.io/DataNarratives/
  23. 23. TOWARDS AUTOMATING DATA NARRATIVES Yolanda Gil, Daniel Garijo Information Sciences Institute and Department of Computer Science @yolandagil, @dgarijov {gil,dgarijo}@isi.edu

×