grizzly - informal overview - pydata boston 2013

  • 143 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
143
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. grizzly statistical analysis with multidimensional dataflows in python Adrian Heilbut Boston University and Broad Institute http://www.empiricist.ca (graphs for reproducible interactive visualization and analysis) PyData Boston 2013
  • 2. 1. Motivation Biological discovery from complex, multidimensional data; common features of complex biological data and analyses 2. Problems and Goals Reproducible, efficient, elegant, collaborative,interactive analysis Data + analysis evolving over time 3. Toy Dataset A simple dataset with hierarchical and temporal structure 4. Strategies Separate concerns; Represent types and structure explicitly; Abstract away data management; Formalize 5. Inspirations OLAP and data cube models; Declarative visualization grammars; Scientific workflow systems 6. Core Ideas Dataflows + Temporal Graphs + Multidimensional Types + Syntactic syrup 7. Toy Demos 8. Implementation 9. Biology application Mechanisms of drug side effects in Parkinson’s Disease 10. Summary and Conclusions
  • 3. Motivation • Common and unique features of scientific data • Examples of complex datasets and analyses in computational biology • Data analysis desiderata Motivation Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application
  • 4. Biological data is increasingly complex; Many datasets and analyses share common structural features • High-dimensional measurements • Longitudinal / time-course measurements • Hierarchical structure of dimensions • Multiple modalities (expression, protein concentration, phosphorylation) • Complex experimental designs • Complex analysis designs • Complex pre-processing pipelines • Many parameter choices • Many cell types • Many treatments • Many organisms • Many patients • Many replicates
  • 5. Ex 1. Cancer Profiling and Signatures Cancer Cell Line Encylopedia (CCLE) Broad / Novartis, Barretina 2012 1000 cell lines expressionfor 20,000genesmutationstatusdrugresponse
  • 6. P0 P07 P12 P18 P21 P56 proliferationproliferation differentiationdifferentiation migration & patterningmigration & patterning P0 P07 P15 P21 E0 E11 E15 E18 3 reps, 40k probes
  • 7. Saline Acute (9) Low Dose Levodopa Chronic (12) Saline Chronic (11) 6-OHDA Ascorbate Day 1 Expression + AIM CP73 Day 8 Expression + AIM High Dose Levodopa Acute (10) High Dose Levodopa Chronic (11) Saline Chronic (10) Low Levodopa Chronic (8) Saline Chronic (7) 6-OHDA Ascorbate CP101 Day 8 Expression + AIM High Levodopa Chronic (8) Saline Chronic (10) Change in Expression between treatment groups Expression vs. AIM (correlation) within treatment groups / cell types Statistics (per gene) Expression vs. AIM (correlation) within combined treatment groups ~ 23,000 x 200 matrix of stats for different contrasts between groups
  • 8. Unique characteristics of scientific data • Relatively short half-life of data and projects • Uncertain and complex analysis methods • Constantly changing data • Lots of internal and external structure over dimensions • Teams with diverse backgrounds and skills over multiple institutions and locations • Communication of data is a primary goal • High risk and high value outcomes project selection / experimental followup clinical decisions Distinctive characteristics, uses, and problems with scientific data analysis motivates need for tailored abstractions and tools
  • 9. Desiderata for Data Analysis • Correctness • Thoroughness (scientific hypothesis space + analysis space) • Reproducibility • Verifiability (analysis and underlying data, others and oneself) • Clarity • Provenance (of the data, and of the analysis) • Interactivity (Exploration, Drill-down) • Computational Efficiency • Scientist Efficiency
  • 10. Vision Every figure, every table, and every quantitative claim in a scientific analysis or publication should be verifiable and explorable it should link to an understandable, executable, modifiable representation of the data analysis pipeline by which it was generated one should be able to trace back all the way to the primary experimental data it should be easy and fun to play with
  • 11. Problems and Goals Errors have serious consequences Practical problems in day-to-day analysis Unmet need for better tools Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions
  • 12. Mistakes even happen in Cambridge... Reinhart / RogoffHerndon, Ash, Pollin OriginalCorrect
  • 13. it’s even worse than it appears... Kimball, 2013 ability to easily drill down to view and assess the underlying data is critical
  • 14. Elements of statistical analysis statistical algorithms output data Input data visualizations summary tables
  • 15. Version 2. output data Input dataInput dataInput dataInput dataInput dataInput dataInput dataInput dataInput data statistical algorithm output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data statistical algorithm statistical algorithm
  • 16. Version 247... (ah_2013_09_13_v247_ 3-17am) statistical algorithm output data Input data Input data Input data Input data Input data Input data Input data statistical algorithm output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data statistical algorithm statistical algorithm statistical algorithm statistical algorithm
  • 17. v247_figs. pdf 75mb (450 pages) v247_tabl e_1.tab v247_tabl e_1.tab v247_tabl e_1.tab v247_tabl e_1.tab v247_tabl e_1.tab v247_tabl e_1.tab v247_tabl e_1.tab v247_tabl e_1.tab v247_tabl e_1.tab v247_tabl e_1.tab
  • 18. Toy Dataset Multidimensional profiling of fermentation metabolites of S. cerevisiae Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application
  • 19. Beer ratings BeerAdvocate.com & RateBeer.com, via Stanford SNAP & a very kind blogger Multidimensional: Appearance, Aroma, Palate, Taste, Overall Hierarchies: Location -> Brewery -> Beer Beer style -> Beer Temporal Toy Dataset Multidimensional profiling of fermentation metabolites of S. cerevisiae
  • 20. Strategies • Separate concerns • Abstract away data management problems • Formalize • Optimize representations (logical and physical) Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions
  • 21. Separation of Concerns • Each of these components evolves over time • Each may be modifed independently by different people with different goals statistical algorithms output data Input data visualizations summary tables
  • 22. Abstract and automate data management Deciding and remembering how to name columns and files and track changes over time is not what I’d like to spend time on Especially since I’ll probably do it inconsistently with what I decided to do last week If the system is responsible for persisting data, caching and memoization can be done automatically.
  • 23. Logical and physical representations matter • Choice of representation and notation has a major effect on ease and efficiency with which concepts can be manipulated, by either a person or a computer • Given our goals for an analysis system, and engineering instinct to separate independent concerns, what are optimal representations for • data? • analysis programs? • visualizations and summary tables?
  • 24. statistical algorithm output data Input data Input data Input data Input data Input data Input data Input data statistical algorithm output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data statistical algorithm statistical algorithm statistical algorithm statistical algorithm How do scientists actually think about analyses?
  • 25. Inspirations (and their deficiencies..) 1. OLAP (On-Line Analytical Processing) and MDX (Multidimensional Expressions) 2. Tableau / Polaris 3. Scientific workflow systems VisTrails, KNIME Galaxy, Genepattern
  • 26. 1: OLAP (on-line analytical processing)
  • 27. 2. Declarative Visualization Grammars (Polaris/Tableau; Stolte 2003) • key idea: declarative specification of visualizations is possible and works well • recent focus has been on busines analytics, rather than statistical graphics; • assumes a static, structured database (ie. OLAP star schema) Stolte 2000
  • 28. 3. Scientific Workflow Systems VisTrails
  • 29. Hypothesis Careful design and selection of representations for data, programs, and visualizations will make it possible to satistfy our data analysis objectives: • multidimensional cubes with static, semantic types for conceptual representation of data • directed acyclic graphs of functions with static, multidimensional input and output type signatures for our statistical programs • declarative queries to generate summary tables • declarative visualization grammar to generate graphics (this is not how most researchers represent their analyses today) Correctness Thoroughness Reproducibility Verifiability Clarity Provenance Interactivity Computational Efficiency Scientist Efficiency
  • 30. Multidimensional Cubes and OLAP Semantic Types Dataflow Programming Core Ideas
  • 31. Data consists of facts about the world. 1 5.5 3 3 4 5 2 6 2 3 2 2 3 8 5 5 4 4.5 ceci n’est pas data
  • 32. Data consists of facts about the world. 1 2 3 5.5 3 3 4 5 6 2 3 2 2 8 5 5 4 4.5 ABV Smell Color Taste OverallBeerID
  • 33. Facts lie in specific domains defined by the structure of the real world or experimental design 1 2 3 5.5 3 3 4 5 6 2 3 2 2 8 5 5 4 4.5 ABV float (%EtOh) Smell ordinal (1-5) 5 is best Color ordinal (1-5) 5 is best Taste ordinal (1-5) 5 is best Overall ordinal (1-5) 5 is best BeerID Integer (BeerAdvocate BeerID)
  • 34. There are a number of possible representations; logically but not practically equivalent 1 2 3 5.5 3 3 4 5 6 2 3 2 2 8 5 5 4 4.5 ABV float (%EtOh) Smell ordinal (1-5) 5 is best Color ordinal (1-5) 5 is best Taste ordinal (1-5) 5 is best Overall ordinal (1-5) 5 is best BeerID Integer (BeerAdvocate) BeerID BeerID Measure Value 1 ABV 5.5 1 Smell 3 1 Color 3 1 Taste 4 1 Overall 5 2 ABV 6 2 Smell 2 2 Color 3 2 Taste 2 2 Overall 2 3 ABV 8 3 Smell 5 3 Color 5 3 Taste 4 3 Overall 4.5 cf. pandas reshape, plyr melt/cast ≈
  • 35. Data Representations • Scientific / statistical data is usally in matrix format, and it must be for efficient storage and computation • Relational model is good for precisely encoding logical structure of data, but • moving between relations and matrices is cumbersome • defining a relational schema for all intermediate data would be a lot of work, especially as with change over time • on its own, the relational model does explicitly represent semantics and units
  • 36. Conceptual Model: OLAP Data Cubes Cartesian product of a set of dimensions (finite discrete sets) defines an N-dimensional grid A multidimensional dataset is a function mapping locations in that grid to typed values called measures (identities of the measures can also be considered as just a special kind of dimension) Beer ID UserID Time Gene Brain Region Stage of Development3 3 2 7.8 3 2 3 2 2.3 2.1 3 2 3 2.3 7.4 12 3 2 3 3.14 15 9 3 2 3 2 2 6.5 2 2 measure: log2 gene expression measure: overall beer rating
  • 37. Conceptual Model: Data Cubes as functions mapping dimensions to measures def BeerRatingsByUser(UserID, BeerID): return (Taste, Smell, Color, Texture, Overall) def BeerRatingsByBeer(BeerID): return (mean Taste, mean Smell, mean Color, mean Texture, mean Overall) def ExpressionBySample(Gene, Region, SampleID): return (log2 expression) def ExpressionByRegionTime(Gene, Region, Timepoint): return (median expression, mean expression, std deviation, median abs deviation, # replicates)
  • 38. Hierarchies Dimensions are related to each other in structures that reflect: • the nature of the world • experimental methods and designs • analysis processes and decisions These hierarchical relationships are critical to understanding and performing analyses, and need to be represented explicitly.
  • 39. Multidimensional Semantic Types 1970s / 80s: Semantic Database formalisms Specify different kinds of relationships and interactions between objects (eg. containment, is-a, relations / cross-products) Overshadowed by ER model and later, UML.. 1990s: OLAP
  • 40. Dataflow Lots of domains model computation as ‘declarative’ dataflows circuit design audio / video processing
  • 41. Grizzly Computation Model Directed Acyclic Graph of processing nodes Inputs and outputs of every node are typed cubes Function nodes add type information to describe their output dimensions ‘Apply’ nodes propagate any types of their input dimensions that they aren’t modified to the outputs Computation is declarative / intensional, not imperative; nodes automatically process whatever is on their inputs, like an electrical circuit (ReviewID, BeerID) --> (Appearance, Aroma, Palate, Taste, Overall) CalcMedian Ratings (BeerID) --> (Appearance, Aroma, Palate, Taste, Overall) (ReviewID, BeerID, SourceID) --> (Appearance, Aroma, Palate, Taste, Overall) (SourceID, BeerID) --> (MedianAppearance, MedianAroma, MedianPalate, MedianTaste, MedianOverall) Apply
  • 42. Advantages of DAG representation • Static type specifications allow precise and clear modeling / design of an analysis pipeline before having to write all the code needed to implement it • Model can be turned into an actual working program, instead of just being a schematic diagram • Provenance tracking without extra instrumentation • Memoization of intermediate results is easy because data dependencies are already explicit • Easier to understand, reason about, and explain to others • Easier to track modification history as graph edits
  • 43. Syntactic Syrup: CubeApply Takes cross-product of a set of input cubes / vectors and applies function to all results (BeerID) --> (Appearance, Aroma, Palate, Taste, Overall) BeerRank (BeerID) --> (RankScore) (BeerID) --> (Appearance, Aroma, Palate, Taste, Overall) (BeerID, RankModelID) --> (RankScore) (AppWeight, AromaWt, PalWt, TasteWt, OverallWt) (RankModelD) --> (AppWt, AromaWt, PalWt, TasteWt, OverallWt)
  • 44. Slicing, Dicing Since semantic type data is always propagated, in principle we can define the schema for any intermediate data (including hierarchy structure) and make use of existing OLAP tools to run declarative queries
  • 45. Implementation • Type system • DAGs • Execution • Data Management • Visualizations • ...queries?
  • 46. Requirements for a practical system • Programmable and extensible, without requiring discontinuous changes to existing habits • OLAP systems not general enough; energy barrier to setting up a ‘data warehouse’ for a particular scientific analysis is too high; arbitrary, complex statistics not supported • System must be deployable over the web, so analyses and results can be easily shared with geographically dispersed collaborators and the scientific community • Free and open source
  • 47. Current Support for Hierarchies in Pandas • Hierarchical dataframes only support ‘uniform’ hierarchies • lots of real analysis requires comparisons across many different types • Metadata is unstructured • can’t compute effectively on column names • Manual management • consistency of column naming and interpretation depends entirely on programmer discipline
  • 48. Simple Semantic Types over Pandas ['[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "bh"], ["st", "pval"], ["tt", "welch ttest"]]', '[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "nominal"], ["st", "pval"], ["tt", "student ttest"]]', '[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "bonf"], ["st", "pval"], ["tt", "student ttest"]]', '[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "bh"], ["st", "pval"], ["tt", "student ttest"]]', '[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["st", "pval"], ["tt", "levene"]] ct CP73 CP101 tt student ttest welch ttest st pval t-stat bonf bh nom mc X ct tt mccmp st
  • 49. Temporal Graph Database • Canonical representation for types, ‘programs’, and pointers to data are all as typed property graphs (DAGs) that can hold JSON payloads • All edit history to the graph is recorded, so user can rewind / replay and branch
  • 50. Generic Visualization Components to compose visualizations & reports
  • 51. Architecture Overview GZDB Graph Editor Grizzly Webapp SQLAlchemy Postgres IPython Pandas HTML Viz Widgets GZData GZFlow CherryPy D3, Slickgrid, FlotjsPlumb Filesystem
  • 52. Biological Applications
  • 53. Bio Example 1: Striatal Gene Expression w. L-DOPA Summary tables Drilldown and provenance from summary tables to primary data
  • 54. Drilldown from summary to statistical tables
  • 55. Drilldown from statistical tables to plots of primary data
  • 56. Bio Example 2: Complex, interactive visualizations: BOMBASTIC Subspace clustering of time-series data A. Define blocks and an ordering B. Cluster each block independently C. Represent resulting clusters in a tree and explore/filter interactively Each (predefined) subspace has unique information; we want to understand patterns both within and between blocks
  • 57. Summary Increasing complexity of biological data presents critical requirements for better systems for collaborative analysis of high- dimensional, multi-factor, dynamic data A dataflow computation model with semantic, multidimensional types offers significant advantages for meeting these requirements Grizzly defines a simple, formal model for multidimensional data and DAGs of operations on that data, adapting and combining ideas from OLAP, declarative visualization, and dataflow programming. Proof-of-concept implementation in python establishes feasibility Applications to analysis of real biological experiments (PD, Neuro, Cancer) will evaluate practical utility and benefits Correctness Thoroughness Reproducibility Verifiability Clarity Provenance Interactivity Computational Efficiency Scientist Efficiency
  • 58. Acknowledgements: Software • IPython • NumPy • Pandas • Statsmodels • Patsy • CherryPy • SQLAlchemy • postgres • NetworkX • igraph • backbone • underscore • jsPlumb • flot • D3.js
  • 59. Acknowledgements
  • 60. @adrian_h http://www.grizzly.io