Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Escaping Flatland: interactive
high-dimensional data analysis in
drug discovery using Spark
Josh Snyder, Victor Hong, Laur...
Overview
• Use case
– High-dimensional screening data
• Goals
– Production data pipelines for scientists
– Reusable analys...
Screening is scale-out for bench science
1
2
3
4
Data size depends on readout technology,
structure is standard
• Microscopy
• Cell morphometrics
• Image texture
• ...
• S...
Datasets can be large
1000 plates 1536 wells/plate 1k to 5k cells/well
50 to 2000 features/cell
1 to 10 billion observatio...
Many features can be used to quantify activity
Active
Control
Neutral
Control
Nucleus/Cytoplasm Intensity
Cell Texture Var...
We can only see what we look at
Cell Texture
Variance (3 pixel)
Nucleus/Cytoplasm
Intensity
Average Z’: 0.65Average Z’: 0....
So we need to look at everything
Input
• All observations, all
features
QC
• Mask problem
observations
• Mask problem
feat...
Smells like Spark…
Data Pipeline
• Rows =
observations
• Columns =
features
Data Pipeline
• Column-wise
filtering and
aggr...
Spark is not a tool for bench scientists
Data Pipeline Data Pipeline Data Pipeline Data Pipeline
Visualization &
Control
V...
High-dimensional data-driven architecture
• Pipelines for large data à
Spark
– Distribute computation
– Minimize IO for in...
Simple workflow
Rich, interactive visualizations
Methods implementations
• Classification
– Mahalanobis Distance
– Gaussian Naïve Bayes
• Coarse-grained utilities
– findNe...
The big picture
• Achievements
– Multi-day batch jobs à multi-hour jobs
– Unified data format & workflow across readout te...
The really big picture
Discovery of therapeutics
for patients in need
Informatics applications
Distributed complex
analyti...
Acknowledgments
Nabil Hachem
Fred Harbinski
Ioannis Moutsatsos
Hanspeter Gubler
Sergey Kokorin
Leonid Volobuev
Marat Gazim...
Attributions
1. "1905 Otto Folin in biochemistry lab at McLean HospitalbyAHFolsom Harvard" by A H Folsom -
http://preserve...
THANK YOU.
josh.snyder@novartis.com Presentation and project
victor.hong@novartis.com
laurent.galafassi@novartis.com
nabil...
Upcoming SlideShare
Loading in …5
×

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discovery Using Spark by Josh Snyder, Victor Hong and Laurent Galafassi

1,683 views

Published on

Spark Summit East Talk

Published in: Data & Analytics
  • Be the first to comment

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discovery Using Spark by Josh Snyder, Victor Hong and Laurent Galafassi

  1. 1. Escaping Flatland: interactive high-dimensional data analysis in drug discovery using Spark Josh Snyder, Victor Hong, Laurent Galafassi Novartis Institutes for BioMedical Research (NIBR)
  2. 2. Overview • Use case – High-dimensional screening data • Goals – Production data pipelines for scientists – Reusable analysis platform for informaticians • High level architecture – Spark and other components • Outcome – Achievements & impact – Future work
  3. 3. Screening is scale-out for bench science 1 2 3 4
  4. 4. Data size depends on readout technology, structure is standard • Microscopy • Cell morphometrics • Image texture • ... • Sequencing • Multiple gene expression • Cytometry • Multiple protein expression 5 6
  5. 5. Datasets can be large 1000 plates 1536 wells/plate 1k to 5k cells/well 50 to 2000 features/cell 1 to 10 billion observations 10 to 2000 features 10b to 20 trillion data points 10 GB to 20 TB + time points (x10 = 200TB) + ?? 1 screen
  6. 6. Many features can be used to quantify activity Active Control Neutral Control Nucleus/Cytoplasm Intensity Cell Texture Variance (3 pixel) … n = 1000’s
  7. 7. We can only see what we look at Cell Texture Variance (3 pixel) Nucleus/Cytoplasm Intensity Average Z’: 0.65Average Z’: 0.78 7
  8. 8. So we need to look at everything Input • All observations, all features QC • Mask problem observations • Mask problem features • Calculate aggregate measures for review • Per feature • Per observation group Normalization • Pattern correction and scoring for each feature • Eliminate uninformative features Classification • Use full feature vectors to find cases showing desired activity/phenotype
  9. 9. Smells like Spark… Data Pipeline • Rows = observations • Columns = features Data Pipeline • Column-wise filtering and aggregation Data Pipeline • Column-wise correction and scoring • Column to column correlation over rows Data Pipeline • Row-wise aggregation over features to compute distance metrics
  10. 10. Spark is not a tool for bench scientists Data Pipeline Data Pipeline Data Pipeline Data Pipeline Visualization & Control Visualization & Control Visualization & Control Visualization & Control Algorithms Workflow
  11. 11. High-dimensional data-driven architecture • Pipelines for large data à Spark – Distribute computation – Minimize IO for intermediate results – Declarative API – Support for popular data analysis languages – Ecosystem: MLlib, Spark Job Server, etc. • Visualization & control à WebGL – Web UI flexibility – Render millions of data points • Query à Cassandra – Spark Connector – Distributed, fast, mature, key-value / column family store
  12. 12. Simple workflow
  13. 13. Rich, interactive visualizations
  14. 14. Methods implementations • Classification – Mahalanobis Distance – Gaussian Naïve Bayes • Coarse-grained utilities – findNearLinearCombos – findCorrelation • Fine-grained utilities – Streaming models for incrementally integrating data (pairwise correlation, Greenwald-Khanna quantile estimations, et al.) – Robust statistical measures (MAD, IQR, et al.) – Data masking, missing values handlers (casewise, pairwise, imputation)
  15. 15. The big picture • Achievements – Multi-day batch jobs à multi-hour jobs – Unified data format & workflow across readout technologies – End user application for bench scientists • Future work – Elastic infrastructure – Supervised learning of cell phenotypes – Methods APIs for informaticians – Contributions back to open source
  16. 16. The really big picture Discovery of therapeutics for patients in need Informatics applications Distributed complex analytics Spark
  17. 17. Acknowledgments Nabil Hachem Fred Harbinski Ioannis Moutsatsos Hanspeter Gubler Sergey Kokorin Leonid Volobuev Marat Gazimullin Evgeniya Condrashina Alexey Girin David Wilson and the entire NIBR project team, stakeholders, & sponsors
  18. 18. Attributions 1. "1905 Otto Folin in biochemistry lab at McLean HospitalbyAHFolsom Harvard" by A H Folsom - http://preserve.harvard.edu/photographs/McLean.html. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png#/media/Fil e:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png 2. "Petri dish at the Pacific Northwest NationalLaboratory" by Pacific Northwest NationalLaboratory, US Department of Energy - http://picturethis.pnl.gov/picturet.nsf/by+id/DRAE-8DBTWP. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Petri_dish_at_the_Pacific_Northwest_National_Laboratory.jpg#/media/File:Petri_dish_at_the_Pacifi c_Northwest_National_Laboratory.jpg 3. "ChemicalGenomics Robot" by Maggie Bartlett, National Human Genome Research Institute - http://www.genome.gov/dmd/img.cfm?node=Photos/Technology/Research%20laboratory&id=79299. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Chemical_Genomics_Robot.jpg#/media/File:Chemical_Genomics_Robot.jpg 4. "385 multiwell plate 1" by real name: Nadina Wiórkiewiczpl.wiki: Nadine90commons: Nadine90 - Own work(dziękiwspółpracy ze szkołą fotograficzną - Fotoedukacja /in cooperation with the schoolof photography - Fotoedukacja). Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:385_multiwell_plate_1.jpg#/media/File:385_multiwell_plate_1.jpg 5. "Automated confocalimage reader" by Neil Emans IPK - self-made. Original image cropped in this usage. Licensed under CC BY-SA 3.0 via Wikipedia - https://en.wikipedia.org/wiki/File:Automated_confocal_image_reader.jpg#/media/File:Automated_confocal_image_reader.jpg 6. By Kierano - Own work. Original image cropped and resized in this usage. CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25180061 7. "Flatland sphere". Licensed under Public Domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Flatland_sphere.JPEG#/media/File:Flatland_sphere.JPEG
  19. 19. THANK YOU. josh.snyder@novartis.com Presentation and project victor.hong@novartis.com laurent.galafassi@novartis.com nabil.hachem@novartis.com NIBR Data Engineering

×