SlideShare a Scribd company logo
Escaping Flatland: interactive
high-dimensional data analysis in
drug discovery using Spark
Josh Snyder, Victor Hong, Laurent Galafassi
Novartis Institutes for BioMedical Research (NIBR)
Overview
• Use case
– High-dimensional screening data
• Goals
– Production data pipelines for scientists
– Reusable analysis platform for informaticians
• High level architecture
– Spark and other components
• Outcome
– Achievements & impact
– Future work
Screening is scale-out for bench science
1
2
3
4
Data size depends on readout technology,
structure is standard
• Microscopy
• Cell morphometrics
• Image texture
• ...
• Sequencing
• Multiple gene expression
• Cytometry
• Multiple protein expression
5
6
Datasets can be large
1000 plates 1536 wells/plate 1k to 5k cells/well
50 to 2000 features/cell
1 to 10 billion observations
10 to 2000 features
10b to 20 trillion data points
10 GB to 20 TB
+ time points (x10 = 200TB)
+ ??
1 screen
Many features can be used to quantify activity
Active
Control
Neutral
Control
Nucleus/Cytoplasm Intensity
Cell Texture Variance (3 pixel)
…
n = 1000’s
We can only see what we look at
Cell Texture
Variance (3 pixel)
Nucleus/Cytoplasm
Intensity
Average Z’: 0.65Average Z’: 0.78
7
So we need to look at everything
Input
• All observations, all
features
QC
• Mask problem
observations
• Mask problem
features
• Calculate aggregate
measures for review
• Per feature
• Per observation
group
Normalization
• Pattern correction
and scoring for each
feature
• Eliminate
uninformative
features
Classification
• Use full feature
vectors to find cases
showing desired
activity/phenotype
Smells like Spark…
Data Pipeline
• Rows =
observations
• Columns =
features
Data Pipeline
• Column-wise
filtering and
aggregation
Data Pipeline
• Column-wise
correction and
scoring
• Column to column
correlation over
rows
Data Pipeline
• Row-wise
aggregation over
features to
compute distance
metrics
Spark is not a tool for bench scientists
Data Pipeline Data Pipeline Data Pipeline Data Pipeline
Visualization &
Control
Visualization &
Control
Visualization &
Control
Visualization &
Control
Algorithms
Workflow
High-dimensional data-driven architecture
• Pipelines for large data à
Spark
– Distribute computation
– Minimize IO for intermediate
results
– Declarative API
– Support for popular data analysis
languages
– Ecosystem: MLlib, Spark Job
Server, etc.
• Visualization & control à
WebGL
– Web UI flexibility
– Render millions of data points
• Query à Cassandra
– Spark Connector
– Distributed, fast, mature, key-value
/ column family store
Simple workflow
Rich, interactive visualizations
Methods implementations
• Classification
– Mahalanobis Distance
– Gaussian Naïve Bayes
• Coarse-grained utilities
– findNearLinearCombos
– findCorrelation
• Fine-grained utilities
– Streaming models for incrementally integrating data (pairwise
correlation, Greenwald-Khanna quantile estimations, et al.)
– Robust statistical measures (MAD, IQR, et al.)
– Data masking, missing values handlers (casewise, pairwise, imputation)
The big picture
• Achievements
– Multi-day batch jobs à multi-hour jobs
– Unified data format & workflow across readout technologies
– End user application for bench scientists
• Future work
– Elastic infrastructure
– Supervised learning of cell phenotypes
– Methods APIs for informaticians
– Contributions back to open source
The really big picture
Discovery of therapeutics
for patients in need
Informatics applications
Distributed complex
analytics
Spark
Acknowledgments
Nabil Hachem
Fred Harbinski
Ioannis Moutsatsos
Hanspeter Gubler
Sergey Kokorin
Leonid Volobuev
Marat Gazimullin
Evgeniya Condrashina
Alexey Girin
David Wilson
and the entire NIBR project team, stakeholders, & sponsors
Attributions
1. "1905 Otto Folin in biochemistry lab at McLean HospitalbyAHFolsom Harvard" by A H Folsom -
http://preserve.harvard.edu/photographs/McLean.html. Licensed under Public Domain via Commons -
https://commons.wikimedia.org/wiki/File:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png#/media/Fil
e:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png
2. "Petri dish at the Pacific Northwest NationalLaboratory" by Pacific Northwest NationalLaboratory, US Department of Energy -
http://picturethis.pnl.gov/picturet.nsf/by+id/DRAE-8DBTWP. Licensed under Public Domain via Commons -
https://commons.wikimedia.org/wiki/File:Petri_dish_at_the_Pacific_Northwest_National_Laboratory.jpg#/media/File:Petri_dish_at_the_Pacifi
c_Northwest_National_Laboratory.jpg
3. "ChemicalGenomics Robot" by Maggie Bartlett, National Human Genome Research Institute -
http://www.genome.gov/dmd/img.cfm?node=Photos/Technology/Research%20laboratory&id=79299. Licensed under Public Domain via
Commons - https://commons.wikimedia.org/wiki/File:Chemical_Genomics_Robot.jpg#/media/File:Chemical_Genomics_Robot.jpg
4. "385 multiwell plate 1" by real name: Nadina Wiórkiewiczpl.wiki: Nadine90commons: Nadine90 - Own work(dziękiwspółpracy ze szkołą
fotograficzną - Fotoedukacja /in cooperation with the schoolof photography - Fotoedukacja). Licensed under CC BY-SA 3.0 via Wikimedia
Commons - https://commons.wikimedia.org/wiki/File:385_multiwell_plate_1.jpg#/media/File:385_multiwell_plate_1.jpg
5. "Automated confocalimage reader" by Neil Emans IPK - self-made. Original image cropped in this usage. Licensed under CC BY-SA 3.0
via Wikipedia - https://en.wikipedia.org/wiki/File:Automated_confocal_image_reader.jpg#/media/File:Automated_confocal_image_reader.jpg
6. By Kierano - Own work. Original image cropped and resized in this usage. CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=25180061
7. "Flatland sphere". Licensed under Public Domain via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:Flatland_sphere.JPEG#/media/File:Flatland_sphere.JPEG
THANK YOU.
josh.snyder@novartis.com Presentation and project
victor.hong@novartis.com
laurent.galafassi@novartis.com
nabil.hachem@novartis.com NIBR Data Engineering

More Related Content

What's hot

Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
Making Homes Efficient and Comfortable Using AI and IoT Data
Making Homes Efficient and Comfortable Using AI and IoT DataMaking Homes Efficient and Comfortable Using AI and IoT Data
Making Homes Efficient and Comfortable Using AI and IoT Data
Databricks
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics event
Open Analytics
 
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
confluent
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Open Analytics
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
Open Analytics
 

What's hot (20)

Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016
 
Making Homes Efficient and Comfortable Using AI and IoT Data
Making Homes Efficient and Comfortable Using AI and IoT DataMaking Homes Efficient and Comfortable Using AI and IoT Data
Making Homes Efficient and Comfortable Using AI and IoT Data
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics event
 
Growing Data Scientists by Amparo Alonso Betanzos
Growing Data Scientists by Amparo Alonso BetanzosGrowing Data Scientists by Amparo Alonso Betanzos
Growing Data Scientists by Amparo Alonso Betanzos
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
 
Qo comparision
Qo comparisionQo comparision
Qo comparision
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Apache Spark and future of advanced analytics
Apache Spark and future of advanced analyticsApache Spark and future of advanced analytics
Apache Spark and future of advanced analytics
 
Analytics graph databases
Analytics graph databasesAnalytics graph databases
Analytics graph databases
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 

Viewers also liked

ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
Spark Summit
 
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Spark Summit
 
Mentoring I
Mentoring IMentoring I
Mentoring I
shendin
 
Xaneiro 2015
Xaneiro 2015Xaneiro 2015
Xaneiro 2015
iesasorey
 

Viewers also liked (20)

ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
 
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
 
High Dimensional Data Visualization
High Dimensional Data VisualizationHigh Dimensional Data Visualization
High Dimensional Data Visualization
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Predict Repeat Shoppers with H20 and Spark
Predict Repeat Shoppers with H20 and SparkPredict Repeat Shoppers with H20 and Spark
Predict Repeat Shoppers with H20 and Spark
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Scala Programming Introduction
Scala Programming IntroductionScala Programming Introduction
Scala Programming Introduction
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Mentoring I
Mentoring IMentoring I
Mentoring I
 
碳酸鈉81027
碳酸鈉81027碳酸鈉81027
碳酸鈉81027
 
Xaneiro 2015
Xaneiro 2015Xaneiro 2015
Xaneiro 2015
 
Xander santiago 7 b
Xander santiago 7 bXander santiago 7 b
Xander santiago 7 b
 
Hi, today i present you five famous
Hi, today i present you five famousHi, today i present you five famous
Hi, today i present you five famous
 
Presentación sobre Diabetes
Presentación sobre DiabetesPresentación sobre Diabetes
Presentación sobre Diabetes
 
Factors that influence path to purchase
Factors that influence path to purchaseFactors that influence path to purchase
Factors that influence path to purchase
 
Factura
FacturaFactura
Factura
 
HP LaserJet Pro P1606dn – CE278A Toner Replacement
HP LaserJet Pro P1606dn – CE278A Toner ReplacementHP LaserJet Pro P1606dn – CE278A Toner Replacement
HP LaserJet Pro P1606dn – CE278A Toner Replacement
 

Similar to Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discovery Using Spark by Josh Snyder, Victor Hong and Laurent Galafassi

FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
Yatpang Cheung
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
Michael Atkins
 

Similar to Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discovery Using Spark by Josh Snyder, Victor Hong and Laurent Galafassi (20)

Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
 
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsRamil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
 
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
HPC at NIBR
HPC at NIBRHPC at NIBR
HPC at NIBR
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
Big Data
Big Data Big Data
Big Data
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 

Recently uploaded (20)

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discovery Using Spark by Josh Snyder, Victor Hong and Laurent Galafassi

  • 1. Escaping Flatland: interactive high-dimensional data analysis in drug discovery using Spark Josh Snyder, Victor Hong, Laurent Galafassi Novartis Institutes for BioMedical Research (NIBR)
  • 2. Overview • Use case – High-dimensional screening data • Goals – Production data pipelines for scientists – Reusable analysis platform for informaticians • High level architecture – Spark and other components • Outcome – Achievements & impact – Future work
  • 3. Screening is scale-out for bench science 1 2 3 4
  • 4. Data size depends on readout technology, structure is standard • Microscopy • Cell morphometrics • Image texture • ... • Sequencing • Multiple gene expression • Cytometry • Multiple protein expression 5 6
  • 5. Datasets can be large 1000 plates 1536 wells/plate 1k to 5k cells/well 50 to 2000 features/cell 1 to 10 billion observations 10 to 2000 features 10b to 20 trillion data points 10 GB to 20 TB + time points (x10 = 200TB) + ?? 1 screen
  • 6. Many features can be used to quantify activity Active Control Neutral Control Nucleus/Cytoplasm Intensity Cell Texture Variance (3 pixel) … n = 1000’s
  • 7. We can only see what we look at Cell Texture Variance (3 pixel) Nucleus/Cytoplasm Intensity Average Z’: 0.65Average Z’: 0.78 7
  • 8. So we need to look at everything Input • All observations, all features QC • Mask problem observations • Mask problem features • Calculate aggregate measures for review • Per feature • Per observation group Normalization • Pattern correction and scoring for each feature • Eliminate uninformative features Classification • Use full feature vectors to find cases showing desired activity/phenotype
  • 9. Smells like Spark… Data Pipeline • Rows = observations • Columns = features Data Pipeline • Column-wise filtering and aggregation Data Pipeline • Column-wise correction and scoring • Column to column correlation over rows Data Pipeline • Row-wise aggregation over features to compute distance metrics
  • 10. Spark is not a tool for bench scientists Data Pipeline Data Pipeline Data Pipeline Data Pipeline Visualization & Control Visualization & Control Visualization & Control Visualization & Control Algorithms Workflow
  • 11. High-dimensional data-driven architecture • Pipelines for large data à Spark – Distribute computation – Minimize IO for intermediate results – Declarative API – Support for popular data analysis languages – Ecosystem: MLlib, Spark Job Server, etc. • Visualization & control à WebGL – Web UI flexibility – Render millions of data points • Query à Cassandra – Spark Connector – Distributed, fast, mature, key-value / column family store
  • 14. Methods implementations • Classification – Mahalanobis Distance – Gaussian Naïve Bayes • Coarse-grained utilities – findNearLinearCombos – findCorrelation • Fine-grained utilities – Streaming models for incrementally integrating data (pairwise correlation, Greenwald-Khanna quantile estimations, et al.) – Robust statistical measures (MAD, IQR, et al.) – Data masking, missing values handlers (casewise, pairwise, imputation)
  • 15. The big picture • Achievements – Multi-day batch jobs à multi-hour jobs – Unified data format & workflow across readout technologies – End user application for bench scientists • Future work – Elastic infrastructure – Supervised learning of cell phenotypes – Methods APIs for informaticians – Contributions back to open source
  • 16. The really big picture Discovery of therapeutics for patients in need Informatics applications Distributed complex analytics Spark
  • 17. Acknowledgments Nabil Hachem Fred Harbinski Ioannis Moutsatsos Hanspeter Gubler Sergey Kokorin Leonid Volobuev Marat Gazimullin Evgeniya Condrashina Alexey Girin David Wilson and the entire NIBR project team, stakeholders, & sponsors
  • 18. Attributions 1. "1905 Otto Folin in biochemistry lab at McLean HospitalbyAHFolsom Harvard" by A H Folsom - http://preserve.harvard.edu/photographs/McLean.html. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png#/media/Fil e:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png 2. "Petri dish at the Pacific Northwest NationalLaboratory" by Pacific Northwest NationalLaboratory, US Department of Energy - http://picturethis.pnl.gov/picturet.nsf/by+id/DRAE-8DBTWP. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Petri_dish_at_the_Pacific_Northwest_National_Laboratory.jpg#/media/File:Petri_dish_at_the_Pacifi c_Northwest_National_Laboratory.jpg 3. "ChemicalGenomics Robot" by Maggie Bartlett, National Human Genome Research Institute - http://www.genome.gov/dmd/img.cfm?node=Photos/Technology/Research%20laboratory&id=79299. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Chemical_Genomics_Robot.jpg#/media/File:Chemical_Genomics_Robot.jpg 4. "385 multiwell plate 1" by real name: Nadina Wiórkiewiczpl.wiki: Nadine90commons: Nadine90 - Own work(dziękiwspółpracy ze szkołą fotograficzną - Fotoedukacja /in cooperation with the schoolof photography - Fotoedukacja). Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:385_multiwell_plate_1.jpg#/media/File:385_multiwell_plate_1.jpg 5. "Automated confocalimage reader" by Neil Emans IPK - self-made. Original image cropped in this usage. Licensed under CC BY-SA 3.0 via Wikipedia - https://en.wikipedia.org/wiki/File:Automated_confocal_image_reader.jpg#/media/File:Automated_confocal_image_reader.jpg 6. By Kierano - Own work. Original image cropped and resized in this usage. CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25180061 7. "Flatland sphere". Licensed under Public Domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Flatland_sphere.JPEG#/media/File:Flatland_sphere.JPEG
  • 19. THANK YOU. josh.snyder@novartis.com Presentation and project victor.hong@novartis.com laurent.galafassi@novartis.com nabil.hachem@novartis.com NIBR Data Engineering