SlideShare a Scribd company logo
Agile Data Science with Scala
by @DataFellas
Xavier Tordoir
xtordoir@data-fellas.guru
@xtordoir
Andy Petrella
noootsab@data-fellas.guru
@noootsab
Data Fellas
Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning
© Data Fellas SPRL 2016
● Pipeline: productizing Data Science
● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook)
● Why Micro Services?
● Painful points:
○ Data science is Discontiguous
○ Context Lost in Translation
● Solution: Data Fellas’ Agile Data Science Toolkit
Lineup
So if you’re not sure you want to stay...
© Data Fellas SPRL 2016
Pipeline
Productizing Data Science
Modelling Coding Deploying
Finding Data
Parsing structures
Cleaning
(Reducing)
Learning
Predicting
Connect PROD data
Tuning training parameters
Create Prediction Service
Generate Deployable
Connect to PROD infrastructure
Integration with existing env
Allocate (schedule) resources
Ensure availability
© Data Fellas SPRL 2016
Distributed Data Science
Demo
All-In Spark Notebooks
Get data: Source → Kafka
Prepare View: Kafka → Cassandra
Train Model: Cassandra → ML...
Create Server: Cassandra/ML/... → Akka Http
Create Client: Json → Html Form, Chart, table, ...
© Data Fellas SPRL 2016
Bad Pipeline
Targeting Dashboard
Modelling Coding Deploying Dashboard
»»»
Data Scientist focusing on the dashboard/report instead of content
breaks reusability of data
time wasted on learning viz instead of increasing accuracy (or velocity)
monolithic instead of service oriented
© Data Fellas SPRL 2016
Extended Pipeline
Micro Services
Modelling Coding Deploying Integrating
Application
Creating
Services
Abstracts access to prepared views
Exposes Prediction capabilities
Highly horizontally scalable
Scaling micro services cluster
→ cheaper than computing cluster
Customer integration
Can be any technologies
Can even be another pipeline!
© Data Fellas SPRL 2016
Painful points
Data science is Discontiguous
➔ Highly heterogeneous environment
➔ Too many friction areas
➔ Time to market too long
Modelling Coding Deploying Integrating
Application
Scientist Data Eng. Ops. Eng. Web Eng. Customers
➔ No integration
➔ Error prone
➔ Schedule delays
Creating
Services
Frictions
Result: Lack of Agility
Collecting
Data Eng.
© Data Fellas SPRL 2016
Painful points
Context Lost in Translation
Data Lake Processing
Machine
Learning
Model
Output
Data
Input
Data
No contextual discovery No quality info
No lineage
(origin of the data)
Link to
process and
input discarded
Huge gap in architecture:
binary and schema aware
serving layer
Accuracy depends on
concealed quality of inputs
No schema!
hard and long integration,
poor satisfaction
Moreover:
No backward links → no agility and no context awareness
Result: Lack of Reproducibility
Application
Data Fellas…
Agile Data Science Toolkit
© Data Fellas SPRL 2016
Our Approach
Agile Data Science Toolkit
Automatic
Semantics
Engine
+ Autogenerated
Microservices
Integrated
End-to-End
Environment
Huge gain
in Time and
Reliability
+ =
Notebook
Computing
Cluster
Access
Layer
Knowledge
Base
Consumers
Customers
Exposes
database,
learning models,
stream sources,
notebooks, ...
data type
process
lineage
usage
Easy to Release
Easy to (Re)Use
Notebook
Version Control
(Git)
Spark Job Project
(SBT)
Service Projects
(SBT)
Metadata
(Doc, Logic, Schema, ...)
Catalog
(ElasticSearch)
Deployable
(Jar, Docker)
Repository
(Nexus, Docker Repo,
Pypi, Gem Server)
Client Projects
(Node.Js, Java, Scala,
Python, Ruby)
Publishable
(NPM, Jar,
Pip/EasyInstall, Gem)
scientist
data
Engineer
ops
Engineer
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
Data Fellas…
Announcements!!!
© Data Fellas SPRL 2016
O’Reilly
Online seminar
© Data Fellas SPRL 2016
Growing
We’re Hiring! http://www.data-fellas.guru/#skillsjobs
Q/A
References
http://www.data-fellas.guru/
http://spark-notebook.io/
https://github.com/andypetrella/spark-notebook/
https://gitter.im/andypetrella/spark-notebook
Come at Strata
-- London at least
-- We have two talks :-)

More Related Content

What's hot

Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
elephantscale
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
Jen Stirrup
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Databricks
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
Sri Ambati
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Mark Kerzner
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
Sarah Guido
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
Jen Stirrup
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Wes McKinney
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
Donald Miner
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
Sri Ambati
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor Smith
Sri Ambati
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Brian O'Neill
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 

What's hot (20)

Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor Smith
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 

Similar to Agile data science with scala

Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
Cloudera, Inc.
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine LearningAUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
Sandesh Rao
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
Data Con LA
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
GautamPopli1
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
WeCloudData
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
sparkflows
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
James Serra
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
Stepan Pushkarev
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Alberto Diaz Martin
 
AzureML TechTalk
AzureML TechTalkAzureML TechTalk
AzureML TechTalk
Udaya Kumar
 
Analytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle ApplicationsAnalytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle Applications
Ray Février
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Jason Hubbard
 

Similar to Agile data science with scala (20)

Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine LearningAUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
 
AzureML TechTalk
AzureML TechTalkAzureML TechTalk
AzureML TechTalk
 
Analytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle ApplicationsAnalytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle Applications
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 

More from Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
Andy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
Andy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
Andy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
Andy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
Andy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
Andy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Andy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
Andy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
Andy Petrella
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
Andy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GIS
Andy Petrella
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-data
Andy Petrella
 
Software Crafted And Libraries Available
Software Crafted And Libraries AvailableSoftware Crafted And Libraries Available
Software Crafted And Libraries Available
Andy Petrella
 

More from Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GIS
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-data
 
Software Crafted And Libraries Available
Software Crafted And Libraries AvailableSoftware Crafted And Libraries Available
Software Crafted And Libraries Available
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Agile data science with scala

  • 1. Agile Data Science with Scala by @DataFellas Xavier Tordoir xtordoir@data-fellas.guru @xtordoir Andy Petrella noootsab@data-fellas.guru @noootsab
  • 2. Data Fellas Andy Petrella Maths Geospatial Distributed Computing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning
  • 3. © Data Fellas SPRL 2016 ● Pipeline: productizing Data Science ● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook) ● Why Micro Services? ● Painful points: ○ Data science is Discontiguous ○ Context Lost in Translation ● Solution: Data Fellas’ Agile Data Science Toolkit Lineup So if you’re not sure you want to stay...
  • 4. © Data Fellas SPRL 2016 Pipeline Productizing Data Science Modelling Coding Deploying Finding Data Parsing structures Cleaning (Reducing) Learning Predicting Connect PROD data Tuning training parameters Create Prediction Service Generate Deployable Connect to PROD infrastructure Integration with existing env Allocate (schedule) resources Ensure availability
  • 5. © Data Fellas SPRL 2016 Distributed Data Science Demo All-In Spark Notebooks Get data: Source → Kafka Prepare View: Kafka → Cassandra Train Model: Cassandra → ML... Create Server: Cassandra/ML/... → Akka Http Create Client: Json → Html Form, Chart, table, ...
  • 6. © Data Fellas SPRL 2016 Bad Pipeline Targeting Dashboard Modelling Coding Deploying Dashboard »»» Data Scientist focusing on the dashboard/report instead of content breaks reusability of data time wasted on learning viz instead of increasing accuracy (or velocity) monolithic instead of service oriented
  • 7. © Data Fellas SPRL 2016 Extended Pipeline Micro Services Modelling Coding Deploying Integrating Application Creating Services Abstracts access to prepared views Exposes Prediction capabilities Highly horizontally scalable Scaling micro services cluster → cheaper than computing cluster Customer integration Can be any technologies Can even be another pipeline!
  • 8. © Data Fellas SPRL 2016 Painful points Data science is Discontiguous ➔ Highly heterogeneous environment ➔ Too many friction areas ➔ Time to market too long Modelling Coding Deploying Integrating Application Scientist Data Eng. Ops. Eng. Web Eng. Customers ➔ No integration ➔ Error prone ➔ Schedule delays Creating Services Frictions Result: Lack of Agility Collecting Data Eng.
  • 9. © Data Fellas SPRL 2016 Painful points Context Lost in Translation Data Lake Processing Machine Learning Model Output Data Input Data No contextual discovery No quality info No lineage (origin of the data) Link to process and input discarded Huge gap in architecture: binary and schema aware serving layer Accuracy depends on concealed quality of inputs No schema! hard and long integration, poor satisfaction Moreover: No backward links → no agility and no context awareness Result: Lack of Reproducibility Application
  • 10. Data Fellas… Agile Data Science Toolkit
  • 11. © Data Fellas SPRL 2016 Our Approach Agile Data Science Toolkit Automatic Semantics Engine + Autogenerated Microservices Integrated End-to-End Environment Huge gain in Time and Reliability + = Notebook Computing Cluster Access Layer Knowledge Base Consumers Customers Exposes database, learning models, stream sources, notebooks, ... data type process lineage usage Easy to Release Easy to (Re)Use Notebook Version Control (Git) Spark Job Project (SBT) Service Projects (SBT) Metadata (Doc, Logic, Schema, ...) Catalog (ElasticSearch) Deployable (Jar, Docker) Repository (Nexus, Docker Repo, Pypi, Gem Server) Client Projects (Node.Js, Java, Scala, Python, Ruby) Publishable (NPM, Jar, Pip/EasyInstall, Gem) scientist data Engineer ops Engineer
  • 12. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 13. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 14. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 15. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 16. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 17. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 19. © Data Fellas SPRL 2016 O’Reilly Online seminar
  • 20. © Data Fellas SPRL 2016 Growing We’re Hiring! http://www.data-fellas.guru/#skillsjobs