SlideShare a Scribd company logo
More Data, More Problems:
Evolving big data machine learning pipelines with Spark & Luigi
Alex Sadovsky
Director of Data Science: Oracle Data Cloud
alex.sadovsky@oracle.com
It's like the more data we come across
The more problems we see
Data Science is growing up
Data Science is growing up
For Data Science to succeed, we need to learn to play
well with others.
Important
Business
Decisions
How will
operations
adapt to this
code change?
We’ll need a
classifier capable of
capturing non-
linear interactions!
Who are the players?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Litmus Test: These are the parties involved in every Data Science Product
What is success?
Success is getting from A to B with everyone staying happy
Data In
Product Goals
&
Data Science
Realization
Operational Insight
Utilization of
Existing
Architecture
Don’t Break the Bank
Data Out
Are we even talking about Spark today?
Outline:
1. Automated ML services
2. Scikit-Learn Pipelines
3. Spark Pipelines
4. Spotify’s Luigi
5. Data Science Pipelines: Spark + Luigi
Spoiler alert:
Spark is still going to be the answer to all of our big data problems
Amazon Machine Learning
ML on AWS: A simple model
{
"version" : "1.0",
"rowId" : null,
"rowWeight" : null,
"targetAttributeName" : "y",
"dataFormat" : "CSV",
"dataFileContainsHeader" : true,
"attributes" : [ {
"attributeName" : "age",
"attributeType" : "NUMERIC"
}
…
}
{
"MLModelId": "string",
"MLModelName": "string",
"MLModelType": "string",
"Parameters":
{
"string" :
"string"
},
"Recipe": "string",
"RecipeUri": "string",
"TrainingDataSourceId": "string"
}
Data Models
{
"groups": {
"LONGTEXT": "group_remove(ALL_TEXT, title, subject)”,
"SPECIALTEXT": "group(title, subject)”,
"BINCAT": "group(ALL_CATEGORICAL, ALL_BINARY)”
},
"assignments": {
"binned_age" : "quantile_bin(age,30)”,
"country_gender_interaction" : "cartesian(country, gender)”
},
"outputs": [
"lowercase(no_punct(LONGTEXT))”,
"ngram(lowercase(no_punct(SPECIALTEXT)),3)”,
"quantile_bin(hours-per-week, 10)”,
"cartesian(binned_age, quantile_bin(hours-per-week,10)) // this one is
critical”,
"country_gender_interaction”,
"BINCAT”
]
}
Recipies
ML on AWS: A simple model
ML on AWS: Who’s happy?
Data Ingest Operations
Product* Architecture
Finance / Investors Data Scientists
Scoring 3 billion records = ($0.10 / 1000) * 3 000 000 000 = $300,000 USD + compute fees
*Amazon Machine Learning can train models on datasets up to 100 GB in size.
Scikit-Learn Pipelines
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
VS
Scikit-Learn Pipelines: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
I thought we were going to have Big Data…
Big Data? Sounds like we need Spark.
• Great for data manipulation
• Great for large scale modeling
Spark Pipelines
Awesome… but don’t really take us end to end for anything but modeling
Spark Pipelines: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Just not flexible enough to comprise a whole product
So why (not) Spark?
• Great for data manipulation
• Great for large scale modeling
• Not a data warehouse
• Not needed for reporting
• Not needed for operational insight
– If anything, it’s an error source!
Pipelines
Spotify’s Luigi
https://github.com/spotify/luigi
Luigi is a pipeline tool for workflow management
• Apache 2.0 License
• Similar to Make utility in Linux
• You have tasks which have dependencies
• Luigi makes sure those dependences are met
• Similar to Spark
• It creates a directed acyclic graph and executes accordingly
Or in meme form:
How does it work: Tasks & Targets
• Tasks
– Code we want to run (that requires other tasks)
– Tasks output targets
• Targets
– A desired state
Luigi works with anything
• Tasks
– Hadoop commands
– Spark jobs
– Python, perl, fortran, shell scripts
– Anything that can be wrapped in a python“run”
method
• Targets
– local, S3, FTP, HDFS files
– database entries
– Anything that can let a python wrapper return “true”
when it exists
It’s all python too
• No XML or YAML
• Configurable via code
Foo and Bar
class Foo( luigi.WrapperTask ):
def run(self):
print("Running Foo")
def requires(self):
yield Bar()
Foo and Bar
class Bar(luigi.Task):
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar’)
/anaconda/bin/python /Users/alexander.sadovsky/PycharmProjects/megalodon/foo2.py Foo
DEBUG: Checking if examples.Foo() is complete
DEBUG: Checking if examples.Bar() is complete
INFO: Informed scheduler that task examples.Foo() has status PENDING
INFO: Informed scheduler that task examples.Bar() has status PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Bar()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Bar()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task examples.Bar() has status DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Foo()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Foo()
DEBUG: 1 running tasks, waiting for next task to finish
Running Foo
INFO: Informed scheduler that task examples.Foo() has status DONE
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) was stopped. Shutting down
Keep-Alive thread
INFO:
===== Luigi Execution Summary =====
Scheduled 2 tasks of which:
* 2 ran successfully:
- 1 examples.Bar()
- 1 examples.Foo()
This progress looks :) because there were no failed tasks or missing external dependencies
===== Luigi Execution Summary =====
Process finished with exit code 0
Foo and Bars
Foo and Bars
class Foo(luigi.WrapperTask):
def run(self):
print("Running Foo")
def requires(self):
for i in range(10):
yield Bar(i)
class Bar(luigi.Task):
num = luigi.Parameter()
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar/%d' % self.num)
Foo and Bars: Parallel Processing
What about Spark?
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext()
sc.textFile(sys.argv[1]) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b) 
.saveAsTextFile(sys.argv[2])
What about Spark?
class InlinePySparkWordCount(PySparkTask):
def input(self):
return S3Target("s3n://bucket.example.org/wordcount.input")
def output(self):
return S3Target('s3n://bucket.example.org/wordcount.output')
def main(self, sc, *args):
sc.textFile(self.input().path) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b) 
.saveAsTextFile(self.output().path)
Modeling ID
Selection
Data
Scoring ID
Selection
Variable
Creation
Variable
Reductio
n
Model
Scoring
Variable
Creation
Scoring
Data
Modeling ID
Selection
Variable
Creation
Variable
reduction
Model
Scoring
ID
Selection
Scoring
Variable
Creation
Scoring
Modeling Pipeline
Model
Model
Data
But wait! There’s more!
• Failure retries are built in
• Upstream failures will stop downstream
processing
• If tasks are files/filesystem/database states,
entire pipelines can be rerun without actually
“re-running” every step
Spark + Luigi: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Everybody!
How does it scale?
Shout out to my awesome team
I’m (almost) always hiring
Questions?

More Related Content

What's hot

Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
Erik Bernhardsson
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
Sri Ambati
 
Dapper
DapperDapper
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop example
Christopher Curtin
 
SharePoint Administration with PowerShell
SharePoint Administration with PowerShellSharePoint Administration with PowerShell
SharePoint Administration with PowerShell
Eric Kraus
 
The Ring programming language version 1.5.2 book - Part 39 of 181
The Ring programming language version 1.5.2 book - Part 39 of 181The Ring programming language version 1.5.2 book - Part 39 of 181
The Ring programming language version 1.5.2 book - Part 39 of 181
Mahmoud Samir Fayed
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
InfluxData
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
Bruce McPherson
 
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
InfluxData
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
Survey Department
 
Dapper & Dapper.SimpleCRUD
Dapper & Dapper.SimpleCRUDDapper & Dapper.SimpleCRUD
Dapper & Dapper.SimpleCRUD
Blank Chen
 
The Ring programming language version 1.5.4 book - Part 40 of 185
The Ring programming language version 1.5.4 book - Part 40 of 185The Ring programming language version 1.5.4 book - Part 40 of 185
The Ring programming language version 1.5.4 book - Part 40 of 185
Mahmoud Samir Fayed
 
#ajn3.lt.marblejenka
#ajn3.lt.marblejenka#ajn3.lt.marblejenka
#ajn3.lt.marblejenka
Shingo Furuyama
 
Dapper performance
Dapper performanceDapper performance
Dapper performance
Suresh Loganatha
 
Node collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDBNode collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDB
m_richardson
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Red Hat Developers
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Sages
 

What's hot (20)

Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 
Dapper
DapperDapper
Dapper
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop example
 
SharePoint Administration with PowerShell
SharePoint Administration with PowerShellSharePoint Administration with PowerShell
SharePoint Administration with PowerShell
 
The Ring programming language version 1.5.2 book - Part 39 of 181
The Ring programming language version 1.5.2 book - Part 39 of 181The Ring programming language version 1.5.2 book - Part 39 of 181
The Ring programming language version 1.5.2 book - Part 39 of 181
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 
Python database interfaces
Python database  interfacesPython database  interfaces
Python database interfaces
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
Dapper & Dapper.SimpleCRUD
Dapper & Dapper.SimpleCRUDDapper & Dapper.SimpleCRUD
Dapper & Dapper.SimpleCRUD
 
The Ring programming language version 1.5.4 book - Part 40 of 185
The Ring programming language version 1.5.4 book - Part 40 of 185The Ring programming language version 1.5.4 book - Part 40 of 185
The Ring programming language version 1.5.4 book - Part 40 of 185
 
Latinoware
LatinowareLatinoware
Latinoware
 
#ajn3.lt.marblejenka
#ajn3.lt.marblejenka#ajn3.lt.marblejenka
#ajn3.lt.marblejenka
 
Dapper performance
Dapper performanceDapper performance
Dapper performance
 
Node collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDBNode collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDB
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 

Similar to More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi

Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202
Mahmoud Samir Fayed
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
GeeksLab Odessa
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Data herding
Data herdingData herding
Data herding
unbracketed
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
Jim Mlodgenski
 
Powershell Training
Powershell TrainingPowershell Training
Powershell TrainingFahad Noaman
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Flyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiegoFlyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiego
KetanUmare
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
Sadayuki Furuhashi
 
Clojure And Swing
Clojure And SwingClojure And Swing
Clojure And Swing
Skills Matter
 

Similar to More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi (20)

Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
 
Powershell Training
Powershell TrainingPowershell Training
Powershell Training
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Flyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiegoFlyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiego
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Clojure And Swing
Clojure And SwingClojure And Swing
Clojure And Swing
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 

More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi

  • 1. More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi Alex Sadovsky Director of Data Science: Oracle Data Cloud alex.sadovsky@oracle.com It's like the more data we come across The more problems we see
  • 2. Data Science is growing up
  • 3. Data Science is growing up For Data Science to succeed, we need to learn to play well with others. Important Business Decisions How will operations adapt to this code change? We’ll need a classifier capable of capturing non- linear interactions!
  • 4. Who are the players? Data Ingest Operations Product Architecture Finance / Investors Data Scientists Litmus Test: These are the parties involved in every Data Science Product
  • 5. What is success? Success is getting from A to B with everyone staying happy Data In Product Goals & Data Science Realization Operational Insight Utilization of Existing Architecture Don’t Break the Bank Data Out
  • 6. Are we even talking about Spark today? Outline: 1. Automated ML services 2. Scikit-Learn Pipelines 3. Spark Pipelines 4. Spotify’s Luigi 5. Data Science Pipelines: Spark + Luigi Spoiler alert: Spark is still going to be the answer to all of our big data problems
  • 8. ML on AWS: A simple model { "version" : "1.0", "rowId" : null, "rowWeight" : null, "targetAttributeName" : "y", "dataFormat" : "CSV", "dataFileContainsHeader" : true, "attributes" : [ { "attributeName" : "age", "attributeType" : "NUMERIC" } … } { "MLModelId": "string", "MLModelName": "string", "MLModelType": "string", "Parameters": { "string" : "string" }, "Recipe": "string", "RecipeUri": "string", "TrainingDataSourceId": "string" } Data Models { "groups": { "LONGTEXT": "group_remove(ALL_TEXT, title, subject)”, "SPECIALTEXT": "group(title, subject)”, "BINCAT": "group(ALL_CATEGORICAL, ALL_BINARY)” }, "assignments": { "binned_age" : "quantile_bin(age,30)”, "country_gender_interaction" : "cartesian(country, gender)” }, "outputs": [ "lowercase(no_punct(LONGTEXT))”, "ngram(lowercase(no_punct(SPECIALTEXT)),3)”, "quantile_bin(hours-per-week, 10)”, "cartesian(binned_age, quantile_bin(hours-per-week,10)) // this one is critical”, "country_gender_interaction”, "BINCAT” ] } Recipies
  • 9. ML on AWS: A simple model
  • 10. ML on AWS: Who’s happy? Data Ingest Operations Product* Architecture Finance / Investors Data Scientists Scoring 3 billion records = ($0.10 / 1000) * 3 000 000 000 = $300,000 USD + compute fees *Amazon Machine Learning can train models on datasets up to 100 GB in size.
  • 11. Scikit-Learn Pipelines vect = CountVectorizer() tfidf = TfidfTransformer() clf = SGDClassifier() vX = vect.fit_transform(Xtrain) tfidfX = tfidf.fit_transform(vX) predicted = clf.fit_predict(tfidfX) # Now evaluate all steps on test set vX = vect.fit_transform(Xtest) tfidfX = tfidf.fit_transform(vX) predicted = clf.fit_predict(tfidfX) pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier()), ]) predicted = pipeline.fit(Xtrain).predict(Xtrain) # Now evaluate all steps on test set predicted = pipeline.predict(Xtest) VS
  • 12. Scikit-Learn Pipelines: Who’s happy? Data Ingest Operations Product Architecture Finance / Investors Data Scientists I thought we were going to have Big Data…
  • 13. Big Data? Sounds like we need Spark. • Great for data manipulation • Great for large scale modeling
  • 14. Spark Pipelines Awesome… but don’t really take us end to end for anything but modeling
  • 15. Spark Pipelines: Who’s happy? Data Ingest Operations Product Architecture Finance / Investors Data Scientists Just not flexible enough to comprise a whole product
  • 16. So why (not) Spark? • Great for data manipulation • Great for large scale modeling • Not a data warehouse • Not needed for reporting • Not needed for operational insight – If anything, it’s an error source!
  • 18. Spotify’s Luigi https://github.com/spotify/luigi Luigi is a pipeline tool for workflow management • Apache 2.0 License • Similar to Make utility in Linux • You have tasks which have dependencies • Luigi makes sure those dependences are met • Similar to Spark • It creates a directed acyclic graph and executes accordingly
  • 19. Or in meme form:
  • 20. How does it work: Tasks & Targets • Tasks – Code we want to run (that requires other tasks) – Tasks output targets • Targets – A desired state
  • 21. Luigi works with anything • Tasks – Hadoop commands – Spark jobs – Python, perl, fortran, shell scripts – Anything that can be wrapped in a python“run” method • Targets – local, S3, FTP, HDFS files – database entries – Anything that can let a python wrapper return “true” when it exists
  • 22. It’s all python too • No XML or YAML • Configurable via code
  • 23. Foo and Bar class Foo( luigi.WrapperTask ): def run(self): print("Running Foo") def requires(self): yield Bar()
  • 24. Foo and Bar class Bar(luigi.Task): def run(self): f = self.output().open('w') f.write("hello, foobar worldn") f.close() def output(self): return luigi.LocalTarget('/tmp/bar’)
  • 25. /anaconda/bin/python /Users/alexander.sadovsky/PycharmProjects/megalodon/foo2.py Foo DEBUG: Checking if examples.Foo() is complete DEBUG: Checking if examples.Bar() is complete INFO: Informed scheduler that task examples.Foo() has status PENDING INFO: Informed scheduler that task examples.Bar() has status PENDING INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running examples.Bar() INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done examples.Bar() DEBUG: 1 running tasks, waiting for next task to finish INFO: Informed scheduler that task examples.Bar() has status DONE DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running examples.Foo() INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done examples.Foo() DEBUG: 1 running tasks, waiting for next task to finish Running Foo INFO: Informed scheduler that task examples.Foo() has status DONE DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) was stopped. Shutting down Keep-Alive thread INFO: ===== Luigi Execution Summary ===== Scheduled 2 tasks of which: * 2 ran successfully: - 1 examples.Bar() - 1 examples.Foo() This progress looks :) because there were no failed tasks or missing external dependencies ===== Luigi Execution Summary ===== Process finished with exit code 0
  • 26.
  • 28. Foo and Bars class Foo(luigi.WrapperTask): def run(self): print("Running Foo") def requires(self): for i in range(10): yield Bar(i) class Bar(luigi.Task): num = luigi.Parameter() def run(self): f = self.output().open('w') f.write("hello, foobar worldn") f.close() def output(self): return luigi.LocalTarget('/tmp/bar/%d' % self.num)
  • 29. Foo and Bars: Parallel Processing
  • 30. What about Spark? import sys from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext() sc.textFile(sys.argv[1]) .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(sys.argv[2])
  • 31. What about Spark? class InlinePySparkWordCount(PySparkTask): def input(self): return S3Target("s3n://bucket.example.org/wordcount.input") def output(self): return S3Target('s3n://bucket.example.org/wordcount.output') def main(self, sc, *args): sc.textFile(self.input().path) .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(self.output().path)
  • 32. Modeling ID Selection Data Scoring ID Selection Variable Creation Variable Reductio n Model Scoring Variable Creation Scoring Data Modeling ID Selection Variable Creation Variable reduction Model Scoring ID Selection Scoring Variable Creation Scoring Modeling Pipeline Model Model Data
  • 33.
  • 34.
  • 35. But wait! There’s more! • Failure retries are built in • Upstream failures will stop downstream processing • If tasks are files/filesystem/database states, entire pipelines can be rerun without actually “re-running” every step
  • 36. Spark + Luigi: Who’s happy? Data Ingest Operations Product Architecture Finance / Investors Data Scientists Everybody!
  • 37. How does it scale?
  • 38. Shout out to my awesome team

Editor's Notes

  1. One guy writing horrible code in R silo’ed away (story)…. Productionized, deployed code that a business depends on
  2. One guy writing horrible code in R silo’ed away (story)…. Productionized, deployed code that a business depends on
  3. One person or multiple teams:
  4. Ml pipeline
  5. Bare with me, I’m getting there
  6. diagram.
  7. diagram
  8. Lost pic? One person or multiple teams:
  9. diagram.
  10. Lost pic? One person or multiple teams:
  11. diagram.
  12. Lost pic? One person or multiple teams:
  13. Mario is off getting the girl, luigi is off creating world class data science pipeline products
  14. Apache oozie/ linked in azkaban
  15. Apache oozie/ linked in azkaban
  16. Lost pic? One person or multiple teams: