More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi

More Data, More Problems:
Evolving big data machine learning pipelines with Spark & Luigi
Alex Sadovsky
Director of Data Science: Oracle Data Cloud
alex.sadovsky@oracle.com
It's like the more data we come across
The more problems we see

Data Science is growing up
For Data Science to succeed, we need to learn to play
well with others.
Important
Business
Decisions
How will
operations
adapt to this
code change?
We’ll need a
classifier capable of
capturing non-
linear interactions!

Who are the players?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Litmus Test: These are the parties involved in every Data Science Product

What is success?
Success is getting from A to B with everyone staying happy
Data In
Product Goals
&
Data Science
Realization
Operational Insight
Utilization of
Existing
Architecture
Don’t Break the Bank
Data Out

Are we even talking about Spark today?
Outline:
1. Automated ML services
2. Scikit-Learn Pipelines
3. Spark Pipelines
4. Spotify’s Luigi
5. Data Science Pipelines: Spark + Luigi
Spoiler alert:
Spark is still going to be the answer to all of our big data problems

ML on AWS: A simple model
{
"version" : "1.0",
"rowId" : null,
"rowWeight" : null,
"targetAttributeName" : "y",
"dataFormat" : "CSV",
"dataFileContainsHeader" : true,
"attributes" : [ {
"attributeName" : "age",
"attributeType" : "NUMERIC"
}
…
}
{
"MLModelId": "string",
"MLModelName": "string",
"MLModelType": "string",
"Parameters":
{
"string" :
"string"
},
"Recipe": "string",
"RecipeUri": "string",
"TrainingDataSourceId": "string"
}
Data Models
{
"groups": {
"LONGTEXT": "group_remove(ALL_TEXT, title, subject)”,
"SPECIALTEXT": "group(title, subject)”,
"BINCAT": "group(ALL_CATEGORICAL, ALL_BINARY)”
},
"assignments": {
"binned_age" : "quantile_bin(age,30)”,
"country_gender_interaction" : "cartesian(country, gender)”
},
"outputs": [
"lowercase(no_punct(LONGTEXT))”,
"ngram(lowercase(no_punct(SPECIALTEXT)),3)”,
"quantile_bin(hours-per-week, 10)”,
"cartesian(binned_age, quantile_bin(hours-per-week,10)) // this one is
critical”,
"country_gender_interaction”,
"BINCAT”
]
}
Recipies

ML on AWS: Who’s happy?
Product* Architecture
Scoring 3 billion records = ($0.10 / 1000) * 3 000 000 000 = $300,000 USD + compute fees
*Amazon Machine Learning can train models on datasets up to 100 GB in size.

Scikit-Learn Pipelines
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
VS

Scikit-Learn Pipelines: Who’s happy?
I thought we were going to have Big Data…

Big Data? Sounds like we need Spark.
• Great for data manipulation
• Great for large scale modeling

Spark Pipelines
Awesome… but don’t really take us end to end for anything but modeling

Spark Pipelines: Who’s happy?
Just not flexible enough to comprise a whole product

So why (not) Spark?
• Great for data manipulation
• Great for large scale modeling
• Not a data warehouse
• Not needed for reporting
• Not needed for operational insight
– If anything, it’s an error source!

Spotify’s Luigi
https://github.com/spotify/luigi
Luigi is a pipeline tool for workflow management
• Apache 2.0 License
• Similar to Make utility in Linux
• You have tasks which have dependencies
• Luigi makes sure those dependences are met
• Similar to Spark
• It creates a directed acyclic graph and executes accordingly

How does it work: Tasks & Targets
• Tasks
– Code we want to run (that requires other tasks)
– Tasks output targets
• Targets
– A desired state

Luigi works with anything
• Tasks
– Hadoop commands
– Spark jobs
– Python, perl, fortran, shell scripts
– Anything that can be wrapped in a python“run”
method
• Targets
– local, S3, FTP, HDFS files
– database entries
– Anything that can let a python wrapper return “true”
when it exists

It’s all python too
• No XML or YAML
• Configurable via code

Foo and Bar
class Foo( luigi.WrapperTask ):
def run(self):
print("Running Foo")
def requires(self):
yield Bar()

Foo and Bar
class Bar(luigi.Task):
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar’)

/anaconda/bin/python /Users/alexander.sadovsky/PycharmProjects/megalodon/foo2.py Foo
DEBUG: Checking if examples.Foo() is complete
DEBUG: Checking if examples.Bar() is complete
INFO: Informed scheduler that task examples.Foo() has status PENDING
INFO: Informed scheduler that task examples.Bar() has status PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Bar()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Bar()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task examples.Bar() has status DONE
DEBUG: Pending tasks: 1
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Foo()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Foo()
DEBUG: 1 running tasks, waiting for next task to finish
Running Foo
INFO: Informed scheduler that task examples.Foo() has status DONE
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) was stopped. Shutting down
Keep-Alive thread
INFO:
===== Luigi Execution Summary =====
Scheduled 2 tasks of which:
* 2 ran successfully:
- 1 examples.Bar()
- 1 examples.Foo()
This progress looks :) because there were no failed tasks or missing external dependencies
===== Luigi Execution Summary =====
Process finished with exit code 0

Foo and Bars
class Foo(luigi.WrapperTask):
def run(self):
print("Running Foo")
def requires(self):
for i in range(10):
yield Bar(i)
class Bar(luigi.Task):
num = luigi.Parameter()
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar/%d' % self.num)

Foo and Bars: Parallel Processing

What about Spark?
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext()
sc.textFile(sys.argv[1])
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(sys.argv[2])

What about Spark?
class InlinePySparkWordCount(PySparkTask):
def input(self):
return S3Target("s3n://bucket.example.org/wordcount.input")
def output(self):
return S3Target('s3n://bucket.example.org/wordcount.output')
def main(self, sc, *args):
sc.textFile(self.input().path)
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(self.output().path)

Modeling ID
Selection
Data
Scoring ID
Selection
Variable
Creation
Variable
Reductio
n
Model
Scoring
Variable
Creation
Scoring
Data
Modeling ID
Selection
Variable
Creation
Variable
reduction
Model
Scoring
ID
Selection
Scoring
Variable
Creation
Scoring
Modeling Pipeline
Model
Model
Data

But wait! There’s more!
• Failure retries are built in
• Upstream failures will stop downstream
processing
• If tasks are files/filesystem/database states,
entire pipelines can be rerun without actually
“re-running” every step

Spark + Luigi: Who’s happy?
Everybody!

More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi

Similar to More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi (20)

Recently uploaded

Recently uploaded (20)

More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi

Editor's Notes