Workflow Engines + Luigi

Data Pipeline ArchitectData Pipeline Architect
Workflow Engines + Luigi.
A broad overview and a brief introduction.
Vladislav Supalov, 8 March 2016

Hi, I’m Vladislav
2
2
● Wow, that’s neat. We can do cool stuff with data.
○ Machine Learning
○ Data Mining
○ Computer Vision
● DevOps? Shiny!
○ Lots of servers being useful and reliable <3
○ Automation
● Oh, so this is how businesses perceive things. I WAS BLIND.
○ Business goals and values
○ Measurable impact

Here’s What I Do
3
3
● Yes, please. All of this. Data engineering consulting.
○ “We built data stuff in-house and it delivers lots of value!”
■ “But it also sucks. We are losing money.”
■ “How can we do better?”
○ Mobile application marketing agencies
■ Not necessarily huge data
■ Very valuable and worthwhile topic
■ datapipelinearchitect.com

Not Necessarily Big Data
4
4
● There’s Big Data
○ It’s pretty fascinating, alright
○ Most companies are a few steps away from having these problems
● Let’s talk more about
○ Messy data (multiple data sources, no overview)
○ Tedious-to-handle data (multiple data sources, lots of manual work)

The Big Picture
Actually handling the data is a very small part.
Straightforward, once the business needs are clear.
It’s about communication and people.

My First Data Pipeline
6
6
● ~20 GB per day
● Legacy MongoDB setup
● BEFORE: “It takes HOURS to get query results!”
● AFTER: “Already done. That was hardly a minute.”
● Google BigQuery
○ Batch: daily
○ Streaming: near real-time
● … So I wrote some modular scripts from scratch (Python & Bash)
○ It worked alright
○ I’m so sorry!

What’s Wrong with Custom Scripts?
7
7
● What happens when the original author leaves?
○ hit-by-bus criterium
● Cost of ownership
○ Learning curve, uniqueness
○ Maintenance time, tricky bugs, code duplication
○ Unexpected failure modes
● Extensibility
● Growth
● Metadata?
● You’re reinventing the wheel

Here’s What Most People Don’t Search For
8
8
“I need to get data from A to B on a regular basis!”
● ETL
○ Extract
○ Transform
○ Load
● Long history
● Even longer beard
● A lot of enterprise-grade tools
→ Data pipelines

Data Plumbing - There Are Many Approaches
9
9
● Data Virtuality
○ Access data across multiple sources transparently
○ Redshift used in the background intelligently
● Snowplow
○ “Event analytics platform” - designed to run on AWS services
○ Generate special events instead of plumbing existing data
● Segment.io
○ “Collect customer data with one API and send it to hundreds of tools for
analytics, marketing, and data warehousing.”
http://datapipelinearchitect.com/tools-for-combining-multiple-data-sources/

Workflow Engines!
10
10
● Workflow = “[..] orchestrated and repeatable pattern of business activity [..]” [1]
● Data flow = “bunch of data processing tasks with inter-dependencies” [2]
● Pipelines of batch jobs
○ complex, long-running
● Dependency management
● Reusability of intermediate steps
● Logging and alerting
● Failure handling
● Monitoring
● Lots of effort went into them (Broken data? Crashes? Partial failures?)
[1] https://en.wikipedia.org/wiki/Workflow
[2] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“

Workflow Engine Specimens
11
11
● Oozie
● Azkaban
○ XML, strong Hadoop ecosystem focus.
● Luigi
● Airflow
● Pinball
○ Glue!
● Google Cloud Dataflow
● AWS Data Pipeline
○ Managed! Fancy.
A nice comparison: http://bytepawn.com/luigi-airflow-pinball.html

Let’s Talk Luigi!
12
12
● Spotify
○ Lots of data!
○ 10k+ Hadoop jobs every day [1]
● Battle hardened
○ Published 2009
○ Has been used in production by large companies for a while
● Python
● Modular & extensible
● Dependency graph
● Not just for data tasks
[1] Erik Bernhardsson, 2013, “Building Data Pipelines with Python and Luigi”

Core Goals and Concepts
13
13
● Goals [1]
○ Minimize boilerplate code
○ As general as possible
○ Easy to go from test to production
● Dependencies modeled as directed acyclic graph (DAG)
● Tasks, Targets
● Assumptions:
○ Idempotency
○ Atomic file operations
■ File X is there? I’m done forever.
[1] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“

What Luigi Provides
14
14
● Parametrization (command line arguments)
● Email alerts
● Dependency resolution
● Retries
● History
● Visualizations
● Preventing duplication of effort
● Testable
● Versioning-friendly
● Collaborative
● Community!

Workers and the Scheduler
15
https://github.com/spotify/luigi15

Workers and the Scheduler
16
https://github.com/spotify/luigi16

Workers and the Central Scheduler
17
17
● Workers
○ Crunch data
○ Started via cron, or by hand
● Scheduler
○ Not cron
○ Doesn’t do any data processing
○ Synchronization
○ Web interface - dashboard, visualizations
○ Prevent same task to run multiple times
○ Edit configuration → run luigid

A Luigi Script
18
18
import luigi
# structure
class MyTask(luigi.Task):
def requires(self): # a list of Task(s)
def output(self): # a Target
def run(self): # the work happens here
if __name__ == “__main__”:
luigi.run()
---
$ python dataflow.py MyTask

Parameters
19
19
class MyTask(luigi.Task):
# magic!
param = luigi.Parameter(default=3)
[...]
---
$ python dataflow.py MyTask --param 2

Task Inputs and Outputs
20
20
[...]
# where the data goes
def output(self):
return luigi.LocalTarget(“/data/out1-%s.txt” % self.param)
# what needs to run beforehand
def requires(self):
return OtherTask(self.param)
[...]

Doing the Work
21
21
[...]
def run(self):
with self.input().open('r') as in_file:
with self.output().open(“w”) as out_file:
# read from in_file, ???, write to out_file
[...]
---
run can yield tasks, to create dynamic dependencies

The Perks
22
22
● Specify dates
$ python dataflow.py MyTask --date 2016-03-08
$ [...] --date_interval 2016-W20
● Concurrency
$ [...] --workers 3
● Lots of functionality already provided
○ Targets (HDFS, S3, …)
○ Tasks (HadoopJobTask, CopyToTable, ...)

Takeaways
23
23
● Don’t consider data plumbing in isolation
● Technical decisions should be informed by business needs & goals
● Don’t go with home-baked scripts
○ “Quick and easy”? No.
● ETL is a thing
● There are workflow engines
○ Lots of them
○ Not only for data
● There are other approaches and services
● Luigi is a useful tool

Thanks! Let’s stay in touch :)
You’ll also get a step-by-step guide on learning Luigi.
http://datapipelinearchitect.com/big-data-eindhoven/

Workflow Engines + Luigi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Workflow Engines + Luigi

Similar to Workflow Engines + Luigi (20)

Recently uploaded

Recently uploaded (20)

Workflow Engines + Luigi