Data Pipelines with Python - NWA TechFest 2017

Data Pipelines
NWA Techfest 2017
With Python

What is a Data Pipeline?
• Discrete set of dependent operations
• Directional (Inputs -> [Operations] -> Outputs)
• One or more input sources and one or more
outputs

Pipelines are Used For
• Data aggregation / augmentation
• Data cleansing / de-duplication
• Data copying / synchronization
• Analytics processing
• AI Modeling

Sources and Targets
• Sources: Initial inputs into a pipeline
• REST API, Excel Sheet, Filesystem, HDFS,
RDBMS, etc.
• Targets: Terminal outputs of a pipeline
• REST API, Excel Sheet, [...], Email, Slack

Operations
• Operations are the fundamental units of work
within a pipeline.
• Operations can be domain speciﬁc.
• Operations can be composable.

O
O O OS T
S
T
Source
Operation
Target
O
Simple Linear Pipeline

O
O
O
S
TS
T
Source
Operation
Target
O
Complex Pipeline
S S
O
T T
O
O
O

DAGs
• Directed Acyclic Graphs
• Transitive reduction enables smart
dependency resolution

Atomicity
• An entire operation fails or succeeds as a
whole.
• There is no partial state in the event of a
failure.
"the state or fact of being
composed of indivisible units."

Atomicity
https://www.postgresql.org/docs/8.3/static/tutorial-transactions.html

Idempotency
• An operation can be run multiple times without failure.
• An operation can be run multiple times without
duplication of output.
Q: What is the correct way to pronounce 'idempotent'?
A: The same way every time.
"denoting an element of a set that is
unchanged in value when multiplied or
otherwise operated on by itself."

Concurrency
• Execute a non-resource bound operation via
many threads on the same core.
• Performant pipelines ﬁnd concurrency within
an operation.
"the decomposability property of a
program, algorithm, or problem into
order-independent or partially-ordered
components or units."

Parallelism
• Execute operations on multiple cores /
machines simultaneously.
• Operators can operate in parallel as soon as a
new input is available.
"a computation architecture in which
many calculations or the execution of
processes are carried out simultaneously"

Periodic Workﬂows
• Pipeline executes on a timed interval
• Great for exhaustive data processing
• Easy backﬁlling

Event-Driven Workﬂows
• Pipeline handles inputs (events) as they are
received
• Real time data
• Best suited for non-exhaustive data processing
• Backﬁlls?

ETL
• Extract, Transform, Load are distinct steps with
no shared operations
• Each step can performed one or more times
before the following step is performed.
Extract, Transform, Load

ETL
• Intermediate data stored between steps and
audit data is tracked for each step.
• Enables independent processing of Extract
Transform, and Load.
• Don't transform during extraction.
• Don't transform during loading!

E
T
T
S
TS
T
Source
ETL Operation
Target
T
ETL Pipeline
S S
L
T T
T
T
L

E
T
T
S
T
S
T
Source
ETL Operation
Target
T
ETL Pipeline(s)
S S
L
T T
T
T
L
S S
T
T
T
S
T

Scientiﬁc / Stats Ecosystem
• NumPy and Pandas
• SciKitLearn
• spaCy

Web Development Ecosystem
• Django, Flask, Pyramid
• Django REST Framework
• Scrapy
• Celery
In web development, we started solving distributed processing
problems a long time ago.

Numba: JIT compiler to LLVM
http://numba.pydata.org/

Celery
• Task queueing / Asynchronous processing
• Native Python executed on distributed workers
• Retrying, Throttling, Pooling

Luigi
• Open sourced by Spotify in 2012
• Lightweight conﬁguration
• Does not support worker pooling
"Luigi is a Python package that helps you build complex pipelines
of batch jobs. It handles dependency resolution, workﬂow
management, visualization, handling failures, command line
integration, and much more."

Airflow
• Open sourced by AirBnb in 2015
• Apache Incubation since March 2016
• Implements workflows as strict DAGs
• Visualization / Audit / Backfill tools
• Scales with Celery
"Airflow is a platform to programmaticaly
author, schedule and monitor data pipelines."

Our Tools of Choice:
Celery + Airﬂow

How To Attack a Pipeline
Problem
1. Pure Python functions
2. Convert to Celery (Parallel for free!)
3. Layer in Concurrency / Optimizations
4. Escalate to AirFlow

Things are going to fail.
• Log often and frequently.
• Remember Atomicity.
• Leverage aggregation / visualization tools.

Pipeline Takeaways
• Build operations with Atomicity and
Idempotency in mind.
• Optimize throughput with concurrency and
parallelism.
• Log and visualize (or just use Airﬂow).

Come talk to us about your data.
Casey Kinsey, Principal Consultant
hirelofty.com
@loftylabs
@quesokinsey

Data Pipelines with Python - NWA TechFest 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Pipelines with Python - NWA TechFest 2017

Similar to Data Pipelines with Python - NWA TechFest 2017 (20)

Recently uploaded

Recently uploaded (20)

Data Pipelines with Python - NWA TechFest 2017