Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Pipelines
NWA Techfest 2017
With Python
What is a Data
Pipeline?
What is a Data Pipeline?
• Discrete set of dependent operations
• Directional (Inputs -> [Operations] -> Outputs)
• One or...
Pipelines are Used For
• Data aggregation / augmentation
• Data cleansing / de-duplication
• Data copying / synchronizatio...
Sources and Targets
• Sources: Initial inputs into a pipeline
• REST API, Excel Sheet, Filesystem, HDFS,
RDBMS, etc.
• Tar...
Operations
• Operations are the fundamental units of work
within a pipeline.
• Operations can be domain specific.
• Operati...
O
O O OS T
S
T
Source
Operation
Target
O
Simple Linear Pipeline
O
O
O
S
TS
T
Source
Operation
Target
O
Complex Pipeline
S S
O
T T
O
O
O
DAGs
• Directed Acyclic Graphs
• Transitive reduction enables smart
dependency resolution
DAG Reduced Form
Atomicity
• An entire operation fails or succeeds as a
whole.
• There is no partial state in the event of a
failure.
"the ...
Atomicity
https://www.postgresql.org/docs/8.3/static/tutorial-transactions.html
Idempotency
• An operation can be run multiple times without failure.
• An operation can be run multiple times without
dup...
Idempotency
Idempotent
Concurrency
• Execute a non-resource bound operation via
many threads on the same core.
• Performant pipelines find concurr...
Parallelism
• Execute operations on multiple cores /
machines simultaneously.
• Operators can operate in parallel as soon ...
Design Patterns
Periodic Workflows
• Pipeline executes on a timed interval
• Great for exhaustive data processing
• Easy backfilling
Event-Driven Workflows
• Pipeline handles inputs (events) as they are
received
• Real time data
• Best suited for non-exhau...
ETL
• Extract, Transform, Load are distinct steps with
no shared operations
• Each step can performed one or more times
be...
ETL
• Intermediate data stored between steps and
audit data is tracked for each step.
• Enables independent processing of ...
O
O
O
S
TS
T
Source
Operation
Target
O
Complex Pipeline
S S
O
T T
O
O
O
E
T
T
S
TS
T
Source
ETL Operation
Target
T
ETL Pipeline
S S
L
T T
T
T
L
E
T
T
S
T
S
T
Source
ETL Operation
Target
T
ETL Pipeline(s)
S S
L
T T
T
T
L
S S
T
T
T
S
T
Why Python?
Scientific / Stats Ecosystem
• NumPy and Pandas
• SciKitLearn
• spaCy
Web Development Ecosystem
• Django, Flask, Pyramid
• Django REST Framework
• Scrapy
• Celery
In web development, we starte...
Numba: JIT compiler to LLVM
http://numba.pydata.org/
Python Libraries
Celery
• Task queueing / Asynchronous processing
• Native Python executed on distributed workers
• Retrying, Throttling, P...
🐶>😿
Luigi
• Open sourced by Spotify in 2012
• Lightweight configuration
• Does not support worker pooling
"Luigi is a Python pa...
Airflow
• Open sourced by AirBnb in 2015
• Apache Incubation since March 2016
• Implements workflows as strict DAGs
• Visual...
Our Tools of Choice:
Celery + Airflow
How To Attack a Pipeline
Problem
1. Pure Python functions
2. Convert to Celery (Parallel for free!)
3. Layer in Concurrenc...
Things are going to fail.
• Log often and frequently.
• Remember Atomicity.
• Leverage aggregation / visualization tools.
Pipeline Takeaways
• Build operations with Atomicity and
Idempotency in mind.
• Optimize throughput with concurrency and
p...
Come talk to us about your data.
Casey Kinsey, Principal Consultant
hirelofty.com
@loftylabs
@quesokinsey
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
Upcoming SlideShare
Loading in …5
×

Data Pipelines with Python - NWA TechFest 2017

377 views

Published on

Tools and design patterns for building data pipelines with Python.

Published in: Data & Analytics
  • Be the first to comment

Data Pipelines with Python - NWA TechFest 2017

  1. 1. Data Pipelines NWA Techfest 2017 With Python
  2. 2. What is a Data Pipeline?
  3. 3. What is a Data Pipeline? • Discrete set of dependent operations • Directional (Inputs -> [Operations] -> Outputs) • One or more input sources and one or more outputs
  4. 4. Pipelines are Used For • Data aggregation / augmentation • Data cleansing / de-duplication • Data copying / synchronization • Analytics processing • AI Modeling
  5. 5. Sources and Targets • Sources: Initial inputs into a pipeline • REST API, Excel Sheet, Filesystem, HDFS, RDBMS, etc. • Targets: Terminal outputs of a pipeline • REST API, Excel Sheet, [...], Email, Slack
  6. 6. Operations • Operations are the fundamental units of work within a pipeline. • Operations can be domain specific. • Operations can be composable.
  7. 7. O O O OS T S T Source Operation Target O Simple Linear Pipeline
  8. 8. O O O S TS T Source Operation Target O Complex Pipeline S S O T T O O O
  9. 9. DAGs • Directed Acyclic Graphs • Transitive reduction enables smart dependency resolution
  10. 10. DAG Reduced Form
  11. 11. Atomicity • An entire operation fails or succeeds as a whole. • There is no partial state in the event of a failure. "the state or fact of being composed of indivisible units."
  12. 12. Atomicity https://www.postgresql.org/docs/8.3/static/tutorial-transactions.html
  13. 13. Idempotency • An operation can be run multiple times without failure. • An operation can be run multiple times without duplication of output. Q: What is the correct way to pronounce 'idempotent'? A: The same way every time. "denoting an element of a set that is unchanged in value when multiplied or otherwise operated on by itself."
  14. 14. Idempotency Idempotent
  15. 15. Concurrency • Execute a non-resource bound operation via many threads on the same core. • Performant pipelines find concurrency within an operation. "the decomposability property of a program, algorithm, or problem into order-independent or partially-ordered components or units."
  16. 16. Parallelism • Execute operations on multiple cores / machines simultaneously. • Operators can operate in parallel as soon as a new input is available. "a computation architecture in which many calculations or the execution of processes are carried out simultaneously"
  17. 17. Design Patterns
  18. 18. Periodic Workflows • Pipeline executes on a timed interval • Great for exhaustive data processing • Easy backfilling
  19. 19. Event-Driven Workflows • Pipeline handles inputs (events) as they are received • Real time data • Best suited for non-exhaustive data processing • Backfills?
  20. 20. ETL • Extract, Transform, Load are distinct steps with no shared operations • Each step can performed one or more times before the following step is performed. Extract, Transform, Load
  21. 21. ETL • Intermediate data stored between steps and audit data is tracked for each step. • Enables independent processing of Extract Transform, and Load. • Don't transform during extraction. • Don't transform during loading!
  22. 22. O O O S TS T Source Operation Target O Complex Pipeline S S O T T O O O
  23. 23. E T T S TS T Source ETL Operation Target T ETL Pipeline S S L T T T T L
  24. 24. E T T S T S T Source ETL Operation Target T ETL Pipeline(s) S S L T T T T L S S T T T S T
  25. 25. Why Python?
  26. 26. Scientific / Stats Ecosystem • NumPy and Pandas • SciKitLearn • spaCy
  27. 27. Web Development Ecosystem • Django, Flask, Pyramid • Django REST Framework • Scrapy • Celery In web development, we started solving distributed processing problems a long time ago.
  28. 28. Numba: JIT compiler to LLVM http://numba.pydata.org/
  29. 29. Python Libraries
  30. 30. Celery • Task queueing / Asynchronous processing • Native Python executed on distributed workers • Retrying, Throttling, Pooling
  31. 31. 🐶>😿
  32. 32. Luigi • Open sourced by Spotify in 2012 • Lightweight configuration • Does not support worker pooling "Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more."
  33. 33. Airflow • Open sourced by AirBnb in 2015 • Apache Incubation since March 2016 • Implements workflows as strict DAGs • Visualization / Audit / Backfill tools • Scales with Celery "Airflow is a platform to programmaticaly author, schedule and monitor data pipelines."
  34. 34. Our Tools of Choice: Celery + Airflow
  35. 35. How To Attack a Pipeline Problem 1. Pure Python functions 2. Convert to Celery (Parallel for free!) 3. Layer in Concurrency / Optimizations 4. Escalate to AirFlow
  36. 36. Things are going to fail. • Log often and frequently. • Remember Atomicity. • Leverage aggregation / visualization tools.
  37. 37. Pipeline Takeaways • Build operations with Atomicity and Idempotency in mind. • Optimize throughput with concurrency and parallelism. • Log and visualize (or just use Airflow).
  38. 38. Come talk to us about your data. Casey Kinsey, Principal Consultant hirelofty.com @loftylabs @quesokinsey

×