Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Pipelines with Python - NWA TechFest 2017

305 views

Published on

Tools and design patterns for building data pipelines with Python.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Pipelines with Python - NWA TechFest 2017

  1. 1. Data Pipelines NWA Techfest 2017 With Python
  2. 2. What is a Data Pipeline?
  3. 3. What is a Data Pipeline? • Discrete set of dependent operations • Directional (Inputs -> [Operations] -> Outputs) • One or more input sources and one or more outputs
  4. 4. Pipelines are Used For • Data aggregation / augmentation • Data cleansing / de-duplication • Data copying / synchronization • Analytics processing • AI Modeling
  5. 5. Sources and Targets • Sources: Initial inputs into a pipeline • REST API, Excel Sheet, Filesystem, HDFS, RDBMS, etc. • Targets: Terminal outputs of a pipeline • REST API, Excel Sheet, [...], Email, Slack
  6. 6. Operations • Operations are the fundamental units of work within a pipeline. • Operations can be domain specific. • Operations can be composable.
  7. 7. O O O OS T S T Source Operation Target O Simple Linear Pipeline
  8. 8. O O O S TS T Source Operation Target O Complex Pipeline S S O T T O O O
  9. 9. DAGs • Directed Acyclic Graphs • Transitive reduction enables smart dependency resolution
  10. 10. DAG Reduced Form
  11. 11. Atomicity • An entire operation fails or succeeds as a whole. • There is no partial state in the event of a failure. "the state or fact of being composed of indivisible units."
  12. 12. Atomicity https://www.postgresql.org/docs/8.3/static/tutorial-transactions.html
  13. 13. Idempotency • An operation can be run multiple times without failure. • An operation can be run multiple times without duplication of output. Q: What is the correct way to pronounce 'idempotent'? A: The same way every time. "denoting an element of a set that is unchanged in value when multiplied or otherwise operated on by itself."
  14. 14. Idempotency Idempotent
  15. 15. Concurrency • Execute a non-resource bound operation via many threads on the same core. • Performant pipelines find concurrency within an operation. "the decomposability property of a program, algorithm, or problem into order-independent or partially-ordered components or units."
  16. 16. Parallelism • Execute operations on multiple cores / machines simultaneously. • Operators can operate in parallel as soon as a new input is available. "a computation architecture in which many calculations or the execution of processes are carried out simultaneously"
  17. 17. Design Patterns
  18. 18. Periodic Workflows • Pipeline executes on a timed interval • Great for exhaustive data processing • Easy backfilling
  19. 19. Event-Driven Workflows • Pipeline handles inputs (events) as they are received • Real time data • Best suited for non-exhaustive data processing • Backfills?
  20. 20. ETL • Extract, Transform, Load are distinct steps with no shared operations • Each step can performed one or more times before the following step is performed. Extract, Transform, Load
  21. 21. ETL • Intermediate data stored between steps and audit data is tracked for each step. • Enables independent processing of Extract Transform, and Load. • Don't transform during extraction. • Don't transform during loading!
  22. 22. O O O S TS T Source Operation Target O Complex Pipeline S S O T T O O O
  23. 23. E T T S TS T Source ETL Operation Target T ETL Pipeline S S L T T T T L
  24. 24. E T T S T S T Source ETL Operation Target T ETL Pipeline(s) S S L T T T T L S S T T T S T
  25. 25. Why Python?
  26. 26. Scientific / Stats Ecosystem • NumPy and Pandas • SciKitLearn • spaCy
  27. 27. Web Development Ecosystem • Django, Flask, Pyramid • Django REST Framework • Scrapy • Celery In web development, we started solving distributed processing problems a long time ago.
  28. 28. Numba: JIT compiler to LLVM http://numba.pydata.org/
  29. 29. Python Libraries
  30. 30. Celery • Task queueing / Asynchronous processing • Native Python executed on distributed workers • Retrying, Throttling, Pooling
  31. 31. 🐶>😿
  32. 32. Luigi • Open sourced by Spotify in 2012 • Lightweight configuration • Does not support worker pooling "Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more."
  33. 33. Airflow • Open sourced by AirBnb in 2015 • Apache Incubation since March 2016 • Implements workflows as strict DAGs • Visualization / Audit / Backfill tools • Scales with Celery "Airflow is a platform to programmaticaly author, schedule and monitor data pipelines."
  34. 34. Our Tools of Choice: Celery + Airflow
  35. 35. How To Attack a Pipeline Problem 1. Pure Python functions 2. Convert to Celery (Parallel for free!) 3. Layer in Concurrency / Optimizations 4. Escalate to AirFlow
  36. 36. Things are going to fail. • Log often and frequently. • Remember Atomicity. • Leverage aggregation / visualization tools.
  37. 37. Pipeline Takeaways • Build operations with Atomicity and Idempotency in mind. • Optimize throughput with concurrency and parallelism. • Log and visualize (or just use Airflow).
  38. 38. Come talk to us about your data. Casey Kinsey, Principal Consultant hirelofty.com @loftylabs @quesokinsey

×