Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Data Pipelines in Python

6,015 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nwSwEh.

Marco Bonzanini discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data; in general, all the steps necessary to prepare data for a data-driven product. In particular, he focuses on data plumbing and on the practice of going from prototype to production. Filmed at qconlondon.com.

Marco Bonzanini is Data Scientist and co-organizer of PyData London Meetup.

Published in: Technology

Building Data Pipelines in Python

  1. 1. Building Data Pipelines in Python Marco Bonzanini QCon London 2017
  2. 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations /data-pipelines-python • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  4. 4. Nice to meet you
  5. 5. R&D ≠ Engineering
  6. 6. R&D ≠ Engineering R&D results in production = high value
  7. 7. Big Data Problems vs Big Data Problems
  8. 8. Data Pipelines (from 30,000ft) Data ETL Analytics
  9. 9. Data Pipelines (zooming in) ETL {Extract Transform Load { Clean Augment Join
  10. 10. Good Data Pipelines Easy to Reproduce Productise{
  11. 11. Towards Good Data Pipelines
  12. 12. Towards Good Data Pipelines (a) Your Data is Dirty unless proven otherwise “It’s in the database, so it’s already good”
  13. 13. Towards Good Data Pipelines (b) All Your Data is Important unless proven otherwise
  14. 14. Towards Good Data Pipelines (b) All Your Data is Important unless proven otherwise Keep it. Transform it. Don’t overwrite it.
  15. 15. Towards Good Data Pipelines (c) Pipelines vs Script Soups
  16. 16. Tasty, but not a pipeline Pic: Romanian potato soup from Wikipedia
  17. 17. $ ./do_something.sh $ ./do_something_else.sh $ ./extract_some_data.sh $ ./join_some_other_data.sh ... Anti-pattern: the script soup
  18. 18. Script soups kill replicability
  19. 19. $ cat ./run_everything.sh ./do_something.sh ./do_something_else.sh ./extract_some_data.sh ./join_some_other_data.sh $ ./run_everything.sh Anti-pattern: the master script
  20. 20. Towards Good Data Pipelines (d) Break it Down setup.py and conda
  21. 21. Towards Good Data Pipelines (e) Automated Testing i.e. why scientists don’t write unit tests
  22. 22. Intermezzo Let me rant about testing Icon by Freepik from flaticon.com
  23. 23. (Unit) Testing Unit tests in three easy steps: • import unittest • Write your tests • Quit complaining about lack of time to write tests
  24. 24. Benefits of (unit) testing • Safety net for refactoring • Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
  25. 25. Testing: not convinced yet?
  26. 26. Testing: not convinced yet?
  27. 27. Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound, max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
  28. 28. Testing: I’m almost done • Unit tests vs Defensive Programming • Say no to tautologies • Say no to vanity tests • The Python ecosystem is rich: 
 py.test, nosetests, hypothesis, coverage.py, …
  29. 29. </rant>
  30. 30. Towards Good Data Pipelines (f) Orchestration Don’t re-invent the wheel
  31. 31. You need a workflow manager Think: 
 GNU Make + Unix pipes + Steroids
  32. 32. Intro to Luigi • Task dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
  33. 33. Luigi Task: unit of execution class MyTask(luigi.Task): def requires(self): return [SomeTask()] def output(self): return luigi.LocalTarget(…) def run(self): mylib.run()
  34. 34. Luigi Target: output of a task class MyTarget(luigi.Target): def exists(self): ... # return bool Great off the shelf support 
 local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
  35. 35. Intro to Airflow • Like Luigi, just younger • Nicer (?) GUI • Scheduling • Apache Project
  36. 36. Towards Good Data Pipelines (g) When things go wrong The Joy of debugging
  37. 37. import logging
  38. 38. Who reads the logs? You’re not going to read the logs, unless… • E-mail notifications (built-in in Luigi) • Slack notifications $ pip install luigi_slack # WIP
  39. 39. Towards Good Data Pipelines (h) Static Analysis The Joy of Duck Typing
  40. 40. If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck. — somebody on the Web
  41. 41. >>> 1.0 == 1 == True True >>> 1 + True 2
  42. 42. >>> '1' * 2 '11' >>> '1' + 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'int' object to str implicitly
  43. 43. def do_stuff(a: int, b: int) -> str: ... return something PEP 3107 — Function Annotations
 (since Python 3.0) (annotations are ignored by the interpreter)
  44. 44. typing module: semantically coherent PEP 484 — Type Hints
 (since Python 3.5) (still ignored by the interpreter)
  45. 45. pip install mypy
  46. 46. • Add optional types • Run: mypy --follow-imports silent mylib • Refine gradual typing (e.g. Any)
  47. 47. Summary Basic engineering principles help
 (packaging, testing, orchestration, logging, static analysis, ...)
  48. 48. Summary R&D is not Engineering:
 can we meet halfway?
  49. 49. Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini
  50. 50. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/data- pipelines-python

×