Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Simple ETL in python 3.5+ with Bonobo, Romain Dorgueil

526 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

Simple ETL in python 3.5+ with Bonobo, Romain Dorgueil

  1. 1. bonobo Simple ETL in Python 3.5+
  2. 2. Romain Dorgueil @rdorgueil CTO/Hacker in Residence Technical Co-founder (Solo) Founder Eng. Manager Developer L’Atelier BNP Paribas WeAreTheShops RDC Dist. Agency Sensio/SensioLabs AffiliationWizard Felt too young in a Linux Cauldron Dismantler of Atari computers Basic literacy using a Minitel Guitars & accordions Off by one baby Inception
  3. 3. #Grayscales #Open-Source #Trapeze shapes #Data Engineering
  4. 4. STARTUP ACCELERATION PROGRAMS NO HYPE, JUST BUSINESS launchpad.atelier.net
  5. 5. bonobo Simple ETL in Python 3.5+
  6. 6. • History & Market • Extract, transform, load • Basics – Bonobo One-O-One • Concepts & Candies • The Future – State & Plans
  7. 7. Once upon a time…
  8. 8. Extract Transform Load • Not new. Popular concept in the 1970s [1] [2] • Everywhere. Commerce, websites, marketing, finance, … • Update secondary data-stores from a master data-store. • Insert initial data somewhere • Bi-directional API calls • Every day, compute time reports, edit invoices if threshold then send e-mail. • … [1] https://en.wikipedia.org/wiki/Extract,_transform,_load [2] https://www.sas.com/en_us/insights/data-management/what-is-etl.html
  9. 9. Small Automation Tools • Mostly aimed at simple recurring tasks. • Cloud / SaaS only.

  10. 10. Data Integration Tools • Pentaho Data Integration (IDE/Java) • Talend Open Studio (IDE/Java) • CloverETL (IDE/Java)
  11. 11. Talend Open Studio
  12. 12. Data Integration Tools • Java + IDE based, for most of them • Data transformations are blocks • IO flow managed by connections • Execution
  13. 13. In the Python world … • Bubbles (https://github.com/stiivi/bubbles) • PETL (https://github.com/alimanfoo/petl) • and now… Bonobo (https://www.bonobo-project.org/) You can also use amazing libraries including 
 Joblib, Dask, Pandas, Toolz, 
 but ETL is not their main focus.
  14. 14. Big Data Tools • Can do anything. And probably more. Fast. • Either needs an infrastructure, or cloud based.
  15. 15. Story time
  16. 16. Partner 1 Data Integration
  17. 17. WE GOT DEALS !!!
  18. 18. Partner 1 Partner 2 Partner 3 Partner 4 Partner 5 Partner 6 Partner 7 Partner 8 Partner 9 …
  19. 19. Tiny bug there… Can you fix it ?
  20. 20. I want… • A data integration / ETL tool using code as configuration. • Preferably Python code. • Something that can be tested (I mean, by a machine). • Something that can use inheritance. • Fast install on laptop, thought to run on servers too.
  21. 21. And that’s Bonobo
  22. 22. It is … • A framework to write ETL jobs in Python 3 (3.5+) • Using the same concepts as the old ETLs • But for coders! • You can use OOP!
  23. 23. It is NOT … • Pandas / R Dataframes • Dask • Luigi / Airflow • Hadoop / Big Data • A monkey
  24. 24. Bonobo - One - O - One
  25. 25. Let’s create a project
  26. 26. Let’s create a project ~ $ pip install bonobo cookiecutter

  27. 27. Let’s create a project ~ $ pip install bonobo cookiecutter
 ~ $ bonobo init pyparis

  28. 28. Let’s create a project ~ $ pip install bonobo cookiecutter
 ~ $ bonobo init pyparis
 ~ $ cd pyparis

  29. 29. Let’s create a project ~ $ pip install bonobo cookiecutter
 ~ $ bonobo init pyparis
 ~ $ cd pyparis
 ~/pyparis $ ls -lsah
 total 32
 0 drwxr-xr-x 7 rd staff 238B 30 mai 18:02 .
 0 drwxr-xr-x 4 rd staff 136B 30 mai 18:03 ..
 0 -rw-r--r-- 1 rd staff 0B 30 mai 18:02 .env
 8 -rw-r--r-- 1 rd staff 6B 30 mai 18:02 .gitignore
 8 -rw-r--r-- 1 rd staff 333B 30 mai 18:02 main.py
 8 -rw-r--r-- 1 rd staff 140B 30 mai 18:02 _services.py
 8 -rw-r--r-- 1 rd staff 7B 30 mai 18:02 requirements.txt
  30. 30. It works! $ bonobo run . 1 3 5 ... 37 39 41 - range in=1 out=42 - Filter in=42 out=21 - print in=21
  31. 31. Let’s rewrite main.py
  32. 32. Let’s rewrite main.py import bonobo
  33. 33. Let’s rewrite main.py import bonobo def extract(): for i in range(42): yield i
  34. 34. Let’s rewrite main.py import bonobo def extract(): for i in range(42): yield i def transform(n): if n % 2: yield n
  35. 35. Let’s rewrite main.py import bonobo def extract(): for i in range(42): yield i def transform(n): if n % 2: yield n def load(n): print(n)
  36. 36. Let’s rewrite main.py import bonobo def extract(): for i in range(42): yield i def transform(n): if n % 2: yield n def load(n): print(n) graph = bonobo.Graph( extract, transform, load, )
  37. 37. Let’s rewrite main.py import bonobo def extract(): for i in range(42): yield i def transform(n): if n % 2: yield n def load(n): print(n) graph = bonobo.Graph( extract, transform, load, ) if __name__ == '__main__': bonobo.run(graph)
  38. 38. Let’s rewrite main.py import bonobo def extract(): for i in range(42): yield i def transform(n): if n % 2: yield n def load(n): print(n) graph = bonobo.Graph( extract, transform, load, ) if __name__ == '__main__': bonobo.run(graph) range(42)
  39. 39. Let’s rewrite main.py import bonobo def extract(): for i in range(42): yield i def transform(n): if n % 2: yield n def load(n): print(n) graph = bonobo.Graph( extract, transform, load, ) if __name__ == '__main__': bonobo.run(graph) range(42) bonobo.Filter(lambda n: n % 2)
  40. 40. Let’s rewrite main.py import bonobo def extract(): for i in range(42): yield i def transform(n): if n % 2: yield n def load(n): print(n) graph = bonobo.Graph( extract, transform, load, ) if __name__ == '__main__': bonobo.run(graph) range(42) bonobo.Filter(lambda n: n % 2) print
  41. 41. Let’s rewrite main.py import bonobo def extract(): for i in range(42): yield i def transform(n): if n % 2: yield n def load(n): print(n) graph = bonobo.Graph( extract, transform, load, ) if __name__ == '__main__': bonobo.run(graph) range(42) bonobo.Filter(lambda n: n % 2) print graph = bonobo.Graph( range(42), bonobo.Filter(filter=lambda x: x % 2), print, ) if __name__ == '__main__': bonobo.run(graph)
  42. 42. graph = bonobo.Graph(…)
  43. 43. BEGIN CsvReader( 'clients.csv' ) InsertOrUpdate( 'db.site', 'clients', key='guid' ) update_crm retrieve_orders
  44. 44. bonobo.run(graph) or in a shell… $ bonobo run main.py
  45. 45. BEGIN CsvReader( 'clients.csv' ) InsertOrUpdate( 'db.site', 'clients', key='guid' ) update_crm retrieve_orders
  46. 46. BEGIN CsvReader( 'clients.csv' ) InsertOrUpdate( 'db.site', 'clients', key='guid' ) update_crm retrieve_orders Context + Thread Context + Thread Context + Thread Context + Thread
  47. 47. Transformations ? a.k.a nodes in the graph
  48. 48. Functions! def get_more_infos(api, **row): more = api.query(row.get('id')) return { **row, **(more or {}), }
  49. 49. Generators! def join_orders(order_api, **row): for order in order_api.get(row.get('customer_id')): yield { **row, **order, }
  50. 50. Iterators! extract = ( 'foo', 'bar', 'baz', ) extract = range(0, 1001, 7)
  51. 51. Classes! from bonobo.config import Configurable, Option, Service class QueryDatabase(Configurable): # Simple string option table_name = Option(str, default='customers') # External dependency database = Service('database.default') def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clientId']) return { **row, 'is_customer': bool(customer), }
  52. 52. Services ?
  53. 53. get_services() COINBASE_API_KEY, COINBASE_API_SECRET = environ.get('...'), environ.get('...') KRAKEN_API_KEY, KRAKEN_API_SECRET = environ.get('...'), environ.get('...') POLONIEX_API_KEY, POLONIEX_API_SECRET = environ.get('...'), environ.get('...') def get_services(): return { 'fs': bonobo.open_fs( path.join(path.dirname(__file__), './data') ), 'coinbase.client': coinbase.wallet.client.Client( COINBASE_API_KEY, COINBASE_API_SECRET ), 'kraken.client': krakenex.API( KRAKEN_API_KEY, KRAKEN_API_SECRET ), 'poloniex.client': poloniex.Poloniex( apikey=POLONIEX_API_KEY, secret=POLONIEX_API_SECRET ), 'google.client': get_google_api_client(), }
  54. 54. run() will handle injection import bonobo from bonobo.commands.run import get_default_services graph = bonobo.Graph() if __name__ == '__main__': bonobo.run( graph, services=get_default_services(__file__) )
  55. 55. … class QueryDatabase(Configurable): # External dependency database = Service('database.default') def call(self, database, **row): return { … }
  56. 56. Bananas!
  57. 57. Files bonobo.FileReader( path: str, *, eol: str, encoding: str, ioformat, mode: str ) bonobo.FileWriter( path: str, *, eol: str, encoding: str, ioformat, mode: str )
  58. 58. CSV bonobo.CsvReader( path: str, *, eol: str, encoding: str, ioformat, mode: str, delimiter: str, quotechar: str, headers: tuple, skip: int ) bonobo.CsvWriter( path: str, *, eol: str, encoding: str, ioformat, mode: str, delimiter: str, quotechar: str, headers: tuple )
  59. 59. JSON bonobo.JsonReader( path: str, *, eol: str, encoding: str, ioformat, mode: str ) bonobo.JsonWriter( path: str, *, eol: str, encoding: str, ioformat, mode: str )
  60. 60. Pickle bonobo.PickleReader( path: str, *, eol: str, encoding: str, ioformat, item_names: tuple, mode: str ) bonobo.PickleWriter( path: str, *, eol: str, encoding: str, ioformat, item_names: tuple, mode: str )
  61. 61. bonobo.Limit(limit)
  62. 62. bonobo.PrettyPrinter()
  63. 63. Filters @bonobo.Filter def NoneShallPass(username): return username == 'romain' bonobo.graph( # ... NoneShallPass(), # ... ) bonobo.Filter(lambda username: username == 'romain')
  64. 64. More bananas!
  65. 65. Console & Logging
  66. 66. Jupyter Notebook
  67. 67. SQLAlchemy Extension bonobo_sqlalchemy.Select( query, *, pack_size=1000, limit=None ) bonobo_sqlalchemy.InsertOrUpdate( table_name, *, fetch_columns, insert_only_fields, discriminant, … ) PREVIEW
  68. 68. Docker Extension $ pip install bonobo[docker] $ bonobo runc myjob.py PREVIEW
  69. 69. State & Plans
  70. 70. RIP rdc.etl • 2013-2015, python 2.7. • Brings most of the execution algorithms. • Different context. • Learn from mistakes.
  71. 71. Bonobo is young • First commit : December 2016 • 21 releases, ~400 commits, 4 contributors • Current stable 0.4.1, released yesterday
  72. 72. __future__ • Stay 100% Open-Source • Light & Focused : Graphs and Executions • The rest goes to Extensions.
  73. 73. Pre 1.0 Roadmap • Towards stable API • More tests, more documentation • Developer experience first Target: 1.0 before 2018
  74. 74. Post 1.0 Roadmap • Scheduling • Monitoring • More execution strategies • Optimizations
  75. 75. Hacker News is crazy
  76. 76. Data Processing for Humans
  77. 77. www.bonobo-project.org docs.bonobo-project.org bonobo-slack.herokuapp.com github.com/python-bonobo Let me know what you think!
  78. 78. Thank you! @monkcage @rdorgueil https://goo.gl/e25eoa
  79. 79. bonobo @monkcage

×