Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
bonobo
Simple ETL in Python 3.5+
Romain Dorgueil
@rdorgueil
CTO/Hacker in Residence
Technical Co-founder
(Solo) Founder
Eng. Manager
Developer
L’Atelier BN...
STARTUP ACCELERATION PROGRAMS
NO HYPE, JUST BUSINESS
launchpad.atelier.net
bonobo
Simple ETL in Python 3.5+
• History of Extract Transform Load
• Concept ; Existing tools ; Related tools ; Ignition
• Practical Bonobo
• Tutorial ; ...
Once upon a time…
Extract Transform Load
• Not new. Popular concept in the 1970s [1] [2]
• Everywhere. Commerce, websites, marketing, finance...
Extract Transform Load
foo
bar
baz
Extract Transform Load
Extract Transform Load
foo
bar
baz
Extract
Transform Load
Transform

more
Join


DB
HTTP POST
log?
Data Integration Tools
• Pentaho Data Integration (IDE/Java)
• Talend Open Studio (IDE/Java)
• CloverETL (IDE/Java)
Talend Open Studio
Data Integration Tools
• Java + IDE based, for most of them
• Data transformations are blocks
• IO flow managed by connecti...
In the Python world …
• Bubbles (https://github.com/stiivi/bubbles)
• PETL (https://github.com/alimanfoo/petl)
• (insert a...
Other scales…
Small Automation Tools
• Mostly aimed at simple recurring tasks.
• Cloud / SaaS only.

Big Data Tools
• Can do anything. And probably more. Fast.
• Either needs an infrastructure, or cloud based.
Story time
Partner 1 Data Integration
WE GOT DEALS !!!
Partner 1 Partner 2 Partner 3 Partner 4 Partner 5
Partner 6 Partner 7 Partner 8 Partner 9 …
Tiny bug there…
Can you fix it ?
My need
• A data integration / ETL tool using code as configuration.
• Preferably Python code.
• Something that can be test...
And that’s Bonobo
It is …
• A framework to write ETL jobs in Python 3 (3.5+)
• Using the same concepts as the old ETLs.
• You can use OOP!
C...
It is NOT …
• Pandas / R Dataframes
• Dask (but will probably implement a dask.distributed
strategy someday)
• Luigi / Air...
Let’s see…
Create a project
~ $ pip install bonobo
~ $ bonobo init europython/tutorial
~ $ bonobo run europython/tutorial
TEMPLATE
~ $ bonobo run .
…demo
Write our own
import bonobo
def extract():
yield 'euro'
yield 'python'
yield '2017'
def transform(s):
return s.title()
def...
EXAMPLE_1
~ $ bonobo run .
…demo
EXAMPLE_1
~ $ bonobo run first.py
…demo
Under the hood…
graph = bonobo.Graph(…)
BEGIN
CsvReader(
'clients.csv'
)
InsertOrUpdate(
'db.site',
'clients',
key='guid'
)
update_crm
retrieve_orders
Graph…
class Graph:
def __init__(self, *chain):
self.edges = {}
self.nodes = []
self.add_chain(*chain)
def add_chain(self,...
bonobo.run(graph)
or in a shell…
$ bonobo run main.py
BEGIN
CsvReader(
'clients.csv'
)
InsertOrUpdate(
'db.site',
'clients',
key='guid'
)
update_crm
retrieve_orders
BEGIN
CsvReader(
'clients.csv'
)
InsertOrUpdate(
'db.site',
'clients',
key='guid'
)
update_crm
retrieve_orders
Context
+
T...
Context…
class GraphExecutionContext:
def __init__(self, graph, plugins, services):
self.graph = graph
self.nodes = [
Node...
Strategy…
class ThreadPoolExecutorStrategy(Strategy):
def execute(self, graph, plugins, services):
context = self.create_c...
</ implementation details >
Transformations
a.k.a
nodes in the graph
Functions
def get_more_infos(api, **row):
more = api.query(row.get('id'))
return {
**row,
**(more or {}),
}
Generators
def join_orders(order_api, **row):
for order in order_api.get(row.get('customer_id')):
yield {
**row,
**order,
}
Iterators
extract = (
'foo',
'bar',
'baz',
)
extract = range(0, 1001, 7)
Classes
class RiminizeThis:
def __call__(self, **row):
return {
**row,
'Rimini': 'Woo-hou-wo...',
}
Anything, as long as i...
Configurable classes
from bonobo.config import Configurable, Option, Service
class QueryDatabase(Configurable):
table_name...
Configurable classes
from bonobo.config import Configurable, Option, Service
class QueryDatabase(Configurable):
table_name...
Configurable classes
from bonobo.config import Configurable, Option, Service
class QueryDatabase(Configurable):
table_name...
Configurable classes
from bonobo.config import Configurable, Option, Service
class QueryDatabase(Configurable):
table_name...
Configurable classes
query_database = QueryDatabase(
table_name='test_customers',
database='database.testing',
)
Services
Define as names
class QueryDatabase(Configurable):
database = Service('database.default')
def call(self, database, **row):...
Runtime injection
import bonobo
graph = bonobo.Graph(...)
def get_services():
return {
‘database.default’: MyDatabaseImpl(...
Bananas!
Library
bonobo.FileReader(…)
bonobo.CsvReader(…)
bonobo.JsonReader(…)
bonobo.PickleReader(…)
bonobo.ExcelReader(…)
bonobo....
Library
bonobo.Limit(limit)
bonobo.PrettyPrinter()
bonobo.Filter(…)
… more to come
Extensions & Plugins
Console Plugin
Jupyter Plugin
SQLAlchemy Extension
bonobo_sqlalchemy.Select(
query,
*,
pack_size=1000,
limit=None
)
bonobo_sqlalchemy.InsertOrUpdate(
ta...
Docker Extension
$ pip install bonobo[docker]
$ bonobo runc myjob.py
PREVIEW
Dev Kit
PREVIEW
https://github.com/python-bonobo/bonobo-devkit
More examples
?
EXAMPLE_1 -> EXAMPLE_2
…demo
• Use filesystem service.
• Write to a CSV
• Also write to JSON
EXAMPLE_3
Rimini open data
~/bdk/demos/europython2017
Europython attendees
featuring…
jupyter notebook

selenium & firefox
~/bdk/demos/sirene
French companies registry
featuring…
docker
postgresql
sql alchemy
Wrap up
Young
• First commit : December 2016
• 23 releases, ~420 commits, 4 contributors
• Current « stable » 0.4.3
• Target : 1.0...
Python 3.5+
• {**}
• async/await
• (…, *, …)
• GIL :(
1.0
• 100% Open-Source.
• Light & Focused.
• Very few dependencies.
• Comprehensive standard library.
• The rest goes to p...
Small scale
• 1 minute to install
• Easy to deploy
• NOT : Big Data, Statistics, Analytics …
• IS : Lean manufacturing for...
Interwebs are crazy
Data Processing for Humans
www.bonobo-project.org
docs.bonobo-project.org
bonobo-slack.herokuapp.com
github.com/python-bonobo
Let me know what you th...
Sprint
• Sprints at Europython are amazing
• Nice place to learn about Bonobo, basics, etc.
• Nice place to contribute whi...
Thank you!
@monkcage @rdorgueil
https://goo.gl/e25eoa
bonobo
@monkcage
EuroPython 2017 - Bonono - Simple ETL in python 3.5+
EuroPython 2017 - Bonono - Simple ETL in python 3.5+
EuroPython 2017 - Bonono - Simple ETL in python 3.5+
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

EuroPython 2017 - Bonono - Simple ETL in python 3.5+

Download to read offline

Presentation of Bonobo, the Extract Transform Load (ETL) framework for python 3.5+.
https://www.bonobo-project.org/

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

EuroPython 2017 - Bonono - Simple ETL in python 3.5+

  1. 1. bonobo Simple ETL in Python 3.5+
  2. 2. Romain Dorgueil @rdorgueil CTO/Hacker in Residence Technical Co-founder (Solo) Founder Eng. Manager Developer L’Atelier BNP Paribas WeAreTheShops RDC Dist. Agency Sensio/SensioLabs AffiliationWizard Felt too young in a Linux Cauldron Dismantler of Atari computers Basic literacy using a Minitel Guitars & accordions Off by one baby Inception
  3. 3. STARTUP ACCELERATION PROGRAMS NO HYPE, JUST BUSINESS launchpad.atelier.net
  4. 4. bonobo Simple ETL in Python 3.5+
  5. 5. • History of Extract Transform Load • Concept ; Existing tools ; Related tools ; Ignition • Practical Bonobo • Tutorial ; Under the hood ; Demo ; Plugins & Extensions ; More demos • Wrap up • Present & future ; Resources ; Sprint ; Feedback Plan
  6. 6. Once upon a time…
  7. 7. Extract Transform Load • Not new. Popular concept in the 1970s [1] [2] • Everywhere. Commerce, websites, marketing, finance, … [1] https://en.wikipedia.org/wiki/Extract,_transform,_load [2] https://www.sas.com/en_us/insights/data-management/what-is-etl.html
  8. 8. Extract Transform Load foo bar baz Extract Transform Load
  9. 9. Extract Transform Load foo bar baz Extract Transform Load Transform
 more Join 
 DB HTTP POST log?
  10. 10. Data Integration Tools • Pentaho Data Integration (IDE/Java) • Talend Open Studio (IDE/Java) • CloverETL (IDE/Java)
  11. 11. Talend Open Studio
  12. 12. Data Integration Tools • Java + IDE based, for most of them • Data transformations are blocks • IO flow managed by connections • Execution GUI first, eventually code :-(
  13. 13. In the Python world … • Bubbles (https://github.com/stiivi/bubbles) • PETL (https://github.com/alimanfoo/petl) • (insert a few more here) • and now… Bonobo (https://www.bonobo-project.org/) You can also use amazing libraries including 
 Joblib, Dask, Pandas, Toolz, 
 but ETL is not their main focus.
  14. 14. Other scales…
  15. 15. Small Automation Tools • Mostly aimed at simple recurring tasks. • Cloud / SaaS only.

  16. 16. Big Data Tools • Can do anything. And probably more. Fast. • Either needs an infrastructure, or cloud based.
  17. 17. Story time
  18. 18. Partner 1 Data Integration
  19. 19. WE GOT DEALS !!!
  20. 20. Partner 1 Partner 2 Partner 3 Partner 4 Partner 5 Partner 6 Partner 7 Partner 8 Partner 9 …
  21. 21. Tiny bug there… Can you fix it ?
  22. 22. My need • A data integration / ETL tool using code as configuration. • Preferably Python code. • Something that can be tested (I mean, by a machine). • Something that can use inheritance. • Fast & cheap install on laptop, thought for servers too.
  23. 23. And that’s Bonobo
  24. 24. It is … • A framework to write ETL jobs in Python 3 (3.5+) • Using the same concepts as the old ETLs. • You can use OOP! Code first. Eventually a GUI will come.
  25. 25. It is NOT … • Pandas / R Dataframes • Dask (but will probably implement a dask.distributed strategy someday) • Luigi / Airflow • Hadoop / Big Data / Big Query / … • A monkey (spoiler : it’s an ape, damnit french language…)
  26. 26. Let’s see…
  27. 27. Create a project ~ $ pip install bonobo ~ $ bonobo init europython/tutorial ~ $ bonobo run europython/tutorial
  28. 28. TEMPLATE ~ $ bonobo run . …demo
  29. 29. Write our own import bonobo def extract(): yield 'euro' yield 'python' yield '2017' def transform(s): return s.title() def load(s): print(s) graph = bonobo.Graph( extract, transform, load, )
  30. 30. EXAMPLE_1 ~ $ bonobo run . …demo
  31. 31. EXAMPLE_1 ~ $ bonobo run first.py …demo
  32. 32. Under the hood…
  33. 33. graph = bonobo.Graph(…)
  34. 34. BEGIN CsvReader( 'clients.csv' ) InsertOrUpdate( 'db.site', 'clients', key='guid' ) update_crm retrieve_orders
  35. 35. Graph… class Graph: def __init__(self, *chain): self.edges = {} self.nodes = [] self.add_chain(*chain) def add_chain(self, *nodes, _input=None, _output=None): # ...
  36. 36. bonobo.run(graph) or in a shell… $ bonobo run main.py
  37. 37. BEGIN CsvReader( 'clients.csv' ) InsertOrUpdate( 'db.site', 'clients', key='guid' ) update_crm retrieve_orders
  38. 38. BEGIN CsvReader( 'clients.csv' ) InsertOrUpdate( 'db.site', 'clients', key='guid' ) update_crm retrieve_orders Context + Thread Context + Thread Context + Thread Context + Thread
  39. 39. Context… class GraphExecutionContext: def __init__(self, graph, plugins, services): self.graph = graph self.nodes = [ NodeExecutionContext(node, parent=self) for node in self.graph ] self.plugins = [ PluginExecutionContext(plugin, parent=self) for plugin in plugins ] self.services = services
  40. 40. Strategy… class ThreadPoolExecutorStrategy(Strategy): def execute(self, graph, plugins, services): context = self.create_context(graph, plugins, services) executor = self.create_executor() for node_context in context.nodes: executor.submit( self.create_runner(node_context) ) while context.alive: self.sleep() executor.shutdown() return context
  41. 41. </ implementation details >
  42. 42. Transformations a.k.a nodes in the graph
  43. 43. Functions def get_more_infos(api, **row): more = api.query(row.get('id')) return { **row, **(more or {}), }
  44. 44. Generators def join_orders(order_api, **row): for order in order_api.get(row.get('customer_id')): yield { **row, **order, }
  45. 45. Iterators extract = ( 'foo', 'bar', 'baz', ) extract = range(0, 1001, 7)
  46. 46. Classes class RiminizeThis: def __call__(self, **row): return { **row, 'Rimini': 'Woo-hou-wo...', } Anything, as long as it’s callable().
  47. 47. Configurable classes from bonobo.config import Configurable, Option, Service class QueryDatabase(Configurable): table_name = Option(str, default=‘customers') database = Service('database.default') def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clien return { **row, 'is_customer': bool(customer), }
  48. 48. Configurable classes from bonobo.config import Configurable, Option, Service class QueryDatabase(Configurable): table_name = Option(str, default=‘customers') database = Service('database.default') def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clien return { **row, 'is_customer': bool(customer), }
  49. 49. Configurable classes from bonobo.config import Configurable, Option, Service class QueryDatabase(Configurable): table_name = Option(str, default=‘customers') database = Service('database.default') def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clien return { **row, 'is_customer': bool(customer), }
  50. 50. Configurable classes from bonobo.config import Configurable, Option, Service class QueryDatabase(Configurable): table_name = Option(str, default=‘customers') database = Service('database.default') def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clien return { **row, 'is_customer': bool(customer), }
  51. 51. Configurable classes query_database = QueryDatabase( table_name='test_customers', database='database.testing', )
  52. 52. Services
  53. 53. Define as names class QueryDatabase(Configurable): database = Service('database.default') def call(self, database, **row): return { … }
  54. 54. Runtime injection import bonobo graph = bonobo.Graph(...) def get_services(): return { ‘database.default’: MyDatabaseImpl() }
  55. 55. Bananas!
  56. 56. Library bonobo.FileReader(…) bonobo.CsvReader(…) bonobo.JsonReader(…) bonobo.PickleReader(…) bonobo.ExcelReader(…) bonobo.XMLReader(…) … more to come bonobo.FileWriter(…) bonobo.CsvWriter(…) bonobo.JsonWriter(…) bonobo.PickleWriter(…) bonobo.ExcelWriter(…) bonobo.XMLWriter(…) … more to come
  57. 57. Library bonobo.Limit(limit) bonobo.PrettyPrinter() bonobo.Filter(…) … more to come
  58. 58. Extensions & Plugins
  59. 59. Console Plugin
  60. 60. Jupyter Plugin
  61. 61. SQLAlchemy Extension bonobo_sqlalchemy.Select( query, *, pack_size=1000, limit=None ) bonobo_sqlalchemy.InsertOrUpdate( table_name, *, fetch_columns, insert_only_fields, discriminant, … ) PREVIEW
  62. 62. Docker Extension $ pip install bonobo[docker] $ bonobo runc myjob.py PREVIEW
  63. 63. Dev Kit PREVIEW https://github.com/python-bonobo/bonobo-devkit
  64. 64. More examples ?
  65. 65. EXAMPLE_1 -> EXAMPLE_2 …demo • Use filesystem service. • Write to a CSV • Also write to JSON
  66. 66. EXAMPLE_3 Rimini open data
  67. 67. ~/bdk/demos/europython2017 Europython attendees featuring… jupyter notebook
 selenium & firefox
  68. 68. ~/bdk/demos/sirene French companies registry featuring… docker postgresql sql alchemy
  69. 69. Wrap up
  70. 70. Young • First commit : December 2016 • 23 releases, ~420 commits, 4 contributors • Current « stable » 0.4.3 • Target : 1.0 early 2018
  71. 71. Python 3.5+ • {**} • async/await • (…, *, …) • GIL :(
  72. 72. 1.0 • 100% Open-Source. • Light & Focused. • Very few dependencies. • Comprehensive standard library. • The rest goes to plugins and extensions.
  73. 73. Small scale • 1 minute to install • Easy to deploy • NOT : Big Data, Statistics, Analytics … • IS : Lean manufacturing for data
  74. 74. Interwebs are crazy
  75. 75. Data Processing for Humans
  76. 76. www.bonobo-project.org docs.bonobo-project.org bonobo-slack.herokuapp.com github.com/python-bonobo Let me know what you think!
  77. 77. Sprint • Sprints at Europython are amazing • Nice place to learn about Bonobo, basics, etc. • Nice place to contribute while learning. • You’re amazing.
  78. 78. Thank you! @monkcage @rdorgueil https://goo.gl/e25eoa
  79. 79. bonobo @monkcage
  • selvakumarkandsamy

    Mar. 6, 2019
  • PaulKrause25

    Feb. 23, 2019
  • TakuroWada

    Jul. 12, 2017

Presentation of Bonobo, the Extract Transform Load (ETL) framework for python 3.5+. https://www.bonobo-project.org/

Views

Total views

1,385

On Slideshare

0

From embeds

0

Number of embeds

473

Actions

Downloads

26

Shares

0

Comments

0

Likes

3

×