bonobo
Simple ETL in Python 3.5+
Romain Dorgueil
@rdorgueil
CTO/Hacker in Residence
Technical Co-founder
(Solo) Founder
Eng. Manager
Developer
L’Atelier BNP Paribas
WeAreTheShops
RDC Dist. Agency
Sensio/SensioLabs
AffiliationWizard
Felt too young in a Linux Cauldron
Dismantler of Atari computers
Basic literacy using a Minitel
Guitars & accordions
Off by one baby
Inception
#Grayscales
#Open-Source
#Trapeze shapes
#Data Engineering
STARTUP ACCELERATION PROGRAMS
NO HYPE, JUST BUSINESS
launchpad.atelier.net
bonobo
Simple ETL in Python 3.5+
• History & Market
• Extract, transform, load
• Basics – Bonobo One-O-One
• Concepts & Candies
• The Future – State & Plans
Once upon a time…
Extract Transform Load
• Not new. Popular concept in the 1970s [1] [2]
• Everywhere. Commerce, websites, marketing, finance, …
• Update secondary data-stores from a master data-store.
• Insert initial data somewhere
• Bi-directional API calls
• Every day, compute time reports, edit invoices if threshold then send e-mail.
• …
[1] https://en.wikipedia.org/wiki/Extract,_transform,_load
[2] https://www.sas.com/en_us/insights/data-management/what-is-etl.html
Small Automation Tools
• Mostly aimed at simple recurring tasks.
• Cloud / SaaS only.

Data Integration Tools
• Pentaho Data Integration (IDE/Java)
• Talend Open Studio (IDE/Java)
• CloverETL (IDE/Java)
Talend Open Studio
Data Integration Tools
• Java + IDE based, for most of them
• Data transformations are blocks
• IO flow managed by connections
• Execution
In the Python world …
• Bubbles (https://github.com/stiivi/bubbles)
• PETL (https://github.com/alimanfoo/petl)
• and now… Bonobo (https://www.bonobo-project.org/)
You can also use amazing libraries including 

Joblib, Dask, Pandas, Toolz, 

but ETL is not their main focus.
Big Data Tools
• Can do anything. And probably more. Fast.
• Either needs an infrastructure, or cloud based.
Story time
Partner 1 Data Integration
WE GOT DEALS !!!
Partner 1 Partner 2 Partner 3 Partner 4 Partner 5
Partner 6 Partner 7 Partner 8 Partner 9 …
Tiny bug there…
Can you fix it ?
I want…
• A data integration / ETL tool using code as configuration.
• Preferably Python code.
• Something that can be tested (I mean, by a machine).
• Something that can use inheritance.
• Fast install on laptop, thought to run on servers too.
And that’s Bonobo
It is …
• A framework to write ETL jobs in Python 3 (3.5+)
• Using the same concepts as the old ETLs
• But for coders!
• You can use OOP!
It is NOT …
• Pandas / R Dataframes
• Dask
• Luigi / Airflow
• Hadoop / Big Data
• A monkey
Bonobo - One - O - One
Let’s create a project
Let’s create a project
~ $ pip install bonobo cookiecutter

Let’s create a project
~ $ pip install bonobo cookiecutter

~ $ bonobo init pyparis

Let’s create a project
~ $ pip install bonobo cookiecutter

~ $ bonobo init pyparis

~ $ cd pyparis

Let’s create a project
~ $ pip install bonobo cookiecutter

~ $ bonobo init pyparis

~ $ cd pyparis

~/pyparis $ ls -lsah

total 32

0 drwxr-xr-x 7 rd staff 238B 30 mai 18:02 .

0 drwxr-xr-x 4 rd staff 136B 30 mai 18:03 ..

0 -rw-r--r-- 1 rd staff 0B 30 mai 18:02 .env

8 -rw-r--r-- 1 rd staff 6B 30 mai 18:02 .gitignore

8 -rw-r--r-- 1 rd staff 333B 30 mai 18:02 main.py

8 -rw-r--r-- 1 rd staff 140B 30 mai 18:02 _services.py

8 -rw-r--r-- 1 rd staff 7B 30 mai 18:02 requirements.txt
It works!
$ bonobo run .
1
3
5
...
37
39
41
- range in=1 out=42
- Filter in=42 out=21
- print in=21
Let’s rewrite main.py
Let’s rewrite main.py
import bonobo
Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
bonobo.Filter(lambda n: n % 2)
Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
bonobo.Filter(lambda n: n % 2)
print
Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
bonobo.Filter(lambda n: n % 2)
print
graph = bonobo.Graph(
range(42),
bonobo.Filter(filter=lambda x: x % 2),
print,
)
if __name__ == '__main__':
bonobo.run(graph)
graph = bonobo.Graph(…)
BEGIN
CsvReader(
'clients.csv'
)
InsertOrUpdate(
'db.site',
'clients',
key='guid'
)
update_crm
retrieve_orders
bonobo.run(graph)
or in a shell…
$ bonobo run main.py
BEGIN
CsvReader(
'clients.csv'
)
InsertOrUpdate(
'db.site',
'clients',
key='guid'
)
update_crm
retrieve_orders
BEGIN
CsvReader(
'clients.csv'
)
InsertOrUpdate(
'db.site',
'clients',
key='guid'
)
update_crm
retrieve_orders
Context
+
Thread
Context
+
Thread
Context
+
Thread
Context
+
Thread
Transformations ?
a.k.a
nodes in the graph
Functions!
def get_more_infos(api, **row):
more = api.query(row.get('id'))
return {
**row,
**(more or {}),
}
Generators!
def join_orders(order_api, **row):
for order in order_api.get(row.get('customer_id')):
yield {
**row,
**order,
}
Iterators!
extract = (
'foo',
'bar',
'baz',
)
extract = range(0, 1001, 7)
Classes!
from bonobo.config import Configurable, Option, Service
class QueryDatabase(Configurable):
# Simple string option
table_name = Option(str, default='customers')
# External dependency
database = Service('database.default')
def call(self, database, **row):
customer = database.query(self.table_name, customer_id=row['clientId'])
return {
**row,
'is_customer': bool(customer),
}
Services ?
get_services()
COINBASE_API_KEY, COINBASE_API_SECRET = environ.get('...'), environ.get('...')
KRAKEN_API_KEY, KRAKEN_API_SECRET = environ.get('...'), environ.get('...')
POLONIEX_API_KEY, POLONIEX_API_SECRET = environ.get('...'), environ.get('...')
def get_services():
return {
'fs': bonobo.open_fs(
path.join(path.dirname(__file__), './data')
),
'coinbase.client': coinbase.wallet.client.Client(
COINBASE_API_KEY, COINBASE_API_SECRET
),
'kraken.client': krakenex.API(
KRAKEN_API_KEY, KRAKEN_API_SECRET
),
'poloniex.client': poloniex.Poloniex(
apikey=POLONIEX_API_KEY, secret=POLONIEX_API_SECRET
),
'google.client': get_google_api_client(),
}
run() will handle injection
import bonobo
from bonobo.commands.run import get_default_services
graph = bonobo.Graph()
if __name__ == '__main__':
bonobo.run(
graph,
services=get_default_services(__file__)
)
…
class QueryDatabase(Configurable):
# External dependency
database = Service('database.default')
def call(self, database, **row):
return { … }
Bananas!
Files
bonobo.FileReader(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str
)
bonobo.FileWriter(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str
)
CSV
bonobo.CsvReader(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str,
delimiter: str,
quotechar: str,
headers: tuple,
skip: int
)
bonobo.CsvWriter(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str,
delimiter: str,
quotechar: str,
headers: tuple
)
JSON
bonobo.JsonReader(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str
)
bonobo.JsonWriter(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str
)
Pickle
bonobo.PickleReader(
path: str,
*,
eol: str,
encoding: str,
ioformat,
item_names: tuple,
mode: str
)
bonobo.PickleWriter(
path: str,
*,
eol: str,
encoding: str,
ioformat,
item_names: tuple,
mode: str
)
bonobo.Limit(limit)
bonobo.PrettyPrinter()
Filters
@bonobo.Filter
def NoneShallPass(username):
return username == 'romain'
bonobo.graph(
# ...
NoneShallPass(),
# ...
)
bonobo.Filter(lambda username: username == 'romain')
More bananas!
Console & Logging
Jupyter Notebook
SQLAlchemy Extension
bonobo_sqlalchemy.Select(
query,
*,
pack_size=1000,
limit=None
)
bonobo_sqlalchemy.InsertOrUpdate(
table_name,
*,
fetch_columns,
insert_only_fields,
discriminant,
…
)
PREVIEW
Docker Extension
$ pip install bonobo[docker]
$ bonobo runc myjob.py
PREVIEW
State & Plans
RIP rdc.etl
• 2013-2015, python 2.7.
• Brings most of the execution algorithms.
• Different context.
• Learn from mistakes.
Bonobo is young
• First commit : December 2016
• 21 releases, ~400 commits, 4 contributors
• Current stable 0.4.1, released yesterday
__future__
• Stay 100% Open-Source
• Light & Focused : Graphs and Executions
• The rest goes to Extensions.
Pre 1.0 Roadmap
• Towards stable API
• More tests, more documentation
• Developer experience first
Target: 1.0 before 2018
Post 1.0 Roadmap
• Scheduling
• Monitoring
• More execution strategies
• Optimizations
Hacker News is crazy
Data Processing for Humans
www.bonobo-project.org
docs.bonobo-project.org
bonobo-slack.herokuapp.com
github.com/python-bonobo
Let me know what you think!
Thank you!
@monkcage @rdorgueil
https://goo.gl/e25eoa
bonobo
@monkcage

Simple ETL in python 3.5+ with Bonobo - PyParis 2017