Simple ETL in python 3.5+ with Bonobo - PyParis 2017

bonobo
Simple ETL in Python 3.5+

Romain Dorgueil
@rdorgueil
CTO/Hacker in Residence
Technical Co-founder
(Solo) Founder
Eng. Manager
Developer
L’Atelier BNP Paribas
WeAreTheShops
RDC Dist. Agency
Sensio/SensioLabs
AﬃliationWizard
Felt too young in a Linux Cauldron
Dismantler of Atari computers
Basic literacy using a Minitel
Guitars & accordions
Oﬀ by one baby
Inception

#Grayscales
#Open-Source
#Trapeze shapes
#Data Engineering

STARTUP ACCELERATION PROGRAMS
NO HYPE, JUST BUSINESS
launchpad.atelier.net

• History & Market
• Extract, transform, load
• Basics – Bonobo One-O-One
• Concepts & Candies
• The Future – State & Plans

Extract Transform Load
• Not new. Popular concept in the 1970s [1] [2]
• Everywhere. Commerce, websites, marketing, ﬁnance, …
• Update secondary data-stores from a master data-store.
• Insert initial data somewhere
• Bi-directional API calls
• Every day, compute time reports, edit invoices if threshold then send e-mail.
• …
[1] https://en.wikipedia.org/wiki/Extract,_transform,_load
[2] https://www.sas.com/en_us/insights/data-management/what-is-etl.html

Small Automation Tools
• Mostly aimed at simple recurring tasks.
• Cloud / SaaS only.

Data Integration Tools
• Pentaho Data Integration (IDE/Java)
• Talend Open Studio (IDE/Java)
• CloverETL (IDE/Java)

Data Integration Tools
• Java + IDE based, for most of them
• Data transformations are blocks
• IO ﬂow managed by connections
• Execution

In the Python world …
• Bubbles (https://github.com/stiivi/bubbles)
• PETL (https://github.com/alimanfoo/petl)
• and now… Bonobo (https://www.bonobo-project.org/)
You can also use amazing libraries including  
Joblib, Dask, Pandas, Toolz,  
but ETL is not their main focus.

Big Data Tools
• Can do anything. And probably more. Fast.
• Either needs an infrastructure, or cloud based.

Partner 1 Partner 2 Partner 3 Partner 4 Partner 5
Partner 6 Partner 7 Partner 8 Partner 9 …

Tiny bug there…
Can you fix it ?

I want…
• A data integration / ETL tool using code as conﬁguration.
• Preferably Python code.
• Something that can be tested (I mean, by a machine).
• Something that can use inheritance.
• Fast install on laptop, thought to run on servers too.

It is …
• A framework to write ETL jobs in Python 3 (3.5+)
• Using the same concepts as the old ETLs
• But for coders!
• You can use OOP!

It is NOT …
• Pandas / R Dataframes
• Dask
• Luigi / Airﬂow
• Hadoop / Big Data
• A monkey

Let’s create a project
~ $ pip install bonobo cookiecutter

~ $ pip install bonobo cookiecutter 
~ $ bonobo init pyparis

~ $ bonobo init pyparis 
~ $ cd pyparis

~ $ bonobo init pyparis 
~ $ cd pyparis 
~/pyparis $ ls -lsah 
total 32 
0 drwxr-xr-x 7 rd staff 238B 30 mai 18:02 . 
0 drwxr-xr-x 4 rd staff 136B 30 mai 18:03 .. 
0 -rw-r--r-- 1 rd staff 0B 30 mai 18:02 .env 
8 -rw-r--r-- 1 rd staff 6B 30 mai 18:02 .gitignore 
8 -rw-r--r-- 1 rd staff 333B 30 mai 18:02 main.py 
8 -rw-r--r-- 1 rd staff 140B 30 mai 18:02 _services.py 
8 -rw-r--r-- 1 rd staff 7B 30 mai 18:02 requirements.txt

It works!
$ bonobo run .
1
3
5
...
37
39
41
- range in=1 out=42
- Filter in=42 out=21
- print in=21

Let’s rewrite main.py
import bonobo

import bonobo
def extract():
for i in range(42):
yield i

import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n

import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)

import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)

import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)

import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)

import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
bonobo.Filter(lambda n: n % 2)

import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
print

import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
print
range(42),
bonobo.Filter(filter=lambda x: x % 2),
print,
)
if __name__ == '__main__':
bonobo.run(graph)

BEGIN
CsvReader(
'clients.csv'
)
InsertOrUpdate(
'db.site',
'clients',
key='guid'
)
update_crm
retrieve_orders

bonobo.run(graph)
or in a shell…
$ bonobo run main.py

BEGIN
CsvReader(
'clients.csv'
)
InsertOrUpdate(
'db.site',
'clients',
key='guid'
)
update_crm
retrieve_orders
Context
+
Thread
Context
+
Thread
Context
+
Thread
Context
+
Thread

Transformations ?
a.k.a
nodes in the graph

Functions!
def get_more_infos(api, **row):
more = api.query(row.get('id'))
return {
**row,
**(more or {}),
}

Generators!
def join_orders(order_api, **row):
for order in order_api.get(row.get('customer_id')):
yield {
**row,
**order,
}

Iterators!
extract = (
'foo',
'bar',
'baz',
)
extract = range(0, 1001, 7)

Classes!
from bonobo.config import Configurable, Option, Service
class QueryDatabase(Configurable):
# Simple string option
table_name = Option(str, default='customers')
# External dependency
database = Service('database.default')
def call(self, database, **row):
customer = database.query(self.table_name, customer_id=row['clientId'])
return {
**row,
'is_customer': bool(customer),
}

get_services()
COINBASE_API_KEY, COINBASE_API_SECRET = environ.get('...'), environ.get('...')
KRAKEN_API_KEY, KRAKEN_API_SECRET = environ.get('...'), environ.get('...')
POLONIEX_API_KEY, POLONIEX_API_SECRET = environ.get('...'), environ.get('...')
def get_services():
return {
'fs': bonobo.open_fs(
path.join(path.dirname(__file__), './data')
),
'coinbase.client': coinbase.wallet.client.Client(
COINBASE_API_KEY, COINBASE_API_SECRET
),
'kraken.client': krakenex.API(
KRAKEN_API_KEY, KRAKEN_API_SECRET
),
'poloniex.client': poloniex.Poloniex(
apikey=POLONIEX_API_KEY, secret=POLONIEX_API_SECRET
),
'google.client': get_google_api_client(),
}

run() will handle injection
import bonobo
from bonobo.commands.run import get_default_services
graph = bonobo.Graph()
if __name__ == '__main__':
bonobo.run(
graph,
services=get_default_services(__file__)
)

…
class QueryDatabase(Configurable):
# External dependency
database = Service('database.default')
def call(self, database, **row):
return { … }

Files
bonobo.FileReader(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str
)
bonobo.FileWriter(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str
)

CSV
bonobo.CsvReader(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str,
delimiter: str,
quotechar: str,
headers: tuple,
skip: int
)
bonobo.CsvWriter(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str,
delimiter: str,
quotechar: str,
headers: tuple
)

JSON
bonobo.JsonReader(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str
)
bonobo.JsonWriter(
path: str,
*,
eol: str,
encoding: str,
ioformat,
mode: str
)

Pickle
bonobo.PickleReader(
path: str,
*,
eol: str,
encoding: str,
ioformat,
item_names: tuple,
mode: str
)
bonobo.PickleWriter(
path: str,
*,
eol: str,
encoding: str,
ioformat,
item_names: tuple,
mode: str
)

Filters
@bonobo.Filter
def NoneShallPass(username):
return username == 'romain'
bonobo.graph(
# ...
NoneShallPass(),
# ...
)
bonobo.Filter(lambda username: username == 'romain')

SQLAlchemy Extension
bonobo_sqlalchemy.Select(
query,
*,
pack_size=1000,
limit=None
)
bonobo_sqlalchemy.InsertOrUpdate(
table_name,
*,
fetch_columns,
insert_only_fields,
discriminant,
…
)
PREVIEW

Docker Extension
$ pip install bonobo[docker]
$ bonobo runc myjob.py
PREVIEW

RIP rdc.etl
• 2013-2015, python 2.7.
• Brings most of the execution algorithms.
• Different context.
• Learn from mistakes.

Bonobo is young
• First commit : December 2016
• 21 releases, ~400 commits, 4 contributors
• Current stable 0.4.1, released yesterday

__future__
• Stay 100% Open-Source
• Light & Focused : Graphs and Executions
• The rest goes to Extensions.

Pre 1.0 Roadmap
• Towards stable API
• More tests, more documentation
• Developer experience ﬁrst
Target: 1.0 before 2018

Post 1.0 Roadmap
• Scheduling
• Monitoring
• More execution strategies
• Optimizations

www.bonobo-project.org
docs.bonobo-project.org
bonobo-slack.herokuapp.com
github.com/python-bonobo
Let me know what you think!

Thank you!
@monkcage @rdorgueil
https://goo.gl/e25eoa

Simple ETL in python 3.5+ with Bonobo - PyParis 2017

More Related Content

What's hot

Similar to Simple ETL in python 3.5+ with Bonobo - PyParis 2017

Recently uploaded

Simple ETL in python 3.5+ with Bonobo - PyParis 2017