Gerard Toonstra | Sr. Data Platform Engineer | g.toonstra@coolblue.nl | 08-10-2017
“Seek more understanding
through data analytics
about business processes.”
Where is time spent?
Follow important ETL principles
● Processes are idempotent
Follow important ETL principles
● Processes are idempotent
● Reusable, parameterisable components
Follow important ETL principles
● Processes are idempotent
● Reusable, parameterisable components
● Data at rest between operations
s3://[bucket]/[dag-id]/[task-id]/2017/10/13
Follow important ETL principles
● Processes are idempotent
● Reusable, parameterisable components
● Data at rest between operations
s3://[my-bucket-name]/[dag-id]/[task-id]/2017/10/13
● Testing strategy
“Airflow is a platform to
programmatically author,
schedule and monitor
workflows.”
What is a scheduler?
What is a DAG?
Conceptual Architecture
Operator?
Apache Airflow (incubating)
Architecture on AWS
Architecture on AWS
Architecture on AWS
Architecture on AWS
extract_customer = PostgresToPostgresOperator(
src_postgres_conn_id='postgres_oltp',
dest_postgress_conn_id='postgres_dwh',
sql='select_customer.sql',
pg_table='staging.customer',
parameters={"window_start_date":"{{ ds }}",
"window_end_date":"{{ tomorrow_ds }}"},
pool='postgres_dwh')
extract_customer = PostgresToPostgresOperator(
src_postgres_conn_id='postgres_oltp',
dest_postgress_conn_id='postgres_dwh',
sql='select_customer.sql',
pg_table='staging.customer',
parameters={"window_start_date":"{{ ds }}",
"window_end_date":"{{ tomorrow_ds }}"},
pool='postgres_dwh')
extract_customer = PostgresToPostgresOperator(
src_postgres_conn_id='postgres_oltp',
dest_postgress_conn_id='postgres_dwh',
sql='select_customer.sql',
pg_table='staging.customer',
parameters={"window_start_date":"{{ ds }}",
"window_end_date":"{{ tomorrow_ds }}"},
pool='postgres_dwh')
extract_customer = PostgresToPostgresOperator(
src_postgres_conn_id='postgres_oltp',
dest_postgress_conn_id='postgres_dwh',
sql='select_customer.sql',
pg_table='staging.customer',
parameters={"window_start_date":"{{ ds }}",
"window_end_date":"{{ tomorrow_ds }}"},
pool='postgres_dwh')
Where is time spent?
Follow important ETL principles
● Idempotency
● Reusable, parameterizable components
● Data from rest to rest
● Testing strategy
Gerard Toonstra | Sr. Data Platform Engineer | g.toonstra@coolblue.nl

Apache Airflow Architecture