@higrys, @sprzedwojskiPyWaw #80 Attribution https://github.com/ananthdurai/airflow-training/
Manageable data pipelines with
Airflow
(and Kubernetes)
@higrys, @sprzedwojskiPyWaw #80
DAG workflow
@higrys, @sprzedwojskiPyWaw #80
Airflow
Airflow is a platform to programmatically author,
schedule and monitor workflows.
Dynamic/Elegant
Extensible
Scalable
@higrys, @sprzedwojskiPyWaw #80
Companies using Airflow
(>200 officially)
@higrys, @sprzedwojskiPyWaw #80
Data Pipeline
https://xkcd.com/2054/
PyWaw #80 @higrys, @sprzedwojski
Cron vs. Airflow
● no installation needed ● setup complex (but easy with Google Composer)
● locally stored job output ● aggregate logs, show them in UI
● no versioning ● versioning as usual (for programmers)
● no testing ● automated testing as usual (for programmers)
● hard re-running, backfilling ● re-running, backfilling (run all missed runs)
● no dependency management ● complex data pipelines (task dependencies)
@higrys, @sprzedwojskiPyWaw #80
Airflow use cases
● ETL jobs
● ML pipelines
● Regular operations:
○ Delivering data
○ Performing backups
● ...
@higrys, @sprzedwojskiPyWaw #80
Core concepts - Directed Acyclic Graph (DAG)
@higrys, @sprzedwojskiPyWaw #80
Core concepts - Operators
@higrys, @sprzedwojskiPyWaw #80
Core concepts - Tasks, TaskInstances, DagRuns
@higrys, @sprzedwojskiPyWaw #80
Operator types
● Action Operators
○ Python, Bash, Docker, GCEInstanceStart, ...
● Sensor Operators
○ S3KeySensor, HivePartitionSensor,
BigtableTableWaitForReplicationOperator , ...
● Transfer Operators
○ MsSqlToHiveTransfer, RedshiftToS3Transfer, …
@higrys, @sprzedwojskiPyWaw #80
Operator Properties
Important when writing your operators:
● Make operators idempotent
● Task execution is atomic
● State is not stored locally
● Data exchanged between tasks with XCOM
@higrys, @sprzedwojskiPyWaw #80
Airflow Plugins
@higrys, @sprzedwojskiPyWaw #80
Airflow Operator and Sensor
@higrys, @sprzedwojskiPyWaw #80
@higrys, @sprzedwojskiPyWaw #80
CLI tools
@higrys, @sprzedwojskiPyWaw #80
Show me the code!
@higrys, @sprzedwojskiPyWaw #80 Source: https://www.logolynx.com/images/logolynx/0b/0b42e766caee6dcd7355c1c95ddaaa1c.png
@higrys, @sprzedwojskiPyWaw #80 Source: http://www.faicoach.com/wp-content/uploads/2017/10/cash-burn.jpg
@higrys, @sprzedwojskiPyWaw #80
The solution
Sources:
https://services.garmin.cn/appsLibraryBusinessServices_v0/rest/apps/9b5dabf3-925b-4a90-b465-89147178ea0e/icon/d8998af1-9fc7-450e-abd2-c57ae102887e
https://malloc.fi/static/images/slack-memory-management.png
https://i.gifer.com/9GXs.gif
@higrys, @sprzedwojskiPyWaw #80
Solution components
● BashOperator
● PythonOperator
● EmailOperator
@higrys, @sprzedwojskiPyWaw #80
The solution
@higrys, @sprzedwojskiPyWaw #80
Initialize DAG
dag = DAG(dag_id='gcp_spy',
default_args={
'start_date': utils.dates.days_ago(1),
'retries': 1
},
schedule_interval='0 16 * * *')
@higrys, @sprzedwojskiPyWaw #80
List of instances
bash_task = BashOperator(
xcom_push=True,
bash_command=
"gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '",
task_id="gcp_service_list_instances_sql",
dag=dag
)
@higrys, @sprzedwojskiPyWaw #80
All services
GCP_SERVICES = [
('sql', 'Cloud SQL'),
('spanner', 'Spanner'),
('bigtable', 'BigTable'),
('compute', 'Compute Engine'),
]
@higrys, @sprzedwojskiPyWaw #80
List of instances - all services
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
xcom_push=True,
bash_command=
"gcloud {} instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '".format(gcp_service[0]),
task_id="gcp_service_list_instances_{}".format(gcp_service[0]),
dag=dag
)
@higrys, @sprzedwojskiPyWaw #80
Send Slack message
send_slack_msg_task = PythonOperator(
python_callable=send_slack_msg,
provide_context=True,
task_id='send_slack_msg_task',
dag=dag
)
@higrys, @sprzedwojskiPyWaw #80
def send_slack_msg(**context):
(...)
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
data = (...)
requests.post(
url=SLACK_WEBHOOK,
data=json.dumps(data),
headers={'Content-type': 'application/json'}
)
@higrys, @sprzedwojskiPyWaw #80
Prepare email
prepare_email_task = PythonOperator(
python_callable=prepare_email,
provide_context=True,
task_id='prepare_email_task',
dag=dag
)
@higrys, @sprzedwojskiPyWaw #80
def prepare_email(**context):
(...)
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
(...)
html_content = (...)
context['task_instance'].xcom_push(key='email', value=html_content)
@higrys, @sprzedwojskiPyWaw #80
Send email
send_email_task = EmailOperator(
to='szymon.przedwojski@polidea.com',
subject=INSTANCES_IN_PROJECT_TITLE,
html_content=
"{{ task_instance.xcom_pull(task_ids='prepare_email_task', key='email') }}",
task_id='send_email',
dag=dag
)
@higrys, @sprzedwojskiPyWaw #80
Dependencies
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
(...)
)
bash_task >> send_slack_msg_task
bash_task >> prepare_email_task
@higrys, @sprzedwojskiPyWaw #80
Dependencies
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
(...)
)
bash_task >> send_slack_msg_task
bash_task >> prepare_email_task
prepare_email_task >> send_email_task
@higrys, @sprzedwojskiPyWaw #80
Dependencies
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
(...)
)
bash_task >> send_slack_msg_task
bash_task >> prepare_email_task
prepare_email_task >> send_email_task
@higrys, @sprzedwojskiPyWaw #80
Demo
https://github.com/PolideaInternal/airflow-gcp-spy
@higrys, @sprzedwojskiPyWaw #80
Complex DAGs
Source: https://speakerdeck.com/pybay/2016-matt-davis-a-practical-introduction-to-airflow?slide=13
@higrys, @sprzedwojskiPyWaw #80
Complex, Manageable, DAGs
@higrys, @sprzedwojskiPyWaw #80
Single node
Local Executor
Web
server
RDBMS DAGs
Scheduler
Local executors`
Local executors
Local executors
Local executors
multiprocessing
@higrys, @sprzedwojskiPyWaw #80
Celery Executor
Controller
Web
server
RDBMS DAGs
Scheduler
Celery Broker
RabbitMQ/Redis/AmazonSQS
Node 1 Node 2
DAGs DAGs
Worker Worker
Sync files
(Chef/Puppet/Ansible/NFS)
@higrys, @sprzedwojskiPyWaw #80
(Beta): Kubernetes Executor
Controller
Web
server
RDBMS
DAGs
Scheduler
Kubernetes Cluster
Node 1 Node 2
Pod
Sync files
● Git Init
● Persistent Volume
● Baked-in (future)
Package
as pods
Kubernetes Master
DAGs DAGs
Pod
Pod
Pod
@higrys, @sprzedwojskiPyWaw #80
GCP - Composer
https://github.com/GoogleCloudPlatform/airflow-operator
@higrys, @sprzedwojskiPyWaw #80
Thank You!
Follow us @
polidea.com/blog
Source: https://techflourish.com/images/superman-animated-clipart.gif
https://airflow.apache.org/_images/pin_large.png
PyWaw #80 @higrys, @sprzedwojski
@higrys, @sprzedwojskiPyWaw #80
Questions? :)
Follow us @
polidea.com/blog
Source: https://techflourish.com/images/superman-animated-clipart.gif
https://airflow.apache.org/_images/pin_large.png

Manageable data pipelines with airflow (and kubernetes) november 27, 11 45 am - py-waw