@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Manageable data pipelines with
Airflow
(and Kubernetes)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Airflow
Airflow is a platform to programmatically author,
schedule and monitor workflows.
Dynamic/Elegant
Extensible
Scalable
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Workflows
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Companies using Airflow
(>200 officially)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Data Pipeline
https://xkcd.com/2054/
GDG DevFest Warsaw 2018 @higrys, @sprzedwojski
Airflow vs. other workflow platforms
● Programming workflows
○ writing code not XML
○ versioning as usual
○ automated testing as usual
○ complex dependencies between tasks
● Managing workflows
○ aggregate logs in one UI
○ tracking execution
○ re-running, backfilling (run all missed runs)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Airflow use cases
● ETL jobs
● ML pipelines
● Regular operations:
○ Delivering data
○ Performing backups
● ...
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Core concepts - Directed Acyclic Graph (DAG)
Source: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/example_dags/example_twitter_README.md
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Core concepts - Operators
Source: https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Operator types
● Action Operators
○ Python, Bash, Docker, GCEInstanceStart, ...
● Sensor Operators
○ S3KeySensor, HivePartitionSensor,
BigtableTableWaitForReplicationOperator , ...
● Transfer Operators
○ MsSqlToHiveTransfer, RedshiftToS3Transfer, …
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
class ExampleOperator(BaseOperator):
def execute(self, context):
# Do something
pass
Operator and Sensor
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
class ExampleOperator(BaseOperator):
def execute(self, context):
# Do something
pass
class ExampleSensorOperator(BaseSensorOperator):
def poke(self, context):
# Check if the condition occurred
return True
Operator and Sensor
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Operator good practices
● Idempotent
● Atomic
● No direct data sharing
○ Small portions of data between tasks: XCOMs
○ Large amounts of data: S3, GCS, etc.
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Core concepts - Tasks, TaskInstances, DagRuns
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Show me the code!
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Source: https://www.logolynx.com/images/logolynx/0b/0b42e766caee6dcd7355c1c95ddaaa1c.png
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Source: http://www.faicoach.com/wp-content/uploads/2017/10/cash-burn.jpg
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
The solution
Sources:
https://services.garmin.cn/appsLibraryBusinessServices_v0/rest/apps/9b5dabf3-925b
https://malloc.fi/static/images/slack-memory-management.png
https://i.gifer.com/9GXs.gif
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Solution components
● Generic
○ BashOperator
○ PythonOperator
● Specific
○ EmailOperator
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
The DAG
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Initialize DAG
dag = DAG(dag_id='gcp_spy',
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Initialize DAG
dag = DAG(dag_id='gcp_spy',
default_args={
'start_date': utils.dates.days_ago(1),
'retries': 1
},
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Initialize DAG
dag = DAG(dag_id='gcp_spy',
default_args={
'start_date': utils.dates.days_ago(1),
'retries': 1
},
schedule_interval='0 16 * * *'
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
bash_command=
"gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '",
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
bash_command=
"gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '",
xcom_push=True,
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
bash_command=
"gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '",
xcom_push=True,
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
All services
GCP_SERVICES = [
('sql', 'Cloud SQL'),
('spanner', 'Spanner'),
('bigtable', 'BigTable'),
('compute', 'Compute Engine'),
]
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances - all services
????
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
bash_command=
"gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '",
xcom_push=True,
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances - all services
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
task_id="gcp_service_list_instances_{}".format(gcp_service[0]),
bash_command=
"gcloud {} instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '".format(gcp_service[0]),
xcom_push=True,
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Send Slack message
send_slack_msg_task = PythonOperator(
python_callable=send_slack_msg,
provide_context=True,
task_id='send_slack_msg_task',
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Send Slack message
send_slack_msg_task = PythonOperator(
python_callable=send_slack_msg,
provide_context=True,
task_id='send_slack_msg_task',
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def send_slack_msg(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
...
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def send_slack_msg(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
...
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def send_slack_msg(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
data = ...
...
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def send_slack_msg(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
data = ...
requests.post(
url=SLACK_WEBHOOK,
data=json.dumps(data),
headers={'Content-type': 'application/json'}
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Prepare email
prepare_email_task = PythonOperator(
python_callable=prepare_email,
provide_context=True,
task_id='prepare_email_task',
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Prepare email
prepare_email_task = PythonOperator(
python_callable=prepare_email,
provide_context=True,
task_id='prepare_email_task',
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def prepare_email(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
...
html_content = ...
context['task_instance'].xcom_push(key='email', value=html_content)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def prepare_email(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
...
html_content = ...
context['task_instance'].xcom_push(key='email', value=html_content)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Send email
send_email_task = EmailOperator(
task_id='send_email',
to='szymon.przedwojski@polidea.com',
subject=INSTANCES_IN_PROJECT_TITLE,
html_content=...,
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Send email
send_email_task = EmailOperator(
task_id='send_email',
to='szymon.przedwojski@polidea.com',
subject=INSTANCES_IN_PROJECT_TITLE,
html_content=
"{{ task_instance.xcom_pull(task_ids='prepare_email_task', key='email') }}",
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Dependencies
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
...
)
bash_task >> send_slack_msg_task
bash_task >> prepare_email_task
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Dependencies
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
...
)
bash_task >> send_slack_msg_task
bash_task >> prepare_email_task
prepare_email_task >> send_email_task
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Dependencies
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
...
)
bash_task >> send_slack_msg_task
bash_task >> prepare_email_task
prepare_email_task >> send_email_task
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Demo
https://github.com/PolideaInternal/airflow-gcp-spy
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Complex DAGs
Source: https://speakerdeck.com/pybay/2016-matt-davis-a-practical-introduction-to-airflow?slide=13
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Complex, Manageable, DAGs
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Single node
Local Executor
Web
server
RDBMS DAGs
Scheduler
Local executors`
Local executors
Local executors
Local executors
multiprocessing
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Celery Executor
Controller
Web
server
RDBMS DAGs
Scheduler
Celery Broker
RabbitMQ/Redis/AmazonSQS
Node 1 Node 2
DAGs DAGs
Worker Worker
Sync files
(Chef/Puppet/Ansible/NFS)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
(Beta): Kubernetes Executor
Controller
Web
server
RDBMS
DAGs
Scheduler
Kubernetes Cluster
Node 1 Node 2
Pod
Sync files
● Git Init
● Persistent Volume
● Baked-in (future)
Package
as pods
Kubernetes Master
DAGs DAGs
Pod
Pod
Pod
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
GCP - Composer
https://github.com/GoogleCloudPlatform/airflow-operator
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Thank You!
Follow us @
polidea.com/blog
Source: https://techflourish.com/images/superman-animated-clipart.gif
https://airflow.apache.org/_images/pin_large.png
GDG DevFest Warsaw 2018 @higrys, @sprzedwojski
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Questions? :)
Follow us @
polidea.com/blog
Source: https://techflourish.com/images/superman-animated-clipart.gif
https://airflow.apache.org/_images/pin_large.png

Manageable Data Pipelines With Airflow (and kubernetes) - GDG DevFest

  • 1.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Manageable data pipelines with Airflow (and Kubernetes)
  • 2.
  • 3.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Airflow Airflow is a platform to programmatically author, schedule and monitor workflows. Dynamic/Elegant Extensible Scalable
  • 4.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Workflows Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
  • 5.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Companies using Airflow (>200 officially)
  • 6.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Data Pipeline https://xkcd.com/2054/
  • 7.
    GDG DevFest Warsaw2018 @higrys, @sprzedwojski Airflow vs. other workflow platforms ● Programming workflows ○ writing code not XML ○ versioning as usual ○ automated testing as usual ○ complex dependencies between tasks ● Managing workflows ○ aggregate logs in one UI ○ tracking execution ○ re-running, backfilling (run all missed runs)
  • 8.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Airflow use cases ● ETL jobs ● ML pipelines ● Regular operations: ○ Delivering data ○ Performing backups ● ...
  • 9.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Core concepts - Directed Acyclic Graph (DAG) Source: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/example_dags/example_twitter_README.md
  • 10.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Core concepts - Operators Source: https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c
  • 11.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Operator types ● Action Operators ○ Python, Bash, Docker, GCEInstanceStart, ... ● Sensor Operators ○ S3KeySensor, HivePartitionSensor, BigtableTableWaitForReplicationOperator , ... ● Transfer Operators ○ MsSqlToHiveTransfer, RedshiftToS3Transfer, …
  • 12.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 class ExampleOperator(BaseOperator): def execute(self, context): # Do something pass Operator and Sensor
  • 13.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 class ExampleOperator(BaseOperator): def execute(self, context): # Do something pass class ExampleSensorOperator(BaseSensorOperator): def poke(self, context): # Check if the condition occurred return True Operator and Sensor
  • 14.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Operator good practices ● Idempotent ● Atomic ● No direct data sharing ○ Small portions of data between tasks: XCOMs ○ Large amounts of data: S3, GCS, etc.
  • 15.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Core concepts - Tasks, TaskInstances, DagRuns Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
  • 16.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Show me the code!
  • 17.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Source: https://www.logolynx.com/images/logolynx/0b/0b42e766caee6dcd7355c1c95ddaaa1c.png
  • 18.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Source: http://www.faicoach.com/wp-content/uploads/2017/10/cash-burn.jpg
  • 19.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 The solution Sources: https://services.garmin.cn/appsLibraryBusinessServices_v0/rest/apps/9b5dabf3-925b https://malloc.fi/static/images/slack-memory-management.png https://i.gifer.com/9GXs.gif
  • 20.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Solution components ● Generic ○ BashOperator ○ PythonOperator ● Specific ○ EmailOperator
  • 21.
  • 22.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Initialize DAG dag = DAG(dag_id='gcp_spy', ... )
  • 23.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Initialize DAG dag = DAG(dag_id='gcp_spy', default_args={ 'start_date': utils.dates.days_ago(1), 'retries': 1 }, ... )
  • 24.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Initialize DAG dag = DAG(dag_id='gcp_spy', default_args={ 'start_date': utils.dates.days_ago(1), 'retries': 1 }, schedule_interval='0 16 * * *' )
  • 25.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 List of instances bash_task = BashOperator( task_id="gcp_service_list_instances_sql", ... )
  • 26.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 List of instances bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '", ... )
  • 27.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 List of instances bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '", xcom_push=True, ... )
  • 28.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 List of instances bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '", xcom_push=True, dag=dag )
  • 29.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 All services GCP_SERVICES = [ ('sql', 'Cloud SQL'), ('spanner', 'Spanner'), ('bigtable', 'BigTable'), ('compute', 'Compute Engine'), ]
  • 30.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 List of instances - all services ???? bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '", xcom_push=True, dag=dag )
  • 31.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 List of instances - all services for gcp_service in GCP_SERVICES: bash_task = BashOperator( task_id="gcp_service_list_instances_{}".format(gcp_service[0]), bash_command= "gcloud {} instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '".format(gcp_service[0]), xcom_push=True, dag=dag )
  • 32.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Send Slack message send_slack_msg_task = PythonOperator( python_callable=send_slack_msg, provide_context=True, task_id='send_slack_msg_task', dag=dag )
  • 33.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Send Slack message send_slack_msg_task = PythonOperator( python_callable=send_slack_msg, provide_context=True, task_id='send_slack_msg_task', dag=dag )
  • 34.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ...
  • 35.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ...
  • 36.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) data = ... ...
  • 37.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) data = ... requests.post( url=SLACK_WEBHOOK, data=json.dumps(data), headers={'Content-type': 'application/json'} )
  • 38.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Prepare email prepare_email_task = PythonOperator( python_callable=prepare_email, provide_context=True, task_id='prepare_email_task', dag=dag )
  • 39.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Prepare email prepare_email_task = PythonOperator( python_callable=prepare_email, provide_context=True, task_id='prepare_email_task', dag=dag )
  • 40.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 def prepare_email(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ... html_content = ... context['task_instance'].xcom_push(key='email', value=html_content)
  • 41.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 def prepare_email(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ... html_content = ... context['task_instance'].xcom_push(key='email', value=html_content)
  • 42.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Send email send_email_task = EmailOperator( task_id='send_email', to='szymon.przedwojski@polidea.com', subject=INSTANCES_IN_PROJECT_TITLE, html_content=..., dag=dag )
  • 43.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Send email send_email_task = EmailOperator( task_id='send_email', to='szymon.przedwojski@polidea.com', subject=INSTANCES_IN_PROJECT_TITLE, html_content= "{{ task_instance.xcom_pull(task_ids='prepare_email_task', key='email') }}", dag=dag )
  • 44.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Dependencies for gcp_service in GCP_SERVICES: bash_task = BashOperator( ... ) bash_task >> send_slack_msg_task bash_task >> prepare_email_task
  • 45.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Dependencies for gcp_service in GCP_SERVICES: bash_task = BashOperator( ... ) bash_task >> send_slack_msg_task bash_task >> prepare_email_task prepare_email_task >> send_email_task
  • 46.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Dependencies for gcp_service in GCP_SERVICES: bash_task = BashOperator( ... ) bash_task >> send_slack_msg_task bash_task >> prepare_email_task prepare_email_task >> send_email_task
  • 47.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Demo https://github.com/PolideaInternal/airflow-gcp-spy
  • 48.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Complex DAGs Source: https://speakerdeck.com/pybay/2016-matt-davis-a-practical-introduction-to-airflow?slide=13
  • 49.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Complex, Manageable, DAGs
  • 50.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
  • 51.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Single node Local Executor Web server RDBMS DAGs Scheduler Local executors` Local executors Local executors Local executors multiprocessing
  • 52.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Celery Executor Controller Web server RDBMS DAGs Scheduler Celery Broker RabbitMQ/Redis/AmazonSQS Node 1 Node 2 DAGs DAGs Worker Worker Sync files (Chef/Puppet/Ansible/NFS)
  • 53.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 (Beta): Kubernetes Executor Controller Web server RDBMS DAGs Scheduler Kubernetes Cluster Node 1 Node 2 Pod Sync files ● Git Init ● Persistent Volume ● Baked-in (future) Package as pods Kubernetes Master DAGs DAGs Pod Pod Pod
  • 54.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 GCP - Composer https://github.com/GoogleCloudPlatform/airflow-operator
  • 55.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Thank You! Follow us @ polidea.com/blog Source: https://techflourish.com/images/superman-animated-clipart.gif https://airflow.apache.org/_images/pin_large.png
  • 56.
    GDG DevFest Warsaw2018 @higrys, @sprzedwojski
  • 57.
    @higrys, @sprzedwojskiGDG DevFestWarsaw 2018 Questions? :) Follow us @ polidea.com/blog Source: https://techflourish.com/images/superman-animated-clipart.gif https://airflow.apache.org/_images/pin_large.png