Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
(ATS6-PLAT07) Managing AEP in an enterprise environmentBIOVIA
Accelrys Enterprise Platform use within an Enterprise environment spans from Power users of Pipeline Pilot to web applications and High Performance Computing. Managing the balance between productivity and enterprise policies can be tricky. This session will focus on exposing the tools and processes needed by administrators to enable users to be productive, yet allowing IT to remain in control.
Capacity planning is a difficult challenge faced by most companies. If you have too few machines, you will not have enough compute resources available to deal with heavy loads. On the other hand, if you have too many machines, you are wasting money. This is why companies have started investing in automatically scaling services and infrastructure to minimize the amount of wasted money and resources.
In this talk, Nathan will describe how Yelp is using PaaSTA, a PaaS built on top of open source tools including Docker, Mesos, Marathon, and Chronos, to automatically and gracefully scale services and the underlying cluster. He will go into detail about how this functionality was implemented and the design designs that were made while architecting the system. He will also provide a brief comparison of how this approach differs from existing solutions.
Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
The importance of ingestion and processing streaming data in telecommunication industry is ever increasing. We, SK Telecom which is Korea's number-one telecommunications provider, encounter how to use infra resources more efficiently. Apache Druid supports auto scaling feature for data ingestion, but it is only available on AWS EC2. We cannot rely on the feature on our private cloud.
In this talk, we are going to introduce auto scale-out/in on Kubernetes. This approach is more outstanding than Druid's scaling implementation. Here are the benefits. The first is our approach can be used anywhere on private cloud or (managed) Kubernetes in Azure, AWS and GKE. The second is AWS EC2's startup and termination requires a few minutes, but our approach requires a few seconds. The last is the scaling mechanism is decoupled from Druid's source code. We will also share development of Druid Helm chart, rolling update, custom metric usage for horizontal auto scaling.
The below is about detailed benefit compared with Druid's auto scaling approach:
1. Druid's auto scaling is only available in AWS, but our approach does not have the obstacle. It can be used in Private cloud(on-premise) are (managed) Kubernetes in Azure, AWS and GKE.
2. AWS EC2 is an instance of virtual machine, so the startup is slower than docker container. A few minutes are required for startup or termination of EC2. Docker container is very lightweight, so it requires a few seconds.
3. Druid's auto scaling is tightly coupled with AWS API because Druid engine code uses AWS API. Our scale-out/in algorithm is conceptually equal to Druid's auto scaling approach, but we decoupled the dependency because Kubernetes communicate with one of dispatcher nodes(i.e. Overlord node) using REST API.
Heart of the SwarmKit: Store, Topology & Object ModelDocker, Inc.
Heart of the SwarmKit: Store, Topology & Object Model by Aaron, Andrea, Stephen D (Docker)
Swarmkit repo - https://github.com/docker/swarmkit
Liveblogging: http://canopy.mirage.io/Liveblog/SwarmKitDDS2016
Tempto is a product test framework that allows developers to write and execute tests for SQL databases running on Hadoop. Individual test requirements such as data generation, HDFS file copy/storage of generated data and schema creation are expressed declaratively and are automatically fulfilled by the framework. Developers can write tests using Java (using a TestNG like paradigm and AssertJ style assertion) or by providing query files with expected results. We will show how we use it for presto product tests.
Benchto is a benchmark framework that provides an easy and manageable way to define, run and analyze macro benchmarks in clustered environment. Understanding behavior of distributed systems is hard and requires good visibility intostate of the cluster and internals of tested system. This project was developed for repeatable benchmarking ofHadoop SQL engines, most importantly Presto.
An admin application for editing database, but with configurable features (grouping and ordering of tables and columns, hyperlink navigation between related records, etc.)
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
(ATS6-PLAT07) Managing AEP in an enterprise environmentBIOVIA
Accelrys Enterprise Platform use within an Enterprise environment spans from Power users of Pipeline Pilot to web applications and High Performance Computing. Managing the balance between productivity and enterprise policies can be tricky. This session will focus on exposing the tools and processes needed by administrators to enable users to be productive, yet allowing IT to remain in control.
Capacity planning is a difficult challenge faced by most companies. If you have too few machines, you will not have enough compute resources available to deal with heavy loads. On the other hand, if you have too many machines, you are wasting money. This is why companies have started investing in automatically scaling services and infrastructure to minimize the amount of wasted money and resources.
In this talk, Nathan will describe how Yelp is using PaaSTA, a PaaS built on top of open source tools including Docker, Mesos, Marathon, and Chronos, to automatically and gracefully scale services and the underlying cluster. He will go into detail about how this functionality was implemented and the design designs that were made while architecting the system. He will also provide a brief comparison of how this approach differs from existing solutions.
Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
The importance of ingestion and processing streaming data in telecommunication industry is ever increasing. We, SK Telecom which is Korea's number-one telecommunications provider, encounter how to use infra resources more efficiently. Apache Druid supports auto scaling feature for data ingestion, but it is only available on AWS EC2. We cannot rely on the feature on our private cloud.
In this talk, we are going to introduce auto scale-out/in on Kubernetes. This approach is more outstanding than Druid's scaling implementation. Here are the benefits. The first is our approach can be used anywhere on private cloud or (managed) Kubernetes in Azure, AWS and GKE. The second is AWS EC2's startup and termination requires a few minutes, but our approach requires a few seconds. The last is the scaling mechanism is decoupled from Druid's source code. We will also share development of Druid Helm chart, rolling update, custom metric usage for horizontal auto scaling.
The below is about detailed benefit compared with Druid's auto scaling approach:
1. Druid's auto scaling is only available in AWS, but our approach does not have the obstacle. It can be used in Private cloud(on-premise) are (managed) Kubernetes in Azure, AWS and GKE.
2. AWS EC2 is an instance of virtual machine, so the startup is slower than docker container. A few minutes are required for startup or termination of EC2. Docker container is very lightweight, so it requires a few seconds.
3. Druid's auto scaling is tightly coupled with AWS API because Druid engine code uses AWS API. Our scale-out/in algorithm is conceptually equal to Druid's auto scaling approach, but we decoupled the dependency because Kubernetes communicate with one of dispatcher nodes(i.e. Overlord node) using REST API.
Heart of the SwarmKit: Store, Topology & Object ModelDocker, Inc.
Heart of the SwarmKit: Store, Topology & Object Model by Aaron, Andrea, Stephen D (Docker)
Swarmkit repo - https://github.com/docker/swarmkit
Liveblogging: http://canopy.mirage.io/Liveblog/SwarmKitDDS2016
Tempto is a product test framework that allows developers to write and execute tests for SQL databases running on Hadoop. Individual test requirements such as data generation, HDFS file copy/storage of generated data and schema creation are expressed declaratively and are automatically fulfilled by the framework. Developers can write tests using Java (using a TestNG like paradigm and AssertJ style assertion) or by providing query files with expected results. We will show how we use it for presto product tests.
Benchto is a benchmark framework that provides an easy and manageable way to define, run and analyze macro benchmarks in clustered environment. Understanding behavior of distributed systems is hard and requires good visibility intostate of the cluster and internals of tested system. This project was developed for repeatable benchmarking ofHadoop SQL engines, most importantly Presto.
An admin application for editing database, but with configurable features (grouping and ordering of tables and columns, hyperlink navigation between related records, etc.)
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
2. Index
- Workflows Management Systems
- Architecture
- Building blocks
- More features
- User Interface
- Security
- CLI
- Demo
3. WTH is a Workflow Management System ?
A worflow Management system
is:
Is a data-centric software (framework) for :
- Settingup
- Performing
- Monitoring
of a defined sequenceofprocesses and tasks
12. Building blocks
Operators :
- Describes a singletaskin a workflow.
- Determine what actually gets done
- Operators generally run independently (atomic)
- The DAG make sure that operators run in the correct certain order
- They may run on completely different machines
13. Building blocks
Operators : There are 3 main types of operators:
● Operators that performs an action,or tell another system to perform an action
● Transferoperators move data from one system to another
● Sensorsare a certain type of operator that will keeprunninguntil a certain criterionis met.
○ Examples include a specific file landing in HDFS or S3.
○ A partition appearing in Hive.
○ A specific time of the day.
18. Tasks : a parameterized instance of an
operator
Building blocks
19. Building blocks
Task Instance : Dag + Task + point in time
- Specific run of a Task
- A task assigned to a DAG
- Has State associated with a specific run of the DAG
- States : it could be
- running
- success,
- failed
- skipped
- upfor retry
- …
20. Building blocks
Workflows :
● DAG: a description of the order in which work should take place
● Operator: a class that acts as a template for carrying out some work
● Task: a parameterized instance of an operator
● Task Instance: a task that
○ Has been assigned to a DAG
○ Has a state associated with a specific run of the DAG
● By combining DAGs and Operators to create TaskInstances, you can build complex workflows.
23. - Features :
- Hooks
- Connections
- Variables
- XComs
- SLA
- Pools
- Queues
- Trigger Rules
- Branchings
- SubDags
More features
24. - Interface to external platforms and databases :
- Hive
- S3
- MySQL
- PostgreSQL
- HDFS
- Hive
- Pig
- …
- Act as building block for Operators
- Use Connection to retrieve authentication informations
- Keep authentication infos out of pipelines.
More features
Hooks :
28. More features
Variables :
- A generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow.
- Variables can be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI.
- While your pipeline code definition and most of your constants and variables should be defined in code and stored in source control, it
can be useful to have some variables or configuration items accessible and modifiable through the UI.
30. XCom or Cross-
communication:
● Let tasks exchangemessages allowing sharedstate.
● Defined by a key,value,and timestamp.
● Also trackattributes like the task/DAG that created the XCom and when it should become visible.
● Any objectthat can be pickledcan be used as an XCom value.
XComs can be :
● Pushed (sent) :
○ Calling xcom_push()
○ If a task return a value (from its operator execute() method) or from a PythonOperator’s python_callable
● Pulled (received) : calling xcom_pull()
More features
31. More features
SLA :
- Service Level Agreements, or time by which a task or DAG should have succeeded,
- Can be set at a task level as a timedelta.
- An alert email is sent detailing the list of tasks that missed their SLA.
32. More features
Pools :
- Some systems can get overwhelmed when too many processes hit them at the same time.
- Limittheexecutionparallelismon arbitrary sets of tasks.
34. Queues : (only on CeleryExecutors) :
- Every Task can be assigned a specific queue name
- By default, both worker and tasks are assigned with the default_queue queue
- Workers can be assigned multiple queues
- Very useful feature when specialized workers are needed (GPU, Spark…)
More features
35. More features
Trigger Rules:
Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded, Airflow allows for more complex
dependency settings.
All operators have a trigger_rule argument which defines the rule by which the generated task get triggered. The default value for trigger_rule is
all_success and can be defined as “trigger this task when all directly upstream tasks have succeeded”. All other rules described here are based
on direct parent tasks and are values that can be passed to any operator while creating tasks:
● all_success: (default) all parents have succeeded
● all_failed: all parents are in a failed or upstream_failed state
● all_done: all parents are done with their execution
● one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done • one_success: fires as soon as at
least one parent succeeds, it does not wait for all parents to be done • dummy: dependencies are just for show, trigger at will.
48. Security
By default : all access are open
Support ;
● Web authentication with :
○ Password
○ LDAP
○ Custom auth
○ Kerberos
○ OAuth
■ Github Entreprise Authentication
■ Google Authentication
● Impersonation (run as other $USER)
● Secure access via SSL
50. Demo
1. Facebook Ads insights data pipeline.
2. Run a pyspark script on a ephemeral dataproc cluster only when s3 data input is available
3. Useless workflow : Hook + Connection + Operators + Sensors + XCom +( SLA ):
○ List s3 files (hooks)
○ Share state with the next task (xcom)
○ Write content to s3 (hooks)
○ Resume the workflow when an S3 DONE.FLAG file is ready (sensor)