This document discusses using Apache Airflow to manage hundreds of data pipelines for processing data from virus attacks. It describes using Airflow to build a feeds importer to automate extracting data from various virus data feeds and load it into target systems. The feeds importer would need to handle various data feeds, be configurable, allow adding new feeds easily, use reusable components, and be scalable. It suggests using Airflow over alternatives like Luigi or custom solutions due to Airflow's capabilities. It provides an example DAG definition and discusses setting up Airflow in a multi-node cluster on AWS using services like ECS, RDS, and Elasticache for high availability. Finally, it briefly describes using the processed data to gain insights through a
VIP Model Call Girls Bhosari ( Pune ) Call ON 8005736733 Starting From 5K to ...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing data from viruses with Apache Airflow"
1. How to manage hundreds of pipelines
for processing data from viruses attack
with Apache Airflow
Yaroslav Nedashkovsky, Software Solutions Architect
SE-Data
2. Agenda
1. Viruses data feeds
2. Feeds importer
3. System architecture for analyzing data
from virus attacks
4. a-Gnostics — framework to analyze
petabytes of data
5. What we want to ahcieve?
Create service which will automate process of
data extraction from variety of viruses feeds and
import them into variety of tagret sources
7. Challenges & requirements for feeds importer
- Variety of data feeds
- Configuration as code
- Ability to add new feed without big effort
- Reusable components (!)
- Task scheduler
- Scalelability
- Cloud agnostic (tentative)
8. Candidates for feeds importer
- Apache Airflow
- Spotify Luigi
- Pinterest Pinball
- AWS Glue / AWS data pipelines
- Custom solutions
9. Airflow vs Luigi
Luigi -:
• Poor UI
• Scheduling is not available
• Testing pipeline is not so trivial
• Problems with scalability
• None trivial way in rerunning old
pipelines
Airflow -:
• Sometimes, could be unclear how to
organize tasks in pipelines
• Be ready meet bugs (this is opensource ☺)
• That’s all?
Better to create POC for each solution and try to test all required options.
Airflow is a platform to programmatically author,
schedule and monitor workflows.
Based on DAGs.
Luigi is a Python module that helps you build
complex pipelines of batch jobs.
Based on pipelines.
15. How Airflow HA will works with AWS?
- PostgreSQL RDS as backend for metastore with Multi A-Z configuration
- Elasticache (Redis) for Celery backend
- Scheduler, workers and web server put into containers and run on ECS
- Elastic file system as long term storage for containers
Weak place here is a scheduler. Running two scheduler simultanusaly could lead to
unpredictable results. Need to monitor scheduler and raise notification in case of fail.
17. - Amazon «Docker» as service ☺
- Launch type: EC2, Fargate ☺
- Support Docker Compose ☺
- Good integration with other AWS services ☺
- Separate clusters for CPU optimized and RAM optimized tasks ☺
- Could be problematic with correct set up hard/soft limits for CPU/RAM
- Need to have reserved EC2 instance in the cluster for quick scaling
- Scale-In problem
ECS key points
21. Feeds importer in numbers
- more than 100 data feeds
- 10 workers, 1 scheduler, 2 web servers
- each day it processeds about 500 Gb of data
- 4 m5.2xlarge (8 vCPU, 32 GB RAM),
3 t2.xlarge (4 vCPU, 16 GB RAM) instances in ECS
- Postgres SQL (RDS) db.m4.xlarge with Multi A-Z
- Elasticache cache.m4.large 2 replica with Multi A-Z