Challenges & requirements for feeds importer
- Variety of data feeds
- Configuration as code
- Ability to add new feed without big effort
- Reusable components (!)
- Task scheduler
- Cloud agnostic (tentative)
Candidates for feeds importer
- Apache Airflow
- Spotify Luigi
- Pinterest Pinball
- AWS Glue / AWS data pipelines
- Custom solutions
Airflow vs Luigi
• Poor UI
• Scheduling is not available
• Testing pipeline is not so trivial
• Problems with scalability
• None trivial way in rerunning old
• Sometimes, could be unclear how to
organize tasks in pipelines
• Be ready meet bugs (this is opensource ☺)
• That’s all?
Better to create POC for each solution and try to test all required options.
Airflow is a platform to programmatically author,
schedule and monitor workflows.
Based on DAGs.
Luigi is a Python module that helps you build
complex pipelines of batch jobs.
Based on pipelines.
How Airflow HA will works with AWS?
- PostgreSQL RDS as backend for metastore with Multi A-Z configuration
- Elasticache (Redis) for Celery backend
- Scheduler, workers and web server put into containers and run on ECS
- Elastic file system as long term storage for containers
Weak place here is a scheduler. Running two scheduler simultanusaly could lead to
unpredictable results. Need to monitor scheduler and raise notification in case of fail.
- Amazon «Docker» as service ☺
- Launch type: EC2, Fargate ☺
- Support Docker Compose ☺
- Good integration with other AWS services ☺
- Separate clusters for CPU optimized and RAM optimized tasks ☺
- Could be problematic with correct set up hard/soft limits for CPU/RAM
- Need to have reserved EC2 instance in the cluster for quick scaling
- Scale-In problem
ECS key points
Feeds importer in numbers
- more than 100 data feeds
- 10 workers, 1 scheduler, 2 web servers
- each day it processeds about 500 Gb of data
- 4 m5.2xlarge (8 vCPU, 32 GB RAM),
3 t2.xlarge (4 vCPU, 16 GB RAM) instances in ECS
- Postgres SQL (RDS) db.m4.xlarge with Multi A-Z
- Elasticache cache.m4.large 2 replica with Multi A-Z
3. System architecture for analyzing data from
4. a-Gnostics — Get Value from Petabytes of
Comparsion Airflow and Luigi - https://towardsdatascience.com/why-quizlet-chose-
Comparsion ECS (EC2) vs ECS (Fargate) vs EKS - https://medium.com/devopslinks/ecs-
Scale-in ECS hosts - https://medium.com/prodopsio/how-to-scale-in-ecs-hosts-