Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing data from viruses with Apache Airflow"

12 views

Published on

BigData & Data Engineering

Published in: Environment
  • Be the first to comment

  • Be the first to like this

Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing data from viruses with Apache Airflow"

  1. 1. How to manage hundreds of pipelines for processing data from viruses attack with Apache Airflow Yaroslav Nedashkovsky, Software Solutions Architect SE-Data
  2. 2. Agenda 1. Viruses data feeds 2. Feeds importer 3. System architecture for analyzing data from virus attacks 4. a-Gnostics — framework to analyze petabytes of data
  3. 3. Virus data feeds
  4. 4. Microsoft feed example - darkbot, ramnit - 700-800 Mb generated each hour, useful 50-100 Mb
  5. 5. What we want to ahcieve? Create service which will automate process of data extraction from variety of viruses feeds and import them into variety of tagret sources
  6. 6. 2. Feeds Importer
  7. 7. Challenges & requirements for feeds importer - Variety of data feeds - Configuration as code - Ability to add new feed without big effort - Reusable components (!) - Task scheduler - Scalelability - Cloud agnostic (tentative)
  8. 8. Candidates for feeds importer - Apache Airflow - Spotify Luigi - Pinterest Pinball - AWS Glue / AWS data pipelines - Custom solutions
  9. 9. Airflow vs Luigi Luigi -: • Poor UI • Scheduling is not available • Testing pipeline is not so trivial • Problems with scalability • None trivial way in rerunning old pipelines Airflow -: • Sometimes, could be unclear how to organize tasks in pipelines • Be ready meet bugs (this is opensource ☺) • That’s all? Better to create POC for each solution and try to test all required options. Airflow is a platform to programmatically author, schedule and monitor workflows. Based on DAGs. Luigi is a Python module that helps you build complex pipelines of batch jobs. Based on pipelines.
  10. 10. Airflow key concepts Base definitions - DAGs - Operators - Tasks - Pools - Queues Runtime - Scheduler - Worker - Webserver - Metastore Features - Xcom - Variables - Connections - Hooks
  11. 11. DAGs declaration example args = { 'retries': 1, 'retry_delay': timedelta(minutes=5), 'execution_timeout': timedelta(hours=1) 'queue': 'group1_feed_queue', 'pool': 'backfill', 'priority_weight': 10 } dag = DAG( dag_id='microsoft_feed_importer', default_args=args, schedule_interval='@hourly') extract_task = MicrosoftFeedExtractorOperator( task_id='microsoft_feed_extractor', dag=dag) kinesis_stream_task = KinesisStreamOperator( task_id='microsoft_feed_kinesis_stream_loader', dag=dag) s3_loader_task = S3FeedDataLoaderOperator( task_id='microsoft_feed_s3_loader', dag=dag) extract_task.set_downstream(kinesis_stream_task) extract_task.set_downstream(s3_loader_task) s3_loader_task.set_downstream(clean_task) kinesis_stream_task.set_downstream(clean_task)
  12. 12. Feed pipeline example
  13. 13. One node setup Scaling: - Vertically: executors 1-n
  14. 14. Multi-node setup Scaling: - Horizontally: workers 1-m - Vertically: executors 1-n
  15. 15. How Airflow HA will works with AWS? - PostgreSQL RDS as backend for metastore with Multi A-Z configuration - Elasticache (Redis) for Celery backend - Scheduler, workers and web server put into containers and run on ECS - Elastic file system as long term storage for containers Weak place here is a scheduler. Running two scheduler simultanusaly could lead to unpredictable results. Need to monitor scheduler and raise notification in case of fail.
  16. 16. Elastic container service
  17. 17. - Amazon «Docker» as service ☺ - Launch type: EC2, Fargate ☺ - Support Docker Compose ☺ - Good integration with other AWS services ☺ - Separate clusters for CPU optimized and RAM optimized tasks ☺ - Could be problematic with correct set up hard/soft limits for CPU/RAM  - Need to have reserved EC2 instance in the cluster for quick scaling  - Scale-In problem  ECS key points
  18. 18. Feeds Importer Architecture
  19. 19. Continuous Integration/DevOps
  20. 20. Feeds importer in numbers - more than 100 data feeds - 10 workers, 1 scheduler, 2 web servers - each day it processeds about 500 Gb of data - 4 m5.2xlarge (8 vCPU, 32 GB RAM), 3 t2.xlarge (4 vCPU, 16 GB RAM) instances in ECS - Postgres SQL (RDS) db.m4.xlarge with Multi A-Z - Elasticache cache.m4.large 2 replica with Multi A-Z
  21. 21. 3. System architecture for analyzing data from virus attacks
  22. 22. 4. a-Gnostics — Get Value from Petabytes of Data
  23. 23. Comparsion Airflow and Luigi - https://towardsdatascience.com/why-quizlet-chose- apache-airflow-for-executing-data-workflows-3f97d40e9571 Comparsion ECS (EC2) vs ECS (Fargate) vs EKS - https://medium.com/devopslinks/ecs- vs-eks-vs-fargate-the-good-the-bad-the-ugly-9f68bfc3bb73 Scale-in ECS hosts - https://medium.com/prodopsio/how-to-scale-in-ecs-hosts- 2d0906d2ba URLs
  24. 24. COLLECT. ANALYZE. INSIGHT. data@softelegance.com www.a-gnostics.com www.fb.com/agnosticstech

×