Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Resilient Predictive Data Pipelines

483 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1OjUNdh.

Sid Anand discusses how Agari is applying Big Data best practices to the problem of securing its customers from email-born threats. Anand presents a system that leverages the best of both the cloud (AWS's SNS, SQS, Kinesis, Auto-scaling, S3, Lambda, API Gateway, etc...) and Big Data (Spark, Airbnb's Airflow, etc...) in a maintainable way. Filmed at qconlondon.com.

Sid Anand currently serves as the Data Architect for Agari, a rising email security company. Prior to joining Agari, Sid held several technical and leadership positions including LinkedIn’s Search Architect, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. He has over 15 years of experience in building websites that millions of people visit every day.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Resilient Predictive Data Pipelines

  1. 1. Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /agari-resilient-predictive-data-pipes
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  4. 4. About Me 2 Work [ed | s] @ Maintainer on Report to Co-Chair for airbnb/airflow
  5. 5. Motivation Why is a Data Pipeline talk in this High Availability Track? 3
  6. 6. Different Types of Data Pipelines 4 ETL • used for : loading data related to business health into a Data Warehouse • user-engagement stats (e.g. social networking) • product success stats (e.g. e-commerce) • audience : Business, BizOps • downtime? : 1-3 days Predictive • used for : •building recommendation products (e.g. social networking, shopping) •updating fraud prevention endpoints (e.g. security, payments, e-commerce) • audience : Customers • downtime? : < 1 hour VS
  7. 7. 5 ETL • used for : loading data into a Data Warehouse —> Reports • user-engagement stats (e.g. social networking) • product success stats (e.g. e-commerce) • audience : Business • downtime? : 1-3 days Predictive • used for : •building recommendation products (e.g. social networking, shopping) •updating fraud prevention endpoints (e.g. security, payments, e-commerce) • audience : Customers • downtime? : < 1 hour VS Different Types of Data Pipelines
  8. 8. Why Do We Care About Resilience? 6 d4d5d6d7d8 DB Search Engines Recommenders d1 d2 d3
  9. 9. DB Why Do We Care About Resilience? 7 d4 d6d7d8 Search Engines Recommenders d1 d2 d3 d5
  10. 10. Why Do We Care About Resilience? 8 d7d8 DB Search Engines Recommenders d1 d2 d3 d4 d5 d6
  11. 11. Why Do We Care About Resilience? 9 d7d8 DB Search Engines Recommenders d1 d2 d3 d4 d5 d6
  12. 12. Any Take-aways? 10 d7d8 DB Search Engines Recommenders d1 d2 d3 d4 d5 d6 • Bugs happen! • Bugs in Predictive Data Pipelines have a large blast radius • The bugs can affect customers and a company’s profits & reputation!
  13. 13. Design Goals Desirable Qualities of a Resilient Data Pipeline 11
  14. 14. 12 • Scalable • Available • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable Desirable Qualities of a Resilient Data Pipeline
  15. 15. 13 • Scalable • Build your pipelines using [infinitely] scalable components • The scalability of your system is determined by its least-scalable component • Available • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable Desirable Qualities of a Resilient Data Pipeline
  16. 16. Desirable Qualities of a Resilient Data Pipeline 14 • Scalable • Build your pipelines using [infinitely] scalable components • The scalability of your system is determined by its least-scalable component • Available • Ditto • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable
  17. 17. Instrumented 15 Instrumentation must reveal SLA metrics at each stage of the pipeline! What SLA metrics do we care about? Correctness & Timeliness • Correctness • No Data Loss • No Data Corruption • No Data Duplication • A Defined Acceptable Staleness of Intermediate Data • Timeliness • A late result == a useless result • Delayed processing of now()’s data may delay the processing of future data
  18. 18. Instrumented, Monitored, & Alert- enabled 16 • Instrument • Instrument Correctness & Timeliness SLA metrics at each stage of the pipeline • Monitor • Continuously monitor that SLA metrics fall within acceptable bounds (i.e. pre-defined SLAs) • Alert • Alert when we miss SLAs
  19. 19. Desirable Qualities of a Resilient Data Pipeline 17 • Scalable • Build your pipelines using [infinitely] scalable components • The scalability of your system is determined by its least-scalable component • Available • Ditto • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable
  20. 20. Quickly Recoverable 18 • Bugs happen! • Bugs in Predictive Data Pipelines have a large blast radius • Optimize for MTTR
  21. 21. Implementation Using AWS to meet Design Goals 19
  22. 22. SQS Simple Queue Service 20
  23. 23. SQS - Overview 21 AWS’s low-latency, highly scalable, highly available message queue Infinitely Scalable Queue (though not FIFO) Low End-to-end latency (generally sub-second) Pull-based How it Works! Producers publish messages, which can be batched, to an SQS queue Consumers consume messages, which can be batched, from the queue commit message contents to a data store ACK the messages as a batch
  24. 24. visibility timer SQS - Typical Operation Flow 22 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 1: A consumer reads a message from SQS. This starts a visibility timer!
  25. 25. visibility timer SQS - Typical Operation Flow 23 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 2: Consumer persists message contents to DB
  26. 26. visibility timer SQS - Typical Operation Flow 24 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 3: Consumer ACKs message in SQS
  27. 27. visibility timer SQS - Time Out Example 25 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 1: A consumer reads a message from SQS
  28. 28. visibility timer SQS - Time Out Example 26 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 2: Consumer attempts persists message contents to DB
  29. 29. visibility time out SQS - Time Out Example 27 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 3: A Visibility Timeout occurs & the message becomes visible again.
  30. 30. visibility timer SQS - Time Out Example 28 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 m1 SQS Step 4: Another consumer reads and persists the same message
  31. 31. visibility timer SQS - Time Out Example 29 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 5: Consumer ACKs message in SQS
  32. 32. SQS - Dead Letter Queue 30 SQS - DLQ visibility timer Producer Producer Producer m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Redrive rule : 2x m1
  33. 33. SNS Simple Notification Service 31
  34. 34. SNS - Overview 32 Highly Scalable, Highly Available, Push-based Topic Service Whereas SQS ensures each message is seen by at least 1 consumer SNS ensures that each message is seen by every consumer Reliable Multi-Push Whereas SQS is pull-based, SNS is push-based There is no message retention & there is a finite retry count No Reliable Message Delivery Can we work around this limitation?
  35. 35. SNS + SQS Design Pattern 33 m1m2 m1m2 m1m2 SQS Q1 SQS Q2 SNS T1 Reliable Multi Push Reliable Message Delivery
  36. 36. SNS + SQS 34 Producer Producer Producer m1m2 Consumer Consumer Consumer DB m1 m1m2 m1m2 SQS Q1 SQS Q2 SNS T1 Consumer Consumer Consumer ES m1
  37. 37. S3 + SNS + SQS Design Pattern 35 m1m2 m1m2 m1m2 SQS Q1 SQS Q2 SNS T1 Reliable Multi Push Transactions S3 d1 d2
  38. 38. Batch Pipeline Architecture Putting the Pieces Together 36
  39. 39. Architecture 37
  40. 40. Architectural Elements 38 A Schema-aware Data format for all data (Avro) The entire pipeline is built from Highly-Available/Highly-Scalable components S3, SNS, SQS, ASG, EMR Spark (exception DB) The pipeline is never blocked because we use a DLQ for messages we cannot process We use queue-based auto-scaling to get high on-demand ingest rates We manage everything with Airflow Every stage in the pipeline is idempotent Every stage in the pipeline is instrumented
  41. 41. ASG Auto Scaling Group 39
  42. 42. ASG - Overview 40 What is it? A means to automatically scale out/in clusters to handle variable load/traffic A means to keep a cluster/service always up Fulfills AWS’s pay-per-use promise! When to use it? Feed-processing, web traffic load balancing, zone outage, etc…
  43. 43. ASG - Data Pipeline 41 importer importer importer importer Importer ASG scaleout/in SQS DB
  44. 44. ASG : CPU-based 42 Sent CPU ACKd/Recvd CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant
  45. 45. ASG : CPU-based 43 Sent CPU Recv Premature Scale-in Premature Scale-in: The CPU drops to noise-levels before all messages are consumed. This causes scale in to occur while the last few messages are still being committed resulting in a long time- to-drain for the queue!
  46. 46. ASG - Queue-based 44 Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0) Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d) This causes the ASG to grow This causes the ASG to shrink
  47. 47. Architecture 45
  48. 48. Reliable Hourly Job Scheduling Workflow Automation & Scheduling 46
  49. 49. 47 Historical Context Our first cut at the pipeline used cron to schedule hourly runs of Spark Problem We only knew if Spark succeeded. What if a downstream task failed? We needed something smarter than cron that Reliably managed a graph of tasks (DAG - Directed Acyclic Graph) Orchestrated hourly runs of that DAG Retried failed tasks Tracked the performance of each run and its tasks Reported on failure/success of runs Our Needs
  50. 50. Airflow Workflow Automation & Scheduling 48
  51. 51. Airflow - DAG Dashboard 49 Airflow: It’s easy to manage multiple DAGs
  52. 52. Airflow - Authoring DAGs 50 Airflow: Visualizing a DAG
  53. 53. 51 Airflow: Author DAGs in Python! No need to bundle many config files! Airflow - Authoring DAGs
  54. 54. Airflow - Performance Insights 52 Airflow: Gantt chart view reveals the slowest tasks for a run!
  55. 55. 53 Airflow: …And we can easily see performance trends over time Airflow - Performance Insights
  56. 56. Airflow - Alerting 54 Airflow: …And easy to integrate with Ops tools!
  57. 57. Airflow - Monitoring 55
  58. 58. 56 Airflow - Join the Community With >30 Companies, >100 Contributors , and >2500 Commits, Airflow is growing rapidly! We are looking for more contributors to help support the community! Disclaimer : I’m a maintainer on the project
  59. 59. Design Goal Scorecard Are We Meeting Our Design Goals? 57
  60. 60. 58 • Scalable • Available • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable Desirable Qualities of a Resilient Data Pipeline
  61. 61. 59 • Scalable • Build using scalable components from AWS • SQS, SNS, S3, ASG, EMR Spark • Exception = DB (WIP) • Available • Build using available components from AWS • Airflow for reliable job scheduling • Instrumented, Monitored, & Alert-enabled • Airflow • Quickly Recoverable • Airflow, DLQs, ASGs, Spark & DB Desirable Qualities of a Resilient Data Pipeline
  62. 62. Questions? (@r39132) 60
  63. 63. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/agari- resilient-predictive-data-pipes

×