Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)

1,067 views

Published on

Presented at QCon Shanghai & Tokyo 2016

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)

  1. 1. Cloud Native Data Pipelines Sid Anand QCon Shanghai & Tokyo 2016 1
  2. 2. About Me 2 Work [ed | s] @ Committer & PPMC on Father of 2 Co-Chair for Apache Airflow
  3. 3. 3 ライブ中継
  4. 4. Agari 4 What We Do!
  5. 5. Agari : What We Do 5
  6. 6. 6 Agari : What We Do
  7. 7. 7 Agari : What We Do
  8. 8. 8 Agari : What We Do
  9. 9. 9 Agari : What We Do
  10. 10. 10 Enterprise Customers email metadata apply trust models email md + trust score Agari’s Previous EP Version Agari : What We Do Batch
  11. 11. 11 email metadata apply trust models email md + trust score Agari’s Current EP VersionEnterprise Customers Agari : What We Do Near-real time Quarantine
  12. 12. Data Pipelines BI vs Predictive 12
  13. 13. Data Pipelines (BI) 13 Web Servers OLTP DB Data Warehouse Repor6ng Tools Query Browsers ETL (batch) MySQL, Oracle, Cassandra Terradata, RedShi; BigQuery
  14. 14. Data Pipelines (Predictive) 14 OLTP DB or cache ETL (batch or streaming) MySQL, Oracle, Cassandra, Redis Spark, Flink, Beam, Storm Web Servers Data Products Ranking (Search, News Feed), Recommender Products, Fraud DetecGon / PrevenGon Data Source
  15. 15. Data Products 15
  16. 16. BI Predictive Common Focus of this talk Data Pipelines 16 Web Servers OLTP DB Data Warehouse Repor6ng Tools Query Browsers ETL (batch) MySQL, Oracle, Cassandra Terradata, RedShi; BigQuery OLTP DB or cache ETL (batch or streaming) MySQL, Oracle, Cassandra, Redis Spark, Flink, Beam, Storm Web Servers Ranking (Search, News Feed), Recommender Products, Fraud DetecGon / PrevenGon Data Source
  17. 17. Motivation Cloud Native Data Pipelines 17
  18. 18. Cloud Native Data Pipelines 18 Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers
  19. 19. Cloud Native Data Pipelines 19 Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers Most start-ups run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?
  20. 20. Cloud Native Data Pipelines 20 Cloud Native Techniques Open Source Technogies Custom Data Pipeline Stacks seen in Big Data companies ~
  21. 21. Design Goals Desirable Qualities of a Resilient Data Pipeline 21
  22. 22. 22 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost
  23. 23. 23 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions • All output within time-bound SLAs • Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs • Quick Recoverability • Pay-as-you-go
  24. 24. Quickly Recoverable 24 • Bugs happen! • Bugs in Predictive Data Pipelines have a large blast radius • Optimize for MTTR
  25. 25. Predictive Analytics @ Agari Use Cases 25
  26. 26. Use Cases 26 Apply trust models (message scoring) batch + near real time Build trust models batch (Enterprise Protect)
  27. 27. Use-Case : Message Scoring (batch) Batch Pipeline Architecture 27
  28. 28. Use-Case : Message Scoring 28 enterprise A enterprise B enterprise C S3 S3 uploads an Avro file every 15 minutes
  29. 29. Use-Case : Message Scoring 29 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour (EMR)
  30. 30. Use-Case : Message Scoring 30 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
  31. 31. Use-Case : Message Scoring 31 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
  32. 32. Use-Case : Message Scoring 32 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
  33. 33. 33 enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  34. 34. 34 enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  35. 35. 35 enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
  36. 36. Tackling Cost & Timeliness Leveraging the AWS Cloud 36
  37. 37. Tackling Cost 37 Between Daily Runs During Daily Runs When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
  38. 38. Tackling Cost 38 Between Hourly Runs During Hourly Runs When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!
  39. 39. Tackling Timeliness Auto Scaling Group (ASG) 39
  40. 40. ASG - Overview 40 What is it? A means to automatically scale out/in clusters to handle variable load/traffic A means to keep a cluster/service of a fixed size always up
  41. 41. ASG - Data Pipeline 41 importer importer importer importer Importer ASG scaleout/in SQS DB
  42. 42. 42 Sent CPU ACKd/Recvd CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant ASG : CPU-based
  43. 43. ASG : CPU-based 43 Sent CPU Recv Premature Scale-in Premature Scale-in: • The CPU drops to noise-levels before all messages are consumed • This causes scale in to occur while the last few messages are still being committed
  44. 44. 44 Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0) Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d) This causes the ASG to grow This causes the ASG to shrink ASG : Queue-based
  45. 45. 45 ASG : Queue-based Shoyu Koto Da!!!! しょうゆうことだ!!
  46. 46. 46 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost • ASG • EMR Spark Daily • ASG • EMR Spark Hourly ASG • No Cost Savings
  47. 47. Tackling Operability & Correctness Leveraging Tooling 47
  48. 48. 48 A simple way to author and manage workflows Provides visual insight into the state & performance of workflow runs Integrates with our alerting and monitoring tools Tackling Operability : Requirements
  49. 49. Apache Airflow Workflow Automation & Scheduling 49
  50. 50. 50 Airflow: Author DAGs in Python! No need to bundle many config files! Apache Airflow - Authoring DAGs
  51. 51. 51 Airflow: Visualizing a DAG Apache Airflow - Authoring DAGs
  52. 52. 52 Airflow: It’s easy to manage multiple DAGs Apache Airflow - Managing DAGs
  53. 53. Apache Airflow - Perf. Insights 53 Airflow: Gantt chart view reveals the slowest tasks for a run!
  54. 54. 54 Apache Airflow - Perf. Insights Airflow: Task Duration chart view show task completion time trends!
  55. 55. 55 Airflow: …And easy to integrate with Ops tools! Apache Airflow - Alerting
  56. 56. 56 Apache Airflow - Correctness
  57. 57. 57 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost
  58. 58. Use-Case : Message Scoring (near-real time) NRT Pipeline Architecture 58
  59. 59. Use-Case : Message Scoring 59 enterprise A enterprise B enterprise C Kinesis batch put every second K
  60. 60. Use-Case : Message Scoring 60 enterprise A enterprise B enterprise C K As ASG of scorers is scaled up to one process per core per kinesis shard Scorers ASG
  61. 61. Use-Case : Message Scoring 61 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Scorers apply the trust model and send scored messages downstream
  62. 62. Use-Case : Message Scoring 62 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG As ASG of importers is scaled up to rapidly import messages DB
  63. 63. Use-Case : Message Scoring 63 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG
  64. 64. Use-Case : Message Scoring 64 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG Quarantine Email
  65. 65. Innovations NRT Pipeline Architecture 65
  66. 66. Apache Avro What is Avro? 66
  67. 67. 67 What is Avro? Avro is a self-describing serialization format that supports primitive data types : int, long, boolean, float, string, bytes, etc… complex data types : records, arrays, unions, maps, enums, etc… many language bindings : Java, Scala, Python, Ruby, etc…
  68. 68. 68 What is Avro? Avro is a self-describing serialization format that supports primitive data types : int, long, boolean, float, string, bytes, etc… complex data types : records, arrays, unions, maps, enums, etc… many language bindings : Java, Scala, Python, Ruby, etc… The most common format for storing structured Big Data at rest in HDFS, S3, Google Cloud Storage, etc… Supports Schema Evolution!
  69. 69. 69 Avro Schema Example {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
  70. 70. 70 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } complex type (record) Avro Schema Example
  71. 71. 71 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } complex type (record) Schema name : User Avro Schema Example
  72. 72. 72 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } complex type (record) Schema name : User 3 fields in the record: 1 required, 2 optional Avro Schema Example
  73. 73. 73 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Data x 1,000,000,000 Avro Schema Data File Example Schema Data 0.0001 % 99.999 % Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
  74. 74. 74 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Binary Data block Avro Schema Streaming Example Schema Data 99 % 1 % Data
  75. 75. 75 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Binary Data block Avro Schema Streaming Example Schema Data 99 % 1 % Data OVERHEAD!!
  76. 76. 76 Schema Registry (Lambda) Innovation 1 : Avro Schema Registry {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } register_schema Message Producer (P)
  77. 77. 77 Schema Registry (Lambda) Innovation 1 : Avro Schema Registry register_schema returns a UUID Message Producer (P)
  78. 78. 78 Schema Registry (Lambda) Innovation 1 : Avro Schema Registry Message Producer sends UUID + Message Producer (P) Data Message Consumer (C)
  79. 79. 79 Schema Registry (Lambda) Innovation 1 : Avro Schema Registry Message Producer (P) Data Message Consumer (C) getSchemaById (UUID)
  80. 80. 80 Schema Registry (Lambda) Innovation 1 : Avro Schema Registry Message Producer (P) Data Message Consumer (C) getSchemaById (UUID) {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
  81. 81. 81 Schema Registry (Lambda) Innovation 1 : Avro Schema Registry Message Producer (P) Message Consumer (C) getSchemaById (UUID) {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Message Consumers • download & cache the schema • then decode the data
  82. 82. 82 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG SR SR SR Innovation 1 : Avro Schema Registry
  83. 83. 83 The Architecture is composed of repeated patterns of : ASG-based compute consumer Kinesis transport streams (i.e. AWS’ managed “Kafka”) A Lambda-based Avro Schema Registry Innovation 2 : Repeatable Units Compute i Kinesis i ASG i SR
  84. 84. 84 You can chain these repeatable units together to make arbitrary DAGs (Directed Acyclic Graphs) User Hashicorp’s Terraform to compose your DAG through automation The example above is a simple Linear DAG with 3 units Innovation 2 : Repeatable Units Compute i Kinesis i ASG i SR Compute i Kinesis i ASG i SR Compute i Kinesis i ASG i SR
  85. 85. Airflow Job Reactively Scales Innovation 3 : Reactive-Scaling (WIP) 85 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG DB K Alerters ASG SR SR SR
  86. 86. 86 If the ADR is triggered and a model build or code push was recently done to Compute 1, ADR will revert the last code or model push to ASG Compute 1 Innovation 4 : Anomaly-based Rollback (WIP) ASG Compute 1 Compute 2 Kinesis ASG SR Anomaly- detector & Reverter
  87. 87. Open Source Plans 87 Follow us to be notified when the following is open- sourced • Avro Schema Registry • Agari (Kinesis+ASG) scaling tool (Airflow Job) • Anomaly-detector & Reverter To be notified, follow @AgariEng & @r39132
  88. 88. Acknowledgments 88 • Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones • Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle None of this work would be possible without the contributions of the strong team below
  89. 89. Questions? (@r39132) 89

×