Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Event Detection Pipelines with Apache Kafka

4,147 views

Published on

Event Detection Pipelines with Apache Kafka
Jeff Holoman
Cloudera

  • Be the first to comment

Event Detection Pipelines with Apache Kafka

  1. 1. Event Detection Pipelines with Apache Kafka Hadoop Summit, Brussels 2015 Jeff Holoman
  2. 2. 2© Cloudera, Inc. All rights reserved. The “Is this talk interesting enough to sit through?” slide • How we got here • Why Kafka • Use Case • Challenges • Kafka in Context What I’m going to say: Buzzword Bingo! If I don’t say all of these I owe you a beverage Kafka Machine Learning Real-time Delivery Semantics Spark StreamingHadoop Storm Durability Guarantees Ingest Pipelines Event Detection AvroJSON
  3. 3. 3© Cloudera, Inc. All rights reserved. How we got here 3 Application RDBMS We Wanted to Do some stuff in Hadoop Hadoop RDBMS RDBMS RDBMS Application Application Application Batch File transfer Application Reporting
  4. 4. 4© Cloudera, Inc. All rights reserved. How we got here 4 Application RDBMS We Wanted to Do some stuff in Hadoop Hadoop RDBMS RDBMS RDBMS Application Application Application Batch File transfer Application Reporting
  5. 5. 5© Cloudera, Inc. All rights reserved. About Kafka • Publish/Subscribe Messaging System From LinkedIn • High throughput (100’s of k messages/sec) • Low latency (sub-second to low seconds) • Fault-tolerant (Replicated and Distributed) • Supports Agnostic Messaging • Standardizes format and delivery
  6. 6. 6© Cloudera, Inc. All rights reserved. Kafka decouples data pipelines Why Kafka 6 Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Broker Consumers
  7. 7. 7© Cloudera, Inc. All rights reserved. Use Case Fraud Detection in Consumer Banking
  8. 8. 8© Cloudera, Inc. All rights reserved. Event Detection - Fraud • Offline • Model Building • Discovery • Forensics • Case Management • Pattern Analysis • Online
  9. 9. 9© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration
  10. 10. 10© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing
  11. 11. 11© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Repository Reporting Forensics Analytics
  12. 12. 12© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Storage HDFS SolR Processing Impala Map/Reduce Spark 3rd Party R, SAS etc Mainfram e/RDBMS
  13. 13. 13© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Storage HDFS SolR Processing Impala Map/Reduce Spark 3rd Party Rules / Models Automated & Manual Analytical Adjustments and Pattern detection R, SAS etc Mainframe/R DBMS Case Management
  14. 14. 14© Cloudera, Inc. All rights reserved. Event Detection - Fraud • Offline • Model Building • Discovery • Forensics • Case Management • Pattern Analysis • Online • Ingest • Enrichment (Profiles, feature selection, etc.) • Early warning / detection (model serving / model application) • Persistence
  15. 15. 15© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Repository Case Management Reporting Forensics Analytics Alerting Reference Data Rules / Models
  16. 16. 16© Cloudera, Inc. All rights reserved. Event Detection A Concrete Example
  17. 17. 17© Cloudera, Inc. All rights reserved.
  18. 18. 18© Cloudera, Inc. All rights reserved. This is not a Data Science Talk. But lets talk about it anyway
  19. 19. 19© Cloudera, Inc. All rights reserved. Event Detection • Attempt to detect if an event of interest has occurred • Temporal or Spatial (or both) • High number of non-events creates challenges • Fraud Detection - semi-supervised ML • You want to optimize for accuracy but also balance the risk of false positives • Very important to monitor the model
  20. 20. 20© Cloudera, Inc. All rights reserved. Generally • Learn model for an expected signal value • Calculate a score based on the current event • Alert (or don’t) on that value • Simple right?
  21. 21. 21© Cloudera, Inc. All rights reserved. Some Numbers • No data loss is acceptable • Event processing must complete ASAP, <500ms • Support approximately 400M transactions per day in aggregate • Highest Volume Flow: • Current – 2k transactions/s • Projected – 10k transactions/s • Each flow has at least three steps • Adapter, Persistence, Hadoop Persistence • Most complex with approximately seven steps
  22. 22. 22© Cloudera, Inc. All rights reserved. Technology Stack
  23. 23. 23© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Spring Integration Storage HDFS SolR Processing Impala Map/Reduce Spark 3rd Party R, SAS etc Mainframe/ DB2, Oracle JVM JVM JVM HBase RPC Java API Flume via Avro RPC client (Netty) Files SQOOP Web Applications Web ApplicationsWeb Applications JDBC REST Java / PMML
  24. 24. 24© Cloudera, Inc. All rights reserved. JVM 1 JVM 2 JVM N Host 1 JVM 1 Host 2 JVM 1 JVM 2 JVM N Host 3 JVM 1 Host 4 Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark Production Hadoop Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark DR Hadoop DR Edge Node File Channel Agent 1-NAgent 1-N DR Edge Node File Channel Agent 1-NAgent 1-N Prod Edge Node File Channel Agent 1-NAgent 1-N Prod Edge Node File Channel Agent 1-NAgent 1-N Flume
  25. 25. 25© Cloudera, Inc. All rights reserved. Challenges • Fraud prevention is very difficult due to response time requirements.
  26. 26. 26© Cloudera, Inc. All rights reserved. Fraud Processing System ~50 ms >500 ms >30,000 ms >90,000 ms Prevention Detection Difficulty High Low(er)
  27. 27. 27© Cloudera, Inc. All rights reserved. Challenges • Fraud prevention is very difficult due to response time requirements. • Disruptions in downstream systems can impact actual processing. • Problems with HDFS, network problems, SAN, agents etc • Integrating data across multiple systems increases complexity • Other systems want / need the data. • System has all of the transactions! Can be used for Customer Events, Analytics etc. • Tracking data and metrics is difficult with different protocols • We need to true up the transaction data with what ends up in HDFS
  28. 28. 28© Cloudera, Inc. All rights reserved. Incoming Events Storage HDFS SolR Processing Impala MR Spark 3rd Party Event Processing JVM JVM JVM HBase Kafk a Kafk a Kafk a Model Serving Outgoing Events Model Building Repository JVM JVM JVM Txn Inu v w Txn Updates z All Eventsy Txn Out x Alerts { Case / Alert Management | OtherOtherOther }
  29. 29. 29© Cloudera, Inc. All rights reserved. JVM 1 JVM 2 JVM N Host 1 JVM 1 Host 2 JVM 1 JVM 2 JVM N Host 3 JVM 1 Host 4 Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark Production Hadoop Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark DR Hadoop Kafka Kafka Cluster Broker 1 Broker 2 Broker 3 Broker N
  30. 30. 30© Cloudera, Inc. All rights reserved. Kafka - Considerations • Data Exchange
  31. 31. 31© Cloudera, Inc. All rights reserved. Data Exchange in Distributed Architectures • Multiple systems interacting together benefit from a common data exchange format. • Choosing the correct standard can significantly impact application design and TCO Client Client serialize serialize deserialize deserialize Common Data Format
  32. 32. 32© Cloudera, Inc. All rights reserved. Goals • Simple • Flexible • Efficient • Change Tolerant • Interoperable As systems become more complex, data endpoints need to be decoupled
  33. 33. 33© Cloudera, Inc. All rights reserved. He means traffic lights
  34. 34. 34© Cloudera, Inc. All rights reserved. Use Avro • A data serialization system • Data always* accompanied by a schema • Provides • A compact, fast, binary data format • A container file to store persistent data • Remote Procedure Call (RPC) • Simple integration with dynamic languages • Schema Evolution • Similar to Thrift of Protocol Puffers but differs by • Dynamic typing • Untagged data • No manually-assigned field IDs: • When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.
  35. 35. 35© Cloudera, Inc. All rights reserved. Schema Registry • Use a Schema Registry / Repository • There are open-source options out there • Exposes a REST interface • Backend storage can be just about anything • Can be heavily customized for your environment
  36. 36. 36© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • There are a number of Kafka clients out there… standardize and develop a producer / consumer library that is consistent so developers aren’t reinventing the wheel
  37. 37. 37© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics
  38. 38. 38© Cloudera, Inc. All rights reserved. • Producers can choose to trade throughput for durability of writes: • A sane configuration: Durable Writes Durability Behaviour Per Event Latency Required Acknowledgements (request.required.acks) Highest ACK all ISRs have received Highest -1 Medium ACK once the leader has received Medium 1 Lowest No ACKs required Lowest 0 Property Value replication 3 min.insync.replicas 2 request.required.acks -1
  39. 39. 39© Cloudera, Inc. All rights reserved. Producer Performance – Single Thread Type Records/sec MB/s Avg Latency (ms) Max Latency Median Latency 95th %tile No Replication 1,100,182 104 42 1070 1 362 3x Async 1,056,546 101 42 1157 2 323 3x Sync 493,855 47 379 4483 192 1692
  40. 40. 40© Cloudera, Inc. All rights reserved. Delivery Semantics • At least once • Messages are never lost but may be redelivered • At most once • Messages are lost but never redelivered • Exactly once • Messages are delivered once and only once Much Harder (Impossible??)
  41. 41. 41© Cloudera, Inc. All rights reserved. Getting Exactly Once Semantics • Must consider two components • Durability guarantees when publishing a message • Durability guarantees when consuming a message • Producer • What happens when a produce request was sent but a network error returned before an ack? • Use a single writer per partition and check the latest committed value after network errors • Consumer • Include a unique ID (e.g. UUID) and de-duplicate. • Consider storing offsets with data
  42. 42. 42© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • We can use Kafka in-stream to save some reporting and analytics later • This will increase your development time but pay off in the long run
  43. 43. 43© Cloudera, Inc. All rights reserved. Auditing and Tracking • Embed timings in the message itself, eg: { "name": "timings", "type": [ "null", { "type": "map", "values": "long" } ], "default": null } • Adopt LinkedIn-style Auditing
  44. 44. 44© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Use Flume for easy ingest into HDFS / Solr
  45. 45. 45© Cloudera, Inc. All rights reserved. Flume (Flafka) • Source • Sink • Channel
  46. 46. 46© Cloudera, Inc. All rights reserved. Flafka Sources Interceptors Selectors Channels Sinks Flume Agent Kafka HDFS Kafka Producer Producer A Kafka KafkaData Sources Logs, JMS, WebServer etc.
  47. 47. 47© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Use Flume for Easy Ingest to HDFS / Solr • Benchmark based on your message size
  48. 48. 48© Cloudera, Inc. All rights reserved. Benchmark Results
  49. 49. 49© Cloudera, Inc. All rights reserved. Benchmark Results
  50. 50. 50© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Benchmark based on your message size • Take the time to setup Kafka metrics
  51. 51. 51© Cloudera, Inc. All rights reserved. Things like • Consumer Lag • Message in Rate • Bytes in Rate • Bytes out Rate • (you can publish your own as well)
  52. 52. 52© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Benchmark based on your message size • Take the time to setup Kafka metrics • Security
  53. 53. 53© Cloudera, Inc. All rights reserved. Security • Out-of-the-box security is pretty weak • Currently must rely on network security • Upcoming improvements add: • Authentication • Authorization • SSL
  54. 54. 54© Cloudera, Inc. All rights reserved. Recap • Fraud prevention is very difficult due to response time requirements. • Disruptions in downstream systems can impact actual processing. • Integrating data across multiple systems increases complexity • Other systems want / need the data • Tracking data and metrics is difficult with different protocols
  55. 55. Thank you.

×