Successfully reported this slideshow.
Your SlideShare is downloading. ×

Event Detection Pipelines with Apache Kafka

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 55 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (16)

Advertisement

Similar to Event Detection Pipelines with Apache Kafka (20)

More from DataWorks Summit (20)

Advertisement

Event Detection Pipelines with Apache Kafka

  1. 1. Event Detection Pipelines with Apache Kafka Hadoop Summit, Brussels 2015 Jeff Holoman
  2. 2. 2© Cloudera, Inc. All rights reserved. The “Is this talk interesting enough to sit through?” slide • How we got here • Why Kafka • Use Case • Challenges • Kafka in Context What I’m going to say: Buzzword Bingo! If I don’t say all of these I owe you a beverage Kafka Machine Learning Real-time Delivery Semantics Spark StreamingHadoop Storm Durability Guarantees Ingest Pipelines Event Detection AvroJSON
  3. 3. 3© Cloudera, Inc. All rights reserved. How we got here 3 Application RDBMS We Wanted to Do some stuff in Hadoop Hadoop RDBMS RDBMS RDBMS Application Application Application Batch File transfer Application Reporting
  4. 4. 4© Cloudera, Inc. All rights reserved. How we got here 4 Application RDBMS We Wanted to Do some stuff in Hadoop Hadoop RDBMS RDBMS RDBMS Application Application Application Batch File transfer Application Reporting
  5. 5. 5© Cloudera, Inc. All rights reserved. About Kafka • Publish/Subscribe Messaging System From LinkedIn • High throughput (100’s of k messages/sec) • Low latency (sub-second to low seconds) • Fault-tolerant (Replicated and Distributed) • Supports Agnostic Messaging • Standardizes format and delivery
  6. 6. 6© Cloudera, Inc. All rights reserved. Kafka decouples data pipelines Why Kafka 6 Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Broker Consumers
  7. 7. 7© Cloudera, Inc. All rights reserved. Use Case Fraud Detection in Consumer Banking
  8. 8. 8© Cloudera, Inc. All rights reserved. Event Detection - Fraud • Offline • Model Building • Discovery • Forensics • Case Management • Pattern Analysis • Online
  9. 9. 9© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration
  10. 10. 10© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing
  11. 11. 11© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Repository Reporting Forensics Analytics
  12. 12. 12© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Storage HDFS SolR Processing Impala Map/Reduce Spark 3rd Party R, SAS etc Mainfram e/RDBMS
  13. 13. 13© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Storage HDFS SolR Processing Impala Map/Reduce Spark 3rd Party Rules / Models Automated & Manual Analytical Adjustments and Pattern detection R, SAS etc Mainframe/R DBMS Case Management
  14. 14. 14© Cloudera, Inc. All rights reserved. Event Detection - Fraud • Offline • Model Building • Discovery • Forensics • Case Management • Pattern Analysis • Online • Ingest • Enrichment (Profiles, feature selection, etc.) • Early warning / detection (model serving / model application) • Persistence
  15. 15. 15© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Repository Case Management Reporting Forensics Analytics Alerting Reference Data Rules / Models
  16. 16. 16© Cloudera, Inc. All rights reserved. Event Detection A Concrete Example
  17. 17. 17© Cloudera, Inc. All rights reserved.
  18. 18. 18© Cloudera, Inc. All rights reserved. This is not a Data Science Talk. But lets talk about it anyway
  19. 19. 19© Cloudera, Inc. All rights reserved. Event Detection • Attempt to detect if an event of interest has occurred • Temporal or Spatial (or both) • High number of non-events creates challenges • Fraud Detection - semi-supervised ML • You want to optimize for accuracy but also balance the risk of false positives • Very important to monitor the model
  20. 20. 20© Cloudera, Inc. All rights reserved. Generally • Learn model for an expected signal value • Calculate a score based on the current event • Alert (or don’t) on that value • Simple right?
  21. 21. 21© Cloudera, Inc. All rights reserved. Some Numbers • No data loss is acceptable • Event processing must complete ASAP, <500ms • Support approximately 400M transactions per day in aggregate • Highest Volume Flow: • Current – 2k transactions/s • Projected – 10k transactions/s • Each flow has at least three steps • Adapter, Persistence, Hadoop Persistence • Most complex with approximately seven steps
  22. 22. 22© Cloudera, Inc. All rights reserved. Technology Stack
  23. 23. 23© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Spring Integration Storage HDFS SolR Processing Impala Map/Reduce Spark 3rd Party R, SAS etc Mainframe/ DB2, Oracle JVM JVM JVM HBase RPC Java API Flume via Avro RPC client (Netty) Files SQOOP Web Applications Web ApplicationsWeb Applications JDBC REST Java / PMML
  24. 24. 24© Cloudera, Inc. All rights reserved. JVM 1 JVM 2 JVM N Host 1 JVM 1 Host 2 JVM 1 JVM 2 JVM N Host 3 JVM 1 Host 4 Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark Production Hadoop Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark DR Hadoop DR Edge Node File Channel Agent 1-NAgent 1-N DR Edge Node File Channel Agent 1-NAgent 1-N Prod Edge Node File Channel Agent 1-NAgent 1-N Prod Edge Node File Channel Agent 1-NAgent 1-N Flume
  25. 25. 25© Cloudera, Inc. All rights reserved. Challenges • Fraud prevention is very difficult due to response time requirements.
  26. 26. 26© Cloudera, Inc. All rights reserved. Fraud Processing System ~50 ms >500 ms >30,000 ms >90,000 ms Prevention Detection Difficulty High Low(er)
  27. 27. 27© Cloudera, Inc. All rights reserved. Challenges • Fraud prevention is very difficult due to response time requirements. • Disruptions in downstream systems can impact actual processing. • Problems with HDFS, network problems, SAN, agents etc • Integrating data across multiple systems increases complexity • Other systems want / need the data. • System has all of the transactions! Can be used for Customer Events, Analytics etc. • Tracking data and metrics is difficult with different protocols • We need to true up the transaction data with what ends up in HDFS
  28. 28. 28© Cloudera, Inc. All rights reserved. Incoming Events Storage HDFS SolR Processing Impala MR Spark 3rd Party Event Processing JVM JVM JVM HBase Kafk a Kafk a Kafk a Model Serving Outgoing Events Model Building Repository JVM JVM JVM Txn Inu v w Txn Updates z All Eventsy Txn Out x Alerts { Case / Alert Management | OtherOtherOther }
  29. 29. 29© Cloudera, Inc. All rights reserved. JVM 1 JVM 2 JVM N Host 1 JVM 1 Host 2 JVM 1 JVM 2 JVM N Host 3 JVM 1 Host 4 Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark Production Hadoop Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark DR Hadoop Kafka Kafka Cluster Broker 1 Broker 2 Broker 3 Broker N
  30. 30. 30© Cloudera, Inc. All rights reserved. Kafka - Considerations • Data Exchange
  31. 31. 31© Cloudera, Inc. All rights reserved. Data Exchange in Distributed Architectures • Multiple systems interacting together benefit from a common data exchange format. • Choosing the correct standard can significantly impact application design and TCO Client Client serialize serialize deserialize deserialize Common Data Format
  32. 32. 32© Cloudera, Inc. All rights reserved. Goals • Simple • Flexible • Efficient • Change Tolerant • Interoperable As systems become more complex, data endpoints need to be decoupled
  33. 33. 33© Cloudera, Inc. All rights reserved. He means traffic lights
  34. 34. 34© Cloudera, Inc. All rights reserved. Use Avro • A data serialization system • Data always* accompanied by a schema • Provides • A compact, fast, binary data format • A container file to store persistent data • Remote Procedure Call (RPC) • Simple integration with dynamic languages • Schema Evolution • Similar to Thrift of Protocol Puffers but differs by • Dynamic typing • Untagged data • No manually-assigned field IDs: • When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.
  35. 35. 35© Cloudera, Inc. All rights reserved. Schema Registry • Use a Schema Registry / Repository • There are open-source options out there • Exposes a REST interface • Backend storage can be just about anything • Can be heavily customized for your environment
  36. 36. 36© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • There are a number of Kafka clients out there… standardize and develop a producer / consumer library that is consistent so developers aren’t reinventing the wheel
  37. 37. 37© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics
  38. 38. 38© Cloudera, Inc. All rights reserved. • Producers can choose to trade throughput for durability of writes: • A sane configuration: Durable Writes Durability Behaviour Per Event Latency Required Acknowledgements (request.required.acks) Highest ACK all ISRs have received Highest -1 Medium ACK once the leader has received Medium 1 Lowest No ACKs required Lowest 0 Property Value replication 3 min.insync.replicas 2 request.required.acks -1
  39. 39. 39© Cloudera, Inc. All rights reserved. Producer Performance – Single Thread Type Records/sec MB/s Avg Latency (ms) Max Latency Median Latency 95th %tile No Replication 1,100,182 104 42 1070 1 362 3x Async 1,056,546 101 42 1157 2 323 3x Sync 493,855 47 379 4483 192 1692
  40. 40. 40© Cloudera, Inc. All rights reserved. Delivery Semantics • At least once • Messages are never lost but may be redelivered • At most once • Messages are lost but never redelivered • Exactly once • Messages are delivered once and only once Much Harder (Impossible??)
  41. 41. 41© Cloudera, Inc. All rights reserved. Getting Exactly Once Semantics • Must consider two components • Durability guarantees when publishing a message • Durability guarantees when consuming a message • Producer • What happens when a produce request was sent but a network error returned before an ack? • Use a single writer per partition and check the latest committed value after network errors • Consumer • Include a unique ID (e.g. UUID) and de-duplicate. • Consider storing offsets with data
  42. 42. 42© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • We can use Kafka in-stream to save some reporting and analytics later • This will increase your development time but pay off in the long run
  43. 43. 43© Cloudera, Inc. All rights reserved. Auditing and Tracking • Embed timings in the message itself, eg: { "name": "timings", "type": [ "null", { "type": "map", "values": "long" } ], "default": null } • Adopt LinkedIn-style Auditing
  44. 44. 44© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Use Flume for easy ingest into HDFS / Solr
  45. 45. 45© Cloudera, Inc. All rights reserved. Flume (Flafka) • Source • Sink • Channel
  46. 46. 46© Cloudera, Inc. All rights reserved. Flafka Sources Interceptors Selectors Channels Sinks Flume Agent Kafka HDFS Kafka Producer Producer A Kafka KafkaData Sources Logs, JMS, WebServer etc.
  47. 47. 47© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Use Flume for Easy Ingest to HDFS / Solr • Benchmark based on your message size
  48. 48. 48© Cloudera, Inc. All rights reserved. Benchmark Results
  49. 49. 49© Cloudera, Inc. All rights reserved. Benchmark Results
  50. 50. 50© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Benchmark based on your message size • Take the time to setup Kafka metrics
  51. 51. 51© Cloudera, Inc. All rights reserved. Things like • Consumer Lag • Message in Rate • Bytes in Rate • Bytes out Rate • (you can publish your own as well)
  52. 52. 52© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Benchmark based on your message size • Take the time to setup Kafka metrics • Security
  53. 53. 53© Cloudera, Inc. All rights reserved. Security • Out-of-the-box security is pretty weak • Currently must rely on network security • Upcoming improvements add: • Authentication • Authorization • SSL
  54. 54. 54© Cloudera, Inc. All rights reserved. Recap • Fraud prevention is very difficult due to response time requirements. • Disruptions in downstream systems can impact actual processing. • Integrating data across multiple systems increases complexity • Other systems want / need the data • Tracking data and metrics is difficult with different protocols
  55. 55. Thank you.

Editor's Notes

  • Good afternoon. Welcome to Event Detection Pipelines with Apache Kafka. Thank you for coming and I hope that the next 30 or so minutes that we have will be informative and enjoyable. Like the other talks here this week in Brussels we have around 40 minutes, so I’m going to get through the content that we have here and then take some questions towards the end. So lets get started
  • Almost done with the pre-amble. Today We’re going to blah blah blah

    An
  • So all of you here are interested in Hadoop and have either deployed it or are thinking about doing so.
    Most Hadoop use cases I know of started with doing batch ingest from some type of database, doing some ETL offloading usually. Then perhaps we even move things back to some other database for some reporting
    We of course realize that hadoop is capable of integrating multiple data sources so then we end up integrating with another system or application.

    And we realize that we can do some reporting directly from hadoop as well.
    We might even build other applications that pull data from Hadoop.
    Soon we have a myriad of applications and upstream systems feeding into Hadoop.


  • But This original box that I drew is a little bit simplified. In reality these applications tend to be tied together. Particularly as organizations move towards services and micro-services, we have interdependencies with on another, and unless we are fairly disciplined, we likely have different ways that these applications talk to one another. If we believe, as I imagine most of us do here in the audience today, that data is extremely valuable, we want to make it easy to exchange data within our overall system and also be flexible and nimble in this process.

    Unfortunately, all to often, our application stack ends up looking something like this. Where, applications are coupled together tightly, and changes in one system can have drastic impact to other downstream systems. I tend to work with very large-scale enterprises, usually these applications are separated by not just technology, but political or organizational barriers as well.

  • Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn. One of the engineers at LinkedIn has said, “if data is the lifeblood of the organization then Kakfa is the circulatory system.”

    Kafka can handle 100’s of thousands of messages per second, if not more…with very low latency, sub-second in many cases. It also is fault-tolerant as it runs as a cluster of machines and messages are replicated across multiple machines.

    When I say agnostic message, I mean that producers of messages are not concerned with consumers of messages, and vice versa…there is no dependency on each other.
    .
  • Producers

    Broker

    Consumers

    Importantly, it allows us solid system on which to standardize our data exchange. As we’ll discuss, we use it as the foundation for moving data between our systems and so allows us to reuse code and design patterns across our systems
  • Today I’m going we’ll talk about fraud detection. I have the most experience in this space as I mentioned previously, as it relates to consumer banking, but the architecture here could easily be applied to other businesses. Whenever we need to build systems that take inputs of data in real time and efficiently ingest them into Hadoop this will be applicable.
  • When building Fraud systems, you can broadly classify them into two categories, the offline aspect and the online aspect. Another way to think about this is that the offline system is Human or Operator Driven, and the online system is happening in an automated fashion, during the flow of the actual event.

    I’ll briefly cover the offline aspect to show the architecture of a fraud system and then we’ll get into the details of building the online system.

    Note this isn’t a contrived example, this type of system is in use today in large banks back in the United States
  • So we want to build a multi-channel fraud system. In this system we accept input from Online transactions, Mobile devices, ATM, and Credit and Debit Cards. Each of these have different exchange formats and so we have an integration layer that is responsible performing conversions on the data feeds into the appropriate formats for processing. More on this a bit later.
  • So the next stage in our system is the event processing. In this segment we take in incoming transactions, and based on the information we have, either from the transaction itself or other data in our systems we make a decision about the event as it comes in, and this is returned back to the source systems.
  • Every transaction then is persisted into a repository. The majority of the reporting that we do is really focused on a relatively short time window, however, we keep the data forever so that we can do forensics, discovery, and analytics on all of the transaction data
  • So in our Case, the repository is Hadoop, and forgive me here as I’ve overlaid system components with functional boxes, but we store all of the transactions in HDFS and also build solr indexes to Allow faceted searching to assist on our forensics.
  • SO the output of our system then, is really 3 fold.

    We generate alerts to send over to the case management system. “Fraud” is actually quite broad. A good portion of it is really handling suspected Fraud…we send updates to the case management system, and they work through their investigations.
    The second is end-user access. Analysts run Hive queries, impala queries, view search GUI to look for patterns and see the incoming data as close to real time as possible. Due to the ingestion rates,
    And finally, we use our Hadoop cluster to do two primary actions. First we generate rules to feed into a rules engine system to check during our event processing. The next is we use the system to build our ML models and fit them with the appropriate parameters. For this we use SAS, or perhaps R or whatever Data analysis tools we need. This brings to the online system.
  • SO the output of our system then, is really 3 fold.

    We generate alerts to send over to the case management system. “Fraud” is actually quite broad. A good portion of it is really handling suspected Fraud…we send updates to the case management system, and they work through their investigations.
    The second is end-user access. Analysts run Hive queries, impala queries, view search GUI to look for patterns and see the incoming data as close to real time as possible. Due to the ingestion rates,
    And finally, we use our Hadoop cluster to do two primary actions. First we generate rules to feed into a rules engine system to check during our event processing. The next is we use the system to build our ML models and fit them with the appropriate parameters. For this we use SAS, or perhaps R or whatever Data analysis tools we need. This brings to the online system.
  • This might not be the place to put this slide in.
  • This might not be the place to put this slide in.
  • If only it were as easy as just dropping in Kafka and making all of our problems go away.
  • If only it were as easy as just dropping in Kafka and making all of our problems go away.
  • If only it were as easy as just dropping in Kafka and making all of our problems go away.
  • Replication -> all the the min.insync.replicas. ..there is a timeout.

    The single digit
  • This is doable with an idempotent producer where the producer tracks committed messages within some configurable window
  • If only it were as easy as just dropping in Kafka and making all of our problems go away.
  • If only it were as easy as just dropping in Kafka and making all of our problems go away.
  • If only it were as easy as just dropping in Kafka and making all of our problems go away.
  • If only it were as easy as just dropping in Kafka and making all of our problems go away.
  • If only it were as easy as just dropping in Kafka and making all of our problems go away.

×