Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming Analytics for Financial Enterprises

Streaming Analytics (or Fast Data processing) is becoming an increasingly popular subject in the financial sector. There are two main reasons for this development. First, more and more data has to be analyze in real-time to prevent fraud; all transactions that are being processed by banks have to pass and ever-growing number of tests to make sure that the money is coming from and going to legitimate sources. Second, customers want to have friction-less mobile experiences while managing their money, such as immediate notifications and personal advise based on their online behavior and other users’ actions.

A typical streaming analytics solution follows a ‘pipes and filters’ pattern that consists of three main steps: detecting patterns on raw event data (Complex Event Processing), evaluating the outcomes with the aid of business rules and machine learning algorithms, and deciding on the next action. At the core of this architecture is the execution of predictive models that operate on enormous amounts of never-ending data streams.

In this talk, I’ll present an architecture for streaming analytics solutions that covers many use cases that follow this pattern: actionable insights, fraud detection, log parsing, traffic analysis, factory data, the IoT, and others. I’ll go through a few architecture challenges that will arise when dealing with streaming data, such as latency issues, event time vs server time, and exactly-once processing. The solution is build on the KISSS stack: Kafka, Ignite, and Spark Structured Streaming. The solution is open source and available on GitHub.

  • Be the first to comment

Streaming Analytics for Financial Enterprises

  1. 1. STREAMING ANALYTICSSTREAMING ANALYTICS FOR FINANCIAL ENTERPRISESFOR FINANCIAL ENTERPRISES Bas Geerdink | October 16, 2019 | Spark + AI Summit
  2. 2. WHO AM I?WHO AM I? { "name": "Bas Geerdink", "role": "Technology Lead", "background": ["Artificial Intelligence", "Informatics"], "mixins": ["Software engineering", "Architecture", "Management", "Innovation"], "twitter": "@bgeerdink", "linked_in": "bgeerdink" }
  3. 3. AGENDAAGENDA 1. Fast Data in Finance 2. Architecture and Technology 3. Deep dive: Event Time, Windows, and Watermarks Model scoring 4. Wrap-up
  4. 4. BIG DATABIG DATA Volume Variety Velocity
  5. 5. FAST DATA USE CASESFAST DATA USE CASES Sector Data source Pattern Noti cation Finance Payment data Fraud detection Block money transfer Finance Clicks and page visits Trend analysis Actionable insights Insurance Page visits Customer is stuck in a web form Chat window Healthcare Patient data Heart failure Alert doctor Traf c Cars passing Traf c jam Update route info Internet of Things Machine logs System failure Alert to sys admin
  6. 6. FAST DATA PATTERNFAST DATA PATTERN The common pattern in all these scenarios: 1. Detect pattern by combining data (CEP) 2. Determine relevancy (ML) 3. Produce follow-up action
  7. 7. ARCHITECTUREARCHITECTURE
  8. 8. THE SOFTWARE STACKTHE SOFTWARE STACK Data stream storage: Kafka Persisting cache, rules, models, and con g: Cassandra or Ignite Stream processing: Spark Structured Streaming Model scoring: PMML and Openscoring.io
  9. 9. APACHE SPARK LIBRARIESAPACHE SPARK LIBRARIES
  10. 10. STREAMING ARCHITECTURESTREAMING ARCHITECTURE
  11. 11. DEEP DIVE PART 1DEEP DIVE PART 1
  12. 12. SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION A Fast Data application is a running job that processes events in a data store (Kafka) Jobs can be deployed as ever-running pieces of software in a big data cluster (Spark)
  13. 13. SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION A Fast Data application is a running job that processes events in a data store (Kafka) Jobs can be deployed as ever-running pieces of software in a big data cluster (Spark) The basic pattern of a job is: Connect to the stream and consume events Group and gather events (windowing) Perform analysis (aggregation) on each window Write the result to another stream (sink)
  14. 14. PARALLELISMPARALLELISM To get high throughput, we have to process the events in parallel Parallelism can be con gured on cluster level (YARN) and on job level (number of worker threads) val conf = new SparkConf() .setMaster("local[8]") .setAppName("FraudNumberOfTransactions") ./bin/spark-submit --name "LowMoneyAlert" --master local[4] --conf "spark.dynamicAllocation.enabled=true" --conf "spark.dynamicAllocation.maxExecutors=2" styx.jar
  15. 15. HELLO SPEED!HELLO SPEED! // connect to Spark val spark = SparkSession .builder .config(conf) .getOrCreate() // for using DataFrames import spark.sqlContext.implicits._ // get the data from Kafka: subscribe to topic val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "transactions") .option("startingOffsets", "latest") .load()
  16. 16. EVENT TIMEEVENT TIME Events occur at certain time ... and are processed later
  17. 17. EVENT TIMEEVENT TIME Events occur at certain time ⇛ event time ... and are processed later ⇛ processing time
  18. 18. EVENT TIMEEVENT TIME Events occur at certain time ⇛ event time ... and are processed later ⇛ processing time
  19. 19. OUT-OF-ORDERNESSOUT-OF-ORDERNESS
  20. 20. WINDOWSWINDOWS In processing in nite streams, we usually look at a time window A windows can be considered as a bucket of time
  21. 21. WINDOWSWINDOWS In processing in nite streams, we usually look at a time window A windows can be considered as a bucket of time There are different types of windows: Sliding window Tumbling window Session window
  22. 22. WINDOWSWINDOWS
  23. 23. WINDOW CONSIDERATIONSWINDOW CONSIDERATIONS Size: large windows lead to big state and long calculations Number: many windows (e.g. sliding, session) lead to more calculations Evaluation: do all calculations within one window, or keep a cache across multiple windows (e.g. when comparing windows, like in trend analysis) Timing: events for a window can appear early or late
  24. 24. WINDOWSWINDOWS Example: sliding window of 1 day, evaluated every 15 minutes over the eld 'customer_id'. The event time is stored in the eld 'transaction_time' // aggregate, produces a sql.DataFrame val windowedTransactions = transactionStream .groupBy( window($"transaction_time", "1 day", "15 minutes"), $"customer_id") .agg(count("t_id") as "count", $"customer_id", $"window.end")
  25. 25. WATERMARKSWATERMARKS Watermarks are timestamps that trigger the computation of the window They are generated at a time that allows a bit of slack for late events
  26. 26. WATERMARKSWATERMARKS Watermarks are timestamps that trigger the computation of the window They are generated at a time that allows a bit of slack for late events Any event that reaches the processor later than the watermark, but with an event time that should belong to the former window, is ignored
  27. 27. EVENT TIME AND WATERMARKSEVENT TIME AND WATERMARKS Example: sliding window of 60 seconds, evaluated every 30 seconds. The watermark is set at 1 second, giving all events some time to arrive. val windowedTransactions = transactionStream .withWatermark("created_at", "1 second") .groupBy( window($"transaction_time", "60 seconds", "30 seconds"), $"customer_id") .agg(...) // e.g. count/sum/...
  28. 28. FAULT-TOLERANCE AND CHECKPOINTINGFAULT-TOLERANCE AND CHECKPOINTING Data is in one of three stages: Unprocessed In transit Processed
  29. 29. FAULT-TOLERANCE AND CHECKPOINTINGFAULT-TOLERANCE AND CHECKPOINTING Data is in one of three stages: Unprocessed ⇛ Kafka consumers provide offsets that guarantee no data loss for unprocessed data In transit ⇛ data can be preserved in a checkpoint, to reload and replay it after a crash Processed ⇛ Kafka provides an acknowledgement once data is written
  30. 30. SINK THE OUTPUT TO KAFKASINK THE OUTPUT TO KAFKA businessEvents .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("topic", "business_events") .option("checkpointLocation", "/hdfs/checkpoint") .start() // this triggers the start of the streaming query
  31. 31. DEEP DIVE PART 2DEEP DIVE PART 2
  32. 32. MODEL SCORINGMODEL SCORING To determine the follow-up action of a aggregated business event (e.g. pattern), we have to enrich the event with customer data The resulting data object contains the characteristics (features) as input for a model
  33. 33. MODEL SCORINGMODEL SCORING To determine the follow-up action of a aggregated business event (e.g. pattern), we have to enrich the event with customer data The resulting data object contains the characteristics (features) as input for a model To get the features and score the model, ef ciency plays a role again: Direct database call > API call In-memory model cache > model on disk
  34. 34. PMMLPMML PMML is the glue between data science and data engineering Data scientists can export their machine learning models to PMML (or PFA) format from sklearn.linear_model import LogisticRegression from sklearn2pmml import sklearn2pmml events_df = pandas.read_csv("events.csv") pipeline = PMMLPipeline(...) pipeline.fit(events_df, events_df["notifications"]) sklearn2pmml(pipeline, "LogisticRegression.pmml", with_repr = True)
  35. 35. PMMLPMML
  36. 36. MODEL SCORINGMODEL SCORING The models can be loaded into memory to enable split-second performance By applying map functions over the events we can process/transform the data in the windows: 1. enrich each business event by getting more data 2. ltering events based on selection criteria (rules) 3. score a machine learning model on each event 4. write the outcome to a new event / output stream
  37. 37. OPENSCORING.IOOPENSCORING.IO def score(event: RichBusinessEvent, pmmlModel: PmmlModel): Double = { val arguments = new util.LinkedHashMap[FieldName, FieldValue] for (inputField: InputField <- pmmlModel.getInputFields.asScala) { arguments.put(inputField.getField.getName, inputField.prepare(customer.all(fieldName.getValue))) } // return the notification with a relevancy score val results = pmmlModel.evaluate(arguments) pmmlModel.getTargetFields.asScala.headOption match { case Some(targetField) => val targetFieldValue = results.get(targetField.getName) case _ => throw new Exception("No valid target") } } }
  38. 38. ALTERNATIVE STACKSALTERNATIVE STACKS Message bus Streaming technology Database Kafka Spark Structured Streaming Ignite Kafka Flink Cassandra Azure Event Hubs Azure Stream Analytics Cosmos DB AWS Kinesis Data Streams AWS Kinesis Data Firehose DynamoDb
  39. 39. WRAP-UPWRAP-UP Fraud detection and actionable insights are examples of streaming analytics use cases in nancial services The common pattern is: CEP → ML → Noti cation Pick the right tools for the job; Kafka, Ignite, and Spark are amongst the best Be aware of typical streaming data issues: late events, state management, windows, etc.
  40. 40. THANKS!THANKS! Please rate my session on the website or app :) Read more about streaming analytics at: Source code and presentation are available at: The world beyond batch: Streaming 101 Google Data ow paper https://github.com/streaming-analytics/Styx

×