Successfully reported this slideshow.
Your SlideShare is downloading. ×

Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Acorns"

Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Acorns"

Download to read offline

Within fintech catching fraudsters is one of the primary opportunities for us to use streaming applications to apply ML models in real-time. This talk will be a review of our journey to bring fraud decisioning to our tellers at Capital One using Kafka, Flink and AWS Lambda. We will share our learnings and experiences to common problems such as custom windowing, breaking down a monolith app to small queryable state apps, feature engineering with Jython, dealing with back pressure from combining two disparate streams, model/feature validation in a regulatory environment, and running Flink jobs on Kubernetes.

Within fintech catching fraudsters is one of the primary opportunities for us to use streaming applications to apply ML models in real-time. This talk will be a review of our journey to bring fraud decisioning to our tellers at Capital One using Kafka, Flink and AWS Lambda. We will share our learnings and experiences to common problems such as custom windowing, breaking down a monolith app to small queryable state apps, feature engineering with Jython, dealing with back pressure from combining two disparate streams, model/feature validation in a regulatory environment, and running Flink jobs on Kubernetes.

More Related Content

Slideshows for you

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Acorns"

  1. 1. FINDING BAD ACORNS ANDREW GAO & JEFF SHARPE FLINK FORWARD 2018
  2. 2. ANDREW GAO JEFF SHARPE
  3. 3. Developing a Fraud Defense Platform Fraud Defense at the Teller Using Flink Our journey to build a Fraud Decisioning Platform and use Flink to build out the use cases
  4. 4. DEVELOPING A FRAUD DEFENSE PLATFORM
  5. 5. OUR USERS Fraud Operator Customer Data Scientist Data Analyst Engineer Product Owner
  6. 6. OUR USERS Fraud Operator Customer Data Scientist Data Analyst Engineer Product Owner
  7. 7. ARCHITECTURE DATA ACTIONS MAGIC!
  8. 8. RUNNING ON
  9. 9. RUNNING ON
  10. 10. PROS • Community support for Docker/Kube • Resilient • Easy to tear down and bring back • Maximizing resource efficiency CONS • Maintaining your own Kubernetes solution • Containing blast radius • Edge cases when combining # of technology solutions Developing on Kubernetes has been challenging but very rewarding
  11. 11. FRAUD DEFENSE AT THE TELLER
  12. 12. A FLINK MONOLITH • Problem: Develop a stream processing workflow for two legacy batch data sources • First Attempt: Do everything in Flink and take advantage of Flink Connected Streams
  13. 13. 1 2 3 Using Flink operators to build our application workflow 4
  14. 14. PROS • Cheap • Not a lot of Code/Config • Scalability / Availability • Deployments are a breeze CONS • Not truly stateless • Start-up time AWS Lambda is a good fit for our use case and works well with our underlying technologies
  15. 15. 1 2 3 Using Flink operators to build our application workflow 4
  16. 16. 90 Day Storage Window CUSTOM WINDOWS FOR OPTIMIZATION AND PORTABILITY 30 Day Virtual View 90 Day Filtered View
  17. 17. CUSTOM WINDOWS FOR OPTIMIZATION AND PORTABILITY Most-Recent-Beyond-24-Hours Window 24 Hour Offset Dynamic Window
  18. 18. 1 2 3 Using Flink operators to build our application workflow 4
  19. 19. USING JYTHON TO BRIDGE THE GAP TO DATA SCIENTISTS Flink Jython Adapter .py .py .py .py Windows Data Featur e Featur e Featur e Featur e Featur e Featur e Featur e Featur e .py .py .py .py Data
  20. 20. GITFLOW AND JYTHON IMPROVE TRACEABILITY Featur e JAR v1.0.42 Junit Tests Pull Request Merge Build Develop Denied Failed Maven Import Junit Tests Build Flink Job JAR Commit
  21. 21. 1 2 3 Using Flink operators to build our application workflow 4
  22. 22. FEATURES EXIST TO FEED MODELS FeatureFeature Model Model Score H20 Tensor Flow Seldon (whatever)
  23. 23. BREAKING UP THE MONOLITH • Problem: Back Pressure leading to Delayed Transactions • Solution: Break up the monolith Flink App into small Queryable State Apps
  24. 24. CHIPMUNKS
  25. 25. •Connected Streams •Flink Keyed State •Checkpointing/Savepointing •Queryable State Features Used •Flink Versioning (FLINK-7783, FLINK-8487) •Keyed Source Function •Kafka Offsets Issues We had a lot of fun and success using Flink, but not without a few hiccups
  26. 26. Developing a Fraud Defense Platform Fraud Defense at the Teller Using Flink Our journey to build a Fraud Decisioning Platform and use Flink to build out the use cases QUESTIONS?

Editor's Notes

  • Jeff Intro
    Andrew Intro
    We are part of the Forest teams(very high level intro)
    Kubernetes-based fraud decisioning platform that you can deploy multiple fraud use cases on
    With the goal of being able to rapidly spin up fraud apps
    Running in Production since September 2017
  • Our talk today:
    Talk briefly about our journey building out this Forest platform using Kubernetes as well as talk about how we used Flink with Kubernetes at a high level
    Then talk about a specific use case we have on the platform and do a deep dive on what’s inside our Flink app
  • Customers First
    If one day you take a look at your bank account and its empty
    However if your account was locked for no reason you would be upset
    This sense of balance between catching stopping fraud and providing a great customer experience is a common trend that we have to deal with
    If we wanted to stop fraud completely we could just stop letting people take their money
    On a similar note, we have a limited number of fraud operators
    Do not have the manpower to call every single person up and ask them
    Primary directive of the platform is to empower Data Scientists/ Data Analysts by building the tools on the platform to help create the models needed to make decisions
    This includes having access to all the data in a fast and easy-to-understand format
    Seeing how their models are performing, and whether the features are being calculated as expected
    When they need to refit the model they need to be able to do the data transformations quickly so we can turn a refreshed model around
    Lastly as we are developing a fraud platform, we need to keep in mind the engineers/developers that will be developing the fraud app
    it should be something that engineers enjoy to develop on
    When you have a feature/model/action repository its very easy to develop turn around fraud apps
    To help us balance these different needs we have our product owners to help bridge the gap
  • Customers First
    If one day you take a look at your bank account and its empty
    However if your account was locked for no reason you would be upset
    This sense of balance between catching stopping fraud and providing a great customer experience is a common trend that we have to deal with
    If we wanted to stop fraud completely we could just stop letting people take their money
    On a similar note, we have a limited number of fraud operators
    Do not have the manpower to call every single person up and ask them
    Primary directive of the platform is to empower Data Scientists/ Data Analysts by building the tools on the platform to help create the models needed to make decisions
    This includes having access to all the data in a fast and easy-to-understand format
    Seeing how their models are performing, and whether the features are being calculated as expected
    When they need to refit the model they need to be able to do the data transformations quickly so we can turn a refreshed model around
    Lastly as we are developing a fraud platform, we need to keep in mind the engineers/developers that will be developing the fraud app
    it should be something that engineers enjoy to develop on
    When you have a feature/model/action repository its very easy to develop turn around fraud apps
    To help us balance these different needs we have our product owners to help bridge the gap
  • 14 EC2s
    6 m4.10xlarge for general minions
    5 m4.2xlarge for kafka nodes
    3 m4.large for masters
    Ansible to provision
    200+ pods
    Flink apps in Java/Scala/Kotlin
    Microservices in Golang

  • Holy smokes that’s a lot
    Zookeeper/Kafka/Flink/Nifi
    Kappa Architecture
    Kafka is our primary messaging bus throughout the platform
    Nifi is one of the tools we use to grab data from different sources in the company
    Flink does the calculations and applies needed transformations
    Minio/Istio to handle http communications throughout the platform
    EFK = ElasticSearch / FluentD / Kibana
    Docker logs
    Managed AWS service
    Influx / Prometheus / Grafana
    Metrics reporting and Dashboards
    Platform health
    Fraud health
    Drill / zeppelin / s3 for data analysts to view transactions

    Why are we switching from influx to prometheus
  • Holy smokes that’s a lot
    Zookeeper/Kafka/Flink/Nifi
    Kafka is our primary messaging bus throughout the platform
    Nifi is one of the tools we use to grab data from different sources in the company
    Flink does the calculations and applies needed transformations
    Minio/Istio to handle http communications throughout the platform
    EFK = ElasticSearch / FluentD / Kibana
    Docker logs
    Managed AWS service
    Influx / Prometheus / Grafana
    Metrics reporting and Dashboards
    Platform health
    Fraud health
    Drill / zeppelin / s3 for data analysts to view transactions

    Why are we switching from influx to prometheus
  • Kubernetes has been a challenge
    If a task manager goes down, it will auto-heal
    If your configurations are set up correctly you can just delete pods and they’ll come back
    Unless your configurations are completely fleshed out, the blast radius on failure can be rippling
    Situation where docker logs could not make it out to kubernetes logs because the docker machines were dying
    Developed internal tool for ci/cd and deployment
  • Use cases tell us the resources they need and we provision them a flink cluster
    1 Job Manager per cluster
    5 Task Managers per cluster
    RocksDB backend
    Checkpoint/Savepoint persist on S3
    Job Deployment Options

  • Considerations
    People obviously don’t want to wait too long
    But we want to respond with the most data we have available on the customer
  • Two data streams need to share state
    Data stream from online interactions / all other customer interactions
    Data stream that we receive from the branch

    Need to calculate Features
    Need to apply ML model
    Need to respond in real-time

  • Developed in python, evaluating golang
    Developed internal tool for ci/cd and deployment
  • Teller transactions have a real-time SLA
    Connected Streams is the culprit

    Break Up One Flink App into Smaller Flink Queryable State Apps
    Flink Apps as Functions


    Disparate Data Streams: Back Pressure
    In our case: we have all the account level activity for a given customer from one source and on the other we have the data from the teller machine
    Not all transactions are equal due to their source. However in a ML world we still want to examine every transaction
    Results in back pressure and uneven transaction flow
  • Alvin for each data source
    Scurry of Alvins build out our feature repository
    Theodore builds his own features, adds on features for Alvin and the passes it down
    Why did we break Simon out?
    We can replace it with anything such as Seldon

  • https://issues.apache.org/jira/browse/FLINK-7783
    https://issues.apache.org/jira/browse/FLINK-8487

×