Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal and Saurabh Dutta

251 views

Published on

This tech talk deals with how we leveraged Spark Streaming and Spark Machine Learning models to build & operationalize real-time credit card approvals for a banking major. We plan to cover ML capabilities in Spark and how a typical ML pipeline looks like.

We are going to talk about the domain and the use case of how a major credit card provider is using spark to calculate card eligibility in real-time. We’re also going to share the challenges faced by the current system and how spark is a good fit to solve these kinds of problems.

We will then take a deep dive on the different tools that were used to design the solution and the architecture of the system. Here, we will also be sharing of how a spark based workflow was created to address various aspects like reading from Kafka, parsing, data enrichment, model selection, model scoring, rule execution to conclude the recommended output.

Finally, we’re also going to talk about the key challenges, learning and recommendations when building such a system and taking it to production.

Published in: Data & Analytics
  • Be the first to comment

Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal and Saurabh Dutta

  1. 1. Leveraging Spark ML for Real-Time Credit Card Approvals Case study from a large financial Institution Anand Venugopal Saurabh Dutta Impetus – StreamAnalytix #Ent6SAIS
  2. 2. Agenda • Use case background • Existing system challenges and new goals • Solution details and lessons learnt • Q&A #Ent6SAIS
  3. 3. Background - Customer #Ent6SAIS 50M+ Credit cards 50+ countries ~4M per year
  4. 4. Background – Use Case • Acquire legitimate, responsible customers • Decision: Approve ? Credit Limit ? APR ? • Sub-second response time to make a decision #Ent6SAIS
  5. 5. Problem Statement Submits application Prospective Customer Determining Card Eligibility #Ent6SAIS Logic Execution Decision + Communication
  6. 6. Core logic: Risk scoring • Estimate debt repayment risk • To limit credit risk of the lender • Ensure individual’s financial well being #Ent6SAIS
  7. 7. Risk scoring factors • History • Credit usage • Loans • Other Credit cards #Ent6SAIS • Job type • Income band • Debt Ratio • Credit Scores
  8. 8. Decision flow Receive Request Call CB REST API Get Acxiom Data Get SOW Data CB Derivations SOW Derivations Decision Line Price Respond #Ent6SAIS Models Risk Score: 0-1
  9. 9. Multiple model types • Approve/ Decline – Segment – Geography • Line – Credit Limit • Price – APR • Decision Tree – Numerous instances • Regression • K-means 9#Ent6SAIS
  10. 10. Decision tree – Approve ? Y/N 1 2 3 4 5 6 7 Salary >= 50,000Salary < 50,000 Other Loans = Y Other Loans = N 8 9 Debt Ratio < 0.7 Other Loans = Y Other Loans = N Debt Ratio > 0.7 #Ent6SAIS
  11. 11. Regression: Credit line ($) $ 1500 $ 500 Creditlimit Risk Model Score #Ent6SAIS
  12. 12. Clustering model: Price 22% APR #Ent6SAIS 18% APR
  13. 13. Existing system • Built using traditional technologies • Microsoft .NET stack – C# – MS SQL Server #Ent6SAIS
  14. 14. Top challenges with existing system • Everything on single box: not scalable, not flexible • Model training on limited data: limits accuracy • Data Scientists work in isolation: silo’ed tools • Model management: manual and cumbersome #Ent6SAIS
  15. 15. Primary goals for the new system • Ease of use for stakeholders (self-service) • Scale: Build models on huge datasets • Fast decision response for the end-customer • Unified, collaborative platform • Data Lineage / Audit capability #Ent6SAIS
  16. 16. Proposed tools • Spark Streaming • Spark ML • Kafka • HDFS • HBase • Visual Spark Platform - StreamAnalytix #Ent6SAIS
  17. 17. Spark Streaming • Write streaming jobs • Extension of core Spark API – Scalable – High throughput – Fault tolerance • Receives input and divides into batches #Ent6SAIS
  18. 18. Spark ML • Spark’s Machine learning module – DataFrame-based API • Algorithms – Classification – Regression – Clustering – Collaborative Filtering #Ent6SAIS • Utilities – Feature Selection – Feature Transformations – Hyper Parameter Tuning – Model Evaluation – Linear Algebra, Statistics
  19. 19. Spark based architecture - Training SparkHDFS Spark ML Model RepositoryTraining Data Source #Ent6SAIS HDFS
  20. 20. Model training pipeline #Ent6SAIS Read from HDFS
  21. 21. Model training pipeline Data Validation - Null checks - Invalid Chars ♡⚐♯♣ #Ent6SAIS Data Quality
  22. 22. Model training pipeline Score = 200 Status = Approved #Ent6SAIS Eliminate Outliers
  23. 23. Model training pipeline Mean Median Most Frequent #Ent6SAIS Name Age Jack 37 Eva ? Dirk 42 Impute missing values Model based Imputation Constant
  24. 24. Model training pipeline Feature Selection Transformation Model Selection Hyperparameters #Ent6SAIS Core logic of training model
  25. 25. Model Evaluation #Ent6SAIS
  26. 26. Spark based architecture - Scoring #Ent6SAIS User Session Kafka Spark Streaming 3rd Party Providers Bank’s internal repository YARN ML Models HBase Kafka HDFS
  27. 27. Model scoring pipeline #Ent6SAIS Read from Kafka
  28. 28. Model scoring pipeline External WS calls #Ent6SAIS
  29. 29. Model scoring pipeline Internal DB Lookup #Ent6SAIS
  30. 30. Model scoring pipeline Conditional Filter and model execution #Ent6SAIS
  31. 31. Model scoring pipeline Decision based on model’s score #Ent6SAIS
  32. 32. Model scoring pipeline #Ent6SAIS Rejected Approved Pending
  33. 33. Model scoring pipeline Line & Price Models #Ent6SAIS
  34. 34. Lineage/ Audit #Ent6SAIS The journey of a single data record through the scoring pipeline
  35. 35. Deployment Transport Compute Storage Exploration Kafka Spark StreamAnalytix HDFS + Hive BI Tools - 2 Nodes with Sticky Session - Load Balancer - Zookeeper - Tomcat - MySQL - RabbitMQ #Ent6SAIS
  36. 36. Project Details • Q4 2017 • 3 months from start to finish • 3x faster than originally planned • Team size: 4 • Apache Spark 2.1 • On-premise Hadoop Cluster with YARN #Ent6SAIS
  37. 37. Learnings • Consistent data format • Add timeouts to third party API calls • Optimize stragglers • Avoid excessive logging #Ent6SAIS • Checkpointing • Outlier Analysis – Using models • Hyperparameter tuning + Metric Evaluation • Caching – useNodeIdCache
  38. 38. Goals: Recap • Ease of use for stakeholders (self-service) • Scale: Build models on huge datasets • Fast decision response for the end-customer • Unified, collaborative platform • Data Lineage / Audit capability #Ent6SAIS
  39. 39. Q&A Visit Impetus StreamAnalytix booth #209 #Ent6SAIS

×