This PowerPoint helps students to consider the concept of infinity.
Spark2
1. Leveraging Spark ML for
Real-Time Credit Card Approvals
Case study from a large financial Institution
Anand Venugopal
Saurabh Dutta
Impetus – StreamAnalytix
#Ent6SAIS
2. Agenda
• Use case background
• Existing system challenges and new goals
• Solution details and lessons learnt
• Q&A
#Ent6SAIS
4. Background – Use Case
• Acquire legitimate, responsible customers
• Decision: Approve ? Credit Limit ? APR ?
• Sub-second response time to make a decision
#Ent6SAIS
10. Decision tree – Approve ? Y/N
1
2 3
4 5 6 7
Salary >= 50,000Salary < 50,000
Other Loans = Y
Other Loans = N
8 9
Debt Ratio < 0.7
Other Loans = Y
Other Loans = N
Debt Ratio > 0.7
#Ent6SAIS
13. Existing system
• Built using traditional technologies
• Microsoft .NET stack
– C#
– MS SQL Server
#Ent6SAIS
14. Top challenges with existing system
• Everything on single box: not scalable, not flexible
• Model training on limited data: limits accuracy
• Data Scientists work in isolation: silo’ed tools
• Model management: manual and cumbersome
#Ent6SAIS
15. Primary goals for the new system
• Ease of use for stakeholders (self-service)
• Scale: Build models on huge datasets
• Fast decision response for the end-customer
• Unified, collaborative platform
• Data Lineage / Audit capability
#Ent6SAIS
17. Spark Streaming
• Write streaming jobs
• Extension of core Spark API
– Scalable
– High throughput
– Fault tolerance
• Receives input and divides into batches
#Ent6SAIS
35. Deployment
Transport Compute Storage Exploration
Kafka Spark
StreamAnalytix
HDFS + Hive BI Tools
- 2 Nodes with Sticky Session
- Load Balancer
- Zookeeper
- Tomcat
- MySQL
- RabbitMQ
#Ent6SAIS
36. Project Details
• Q4 2017
• 3 months from start to finish
• 3x faster than originally planned
• Team size: 4
• Apache Spark 2.1
• On-premise Hadoop Cluster with YARN
#Ent6SAIS
37. Learnings
• Consistent data format
• Add timeouts to third
party API calls
• Optimize stragglers
• Avoid excessive logging
#Ent6SAIS
• Checkpointing
• Outlier Analysis
– Using models
• Hyperparameter tuning +
Metric Evaluation
• Caching
– useNodeIdCache
38. Goals: Recap
• Ease of use for stakeholders (self-service)
• Scale: Build models on huge datasets
• Fast decision response for the end-customer
• Unified, collaborative platform
• Data Lineage / Audit capability
#Ent6SAIS