Abdulla Al Qawasmeh
Engineering Manager, Machine Learning
Credit Karma
Building Competing Models
using Spark DataFrames
Proprietary & Confidential
● Recommendation Problem
● Case Study: John
● Choice of Evaluation Metrics: A Tale of Two Models
● DataFrames & Type Safety
● Gotchas
Outline
Recommendation
Problem
Proprietary & Confidential
Credit Karma helps 60 million members make financial progress
● Revenue stream: Approved financial products
● Verticals: Credit cards, personal loans, auto, mortgages, and tax
● Goal: Recommend products that
● The member is interested in
● The member has a high probability of being approved for
● Have a long term benefit for the member (e.g., pay credit card debt using a
cheaper personal loan)
Recommendation problem
Case Study: John
Proprietary & Confidential
Proprietary & Confidential
Product Features P(Interest) P(Approval)
2% Rewards
$300 Bonus
Very High Average
1% Rewards
0% APR for 6m
High Average
0% Rewards
$20 Annual Fee
Very Low Very High
Searching for the right options for John
Proprietary & Confidential
Product Features P(Interest) P(Approval) E[Benefit]
2% Rewards
$300 Bonus
Very High Average Average
1% Rewards
0% APR for 6m
High Average Very High
0% Rewards
$20 Annual Fee
Very Low Very High Very Low
Long term benefit
Proprietary & Confidential
We achieved success!
+ =
Proprietary & Confidential
Did we?
+ =
Proprietary & Confidential
What about other financial products?
Product Features P(Interest) P(Approval) E[Benefit]
Unsecured
7.5% APR
High High Very High
1% Rewards
0% APR for 6m
(25% APR after)
High Average High
Choice of Evaluation
Metrics: A Tale of Two
Models
Proprietary & Confidential
A Tale of Two Models
Label0 Predicted0
00 0.050
00 0.90
10 0.950
Label0 Predicted0
10 0.950
10 0.550
00 0.050
Proprietary & Confidential
AUC as a Metric
Label0 Predicted0
00 0.050
00 0.90
10 0.950
Label0 Predicted0
10 0.950
10 0.550
00 0.050
AUC = 1.0
Proprietary & Confidential
What’s Wrong?
Label0 Predicted0
00 0.050
00 0.90
10 0.950
Label0 Predicted0
10 0.950
10 0.550
00 0.050
AUC = 1.0
Proprietary & Confidential
Recommendations
?
Higher Predicted Score Wins
Impression Distribution
667
721
756
Proprietary & Confidential
Which metric?
Label0 Predicted0
00 0.050
00 0.90
10 0.950
Label0 Predicted0
10 0.950
10 0.550
00 0.050
AUC = 1.0
Proprietary & Confidential
LogLoss Instead of AUC
Label0 Predicted0
00 0.050
00 0.90
10 0.950
Label0 Predicted0
10 0.950
10 0.550
00 0.050
LogLoss = 0.8 LogLoss = 0.2
Proprietary & Confidential
Average Label Predicted
0 0.05
0.45 0.9
0.95 0.95
AUC = 1.0
Calibration
LogLoss = 0.800 1 2
Predicted
Experienced
Bucket (sorted by prediction)
AveragePrediction(Probability)
DataFrames & Type
Safety
Proprietary & Confidential
DataFrames
DataFrames are flexible data types
● Any number of columns
● Any types
● No type checking done at compile time
Some checking done at runtime
● E.g., using a schema
Proprietary & Confidential
val featuresLabelDF: DataFrame
val featuresOnlyDF: DataFrame
def doTraining(featuresLabelDF: DataFrame) = {
...
}
...
doTraining(featuresOnlyDF)
DataFrames - Type Safety Problem
Proprietary & Confidential
Type Safety Wrappers
- Row Wrapper
- DataFrame Wrapper
Proprietary & Confidential
Row Wrapper
final case class FeaturesLabelDataRow(facts: Facts, label: Double)
extends DataRow {
def toRow: Row
}
object FeaturesLabelDataRow {
def parseRow(row: Row): FeaturesLabelDataRow
}
Type Safety Wrappers
Proprietary & Confidential
Row Wrapper
final case class FeaturesLabelDataRow(facts: Facts, label: Double)
extends DataRow {
def toRow: Row
}
object FeaturesLabelDataRow {
def parseRow(row: Row): FeaturesLabelDataRow
}
final case class FeaturesLabelData(dataFrame: DataFrame)
object FeaturesLabelData {
def apply(rowDataRDD: RDD[FeaturesLabelDataRow]): FeaturesLabelsData = {
val schema : StructType = FeaturesLabelsDataRow.getSchema
FeaturesLabelsData(sqlContext.createDataFrame(rowDataRDD, schema))
}
}
Type Safety Wrappers
DataFrame Wrapper
Proprietary & Confidential
val featuresLabelData: FeaturesLabelData
val featuresOnlyData: FeaturesData
def doTraining(featuresLabelData: FeaturesLabelData) = {
...
}
...
doTraining(featuresOnlyData) // Compile time error!
DataFrames - Type Safety Solution
Gotchas
Proprietary & Confidential
ImpressionID0 ClickID0 ...0
10 None0
20 10
30 None0
...0
ClickID0 ...0
10
...0
val impDF: DataFrame
val clickDF: DataFrame
val joined = impDF.join(clickDF, impDF("ClickID") === clickDF("ClickID"), "left_outer")
DataFrames Left Join
Proprietary & Confidential
val impDF: DataFrame
val clickDF: DataFrame
val impClicks = impDF.where(impDF("ClickID").isNotNull())
val impOnly = impDF.where(impDF("ClickID").isNull())
val joined = impClicks.join(clickDF, impClicks("ClickID") === clickDF("ClickID")).unionAll(impOnly)
ImpressionID0 ClickID0 ...0
10 None0
20 10
30 None0
...0
ClickID0 ...0
10
...0
DataFrames Left Join - Solution
Proprietary & Confidential
new GBTRegressor().fit(df)
StackOverflow
Proprietary & Confidential
Exception in thread "main" java.lang.StackOverflowError at
org.apache.spark.sql.catalyst...
new GBTRegressor().fit(df)
StackOverflow
Proprietary & Confidential
Exception in thread "main" java.lang.StackOverflowError at
org.apache.spark.sql.catalyst...
new GBTRegressor().setCheckpointInterval(metaData.checkpointInterval)
new GBTRegressor().fit(df)
StackOverflow
Proprietary & Confidential
Exception in thread "main" java.lang.StackOverflowError at
org.apache.spark.sql.catalyst...
new GBTRegressor().setCheckpointInterval(metaData.checkpointInterval)
new GBTRegressor().fit(df)
-Dspark.executor.extraJavaOptions='-XX:ThreadStackSize=81920'
StackOverflow
Proprietary & Confidential
val df: DataFrame // 500 columns, 1000 rows
val Array(left, right) = df.randomSplit(Array(.8,.2))
// This crashes
left.count
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
"(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/
InternalRow;)I" of class "org.apache.spark.sql.catalyst.ex
pressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
Random Split Janino Error
Proprietary & Confidential
val df: DataFrame // 500 columns, 1000 rows
val Array(left, right) = df.randomSplit(Array(.8,.2))
// This crashes
left.count
Upgrade to: 1.6.4, 2.0.3, 2.1.1, or 2.2.0
https://issues.apache.org/jira/browse/SPARK-16845
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
"(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/
InternalRow;)I" of class "org.apache.spark.sql.catalyst.ex
pressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
Random Split Janino Error
Proprietary & Confidential
● Choice of evaluation metric is very important.
Especially for competing models
● There are ways to introduce type safety to
dynamic types
Takeaways
Thank You.
Questions?
https://engineering.creditkarma.com
abdulla.alqawasmeh@creditkarma.com
https://www.linkedin.com/in/aalqawasmeh

Building Competing Models Using Apache Spark DataFrames with Abdulla Al-Qawasmeh

  • 1.
    Abdulla Al Qawasmeh EngineeringManager, Machine Learning Credit Karma Building Competing Models using Spark DataFrames
  • 2.
    Proprietary & Confidential ●Recommendation Problem ● Case Study: John ● Choice of Evaluation Metrics: A Tale of Two Models ● DataFrames & Type Safety ● Gotchas Outline
  • 3.
  • 4.
    Proprietary & Confidential CreditKarma helps 60 million members make financial progress ● Revenue stream: Approved financial products ● Verticals: Credit cards, personal loans, auto, mortgages, and tax ● Goal: Recommend products that ● The member is interested in ● The member has a high probability of being approved for ● Have a long term benefit for the member (e.g., pay credit card debt using a cheaper personal loan) Recommendation problem
  • 5.
  • 6.
  • 7.
    Proprietary & Confidential ProductFeatures P(Interest) P(Approval) 2% Rewards $300 Bonus Very High Average 1% Rewards 0% APR for 6m High Average 0% Rewards $20 Annual Fee Very Low Very High Searching for the right options for John
  • 8.
    Proprietary & Confidential ProductFeatures P(Interest) P(Approval) E[Benefit] 2% Rewards $300 Bonus Very High Average Average 1% Rewards 0% APR for 6m High Average Very High 0% Rewards $20 Annual Fee Very Low Very High Very Low Long term benefit
  • 9.
    Proprietary & Confidential Weachieved success! + =
  • 10.
  • 11.
    Proprietary & Confidential Whatabout other financial products? Product Features P(Interest) P(Approval) E[Benefit] Unsecured 7.5% APR High High Very High 1% Rewards 0% APR for 6m (25% APR after) High Average High
  • 12.
    Choice of Evaluation Metrics:A Tale of Two Models
  • 13.
    Proprietary & Confidential ATale of Two Models Label0 Predicted0 00 0.050 00 0.90 10 0.950 Label0 Predicted0 10 0.950 10 0.550 00 0.050
  • 14.
    Proprietary & Confidential AUCas a Metric Label0 Predicted0 00 0.050 00 0.90 10 0.950 Label0 Predicted0 10 0.950 10 0.550 00 0.050 AUC = 1.0
  • 15.
    Proprietary & Confidential What’sWrong? Label0 Predicted0 00 0.050 00 0.90 10 0.950 Label0 Predicted0 10 0.950 10 0.550 00 0.050 AUC = 1.0
  • 16.
    Proprietary & Confidential Recommendations ? HigherPredicted Score Wins Impression Distribution 667 721 756
  • 17.
    Proprietary & Confidential Whichmetric? Label0 Predicted0 00 0.050 00 0.90 10 0.950 Label0 Predicted0 10 0.950 10 0.550 00 0.050 AUC = 1.0
  • 18.
    Proprietary & Confidential LogLossInstead of AUC Label0 Predicted0 00 0.050 00 0.90 10 0.950 Label0 Predicted0 10 0.950 10 0.550 00 0.050 LogLoss = 0.8 LogLoss = 0.2
  • 19.
    Proprietary & Confidential AverageLabel Predicted 0 0.05 0.45 0.9 0.95 0.95 AUC = 1.0 Calibration LogLoss = 0.800 1 2 Predicted Experienced Bucket (sorted by prediction) AveragePrediction(Probability)
  • 20.
  • 21.
    Proprietary & Confidential DataFrames DataFramesare flexible data types ● Any number of columns ● Any types ● No type checking done at compile time Some checking done at runtime ● E.g., using a schema
  • 22.
    Proprietary & Confidential valfeaturesLabelDF: DataFrame val featuresOnlyDF: DataFrame def doTraining(featuresLabelDF: DataFrame) = { ... } ... doTraining(featuresOnlyDF) DataFrames - Type Safety Problem
  • 23.
    Proprietary & Confidential TypeSafety Wrappers - Row Wrapper - DataFrame Wrapper
  • 24.
    Proprietary & Confidential RowWrapper final case class FeaturesLabelDataRow(facts: Facts, label: Double) extends DataRow { def toRow: Row } object FeaturesLabelDataRow { def parseRow(row: Row): FeaturesLabelDataRow } Type Safety Wrappers
  • 25.
    Proprietary & Confidential RowWrapper final case class FeaturesLabelDataRow(facts: Facts, label: Double) extends DataRow { def toRow: Row } object FeaturesLabelDataRow { def parseRow(row: Row): FeaturesLabelDataRow } final case class FeaturesLabelData(dataFrame: DataFrame) object FeaturesLabelData { def apply(rowDataRDD: RDD[FeaturesLabelDataRow]): FeaturesLabelsData = { val schema : StructType = FeaturesLabelsDataRow.getSchema FeaturesLabelsData(sqlContext.createDataFrame(rowDataRDD, schema)) } } Type Safety Wrappers DataFrame Wrapper
  • 26.
    Proprietary & Confidential valfeaturesLabelData: FeaturesLabelData val featuresOnlyData: FeaturesData def doTraining(featuresLabelData: FeaturesLabelData) = { ... } ... doTraining(featuresOnlyData) // Compile time error! DataFrames - Type Safety Solution
  • 27.
  • 28.
    Proprietary & Confidential ImpressionID0ClickID0 ...0 10 None0 20 10 30 None0 ...0 ClickID0 ...0 10 ...0 val impDF: DataFrame val clickDF: DataFrame val joined = impDF.join(clickDF, impDF("ClickID") === clickDF("ClickID"), "left_outer") DataFrames Left Join
  • 29.
    Proprietary & Confidential valimpDF: DataFrame val clickDF: DataFrame val impClicks = impDF.where(impDF("ClickID").isNotNull()) val impOnly = impDF.where(impDF("ClickID").isNull()) val joined = impClicks.join(clickDF, impClicks("ClickID") === clickDF("ClickID")).unionAll(impOnly) ImpressionID0 ClickID0 ...0 10 None0 20 10 30 None0 ...0 ClickID0 ...0 10 ...0 DataFrames Left Join - Solution
  • 30.
    Proprietary & Confidential newGBTRegressor().fit(df) StackOverflow
  • 31.
    Proprietary & Confidential Exceptionin thread "main" java.lang.StackOverflowError at org.apache.spark.sql.catalyst... new GBTRegressor().fit(df) StackOverflow
  • 32.
    Proprietary & Confidential Exceptionin thread "main" java.lang.StackOverflowError at org.apache.spark.sql.catalyst... new GBTRegressor().setCheckpointInterval(metaData.checkpointInterval) new GBTRegressor().fit(df) StackOverflow
  • 33.
    Proprietary & Confidential Exceptionin thread "main" java.lang.StackOverflowError at org.apache.spark.sql.catalyst... new GBTRegressor().setCheckpointInterval(metaData.checkpointInterval) new GBTRegressor().fit(df) -Dspark.executor.extraJavaOptions='-XX:ThreadStackSize=81920' StackOverflow
  • 34.
    Proprietary & Confidential valdf: DataFrame // 500 columns, 1000 rows val Array(left, right) = df.randomSplit(Array(.8,.2)) // This crashes left.count Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/ InternalRow;)I" of class "org.apache.spark.sql.catalyst.ex pressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB Random Split Janino Error
  • 35.
    Proprietary & Confidential valdf: DataFrame // 500 columns, 1000 rows val Array(left, right) = df.randomSplit(Array(.8,.2)) // This crashes left.count Upgrade to: 1.6.4, 2.0.3, 2.1.1, or 2.2.0 https://issues.apache.org/jira/browse/SPARK-16845 Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/ InternalRow;)I" of class "org.apache.spark.sql.catalyst.ex pressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB Random Split Janino Error
  • 36.
    Proprietary & Confidential ●Choice of evaluation metric is very important. Especially for competing models ● There are ways to introduce type safety to dynamic types Takeaways
  • 37.