a talk
Ryan Wang (@ryw90)
If it weighs the same as a duck
Detecting fraud with Python and machine learning
Outline
• Why do we use machine learning?
• Overview of our pipeline
• What does it take to update a model?
What is Stripe?
• Collect payments viaAPI
• Most users charge credit cards
import stripe
stripe.Charge.create(
amount='100',
currency='usd',
source={
object='card',
number='4242 4242 4242 4242',
...
}
)
Things fraudsters do
• Typical fraudster buys stolen credit cards then:
• Creates fake Stripe accounts
• Buys goods from legitimate Stripe users
• Others test / brute force credentials
Witches easier to spot than fraud
Stopping fraud v1
• Manual rules and aggressive blacklisting
• Scaling issues
• Hard to control precision
• Complexity grows quickly
• Little generalization
• But important infrastructure built
• Tools for manual investigation
• Graph search
Stopping fraud v2
• Tree-based models to estimate p(fraud | features)
• Target composite outcome
• Disputes,
• Manual tags
• Information from card networks
• Python as glue
Qualita've*
feedback*
Feature*
engineering*
Model*
training*
Model*
evalua'on*
Model*
deployment*
In order of work required
• Model evaluation
• Feature engineering
• Model training
• Qualitative feedback
• Monitoring / deployment
What does it take to update a model?
Feature engineering aka counting stuff
Types of features
• Static features useful on the margin
• Card from risky country?
• Billing details consistent?
• Dynamic features really useful
• Velocity of charges from email recently?
• Utilize network information
Feature pipeline
• Slow Hadoop jobs compute features
• Sampling doesn’t really help
• Luigi manages dependencies
• Only re-run jobs with changes
• Load results to database
• http://www.github.com/spotify/luigi
Raw$
Charges$
Sta-c$
features$
Card$
features$
Email$
features$
Joined$
features$
Training$
Outcomes$
Feature pipeline (cont.)
@redshift('transactionfraud.features')
class JoinFeatures(luigi.WrapperTask):
def requires(self):
components = [
'static_features',
'dynamic_card_features',
'dynamic_email_features',
'outcomes',
]
return [FeatureTask(c) for c in components]
def job(self):
return ScaldingJob(
job='JoinFeatures',
output=self.output().path,
**self.requires()
)
Feature pipeline (cont.)
import com.twitter.scalding._
import com.stripe.thrift.Charge
class DynamicIpFeatures(args: Args) extends Job(args) {
val charges = load[Charge](args("charges"))
val historicalCounts = getHistoricalCounts(charges)
historicalCounts
.map { case (chargeId, counts) =>
IpFeatures(
chargeId = chargeId,
feature1 = counts.feature1,
feature2 = counts.feature2,
...
)
}
.save
}
The curious case of email
Model debugging
• Added dynamic email features to model
• Velocity of charges from email recently?
• Quantitative measures good
• High feature importance
• Overall model performance improved
• Weird issues in staging
• Systematic false positives
• High velocity did not yield higher p(fraud)
Model debugging (cont.)
• Old fashioned data analysis reveals…
• Likelihood of fraud much higher when email undefined
than when defined
• p(fraud | email undefined) = ~14%
• p(fraud | email defined) = ~5%
• In other words, email missing “predictive” of fraud
Model debugging (cont.)
• Email attribute of Customer
• If credit card declined during customer creation*,
fails with `CardError`
• Fraud correlated with decline, thus missing email
stripe.Customer.create(
source={
'object': 'card',
# Test card for declines
'number': '4000000000000002',
'exp_year': '2016',
'exp_month': 1,
}
)
* Not exactly accurate, as most users tokenize cards rather than creating customers with cards directly
• Apply this model on live traffic:
Model debugging (cont.)
• Data is generated according to:
stripe.Customer.create.
Card.declined.
(correlated.with.fraud).
No.customer.
(customer.email).
A"empt'charge'
without'email'
P(fraud'|'no'email)'>>'
P(fraud'|'email)'
Model'blocks'
charge'
Is the model any good?
Model evaluation
• Topmodel
• Flask app that charts and organizes output
from binary classifiers
• Cross between a lab notebook and Kaggle
• Feedback / PRs appreciated!
• https://github.com/stripe/topmodel
Model evaluation (cont.)
• Regularly generate ground truth and
benchmarks existing models
• Newly trained models automatically compared
test_y, test_start, test_end = 
topmodel_integration.retrieve_actuals(path)
test_X = query_to_df(
model.spec.sql_query()), test_start, test_end)
metadata = model.metadata()
results = model.score_and_format(test_y, test_X)
topmodel_integration.send_dataframe_to_s3(results, metadata)
Model evaluation (cont.)
• Maintaining reproducibility annoying
• Originally store pickled models on S3
• But wrapper code sometimes changes
• But sklearn sometimes changes
Summary
• Python glues together whole pipeline
• Adding a simple feature can be hard
• Spend a lot of time on feature
engineering, model evaluation
Questions?

Detecting fraud with Python and machine learning

  • 1.
    a talk Ryan Wang(@ryw90) If it weighs the same as a duck Detecting fraud with Python and machine learning
  • 2.
    Outline • Why dowe use machine learning? • Overview of our pipeline • What does it take to update a model?
  • 3.
    What is Stripe? •Collect payments viaAPI • Most users charge credit cards import stripe stripe.Charge.create( amount='100', currency='usd', source={ object='card', number='4242 4242 4242 4242', ... } )
  • 4.
    Things fraudsters do •Typical fraudster buys stolen credit cards then: • Creates fake Stripe accounts • Buys goods from legitimate Stripe users • Others test / brute force credentials
  • 5.
    Witches easier tospot than fraud
  • 6.
    Stopping fraud v1 •Manual rules and aggressive blacklisting • Scaling issues • Hard to control precision • Complexity grows quickly • Little generalization • But important infrastructure built • Tools for manual investigation • Graph search
  • 7.
    Stopping fraud v2 •Tree-based models to estimate p(fraud | features) • Target composite outcome • Disputes, • Manual tags • Information from card networks • Python as glue
  • 8.
    Qualita've* feedback* Feature* engineering* Model* training* Model* evalua'on* Model* deployment* In order ofwork required • Model evaluation • Feature engineering • Model training • Qualitative feedback • Monitoring / deployment
  • 9.
    What does ittake to update a model?
  • 10.
  • 11.
    Types of features •Static features useful on the margin • Card from risky country? • Billing details consistent? • Dynamic features really useful • Velocity of charges from email recently? • Utilize network information
  • 12.
    Feature pipeline • SlowHadoop jobs compute features • Sampling doesn’t really help • Luigi manages dependencies • Only re-run jobs with changes • Load results to database • http://www.github.com/spotify/luigi Raw$ Charges$ Sta-c$ features$ Card$ features$ Email$ features$ Joined$ features$ Training$ Outcomes$
  • 13.
    Feature pipeline (cont.) @redshift('transactionfraud.features') classJoinFeatures(luigi.WrapperTask): def requires(self): components = [ 'static_features', 'dynamic_card_features', 'dynamic_email_features', 'outcomes', ] return [FeatureTask(c) for c in components] def job(self): return ScaldingJob( job='JoinFeatures', output=self.output().path, **self.requires() )
  • 14.
    Feature pipeline (cont.) importcom.twitter.scalding._ import com.stripe.thrift.Charge class DynamicIpFeatures(args: Args) extends Job(args) { val charges = load[Charge](args("charges")) val historicalCounts = getHistoricalCounts(charges) historicalCounts .map { case (chargeId, counts) => IpFeatures( chargeId = chargeId, feature1 = counts.feature1, feature2 = counts.feature2, ... ) } .save }
  • 15.
  • 16.
    Model debugging • Addeddynamic email features to model • Velocity of charges from email recently? • Quantitative measures good • High feature importance • Overall model performance improved • Weird issues in staging • Systematic false positives • High velocity did not yield higher p(fraud)
  • 17.
    Model debugging (cont.) •Old fashioned data analysis reveals… • Likelihood of fraud much higher when email undefined than when defined • p(fraud | email undefined) = ~14% • p(fraud | email defined) = ~5% • In other words, email missing “predictive” of fraud
  • 18.
    Model debugging (cont.) •Email attribute of Customer • If credit card declined during customer creation*, fails with `CardError` • Fraud correlated with decline, thus missing email stripe.Customer.create( source={ 'object': 'card', # Test card for declines 'number': '4000000000000002', 'exp_year': '2016', 'exp_month': 1, } ) * Not exactly accurate, as most users tokenize cards rather than creating customers with cards directly
  • 19.
    • Apply thismodel on live traffic: Model debugging (cont.) • Data is generated according to: stripe.Customer.create. Card.declined. (correlated.with.fraud). No.customer. (customer.email). A"empt'charge' without'email' P(fraud'|'no'email)'>>' P(fraud'|'email)' Model'blocks' charge'
  • 20.
    Is the modelany good?
  • 21.
    Model evaluation • Topmodel •Flask app that charts and organizes output from binary classifiers • Cross between a lab notebook and Kaggle • Feedback / PRs appreciated! • https://github.com/stripe/topmodel
  • 23.
    Model evaluation (cont.) •Regularly generate ground truth and benchmarks existing models • Newly trained models automatically compared test_y, test_start, test_end = topmodel_integration.retrieve_actuals(path) test_X = query_to_df( model.spec.sql_query()), test_start, test_end) metadata = model.metadata() results = model.score_and_format(test_y, test_X) topmodel_integration.send_dataframe_to_s3(results, metadata)
  • 24.
    Model evaluation (cont.) •Maintaining reproducibility annoying • Originally store pickled models on S3 • But wrapper code sometimes changes • But sklearn sometimes changes
  • 25.
    Summary • Python gluestogether whole pipeline • Adding a simple feature can be hard • Spend a lot of time on feature engineering, model evaluation
  • 26.