More Related Content Similar to Pixels.camp - Machine Learning: Building Successful Products at Scale (20) Pixels.camp - Machine Learning: Building Successful Products at Scale1. © 2016 Feedzai Confidential 1
@antonioalegria
Product Lead for Cloud
Machine Learning:
Building Successful Products at Scale
5. © Feedzai Inc. Confidential.
This talk is about
Tips on building a successful Machine Learning product
One that works on multiple use cases within the same ecosystem
One that works on structured data and in classification use cases
Challenges with doing this for a generic SaaS Fraud Prevention
product
Data API design choices
Taking advantage of the data to power Machine Learning
5
6. © Feedzai Inc. Confidential.
This talk is NOT about
Unstructured data problems
Natural Language Processing
Image Recognition
Speech Recognition
Virtual Assistants
Self-driving cars
Specific technologies
Dataviz or UI
6
8. © Feedzai Inc. Confidential.
Examples of ML Products
Gmail Spam detection
Recommendations:
Movies
Books
Music
Dating
Automating Access Control @ Amazon
Predicting Heart Attacks / Diseases on a Fitness Tracking Service
Fraud Detection ;-)
15. © Feedzai Inc. Confidential.
Feedzai in a Nutshell
What? Detect fraudulent payments and their customers, in real-time
How?
We receive transaction and behavior data
We continuously update 1:1 profiles for every entity (e.g. cards, IPs, merchants, etc)
Machine Learning model analyses each payment and its history in real-time
User receives scores immediately, with human-readable explanations
User can give feedback by labeling transactions as “ok” or “fraud” – our models will learn automatically
Where? Deployed on-site or used in the cloud through REST API
When? our AI never sleeps and it responds in a few milliseconds
16. © Feedzai Inc. Confidential.
Huge Data
Securing over $2B per day (over $700B/year, 3.3x Portugal’s GDP)
Growing soon to reach trillion scale
Fighting crime across the globe
US – our clients use Feedzai to process $4 of every $10 in all US
Canada
Brazil
India
Nigeria
Europe
We have to make decisions in 25ms
17. DATA-DRIVEN FRAUD DETECTION
1:1 Profiling & Analytics
✖
✖ ✖
✖
Payments & Actions
$ € ¥ £
Machine Learning
Data Enrichment
★★★★★
Risk Analysis
Decision:
Approve, Decline, Review
User Feedback
Human-built Rules
Request
Response
18. © 2016 Feedzai Confidential 18
White-box Scoring
Human explanations from AI reasoning
19. © Feedzai Inc. Confidential.
Challenges
SaaS for Online Commerce
Fraud Prevention in Online Commerce is a very broad scenario
Different geographies
Widely different use cases
Fraud and abuse are a case of extremely unbalanced classes
It can involve many abuse scenarios:
Payment fraud (e.g. stolen credit cards)
Account Takeovers
Money Laundering
Abusing employee benefits
Solution needs to SCALE
20. © Feedzai Inc. Confidential.
Scale…
I don’t think it means what you think it
means
21. © Feedzai Inc. Confidential. 21
Lets look at some of the key components of
a good ML product
23. © 2016 Feedzai Confidential 23
Data API Specificity Spectrum
• Very specific to a particular use case
• Strict validations
• Clients need to fully adapt to the API
• Defines bare minimum generic terms
• Clients have full flexibility to integrate
• Custom events with custom data fields
• API defines common “language”
• Comprehensive set of well-defined optional
fields
• Supports custom fields and events
• Clients adapt to the Native fields but can
use custom data
+ Shared Models
+ Shared Model Features
+ Very easy to fully automate and scale
– Low adaptation to clients
+ Potential for total adaptation to clients
– Costly adaptation to clients
– Hard to do feature engineering
– Fully Separate Models
– Fully Separate Model Features
API Flexibility
Model Shareability
+ Potential for high adaptation to clients
+ Tiered model possible (shared + specific)
+ Shared and specific model features
– Automation and scaling is not trivial
Generic ML Platform
(e.g. BigML)
Very use-case specific
(e.g. Email spam detection)
Platform for classes of use cases
(e.g. Feedzai for Online Commerce)
24. © Feedzai Inc. Confidential.
Responses should include the following
Score(s) (e.g. probability of being fraud)
Decision:
Accept
Review
Decline
Human-Readable Explanations
Machine-Readable Reason Codes
24
25. Feedzai API Example
POST /v1.1/payments
{
"id": "1477020120",
"user_id": "af00-bc14-1245",
"amount": 280000,
"currency": "USD",
"ip": "212.10.114.18",
"items": [
{
"item_id": "cell_400200",
"name": "Cellphone 1450",
"price": 25000
}
],
"payment_methods": [
{
"type": "card”,
"card_fullname": "HUGH Howey",
"card_pan": “4539488752989912",
"card_exp": "06/17”
}
],
"user_defined": {
"is_po_box": true,
"expedited_delivery": true
}
}
HTTP 200 (OK)
{
"id": "1477020120”,
"score": 740,
"decision": ”review”,
"reason_codes": [
{ "name": "Fraud" },
{ "name": "MoneyLaundering" },
{ "name": "AccountTakeover" }
],
"explanation": [
{
"description": ”Customer used over 3 cards in past
week.",
"risk": 0.4,
"confidence": 5
},
{
"description": "Customer has used a single internet
address in the last 24 hours.",
"risk": 0.003,
"confidence": 5
}
]
}
Request Response
26. © Feedzai Inc. Confidential.
The Machine Learning Algorithm
(in 30 seconds)
27. © Feedzai Inc. Confidential.
Machine Learning Algorithm
Encapsulate the actual ML into an isolated component
Start with algorithms that are fast to train and evaluate, adapt to
different use cases, support classification and regression and are
whitebox
Random Forest
Gradient Boosting Machines (GBMs)
Deep Learning shows potential for more unstructured problems
Though it’s heavyweight to train, requires a lot of pre-processing
Still unclear how much it can “replace” feature engineering
27
28. © Feedzai Inc. Confidential. 28
Machine Learning is 90% Data Processing
29. © Feedzai Inc. Confidential.
Machine LearningIn Production
Live Input Data Instance Vector
Enrich
Filter
Transform
Aggregate
Project
Historical Input Data
Naïve ML Pipeline
Instance Vector + Class Annotation
Training
Classify
Historical
Data
Enrich
Filter
Transform
Aggregate
Project
30. © Feedzai Inc. Confidential.
Example Input Data
Transaction:
Amount
Currency
User ID
User Name
Credit Card Number
Cardholder Name
IP Address
30
31. © Feedzai Inc. Confidential.
Example Naïve Features
Amount in USD
Currency
Time of Day
Day of Week
Is IP a Proxy
IP Country == Store Country
IP Country
User ID
Card
Device ID
Some of these features are aweful (don’t do this):
Having such high-cardinality categoricals is bad and leads to overfitting
Also, the model isn’t learning patterns just which users/devices/cards are bad
31
33. © Feedzai Inc. Confidential. 33
It can’t distinguish between two equal
transactions from two different people.
We need to go further
34. © Feedzai Inc. Confidential. 34
Goal: the model must see
The current event
+
All* past events
+
All* related events (e.g. same card)
35. © Feedzai Inc. Confidential. 35
A good approximation to this are 1:1 Profiles
36. © Feedzai Inc. Confidential.
What’s a profile?
An aggregation or summarization of events over a certain time window
and for a group of entities
Examples:
Number of transactions in last 24h for this card
Number of transactions in past month for this customer with this card
Number distinct cards for this customer
Last 5 used card countries
1:1 refers to the fact that profiles are tracked by specific entities
36
37. © Feedzai Inc. Confidential.
Characteristics of a Profile
It’s applied over a (usually time) data window
Sliding
Tumbling
Delayed
It has a set of dimensions or entities to group by
It has an aggregation function
37
38. © Feedzai Inc. Confidential.
Challenges
How do you calculate these profiles continuously and in real-time?
How do you calculate profiles for both short term and long-term
windows?
How do you reproduce exactly the same processing in training, testing
and in production?
How do you make it so that Data Scientists can ship something to
production without having Developers’ intervention?
How do you easily “code-review” it?
38
39. © Feedzai Inc. Confidential. 39
Reproducibility between Training and
Production is essential
This is the most important thing
40. © Feedzai Inc. Confidential.
Reproducibility
Without a training pipeline that mirrors real-time the model will
learn something different than what it will see in reality
This kind of concept drift can kill your model’s performance
You can fix this in two ways:
Have a very strict (and slow) process of testing and QA
Or you use the same code during training, testing and in production
40
41. © Feedzai Inc. Confidential. 41
Data Scientists must be able to “code”, test
and ship Feature Engineering logic to
Production
(without having Sw. Eng. having to
implement it based on a spec)
42. © Feedzai Inc. Confidential. 42
Complex Event Processing
+
Large Scale Data Processing Platforms
43. © Feedzai Inc. Confidential.
Complex Event Processing
Data Processing methodology and family of stream-based
technologies
Relies on DSL, sometimes similar to SQL
Instead of applying queries/logic to data, the data goes
through in-memory queries that update state immediately
43
44. © Feedzai Inc. Confidential.
Example
SELECT user_id,
card,
avg(amount) AS avg_amount,
count() AS num_trx,
count() / last().timestamp - first().timestamp AS velocity
FROM transactions[24 hours]
GROUP BY user_id, card;
45. © Feedzai Inc. Confidential.
Common CEP Operations
Filtering
Correlation
Windowing
Transformation
Aggregation/Grouping
Merging/Union
Sorting
Pattern Detection
45
46. © Feedzai Inc. Confidential.
Complex Event Processing at Scale
CEP technology is usually reliant on in-memory processing
To handle long-term profiles you need to pair this with
distributed data processing platforms
The ability to replay historical data like in production should
be a core requirement for the whole system
46
48. © Feedzai Inc. Confidential. 48
Support 0-downtime deployment of new
models in staging mode
49. © Feedzai Inc. Confidential. 49
Good (consistent) Data > Lots of Data
50. © Feedzai Inc. Confidential. 50
Do things that don’t scale
• Look at specific data rows
• Open the CSVs
• Use SQL to try to find new insights
51. © Feedzai Inc. Confidential. 51
Throw away good data
(wait, what?)
52. © Feedzai Inc. Confidential.
>99.5% Good Transactions
< 0.5% Fraud
Fraud is extremely unbalanced
Use undersampling to drop good transactions
53. © Feedzai Inc. Confidential.
KeyTakeaways
Design APIs with comprehensive native fields but allow custom
data
Data Processing is 90% of Machine Learning
Must Have: full reproducibility of production behavior offline
and for training
Combine CEP and streaming with distributed batch processing
Combine Machine Learning with Human Intelligence
© 2016 Feedzai Confidential53
54. 54
MACHINE LEARNING
Keep commerce safe
and create a better customer
experience through
machine learning.
INVESTORS
QUICKFACTS
MISSION
WHAT OTHERS SAY
The U.S. market
fraud prevention just
got a new player.
Feedzai’s
machine learning
is the next wave.
Ranked as a cool
technology to
watch.
Startups that are
owning the data
game.
Payment Card
Management: Essential
tools for U.S. card issuers
• Top 50 High Growth startups in Europe,
FASTEST GROWING startup in Portugal
• Founded by data scientists and aerospace
engineers in 2009
• 120+ employees and doubling
• Offices in Portugal, Silicon Valley, New York
City, London
• Series B funded by Oak HC/FT and Sapphire
Ventures (SAP)
55. © 2016 Feedzai Confidential 55
Want to be a data Samurai?
We’re hiring! feedzai.com/about-us/careers
56. © 2016 Feedzai Confidential 56
Want to be a data Samurai?
We’re hiring! feedzai.com/about-us/careers
57. REFERENCES
• Automation like Iron Man, not Ultron and the Leftover Principle:
• http://queue.acm.org/detail.cfm?id=2841313
• Six novel ML applications:
• http://www.forbes.com/sites/85broads/2014/01/06/six-novel-machine-learning-
applications/#360c101567bf
• Complex Event Processing with Esper:
• http://www.slideshare.net/antonio_alegria/complex-event-processing-with-
esper-10122384
• Approaching almost any ML problem
• http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-
problem-abhishek-thakur/
• https://www.datarobot.com/
• XGBoost tutorial – http://xgboost.readthedocs.io/en/latest/model.html
© 2016 Feedzai Confidential 57
Editor's Notes Feedzai’s mission is to keep commerce safe, to stop bad guys, criminals from stealing money: either through payment fraud, account takeover or other kinds of abuse
2 min