2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Node.js
+
RabbitMQ
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notiﬁcations
5. Ratings
6. Merchant Intelligence
7. Engagement Optimization
8. Marketing Optimization
9. App Personalization
10. Ad Network Support
11. Image / Speech Recognition

Node.js
+
RabbitMQ
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notiﬁcations
5. Ratings
Theory
(Math, Algorithms)
Proof-of-Concept
(R, Python, Scala, C++)
Spark Implementation
(Scalability, Robustness)
Platform Integration

Node.js
+
RabbitMQ
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notiﬁcations
5. Ratings
Theory
(Math, Algorithms)
Proof-of-Concept
?

Transaction
grade
APIs + MQs
Data Lake
HBase,
Cassandra,
OceanBase,
etc.
Stream
Processing
Batch
Processing
Model
Generator
Decision
Engine
(context, event, data)
(event)
(data)
Feature Selection
Model Training
Model Evaluation
Model Assembly
Real-Time Layer Batch Processing Layer
{
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notiﬁcations
5. Ratings
Theory
(Math, Algorithms)
Proof-of-Concept

Transaction
grade
APIs + MQs
Data Lake
Stream
Processing
Batch
Processing
Model
Generator
Decision
Engine
(context, event, data)
(event)
(data)
Feature Selection
Model Training
Model Evaluation
Model Assembly
Real-Time Layer Batch Processing Layer
{
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notiﬁcations
5. Ratings
Theory
(Math, Algorithms)
Proof-of-Concept
DevOps !!!
HBase,
Cassandra,
OceanBase,
etc.

Data Science
for Fraud Detection
Armando Benitez - @jabenitez - @paytmlabs

armando@paytm.com - @jabenitez
Supervised learning vs Anomaly detection
๏
Very small number of positive
examples
๏
Large number of negative examples.
๏
Many different “types” of anomalies.
Hard for any algorithm to learn from
positive examples what the
anomalies look like; future anomalies
may look nothing like any of the
anomalous examples we’ve seen so
far.
10
๏
Ideally large number of positive and
negative examples.
๏
Enough positive examples for
algorithm to get a sense of what
positive examples are like, future
positive examples likely to be similar
to ones in training set.
* Anomaly Detection - Andrew Ng - Coursera ML Course

What approach to follow?
๏ Not so good: One model to rule them all
๏ Better:
๏ Many models competing against each other
๏ 100s or 1000s of rules running in parallel
๏ Know thy customer
11

Feature Selection๏
Want p(x) large (small) for normal examples,  
p(x) small (large) for anomalous examples
๏
Most common problem:  
comparable distributions for both normal and anomalous examples
๏
Possible solutions:
๏ Apply transformation and variable combinations:
๏ xn+1 = ( x1 + x4 )
2
/ x3
๏ Focus on variable ratios and transaction velocity
๏ Use deep learning for feature extraction
๏ Dimensionality reduction
๏ your solution here
12

Feature Selection
13

Feature Selection
14
Variable X
Counts
BKGSIG

What have we have tried
๏
Density estimator
๏
2D Proﬁles
๏
Anomaly detection
๏
Clustering
๏
Model ensemble (Random forest)
๏
Deep learning (RBM)
๏
Logistic Regression
15
Combine

Gaussian distribution
16

Anomaly Detection* - Example
๏ Choose features, xi , that are indicative of anomalous examples.
๏ Fit parameters to a normal distribution
๏ Given new example, compute:
๏ Anomaly if
17

Algorithm Evaluation
๏
Fit model on training set
๏
On a cross validation/test example, predict
๏
Possible evaluation metrics:
๏ True positive, false positive, false negative, true negative
๏ Precision/Recall
๏ F1-score
18

Implementation
19

Anomaly Detection*
21
Cross validation set:
Test set:
Assume we have some labeled data, of
anomalous and non-anomalous
examples: y = 0 if standard behaviour,
. y = 1 if anomalous.
Training set:  
(assume normal examples/not anomalous)

Transform, Normalize, Calculate
22

Scala
23

Creating Scalable
Architecture
Futures

The lake again
25
Lake Simcoe
going on
Lake Superior
Classic Lambda
Architecture
Various
Processing
Frameworks
Near-Realtime
Scoring/Alerting*

Fraud Capabilities and Technology
A. Batch Ingest and Analysis of
transaction data from Database
B. Batch Behavioural and Portfolio
heuristic fraud detection
C. Near-realtime anomaly and
heuristic fraud detection
D. Online Model Scoring
26
A. Traditional ETL tools for transfer, HDFS/S3 for
storage, Spark for processing
B. Model analysis with iPython/Scala Notebook,
Spark for processing, HDFS/HBase/Cassandra
for storage
C. Kafka real-time ingest, introduce Storm/Spark
Streaming for near-realtime interception of
data, HBase for model/rule storage and lookup
D. JPMML/Spark Streaming for realtime model
scoring

Our framework shopping list
27
iPython &
Scala
Notebooks
Explore & Train Ingest, Store, Score, & Act
Spark
::Core ::MLLib
::Streaming ::GraphX?
Intercept
with Storm?
Spark
Streaming?
Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3
OpenScoring
?
JPMML?
R?

armando@paytm.com - @jabenitez28
Fin

2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Recommended

Recommended

More Related Content

Similar to 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Similar to 2015 feb 24_paytm_labs_intro_ashwin_armandoadam (20)

More from Adam Muise

More from Adam Muise (20)

Recently uploaded

Recently uploaded (20)

2015 feb 24_paytm_labs_intro_ashwin_armandoadam