SlideShare a Scribd company logo
Data
Ashwin Tumne
Node.js
+
RabbitMQ
Node.js
+
RabbitMQ
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notifications
5. Ratings
6. Merchant Intelligence
7. Engagement Optimization
8. Marketing Optimization
9. App Personalization
10. Ad Network Support
11. Image / Speech Recognition
Node.js
+
RabbitMQ
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notifications
5. Ratings
6. Merchant Intelligence
7. Engagement Optimization
8. Marketing Optimization
9. App Personalization
10. Ad Network Support
11. Image / Speech Recognition
Theory
(Math, Algorithms)
Proof-of-Concept
(R, Python, Scala, C++)
Spark Implementation
(Scalability, Robustness)
Platform Integration
Node.js
+
RabbitMQ
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notifications
5. Ratings
6. Merchant Intelligence
7. Engagement Optimization
8. Marketing Optimization
9. App Personalization
10. Ad Network Support
11. Image / Speech Recognition
Theory
(Math, Algorithms)
Proof-of-Concept
(R, Python, Scala, C++)
Spark Implementation
(Scalability, Robustness)
Platform Integration
?
Transaction
grade
APIs + MQs
Data Lake
HBase,
Cassandra,
OceanBase,
etc.
Stream
Processing
Batch
Processing
Model
Generator
Decision
Engine
(context, event, data)
(event)
(data)
Feature Selection
Model Training
Model Evaluation
Model Assembly
Real-Time Layer Batch Processing Layer
{
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notifications
5. Ratings
6. Merchant Intelligence
7. Engagement Optimization
8. Marketing Optimization
9. App Personalization
10. Ad Network Support
11. Image / Speech Recognition
Theory
(Math, Algorithms)
Proof-of-Concept
(R, Python, Scala, C++)
Spark Implementation
(Scalability, Robustness)
Platform Integration
Transaction
grade
APIs + MQs
Data Lake
Stream
Processing
Batch
Processing
Model
Generator
Decision
Engine
(context, event, data)
(event)
(data)
Feature Selection
Model Training
Model Evaluation
Model Assembly
Real-Time Layer Batch Processing Layer
{
Data Science
Pipeline
1. Fraud Detection
2. Search
3. Recommendations
4. Notifications
5. Ratings
6. Merchant Intelligence
7. Engagement Optimization
8. Marketing Optimization
9. App Personalization
10. Ad Network Support
11. Image / Speech Recognition
Theory
(Math, Algorithms)
Proof-of-Concept
(R, Python, Scala, C++)
Spark Implementation
(Scalability, Robustness)
Platform Integration
DevOps !!!
HBase,
Cassandra,
OceanBase,
etc.
Data Science
for Fraud Detection
Armando Benitez - @jabenitez - @paytmlabs
armando@paytm.com - @jabenitez
Supervised learning vs Anomaly detection
๏
Very small number of positive
examples
๏
Large number of negative examples.
๏
Many different “types” of anomalies.
Hard for any algorithm to learn from
positive examples what the
anomalies look like; future anomalies
may look nothing like any of the
anomalous examples we’ve seen so
far.
10
๏
Ideally large number of positive and
negative examples.
๏
Enough positive examples for
algorithm to get a sense of what
positive examples are like, future
positive examples likely to be similar
to ones in training set.
* Anomaly Detection - Andrew Ng - Coursera ML Course
armando@paytm.com - @jabenitez
What approach to follow?
๏ Not so good: One model to rule them all
๏ Better:
๏ Many models competing against each other
๏ 100s or 1000s of rules running in parallel
๏ Know thy customer
11
armando@paytm.com - @jabenitez
Feature Selection๏
Want		 p(x) large (small) for normal examples, 

p(x) small (large) for anomalous examples
๏
Most common problem: 

comparable distributions for both normal and anomalous examples
๏
Possible solutions:
๏ Apply transformation and variable combinations:
๏ xn+1 = ( x1 + x4 )
2
/ x3
๏ Focus on variable ratios and transaction velocity
๏ Use deep learning for feature extraction
๏ Dimensionality reduction
๏ your solution here
12
armando@paytm.com - @jabenitez
Feature Selection
13
armando@paytm.com - @jabenitez
Feature Selection
14
Variable X
Counts
BKGSIG
armando@paytm.com - @jabenitez
What have we have tried
๏
Density estimator
๏
2D Profiles
๏
Anomaly detection
๏
Clustering
๏
Model ensemble (Random forest)
๏
Deep learning (RBM)
๏
Logistic Regression
15
Combine
armando@paytm.com - @jabenitez
Gaussian distribution
16
armando@paytm.com - @jabenitez
Anomaly Detection* - Example
๏ Choose features, xi , that are indicative of anomalous examples.
๏ Fit parameters to a normal distribution
๏ Given new example, compute:
๏ Anomaly if
17
* Anomaly Detection - Andrew Ng - Coursera ML Course
armando@paytm.com - @jabenitez
Algorithm Evaluation
๏
Fit model on training set
๏
On a cross validation/test example, predict
๏
Possible evaluation metrics:
๏ True positive, false positive, false negative, true negative
๏ Precision/Recall
๏ F1-score
18
armando@paytm.com - @jabenitez
Implementation
19
Extra Slides
armando@paytm.com - @jabenitez
Anomaly Detection*
21
* Anomaly Detection - Andrew Ng - Coursera ML Course
Cross validation set:
Test set:
Assume we have some labeled data, of
anomalous and non-anomalous
examples: y = 0 if standard behaviour,
. y = 1 if anomalous.
Training set: 

(assume normal examples/not anomalous)
armando@paytm.com - @jabenitez
Transform, Normalize, Calculate
22
armando@paytm.com - @jabenitez
Scala
23
Creating Scalable
Architecture
Futures
armando@paytm.com - @jabenitez
The lake again
25
Lake Simcoe
going on
Lake Superior
Classic Lambda
Architecture
Various
Processing
Frameworks
Near-Realtime
Scoring/Alerting*
armando@paytm.com - @jabenitez
Fraud Capabilities and Technology
A. Batch Ingest and Analysis of
transaction data from Database
B. Batch Behavioural and Portfolio
heuristic fraud detection
C. Near-realtime anomaly and
heuristic fraud detection
D. Online Model Scoring
26
A. Traditional ETL tools for transfer, HDFS/S3 for
storage, Spark for processing
B. Model analysis with iPython/Scala Notebook,
Spark for processing, HDFS/HBase/Cassandra
for storage
C. Kafka real-time ingest, introduce Storm/Spark
Streaming for near-realtime interception of
data, HBase for model/rule storage and lookup
D. JPMML/Spark Streaming for realtime model
scoring
armando@paytm.com - @jabenitez
Our framework shopping list
27
iPython &
Scala
Notebooks
Explore & Train Ingest, Store, Score, & Act
Spark
::Core ::MLLib
::Streaming ::GraphX?
Intercept
with Storm?
Spark
Streaming?
Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3
OpenScoring
?
JPMML?
R?
armando@paytm.com - @jabenitez28
Fin

More Related Content

Similar to 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: butest
 
Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
ShivaShiva783981
 
Machine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceMachine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift Science
Sift Science
 
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsDeveloping Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Kun Liu
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning Research
ArtemSunfun
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
Bin Han
 
AI in SE: A 25-year Journey
AI in SE: A 25-year JourneyAI in SE: A 25-year Journey
AI in SE: A 25-year Journey
Lionel Briand
 
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
Nicolas Sarramagna
 
Fraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMakerFraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMaker
Amazon Web Services
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Machine Learning in Production: Manu Mukerji, Strata CA March 2018
Machine Learning in Production: Manu Mukerji, Strata CA March 2018 Machine Learning in Production: Manu Mukerji, Strata CA March 2018
Machine Learning in Production: Manu Mukerji, Strata CA March 2018
Manu Mukerji
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
Greg Makowski
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
knowbigdata
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
ShubhWadekar
 
An Introduction to Reinforcement Learning (December 2018)
An Introduction to Reinforcement Learning (December 2018)An Introduction to Reinforcement Learning (December 2018)
An Introduction to Reinforcement Learning (December 2018)
Julien SIMON
 
Building the Ideal Stack for Machine Learning
Building the Ideal Stack for Machine LearningBuilding the Ideal Stack for Machine Learning
Building the Ideal Stack for Machine Learning
SingleStore
 
Einführung in Amazon Machine Learning - AWS Machine Learning Web Day
Einführung in Amazon Machine Learning  - AWS Machine Learning Web DayEinführung in Amazon Machine Learning  - AWS Machine Learning Web Day
Einführung in Amazon Machine Learning - AWS Machine Learning Web Day
AWS Germany
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
Charles Martin
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
Akin Osman Kazakci
 

Similar to 2015 feb 24_paytm_labs_intro_ashwin_armandoadam (20)

Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web:
 
Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
 
Machine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceMachine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift Science
 
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsDeveloping Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning Research
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
 
AI in SE: A 25-year Journey
AI in SE: A 25-year JourneyAI in SE: A 25-year Journey
AI in SE: A 25-year Journey
 
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
 
Fraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMakerFraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMaker
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Machine Learning in Production: Manu Mukerji, Strata CA March 2018
Machine Learning in Production: Manu Mukerji, Strata CA March 2018 Machine Learning in Production: Manu Mukerji, Strata CA March 2018
Machine Learning in Production: Manu Mukerji, Strata CA March 2018
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
An Introduction to Reinforcement Learning (December 2018)
An Introduction to Reinforcement Learning (December 2018)An Introduction to Reinforcement Learning (December 2018)
An Introduction to Reinforcement Learning (December 2018)
 
Building the Ideal Stack for Machine Learning
Building the Ideal Stack for Machine LearningBuilding the Ideal Stack for Machine Learning
Building the Ideal Stack for Machine Learning
 
Einführung in Amazon Machine Learning - AWS Machine Learning Web Day
Einführung in Amazon Machine Learning  - AWS Machine Learning Web DayEinführung in Amazon Machine Learning  - AWS Machine Learning Web Day
Einführung in Amazon Machine Learning - AWS Machine Learning Web Day
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 

More from Adam Muise

Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
Adam Muise
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
Adam Muise
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
Adam Muise
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
Adam Muise
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
Adam Muise
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
Adam Muise
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
Adam Muise
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
Adam Muise
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
Adam Muise
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Adam Muise
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
Adam Muise
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
Adam Muise
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
Adam Muise
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
Adam Muise
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
Adam Muise
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
Adam Muise
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012Adam Muise
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
Adam Muise
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohug
Adam Muise
 

More from Adam Muise (20)

Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohug
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

2015 feb 24_paytm_labs_intro_ashwin_armandoadam

  • 2.
  • 4. Node.js + RabbitMQ Data Science Pipeline 1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition
  • 5. Node.js + RabbitMQ Data Science Pipeline 1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition Theory (Math, Algorithms) Proof-of-Concept (R, Python, Scala, C++) Spark Implementation (Scalability, Robustness) Platform Integration
  • 6. Node.js + RabbitMQ Data Science Pipeline 1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition Theory (Math, Algorithms) Proof-of-Concept (R, Python, Scala, C++) Spark Implementation (Scalability, Robustness) Platform Integration ?
  • 7. Transaction grade APIs + MQs Data Lake HBase, Cassandra, OceanBase, etc. Stream Processing Batch Processing Model Generator Decision Engine (context, event, data) (event) (data) Feature Selection Model Training Model Evaluation Model Assembly Real-Time Layer Batch Processing Layer { Data Science Pipeline 1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition Theory (Math, Algorithms) Proof-of-Concept (R, Python, Scala, C++) Spark Implementation (Scalability, Robustness) Platform Integration
  • 8. Transaction grade APIs + MQs Data Lake Stream Processing Batch Processing Model Generator Decision Engine (context, event, data) (event) (data) Feature Selection Model Training Model Evaluation Model Assembly Real-Time Layer Batch Processing Layer { Data Science Pipeline 1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition Theory (Math, Algorithms) Proof-of-Concept (R, Python, Scala, C++) Spark Implementation (Scalability, Robustness) Platform Integration DevOps !!! HBase, Cassandra, OceanBase, etc.
  • 9. Data Science for Fraud Detection Armando Benitez - @jabenitez - @paytmlabs
  • 10. armando@paytm.com - @jabenitez Supervised learning vs Anomaly detection ๏ Very small number of positive examples ๏ Large number of negative examples. ๏ Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far. 10 ๏ Ideally large number of positive and negative examples. ๏ Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set. * Anomaly Detection - Andrew Ng - Coursera ML Course
  • 11. armando@paytm.com - @jabenitez What approach to follow? ๏ Not so good: One model to rule them all ๏ Better: ๏ Many models competing against each other ๏ 100s or 1000s of rules running in parallel ๏ Know thy customer 11
  • 12. armando@paytm.com - @jabenitez Feature Selection๏ Want p(x) large (small) for normal examples, 
 p(x) small (large) for anomalous examples ๏ Most common problem: 
 comparable distributions for both normal and anomalous examples ๏ Possible solutions: ๏ Apply transformation and variable combinations: ๏ xn+1 = ( x1 + x4 ) 2 / x3 ๏ Focus on variable ratios and transaction velocity ๏ Use deep learning for feature extraction ๏ Dimensionality reduction ๏ your solution here 12
  • 14. armando@paytm.com - @jabenitez Feature Selection 14 Variable X Counts BKGSIG
  • 15. armando@paytm.com - @jabenitez What have we have tried ๏ Density estimator ๏ 2D Profiles ๏ Anomaly detection ๏ Clustering ๏ Model ensemble (Random forest) ๏ Deep learning (RBM) ๏ Logistic Regression 15 Combine
  • 17. armando@paytm.com - @jabenitez Anomaly Detection* - Example ๏ Choose features, xi , that are indicative of anomalous examples. ๏ Fit parameters to a normal distribution ๏ Given new example, compute: ๏ Anomaly if 17 * Anomaly Detection - Andrew Ng - Coursera ML Course
  • 18. armando@paytm.com - @jabenitez Algorithm Evaluation ๏ Fit model on training set ๏ On a cross validation/test example, predict ๏ Possible evaluation metrics: ๏ True positive, false positive, false negative, true negative ๏ Precision/Recall ๏ F1-score 18
  • 21. armando@paytm.com - @jabenitez Anomaly Detection* 21 * Anomaly Detection - Andrew Ng - Coursera ML Course Cross validation set: Test set: Assume we have some labeled data, of anomalous and non-anomalous examples: y = 0 if standard behaviour, . y = 1 if anomalous. Training set: 
 (assume normal examples/not anomalous)
  • 22. armando@paytm.com - @jabenitez Transform, Normalize, Calculate 22
  • 25. armando@paytm.com - @jabenitez The lake again 25 Lake Simcoe going on Lake Superior Classic Lambda Architecture Various Processing Frameworks Near-Realtime Scoring/Alerting*
  • 26. armando@paytm.com - @jabenitez Fraud Capabilities and Technology A. Batch Ingest and Analysis of transaction data from Database B. Batch Behavioural and Portfolio heuristic fraud detection C. Near-realtime anomaly and heuristic fraud detection D. Online Model Scoring 26 A. Traditional ETL tools for transfer, HDFS/S3 for storage, Spark for processing B. Model analysis with iPython/Scala Notebook, Spark for processing, HDFS/HBase/Cassandra for storage C. Kafka real-time ingest, introduce Storm/Spark Streaming for near-realtime interception of data, HBase for model/rule storage and lookup D. JPMML/Spark Streaming for realtime model scoring
  • 27. armando@paytm.com - @jabenitez Our framework shopping list 27 iPython & Scala Notebooks Explore & Train Ingest, Store, Score, & Act Spark ::Core ::MLLib ::Streaming ::GraphX? Intercept with Storm? Spark Streaming? Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3 OpenScoring ? JPMML? R?