SlideShare a Scribd company logo
Building a Real-Time Fraud
Prevention Engine Using Open
Source (Big Data) Software
Kees Jan de Vries
Booking.com
Who am I?
About me
• Physics PhD Imperial College
• Data Scientist at Booking.com for 1.5 year
▸ Security Department
• linkedin.com/in/kees-jan-de-vries-93767240
About Booking.com
• World leader in connecting travellers with the widest
variety of great places to stay
• Part of The Priceline Group, the world’s 3rd largest
e-commerce company (by market capitalisation)
• Employing 14,000 people in 180 countries
• Each day, over 1,200,000 room nights are reserved on
Booking.com
Contents.
• Motivation
• Running Example
▸ Probability to Book
• Real Time Prediction Engine: Lessons Learnt
▸ Aggregate Features
▸ Models Training and Deployment
▸ Interpretation of Individual Scores
Motivation
Motivation.
booking.com
Motivation.
booking.com
Booking.com
• Empower people to experience the
world
Security
• Provide secure reservation system
(facilitating hundreds of thousands of
reservations on a daily basis)
Fraud Prevention
• Protect customers and property
partners, ensuring the best possible
experience
Motivation for this Talk.
Train awesome Model
Serve in Real Time
at Low Latency using
aggregate features
Running Example
Probability to Book
Disclaimer: although the running example
presented in these slides may seem realistic, it is
only intended to highlight some lessons learnt
about building a real time prediction engine.
Probability to Book.
Let’s book a hotel for
the Spark Summit
East 2017 in Boston
Probability to Book.
OK. Let’s click on the
first hit.
Probability to Book.
Nice. Here’s all the
information I need.
But maybe I’ll browse
a few more, to make
the best choice.
Probability to Book.
Nice. Here’s all the
information I need.
But maybe I’ll browse
a few more, to make
the best choice.
Let’s help the customer
make the best choice
for them!
Probability to Book.
We’ll calculate the
probability of booking
when the Customer
clicks on an
accommodation
Nice. Here’s all the
information I need.
But maybe I’ll browse
a few more, to make
the best choice.
Behind the Scenes.
User id
Page id
...
Action
PredictionEngine
Click!
Behind the Scenes.
Labels
• Booked? Yes/No
Features
• Simple
▸ Time of day
▸ ...
• Profile
▸ User
▸ Circumstantial
Click!
Behind the Scenes.
Labels
• Booked? Yes/No
Features
• Simple
▸ Time of day
▸ ...
• Profile
▸ User
▸ Circumstantial
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Click!
Behind the Scenes.
Features
Model
Action
User id
...
Click!
Action
Behind the Scenes.
Features
Model
Action
User id
...
Click!
Aggregate
features
“ ”
Action
Aggregate Features
Aggregate Features.
Time
Requests Data Streams
Click Booking
Aggregate Features.
Time
Requests Data Streams
Click Booking
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features.
Users
Time
...
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Requests
Clicks (Requests)
Aggregate Features.
Users
Time
...
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Requests
Clicks (Requests)
Clicks coincide
with requests, so
aggregation is
preferably instant
Aggregate Features.
Users
Time
...
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Requests
Clicks (Requests)
Clicks coincide
with requests, so
aggregation is
preferably instant
Small time
window, so limited
amount of data
Aggregate Features.
SELECT
COUNT(DISTINCT hotel_id)
FROM cliks.win:time(30 min)
GROUP BY user_id
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
In memory Complex Event Processing
• No lag: instant aggregation
• Scalability: see Esper website
• Not persistent
Aggregate Features.
SELECT
COUNT(DISTINCT hotel_id)
FROM cliks.win:time(30 min)
GROUP BY user_id
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...From http://espertech.com/esper/faq_esper.php#streamprocessing
“Complex Event Processing and Esper are standing queries and latency to the answer is
usually below 10us with more than 99% predictability.”
“The first component of scaling is the throughput that can be achieved running
single-threaded. For Esper we think this number is very high and likely between 10k to
200k events per second.”
In memory Complex Event Processing
Aggregate Features.
Users
Time
...
Requests
Bookings
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features.
Users
Time
...
Requests
Bookings
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Very High Cardinality:
features for every user who
made at least one booking
Aggregate Features
High Cardinality Features with Cassandra
• Very scalable: reads & writes
• None-instant aggregations
▸ Consistency fundamentally bound
by gossip protocol
• Persistent
( )
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features
High Cardinality Features with Cassandra
( )
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
From http://cassandra.apache.org/
● Proven
● Fault Tolerant
● Performant
● Scalable
● Elastic
● ...
“Some of the largest production
deployments include Apple's, with over
75,000 nodes storing over 10 PB of
data, Netflix (2,500 nodes, 420 TB, over
1 trillion requests per day), …”
Aggregate Features.
HotelPages
Time
...
Requests
Bookings/Clicks
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features.
HotelPages
Time
...
Requests
Bookings/Clicks
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Medium Cardinality: limited
by number of hotels
Aggregate Features.
HotelPages
Time
...
Requests
Bookings/Clicks
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Medium Cardinality: limited
by number of hotels
Larger lag is ok with time
window of 3 months and
large number of events per
page.
Aggregate Features
SELECT
page_id
, COUNT(*) AS res_count
FROM reservations
GROUP BY page_id
Low to Medium Cardinality
• No need to go “fancy”
• Batch processing
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Models
Training and deployment
Aggregate Features for Training.
Time
Requests Data Streams
Click Reservation
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features for Training.
Time
Requests Data Streams
Click Reservation
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features for Training.
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
SELECT
request.request_id
, COUNT DISTINCT event.page_id
FROM request
JOIN event ON
request.user_id = event.user_id
WHERE event.epoch BETWEEN
request.epoch
AND request.epoch + 30*60
GROUP BY request.request_id
Aggregate Features for Training.
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
SELECT
request.request_id
, COUNT DISTINCT event.page_id
FROM request
JOIN event ON
request.user_id = event.user_id
WHERE event.epoch BETWEEN
request.epoch
AND request.epoch + 30*60
GROUP BY request.request_id
Warning: stragglers
ahead! Distribute wisely!
Aggregate Features for Training.
SELECT
request.request_id
, COUNT DISTINCT event.page_id
FROM request
JOIN event ON
request.user_id = event.user_id
WHERE event.epoch BETWEEN
request.epoch
AND request.epoch + 30*60
GROUP BY request.request_id
Scalable Technologies
• Simple SQL
▸ Hive
• Facing stragglers
▸ Spark
Model Training & Iteration
y X
Train Model
Predict
YES
Predict
NO
Actual YES TP FN
Actual NO FP TN
Investigate
Evaluate
• Features
▸ Engineering
▸ Selection
• Hyper parameters
Improve Model
Sorry: H2O?
The benchmark shown on this slide can be found in:
http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/resources/h2odatasheet.html
More benchmarks:
https://github.com/szilard/benchm-ml
Time to Deploy
model = H2ORandomForestEstimator(ntrees=50, max_depth=10)
model.train(x=inputCols, y="label",
training_frame=trainingDf,
validation_frame=validationDf)
model.download_pojo("/my/model/path/")
Time to Deploy.
model = H2ORandomForestEstimator(ntrees=50, max_depth=10)
model.train(x=inputCols, y="label",
training_frame=trainingDf,
validation_frame=validationDf)
model.download_pojo("/my/model/path/")
AWESOME!!!
POJO: Plain Old Java Object. Model export in plain Java code for real- time scoring
on the JVM. Very fast!
Model Interpretation
Score Interpretation.
feature 15 > 9
feature 2 > 0.56 feature 4 > 200
yes no
yes no yes no
feature 11 > 6
yes no
Logistic Regression
Decision Trees
feature 1 feature 2+ 13.9 x + …x = -5.12 x
Decision tree interpretation based on:
http://blog.datadive.net/interpreting-random-forests
Contribution Feature i:
i
xi
p0 Contribution Feature i:
pnode
- pparent node
p2
p1
p3
p4
p6
p7
p5
p8
Example:
p3
= p0
+ (p1
-p0
) + (p3
-p1
)
p0
: bias
(p1
-p0
): feature 15 contr.
(p3
-p1
): feature 2 contr.
Score Interpretation.
feature 15 > 9
feature 2 > 0.56 feature 4 > 200
yes no
yes no yes no
feature 11 > 6
yes no
Logistic Regression
Decision Trees
feature 1 feature 2+ 13.9 x + …x = -5.12 x
Decision tree interpretation based on:
http://blog.datadive.net/interpreting-random-forests
Contribution Feature i:
i
xi
p0 Contribution Feature i:
pnode
- pparent node
p2
p1
p3
p4
p6
p7
p5
p8
We hacked this
interpretation into Spark
ML during a Hackathon
at Booking:) H2O does
not offer it (yet?).
Probability to Book.
The probability of
booking is high; largest
contribution from
#distinct property
pages. Let’s show a
summary!
Probability to Book.
Awesome! This app
really helps me book!
The probability of
booking is high; largest
contribution from
#distinct property
pages. Let’s show a
summary!
Summary
Summary.
• How to make Aggregate Features available?
▸ Long/short time windows
▸ High and Low cardinality dimensions
• Model Training and Deployment
▸ H2O POJO for fast Real-Time scoring
• Interpretation
▸ Contributions per Feature per Score
Thank You.
Questions?
We’re hiring!
Kees Jan de Vries from Booking.com
linkedin.com/in/kees-jan-de-vries-93767240

More Related Content

What's hot

Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AI
Semantic Web Company
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
James Serra
 
Büyük veri(bigdata)
Büyük veri(bigdata)Büyük veri(bigdata)
Büyük veri(bigdata)
Hülya Soylu
 
10 Key Considerations for AI/ML Model Governance
10 Key Considerations for AI/ML Model Governance10 Key Considerations for AI/ML Model Governance
10 Key Considerations for AI/ML Model Governance
QuantUniversity
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge Graph
Benjamin Raethlein
 
Introdução à Neo4j
Introdução à Neo4j Introdução à Neo4j
Introdução à Neo4j
Neo4j
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphs
mukuljoshi
 
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j
 
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
BigID Inc
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
Michelle Ufford
 
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from RealityBuilding an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
Joshua Shinavier
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
DATAVERSITY
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with Transparency
Debmalya Biswas
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
John Yeung
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
Amazon Web Services
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
Databricks
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
DataWorks Summit
 
Fraud Detection and Prevention on AWS using Machine Learning
Fraud Detection and Prevention on AWS using Machine LearningFraud Detection and Prevention on AWS using Machine Learning
Fraud Detection and Prevention on AWS using Machine Learning
Amazon Web Services
 

What's hot (20)

Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AI
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Büyük veri(bigdata)
Büyük veri(bigdata)Büyük veri(bigdata)
Büyük veri(bigdata)
 
10 Key Considerations for AI/ML Model Governance
10 Key Considerations for AI/ML Model Governance10 Key Considerations for AI/ML Model Governance
10 Key Considerations for AI/ML Model Governance
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge Graph
 
Introdução à Neo4j
Introdução à Neo4j Introdução à Neo4j
Introdução à Neo4j
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphs
 
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time Analytics
 
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from RealityBuilding an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with Transparency
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
 
Fraud Detection and Prevention on AWS using Machine Learning
Fraud Detection and Prevention on AWS using Machine LearningFraud Detection and Prevention on AWS using Machine Learning
Fraud Detection and Prevention on AWS using Machine Learning
 

Viewers also liked

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Spark Summit
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Spark Summit
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Spark Summit
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Spark Summit
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Spark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Spark Summit
 

Viewers also liked (20)

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
 

Similar to Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries

Reliably Measuring Responsiveness
Reliably Measuring ResponsivenessReliably Measuring Responsiveness
Reliably Measuring Responsiveness
Nicholas Jansma
 
AI techniques for tourism-oriented applications
AI techniques for tourism-oriented applicationsAI techniques for tourism-oriented applications
AI techniques for tourism-oriented applications
Amine Bendahmane
 
Your hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the futureYour hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the future
SiteMinder
 
Genn.ai introduction for Buzzwords
Genn.ai introduction for BuzzwordsGenn.ai introduction for Buzzwords
Genn.ai introduction for Buzzwords
Takeshi Nakano
 
Big Data and the Next Best Offer
Big Data and the Next Best OfferBig Data and the Next Best Offer
Big Data and the Next Best Offer
Michel Bruley
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews, Inc.
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
Srinath Perera
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream
Michel Bruley
 
From model to experiment Tel Aviv
From model to experiment Tel AvivFrom model to experiment Tel Aviv
From model to experiment Tel Aviv
Elena Sokolova, PhD
 
SiteMinder. Attract, reach and convert guests globally.
SiteMinder.  Attract, reach and convert guests globally.SiteMinder.  Attract, reach and convert guests globally.
SiteMinder. Attract, reach and convert guests globally.
Mark Abramowitz
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic FoxMicroservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
OpenCredo
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetup
PredicSis
 
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
Amazon Web Services
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
Marko Grobelnik
 

Similar to Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries (15)

Reliably Measuring Responsiveness
Reliably Measuring ResponsivenessReliably Measuring Responsiveness
Reliably Measuring Responsiveness
 
AI techniques for tourism-oriented applications
AI techniques for tourism-oriented applicationsAI techniques for tourism-oriented applications
AI techniques for tourism-oriented applications
 
Your hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the futureYour hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the future
 
Genn.ai introduction for Buzzwords
Genn.ai introduction for BuzzwordsGenn.ai introduction for Buzzwords
Genn.ai introduction for Buzzwords
 
Big Data and the Next Best Offer
Big Data and the Next Best OfferBig Data and the Next Best Offer
Big Data and the Next Best Offer
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream
 
From model to experiment Tel Aviv
From model to experiment Tel AvivFrom model to experiment Tel Aviv
From model to experiment Tel Aviv
 
SiteMinder. Attract, reach and convert guests globally.
SiteMinder.  Attract, reach and convert guests globally.SiteMinder.  Attract, reach and convert guests globally.
SiteMinder. Attract, reach and convert guests globally.
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic FoxMicroservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetup
 
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 

Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries

  • 1. Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software Kees Jan de Vries Booking.com
  • 2. Who am I? About me • Physics PhD Imperial College • Data Scientist at Booking.com for 1.5 year ▸ Security Department • linkedin.com/in/kees-jan-de-vries-93767240 About Booking.com • World leader in connecting travellers with the widest variety of great places to stay • Part of The Priceline Group, the world’s 3rd largest e-commerce company (by market capitalisation) • Employing 14,000 people in 180 countries • Each day, over 1,200,000 room nights are reserved on Booking.com
  • 3. Contents. • Motivation • Running Example ▸ Probability to Book • Real Time Prediction Engine: Lessons Learnt ▸ Aggregate Features ▸ Models Training and Deployment ▸ Interpretation of Individual Scores
  • 6. Motivation. booking.com Booking.com • Empower people to experience the world Security • Provide secure reservation system (facilitating hundreds of thousands of reservations on a daily basis) Fraud Prevention • Protect customers and property partners, ensuring the best possible experience
  • 7. Motivation for this Talk. Train awesome Model Serve in Real Time at Low Latency using aggregate features
  • 9. Disclaimer: although the running example presented in these slides may seem realistic, it is only intended to highlight some lessons learnt about building a real time prediction engine.
  • 10. Probability to Book. Let’s book a hotel for the Spark Summit East 2017 in Boston
  • 11. Probability to Book. OK. Let’s click on the first hit.
  • 12. Probability to Book. Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice.
  • 13. Probability to Book. Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice. Let’s help the customer make the best choice for them!
  • 14. Probability to Book. We’ll calculate the probability of booking when the Customer clicks on an accommodation Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice.
  • 15. Behind the Scenes. User id Page id ... Action PredictionEngine Click!
  • 16. Behind the Scenes. Labels • Booked? Yes/No Features • Simple ▸ Time of day ▸ ... • Profile ▸ User ▸ Circumstantial Click!
  • 17. Behind the Scenes. Labels • Booked? Yes/No Features • Simple ▸ Time of day ▸ ... • Profile ▸ User ▸ Circumstantial Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Click!
  • 19. Behind the Scenes. Features Model Action User id ... Click! Aggregate features “ ” Action
  • 21. Aggregate Features. Time Requests Data Streams Click Booking
  • 22. Aggregate Features. Time Requests Data Streams Click Booking Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 23. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests)
  • 24. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests) Clicks coincide with requests, so aggregation is preferably instant
  • 25. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests) Clicks coincide with requests, so aggregation is preferably instant Small time window, so limited amount of data
  • 26. Aggregate Features. SELECT COUNT(DISTINCT hotel_id) FROM cliks.win:time(30 min) GROUP BY user_id Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... In memory Complex Event Processing • No lag: instant aggregation • Scalability: see Esper website • Not persistent
  • 27. Aggregate Features. SELECT COUNT(DISTINCT hotel_id) FROM cliks.win:time(30 min) GROUP BY user_id Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...From http://espertech.com/esper/faq_esper.php#streamprocessing “Complex Event Processing and Esper are standing queries and latency to the answer is usually below 10us with more than 99% predictability.” “The first component of scaling is the throughput that can be achieved running single-threaded. For Esper we think this number is very high and likely between 10k to 200k events per second.” In memory Complex Event Processing
  • 28. Aggregate Features. Users Time ... Requests Bookings Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 29. Aggregate Features. Users Time ... Requests Bookings Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Very High Cardinality: features for every user who made at least one booking
  • 30. Aggregate Features High Cardinality Features with Cassandra • Very scalable: reads & writes • None-instant aggregations ▸ Consistency fundamentally bound by gossip protocol • Persistent ( ) Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 31. Aggregate Features High Cardinality Features with Cassandra ( ) Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... From http://cassandra.apache.org/ ● Proven ● Fault Tolerant ● Performant ● Scalable ● Elastic ● ... “Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), …”
  • 32. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 33. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Medium Cardinality: limited by number of hotels
  • 34. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Medium Cardinality: limited by number of hotels Larger lag is ok with time window of 3 months and large number of events per page.
  • 35. Aggregate Features SELECT page_id , COUNT(*) AS res_count FROM reservations GROUP BY page_id Low to Medium Cardinality • No need to go “fancy” • Batch processing Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 37. Aggregate Features for Training. Time Requests Data Streams Click Reservation Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 38. Aggregate Features for Training. Time Requests Data Streams Click Reservation Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 39. Aggregate Features for Training. Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id
  • 40. Aggregate Features for Training. Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id Warning: stragglers ahead! Distribute wisely!
  • 41. Aggregate Features for Training. SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id Scalable Technologies • Simple SQL ▸ Hive • Facing stragglers ▸ Spark
  • 42. Model Training & Iteration y X Train Model Predict YES Predict NO Actual YES TP FN Actual NO FP TN Investigate Evaluate • Features ▸ Engineering ▸ Selection • Hyper parameters Improve Model
  • 43. Sorry: H2O? The benchmark shown on this slide can be found in: http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/resources/h2odatasheet.html More benchmarks: https://github.com/szilard/benchm-ml
  • 44. Time to Deploy model = H2ORandomForestEstimator(ntrees=50, max_depth=10) model.train(x=inputCols, y="label", training_frame=trainingDf, validation_frame=validationDf) model.download_pojo("/my/model/path/")
  • 45. Time to Deploy. model = H2ORandomForestEstimator(ntrees=50, max_depth=10) model.train(x=inputCols, y="label", training_frame=trainingDf, validation_frame=validationDf) model.download_pojo("/my/model/path/") AWESOME!!! POJO: Plain Old Java Object. Model export in plain Java code for real- time scoring on the JVM. Very fast!
  • 47. Score Interpretation. feature 15 > 9 feature 2 > 0.56 feature 4 > 200 yes no yes no yes no feature 11 > 6 yes no Logistic Regression Decision Trees feature 1 feature 2+ 13.9 x + …x = -5.12 x Decision tree interpretation based on: http://blog.datadive.net/interpreting-random-forests Contribution Feature i: i xi p0 Contribution Feature i: pnode - pparent node p2 p1 p3 p4 p6 p7 p5 p8 Example: p3 = p0 + (p1 -p0 ) + (p3 -p1 ) p0 : bias (p1 -p0 ): feature 15 contr. (p3 -p1 ): feature 2 contr.
  • 48. Score Interpretation. feature 15 > 9 feature 2 > 0.56 feature 4 > 200 yes no yes no yes no feature 11 > 6 yes no Logistic Regression Decision Trees feature 1 feature 2+ 13.9 x + …x = -5.12 x Decision tree interpretation based on: http://blog.datadive.net/interpreting-random-forests Contribution Feature i: i xi p0 Contribution Feature i: pnode - pparent node p2 p1 p3 p4 p6 p7 p5 p8 We hacked this interpretation into Spark ML during a Hackathon at Booking:) H2O does not offer it (yet?).
  • 49. Probability to Book. The probability of booking is high; largest contribution from #distinct property pages. Let’s show a summary!
  • 50. Probability to Book. Awesome! This app really helps me book! The probability of booking is high; largest contribution from #distinct property pages. Let’s show a summary!
  • 52. Summary. • How to make Aggregate Features available? ▸ Long/short time windows ▸ High and Low cardinality dimensions • Model Training and Deployment ▸ H2O POJO for fast Real-Time scoring • Interpretation ▸ Contributions per Feature per Score
  • 53. Thank You. Questions? We’re hiring! Kees Jan de Vries from Booking.com linkedin.com/in/kees-jan-de-vries-93767240