Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Roman Studenikin & Ben Teeuwen
Booking.com | Priceline Group
PRODUCTIONIZING BEHAVIOURAL FEATURES FOR
MACHINE LEARNING WIT...
#EUstr4
2014: JoinedBooking.com
Data analysis through Perl
Push for R & python
My first big ML project
Timeline
#EUstr4
Mid 2015: Validating real-time service
Online predictions
Feature inputs
Early 2015: Towards production
R randomFo...
#EUstr4
Today
Integration of components into Feature Store
Trainings, trainings & trainings
2016/2017 continuedDS toolkit ...
#EUstr4
Objectives
Re-usageof DS productsReduceskew between
offline DS world & online
DEV-guarded world
Speed up experimen...
#EUstr4
Our tooling
#EUstr4
Software development Data science
Business
case
Get data
Train a
model
Deploy
and monitor
#EUstr4
What can be improved?
Business
case
Get labeled
instances
Construct
offline
features
Train a
model
Deploy
and moni...
#EUstr4
Reproducible
Business
case Get labeled
instances
Construct
offline
features
Train a
model
Deploy
and monitor
Const...
#EUstr4
● FeatureVader - registry of all features
● Spark ML Transformer
data = spark.sql(....)
fv = FeatureVader()
fv.set...
#EUstr4
Train a model
#EUstr4
Construct online features
Data Features
Feature reads
#EUstr4
Online architecture
#EUstr4
Stable pipeline - at least once semantics
#EUstr4
Reusable features
#EUstr4
Reduce Training-Serving skew
#EUstr4
Can be done by Data Scientist
#EUstr4
Online feature encoder
Deployment tooling
Grafana dashboard
Bosun alerts
#EUstr4
Automated, standardised
Business
case Get labeled
instances
Construct
offline
features
Train a
model
Deploy
and mo...
#EUstr4
Learnings
Yes, it is possible =)
Offline (simple) feature construction is still easier.
Give data, not tooling.
St...
Eight stunning
photos.
roman.studenikin@booking.com ben.teeuwen@booking.com
Upcoming SlideShare
Loading in …5
×

Productionizing Behavioural Features for Machine Learning with Apache Spark Streaming with Ben Teeuwen and Roman Studenikin

1,392 views

Published on

We are using Spark Streaming for building online Machine Learning(ML) features that are used in Booking.com for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. Our initial set of goals was to speedup experimentation with real-time features, make features reusable by Data Scientists (DS) within the company and reduce training/serving data skew problem. The tooling that we’ve built and integrated into company’s infrastructure simplifies development of new features to the level that online feature collection can be implemented and deployed into production by DS with very little or no help from developers. That makes this approach scalable and allows us to iterate fast. We use Kafka as a streaming source of real-time events from the website as well as other sources and with connectivity to Cassandra and Hive we were able to make data more consistent between training and serving phases of ML pipelines. Our key takeaways: – It is possible to design production pipelines in a way that allows DS to build and deploy them without help of a developer. – Constructing online features is a much more complex job than offline construction and business-wise it is not always a priority to invest into their construction even if they are proven to be beneficial to the model performance. We plan to invest further into development of pipelines with Spark Streaming and are happy to see that support for streaming operations in Spark evolves in right direction.

Published in: Data & Analytics
  • Be the first to comment

Productionizing Behavioural Features for Machine Learning with Apache Spark Streaming with Ben Teeuwen and Roman Studenikin

  1. 1. Roman Studenikin & Ben Teeuwen Booking.com | Priceline Group PRODUCTIONIZING BEHAVIOURAL FEATURES FOR MACHINE LEARNING WITH SPARK STREAMING #EUstr4
  2. 2. #EUstr4 2014: JoinedBooking.com Data analysis through Perl Push for R & python My first big ML project Timeline
  3. 3. #EUstr4 Mid 2015: Validating real-time service Online predictions Feature inputs Early 2015: Towards production R randomForest ported Counter-based features implemented > Late 2015: Experimentation .. profit!
  4. 4. #EUstr4 Today Integration of components into Feature Store Trainings, trainings & trainings 2016/2017 continuedDS toolkit expansion Spark, Kafka, Spark Streaming, Cassandra Today
  5. 5. #EUstr4 Objectives Re-usageof DS productsReduceskew between offline DS world & online DEV-guarded world Speed up experimentation cycle with product owners Autonomy forDS (fueled by scarcity of DEV capacity) Objectives
  6. 6. #EUstr4 Our tooling
  7. 7. #EUstr4 Software development Data science Business case Get data Train a model Deploy and monitor
  8. 8. #EUstr4 What can be improved? Business case Get labeled instances Construct offline features Train a model Deploy and monitor Construct online features
  9. 9. #EUstr4 Reproducible Business case Get labeled instances Construct offline features Train a model Deploy and monitor Construct online features + Quality control + Best dev practices
  10. 10. #EUstr4 ● FeatureVader - registry of all features ● Spark ML Transformer data = spark.sql(....) fv = FeatureVader() fv.setNeededColumns([“feature1”, “feature2”]) withFeatures = fv.transform(data) Construct offline features userId unixTime label 1001 1508083925 1 1002 1508189936 0 userId unixTime label feature1 feature2 1001 1508083925 1 123.2 2 kids 1002 1508189936 0 213.5 0 kids
  11. 11. #EUstr4 Train a model
  12. 12. #EUstr4 Construct online features Data Features Feature reads
  13. 13. #EUstr4 Online architecture
  14. 14. #EUstr4 Stable pipeline - at least once semantics
  15. 15. #EUstr4 Reusable features
  16. 16. #EUstr4 Reduce Training-Serving skew
  17. 17. #EUstr4 Can be done by Data Scientist
  18. 18. #EUstr4 Online feature encoder Deployment tooling Grafana dashboard Bosun alerts
  19. 19. #EUstr4 Automated, standardised Business case Get labeled instances Construct offline features Train a model Deploy and monitor Construct online features Manual Partly available Partly available Templated Automated
  20. 20. #EUstr4 Learnings Yes, it is possible =) Offline (simple) feature construction is still easier. Give data, not tooling. Start building your vision from something
  21. 21. Eight stunning photos. roman.studenikin@booking.com ben.teeuwen@booking.com

×