SlideShare a Scribd company logo
1 of 26
Download to read offline
Founded in 2010, TravelBird’s
focus is to bring back the joy of
travel by providing inspiration to
explore and simplicity in
discovering new destinations.
Active in eleven markets across
Europe and inspiring three
million travelers daily via email,
web, and mobile app.
Our Values
Inspiring
Prompting you to visit a place you’d
never thought about before.
Curated & local
Proudly introducing travellers to the
very best their destinations have to
offer, with insider tips and local
insight.
Simple & easy
Taking care of the core elements of
your journey, and there for you
every step of the way.
Our Team’s Role: Applying Data to Solve Problems
● Invoicing and liability risk modeling
● Marketing budgeting/attribution management
● CRM + personalization
● Email channel management
● Business intelligence
● Data gathering + enrichment (ETL)
● Data warehousing / Big data analytics
And all done in Python
Our Architecture (Overall)
● Fully AWS hosted
● Mixture of permanent hosts, auto-scaled,
and dynamically launched (ex for ML jobs)
● Production is built in Django + MySQL
● Data Science architecture (interesting stuff
in red) is:
○ Postgres + Vertica for databases
○ Kinesis for event buffering
○ Spark for ML
○ Airflow + Rundeck for scheduling
○ Redis for RT data
○ S3 + HDFS + GFS for storage
And Python for EVERYTHING
The Why-thon of Python
● We use it in production, so any dev can work on our data stack
● Best libraries available for ML/DL, visualization, data
integration/transformation, anything you want to do with data
● It works for EVERYTHING machine learning, even with big data, so allows
our data scientists to do data engineering as well
● It’s fast enough, and good hardware is cheaper than wasted dev
time/resources
The event pipeline architecture
Python at the heart of it
● Python, python and only python
● Benefit from its great ecosystem (uwsgi, supervisord, Flask, 0mq, boto3, click, etc)
● Some design patterns:
○ Pipelining down to the lowest level (use queues and monitor them)
○ JSON, JSON everywhere
○ Exploit polymorphism
○ Processes, not threads
○ And most of all: keep it lean, easier to understand, easier to maintain
● Building the bibrary: it’s got the BEST modules
○ Abstract from low level AWS details
○ Utilities
○ Centralized configuration
○ Event library
The event library: a standard (2)
The event library: a standard (3)
The event library also standardizes the life
cycle:
● Decoding
● (De)serializing
● Processing
Easy to store, easy to move around, easy to
work with.
Deploy, Test, Quality Assurance
Deployment Testing Monitoring Logging
Our nightly ML job chain (one of many!)
● 93 tasks consisting of:
○ Creation of Spark clusters on spot
workers
○ (Lots of) Spark models
○ Keras models on deployed spot
workers (we LOVE spot!)
○ Database queries and data
aggregations
○ Output merges S3->DWH
● The beauty of python tooling? This is built
and managed by data scientists, not
engineers
Our Tools: PySpark, Keras, and Good Old Python
● Used for all the big, sexy analytics
○ Regression billions of records
○ Collaborative filtering
■ Average domain has 15k
products and 1,5M training
users
● PySpark instead of Scala allows recycling
of all our custom Python libraries into ML
jobs (rather than rewriting)
● In modern Spark, performance in Python
and Scala is about the same (when using
Spark functionality)
● Used for all the small, sexy analytics
○ Deep learning on session purchase
propensity
○ Predicting sellout dates using RNNs
● Keras is easier and cleaner to read than
raw TensorFlow
● Spark deep learning functionality is
underdeveloped at this time
● In deep learning, TF is #1 and Keras #2,
so Keras + TF is … #12? Great
community and development
An example job: User-Item Ratings
Observed User -
Item Propensity
User History
Ratings
Collaborative
Filtering (ALS)
Ratings
Adjustment
Collaborative
Filtering (ALS)
Current Items and
Users
Calculate Scores
for Current User -
Item Pairs
User Features
Item Features
Re-weight based
on feature data (ex
airport preference)
Write out to S3
PySpark, Easy as 1-2-3
● This is a simplified version of
that model in 20 lines of Python,
skipping one step
● A data scientist familiar with
Python can be working
productively in Spark in a few
days
● Easy, fast modeling means we
can keep iteration time low,
increasing number of tests
How mails are built, the short version
● Each domain is built and sent independently, allowing
easy restarts in the event of issues and better
parallelization
● The job on average takes two hours to build and schedule
2,5 million emails and synchronize the same data to
Redis
● It looks complicated, but this complex job chain of 62
steps is 85 lines of Python, easy to modify and maintain
● Tasks consist of:
○ Database creation of 35 million content records
○ Real time generation and capture of 7,5 million
events
○ Launching and spinning down of 16 AWS workers
○ Syncing of >50 million records to Redis
○ Syncing three different APIs (AND google sheets!
BIG DATA)
How do we go from ranks to mails?
Every mail begins as a template
All templates are used to generate dynamic SQL to build the mail content for each recipient
Why dynamic SQL?
● Our database hates transactions
but loves batch
● Our data scientists can understand
what’s going on and contribute
(versus complex frameworks)
Total runtime for 800k mails with >20 different
templates, all personal?
Six minutes
What happens when mails are built?
1. 15k records are picked up via PyODBC into Pandas based on mail, segment,
and desired send hour
2. For each sub:
a. We identify and build a dictionary for each module based on defined template rules
b. Special modules are injected based on upcoming travel, retention campaigns, etc
c. We determine a custom subject line using a Bayesian Bandit based on past subjects, content
being sent, customer segmentation (ex preferred device type), and predicted open rate
d. We add custom URL parameters to trigger experience changes based on past behavior
3. Those dicts are sent to RabbitMQ for consumption by our mailer (based on
Django) to transform JSON into the pretty HTML
4. After successful transfer to RabbitMQ, those mails are marked as sent and
the next batch is started
THIS is a mail
● This mail consists of four content blocks, 50%
were decided at runtime
● Generating this mail took 0,007 seconds
(including all database transactions); rendering
to HTML takes another 0,04 seconds
● Every element in the args are personal except
utm_medium, utm_source, and utm_content
Building data "Py-pelines"

More Related Content

What's hot

Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
 
Empowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingEmpowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingDatabricks
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data ArchitectureSplunk
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseDatabricks
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
Analytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponAnalytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponDatabricks
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataRob Winters
 
Data & AI Platform Concepts
Data & AI Platform ConceptsData & AI Platform Concepts
Data & AI Platform ConceptsAnkit Rathi
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML MeetupML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML MeetupRomain Yon
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summitOpen Analytics
 
Data Engineering Challenges - DSE Day at Bandung Institute of Technology
Data Engineering Challenges - DSE Day at Bandung Institute of TechnologyData Engineering Challenges - DSE Day at Bandung Institute of Technology
Data Engineering Challenges - DSE Day at Bandung Institute of TechnologyRendy Bambang Junior
 
Embedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven LogisticsEmbedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven LogisticsDatabricks
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleDatabricks
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Databricks
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Predicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCAPredicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCASri Ambati
 

What's hot (20)

Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
 
Empowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingEmpowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark Streaming
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Analytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponAnalytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret Weapon
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
Data & AI Platform Concepts
Data & AI Platform ConceptsData & AI Platform Concepts
Data & AI Platform Concepts
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML MeetupML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
 
Data Engineering Challenges - DSE Day at Bandung Institute of Technology
Data Engineering Challenges - DSE Day at Bandung Institute of TechnologyData Engineering Challenges - DSE Day at Bandung Institute of Technology
Data Engineering Challenges - DSE Day at Bandung Institute of Technology
 
Embedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven LogisticsEmbedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven Logistics
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Predicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCAPredicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCA
 

Similar to Building data "Py-pelines"

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformGetInData
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientistsStitch Fix Algorithms
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Pôle Systematic Paris-Region
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsRendy Bambang Junior
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Hisham Mardam-Bey
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...confluent
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 

Similar to Building data "Py-pelines" (20)

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analytics
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 

More from Rob Winters

Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActionsRob Winters
 
Building a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningBuilding a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningRob Winters
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseRob Winters
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfRob Winters
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data AnalyticsRob Winters
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowRob Winters
 

More from Rob Winters (7)

Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
Building a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningBuilding a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine Learning
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data Analytics
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

Building data "Py-pelines"

  • 1.
  • 2.
  • 3. Founded in 2010, TravelBird’s focus is to bring back the joy of travel by providing inspiration to explore and simplicity in discovering new destinations. Active in eleven markets across Europe and inspiring three million travelers daily via email, web, and mobile app. Our Values Inspiring Prompting you to visit a place you’d never thought about before. Curated & local Proudly introducing travellers to the very best their destinations have to offer, with insider tips and local insight. Simple & easy Taking care of the core elements of your journey, and there for you every step of the way.
  • 4. Our Team’s Role: Applying Data to Solve Problems ● Invoicing and liability risk modeling ● Marketing budgeting/attribution management ● CRM + personalization ● Email channel management ● Business intelligence ● Data gathering + enrichment (ETL) ● Data warehousing / Big data analytics And all done in Python
  • 5.
  • 6. Our Architecture (Overall) ● Fully AWS hosted ● Mixture of permanent hosts, auto-scaled, and dynamically launched (ex for ML jobs) ● Production is built in Django + MySQL ● Data Science architecture (interesting stuff in red) is: ○ Postgres + Vertica for databases ○ Kinesis for event buffering ○ Spark for ML ○ Airflow + Rundeck for scheduling ○ Redis for RT data ○ S3 + HDFS + GFS for storage And Python for EVERYTHING
  • 7. The Why-thon of Python ● We use it in production, so any dev can work on our data stack ● Best libraries available for ML/DL, visualization, data integration/transformation, anything you want to do with data ● It works for EVERYTHING machine learning, even with big data, so allows our data scientists to do data engineering as well ● It’s fast enough, and good hardware is cheaper than wasted dev time/resources
  • 8.
  • 9. The event pipeline architecture
  • 10. Python at the heart of it ● Python, python and only python ● Benefit from its great ecosystem (uwsgi, supervisord, Flask, 0mq, boto3, click, etc) ● Some design patterns: ○ Pipelining down to the lowest level (use queues and monitor them) ○ JSON, JSON everywhere ○ Exploit polymorphism ○ Processes, not threads ○ And most of all: keep it lean, easier to understand, easier to maintain ● Building the bibrary: it’s got the BEST modules ○ Abstract from low level AWS details ○ Utilities ○ Centralized configuration ○ Event library
  • 11. The event library: a standard (2)
  • 12. The event library: a standard (3) The event library also standardizes the life cycle: ● Decoding ● (De)serializing ● Processing Easy to store, easy to move around, easy to work with.
  • 13. Deploy, Test, Quality Assurance Deployment Testing Monitoring Logging
  • 14.
  • 15. Our nightly ML job chain (one of many!) ● 93 tasks consisting of: ○ Creation of Spark clusters on spot workers ○ (Lots of) Spark models ○ Keras models on deployed spot workers (we LOVE spot!) ○ Database queries and data aggregations ○ Output merges S3->DWH ● The beauty of python tooling? This is built and managed by data scientists, not engineers
  • 16. Our Tools: PySpark, Keras, and Good Old Python ● Used for all the big, sexy analytics ○ Regression billions of records ○ Collaborative filtering ■ Average domain has 15k products and 1,5M training users ● PySpark instead of Scala allows recycling of all our custom Python libraries into ML jobs (rather than rewriting) ● In modern Spark, performance in Python and Scala is about the same (when using Spark functionality) ● Used for all the small, sexy analytics ○ Deep learning on session purchase propensity ○ Predicting sellout dates using RNNs ● Keras is easier and cleaner to read than raw TensorFlow ● Spark deep learning functionality is underdeveloped at this time ● In deep learning, TF is #1 and Keras #2, so Keras + TF is … #12? Great community and development
  • 17. An example job: User-Item Ratings Observed User - Item Propensity User History Ratings Collaborative Filtering (ALS) Ratings Adjustment Collaborative Filtering (ALS) Current Items and Users Calculate Scores for Current User - Item Pairs User Features Item Features Re-weight based on feature data (ex airport preference) Write out to S3
  • 18. PySpark, Easy as 1-2-3 ● This is a simplified version of that model in 20 lines of Python, skipping one step ● A data scientist familiar with Python can be working productively in Spark in a few days ● Easy, fast modeling means we can keep iteration time low, increasing number of tests
  • 19.
  • 20. How mails are built, the short version ● Each domain is built and sent independently, allowing easy restarts in the event of issues and better parallelization ● The job on average takes two hours to build and schedule 2,5 million emails and synchronize the same data to Redis ● It looks complicated, but this complex job chain of 62 steps is 85 lines of Python, easy to modify and maintain ● Tasks consist of: ○ Database creation of 35 million content records ○ Real time generation and capture of 7,5 million events ○ Launching and spinning down of 16 AWS workers ○ Syncing of >50 million records to Redis ○ Syncing three different APIs (AND google sheets! BIG DATA)
  • 21. How do we go from ranks to mails?
  • 22. Every mail begins as a template
  • 23. All templates are used to generate dynamic SQL to build the mail content for each recipient Why dynamic SQL? ● Our database hates transactions but loves batch ● Our data scientists can understand what’s going on and contribute (versus complex frameworks) Total runtime for 800k mails with >20 different templates, all personal? Six minutes
  • 24. What happens when mails are built? 1. 15k records are picked up via PyODBC into Pandas based on mail, segment, and desired send hour 2. For each sub: a. We identify and build a dictionary for each module based on defined template rules b. Special modules are injected based on upcoming travel, retention campaigns, etc c. We determine a custom subject line using a Bayesian Bandit based on past subjects, content being sent, customer segmentation (ex preferred device type), and predicted open rate d. We add custom URL parameters to trigger experience changes based on past behavior 3. Those dicts are sent to RabbitMQ for consumption by our mailer (based on Django) to transform JSON into the pretty HTML 4. After successful transfer to RabbitMQ, those mails are marked as sent and the next batch is started
  • 25. THIS is a mail ● This mail consists of four content blocks, 50% were decided at runtime ● Generating this mail took 0,007 seconds (including all database transactions); rendering to HTML takes another 0,04 seconds ● Every element in the args are personal except utm_medium, utm_source, and utm_content