Building data "Py-pelines"

Founded in 2010, TravelBird’s
focus is to bring back the joy of
travel by providing inspiration to
explore and simplicity in
discovering new destinations.
Active in eleven markets across
Europe and inspiring three
million travelers daily via email,
web, and mobile app.
Our Values
Inspiring
Prompting you to visit a place you’d
never thought about before.
Curated & local
Proudly introducing travellers to the
very best their destinations have to
offer, with insider tips and local
insight.
Simple & easy
Taking care of the core elements of
your journey, and there for you
every step of the way.

Our Team’s Role: Applying Data to Solve Problems
● Invoicing and liability risk modeling
● Marketing budgeting/attribution management
● CRM + personalization
● Email channel management
● Business intelligence
● Data gathering + enrichment (ETL)
● Data warehousing / Big data analytics
And all done in Python

Our Architecture (Overall)
● Fully AWS hosted
● Mixture of permanent hosts, auto-scaled,
and dynamically launched (ex for ML jobs)
● Production is built in Django + MySQL
● Data Science architecture (interesting stuff
in red) is:
○ Postgres + Vertica for databases
○ Kinesis for event buffering
○ Spark for ML
○ Airflow + Rundeck for scheduling
○ Redis for RT data
○ S3 + HDFS + GFS for storage
And Python for EVERYTHING

The Why-thon of Python
● We use it in production, so any dev can work on our data stack
● Best libraries available for ML/DL, visualization, data
integration/transformation, anything you want to do with data
● It works for EVERYTHING machine learning, even with big data, so allows
our data scientists to do data engineering as well
● It’s fast enough, and good hardware is cheaper than wasted dev
time/resources

The event pipeline architecture

Python at the heart of it
● Python, python and only python
● Benefit from its great ecosystem (uwsgi, supervisord, Flask, 0mq, boto3, click, etc)
● Some design patterns:
○ Pipelining down to the lowest level (use queues and monitor them)
○ JSON, JSON everywhere
○ Exploit polymorphism
○ Processes, not threads
○ And most of all: keep it lean, easier to understand, easier to maintain
● Building the bibrary: it’s got the BEST modules
○ Abstract from low level AWS details
○ Utilities
○ Centralized configuration
○ Event library

The event library: a standard (2)

The event library: a standard (3)
The event library also standardizes the life
cycle:
● Decoding
● (De)serializing
● Processing
Easy to store, easy to move around, easy to
work with.

Deploy, Test, Quality Assurance
Deployment Testing Monitoring Logging

Our nightly ML job chain (one of many!)
● 93 tasks consisting of:
○ Creation of Spark clusters on spot
workers
○ (Lots of) Spark models
○ Keras models on deployed spot
workers (we LOVE spot!)
○ Database queries and data
aggregations
○ Output merges S3->DWH
● The beauty of python tooling? This is built
and managed by data scientists, not
engineers

Our Tools: PySpark, Keras, and Good Old Python
● Used for all the big, sexy analytics
○ Regression billions of records
○ Collaborative filtering
■ Average domain has 15k
products and 1,5M training
users
● PySpark instead of Scala allows recycling
of all our custom Python libraries into ML
jobs (rather than rewriting)
● In modern Spark, performance in Python
and Scala is about the same (when using
Spark functionality)
● Used for all the small, sexy analytics
○ Deep learning on session purchase
propensity
○ Predicting sellout dates using RNNs
● Keras is easier and cleaner to read than
raw TensorFlow
● Spark deep learning functionality is
underdeveloped at this time
● In deep learning, TF is #1 and Keras #2,
so Keras + TF is … #12? Great
community and development

An example job: User-Item Ratings
Observed User -
Item Propensity
User History
Ratings
Collaborative
Filtering (ALS)
Ratings
Adjustment
Collaborative
Filtering (ALS)
Current Items and
Users
Calculate Scores
for Current User -
Item Pairs
User Features
Item Features
Re-weight based
on feature data (ex
airport preference)
Write out to S3

PySpark, Easy as 1-2-3
● This is a simplified version of
that model in 20 lines of Python,
skipping one step
● A data scientist familiar with
Python can be working
productively in Spark in a few
days
● Easy, fast modeling means we
can keep iteration time low,
increasing number of tests

How mails are built, the short version
● Each domain is built and sent independently, allowing
easy restarts in the event of issues and better
parallelization
● The job on average takes two hours to build and schedule
2,5 million emails and synchronize the same data to
Redis
● It looks complicated, but this complex job chain of 62
steps is 85 lines of Python, easy to modify and maintain
● Tasks consist of:
○ Database creation of 35 million content records
○ Real time generation and capture of 7,5 million
events
○ Launching and spinning down of 16 AWS workers
○ Syncing of >50 million records to Redis
○ Syncing three different APIs (AND google sheets!
BIG DATA)

How do we go from ranks to mails?

Every mail begins as a template

All templates are used to generate dynamic SQL to build the mail content for each recipient
Why dynamic SQL?
● Our database hates transactions
but loves batch
● Our data scientists can understand
what’s going on and contribute
(versus complex frameworks)
Total runtime for 800k mails with >20 different
templates, all personal?
Six minutes

What happens when mails are built?
1. 15k records are picked up via PyODBC into Pandas based on mail, segment,
and desired send hour
2. For each sub:
a. We identify and build a dictionary for each module based on defined template rules
b. Special modules are injected based on upcoming travel, retention campaigns, etc
c. We determine a custom subject line using a Bayesian Bandit based on past subjects, content
being sent, customer segmentation (ex preferred device type), and predicted open rate
d. We add custom URL parameters to trigger experience changes based on past behavior
3. Those dicts are sent to RabbitMQ for consumption by our mailer (based on
Django) to transform JSON into the pretty HTML
4. After successful transfer to RabbitMQ, those mails are marked as sent and
the next batch is started

THIS is a mail
● This mail consists of four content blocks, 50%
were decided at runtime
● Generating this mail took 0,007 seconds
(including all database transactions); rendering
to HTML takes another 0,04 seconds
● Every element in the args are personal except
utm_medium, utm_source, and utm_content

Building data "Py-pelines"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building data "Py-pelines"

Similar to Building data "Py-pelines" (20)

More from Rob Winters

More from Rob Winters (7)

Recently uploaded

Recently uploaded (20)

Building data "Py-pelines"