How to Feed a Data Hungry Organization – by Traveloka Data Team

Traveloka Data
Meetup v1.0.0
How to Feed a Data Hungry Organization

Part One
Traveloka Data Culture

Part 1: Traveloka Data Culture
Five Characteristics of Data Hungry Organization
Driven Decision
Learn from Mistakes
Better Understanding
Uncertainty and Variation
High Quality Data
Data
Hungry
Organization

Our responsibility is to turn data into consumable insights
DATA
TEAM
BETTER
BUSINESS
DECISION

We need the brightest people to fill our needs and create the future
Mathematics
Business
Programming
Skills

Some of the skills in mathematics
Mathematics
Optimization
Decision Theory
Statistics
Differential Equations
Time Series

Some of the skills in business
Business
Strategy
Finance
Economics

Some of the skills in programming
Programming
Data Wrangling
Modelling
Big Data

This is how we structure our team
Data
Team
Data Governance
Machine Learning Engineering
Data Analysis
Data Science
Data Engineering

Houston,
We have
a problem.
DW
Tens of Terabytes
Hundreds of ETLs
Kafka
Hundreds of topics
Millions of Messages per Hour
Hundreds of Megabytes per Second
S3
Hundreds of Terabytes
Redshift
Tens of Thousand Queries Daily
DOMO
Thousands of Cards
Hundreds of Users
PeriscopeData
Thousands of Dashboards
Hundreds of Users

We need
state of the art
technology
to feed data
hungry people
Ingestion
Gobblin
Data Lake
AWS S3
Batch Processing
Spark, Airflow, Hadoop2,
Python, Java App
Data Warehouse
Redshift, MongoDB,
PostgreSQL
Datahub
Pubsub, Kafka Stream Processing
DataFlow, MemSQL
Pipeline
Near Real Time DW
GCP BigQuery, MemSQL
Real Time DB
AWS DynamoDB
Ingestion Processin
g
Storage Presentation
Source DB
Mongo, PostgreSQL
App / Services
Java App
Analytics Tools
PeriscopeData, Spark, R,
Domo Dataiku Holistics, Keboola
ML Tools, Library, and Services
Jupyter, Zeppelin, Caffe, DataDog,
TensorFlow, Cloud Vision API
Query Engine
Qubole, Presto,
Hive

Part 2: Data Engineering
Fast Food,
Or…?

MINDSETS
Managed service
for focus
So we could focus more on
the use cases

Real Time Pipeline
5 min data delivery SLA. Real latency ~ 10s
100 ms query SLA. Real latency ~ 10ms (p95)
Key value data, query by service/app
Autoscale - Self service for each engineering team
we provide governance, guidance, building blocks, and consultation

Real
Time
Pipeline

Near Real Time Pipeline
Raw data, query by BI Tools
5 min data delivery SLA. Real latency ~ 5s
Using Yaml for Schema definition (built and defined by ourselves)
Self service for data analysts! with guidance and governance

But, MemSQL is not managed service, it is on EC2.
It is easy to scale, but not autoscale yet.
So we are moving to… v2!!
Currently on usability testing test by analysts.
Self service, of course!

Analytical Pipeline
Heavy data
processing
query by BI Tools
6 hour data
delivery SLA

Analytical Pipeline
Interesting features:
• Custom dev/prod environment, for self service!
• Custom framework, on top of Spark
• Custom airflow, separated queue for backfill
• EMR autoscale for backfill
• Redshift microbatch bulk load
• etc...

Summary

Part Three
Data Science in Traveloka

Part 3: Data Science in Traveloka
Three
Things to
Discuss
Today
Data Science Purpose
Tools of the Trade
Model Evaluations and Applications

Novia is 25 years old. She is single, outspoken, and
mathematically gifted. As a student, she was deeply
interested in calculus and statistics, and also participated in
International Mathematical Olympiad.
a. Novia is a data scientist
b. Novia is a data scientist and is active as mathematical
Olympiad tutor

Consider a regular six-sided die with four green faces and
two red faces. The die will be rolled 20 times and the
sequence of greens (G) and reds (R) will be recorded.
Choose one sequence from a set of three. Which one is the
more likely outcome?
RGRRR
GRGRRR
GRRRRR

Remember This:
The goal of data science exercise is to help us make
a good business decision
Logic
Alternatives
Information
Preferences

“if they learn nothing else about decision
analysis from their studies, distinction between
outcome and decisions will have been worth
the price of admission”
Ron Howard, Professor at Stanford University
Father of Decision Analysis
Good Bad
Good Took a taxi and arrived safely Drive home and arrived safely
Bad Took a taxi and involved in accident Drive home and involved in accident
Decisions
Outcome

Data Science Framework: CRISP-DM
Business
Data
Data Prep
Model
Evaluation
Deployment
Common
Sense

“Hiding within those
mounds of data is
knowledge that could
change the life of a
patient, or change the
world”
-Atul Butte, Stanford-
We use open source library
for data science
Wrangling
• data.table
• dplyr
• sparkR
• sparklyr
• pandas
• pyspark
Visualizatio
n
• ggplot
• matplotlib
• seaborn
• shiny
Statistics
• R
• JAGS
• STAN
• Python
• Julia
Machine
Learning
• scikit-learn
• caret
• e1071
• fbprophet

Are we using the algorithm? Or being used by it?
Classification Linear Models
Naïve Bayes
Classifier
Support Vector
Classifier
Vowpal Wabbit
Classifier
Random Forest
Decision Trees
Neural Network
Extreme Gradient
Boosted Trees
Many more algos!
Prediction
Linear Models
Nystroem
Regressor
Support Vector
Regressor
Vowpal Wabbit
Regressor
Random Forest
Decision Trees
Neural Network
Extreme Gradient
Boosted Trees
More Algos!
• Scikit-learn
• Caret
• TensorFlow
• …

We need more than just off the shelf libraries to
feed data hungry people
Bayesian Network Markov Chain Monte Carlo

Model Evaluation: judging the usefulness of your model
Rule #1
Never ever peek at the test set during training/validation
Rule #2
You can never satisfy all the metrics,
pick one or two metrics as your decision criteria beforehand
Rule #3
Always do comparative statics on the final model

Comparative
Statics
commonly used as
feature importance
analysis

Remember the end goal: decisions
What should
we do?
What
might
happen

“But in my view,
obsessive customer focus
is by far the most protective of
Day 1 vitality”
Our data is telling us:
• What do they want?
• Do we serve their needs?
• Are they trying to leave us?
My name is Jeff

How to Feed a Data Hungry Organization – by Traveloka Data Team

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to Feed a Data Hungry Organization – by Traveloka Data Team

Similar to How to Feed a Data Hungry Organization – by Traveloka Data Team (20)

Recently uploaded

Recently uploaded (20)

How to Feed a Data Hungry Organization – by Traveloka Data Team