ANSELMO SILVA
Building a Solid Foundation
Expedia Partner Solutions
Data Platform
The world’s travel platform
600M+
monthly visits in 75+ countries
20k
employees worldwide
90B usd
yearly travel sales
10k
Affiliates
Expedia Group is all
about travel;
Our secret is that
we are also all
about technology
and data science.
600M+
API hotel searches a day
1000+
EPS powered partners
30%+
YOY growth
10TB
Data processed every day
A FEW OF OUR PAR TNERS
Travel Industry challenges & opportunities
Augmented
Reality and VR
Market growth
and consolidation
source: skift.com
TRAVEL INDUSTRY CHALLENGES & OPPOR TUNITIES
Voice search and
personalisation
TRAVEL INDUSTRY CHALLENGES & OPPOR TUNITIES
AI - ML AT EXPEDIA PAR TNER SOLUTIONS
Sorting
AI - ML AT EXPEDIA PAR TNER SOLUTIONS
Image Classification
AI - ML AT EXPEDIA PAR TNER SOLUTIONS
Forecast & Anomaly Detection
AI - ML AT EXPEDIA PAR TNER SOLUTIONS
Voice & Bots
AI - ML AT EXPEDIA PAR TNER SOLUTIONS
Recomendations & Cross-Sell
Data Challenges
DATA-SCIENCE
Heterogenity (Partners and Supply)
Supply size (> 500K Properties)
Partners size (> 1K Partners)
Content size (> 1M Images)
Data size (10TB per day, PBs data lake)
Guiding principles
Data
Platform
Data Lake
Hive Metastore
On-Premises
CLOUD MIGRATION
Follow data producers path
Improve security, scalabity and resilience
Promote technology innovation
Separate computing from storage
Hive Metastore
Solid Foundation
Data Lake
Hive Metastore
On-Premises
Hive Metastore
DATA REPLICATION { CIRCUS-TRAIN }
Replicates Hive tables between clusters on request. It
replicates both the table's data and metadata.
It has a light touch, requiring no direct integration
with Hive's core services.
It can copy either entire unpartitioned tables or user
defined sets of partitions on partitioned tables.
it is not event driven and does not know how tables
differ between sites.
SOLID FOUNDATION
https://github.com/hotelsdotcom/circus-train
SOLID FOUNDATION
circus_train.yml
source-catalog:
name: on_prem_dw
disable-snapshots: true
hive-metastore-uris: ${on-prem-dw-foo-params.source-thrift-uris}
replica-catalog:
name: usw2_foo
hive-metastore-uris: ${usw2-foo-params.replica-thrift-uris}
copier-options:
tmp-dir: hdfs:///tmp/circus-train/
region: us-west-2
table-replications:
-copier-options:
task-count: ${usw2-foo-params.task-count}
source-table:
database-name: ${usw2-foo-params.database-name}
table-name: ${usw2-foo-params.table-name}
partition-filter: ${usw2-foo-params.partition-filter}
replica-table:
database-name: ${usw2-foo-params.target-database-name}
table-location: s3://${usw2-foo-params.s3-bucket-name} …
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Data Lake
Hive Metastore
Data Lake #2
Hive Metastore
SOLID FOUNDATION
Data Lake #3
Hive Metastore
DATA FEDERATION { WAGGLE-DANCE }
Waggle Dance is a request routing
Hive metastore proxy that allows
tables to be concurrently accessed
across multiple Hive deployments.
It was created to tackle the
appearance of dataset silos that
arose as our large organization
gradually migrated from monolithic
on-premises clusters to cloud based
platforms.
https://github.com/hotelsdotcom/waggle-dance
Data Lake
Hive Metastore
SOLID FOUNDATION
Data Lake #3
Hive Metastore
waggle_dance_federation.yml
primary-meta-store:
access-control-type: READ_AND_WRITE_ON_DATABASE_WHITELIST
name: primary
remote-meta-store-uris: ${ON_PREM_HIVE_METASTORE_URI}
writable-database-white-list:
- foo_user_.*
federated-meta-stores:
- name: zed-bar-prod
access-control-type: READ_ONLY
remote-meta-store-uris: ${USW2_6623552_PROD_HIVE_METASTORE_URI}
mapped-databases:
- foo_transaction
- bar_stream
- zed_common
- opp_charles
Data Lake
Hive Metastore
SOLID FOUNDATION
On-Premises
Hive Metastore
DATA QUALITY FRAMEWORK
Manage core data-assets like anyother product,
promoting instrumentation, observability and
alerting.
“First to Know” culture and process, measuring
how data-assets are accessible, fresh,
complete, accurate, enriched, integrated.
#BKG-MART #USR-TABLE
#CLK—STREAM
Easy to produce data
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
NRT SERVICE
One way to produce data
Scalability - perf/efficiency (Kafka)
Simplifiied schema management
Support on all environments
Strive for a full hands-off service
EASY TO PRODUCE DATA
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
PRODUCER CONTRACT
Own the data schema
Own produced data (e2e)
Stream events in realtime
Obfuscate sensitive information
Document and update data assets
Monitor data in production
EASY TO PRODUCE DATA
Easy to consume data
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
CONSUMER CONTRACT
Consume documented data-assets
Use approved access layers/libs
Report back any data quality issue
Anotate outputs with data-sources
Follow data governance guidelines
Adopt schema changes
*Do not duplicate data-assets*
EASY TO CONSUME DATA
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
QUERY ENGINES + TOOLS
Hive, Presto, Spark
EMR (data processing)
Databricks (data science)
Qubole (query, insights)
Athena (operational support)
EASY TO CONSUME DATA
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Online
Offline
Development
Data Lake
Hive MetastoreOrchestrator
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
ANALYTICS API (METRICS/DIMS STORE)
Programatical access to analytical data
with granular ACL on data-sets,
columns, rows.
Metadata, search, breakdown, filter,
timeseries, comparison, forecast on key
data-sets (sub-second response time).
EASY TO CONSUME DATA
ANALYTICS API
curl -o analytics.eps/bookings?
dateField=created_day&date_range=2018-03-01,2018-05-01|
2018-01-01,2018-03-01&groupby=partner
[top=10,by=foo]&fields=foo,zed,bar&interval=hour
Data Science pushes the envelope
Online
Offline
Development
Data Lake
Orchestrator Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Data Exploration + Pipelines Setup
Online
Offline
Development
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
DS DEVELOPMENT CYCLE
Models Tuning
Algorithm Training
ML Model storage
DATA SCIENCE PUSHES THE ENVELOPE
Online
Offline
Development
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Data Exploration + Pipelines Setup
DATA SCIENCE PUSHES THE ENVELOPE
FEATURES PIPELINE
Training sets
Validation sets
Parameters
Configuration
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Data Exploration + Pipelines Setup
DATA SCIENCE PUSHES THE ENVELOPE
BATCH EXECUTION
Prediction backtesting
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Performance
Set
Data Exploration + Pipelines Setup
DATA SCIENCE PUSHES THE ENVELOPE
MODEL PERFORMANCE
Performance evaluation
Observability
Model Performance / Monitoring
50k
23k
Online
Offline
Development
EPS API
book
Partner(s)
Service
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Features Store
ML Service
Data Exploration + Pipelines Setup
CI/CD
Performance
Set
Model Performance / Monitoring
50k
23k
DATA SCIENCE PUSHES THE ENVELOPE
ONLINE SERVICE
CI/CD
Online features store
Model serialisason
Model serving
{ Custom, MLeap, Tensorflow, PMML }
Model Performance / Monitoring
50k
23k
Online
Offline
Development
NRT Streaming Service
Data
Producers
Features Store
ML Service
CI/CD
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Performance
Set
Model Performance / Monitoring
50k
23k
EPS API
book
Partner(s)
Service
Online
Offline
Development
NRT Streaming Service
Data
Producers
Features Store
ML Service
CI/CD
EPS API
book
Partner(s)
Service
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Performance
Set
Model Performance / Monitoring
50k
23k
“It Takes a Village … ”
IT TAKES A VILLAGE ...
C R O S S
F U N C T I O N A L
T E A M S
$
P R O M O T E
B E S T
E N G I N E E R I N G
P R A C T I C E S
C R I T I C A L
E X E C U T I O N
P A T H
M E A S U R E
O P E R A T I O N A L
C O S T S
S O L I D
P L A T F O R M
T O B U I L D
O N T O P
#lifeatexpedia

Blueprint Series: Expedia Partner Solutions, Data Platform

  • 1.
    ANSELMO SILVA Building aSolid Foundation Expedia Partner Solutions Data Platform
  • 2.
  • 3.
    600M+ monthly visits in75+ countries 20k employees worldwide 90B usd yearly travel sales 10k Affiliates
  • 4.
    Expedia Group isall about travel; Our secret is that we are also all about technology and data science.
  • 5.
    600M+ API hotel searchesa day 1000+ EPS powered partners 30%+ YOY growth 10TB Data processed every day
  • 6.
    A FEW OFOUR PAR TNERS
  • 7.
    Travel Industry challenges& opportunities Augmented Reality and VR
  • 8.
    Market growth and consolidation source:skift.com TRAVEL INDUSTRY CHALLENGES & OPPOR TUNITIES
  • 9.
    Voice search and personalisation TRAVELINDUSTRY CHALLENGES & OPPOR TUNITIES
  • 10.
    AI - MLAT EXPEDIA PAR TNER SOLUTIONS Sorting
  • 11.
    AI - MLAT EXPEDIA PAR TNER SOLUTIONS Image Classification
  • 12.
    AI - MLAT EXPEDIA PAR TNER SOLUTIONS Forecast & Anomaly Detection
  • 13.
    AI - MLAT EXPEDIA PAR TNER SOLUTIONS Voice & Bots
  • 14.
    AI - MLAT EXPEDIA PAR TNER SOLUTIONS Recomendations & Cross-Sell
  • 15.
    Data Challenges DATA-SCIENCE Heterogenity (Partnersand Supply) Supply size (> 500K Properties) Partners size (> 1K Partners) Content size (> 1M Images) Data size (10TB per day, PBs data lake)
  • 16.
  • 17.
    Data Lake Hive Metastore On-Premises CLOUDMIGRATION Follow data producers path Improve security, scalabity and resilience Promote technology innovation Separate computing from storage Hive Metastore Solid Foundation
  • 18.
    Data Lake Hive Metastore On-Premises HiveMetastore DATA REPLICATION { CIRCUS-TRAIN } Replicates Hive tables between clusters on request. It replicates both the table's data and metadata. It has a light touch, requiring no direct integration with Hive's core services. It can copy either entire unpartitioned tables or user defined sets of partitions on partitioned tables. it is not event driven and does not know how tables differ between sites. SOLID FOUNDATION https://github.com/hotelsdotcom/circus-train
  • 19.
    SOLID FOUNDATION circus_train.yml source-catalog: name: on_prem_dw disable-snapshots:true hive-metastore-uris: ${on-prem-dw-foo-params.source-thrift-uris} replica-catalog: name: usw2_foo hive-metastore-uris: ${usw2-foo-params.replica-thrift-uris} copier-options: tmp-dir: hdfs:///tmp/circus-train/ region: us-west-2 table-replications: -copier-options: task-count: ${usw2-foo-params.task-count} source-table: database-name: ${usw2-foo-params.database-name} table-name: ${usw2-foo-params.table-name} partition-filter: ${usw2-foo-params.partition-filter} replica-table: database-name: ${usw2-foo-params.target-database-name} table-location: s3://${usw2-foo-params.s3-bucket-name} … Data Lake Hive Metastore On-Premises Hive Metastore
  • 20.
    Data Lake Hive Metastore DataLake #2 Hive Metastore SOLID FOUNDATION Data Lake #3 Hive Metastore DATA FEDERATION { WAGGLE-DANCE } Waggle Dance is a request routing Hive metastore proxy that allows tables to be concurrently accessed across multiple Hive deployments. It was created to tackle the appearance of dataset silos that arose as our large organization gradually migrated from monolithic on-premises clusters to cloud based platforms. https://github.com/hotelsdotcom/waggle-dance
  • 21.
    Data Lake Hive Metastore SOLIDFOUNDATION Data Lake #3 Hive Metastore waggle_dance_federation.yml primary-meta-store: access-control-type: READ_AND_WRITE_ON_DATABASE_WHITELIST name: primary remote-meta-store-uris: ${ON_PREM_HIVE_METASTORE_URI} writable-database-white-list: - foo_user_.* federated-meta-stores: - name: zed-bar-prod access-control-type: READ_ONLY remote-meta-store-uris: ${USW2_6623552_PROD_HIVE_METASTORE_URI} mapped-databases: - foo_transaction - bar_stream - zed_common - opp_charles
  • 22.
    Data Lake Hive Metastore SOLIDFOUNDATION On-Premises Hive Metastore DATA QUALITY FRAMEWORK Manage core data-assets like anyother product, promoting instrumentation, observability and alerting. “First to Know” culture and process, measuring how data-assets are accessible, fresh, complete, accurate, enriched, integrated. #BKG-MART #USR-TABLE #CLK—STREAM
  • 23.
    Easy to producedata Online Offline Development Data Lake Hive Metastore On-Premises Hive Metastore NRT Streaming Service Data Producers
  • 24.
    Online Offline Development Data Lake Hive Metastore On-Premises HiveMetastore NRT Streaming Service Data Producers NRT SERVICE One way to produce data Scalability - perf/efficiency (Kafka) Simplifiied schema management Support on all environments Strive for a full hands-off service EASY TO PRODUCE DATA
  • 25.
    Online Offline Development Data Lake Hive Metastore On-Premises HiveMetastore NRT Streaming Service Data Producers PRODUCER CONTRACT Own the data schema Own produced data (e2e) Stream events in realtime Obfuscate sensitive information Document and update data assets Monitor data in production EASY TO PRODUCE DATA
  • 26.
    Easy to consumedata Online Offline Development Data Lake Hive Metastore On-Premises Hive Metastore Orchestrator NRT Streaming Service Data Producers Data Exploration + Pipelines Setup DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid
  • 27.
    Online Offline Development Data Lake Hive Metastore On-Premises HiveMetastore Orchestrator NRT Streaming Service Data Producers CONSUMER CONTRACT Consume documented data-assets Use approved access layers/libs Report back any data quality issue Anotate outputs with data-sources Follow data governance guidelines Adopt schema changes *Do not duplicate data-assets* EASY TO CONSUME DATA
  • 28.
    Online Offline Development Data Lake Hive Metastore On-Premises HiveMetastore Orchestrator NRT Streaming Service Data Producers QUERY ENGINES + TOOLS Hive, Presto, Spark EMR (data processing) Databricks (data science) Qubole (query, insights) Athena (operational support) EASY TO CONSUME DATA Data Exploration + Pipelines Setup DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid
  • 29.
    Online Offline Development Data Lake Hive MetastoreOrchestrator On-Premises HiveMetastore NRT Streaming Service Data Producers ANALYTICS API (METRICS/DIMS STORE) Programatical access to analytical data with granular ACL on data-sets, columns, rows. Metadata, search, breakdown, filter, timeseries, comparison, forecast on key data-sets (sub-second response time). EASY TO CONSUME DATA ANALYTICS API curl -o analytics.eps/bookings? dateField=created_day&date_range=2018-03-01,2018-05-01| 2018-01-01,2018-03-01&groupby=partner [top=10,by=foo]&fields=foo,zed,bar&interval=hour
  • 30.
    Data Science pushesthe envelope Online Offline Development Data Lake Orchestrator Hive Metastore On-Premises Hive Metastore NRT Streaming Service Data Producers DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid Batch Model Execution Prediction + Backtesting Training Set Validation Set Algorithm TrainingModel Config ML Model Store Data Exploration + Pipelines Setup
  • 31.
    Online Offline Development Orchestrator Data Lake Hive Metastore On-Premises HiveMetastore Data Exploration + Pipelines Setup DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid Batch Model Execution Prediction + Backtesting Training Set Validation Set Algorithm TrainingModel Config ML Model Store DS DEVELOPMENT CYCLE Models Tuning Algorithm Training ML Model storage DATA SCIENCE PUSHES THE ENVELOPE
  • 32.
    Online Offline Development Orchestrator Data Lake Hive Metastore On-Premises HiveMetastore NRT Streaming Service Data Producers DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid Batch Model Execution Prediction + Backtesting Training Set Validation Set Algorithm TrainingModel Config ML Model Store Data Exploration + Pipelines Setup DATA SCIENCE PUSHES THE ENVELOPE FEATURES PIPELINE Training sets Validation sets Parameters Configuration
  • 33.
    Online Offline Development Data Lake Hive Metastore On-Premises HiveMetastore Orchestrator NRT Streaming Service Data Producers DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid Batch Model Execution Prediction + Backtesting Training Set Validation Set Algorithm TrainingModel Config ML Model Store Data Exploration + Pipelines Setup DATA SCIENCE PUSHES THE ENVELOPE BATCH EXECUTION Prediction backtesting
  • 34.
    Online Offline Development Data Lake Hive Metastore On-Premises HiveMetastore Orchestrator NRT Streaming Service Data Producers DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid Batch Model Execution Prediction + Backtesting Training Set Validation Set Algorithm TrainingModel Config ML Model Store Performance Set Data Exploration + Pipelines Setup DATA SCIENCE PUSHES THE ENVELOPE MODEL PERFORMANCE Performance evaluation Observability Model Performance / Monitoring 50k 23k
  • 35.
    Online Offline Development EPS API book Partner(s) Service Orchestrator Data Lake HiveMetastore On-Premises Hive Metastore NRT Streaming Service Data Producers DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid Batch Model Execution Prediction + Backtesting Training Set Validation Set Algorithm TrainingModel Config ML Model Store Features Store ML Service Data Exploration + Pipelines Setup CI/CD Performance Set Model Performance / Monitoring 50k 23k DATA SCIENCE PUSHES THE ENVELOPE ONLINE SERVICE CI/CD Online features store Model serialisason Model serving { Custom, MLeap, Tensorflow, PMML } Model Performance / Monitoring 50k 23k
  • 36.
    Online Offline Development NRT Streaming Service Data Producers FeaturesStore ML Service CI/CD Orchestrator Data Lake Hive Metastore On-Premises Hive Metastore Data Exploration + Pipelines Setup DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid Batch Model Execution Prediction + Backtesting Training Set Validation Set Algorithm TrainingModel Config ML Model Store Performance Set Model Performance / Monitoring 50k 23k EPS API book Partner(s) Service
  • 37.
    Online Offline Development NRT Streaming Service Data Producers FeaturesStore ML Service CI/CD EPS API book Partner(s) Service Orchestrator Data Lake Hive Metastore On-Premises Hive Metastore Data Exploration + Pipelines Setup DS Development Execute > SQL oiu aosiud oa dasdaosiud oas asodiuaosid Batch Model Execution Prediction + Backtesting Training Set Validation Set Algorithm TrainingModel Config ML Model Store Performance Set Model Performance / Monitoring 50k 23k “It Takes a Village … ”
  • 38.
    IT TAKES AVILLAGE ... C R O S S F U N C T I O N A L T E A M S $ P R O M O T E B E S T E N G I N E E R I N G P R A C T I C E S C R I T I C A L E X E C U T I O N P A T H M E A S U R E O P E R A T I O N A L C O S T S S O L I D P L A T F O R M T O B U I L D O N T O P
  • 39.