Traveloka’s
Data
Journey
Stories and lessons learned
on building a scalable data
pipeline at Traveloka.
Very Early
Days...
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Very Early days
Applications
& Services
Summarizer
Internal
Dashboard
Report Scripts +
Crontab
- Raw Activity
- Key Value
- Time Series
Full... Split & Shard!
Raw, KV, and Time Series DB
Applications
& Services Internal
Dashboard
Report Scripts +
Crontab
Raw Activity
(Sharded)
Time Series
SummarySummarizer
Lesson Learned
1. UNIX principle: “Do One Thing and Do It Well”
2. Split use cases based on SLA & query pattern
3. Scalable tech based on growth estimation
Key Value DB
(Sharded)
Throughput?
Kafka comes into rescue
Applications
& Services
Raw Activity
(Sharded)
Lesson Learned
1. Use something that can handle
higher throughput for cases with
high write volume like tracking
2. Decouple publish and consume
Kafka as
Datahub
Raw data
consumer
Key Value
(Sharded)
insert
update
We need Data Warehouse
and BI Tool, and we need it fast!
Raw Activity
(Sharded)
Other sources
Python ETL
(temporary
solution)
Star Schema
DW on
Postgres
Periscope BI
Tool
Lesson Learned
1. Think DW since the beginning of data pipeline
2. BI Tools: Do not reinvent the wheel
“Have” to
adopt big data
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Postgres couldn’t handle the load!
Raw Activity
(Sharded)
Other sources
Python ETL
(temporary
solution)
Star Schema
DW on
Redshift
Periscope BI
Tool
Lesson Learned
1. Choose specific tech that best fit the use case
Scaling out in MongoDB
every so often is not manageable...
Lesson Learned
1. MongoDB Shard: Scalability need to be tested!
Kafka as
Datahub
Gobblin as
Consumer
Raw Activity
on S3
“Have” to adopt big data
Lesson Learned
1. Processing have to be easily scaled
2. Scale processing separately for: day to day job,
backfill job
Kafka as
Datahub
Gobblin as
Consumer
Raw Activity
on S3
Processing on
Spark
Star Schema
DW on
Redshift
Near Real Time on Big Data
is challenging
Lesson Learned
1.Dig requirement until it is very
specific, for data it is related to:
1) latency SLA
2) query pattern
3) accuracy
4) processing requirement
5) tools integration
Kafka as
Datahub
MemSQL for Near
Real Time DB
No OPS!!!
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Open your mind for
any combination of tech!
Lesson Learned
1. Combination of cloud provider is possible, but
be careful of latency concern
2. During a research project, always prepare plan
B & C plus proper buffer on timeline
3. Autoscale!
PubSub as
Datahub
DataFlow for
Stream
Processing
Key Value on
DynamoDB
More autoscale!
Lesson Learned
1. Autoscale = cost monitoring
Caveat
Autoscale != everything solved
e.g. PubSub default quota 200MB/s (could be
increased, but manually request)
PubSub as
Datahub
BigQuery for Near
Real Time DB
More autoscale!
Lesson Learned
1. Scalable as granular as
possible, in this case
separate compute and
storage scalability
2. Separate BI with well
defined SLA and
exploration use case
Kafka as
Datahub
Gobblin as
Consumer
Raw Activity
on S3
Processing on
Spark
Hive & Presto on
Qubole as Query
Engine
BI & Exploration
Tools
WRAP UP
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Consumer of Data
Streaming
Batch
Traveloka
App
Kafka
ETL
Data
Warehouse
S3 Data
Lake
Batch
Ingest
Android,
iOS
DOMO
Analytics
UI
NoSQL DB
Traveloka
Services
Inges
t
Cloud
Pub/Sub
Storag
e
Cloud
Storage
Pipeline
s Cloud
Dataflow
Analytic
s
BigQuery
Monitoring
Logging
Hive, Presto
Query
Key Lessons Learned
● Scalability in mind -- esp disk full.. :)
● Scalable as granular as possible -- compute, storage
● Scalability need to be tested (of course!)
● Do one thing, and do it well, dig your requirement
-- SLA, query pattern
● Decouple publish and consume
-- publisher availability is very important!
● Choose tech that is specific to the use case
● Careful of Gotchas! There's no silver bullet...
THE FUTURE
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Future Roadmap
● In the past, we see problems/needs, see what technology
can solve it, and plug it to the existing pipeline.
● It works well.
● But after some time, we need to maintain a lot of different
components.
● Multiple clusters:
○ Kafka
○ Spark
○ Hive/Presto
○ Redshift
○ etc
● Multiple data entry points for analyst:
○ BigQuery
○ Hive/Presto
○ Redshift
Our Goal
● Simplifying our data architecture.
● Single data entry point for data analysts/scientists,
both streaming and batch data.
● Without compromising what we can do now.
● Reliability, speed, and scale.
● Less or no ops.
● We also want to make migration as simple/easy as
possible.
How will we achieve this?
● There are few options that we are considering right
now.
● Some of them introducing new
technologies/components.
● Some of them is making use of our existing
technology to its maximum potential.
● We are trying exciting new (relatively) technologies:
○ Google BigQuery
○ Google Dataprep on Dataflow
○ AWS Athena
○ AWS Redshift Spectrum
○ etc
Plan to simplify
Cloud Pub/Sub
Cloud Dataflow
BigQuery Cloud Storage
Kubernetes Cluster
Collector
Managed services
BI &
Analytics UI
BigTable
REST API
ML Models
Plan to simplify
● Seems promising, but…
● Need to be tested.
● Cover all use cases that we need ?
● Query migration ?
● Costs ?
● Maintainability ?
● Potential problems ?
See You On
Next Event!
Thank You

Traveloka's data journey — Traveloka data meetup #2

  • 1.
    Traveloka’s Data Journey Stories and lessonslearned on building a scalable data pipeline at Traveloka.
  • 2.
    Very Early Days... Stories andlessons learned on building a scalable data pipeline at Traveloka.
  • 3.
    Very Early days Applications &Services Summarizer Internal Dashboard Report Scripts + Crontab - Raw Activity - Key Value - Time Series
  • 4.
    Full... Split &Shard! Raw, KV, and Time Series DB Applications & Services Internal Dashboard Report Scripts + Crontab Raw Activity (Sharded) Time Series SummarySummarizer Lesson Learned 1. UNIX principle: “Do One Thing and Do It Well” 2. Split use cases based on SLA & query pattern 3. Scalable tech based on growth estimation Key Value DB (Sharded)
  • 5.
    Throughput? Kafka comes intorescue Applications & Services Raw Activity (Sharded) Lesson Learned 1. Use something that can handle higher throughput for cases with high write volume like tracking 2. Decouple publish and consume Kafka as Datahub Raw data consumer Key Value (Sharded) insert update
  • 6.
    We need DataWarehouse and BI Tool, and we need it fast! Raw Activity (Sharded) Other sources Python ETL (temporary solution) Star Schema DW on Postgres Periscope BI Tool Lesson Learned 1. Think DW since the beginning of data pipeline 2. BI Tools: Do not reinvent the wheel
  • 7.
    “Have” to adopt bigdata Stories and lessons learned on building a scalable data pipeline at Traveloka.
  • 8.
    Postgres couldn’t handlethe load! Raw Activity (Sharded) Other sources Python ETL (temporary solution) Star Schema DW on Redshift Periscope BI Tool Lesson Learned 1. Choose specific tech that best fit the use case
  • 9.
    Scaling out inMongoDB every so often is not manageable... Lesson Learned 1. MongoDB Shard: Scalability need to be tested! Kafka as Datahub Gobblin as Consumer Raw Activity on S3
  • 10.
    “Have” to adoptbig data Lesson Learned 1. Processing have to be easily scaled 2. Scale processing separately for: day to day job, backfill job Kafka as Datahub Gobblin as Consumer Raw Activity on S3 Processing on Spark Star Schema DW on Redshift
  • 11.
    Near Real Timeon Big Data is challenging Lesson Learned 1.Dig requirement until it is very specific, for data it is related to: 1) latency SLA 2) query pattern 3) accuracy 4) processing requirement 5) tools integration Kafka as Datahub MemSQL for Near Real Time DB
  • 12.
    No OPS!!! Stories andlessons learned on building a scalable data pipeline at Traveloka.
  • 13.
    Open your mindfor any combination of tech! Lesson Learned 1. Combination of cloud provider is possible, but be careful of latency concern 2. During a research project, always prepare plan B & C plus proper buffer on timeline 3. Autoscale! PubSub as Datahub DataFlow for Stream Processing Key Value on DynamoDB
  • 14.
    More autoscale! Lesson Learned 1.Autoscale = cost monitoring Caveat Autoscale != everything solved e.g. PubSub default quota 200MB/s (could be increased, but manually request) PubSub as Datahub BigQuery for Near Real Time DB
  • 15.
    More autoscale! Lesson Learned 1.Scalable as granular as possible, in this case separate compute and storage scalability 2. Separate BI with well defined SLA and exploration use case Kafka as Datahub Gobblin as Consumer Raw Activity on S3 Processing on Spark Hive & Presto on Qubole as Query Engine BI & Exploration Tools
  • 16.
    WRAP UP Stories andlessons learned on building a scalable data pipeline at Traveloka.
  • 18.
    Consumer of Data Streaming Batch Traveloka App Kafka ETL Data Warehouse S3Data Lake Batch Ingest Android, iOS DOMO Analytics UI NoSQL DB Traveloka Services Inges t Cloud Pub/Sub Storag e Cloud Storage Pipeline s Cloud Dataflow Analytic s BigQuery Monitoring Logging Hive, Presto Query
  • 19.
    Key Lessons Learned ●Scalability in mind -- esp disk full.. :) ● Scalable as granular as possible -- compute, storage ● Scalability need to be tested (of course!) ● Do one thing, and do it well, dig your requirement -- SLA, query pattern ● Decouple publish and consume -- publisher availability is very important! ● Choose tech that is specific to the use case ● Careful of Gotchas! There's no silver bullet...
  • 20.
    THE FUTURE Stories andlessons learned on building a scalable data pipeline at Traveloka.
  • 21.
    Future Roadmap ● Inthe past, we see problems/needs, see what technology can solve it, and plug it to the existing pipeline. ● It works well. ● But after some time, we need to maintain a lot of different components. ● Multiple clusters: ○ Kafka ○ Spark ○ Hive/Presto ○ Redshift ○ etc ● Multiple data entry points for analyst: ○ BigQuery ○ Hive/Presto ○ Redshift
  • 22.
    Our Goal ● Simplifyingour data architecture. ● Single data entry point for data analysts/scientists, both streaming and batch data. ● Without compromising what we can do now. ● Reliability, speed, and scale. ● Less or no ops. ● We also want to make migration as simple/easy as possible.
  • 23.
    How will weachieve this? ● There are few options that we are considering right now. ● Some of them introducing new technologies/components. ● Some of them is making use of our existing technology to its maximum potential. ● We are trying exciting new (relatively) technologies: ○ Google BigQuery ○ Google Dataprep on Dataflow ○ AWS Athena ○ AWS Redshift Spectrum ○ etc
  • 24.
    Plan to simplify CloudPub/Sub Cloud Dataflow BigQuery Cloud Storage Kubernetes Cluster Collector Managed services BI & Analytics UI BigTable REST API ML Models
  • 25.
    Plan to simplify ●Seems promising, but… ● Need to be tested. ● Cover all use cases that we need ? ● Query migration ? ● Costs ? ● Maintainability ? ● Potential problems ?
  • 26.
    See You On NextEvent! Thank You