Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Traveloka's data journey — Traveloka data meetup #2

2,146 views

Published on

Discover the journey that Traveloka's Data Team have taken so far and learn from our struggles and triumphs in managing Traveloka's burgeoning data!

In this slide, you will learn more about the stories and lessons learned on building a scalable data pipeline at Traveloka.

Presenters of the slide:
Nisrina Luthfiyati - Data Engineer
Rendy B. Junior - Data System Architect
Wilson lauw - Lead Data Engineer

To follow our LinkedIn page, visit bit.ly/TravelokaLinkedInPage


Safe Harbor Statement

Our discussion may include predictions, estimates or other information that might be considered conclusive. While these conclusive statements represent our current judgment on the best practices, they are subject to risks and uncertainties that could cause actual results to differ materially. You are cautioned not to place undue reliance on our statements, which reflect our opinions only as of the date of this presentation. Please keep in mind that we are not obligating ourselves to revise or publicly release the results of any revision to these presentation materials in light of new information or future events.

Published in: Data & Analytics
  • Your opinions matter! get paid BIG $$$ for them! START NOW!!.. ▲▲▲ http://ishbv.com/surveys6/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Discover a WEIRD trick I use to make over $3500 per month taking paid surveys online. read more... ♣♣♣ https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I'll tell you right now (and I've got proof), that anyone who tells you "size doesn't matter to women" is flat out lying to your face and trying to make you feel better... Heck, just recently I asked a focus group of women via an anonymous online survey if size matters, and again and again they said "Oh my god, I HATE IT when it's SMALL." For a long time I didn't know what to tell the guys who'd write in to me and ask how to get "bigger." I'd say something lame like "Women actually like guys who are smaller... you just have to get good with your hands." Then I found "THE BIBLE of Penis Enlargement" by this guy named John Collins ★★★ https://tinyurl.com/getpebible2019
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The methods and techniques in the PE Bible are exclusive to this unique program. The two step system involves low cost off the shelf natural supplements and a specially designed exercise program. Many users experience gains of almost an inch within just a few weeks of starting this unique program! Imagine having 2-4 inches of extra length and girth added onto your penis size, this Penis Enlargement Bible makes it possible. Over 5000 copies of this product have already been sold, and unlike most products on the market there is real video proof from actual users that show REAL results. You can see the video here ➤➤ https://tinyurl.com/yaygh4xh
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Your opinions matter! get paid for them! click here for more info...★★★ http://ishbv.com/surveys6/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Traveloka's data journey — Traveloka data meetup #2

  1. 1. Traveloka’s Data Journey Stories and lessons learned on building a scalable data pipeline at Traveloka.
  2. 2. Very Early Days... Stories and lessons learned on building a scalable data pipeline at Traveloka.
  3. 3. Very Early days Applications & Services Summarizer Internal Dashboard Report Scripts + Crontab - Raw Activity - Key Value - Time Series
  4. 4. Full... Split & Shard! Raw, KV, and Time Series DB Applications & Services Internal Dashboard Report Scripts + Crontab Raw Activity (Sharded) Time Series SummarySummarizer Lesson Learned 1. UNIX principle: “Do One Thing and Do It Well” 2. Split use cases based on SLA & query pattern 3. Scalable tech based on growth estimation Key Value DB (Sharded)
  5. 5. Throughput? Kafka comes into rescue Applications & Services Raw Activity (Sharded) Lesson Learned 1. Use something that can handle higher throughput for cases with high write volume like tracking 2. Decouple publish and consume Kafka as Datahub Raw data consumer Key Value (Sharded) insert update
  6. 6. We need Data Warehouse and BI Tool, and we need it fast! Raw Activity (Sharded) Other sources Python ETL (temporary solution) Star Schema DW on Postgres Periscope BI Tool Lesson Learned 1. Think DW since the beginning of data pipeline 2. BI Tools: Do not reinvent the wheel
  7. 7. “Have” to adopt big data Stories and lessons learned on building a scalable data pipeline at Traveloka.
  8. 8. Postgres couldn’t handle the load! Raw Activity (Sharded) Other sources Python ETL (temporary solution) Star Schema DW on Redshift Periscope BI Tool Lesson Learned 1. Choose specific tech that best fit the use case
  9. 9. Scaling out in MongoDB every so often is not manageable... Lesson Learned 1. MongoDB Shard: Scalability need to be tested! Kafka as Datahub Gobblin as Consumer Raw Activity on S3
  10. 10. “Have” to adopt big data Lesson Learned 1. Processing have to be easily scaled 2. Scale processing separately for: day to day job, backfill job Kafka as Datahub Gobblin as Consumer Raw Activity on S3 Processing on Spark Star Schema DW on Redshift
  11. 11. Near Real Time on Big Data is challenging Lesson Learned 1.Dig requirement until it is very specific, for data it is related to: 1) latency SLA 2) query pattern 3) accuracy 4) processing requirement 5) tools integration Kafka as Datahub MemSQL for Near Real Time DB
  12. 12. No OPS!!! Stories and lessons learned on building a scalable data pipeline at Traveloka.
  13. 13. Open your mind for any combination of tech! Lesson Learned 1. Combination of cloud provider is possible, but be careful of latency concern 2. During a research project, always prepare plan B & C plus proper buffer on timeline 3. Autoscale! PubSub as Datahub DataFlow for Stream Processing Key Value on DynamoDB
  14. 14. More autoscale! Lesson Learned 1. Autoscale = cost monitoring Caveat Autoscale != everything solved e.g. PubSub default quota 200MB/s (could be increased, but manually request) PubSub as Datahub BigQuery for Near Real Time DB
  15. 15. More autoscale! Lesson Learned 1. Scalable as granular as possible, in this case separate compute and storage scalability 2. Separate BI with well defined SLA and exploration use case Kafka as Datahub Gobblin as Consumer Raw Activity on S3 Processing on Spark Hive & Presto on Qubole as Query Engine BI & Exploration Tools
  16. 16. WRAP UP Stories and lessons learned on building a scalable data pipeline at Traveloka.
  17. 17. Consumer of Data Streaming Batch Traveloka App Kafka ETL Data Warehouse S3 Data Lake Batch Ingest Android, iOS DOMO Analytics UI NoSQL DB Traveloka Services Inges t Cloud Pub/Sub Storag e Cloud Storage Pipeline s Cloud Dataflow Analytic s BigQuery Monitoring Logging Hive, Presto Query
  18. 18. Key Lessons Learned ● Scalability in mind -- esp disk full.. :) ● Scalable as granular as possible -- compute, storage ● Scalability need to be tested (of course!) ● Do one thing, and do it well, dig your requirement -- SLA, query pattern ● Decouple publish and consume -- publisher availability is very important! ● Choose tech that is specific to the use case ● Careful of Gotchas! There's no silver bullet...
  19. 19. THE FUTURE Stories and lessons learned on building a scalable data pipeline at Traveloka.
  20. 20. Future Roadmap ● In the past, we see problems/needs, see what technology can solve it, and plug it to the existing pipeline. ● It works well. ● But after some time, we need to maintain a lot of different components. ● Multiple clusters: ○ Kafka ○ Spark ○ Hive/Presto ○ Redshift ○ etc ● Multiple data entry points for analyst: ○ BigQuery ○ Hive/Presto ○ Redshift
  21. 21. Our Goal ● Simplifying our data architecture. ● Single data entry point for data analysts/scientists, both streaming and batch data. ● Without compromising what we can do now. ● Reliability, speed, and scale. ● Less or no ops. ● We also want to make migration as simple/easy as possible.
  22. 22. How will we achieve this? ● There are few options that we are considering right now. ● Some of them introducing new technologies/components. ● Some of them is making use of our existing technology to its maximum potential. ● We are trying exciting new (relatively) technologies: ○ Google BigQuery ○ Google Dataprep on Dataflow ○ AWS Athena ○ AWS Redshift Spectrum ○ etc
  23. 23. Plan to simplify Cloud Pub/Sub Cloud Dataflow BigQuery Cloud Storage Kubernetes Cluster Collector Managed services BI & Analytics UI BigTable REST API ML Models
  24. 24. Plan to simplify ● Seems promising, but… ● Need to be tested. ● Cover all use cases that we need ? ● Query migration ? ● Costs ? ● Maintainability ? ● Potential problems ?
  25. 25. See You On Next Event! Thank You

×