Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable data pipeline at Traveloka - Facebook Dev Bandung

30 views

Published on

Scalable data pipeline at Facebook Dev Bandung

Published in: Technology
  • Be the first to comment

Scalable data pipeline at Traveloka - Facebook Dev Bandung

  1. 1. Scalable Data Pipeline @ Traveloka : How We Get There Stories and lessons learned on building a scalable data pipeline at Traveloka.
  2. 2. Very Early days Applications & Services Summarizer Internal Dashboard Report Scripts + Crontab - Raw Activity - Key Value - Time Series
  3. 3. Full... Split & Shard! Raw, KV, and Time Series DB Applications & Services Internal Dashboard Report Scripts + Crontab Raw Activity (Sharded) Time Series SummarySummarizer Lesson Learned 1. UNIX principle: “Do One Thing and Do It Well” 2. Split use cases based on SLA & query pattern 3. Scalable tech based on growth estimation Key Value DB (Sharded)
  4. 4. Throughput? Kafka comes into rescue Applications & Services Raw Activity (Sharded) Lesson Learned 1. Use something that can handle higher throughput for cases with high write volume like tracking 2. Decouple publish and consume Kafka as Datahub Raw data consumer Key Value (Sharded) insert update
  5. 5. We need Data Warehouse and BI Tool, and we need it fast! Raw Activity (Sharded) Other sources Python ETL (temporary solution) Star Schema DW on Postgres Periscope BI Tool Lesson Learned 1. Think DW since the beginning of data pipeline 2. BI Tools: Do not reinvent the wheel
  6. 6. Postgres couldn’t handle the load! Raw Activity (Sharded) Other sources Python ETL (temporary solution) Star Schema DW on Redshift Periscope BI Tool Lesson Learned 1. Choose specific tech that best fit the use case
  7. 7. Scaling out in MongoDB every so often is not manageable... Lesson Learned 1. MongoDB Shard: Scalability need to be tested! Kafka as Datahub Gobblin as Consumer Raw Activity on S3
  8. 8. “Have” to adopt big data Lesson Learned 1. Processing have to be easily scaled 2. Scale processing separately for: day to day job, backfill job Kafka as Datahub Gobblin as Consumer Raw Activity on S3 Processing on Spark Star Schema DW on Redshift
  9. 9. Near Real Time on Big Data is challenging Lesson Learned 1.Dig requirement until it is very specific, for data it is related to: 1) latency SLA 2) query pattern 3) accuracy 4) processing requirement 5) tools integration Kafka as Datahub MemSQL for Near Real Time DB
  10. 10. Open your mind for any combination of tech! Lesson Learned 1. Combination of cloud provider is possible, but be careful of latency concern 2. During a research project, always prepare plan B & C plus proper buffer on timeline 3. Autoscale! PubSub as Datahub DataFlow for Stream Processing Key Value on DynamoDB
  11. 11. More autoscale! Lesson Learned 1. Autoscale = cost monitoring Caveat Autoscale != everything solved e.g. PubSub default quota 200MB/s (could be increased, but manually request) PubSub as Datahub BigQuery for Near Real Time DB
  12. 12. More autoscale! Lesson Learned 1. Scalable as granular as possible, in this case separate compute and storage scalability 2. Separate BI with well defined SLA and exploration use case Kafka as Datahub Gobblin as Consumer Raw Activity on S3 Processing on Spark Hive & Presto on Qubole as Query Engine BI & Exploration Tools
  13. 13. Key Lessons Learned ● Scalability in mind -- esp disk full.. :) ● Scalable as granular as possible -- compute, storage ● Scalability need to be tested (of course!) ● Do one thing, and do it well, dig your requirement -- SLA, query pattern ● Decouple publish and consume -- publisher availability is very important! ● Choose tech that is specific to the use case ● Careful of Gotchas! There's no silver bullet...
  14. 14. Future Roadmap - In the past, we see problems/needs, see what technology can solve it, and plug it to the existing pipeline. - It works well. - But after some time, we need to maintain a lot of different components. - Multiple clusters: - Kafka - Spark - Hive/Presto - Redshift - etc - Multiple data entry points for analyst: - BigQuery - Hive/Presto - Redshift
  15. 15. Future Roadmap Our goal: - Simplifying our data architecture. - Single data entry point for data analysts/scientists, both streaming and batch data. - Without compromising what we can do now. - Reliability, speed, and scale. - Less or no ops. - We also want to make migration as simple/easy as possible.
  16. 16. Future Roadmap How will we achieve this? - There are few options that we are considering right now. - Some of them introducing new technologies/components. - Some of them is making use of our existing technology to its maximum potential. - We are trying exciting new (relatively) technologies: - Google BigQuery - AWS Athena - AWS Redshift Spectrum - etc
  17. 17. Thanks! See you on the next event.

×