Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Continuous Optimization for Distributed BigData Analysis

515 views

Published on

Talk at HighLoad++ 2018, Moscow

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Continuous Optimization for Distributed BigData Analysis

  1. 1. Continuous Optimization for Distributed BigData Analysis Kai Sasaki (Treasure Data)
  2. 2. Bio Kai Sasaki - Software Engineer at Treasure Data - Hadoop, Presto - Apache Hivemall - Books 2
  3. 3. 3 Design and Concept https://pixabay.com/en/desktop-tidy-clean-mockup-white-2325627/
  4. 4. Agenda - Who is Treasure Data - What is distributed data analysis? - What kind of challenges we have? - Our approach - Columnar Storage - Partitioning - Repartitioning 4
  5. 5. Treasure Data 5
  6. 6. Treasure Data • Founded in Dec, 2011 • Mountain View, CA • DMP, CDP, IoT, Cloud • We joined Arm Oct, 2018 6
  7. 7. Treasure Data • Open Source Lovers 7
  8. 8. Enterprise Data Analysis 8
  9. 9. Arm x Treasure Data • Pelion: Device-to-Device Platform 9
  10. 10. 10 Challenges based on Our Experience https://pixabay.com/en/adventure-height-climbing-mountain-1807524/
  11. 11. Distributed Data Analysis? • Large Scale Data • High Throughput • High Availability & Reliability • Data Consistency 11
  12. 12. Distributed Processing Engines • Hadoop • Presto • Spark 12
  13. 13. Typical Architecture • Master-Worker model 13 https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm
  14. 14. Distributed Plan 14 select t1.class, t2.features, count(1) from iris t1 join iris t2 on t1.class = t2.class group by 1, 2;
  15. 15. Challenges • Network Bandwidth • Throughput • Transactional Processing • Data Consistency • System Reliability • Service Availability 15
  16. 16. Our Approach • Columnar Storage • MessagePack based columnar format • Time Index Pushdown • Optimization of Partitioning Layout 16
  17. 17. Columnar Storage • General design for OLAP workload • Save IO bandwidth • Efficient compression and encoding • e.g. Parquet, ORC 17
  18. 18. MessagePack • JSON-like binary serialization format • Faster and smaller • 100+ 
 implementations • https://msgpack.org 18
  19. 19. MessagePack x Columnar File • Type embedded file format • Schema-on-Read • -> Saving network bandwidth and storage space efficiently 19
  20. 20. MessagePack x Columnar File 20
  21. 21. Time Index Pushdown • Read skipping by time range • Fitting to the typical analytical use cases • Saving network bandwidth 21
  22. 22. Time Index Pushdown • Indexed by PostgreSQL • Transactional Update • Data Consistency • GiST index achieves efficient multi column index 22
  23. 23. Time-Range Partitioning 23
  24. 24. Time Index Pushdown 24
  25. 25. Partition Size? • The partition file size affects the performance significantly • 1000000 records / file • 256MB / file • But depends on the workload 25
  26. 26. Auto Optimization • Partitioning layout should be fit to the actual workload • File size • Time range • Partitioning key 26
  27. 27. Repartitioning • Small distributed partition files • High IO overhead • Few large partition files • High memory pressure TRADE OFF PROBLEM 27
  28. 28. Repartitioning • Partitioning key decides the throughput • e.g. Customer segmentation by • User ID • Purchase item • Living address 28
  29. 29. User Defined Partitioning • Custom partitioning schema defined by our user side (or ourselves) 29
  30. 30. User Defined Partitioning 30
  31. 31. Colocated Join 31
  32. 32. User Defined Partitioning 32
  33. 33. User Defined Partitioning • Granularity • Partitioning Key Selection 33
  34. 34. Stella Connector • Repartitioning & UDP is designed as a Presto connector • Make use of Presto high scalability and reliability for such high workload 34
  35. 35. Stella Connector 35 CREATE TABLE remerged WITH (max_file_size = '256MB', max_time_range='48h') AS SELECT * FROM partition.sources WHERE table_schema = 'tpch_s1' AND table_name = 'lineitem' AND TD_TIME_RANGE(time, '1998-10-11', '1998-10-20')
  36. 36. Stella Connector • Scalable • Reliable • Easy to embed it into Workflow • Automatic Storage Optimization! 36
  37. 37. Recap - Treasure Data Overview - Architecture of Distributed Data Analysis - Challenges - Our Approach - Columnar Storage - Partitioning - Repartitioning 37
  38. 38. Thanks! 38

×