Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presentation by TachyonNexus & Intel at Strata Singapore 2015

1,077 views

Published on

Make Tachyon Ready for Next-Gen Data Center Platforms with NVM.

The talk was presented at Strata Singapore, December 2015, focusing on using Tachyon Tiered Storage with NVM as the next generation data center platforms.

Published in: Technology
  • Be the first to comment

Presentation by TachyonNexus & Intel at Strata Singapore 2015

  1. 1. Make Tachyon ready for next-gen data center platforms with NVM Mingfei Shi Intel mingfei.shi@intel.com Bin Fan Tachyon Nexus binfan@tachyonnexus.com 1
  2. 2. Intel Cloud & BigData Engineering Team • Work with community to optimize Apache Spark and Hadoop on Intel platform • Improve Spark scalability and reliability • Deliver better tools for management, benchmarking, tuning • e.g., HiBench, HiMeter • Build Spark based solutions 2
  3. 3. Tachyon Nexus • Team consists of Tachyon creators, top contributors • Series A ($7.5 million) from Andreessen Horowitz • Committed to Tachyon Open Source Project • www.tachyonnexus.com 3
  4. 4. Outline  Memory trend and challenges  Introduction of Tachyon & NVM  Performance testing and key learning  Summary 4
  5. 5. Outline  Memory trend and challenges  Introduction of Tachyon & NVM  Performance testing and key learning  Summary 5
  6. 6. Performance Trend: Memory is Fast • RAM throughput increasing exponentially • Disk throughput increasing slowly Memory-locality is important! 6
  7. 7. Price Trend: Memory is Cheaper source: jcmit.com $/ GB of Memory is lower, possible to handle huge size of data in memory7
  8. 8. New Memory Technology: Intel's Crazy-Fast 3D XPoint Memory • 1,000X faster than NAND • 10X denser than DRAM • less costly than DRAM • Shipments may start in 2017 8
  9. 9. Realized By Many Frameworks: 9
  10. 10. Challenges • Effectively share in-memory data among distributed applications. • GC overhead introduced by in memory caching • Data set could be larger than memory capacity 10
  11. 11. Outline  Memory trend and challenges  Introduction of Tachyon & NVM  Performance testing and key learning  Summary 11
  12. 12. What is Tachyon? 12
  13. 13. An Open Source Memory-centric Distributed Storage System 13
  14. 14. Tachyon Stack 14
  15. 15. How Easy to Use Tachyon in scala> val file = sc.textFile(“hdfs://foo”) scala> val file = sc.textFile(“tachyon://foo”) 15
  16. 16. Issue 1 Spark Job Spark Memory block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 Data Sharing bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process 16
  17. 17. Issue 1 resolved with Tachyon Memory-speed data sharing among different jobs and different frameworks Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 storage engine & execution engine same process 17
  18. 18. Issue 2 Spark Task Spark Memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 In-Memory data loss when computation crashes storage engine & execution engine same process 18
  19. 19. Issue 2 crash Spark Memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process In-Memory data loss when computation crashes 19
  20. 20. HDFS / Amazon S3 Issue 2 block 1 block 3 block 2 block 4 crash storage engine & execution engine same process In-Memory data loss when computation crashes 20
  21. 21. HDFS / Amazon S3 block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 Issue 2 resolved with Tachyon Spark Task Spark Memory block manager storage engine & execution engine same process Keep in-memory data safe, even when computation crashes 21
  22. 22. Issue 2 resolved with Tachyon HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 crash HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process Keep in-memory data safe, even when computation crashes 22
  23. 23. HDFS / Amazon S3 Issue 3 In-memory Data Duplication & Java Garbage Collection Spark Job1 Spark Memory block 1 block 3 Spark Job2 Spark Memory block 3 block 1 block 1 block 3 block 2 block 4 storage engine & execution engine same process 23
  24. 24. Issue 3 resolved with Tachyon No in-memory data duplication, much less GC Spark Job1 Spark mem Spark Job2 Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 storage engine & execution engine same process 24
  25. 25. Challenges • Effectively share in-memory data sharing among distributed applications. => HDFS compatible, Spark/MapReduce can access Tachyon in-memory data with no modification • GC overhead introduced by in memory caching => No GC overhead as working set files are in memory but off-heap • Data set could be larger than memory capacity => Tiered storage 25
  26. 26. Tiered block storage: extend Tachyon space with external storage (e.g., SSD HDD etc.)  It is available since Tachyon 0.6 Release • The JIRA for tiered storage: TACHYON-33 • Different tiers have different speed and priority • “hot” data in upper tier • “warm” data in lower tier • Multiple directories in single tier MEM SSD HDD 26
  27. 27. Data migration among tiers • Data can be evicted to lower layer if it is less “hot” • Data can be promote to upper layer if it is “hot” again. • Pluggable Eviction Policy • LRU, LRFU pre-defined Evict stale data to lower tier Promote hot data to upper tier 27
  28. 28. Non-Volatile Memory (NVM)  NVM is the next era of computer memory • Including 3D NAND, solid-state drives, and Intel 3D XPoint™ technology.  Advantages of NVM • Fast, comes to memory performance, low latency • Non-Volatile, Capable of retrieving stored data even after a power outage. • Inexpensive 28
  29. 29. When Tachyon Meets NVM  Data caching & sharing at memory speed • Local cache for remote data • Data share in job chain / among different applications  Eliminate GC overhead • Storing data off heap on NVM  Efficient cache management • Storage hierarchy knowledge for different storage media, MEM SSD HDD etc • Quota management for different storage 29
  30. 30. Outline  Memory trend and challenges  Introduction of Tachyon & NVM  Performance testing and key learning  Summary 30
  31. 31. PCI-E SSD(P3700) SPEC  Due to availability of NVM hardware, use high speed SSD to simulate the usage of NVM • The NVM can be just used as block device like SSD, with lower latency and higher bandwidth • The performance of NVM will be better than PCI-E SSD, with similar behavior 31
  32. 32. Test Bed  One Tachyon master node  Four Tachyon worker nodes Worker node Configuration CPU IVB E5-2680 @ 2.8G 40 logical cores Memory 192GB DDR3 @ 1066 MHZ Disk P3700 SSD * 1 + HDD * 11 OS Distribution Redhat 6.2 (kernel 2.6.32-220.el6.x86_64) JDK version JDK1.8.0_60 (64 bit server) Spark version 1.4.1 release Tachyon version 0.8.1 release Hadoop version 2.5.0 32
  33. 33. U B C D V X Y A Experiment 1: Big Graph Computation  Background: Calculate similarity between videos  A case of Graph Analysis  Friends-of-friend  Iteratively compute association between nodes  Challenge  Generating huge Intermediate data after each iter 33
  34. 34. Architecture: Bagel over Tachyon  Bagel is a graph processing framework.  Spark implementation of Google Pregel  jobs run as a sequence of iterations.  In each iteration, each vertex runs a user-specified function  update state associated with the vertex and  send messages to other vertices for use in the next iteration.  The message generated and new vertex data will be cached after each iteration  Intermediate data can be cache in Tachyon • Tachyon can make full use of high speed device for data caching. aggregate & generate msg for next superstep compute new vertex data Tachyon OffHeap Bagel 34
  35. 35.  Input data size: 22G  Caching data size: 1.5TB  Four configuration to test Configuration Framework Storage Configuration Spark Spark: 11HDD + 1 SSD Spark Spark: 11HDD + 4 SSD Spark + Tachyon Spark: 11 HDD Tachyon (1 tier): 1 SSD (500GB) Spark + Tachyon Spark: 11 HDD Tachyon (2 tiers): 1 SSD (250GB) + 11 HDD (100GB per disk) Eviction strategy: LRU 35
  36. 36. Test Result  Tachyon with SSD & SSD + HDD outperforms original Spark, gets about 20% performance gain  In Spark + Tachyon(2 tiers), 2/3 of caching data is putted on SSD and 1/3 on Disk  Spark + Tachyon(1 tier) and Spark + Tachyon(2 tiers) have very similar performance 4899 4516 3848 3913 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 Spark(11HDD + 1 SSD) Spark(11HDD + 4 SSD) Spark + Tachyon(SSD) Spark + Tachyon(SSD + 11 HDD) Duration (Second) 36
  37. 37. Performance Analysis – Disk utilization Spark (11 HDD + 1 SSD) Spark (11 HDD + 4 SSD) Spark + Tachyon (2 tiers) Spark + Tachyon (1 tier) Tachyon helps achieve higher disk utilization 37
  38. 38. Performance Analysis – CPU utilization Spark (11 HDD + 1 SSD) Spark (11 HDD + 4 SSD) Spark + Tachyon (1 tier) Spark + Tachyon (2 tiers) Tachyon helps achieve higher CPU utilization due to less time in IO 38
  39. 39.  Tachyon significantly speeds up Spark  Spark takes all kinds of storage as the same storage media, it randomly allocates space and access to all local directories configured.  With tiered storage and smarter cache management, Tachyon achieves comparable performance while requiring less fast memory resource • Tachyon tries to use the fast device as much as possible, and has efficient data cache management for “hot” and “warm” data. Experiment 1: Summary 39
  40. 40. Experiment 2: Recommendation System  Background: data analysis of user event logs  Job1: Top-K online videos  IO Bound  Job2: Visits of advertisement by unique users in one day/week  CPU Bound  Challenges:  storage and computation cluster are separate.  Data must be transferred from remote data cluster during computation  Same Input data is used by different applications  causing redundant network I/O. 40
  41. 41. Architecture: Tachyon as local cache • Deploy Tachyon in compute cluster to avoid the duplicated network I/O Storage serviceComputation Cluster Spark/MR Tachyon Cache data 41
  42. 42. Configuration  Input data size: 1.2TB  Tachyon • 1 tier with 500GB SSD on each worker  Network • 1 Gb connection between computation cluster and data service cluster • 10 Gb connection inside computation cluster 42
  43. 43. Test Result Job1 (IO Bound) 0 1000 2000 3000 4000 Spark Only Spark + Tachyon Duration Job2 (CPU Bound) 0 1000 2000 3000 4000 5000 6000 Spark Only Spark + Tachyon Duration 43
  44. 44. Performance Analysis – Job1 Spark-only Spark+Tachyon 44
  45. 45. Performance Analysis – Job2 Spark-only Spark+Tachyon 45
  46. 46.  Using Tachyon as local cache accelerates the application • For the application is I/O bound, Tachyon greatly improves the performance(3X) • For the application is CPU bound, Tachyon still brings much performance gain(2X) Experiment 2: Summary 46
  47. 47. Experiment 3: Data Sharing within a Pipeline  Background:  Intermediate data is shared in a job chain , the output of one job is the input of another job  Challenge:  Without Tachyon, output of the previous job needs to be written into HDFS, and read by another job 47
  48. 48. Architecture • With Tachyon, the output can be cached in Tachyon caching space, and makes the jobs share data without heavy network & HDD I/O Spark / MR job HDFS Tachy on Spark/ MR job Input Output Intermediate data read / write 48
  49. 49. Data share in job chain – Terasort  Two jobs in Terasort: • Data-preparation • Data-processing  Data sharing between two jobs: • Written into HDFS, with 1 replication • Written into Tachyon 49
  50. 50. Configuration  Input data size: 1TB  Tachyon caching space • 1 layer with 500G SSD on each worker  Two rounds of testing for the second phase: • Data processing-round1: • Write result into HDFS with 3 replications • Data processing-round2: • Write result into HDFS with 1 replication / tachyon 50
  51. 51. Test Result Data preparation Data processing 0 100 200 300 400 prepare-hdfs-1rep prepare-tachyon Duration 0 200 400 600 800 1000 1200 1400 1600 Run-hdfs-hdfs-1rep Run-tachyon-hdfs-1rep Run-tachyon-tachyon Duration 51
  52. 52. Performance Analysis – Data preparation Prepare-hdfs-1rep Prepare-Tachyon 52
  53. 53. Performance Analysis – Data processing-round1 Run-hdfs-hdfs-1rep Run-tachyon-tachyon 53
  54. 54. Performance Analysis – Data processing-round2 Run-tachyon-hdfs-1repRun-hdfs-hdfs-1rep 54
  55. 55. Experiment 3: Summary  Sharing Intermediate data via Tachyon with NVM boost the execution • For the first phase, which is purely I/O intensive, it brings (2X) performance gain • For the second phase, it brings 1.2X ~1.6X performance gain 55
  56. 56. Outline  Introduction and problem statement  Introduction of Tachyon & NVM  Performance testing and key learning  Summary 56
  57. 57. Conclusion  Memory is the new disk  Tachyon resolves the challenges in in-memory data management  NVM will offer great performance for data access in Tachyon 57
  58. 58. Q & A 58
  59. 59. 59
  60. 60.  Compute associations between two vertices n-hops away in a Graph • Getting weights between Vertices that have N degree association • Weightn(u, v) = (M is the number of paths that have exactly n edges) • = (We is the weight of edge)  A Graph Analysis case • E.g., friends of friend in social network  Graph-parallel implementation • Bagel (Pregel on Spark) U B C D V X Y A Test 1: Friends-of-friend 60
  61. 61. Problem statement  In memory becomes more important and popular • Cost / capacity of Memory is lower and lower, which makes it possible to handle huge size of data in memory • Many computation frameworks leverage memory • How to manage huge caching data is an interesting problem  There are still some challenges • Data sharing among applications • GC overhead introduced by in memory caching • When data is huge, external storage is still needed 61
  62. 62. Introduction of Tachyon  Tachyon is an memory-centric distributed storage system enabling data sharing at memor-speed acorss cluster frameworks, such as Spark Mapreduce etc. • Caches working set files in memory and off-heap • Enables different jobs/queries and frameworks to access cached files at memory speed  Features • Java-like file API • Hadoop file system compatible • Pluggable under layer file system  Tiered block storage http://www.tachyonproject.org/ Source: http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2014-10-16-Strata.pdf 62
  63. 63. Test cases for remote data cache  Input data is used by different applications • Input: event logs • Data format: • TimestampCategoryObjectIdEventID …  Input data location: • Without Tachyon, all data is putted on remote HDFS cluster • With Tachyon, all data is cached in local Tachyon  There are two cases, which share the same input data: • Case1 TopN • Case2 Nreach 63
  64. 64. Intro for TopN  Compute the top N object in each category • Used fields from input data: • <CategoryObjectId...> • select Category, ObjectId, count(*) as events from Input_data group by Category ,ObjectId order by events desc limit N where category='a';(for all categories) • Output: • CategoryObjectIdVisits  It can be used to calculate: • The best selling products in each category • Most popular videos in each category 64
  65. 65. Intro for Nreach  Computing unique event occurrence across the time range • Used fields from input data: • <TimeStamp, ObjectId, EventId, ...> • Output: • <ObjectId, TimeRange, Unique Event#, Unique Event(≥2)#, …, Unique Event(≥n)#> • Accumulated unique event# for each object and certain time range  It is used to calculate: • Visits of certain advertisement by unique users in one day/week 65
  66. 66. Tiered storage in Tachyon  Tiered block storage is used to extend Tachyon’s caching space with external storage, such as SSD HDD etc. • Different tiers have different speed and priority • “Hot” data and “warm” data are putted on different layers • Multiple directories in single tier  Data migration among tiers • “Hot” data can be evicted to lower layer, when it is not “hot” any longer by eviction strategies. • “Warm” data can be promote to top layer, when it becomes “hot” again.  It is available since Tachyon 0.6 Release • The JIRA for tiered storage: TACHYON-33 MEM SSD HDD Evict from top layer to bottom layer Promote to top layer 66

×