Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Spark with Tachyon by Gene Pang

2,431 views

Published on

Using Spark with Tachyon by Gene Pang

Published in: Data & Analytics

Using Spark with Tachyon by Gene Pang

  1. 1. Using Spark with Tachyon: An Open Source Memory-Centric Distributed Storage System Gene Pang, Tachyon Nexus gene@tachyonnexus.com October 29, 2015 @ Spark Summit Europe
  2. 2. Who Am I? • Gene Pang • PhD from UC Berkeley AMPLab • Software Engineer at Tachyon Nexus
  3. 3. • Team consists of Tachyon creators, top contributors • Series A ($7.5 million) from Andreessen Horowitz • Committed to Tachyon Open Source Project • www.tachyonnexus.com
  4. 4. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  5. 5. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  6. 6. History of Tachyon • Started at UC Berkeley AMPLab – From Summer 2012 – Same lab produced Apache Spark and Apache Mesos • Open sourced on April 2013 – Apache License 2.0 – Latest Release: Version 0.8.0 (October 2015) • Deployed at > 100 companies
  7. 7. Contributors Growth 1 3 15 30 46 70 111 v0.1 Dec'12 v0.2 Apr'13 v0.3 Oct'13 v0.4 Feb'14 v0.5 Jul'14 v0.6 Mar'15 v0.7 Jul'15
  8. 8. Contributors Growth 150+ Contributors 50+ Organizations
  9. 9. One of the Fastest Growing Big Data Open Source Projects
  10. 10. Thanks to Contributors and Users!
  11. 11. Reported Tachyon Usage
  12. 12. What is Tachyon?
  13. 13. Open Source Memory-Centric Distributed Storage System
  14. 14. Tachyon Stack
  15. 15. Why Use Tachyon?
  16. 16. Performance Trend: Memory is Fast • RAM throughput increasing exponentially • Disk throughput increasing slowly Memory-locality is important!
  17. 17. Price Trend: Memory is Cheaper source: jcmit.com
  18. 18. These Memory Trends are Realized By Many…
  19. 19. Is the Problem Solved? Missing a Solution for the Storage Layer
  20. 20. enables reliable data sharing at memory-speed within and across computation frameworks/jobs
  21. 21. How Does Tachyon Work? Memory-Centric Storage Architecture Lineage in Storage Layer
  22. 22. Tachyon Memory-Centric Architecture
  23. 23. Lineage in Tachyon
  24. 24. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  25. 25. Fast and general engine for large-scale data processing What are some potential issues?
  26. 26. Issue 1 Data Sharing bottleneck in analytics pipeline: Slow writes to disk Spark Job1 Spark Memory block 1 block 3 Spark Job2 Spark Memory block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process
  27. 27. Issue 1 Spark Job Spark Memory block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 Data Sharing bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process
  28. 28. Issue 1 resolved with Tachyon Memory-speed data sharing among different jobs and different frameworks Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 storage engine & execution engine same process
  29. 29. Issue 2 Spark Task Spark Memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 In-Memory data loss when computation crashes storage engine & execution engine same process
  30. 30. Issue 2 crash Spark Memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process In-Memory data loss when computation crashes
  31. 31. HDFS / Amazon S3 Issue 2 block 1 block 3 block 2 block 4 crash storage engine & execution engine same process In-Memory data loss when computation crashes
  32. 32. HDFS / Amazon S3 block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 Issue 2 resolved with Tachyon Spark Task Spark Memory block manager storage engine & execution engine same process Keep in-memory data safe, even when computation crashes
  33. 33. Issue 2 resolved with Tachyon HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 crash HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process Keep in-memory data safe, even when computation crashes
  34. 34. HDFS / Amazon S3 Issue 3 In-memory Data Duplication & Java Garbage Collection Spark Job1 Spark Memory block 1 block 3 Spark Job2 Spark Memory block 3 block 1 block 1 block 3 block 2 block 4 storage engine & execution engine same process
  35. 35. Issue 3 resolved with Tachyon No in-memory data duplication, much less GC Spark Job1 Spark mem Spark Job2 Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 storage engine & execution engine same process
  36. 36. Tachyon Use Case: Baidu • Framework: SparkSQL • Under Storage: Baidu’s File System • Tachyon Storage Media: MEM + HDD • 100+ Tachyon nodes • 1PB+ Tachyon managed storage • 30x Performance Improvement
  37. 37. Tachyon Use Case: An Oil Company • Framework: Spark • Under Storage: GlusterFS • Tachyon Storage Media: MEM only • Analyzing data in traditional storage
  38. 38. Tachyon Use Case: A SAAS Company • Framework: Spark • Under Storage: S3 • Tachyon Storage Media: SSD only • Elastic Tachyon deployment
  39. 39. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  40. 40. Tachyon 0.8.0 Just Released! http://tachyon-project.org/
  41. 41. Use different frameworks to enable workloads on different storage 1. Growing Ecosystem
  42. 42. MEM SSD HDD Faster Greater Capacity 2. Tiered Storage Tachyon manages more than DRAM
  43. 43. MEM only MEM + HDD SSD only 2. Tiered Storage Configurable storage tiers
  44. 44. Evict stale data to lower tier Promote hot data to upper tier 3. Pluggable Data Management Policy
  45. 45. Tachyon Storage System (HDFS, S3, …) tachyon://host:port/ Data Users Reports Sales Alice Bob s3n://bucket/directory/ Data Users Reports Sales Alice Bob 4. Transparent Naming • Persisted Tachyon files are mapped to under storage • Tachyon paths are preserved in under storage
  46. 46. Tachyon Storage System A tachyon://host:port/ Data Users Alice Bob hdfs://host:port/ Users Alice Bob Storage System B s3n://bucket/directory/ Reports Sales Reports Sales 5. Unified Namespace • Unified namespace for multiple storage systems • Share data across storage systems • On-the-fly mounting/unmounting
  47. 47. Additional Features Remote Write Support Easy deployment with Mesos and Yarn Initial Security Support One Command Cluster Deployment Metrics for Clients/Workers/Master
  48. 48. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  49. 49. Welcome users and collaborators! Memory-Centric Distributed Storage System
  50. 50. Try Tachyon: http://tachyon-project.org Develop Tachyon: https://github.com/amplab/tachyon Meet Friends: http://www.meetup.com/Tachyon Tachyon Nexus: http://www.tachyonnexus.com Email: gene@tachyonnexus.com Thank you!

×