Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tachyon workshop 2015-07-19

3,057 views

Published on

Tachyon is an open source project to build a reliable, memory-centric distributed storage system. This is a talk given at Tachyon workshop on July 19, 2015. describing the architecture of this open source project and the growth of its community

Published in: Technology

Tachyon workshop 2015-07-19

  1. 1. Bin Fan, Tachyon Nexus July 19, 2015 @ Tachyon Workshop tachyon-project.org A Reliable Memory-Centric Distributed Storage System
  2. 2. • Founded by Tachyon creators and top contributors • $7.5 million Series A from Andreessen Horowitz • Committed to Tachyon Open Source • www.tachyonnexus.com 2
  3. 3. 3
  4. 4. Outline • Overview – Motivation – Tachyon Architecture – Using Tachyon • Open Source – Status – Production Use Cases • Roadmap 4
  5. 5. Outline • Overview – Motivation – Tachyon Architecture – Using Tachyon • Open Source – Status – Production Use Cases • Roadmap 5
  6. 6. Started From UCB AMPLab Berkeley Data Analytics Stack (BDAS) Cluster manager Parallel computation framework Reliable, distributed memory-centric storage system 6
  7. 7. 7 Why Tachyon?
  8. 8. Memory is Fast • RAM throughput increasing exponentially • Disk throughput increasing slowly 8 Memory-locality key to interactive response times
  9. 9. Memory is Cheaper source: jcmit.com 9
  10. 10. Realized by many… 10
  11. 11. 11 Is the Problem Solved?
  12. 12. 12 Missing a Solution for the Storage Layer
  13. 13. An Example: - • Fast, in-memory data processing framework – Keep one in-memory copy inside JVM – Track lineage of operations used to derive data – Upon failure, use lineage to recompute data map filter map join reduce Lineage Tracking 13
  14. 14. Issue 1 14 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes)
  15. 15. Issue 1 15 Spark Job Spark mem block manager block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes)
  16. 16. Issue 2 16 Spark Task Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process Cache loss when process crashes
  17. 17. Issue 2 17 crash Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process Cache loss when process crashes
  18. 18. HDFS / Amazon S3 Issue 2 18 block 1 block 3 block 2 block 4 execution engine & storage engine same process crash Cache loss when process crashes
  19. 19. HDFS / Amazon S3 Issue 3 19 In-memory Data Duplication & Java Garbage Collection Spark Task1 Spark mem block manager block 1 block 3 Spark Task2 Spark mem block manager block 3 block 1 block 1 block 3 block 2 block 4 execution engine & storage engine same process (duplication & GC)
  20. 20. Tachyon Reliable data sharing at memory-speed within and across cluster frameworks/jobs 20
  21. 21. Technical Overview Ideas • A memory-centric storage architecture • Push lineage down to storage layer • Manage tiered storage Facts • One data copy in memory • Re-computation for fault-tolerance 21
  22. 22. Apache Spark Apache MR Apache HBase H2O Apache Flink Impala S3 Gluster FS HDFS Swift NFS Ceph …… …… Stack 22
  23. 23. Tachyon Memory-Centric Architecture 23
  24. 24. Tachyon Memory-Centric Architecture 24
  25. 25. Lineage in Tachyon 25
  26. 26. Issue 1 revisited 26 Memory-speed data sharing among jobs in different frameworks execution engine & storage engine same process (fast writes) Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4
  27. 27. HDFS / Amazon S3 block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 Issue 2 revisited 27 Spark Task Spark memory block manager execution engine & storage engine same process Keep in-memory data safe, even when a job crashes.
  28. 28. Issue 2 revisited 28 HDFS disk block 1 block 3 block 2 block 4 execution engine & storage engine same process Tachyon in-memory block 1 block 3 block 4 crash HDFS / Amazon S3 block 1 block 3 block 2 block 4 Keep in-memory data safe, even when a job crashes.
  29. 29. Issue 3 revisited 29 No in-memory data duplication, much less GC Spark Task Spark mem Spark Task Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process (no duplication & GC) HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4
  30. 30. Comparison with In-Memory HDFS 30
  31. 31. How easy / hard to use Tachyon? 31
  32. 32. Spark/MapReduce/Shark without Tachyon • Spark scala> val file = sc.textFile(“hdfs://ip:port/path”) • Hadoop MapReduce $ hadoop jar hadoop-examples-1.0.4.jar wordcount hdfs://localhost:19998/input hdfs://localhost:19998/output • Shark CREATE TABLE orders_cached AS SELECT * FROM orders; 32
  33. 33. Spark/MapReduce/Shark with Tachyon • Spark scala> val file = sc.textFile(“tachyon://ip:port/path”) • Hadoop MapReduce $ hadoop jar hadoop-examples-1.0.4.jar wordcount tachyon://localhost:19998/input tachyon://localhost:19998/output • Shark CREATE TABLE orders_tachyon AS SELECT * FROM orders; 33
  34. 34. Outline • Overview – Motivation – Tachyon Architecture – Using Tachyon • Open Source – Status – Production Use Cases • Roadmap 34
  35. 35. Open Source Status • Started at UC Berkeley AMPLab in Summer 2012 • Apache License 2.0, Version 0.7 (July 2015) • Deployed at > 50 companies (July 2014) • 30+ Companies Contributing • Spark/MapReduce/Flink applications can run without code change 35
  36. 36. Contributors Growth v0.4 Feb ‘14 v0.3 Oct ‘13 v0.2 Apr ‘13 v0.1 Dec ‘12 36 v0.6 Mar ‘15 v0.5 Jul ‘14 v0.7 Jul ‘15 1 3 15 30 46 70 100+
  37. 37. Codebase Growth v0.4 Feb ‘14 v0.3 Oct ‘13 v0.2 Apr ‘13 37 v0.6 Mar ‘15 v0.5 Jul ‘14 v0.7 Jul ‘15 465 commits 696 commits 1080 commits 1610 commits 2884 commits 4969 commits
  38. 38. Open Community 38 Berkeley Contributors Non-Berkeley Contributors
  39. 39. Thanks to Our Contributors!Aaron Davidson Abhiraj Butala Achal Soni Albert Chu Ali Ghodsi Andrew Ash Anurag Khandelwal Aslan Bekirov Bill Zhao Bin Fan Bradley Childs Calvin Jia Carson Wang Chao Chen Cheng Chang Cheng Hao Colin Patrick McCabe Dan Crankshaw Darion Yaphet David Capwell David Zhu Dina Leventol Du Li Fei Wang Gene Pang Gerald Zhang Grace Huang Haoyuan Li Henry Saputra Hobin Yoon Huamin Chen Jacky Li Jey Kottalam Jingxin Feng Joseph Tang Juan Zhou Jun Aoki Kun Xu Lukasz Jastrzebski Luogan Kun Manu Goyal Mark Hamstra Mingfei Shi Mubarak Seyed Nan Dun Nick Lanham Orcun Simsek Pengfei Xuan Qianhao Dong Qifan Pu Ramaraju Indukuri Raymond Liu Rob Vesse Robert Metzger Rong Gu Sean Zhong Seonghwan Moon Shaoshan Liu Shivaram Venkataraman Shu Peng Srinivas Parayya Tao Wang Thu Kyaw Timothy St. Clair Vaishnav Kovvuri Vikram Sreekanti Xi Liu Xiaomeng Huang Xiaomin Zhang Xing Lin Yi Liu Zhao Zhang 39
  40. 40. Tachyon Usage 40
  41. 41. Under Filesystem Choices (Big Data, Cloud, HPC, Enterprise) 41
  42. 42. Use Case: Baidu • Framework: SparkSQL • Under Storage: Baidu’s File System • Storage Media: MEM + HDD • 100+ nodes deployment • 1PB+ managed space • 30x Performance Improvement More Details: www.meetup.com/Tachyon 42
  43. 43. Use Case: a SAAS Company • Framework: Impala • Under Storage: S3 • Storage Media: MEM + SSD • 15x Performance Improvement 43
  44. 44. Use Case: an Oil Company • Framework: Spark • Under Storage: GlusterFS • Storage Media: MEM only • Analyzing data in traditional storage 44
  45. 45. Use Case: a SAAS Company • Framework: Spark • Under Storage: S3 • Storage Media: SSD only • Elastic Tachyon deployment 45
  46. 46. Outline • Overview – Motivation – Tachyon Architecture – Using Tachyon • Open Source – Status – Production Use Cases • Roadmap 46
  47. 47. New Features • Lineage in Storage (alpha) • Tiered Storage (alpha) 47
  48. 48. New Features • Lineage in Storage (alpha) • Tiered Storage (alpha) • Data Serving • Support for New Hardware • … • Your New Feature! 48
  49. 49. 49 Tachyon’s Goal?
  50. 50. Distributed Memory-Centric Storage: Better Assist Other Components Welcome Collaboration! 50 JIRA New Contributor Tasks Apache Spark Apache MR Apache HBase H2O Apache Flink Impala S3 Gluster FS HDFS Swift NFS Ceph …… ……
  51. 51. • Website: http://tachyon-project.org • Github: https://github.com/amplab/tachyon • Meetup: http://www.meetup.com/Tachyon • Training program: coming soon • Tachyon Nexus is hiring • News Letter Subscription: http://goo.gl/mwB2sX • Email: binfan@tachyonnexus.com 51

×