Haoyuan Li, Tachyon Nexus & UC Berkeley



November 19, 2015 @ AMPCamp 6
An Open Source Memory-Centric
Distributed Storage...
Outline
•  Open Source
•  Introduction to Tachyon (Before 2015)
•  Deployments and New Features
•  Getting Involved
2
Background
•  Started at UC Berkeley AMPLab
–  From summer 2012
•  Open sourced
–  April 2013 (two and half years ago)
–  ...
4
Tachyon: one of the
Fastest Growing

Big Data
Open Source
Project
Contributors Growth
5
v0.4!
Feb ‘14
v0.3!
Oct ‘13
v0.2
Apr ‘13
v0.1
Dec ‘12
v0.6!
Mar ‘15
v0.5!
Jul ‘14
v0.7!
Jul ‘15
1
 3...
Contributors Growth
6
> 170 Contributors (V0.8)
(3x increment over the last AMPCamp)

> 50 Organizations
Thanks to Contributors and Users!
7
h"p://tachyon-project.org/community/
One Tachyon Production

Deployment Example
•  Baidu (Dominant Search Engine in China,
~ 50 Billion USD Market Cap)
•  Fram...
Outline
•  Open Source
•  Introduction to Tachyon (Before 2015)
•  New Features
•  Getting Involved
9
Tachyon is an
Open Source

Memory-centric

Distributed
Storage System
10
11
Why Tachyon?
Performance Trend: 

Memory is Fast
•  RAM throughput 

increasing exponentially
•  Disk throughput
increasing slowly
12
M...
Price Trend: Memory is Cheaper
source:	jcmit.com	
13
Realized by many…
14
15
Is the
Problem Solved?
16
Missing a Solution
for the Storage Layer
A Use Case Example with - 
•  Fast, in-memory data processing framework
– Keep one in-memory copy inside JVM
– Track linea...
Issue 1
18
Data Sharing is the bottleneck in
analytics pipeline:

Slow writes to disk
Spark Job1
Spark mem
block manager
b...
Issue 1
19
Spark Job
Spark mem
block manager
block 1
block 3
Hadoop MR Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
b...
Issue 1 resolved with Tachyon
20
Memory-speed data sharing

among jobs in different
frameworks
execution engine & 

storage...
Issue 2
21
Spark Task
Spark memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
executio...
Issue 2
22
crash
Spark memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
execution eng...
HDFS / Amazon S3
Issue 2
23
block 1
block 3
block 2
block 4
execution engine & 

storage engine
same process
crash
Cache l...
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Tachyon!
in-memory
block 1
block 3
 block 4
Issue 2 resolved with Tachyon...
Issue 2 resolved with Tachyon
25
HDFS	
disk	
block	1	
block	3	
block	2	
block	4	
execution engine & 

storage engine
same ...
HDFS / Amazon S3
Issue 3
26
In-memory Data Duplication &
Java Garbage Collection
Spark Job1
Spark mem
block manager
block ...
Issue 3 resolved with Tachyon
27
No in-memory data duplication,

much less GC
Spark Job1
Spark mem
Spark Job2
Spark mem
HD...
Previously Mentioned
•  A memory-centric storage architecture
•  Push lineage down to storage layer
28
Tachyon Memory-Centric Architecture
29
Tachyon Memory-Centric Architecture
30
Lineage in Tachyon
31
Outline
•  Open Source
•  Introduction to Tachyon (Before 2015)
•  Deployments and New Features
•  Getting Involved
32
1) Eco-system:
Enable new workload in any storage;
Work with the framework of your choice;
33
2) Tachyon running in
production environments, 
both 
in the Cloud and on Premise.
34
Use Case: Baidu
•  Framework: SparkSQL
•  Under Storage: Baidu’s File System
•  Storage Media: MEM + HDD
•  100+ nodes dep...
Use Case: an Oil Company
•  Framework: Spark
•  Under Storage: GlusterFS
•  Storage Media: MEM only
•  Analyzing data in t...
Use Case: a SAAS Company
•  Framework: Impala
•  Under Storage: S3
•  Storage Media: MEM + SSD
•  15x Performance Improvem...
Use Case: a Biotechnology Company
•  Framework: Spark & MapReduce
•  Under Storage: GlusterFS
•  Storage Media: MEM and SS...
Use Case: a SAAS Company
•  Framework: Spark
•  Under Storage: S3
•  Storage Media: SSD only
•  Elastic Tachyon deployment...
Use Case: a Retail Company
•  Framework: Spark & MapReduce
•  Under Storage: HDFS
•  Storage Media: MEM
40
Run Everywhere
41
Enable Faster Innovation in Storage Layer
42
What if 

data size exceeds 

memory capacity?
43
3) Tiered Storage:

Tachyon Manages More Than DRAM
MEM
SSD
HDD
Faster
Higher 

Capacity
44
Configurable Storage Tiers
MEM only
MEM + HHD
SSD only
45
4) Pluggable Data Management Policy
Evict stale data to
lower tier
Promote hot data to
upper tier
46
Pin Data in Memory
5) Transparent Naming
47
6) Unified Namespace
48
More Features
•  7) Remote Write Support
•  8) Easy deployment with Mesos and Yarn
•  9) Initial Security Support
•  10) O...
12) More Under Storage Supports
50
Reported Tachyon Usage
51
•  Team consists of Tachyon creators, top contributors
•  Series A ($7.5 million) from Andreessen Horowitz


•  Committed ...
Outline
•  Open Source
•  Introduction to Tachyon (Before 2015)
•  Deployments and New Features
•  Getting Involved
53
Memory-Centric Distributed Storage
Welcome to try, contact, and collaborate!
54
JIRA New Contributor Tasks
•  Try Tachyon: http://tachyon-project.org


•  Develop Tachyon: https://github.com/amplab/tachyon


•  Meet Friends: http...
Upcoming SlideShare
Loading in …5
×

Tachyon

854 views

Published on

by Haoyuan Li

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
854
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
49
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Tachyon

  1. 1. Haoyuan Li, Tachyon Nexus & UC Berkeley
 
 November 19, 2015 @ AMPCamp 6 An Open Source Memory-Centric Distributed Storage System
  2. 2. Outline •  Open Source •  Introduction to Tachyon (Before 2015) •  Deployments and New Features •  Getting Involved 2
  3. 3. Background •  Started at UC Berkeley AMPLab –  From summer 2012 •  Open sourced –  April 2013 (two and half years ago) –  Apache License 2.0 –  Latest Release: Version 0.8.2 (November 2015) •  Deployed at > 100 companies 3
  4. 4. 4 Tachyon: one of the Fastest Growing
 Big Data Open Source Project
  5. 5. Contributors Growth 5 v0.4! Feb ‘14 v0.3! Oct ‘13 v0.2 Apr ‘13 v0.1 Dec ‘12 v0.6! Mar ‘15 v0.5! Jul ‘14 v0.7! Jul ‘15 1 3 15 30 46 70 111
  6. 6. Contributors Growth 6 > 170 Contributors (V0.8) (3x increment over the last AMPCamp) > 50 Organizations
  7. 7. Thanks to Contributors and Users! 7 h"p://tachyon-project.org/community/
  8. 8. One Tachyon Production
 Deployment Example •  Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap) •  Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement 8
  9. 9. Outline •  Open Source •  Introduction to Tachyon (Before 2015) •  New Features •  Getting Involved 9
  10. 10. Tachyon is an Open Source
 Memory-centric
 Distributed Storage System 10
  11. 11. 11 Why Tachyon?
  12. 12. Performance Trend: 
 Memory is Fast •  RAM throughput 
 increasing exponentially •  Disk throughput increasing slowly 12 Memory-locality key to interactive response times
  13. 13. Price Trend: Memory is Cheaper source: jcmit.com 13
  14. 14. Realized by many… 14
  15. 15. 15 Is the Problem Solved?
  16. 16. 16 Missing a Solution for the Storage Layer
  17. 17. A Use Case Example with - •  Fast, in-memory data processing framework – Keep one in-memory copy inside JVM – Track lineage of operations used to derive data – Upon failure, use lineage to recompute data map filter map join reduce Lineage Tracking 17
  18. 18. Issue 1 18 Data Sharing is the bottleneck in analytics pipeline:
 Slow writes to disk Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes)
  19. 19. Issue 1 19 Spark Job Spark mem block manager block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 Data Sharing is the bottleneck in analytics pipeline:
 Slow writes to disk storage engine & execution engine same process (slow writes)
  20. 20. Issue 1 resolved with Tachyon 20 Memory-speed data sharing
 among jobs in different frameworks execution engine & 
 storage engine same process (fast writes) Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon! in-memory block 1 block 3 block 4
  21. 21. Issue 2 21 Spark Task Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process Cache loss when process crashes
  22. 22. Issue 2 22 crash Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process Cache loss when process crashes
  23. 23. HDFS / Amazon S3 Issue 2 23 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process crash Cache loss when process crashes
  24. 24. HDFS / Amazon S3 block 1 block 3 block 2 block 4 Tachyon! in-memory block 1 block 3 block 4 Issue 2 resolved with Tachyon 24 Spark Task Spark memory block manager execution engine & 
 storage engine same process Keep in-memory data safe,
 even when a job crashes.
  25. 25. Issue 2 resolved with Tachyon 25 HDFS disk block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process Tachyon! in-memory block 1 block 3 block 4 crash HDFS / Amazon S3 block 1 block 3 block 2 block 4 Keep in-memory data safe,
 even when a job crashes.
  26. 26. HDFS / Amazon S3 Issue 3 26 In-memory Data Duplication & Java Garbage Collection Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process (duplication & GC)
  27. 27. Issue 3 resolved with Tachyon 27 No in-memory data duplication,
 much less GC Spark Job1 Spark mem Spark Job2 Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process (no duplication & GC) HDFS disk block 1 block 3 block 2 block 4 Tachyon! in-memory block 1 block 3 block 4
  28. 28. Previously Mentioned •  A memory-centric storage architecture •  Push lineage down to storage layer 28
  29. 29. Tachyon Memory-Centric Architecture 29
  30. 30. Tachyon Memory-Centric Architecture 30
  31. 31. Lineage in Tachyon 31
  32. 32. Outline •  Open Source •  Introduction to Tachyon (Before 2015) •  Deployments and New Features •  Getting Involved 32
  33. 33. 1) Eco-system: Enable new workload in any storage; Work with the framework of your choice; 33
  34. 34. 2) Tachyon running in production environments, both in the Cloud and on Premise. 34
  35. 35. Use Case: Baidu •  Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement 35
  36. 36. Use Case: an Oil Company •  Framework: Spark •  Under Storage: GlusterFS •  Storage Media: MEM only •  Analyzing data in traditional storage 36
  37. 37. Use Case: a SAAS Company •  Framework: Impala •  Under Storage: S3 •  Storage Media: MEM + SSD •  15x Performance Improvement 37
  38. 38. Use Case: a Biotechnology Company •  Framework: Spark & MapReduce •  Under Storage: GlusterFS •  Storage Media: MEM and SSD 38
  39. 39. Use Case: a SAAS Company •  Framework: Spark •  Under Storage: S3 •  Storage Media: SSD only •  Elastic Tachyon deployment 39
  40. 40. Use Case: a Retail Company •  Framework: Spark & MapReduce •  Under Storage: HDFS •  Storage Media: MEM 40
  41. 41. Run Everywhere 41 Enable Faster Innovation in Storage Layer
  42. 42. 42 What if 
 data size exceeds 
 memory capacity?
  43. 43. 43 3) Tiered Storage:
 Tachyon Manages More Than DRAM MEM SSD HDD Faster Higher 
 Capacity
  44. 44. 44 Configurable Storage Tiers MEM only MEM + HHD SSD only
  45. 45. 45 4) Pluggable Data Management Policy Evict stale data to lower tier Promote hot data to upper tier
  46. 46. 46 Pin Data in Memory
  47. 47. 5) Transparent Naming 47
  48. 48. 6) Unified Namespace 48
  49. 49. More Features •  7) Remote Write Support •  8) Easy deployment with Mesos and Yarn •  9) Initial Security Support •  10) One Command Cluster Deployment •  11) Metrics Reporting for Clients, Workers, and Master 49
  50. 50. 12) More Under Storage Supports 50
  51. 51. Reported Tachyon Usage 51
  52. 52. •  Team consists of Tachyon creators, top contributors •  Series A ($7.5 million) from Andreessen Horowitz
 •  Committed to Tachyon Open Source
 •  http://www.tachyonnexus.com 52
  53. 53. Outline •  Open Source •  Introduction to Tachyon (Before 2015) •  Deployments and New Features •  Getting Involved 53
  54. 54. Memory-Centric Distributed Storage Welcome to try, contact, and collaborate! 54 JIRA New Contributor Tasks
  55. 55. •  Try Tachyon: http://tachyon-project.org
 •  Develop Tachyon: https://github.com/amplab/tachyon
 •  Meet Friends: http://www.meetup.com/Tachyon
 •  Get News: http://goo.gl/mwB2sX •  Tachyon Nexus: http://www.tachyonnexus.com •  Contact us: haoyuan@tachyonnexus.com 55

×