Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ENABLE FAST BIG DATA ANALYTICS ON
CEPH WITH ALLUXIO
Adit Madan
March 2017
ABOUT ME
Adit Madan, Software Engineer @ Alluxio, Inc
Master’s @ Carnegie Mellon University
Bachelor’s @ Indian Institute ...
ALLUXIO INTRODUCTION
3
FASTEST-GROWING BIG DATA PROJECT
• Fastest growing
open-source
project in the big
data ecosystem
• 400+ contributors
from ...
BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY
…
…
FUSE Compatible File
System
Hadoop...
WHY ALLUXIO
Co-located with compute, provides memory-speed access to data
Virtualized across different storage systems und...
ALLUXIO BENEFITS
Unification
New workflows across
any data in any storage
system
Orders of magnitude
improvement in run
ti...
USE CASE – ACCELERATE I/O TO/FROM
REMOTE STORAGE
8
• Compute and Storage Separation
• Advantages
• Meet different compute ...
USE CASE WITHOUT ALLUXIO
9
Spark
Storage
Low latency, memory
throughput
High latency, network
throughput
USE CASE WITH ALLUXIO
10
Spark
Storage
Alluxio
Keeping data in Alluxio
accelerates data access
ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark
SQL alone, it took 100-150 seconds to finish...
ALLUXIO ON CEPH
12
ALLUXIO ON CEPH
13
Spark
Ceph Object
Storage
Alluxio
● Connect using RADOS Gateway
○ Swift Object Storage API
EC2 CONFIGURATION
14
● 1  Compute  Master
○ Spark  and  Alluxio  Masters
● 3  Compute  Workers
○ Spark  and  Alluxio  Work...
SOFTWARE VERSIONS
15
● Ceph  Version:  0.94.9  
● Alluxio  Version:  1.4.0
○ Custom  JOSS  library  0.9.13-­SNAPSHOT
● Spa...
DEMO OF THE SOLUTION
16
● Spark,  Alluxio  and  Ceph  Cluster  pre-­deployed
● Ceph  pre-­populated  with  a  60GB  datase...
SPARK COUNT PERFORMANCE
17
Count  on  60  GB  dataset
● 20x  improvement  for  repeated  access
FOR MORE INFORMATION ….
18
Please  take  a  look  at  our  Whitepaper!
● Blog:  https://alluxio.com/blog/accelerating-­dat...
Thank you!
Contact: adit@alluxio.com or info@alluxio.com
Twitter: @Alluxio
Websites: www.alluxio.com and www.alluxio.org
19
Upcoming SlideShare
Loading in …5
×

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

1,072 views

Published on

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

Published in: Technology
  • Be the first to comment

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

  1. 1. ENABLE FAST BIG DATA ANALYTICS ON CEPH WITH ALLUXIO Adit Madan March 2017
  2. 2. ABOUT ME Adit Madan, Software Engineer @ Alluxio, Inc Master’s @ Carnegie Mellon University Bachelor’s @ Indian Institute of Technology, Delhi Email: adit@alluxio.com 2
  3. 3. ALLUXIO INTRODUCTION 3
  4. 4. FASTEST-GROWING BIG DATA PROJECT • Fastest growing open-source project in the big data ecosystem • 400+ contributors from 100+ organizations • Running world’s largest production clusters • Welcome to join the community! 4
  5. 5. BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY … … FUSE Compatible File System Hadoop Compatible File System Native Key-Value Interface Native File System Enabling Application to Access Data from any Storage System at Memory-speed BIG DATA ECOSYSTEM ISSUES GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 5
  6. 6. WHY ALLUXIO Co-located with compute, provides memory-speed access to data Virtualized across different storage systems under a unified global namespace Distributed system, scale-out architecture Software only, no change needed to existing application 6
  7. 7. ALLUXIO BENEFITS Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility 7
  8. 8. USE CASE – ACCELERATE I/O TO/FROM REMOTE STORAGE 8 • Compute and Storage Separation • Advantages • Meet different compute and storage hardware requirements efficiently • Scale compute and storage independently • Store data in Traditional filers/SANs and object stores cost effectively • Compute on data in existing storage via Big Data Computational frameworks • Disadvantage • Accessing data requires remote I/O
  9. 9. USE CASE WITHOUT ALLUXIO 9 Spark Storage Low latency, memory throughput High latency, network throughput
  10. 10. USE CASE WITH ALLUXIO 10 Spark Storage Alluxio Keeping data in Alluxio accelerates data access
  11. 11. ACCELERATE I/O TO/FROM REMOTE STORAGE The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Baidu’s PMs and analysts run interactive queries to gain insights into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 11
  12. 12. ALLUXIO ON CEPH 12
  13. 13. ALLUXIO ON CEPH 13 Spark Ceph Object Storage Alluxio ● Connect using RADOS Gateway ○ Swift Object Storage API
  14. 14. EC2 CONFIGURATION 14 ● 1  Compute  Master ○ Spark  and  Alluxio  Masters ● 3  Compute  Workers ○ Spark  and  Alluxio  Workers ● 1  Storage  Manager ○ Ceph  RadosGW  and  Monitor ● 2  Storage  Devices ○ Ceph  OSDs ● Instance  type:  r3.xlarge ● Availability  Zone:  us-­east-­1a
  15. 15. SOFTWARE VERSIONS 15 ● Ceph  Version:  0.94.9   ● Alluxio  Version:  1.4.0 ○ Custom  JOSS  library  0.9.13-­SNAPSHOT ● Spark  Version  1.6.1
  16. 16. DEMO OF THE SOLUTION 16 ● Spark,  Alluxio  and  Ceph  Cluster  pre-­deployed ● Ceph  pre-­populated  with  a  60GB  dataset ● Launch  spark  shell a. First  ‘count’ b. Second  ‘count’ c. <Restart  shell> d. Third  ‘count’ ● Ad-­hoc  queries  w/  Alluxio a. ‘wordcount’  w/  intermediate  data
  17. 17. SPARK COUNT PERFORMANCE 17 Count  on  60  GB  dataset ● 20x  improvement  for  repeated  access
  18. 18. FOR MORE INFORMATION …. 18 Please  take  a  look  at  our  Whitepaper! ● Blog:  https://alluxio.com/blog/accelerating-­data-­analytics-­on-­ ceph-­object-­storage-­with-­alluxio ● Whitepaper:  https://alluxio.com/resources/accelerating-­data-­ analytics-­on-­ceph-­object-­storage-­with-­alluxio
  19. 19. Thank you! Contact: adit@alluxio.com or info@alluxio.com Twitter: @Alluxio Websites: www.alluxio.com and www.alluxio.org 19

×