Alluxio: Unify Data at Memory Speed
• Yupeng Fu
2
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
3
Data Ecosystem Develops
• One Compute Framework
• Single Storage System
• Co-located
ETL
ETL
ETL
4
Data Ecosystem Explodes
…
• Many Compute
Frameworks
• Many Storage Systems
• Most not co-located
…
5
Data Ecosystem Issues
• Each app manages
multiple data sources
• Data source changes
require global updates
• Storage optimizations
requires app change
• Poor performance due
to lack of locality
…
…
6
Data Ecosystem Challenges
2 Data Freshness
• Cross-network movement is slow
• Copies create lag
• Data quality suffers with copies
4 Security & Governance
• Data security & governance is
increasingly complex
1 Speed & Complexity
• Integration and interoperability issues
(on prem, hybrid, cloud)
• Many departments & groups
3 Cost
• Many-to-many integrations are
expensive
• Data duplication
6
Heavy integrations create painful organizational drag
7
Data Ecosystem with Alluxio
• Apps only talk to
Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance
in Memory
Java File API HDFS Interface
Amazon S3
Interface
REST Web Service
HDFS Interface
Amazon S3
Interface
Swift Interface NFS Interface
…
…
8
Alluxio Design Principles
2
Data Sharing
• Don’t own the data
• Multiple apps sharing common data
• Data stored in multiple, hybrid systems
4
Enterprise Class
• Distributed architecture
• Commodity hardware
• Service-oriented
• High availability
• Security
1
Big Data & Machine Learning
• Interoperability with leading projects
• Large scale data sets
• High IO
3
High Speed Data Access
• Remote data
• Hot/warm/cold data
• Temporary data
• Read/write support
8
9
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
• Demo• 5
10
Alluxio Innovation:
Server-side API Translation
Convert from Client-side Interface to Native Storage Interface
HDFS Interface / S3 Interface
HDFS Interface S3A Interface Swift Interface
Google Cloud
Interface
11
Alluxio Innovation:
Server-side API Translation
Convert between different versions of HDFS
HDFS 2.7 Interface
HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface
12
Alluxio Innovation:
Unified Namespace
Enables effective data management across different Under Stores
Uses Mounting with Transparent Naming
13
Alluxio Innovation:
Unified Namespace
Create a catalog of available data sources for Data Scientists
/finance/customer-transactions/
/finance/vendor-transactions/
/operations/device-logs/
/operations/phone-call-recordings/
/operations/check-images/
/research/us-economic-data/
/research/intl-economic-data/
/marketing/advertising-dataset/
/marketing/marketing-funnel-dataset/
alluxio://
14
Alluxio Innovation:
Intelligent Cache
Local performance from remote data using native multi-tier storage
RAM
SSD
HDD
Hot Warm Cold
15
Where to use Alluxio
Finding high-fit Alluxio use-cases
Compute Zone
Standalone or managed with Mesos or Yarn
Storage in Different Availability Zone
Either on-prem or cloud
Alluxio is installed with or near compute to unify data
stores, stage remote data, and improve system
performance.
Spark Tensorflow Presto
HDFS
Guidelines
 Cloud deployment
 Compute separated from storage
 I/O or network latency exists
 Unification of many storage systems
 Applications sharing long lived data
More checks result in higher fit applications
16
100+ known production deployments
AND MORE!
17
Machine Learning Case Study
Challenge –
Slow training of model for
algorithmic trading in $46B data
driven Hedge Fund
Data access was slow, costing them
$$ in compute cost and lower
modeler productivity
SPARK
HDFS
SPARK
HDFS
Solution –
With Alluxio, data access are 10-30X
faster
Impact –
Increased efficiency on training of ML
algorithm, lowered compute cost and
increased modeler productivity,
resulting in 14 day ROI of Alluxio
MESOS
MES
OS
Public Internet
Public Internet
18
Consumer Intelligence Use Case – Top 3 Telco
Challenge –
Desired a central view of consumer
information in near real time for
proactive support.
Many HDFS, different distributions,
many incompatible versions. On-
prem & cloud. Integration through
heavy ETL.
HADOOP
Solution –
Alluxio integrates data into central
catalog for fast access to consumer
interaction records.
Impact –
Reduced integration time
Faster data speed & freshness
ML HADOOP
HDFS HDFS HDFS
ML
ETL
HDP
HDFS
CDH
HDFS
MAPR
HDFS
HDFS
20
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
• Demo• 5
21
Alluxio Architecture
Alluxio
Master
Zookeeper
Standby
Master
Alluxio
Worker
Alluxio
Worker
Under Store
Under Store
Alluxio
Client
Application
RAM / SSD / HDD
RAM / SSD / HDD
Control Path
Data Path
Alluxio Master
- Master responsible for
managing metadata
- Secondary masters used for
journal checkpoints and fault
tolerance
- Performs distributed storage
metadata operations
222017 Alluxio, Inc. All Rights Reserved
File
System
Metadata
Block
Metadata
Worker
Metadata
RPC
Servic
e
Journal
Storage
Primary Master
File System
Metadata
Block
Metadata
Worker
Metadata
Secondary Master
Alluxio Worker
- Worker responsible for managing block data
- Each worker manages metadata for the
block data it stores
- Workers store block data on various local
storage mediums
- Performs distributed storage data operations
232017 Alluxio, Inc. All Rights Reserved
Block
Metadata
RPC
Servic
e
Data
Transfe
r
Service
RAM
Under
Storage
SSD
HDD
Data Flow In Alluxio
Applications Read/Write data via the Alluxio Client
Ideally, Alluxio deployed on same nodes as compute so
Alluxio Client and Alluxio Workers on same node
Different Read Scenarios
Read data in Alluxio, on same node as client
Read data in Alluxio, not on same node as client
Read data not in Alluxio
Different Write Scenarios
Write data only to Alluxio
Write data to Alluxio and Under Store synchronously
25
Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
26
Read data not in Alluxio + Caching
26
RAM / SSD / HDD
Network / Disk Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
27
Read data in Alluxio, not on same node as
client + Caching
RAM / SSD / HDD
Network Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
28
Write data only to Alluxio on same
node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
29
Write data to Alluxio and Under Store
synchronously
RAM / SSD / HDD
Network / Disk Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
31
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
• Demo• 5
32
New features in 1.7.0
Async caching
Kubernates integration
Tiered locality
Under store synchronization
FUSE improvement
33
Partial Caching
Alluxio Worker
Alluxio Client
Machine A
Application
Under
Storage
RAM
block
34
Async Partial Caching
Alluxio Worker
Alluxio Client
Machine A
Application
Under
Storage
RAM
block
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
41
• Thank you!
• Yupeng Fu yupeng@alluxio.com
• Github: yupeng9
• wechat: richbird9
•
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.org
E-mail
info@alluxio.com
@
Social Media

Alluxio: Unify Data at Memory Speed

  • 1.
    Alluxio: Unify Dataat Memory Speed • Yupeng Fu
  • 2.
    2 Outline Why we builtAlluxio Alluxio’s innovations Alluxio’s Architecture What’s new in 1.7.0 1 2 3 4
  • 3.
    3 Data Ecosystem Develops •One Compute Framework • Single Storage System • Co-located ETL ETL ETL
  • 4.
    4 Data Ecosystem Explodes … •Many Compute Frameworks • Many Storage Systems • Most not co-located …
  • 5.
    5 Data Ecosystem Issues •Each app manages multiple data sources • Data source changes require global updates • Storage optimizations requires app change • Poor performance due to lack of locality … …
  • 6.
    6 Data Ecosystem Challenges 2Data Freshness • Cross-network movement is slow • Copies create lag • Data quality suffers with copies 4 Security & Governance • Data security & governance is increasingly complex 1 Speed & Complexity • Integration and interoperability issues (on prem, hybrid, cloud) • Many departments & groups 3 Cost • Many-to-many integrations are expensive • Data duplication 6 Heavy integrations create painful organizational drag
  • 7.
    7 Data Ecosystem withAlluxio • Apps only talk to Alluxio • Simple Add/Remove • No App Changes • Highest performance in Memory Java File API HDFS Interface Amazon S3 Interface REST Web Service HDFS Interface Amazon S3 Interface Swift Interface NFS Interface … …
  • 8.
    8 Alluxio Design Principles 2 DataSharing • Don’t own the data • Multiple apps sharing common data • Data stored in multiple, hybrid systems 4 Enterprise Class • Distributed architecture • Commodity hardware • Service-oriented • High availability • Security 1 Big Data & Machine Learning • Interoperability with leading projects • Large scale data sets • High IO 3 High Speed Data Access • Remote data • Hot/warm/cold data • Temporary data • Read/write support 8
  • 9.
    9 Outline Why we builtAlluxio Alluxio’s innovations Alluxio’s Architecture What’s new in 1.7.0 1 2 3 4 • Demo• 5
  • 10.
    10 Alluxio Innovation: Server-side APITranslation Convert from Client-side Interface to Native Storage Interface HDFS Interface / S3 Interface HDFS Interface S3A Interface Swift Interface Google Cloud Interface
  • 11.
    11 Alluxio Innovation: Server-side APITranslation Convert between different versions of HDFS HDFS 2.7 Interface HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface
  • 12.
    12 Alluxio Innovation: Unified Namespace Enableseffective data management across different Under Stores Uses Mounting with Transparent Naming
  • 13.
    13 Alluxio Innovation: Unified Namespace Createa catalog of available data sources for Data Scientists /finance/customer-transactions/ /finance/vendor-transactions/ /operations/device-logs/ /operations/phone-call-recordings/ /operations/check-images/ /research/us-economic-data/ /research/intl-economic-data/ /marketing/advertising-dataset/ /marketing/marketing-funnel-dataset/ alluxio://
  • 14.
    14 Alluxio Innovation: Intelligent Cache Localperformance from remote data using native multi-tier storage RAM SSD HDD Hot Warm Cold
  • 15.
    15 Where to useAlluxio Finding high-fit Alluxio use-cases Compute Zone Standalone or managed with Mesos or Yarn Storage in Different Availability Zone Either on-prem or cloud Alluxio is installed with or near compute to unify data stores, stage remote data, and improve system performance. Spark Tensorflow Presto HDFS Guidelines  Cloud deployment  Compute separated from storage  I/O or network latency exists  Unification of many storage systems  Applications sharing long lived data More checks result in higher fit applications
  • 16.
    16 100+ known productiondeployments AND MORE!
  • 17.
    17 Machine Learning CaseStudy Challenge – Slow training of model for algorithmic trading in $46B data driven Hedge Fund Data access was slow, costing them $$ in compute cost and lower modeler productivity SPARK HDFS SPARK HDFS Solution – With Alluxio, data access are 10-30X faster Impact – Increased efficiency on training of ML algorithm, lowered compute cost and increased modeler productivity, resulting in 14 day ROI of Alluxio MESOS MES OS Public Internet Public Internet
  • 18.
    18 Consumer Intelligence UseCase – Top 3 Telco Challenge – Desired a central view of consumer information in near real time for proactive support. Many HDFS, different distributions, many incompatible versions. On- prem & cloud. Integration through heavy ETL. HADOOP Solution – Alluxio integrates data into central catalog for fast access to consumer interaction records. Impact – Reduced integration time Faster data speed & freshness ML HADOOP HDFS HDFS HDFS ML ETL HDP HDFS CDH HDFS MAPR HDFS HDFS
  • 19.
    20 Outline Why we builtAlluxio Alluxio’s innovations Alluxio’s Architecture What’s new in 1.7.0 1 2 3 4 • Demo• 5
  • 20.
    21 Alluxio Architecture Alluxio Master Zookeeper Standby Master Alluxio Worker Alluxio Worker Under Store UnderStore Alluxio Client Application RAM / SSD / HDD RAM / SSD / HDD Control Path Data Path
  • 21.
    Alluxio Master - Masterresponsible for managing metadata - Secondary masters used for journal checkpoints and fault tolerance - Performs distributed storage metadata operations 222017 Alluxio, Inc. All Rights Reserved File System Metadata Block Metadata Worker Metadata RPC Servic e Journal Storage Primary Master File System Metadata Block Metadata Worker Metadata Secondary Master
  • 22.
    Alluxio Worker - Workerresponsible for managing block data - Each worker manages metadata for the block data it stores - Workers store block data on various local storage mediums - Performs distributed storage data operations 232017 Alluxio, Inc. All Rights Reserved Block Metadata RPC Servic e Data Transfe r Service RAM Under Storage SSD HDD
  • 23.
    Data Flow InAlluxio Applications Read/Write data via the Alluxio Client Ideally, Alluxio deployed on same nodes as compute so Alluxio Client and Alluxio Workers on same node Different Read Scenarios Read data in Alluxio, on same node as client Read data in Alluxio, not on same node as client Read data not in Alluxio Different Write Scenarios Write data only to Alluxio Write data to Alluxio and Under Store synchronously
  • 24.
    25 Read data inAlluxio, on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  • 25.
    26 Read data notin Alluxio + Caching 26 RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  • 26.
    27 Read data inAlluxio, not on same node as client + Caching RAM / SSD / HDD Network Speed Read of Data Application Alluxio Client Alluxio Master Alluxio Worker RAM / SSD / HDD Alluxio Worker
  • 27.
    28 Write data onlyto Alluxio on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  • 28.
    29 Write data toAlluxio and Under Store synchronously RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  • 29.
    31 Outline Why we builtAlluxio Alluxio’s innovations Alluxio’s Architecture What’s new in 1.7.0 1 2 3 4 • Demo• 5
  • 30.
    32 New features in1.7.0 Async caching Kubernates integration Tiered locality Under store synchronization FUSE improvement
  • 31.
    33 Partial Caching Alluxio Worker AlluxioClient Machine A Application Under Storage RAM block
  • 32.
    34 Async Partial Caching AlluxioWorker Alluxio Client Machine A Application Under Storage RAM block
  • 33.
    Twitter.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media 41 • Thankyou! • Yupeng Fu yupeng@alluxio.com • Github: yupeng9 • wechat: richbird9 • Twitter.com/alluxio Linkedin.com/alluxio Website www.alluxio.org E-mail info@alluxio.com @ Social Media

Editor's Notes

  • #4 A technology ecosystem developed around so called, big data; Hadoop, being one of the most important. The idea was to allow large clusters of cheap computers mine huge volumes of low value data, to extract valuable insights. To do that, Hadoop was designed to make compute and storage tightly-coupled.
  • #5 As enterprises matured and more big data systems developed, they adopted new technologies across more datacenters. Compute began to be separated from storage to take advantage of lower cost hosting options.
  • #6 Enterprises now need to manage a bird’s nest of integrations, and as a result development and operations are complicated. Performance also suffers as large volumes of data cross slow pipes that connect networks.
  • #7 As folks in the trenches know, Gartner says 60% of data projects will fail in 2017. That’s largely due to the complexity, cost, bottlenecks and governance surrounding these projects.
  • #8 Alluxio is a virtual distributed file system that unifies data access between storage and compute, and offers memory speed performance when working on remote data. How does it do that? Our software connects with dozens of the leading storage platforms and hosting providers like Amazon, Google and Microsoft Alluxio unifies all your storage systems into a single global namespace that Spark, Presto, MapReduce, and other frameworks can access Because most big data processes are run on only a subset of data, teams have the ability to place the most recent data into Alluxio memory And as data flows through the system, Alluxio’s intelligent cache can place other frequently data into memory And Alluxio is read/write, so entire processing pipelines can be managed through the system Benefits Application development and data science is simpler – abstracts the complexity of storage Architecture is more flexible – separate compute from storage, improve interoperability and choose lower-cost storage Improves data access speed across networks – commonly 2-10x
  • #9 Alluxio is deeply informed by our unique design principles: Work with large-scale applications Don’t own the data Don’t hurt performance, and try to address common performance issues like remote data access and write throughput Make sure it can operate within the largest enterprises These principles informed our 3 enabling innovations…
  • #16 While an infrastructure doesn’t need to meet all the following guidelines to be a high-fit application, meeting more of these guidelines will lead to a higher fit
  • #17 We have been very fortunate to have been rapidly adopted by some of the largest and most respected technology organizations in the world.
  • #18 We’re working with one early adopter of technology, called Two Sigma, which is leading tech-focused hedge fund with over $40 billion AUM. This team had a real problem in training their models across their 10,000 node Spark cluster. Their Spark nodes were in AWS and their source data was in their local storage. Because so much data was being transported across the relatively slow public network, they saw massive network bottlenecks. This meant they could only run their process twice a day. When they added Alluxio, they were able to increase that to 8-10x cycles per day, which meant big money for the hedge fund. As they’ve matured in the cloud, they’ve realized they’ve locked themselves into S3, and S3 prices are very unpredictable, they’ve wanted to take advantage of lower cost cloud providers like Google. With Alluxio, they have the ability to lift and shift their nodes, without impacting their application layer, taking advantage of more favorable pricing.
  • #19 One of the largest telcos in the US uses Alluxio as a virtual data lake – providing a single unified access layer across all their data systems. This enables them to integrate new systems more quickly and improves the data freshness and responsiveness of applications built on top of the data lake.
  • #36 We ran this experiment when Spark 2.0 came out with the latest version of Alluxio at the time. Since then we have kept track of improvements to Spark and Alluxio, but have not seen enough of a difference to rerun this performance evaluation. We compare several storage types in Spark with not storing the data in Spark but in Alluxio instead.
  • #37 Let me explain what this graph is presenting. We first tested with a cached RDD, meaning the data was already in Spark or Alluxio, and we varied the size of the RDD and recorded how much time it took to run a scan on the RDD. The disk only line in green is as expected, much worse than all the others which are using memory. A more interesting observation is the performance difference between the Alluxio text file and object file. Text file is almost strictly better, this is because of the object serialization overhead, the file itself is all plaintext so using an object file was not helpful. I would recommend to use text file when possible. Using spark’s memory only is the best for small files, but abruptly performs worse once the data cannot be completely cached in memory. This happens as well for mem_only_ser the purlple line, but much last because of the small size of serialized objects. Alluxio scales linearly throughout the test and actually outperforms Spark caches for a single task after 32 GB or so.
  • #38 The previous comparison was if the data was on SSD, which is still a relatively fast storage. If we instead put the data in S3, we see a much larger speed up. This is more representative of architectures because this allows compute and storage to be decoupled. In this case, we see 16x speed up even in this simple job, which is similar to some of the performance use cases I previously presented.
  • #39 We also did a similar test with a parquet file using the data frame API. Here we did a simple aggregation which would access all the rows. The behavior is the same as before, using Spark’s native caching has an abrupt turning point where the performance degrades. These tests are run with default spark configurations which they suggest not to change. If we optimize for additional storage, we can move the point of bend, but it would still be present.
  • #40 This is average of 7 runs. Range of s3 is 1132.765125 Range of Alluxio is 10.5890684 We also ran the same example against S3, and the variation in the test was fairly large. This is similar to what some users see in their storage either due to the storage itself, or their workload and sometimes both. Alluxio performance is much more consistent and provided on average 10x and up to 17x performance improvement.
  • #41 Easy to use with spark