5. 5
Data Ecosystem Issues
• Each app manages
multiple data sources
• Data source changes
require global updates
• Storage optimizations
requires app change
• Poor performance due
to lack of locality
…
…
6. 6
Data Ecosystem Challenges
2 Data Freshness
• Cross-network movement is slow
• Copies create lag
• Data quality suffers with copies
4 Security & Governance
• Data security & governance is
increasingly complex
1 Speed & Complexity
• Integration and interoperability issues
(on prem, hybrid, cloud)
• Many departments & groups
3 Cost
• Many-to-many integrations are
expensive
• Data duplication
6
Heavy integrations create painful organizational drag
7. 7
Data Ecosystem with Alluxio
• Apps only talk to
Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance
in Memory
Java File API HDFS Interface
Amazon S3
Interface
REST Web Service
HDFS Interface
Amazon S3
Interface
Swift Interface NFS Interface
…
…
8. 8
Alluxio Design Principles
2
Data Sharing
• Don’t own the data
• Multiple apps sharing common data
• Data stored in multiple, hybrid systems
4
Enterprise Class
• Distributed architecture
• Commodity hardware
• Service-oriented
• High availability
• Security
1
Big Data & Machine Learning
• Interoperability with leading projects
• Large scale data sets
• High IO
3
High Speed Data Access
• Remote data
• Hot/warm/cold data
• Temporary data
• Read/write support
8
9. 9
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
• Demo• 5
10. 10
Alluxio Innovation:
Server-side API Translation
Convert from Client-side Interface to Native Storage Interface
HDFS Interface / S3 Interface
HDFS Interface S3A Interface Swift Interface
Google Cloud
Interface
11. 11
Alluxio Innovation:
Server-side API Translation
Convert between different versions of HDFS
HDFS 2.7 Interface
HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface
13. 13
Alluxio Innovation:
Unified Namespace
Create a catalog of available data sources for Data Scientists
/finance/customer-transactions/
/finance/vendor-transactions/
/operations/device-logs/
/operations/phone-call-recordings/
/operations/check-images/
/research/us-economic-data/
/research/intl-economic-data/
/marketing/advertising-dataset/
/marketing/marketing-funnel-dataset/
alluxio://
15. 15
Where to use Alluxio
Finding high-fit Alluxio use-cases
Compute Zone
Standalone or managed with Mesos or Yarn
Storage in Different Availability Zone
Either on-prem or cloud
Alluxio is installed with or near compute to unify data
stores, stage remote data, and improve system
performance.
Spark Tensorflow Presto
HDFS
Guidelines
Cloud deployment
Compute separated from storage
I/O or network latency exists
Unification of many storage systems
Applications sharing long lived data
More checks result in higher fit applications
17. 17
Machine Learning Case Study
Challenge –
Slow training of model for
algorithmic trading in $46B data
driven Hedge Fund
Data access was slow, costing them
$$ in compute cost and lower
modeler productivity
SPARK
HDFS
SPARK
HDFS
Solution –
With Alluxio, data access are 10-30X
faster
Impact –
Increased efficiency on training of ML
algorithm, lowered compute cost and
increased modeler productivity,
resulting in 14 day ROI of Alluxio
MESOS
MES
OS
Public Internet
Public Internet
18. 18
Consumer Intelligence Use Case – Top 3 Telco
Challenge –
Desired a central view of consumer
information in near real time for
proactive support.
Many HDFS, different distributions,
many incompatible versions. On-
prem & cloud. Integration through
heavy ETL.
HADOOP
Solution –
Alluxio integrates data into central
catalog for fast access to consumer
interaction records.
Impact –
Reduced integration time
Faster data speed & freshness
ML HADOOP
HDFS HDFS HDFS
ML
ETL
HDP
HDFS
CDH
HDFS
MAPR
HDFS
HDFS
19. 20
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
• Demo• 5
21. Alluxio Master
- Master responsible for
managing metadata
- Secondary masters used for
journal checkpoints and fault
tolerance
- Performs distributed storage
metadata operations
222017 Alluxio, Inc. All Rights Reserved
File
System
Metadata
Block
Metadata
Worker
Metadata
RPC
Servic
e
Journal
Storage
Primary Master
File System
Metadata
Block
Metadata
Worker
Metadata
Secondary Master
22. Alluxio Worker
- Worker responsible for managing block data
- Each worker manages metadata for the
block data it stores
- Workers store block data on various local
storage mediums
- Performs distributed storage data operations
232017 Alluxio, Inc. All Rights Reserved
Block
Metadata
RPC
Servic
e
Data
Transfe
r
Service
RAM
Under
Storage
SSD
HDD
23. Data Flow In Alluxio
Applications Read/Write data via the Alluxio Client
Ideally, Alluxio deployed on same nodes as compute so
Alluxio Client and Alluxio Workers on same node
Different Read Scenarios
Read data in Alluxio, on same node as client
Read data in Alluxio, not on same node as client
Read data not in Alluxio
Different Write Scenarios
Write data only to Alluxio
Write data to Alluxio and Under Store synchronously
24. 25
Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
25. 26
Read data not in Alluxio + Caching
26
RAM / SSD / HDD
Network / Disk Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
26. 27
Read data in Alluxio, not on same node as
client + Caching
RAM / SSD / HDD
Network Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
27. 28
Write data only to Alluxio on same
node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
28. 29
Write data to Alluxio and Under Store
synchronously
RAM / SSD / HDD
Network / Disk Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
29. 31
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
• Demo• 5
30. 32
New features in 1.7.0
Async caching
Kubernates integration
Tiered locality
Under store synchronization
FUSE improvement
A technology ecosystem developed around so called, big data; Hadoop, being one of the most important. The idea was to allow large clusters of cheap computers mine huge volumes of low value data, to extract valuable insights. To do that, Hadoop was designed to make compute and storage tightly-coupled.
As enterprises matured and more big data systems developed, they adopted new technologies across more datacenters. Compute began to be separated from storage to take advantage of lower cost hosting options.
Enterprises now need to manage a bird’s nest of integrations, and as a result development and operations are complicated. Performance also suffers as large volumes of data cross slow pipes that connect networks.
As folks in the trenches know, Gartner says 60% of data projects will fail in 2017. That’s largely due to the complexity, cost, bottlenecks and governance surrounding these projects.
Alluxio is a virtual distributed file system that unifies data access between storage and compute, and offers memory speed performance when working on remote data.
How does it do that?
Our software connects with dozens of the leading storage platforms and hosting providers like Amazon, Google and Microsoft
Alluxio unifies all your storage systems into a single global namespace that Spark, Presto, MapReduce, and other frameworks can access
Because most big data processes are run on only a subset of data, teams have the ability to place the most recent data into Alluxio memory
And as data flows through the system, Alluxio’s intelligent cache can place other frequently data into memory
And Alluxio is read/write, so entire processing pipelines can be managed through the system
Benefits
Application development and data science is simpler – abstracts the complexity of storage
Architecture is more flexible – separate compute from storage, improve interoperability and choose lower-cost storage
Improves data access speed across networks – commonly 2-10x
Alluxio is deeply informed by our unique design principles:
Work with large-scale applications
Don’t own the data
Don’t hurt performance, and try to address common performance issues like remote data access and write throughput
Make sure it can operate within the largest enterprises
These principles informed our 3 enabling innovations…
While an infrastructure doesn’t need to meet all the following guidelines to be a high-fit application, meeting more of these guidelines will lead to a higher fit
We have been very fortunate to have been rapidly adopted by some of the largest and most respected technology organizations in the world.
We’re working with one early adopter of technology, called Two Sigma, which is leading tech-focused hedge fund with over $40 billion AUM. This team had a real problem in training their models across their 10,000 node Spark cluster. Their Spark nodes were in AWS and their source data was in their local storage. Because so much data was being transported across the relatively slow public network, they saw massive network bottlenecks. This meant they could only run their process twice a day. When they added Alluxio, they were able to increase that to 8-10x cycles per day, which meant big money for the hedge fund.
As they’ve matured in the cloud, they’ve realized they’ve locked themselves into S3, and S3 prices are very unpredictable, they’ve wanted to take advantage of lower cost cloud providers like Google. With Alluxio, they have the ability to lift and shift their nodes, without impacting their application layer, taking advantage of more favorable pricing.
One of the largest telcos in the US uses Alluxio as a virtual data lake – providing a single unified access layer across all their data systems. This enables them to integrate new systems more quickly and improves the data freshness and responsiveness of applications built on top of the data lake.
We ran this experiment when Spark 2.0 came out with the latest version of Alluxio at the time. Since then we have kept track of improvements to Spark and Alluxio, but have not seen enough of a difference to rerun this performance evaluation. We compare several storage types in Spark with not storing the data in Spark but in Alluxio instead.
Let me explain what this graph is presenting. We first tested with a cached RDD, meaning the data was already in Spark or Alluxio, and we varied the size of the RDD and recorded how much time it took to run a scan on the RDD. The disk only line in green is as expected, much worse than all the others which are using memory. A more interesting observation is the performance difference between the Alluxio text file and object file. Text file is almost strictly better, this is because of the object serialization overhead, the file itself is all plaintext so using an object file was not helpful. I would recommend to use text file when possible. Using spark’s memory only is the best for small files, but abruptly performs worse once the data cannot be completely cached in memory. This happens as well for mem_only_ser the purlple line, but much last because of the small size of serialized objects. Alluxio scales linearly throughout the test and actually outperforms Spark caches for a single task after 32 GB or so.
The previous comparison was if the data was on SSD, which is still a relatively fast storage. If we instead put the data in S3, we see a much larger speed up. This is more representative of architectures because this allows compute and storage to be decoupled. In this case, we see 16x speed up even in this simple job, which is similar to some of the performance use cases I previously presented.
We also did a similar test with a parquet file using the data frame API. Here we did a simple aggregation which would access all the rows. The behavior is the same as before, using Spark’s native caching has an abrupt turning point where the performance degrades. These tests are run with default spark configurations which they suggest not to change. If we optimize for additional storage, we can move the point of bend, but it would still be present.
This is average of 7 runs.
Range of s3 is 1132.765125
Range of Alluxio is 10.5890684
We also ran the same example against S3, and the variation in the test was fairly large. This is similar to what some users see in their storage either due to the storage itself, or their workload and sometimes both. Alluxio performance is much more consistent and provided on average 10x and up to 17x performance improvement.