Alluxio Webinar
September 22, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
3. Enterprises have organically created a legacy of data silos through short term
focused projects, mergers & acquisitions!
Data Lakes and Silos Abound
▪ Data lakes and critical data is often in a silo and challenging to access
▪ Consolidation of data lakes and silos are expensive and slow to
complete
▪ Compute is everywhere
Teradata POSIX
DB
Intern
apps
Public
Clouds
S3 Object HDFS 1
HDFS 2
4. 4 Big Trends Driving the Need for a New
Architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
5. ▪ Data volume, velocity and variety are avalanching - data doubles every two years*
▪ The business knows data analytics/ML models allow them to compete effectively*
▪ The Hadoop investment is being replaced by object (on prem and cloud)
▪ The enterprise is a multi cloud world and will remain so for some time
▪ Technical leadership wants the agility to run applications anywhere to sustain
operations offering users a transparent self service experience
▪ Technical organizations struggle to keep up with data ingest and business demands
▪ Data is still not fully optimized yet there are many copies costing $$$$
* “The Fourth Industrial Revolution”, by Klaus Schwab
Market Summary
6. Alluxio’s Vision
Accelerate analytics and machine learning to enable companies to grow
and remain relevant regardless of where their data and compute are
located.
What can 2X – 5X analytics acceleration do for -
● Fraud protection
● Research for treatments for diseases like COVID-19
● Uptime for all industrial and digital technologies we depend on
7. What is Data Orchestration?
A platform that brings your data closer to compute across
clusters, regions, clouds, and countries to accelerate results
8. Companies Using Alluxio
Consumer Travel & TransportationTelco & Media
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services
8
9. Companies use Alluxio to …
• Gain faster results that matter to the business – advanced caching
technology
• Dramatically lower OpEx by eliminating data management and cloud
egress costs – unified namespace and API translations
• Drop into existing on prem and clouds with zero programming
10. Data Accessibility
Translate access to optimal storage APIs over a slow network
Data Orchestration for the
Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
10
12. Approaches to Hybrid Cloud
▪ Simple tools available like distCP
▪ Works for workloads with easily
identifiable datasets
Issues
▪ Datasets for many workloads
cannot always be identified easily
▪ Significantly more data transfer
than workload requirements
▪ Additional copies are very hard to
sync back with master data
Performance can be dramatically
impacted due to cloud storage
limitations
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching
▪ Migration may seem easier as no
application re-architecture needed
Issues
▪ If workloads are not made cloud-
native and elastic, infrastructure cost
can skyrocket
▪ If on-prem data copy needs to be
maintained, syncing cloud and on-
prem data can be hard
▪ Data pulled into cloud based on
compute requests
▪ Data is cached locally to reduce I/O
on remote clusters and is
automatically synced
Issues
▪ Less helpful for workloads that don’t
read data set more than once
12
13. Problem: HDFS cluster is compute-
bound & complex to maintain
Google Cloud Platform
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity• Offload on-prem cluster (both compute & I/O)
• Manage working set, not FULL set of data
• Local performance
• Automatic synchronization with on-prem changes
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
GCS
Our Solution: “Zero-Copy Burst”
13
15. Alluxio at Walmart
15
Architectural Components
• Alluxio is co-located with Presto
For Data Locality
• Automatic Metadata
Synchronization To create Hive tables
with Alluxio mount points
• Auto-scaling
To maintain a min number of Alluxio
workers
• Pin frequently used data
To avoid cache evictions
16. 2x Performance
For range queries
High Concurrency
With Alluxio
Cost Reduction
With Half the compute costs or 2x
compute capacity for the same
environment
Alluxio at Walmart
Takeaways
16
17. Alluxio at Adobe
Primary DC with large Hadoop Cluster out of space,
ad hoc SQL workloads exponentially growing as
analyst headcount as reached 1800 ppl
PROBLEM
● 80% less network usage
● More stable infrastructure
● Lower costs
● Results come in faster
● Easier to scale
● Ability handle new analysts with no impact and increase response times
● Self-service for end-users
Leverage compute resources outside of primary on-
prem DC for multiple analytical frameworks.
SOLUTION
REMOTE DATA RESULTS
17
Cross Data Center Access
18. Alluxio at Electronic Arts (EA)
Single Cloud with AWS
Learn More
Upto 6x Performance
When handling a large
number of small files
Elastic Compute
To Reduce Infrastructure
Costs
Reduce S3 Costs
By eliminating S3 access
operations
20. Data Locality with Intelligent Multi-tiering
Local Performance from remote data using multi-tier storage
Hot War
m
Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
On-premisesPublic Cloud
20
21. Metadata Locality with “Active Sync”
Detect on-prem changes and synchronize metadata
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion, TTL
HDFS iNotify Based
Metadata Synchronization
Mutation
On-premisesPublic Cloud
21
22. Policy Driven Data Migration
Migrate Data to Cloud Storage based on Access Policies
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
22
24. Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment
24
Alluxio Catalog Service
25. Transform data to be compute-optimized
independent of the storage format
Coalesce Format Conversion
parquetcsv
25
Transformation Service
26. Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
Alluxio
Transformations With
Caching
3s
26
Example Results
28. How can Alluxio help you?
• Did you learn what Alluxio Data Orchestration is?
• Do you have a use case Alluxio can accelerate?
For follow up questions and to discuss your situation, please contact Peter at
peter@alluxio.com
29. I. Burst data lake processing to Dataproc using on-prem Hadoop data
https://cloud.google.com/blog/products/data-analytics/burst-data-lake-processing-dataproc-using-prem-hadoop-data
II. Tutorial: Hybrid Cloud Bursting with GCP and Alluxio
https://docs.alluxio.io/ee/user/stable/en/tutorials/GCP-Tutorial.html
III. “Zero-Copy” Hybrid Cloud for Data Analytics
https://www.alluxio.io/resources/whitepapers/zero-copy-hybrid-cloud-for-data-analytics-strategy-architecture-and-
benchmark-report/
IV. Getting Started with Dataproc and Alluxio
https://docs.alluxio.io/ee/user/stable/en/cloud/Google-Dataproc.html
V. Using Transparent URI
https://docs.alluxio.io/ee/user/stable/en/operation/Transparent-Uri.html
Additional Resources
29