Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with Disaggregated Compute and Storage

Dipti Borkar | Head of Product, Alluxio
Shailesh Manjrekar | Head of AI/ML Product and Solutions, SwiftStack
NextGen Data Analytics Stack – Alluxio and SwiftStack
Edge to Core to Cloud

Unstoppable Data Growth – Edge to core to cloud
Emphasis on capturing value and show return on investment (ROI) from data
*, IDC Worldwide Storage in Big Data Forecast, 2015-2019, October 2015 and IDC Directions
Value Capture is key
30B
IoT Connected
Devices*
By 2020,
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
101010101010101
100-250EB
Big Data Storage
Capacity*
10101010101010
1010101010101010
101010101
44ZB
data created*

Status quo - Existing solutions
Business leaders and storage architects struggling to show return on investment
SILO 1
Compute
+
Storage
SILO 3
Compute
+
Storage
SILO 5
Compute
+
Storage
SILO 2 SILO 4
DAS silos
Poor
Utilization
DIY
Fatigue
High Capex
High OpEx
INEFFICIENT AND EXPENSIVE
Data
Gravity

4 big trends driving the need for a new data analytics
stack
Separation of
Compute &
Storage
Hybrid – Multi
cloud environments
Self-service data
across the
enterprise
Rise
of the object
store

Customer Challenges with existing solutions
Lack of Enterprise ready products and continued pressure of every increasing cloud OpEx
Ever increasing operating expenditures on existing (a) poorly utilized DAS solutions or (b)
cloud storage deployments costs at scale
Need for high throughput stack with API compatibility to support batch, interactive and
advanced analytical workloads
Lack of enterprise ready and multi-cloud data lake systems – at scale deployments with
lifecycle management, self healing, geo-replication and faster re-builds
CHALLENGE 1
CHALLENGE 2
CHALLENGE 3

Data Ecosystem - Beta Data Ecosystem 1.0
COMPUTE
STORAGE STORAGE
COMPUTE

Co-located
Big data journey and innovation options for enterprises
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the same cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
Burst HDFS data in the
cloud,
public or private
Support Presto, Spark
and other computes
without app changes
Enable & accelerate big
data on
object stores
Transition to Object store
HDFS for Hybrid Cloud
Support more frameworks
§ Typically compute-bound
clusters over 100% capacity
§ Compute & I/O need to be
scaled together even when
not needed
§ Compute & I/O can be
scaled independently but I/O
still needed on HDFS which
is expensive
1 2
3
4
5

Multi-cloud storage and
Data Management
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver
The SwiftStack Data Analytics Solution with Alluxio
Accelerated Compute, Data accessibility and Elasticity

SwiftStack Data Analytics Solution – business use cases
Customer, Security and Fraud Analysis
Precision Medicine and Bio-Informatics
Customer Churn/ Sentiment Analysis
Analytics As a Service, Operational
Analytics
Internet of Things / Everything
Financial Services, FSI
Healthcare and Life Sciences,
Genomics
Cloud Service Providers
Oil and Gas, Industrial Internet and
Manufacturing
Media and Entertainment

SwiftStack Data Analytics solution – Value to be Captured by Enterprises
Data and analytics as a source of competitive advantage
Source: IDC Directions
“Organizations that analyze all relevant data and deliver actionable information will achieve extra $430B in productivity
gains over less analytically oriented peers by 2020”
IDC: Worldwide Bid Data and Analytics 2016 Predictions
Value can be created in the following ways with some industry relevance:
üImprove operational efficiency
üReduce cost
üNew product development
üInsights into new services
üBetter Customer Experience

Alluxio and SwiftStack partnership
Originated as Tachyon project, at the UC
Berkley’s AMP Lab by then Ph.D. student & now
Alluxio CTO, Haoyuan (H.Y.) Li.2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed
for the Cloud for data driven apps such as Big
Data Analytics, ML and AI.
2018 20192018

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations

Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2

“Infrastructure challenges are the primary inhibitor for broader adoption of AI/ML workflows. SwiftStack’s
Multi-Cloud data management solution is first of it’s kind in the industry and effectively handles storage I/O
challenges faced by edge to core to cloud, large scale AI/ML data pipelines”
Amita Potnis,
Research Director at IDC’s Infrastructure
System Platform and Technologies Group

Property of SwiftStack Inc. 15
Multi-Cloud Storage and Data Management
Storage and multi-cloud data management for
data-driven applications and workflows
SwiftStack
SwiftStack Storage SwiftStack 1space
On-premises cloud storage
Highest throughput performance
Easy to deploy, operate, and scale
From tens of terabytes to hundreds of petabytes
Spans multiple geographic regions
Proven platform to realize more value from data!
Multi-cloud data management
Transparent access to a single storage namespace
Public and private infrastructure
Policy-driven data placement
Metadata search across the namespace
Leverage unique services across clouds!

Property of SwiftStack Inc. 16
SwiftStack Object Storage Architecture
Continuous
Auditing
Automatic
Replication
Fault
Tolerant
• Automated storage system
management for standard
servers
Replicas Erasure Codes
Direct-Attached Storage
• Masterless, quorum writes
• Nearest reads
• As-dispersed-as-possible data
placement across
nodes/zones/regions
• Distributed partitions in a
modified consistent hash ring
Replication
Reconstruction
Auditing
Device Inventory Management
Storage System Metrics Collection
Hardware Fault Detection
Standard Servers, Drives & Networking
Site 1 Site 2 Site 3
SwiftStack
Storage

18
Data Analytics Hub – Total Cost Of Ownership (TCO) Analysis
The 5-year TCO for the Hosted Private Cloud solution is 1/4th of the one hosted on a public cloud

Cloud native
applications with
SwiftStack Data
Analytics solution
© 2016 Western Digital Corporation All rights reserved
3 key use cases - SwiftStack Data Analytics Solution
Cloud bursting
with SwiftStack
Data Analytics
solution; compute
in any public
cloud
HDFS off-load to
SwiftStack Data
Analytics Solution

1. SwiftStack Data Analytics solution – On-premise deployment
• Customers starting their on-premise analytic
journeys
• Benefits
• Same performance as HDFS
• No more HDFS: operational simplicity
• Compute can be fully virtualized /
containerized!
• Durability++ (Erasure Coding)
• Scale (billions of objects / racks / Geo)
Alluxio
Presto
Alluxio
Presto
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
Alluxio
Presto
Alluxio
Spark / Presto / Hive

2. Cloud bursting with SwiftStack Data Analytics solution
•Hybrid workflow
• customers hosting data on-premise and
leveraging public cloud for economies
of scale
• Alluxio for data locality
•Benefits
• Data as strategic asset stays on-premise
• Leverage cloud economies of scale for
compute
Hadoop cluster node
Alluxio
“alluxio//”
Compute:
Spark, Presto, Hive,
…
WAN
Private Cloud
Public Cloud

3. HDFS off-load to SwiftStack Data Analytics solution
•HDFS off-load
existing HDFS customers on DAS
looking to move to S3 - needs
migration
leverage DistCp - Distributed copy as
data mover and then the same
workflow
Using Alluxio
•Benefits
• Known and well understood process by
administrators: existing HDFS
workflows + rsync-like backup workflow
Co-located environment
ImpalaHive Spark
Same data
center / region
Presto
Enable big data on object stores
across single or multiple clouds
Spark
Alluxio Alluxio

Enterprises moving towards independent compute & storage

25
§ Come, talk to us about analytics on SwiftStack
§ Data analytics Solution – Alluxio and SwiftStack deliver a winning
combination of performance and capacity – “deliver on the promise of a
future ready data lake”
§ Multiple use cases across industry verticals show success of highly
scalable and lowest TCO big data solution
Get started with the Data Analytics Solution

SwiftStack Confidential 26
Questions?

Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with Disaggregated Compute and Storage

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with Disaggregated Compute and Storage

Similar to Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with Disaggregated Compute and Storage (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with Disaggregated Compute and Storage