EMC ViPR HDFS Data Service Technical Overview

1© Copyright 2014 EMC Corporation. All rights reserved.
EMC ViPR HDFS Data
Service Technical
Overview
Download this slide
http://ouo.io/FuYX5
VIRTUALIZE
EVERYTHING
COMPROMISE
NOTHING

Disruptive / Opportunistic IT Trends
Mobile Cloud Big Data Social
T R U S T

Mainframe, Mini Computer
Terminals
MILLIONS
OF USERS
THOUSANDS
OF APPS
LAN/Internet Client/Server
PC
HUNDREDS OF MILLIONS
OF USERS
TENS OF THOUSANDS
OF APPS
Mobile Cloud Big Data Social
Mobile Devices
BILLIONS
OF USERS
MILLIONS
OF APPS
Source: IDC, 2013

The Big Data Economy
More data sources, richer content, longer utility
40
ZB
Source: IDC 2012 Digital Universe Study

Significant financial value across many verticals
The Big Data Potential
Source: “Big Data: The Next Frontier for Innovation, Competition,
and Productivity”, McKinsey Global Institute
US Retail
• 60+% increase in net
margin possible
• 0.5-1% annual
productivity growth
US Healthcare
• $300 billion value per year
• 0.7% annual productivity
growth
Manufacturing
• Up to 50% decrease in
product development,
assembly costs
• Up to 7% reduction in
working capital
Global personal location
data
• $100 billion+ revenue for
service providers
• Up to $700 billion value to
end users

Supporting 3rd platform app with 2nd platform infrastructure
The Challenges to Widespread Adoption
 How to move from the lab to
production?
– Trusting an open source Hadoop
distribution
– HDFS not enterprise grade
– Analytics on existing data?
 What’s the risk?
– Dedicated cluster requires significant
investment
– ROI? – does the data have value?
 What are the costs?
– Costs increase as my dedicated
analytics cluster scales
– Bandwidth and network costs of
moving data to the cluster

Big Data Storage Requirements
In-place analytics and protection of all data types
 Data Unification:
– Big Data storage must support structured, semi-
structured, and unstructured data types.
 In-Place Analytics:
– Analytics, compute workloads need to execute
where the data live.
 Data Compliance:
– More sources of data, more volume, velocity,
etc. exacerbate compliance and long-term
retention requirements
40 ZB

ViPR Data Services
Overview

Data Services that Span Arrays and Support Hybrid Data Types
ViPR Data Services
 Storage services at cloud scale
– Built in software
– Layered over both traditional and new storage
devices
 Object and HDFS data services
– Many more to follow, at regular intervals
– Open API for 3rd party development
 Unified platform
– Data services can be used as different
semantic views on the same data e.g. Object
on File, HDFS on Object

EMC ViPR - Software-Defined Storage
ViPR
Data Services
ViPR
Controller
EMC ViPR Platform
Provisioning Self-Service Reporting Automation
Third-Party
Isilon
Atmos
VMAX VNX VPLEX
Commodity
XtremIOCentera

ViPR Data Services: Architecture
ViPR
Data Path
ViPR
Control Path
• Distributed Infrastructure
• Device Drivers
• Elastic Volumes
• Migration
GEO-SCALE INDEX, METADATA, TRANSACTIONS
… 3rd PARTYOBJECT HDFS KEY-VALUE
GEO SCALE INDEX, METADATA, TRANSACTIONS
Commodity
VNX Isilon
3rd Party

ViPR Data Services Address Big Data
Storage Requirements
 Data Unification
– Transform existing storage infrastructure into a
data lake
– Structured, semi/un-structured content
 In-place Analytics
– Run queries against data on existing arrays
– Flexible software model supports future colocation
of compute and storage
 Data Compliance
– Choice and flexibility or persistence layer
– Support cloud-scale and consumer-grade
applications on enterprise-grade infrastructure
40 ZB

ViPR HDFS Data
Service Overview

ViPR HDFS Service Overview
 HDFS is becoming the de facto file
system for distributed applications
 ViPR is a great platform for HDFS
– Addresses limitations of off-the-shelf
HDFS
– Brings HDFS to existing storage
hardware
– Enables HDFS/Object/File scenarios
– Flexible software model

ViPR HDFS Service Overview
 API head
– Custom client/server protocol optimized
for high scale
– Uses the same unstructured storage
engine as ViPR Object data service
 Client library over the HDFS API
– Provides a viprhdfs:// drop-in
replacement for HDFS 2.0
– Can be seamlessly added to existing
Hadoop distributions

EMC ViPR Data Services
ViPR
Data Services
ViPR
Controller
EMC ViPR Platform
Provisioning Self-Service Reporting Automation
Third-Party
IsilonVNX

How ViPR HDFS Data Service Helps
Accelerate Big Data initiatives
 Quickly move from lab to production
– Utilize existing infrastructure as a big data
repository or “data lake”
– Eliminate single namenode single point of failure
 Reduce risk
– Run queries against data on existing arrays
– Leverage existing investments
 Reduce costs
– Reduce the growth in dedicated analytics
infrastructure
– Reduce bandwidth, storage and network costs
40 ZB

ViPR HDFS Data
Service
Technical Deep Dive

Name Node
JOB TRACKER
Commodity Compute & Storage
TASK TRACKER
Data Store
MapReduce Task
Client
TASK TRACKER
Data Store
MapReduce Task
TASK TRACKER
Data Store
MapReduce Task
HDFS ARCHITECTURE

VNX Isilon
3rd Party
VMAX
Commodity
JOB TRACKER
TASK TRACKER
MapReduce Task
Client
TASK TRACKER
MapReduce Task
TASK TRACKER
MapReduce Task
ViPR HDFS ARCHITECTURE

VNX Isilon
3rd Party
VMAX
Commodity
JOB TRACKER
TASK TRACKER
MapReduce Task
Client
TASK TRACKER
MapReduce Task
TASK TRACKER
MapReduce Task
• No single point of failure
• Leverage existing storage
• Compatible with existing
Hadoop distribution
• Mixed workload across
HDFS and Object
ViPR HDFS ARCHITECTURE

MapReduce Job Flow
Master Node
Job
Tracker
Task Tracker
Data Store
Commodity Compute & Storage
MapReduce Task
Client
Task Tracker
Data Store
MapReduce Task
Task Tracker
Data Store
MapReduce Task
Name
Node
Secondary
NameNode
Submit Job
Split into tasks
Rack 1 Rack 2
Data Node 1 Data Node 2 Data Node 3

Presales Training
Customer’s Hadoop Compute
Cluster
ViPR Controller
ViPR Data Node(s) running outside
the ViPR managed arrays
Blob
Engine
S3
Head
HDFS
Head
Customer
AD
Trust Relationship
ViPR HDFS - Under The Hood
Trust RelationshipTrust Relationship
Data
Read/
Write
Kerberos KDC
VNX
Isilon
3rd Party

HDFS uses ViPR Object Storage Engine
ViPR data services creates a unified pool (bucket) of data
VIRTUAL ARRAY
 Buckets of data span file shares
– Grow and shrink on demand
 Data is distributed and intermingled across
the storage
 Provides an HDFS interface
 ViPR makes HDFS enterprise grade
– ViPR HDFS replaces namenodes, no single point of
failure
Isilon
3rd Party
VNX
5500

Support Mixed Workloads
Object, File and HDFS operations on the same data
VIRTUAL ARRAY
Isilon
3rd Party
VNX
5500
 ViPR Data Services offer three
bucket options:
– Object
– HDFS
– ObjectandHDFS
 ObjectandHDFS provides user with
access to either S3 or HDFS
Interface
– Full compatibility with existing
object based APIs
▪ Amazon S3, Openstack Swift, Atmos
Object HDFS
Object
& HDFS

ViPR HDFS Data
Value Proposition

Instantly Deploy a Big Data Repository
Use existing arrays as a big data store
Isilon
3rd Party
VNX
5500
VIRTUAL ARRAY
 Reduce risk
– Reduce CAPEX investment required to
perform analytics
– Maintain data protection, compliance at
array level
 Reduces cost and complexity of
dedicated clusters
– Reduce need for new vendor nodes and
storage capacity
 Reduce data transfer time and
bandwidth costs
– 10 TBs takes 25 hours via 10gE
– 10 TBs takes 3 days via dedicated WAN

Expand the Reach of Big Data Queries
Expand analytics to ViPR-managed data stores
 Extend big data queries to run
on existing file arrays as
existing Hadoop deployments
 Opens new opportunities and
analytics scenarios
– Faster, easier business insights
Isilon
3rd Party
VNX
5500
VIRTUAL ARRAY

Leverage and Extend Existing Investments
Utilize existing Hadoop infrastructure
 ViPR HDFS data service can
be the data source for
Pig/Hive queries
– Fully compatible with existing
Hive/Pig query engines
 Can use an existing
infrastructure to query ViPR-
managed data stores
– Add data stores via ViPR
without having to re-write
queriesIsilon
3rd Party
VNX
5500
VIRTUAL ARRAY

Support Mixed Workloads
Provide multiple semantic views of the same data
 Eliminates expensive data movement
– Object based workloads and analytics applications can
manipulate the same data
 Increase developer productivity
– Different applications can target the same data without re-
writes
– IT can serve different developer and business groups with
the same infrastructure
 Increases data value
– Extract more insight from file and object data
(unstructured, semi-structured)
 Reduce infrastructure costs
– Eliminate dedicated data silos

Summary
 ViPR provides storage services at cloud scale
– Built in software
– Layered over both traditional and new storage devices
 ViPR creates a unified platform
– Data services can be used as different semantic views on the
same data e.g. Object, File, HDFS interfaces for same data
 ViPR HDFS accelerates journey to 3rd Platform
– Extend Big Data queries to existing storage
– Reduces complexity and cost of dedicated analytics
infrastructure
– Leverages existing investments

EMC ViPR HDFS Data Service Technical Overview

EMC ViPR HDFS Data Service Technical Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to EMC ViPR HDFS Data Service Technical Overview

Similar to EMC ViPR HDFS Data Service Technical Overview (20)

More from solarisyougood

More from solarisyougood (20)

Recently uploaded

Recently uploaded (20)

EMC ViPR HDFS Data Service Technical Overview

Editor's Notes