SlideShare a Scribd company logo
1 of 44
Download to read offline
PRESENTATION TITLE GOES HEREBig Data Storage Options for Hadoop
Sam Fineberg/HP Storage Division
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
SNIA Legal Notice
The material contained in this tutorial is copyrighted by the SNIA unless
otherwise noted.
Member companies and individual members may use this material in
presentations and literature under the following conditions:
Any slide or slides used must be reproduced in their entirety without modification
The SNIA must be acknowledged as the source of any material used in the body of
any document containing material from these presentations.
This presentation is a project of the SNIA Education Committee.
Neither the author nor the presenter is an attorney and nothing in this
presentation is intended to be, or should be construed as legal advice or an
opinion of counsel. If you need legal advice or a legal opinion please
contact your attorney.
The information presented herein represents the author's personal opinion
and current understanding of the relevant issues involved. The author, the
presenter, and the SNIA do not assume any responsibility or liability for
damages arising out of any reliance on or use of this information.
NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.
2
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Abstract
Big Data Storage Options for Hadoop
The Hadoop system was developed to enable the transformation
and analysis of vast amounts of structured and unstructured
information. It does this by implementing an algorithm called
MapReduce across compute clusters that may consist of
hundreds or even thousands of nodes. In this presentation
Hadoop will be looked at from a storage perspective. The
tutorial will describe the key aspects of Hadoop storage, the
built-in Hadoop file system (HDFS), and some other options for
Hadoop storage that exist in the commercial and open source
communities.
3
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Overview
Introduction
What is Hadoop
What is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage options
Native Hadoop – HDFS
On direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storage
Emerging options
4
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Overview
Introduction
What is Hadoop
What is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage options
Native Hadoop – HDFS
On direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storage
Emerging options
5
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
What is Hadoop?
A scalable fault-tolerant distributed system for data storage
and processing
Core Hadoop has two main components
MapReduce: fault-tolerant distributed processing
Programming model for processing sets of data
Mapping inputs to outputs and reducing the output of multiple Mappers to one (or
a few) answer(s)
Hadoop Distributed File System (HDFS): high-bandwidth clustered
storage
Distributed file system optimized for large files
Operates on unstructured and structured data
A large and active ecosystem
Written in Java
Open source under the friendly Apache License
http://hadoop.apache.org
6
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
What is MapReduce?
A method for distributing a task across multiple nodes
Each node processes data stored on that node
Consists of two developer-created phases
1. Map
2. Reduce
In between Map and Reduce is the Shuffle and Sort
7
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce
8
© Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
What was the max temperature for the last century?
MapReduce Operation
9
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Key MapReduce Terminology Concepts
A user runs a client program (typically a Java application) on
a client computer
The client program submits a job to Hadoop
The job is sent to the JobTracker process on the Master Node
Each Slave Node runs a process called the TaskTracker
The JobTracker instructs TaskTrackers to run and monitor
tasks
A task attempt is an instance of a task running on a slave
node
There will be at least as many task attempts as there are
tasks which need to be performed
10
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce in Hadoop
11
© Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html
Task Tracker
Input (HDFS)
Output (HDFS)
Mapper
Reducer
Worker=Tasks
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce: Basic Concepts
Each Mapper processes single input split from HDFS
Hadoop passes one record at a time to the developer’s
Map code
Each record has a key and a value
Intermediate data written by the Mapper to local disk (not
HDFS) on each of the individual cluster nodes
intermediate data is reliable or globally accessible
During shuffle and sort phase, all values associated with
same intermediate key are transferred to same Reducer
Reducer is passed each key and a list of all its values
Output from Reducers is written to HDFS
12
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
What is a Distributed File System?
A distributed file system is a file system that allows
access to files from multiple hosts across a network
A network filesystem (NFS/CIFS) is a type of distributed file
system – more tuned for file sharing than distributed computation
Distributed computing applications, like Hadoop, utilize a tightly
coupled distributed file system
Tightly coupled distributed filesystems
Provide a single global namespace across all nodes
Support multiple initiators, multiple disk nodes, multiple access
to files – file parallelism
Examples include HDFS, GlusterFS, pNFS, as well as many
commercial and research systems
13
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Overview
Introduction
What is Hadoop
What is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage options
Native Hadoop – HDFS
On direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storage
Emerging options
14
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Distributed File System - HDFS
Architecture
Java application, not deeply integrated with the server OS
Layered on top of a standard FS (e.g., ext, xfs, etc.)
Must use Hadoop or a special library to access HDFS files
Shared-nothing, all nodes have direct attached disks
Write once filesystem – must copy a file to modify it
HDFS basics
Data is organized into files & directories
Files are divided into 64-128MB blocks, distributed across nodes
Block placement is handled by the “NameNode”
Placement coordinated with job tracker = writes always co-located, reads co-
located with computation whenever possible
Blocks replicated to handle failure, replica blocks can be used by compute tasks
Checksums used to ensure data integrity
Replication: one and only strategy for error handling, recovery and fault
tolerance
Self Healing
Makes multiple copies (typically 3)
15
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS on local DAS
A Hadoop cluster consisting of many nodes, each of
which has local direct attached storage (DAS)
Disks are running a standard file system (e.g., ext. xfs, etc.)
HDFS blocks are stored as files in a special directory
Disks attached directly, for example, with SAS or SATA
No storage is shared, disks only attach to a single node
The most common use case for Hadoop
Original design point for Hadoop/HDFS
Can work with cheap unreliable hardware
Some very large systems utilize this model
16
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS on local DAS
17
…
Compute nodes are part of
HDFS, data spread across nodes
HDFS
Protocol
…
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS File Write Operation
18
Image source: Hadoop,The Definitive Guide Tom White, O’Reilly
3-way
replication
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS File Read Operation
19
Image source: Hadoop,The Definitive Guide Tom White, O’Reilly
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS on local DAS - Pros and Cons
Pros
Writes are highly parallel
Large files are broken into many parts, distributed across the cluster
Three copies of any file block, one written local, two remote
Not a simple round-robin scheme, tuned for Hadoop jobs
Job tracker attempts to make reads local
If possible, tasks scheduled in same node as the needed file segment
Duplicate file segments are also readable, can be used for tasks too
Cons
Not a replacement for general purpose storage
Not a kernel-based POSIX filesystem
Incompatible with standard applications and utilities (but future versions of
Hadoop are adding more other application models)
High replication cost compared with RAID/shared disk
The NameNode keeps track of data location
SPOF - location data is critical and must be protected
Scalability bottleneck (everything has to be in memory)
Improvements to NameNode are in the works
20
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Other HDFS storage options
HDFS on Storage Area Network (SAN) attached storage
A lot like DAS,
Disks are logical volumes in storage array(s), accessed across a SAN
HDFS doesn’t know the difference
– Still appears like a locally attached disk
SAN attached arrays aren’t the same as DAS
Array has its own cache, redundancy, replication, etc.
Any node on the SAN can access any array volume
So a new node can be assigned to a failed node’s data
21
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS with SAN Storage
22
Storage Arrays Compute nodes
Hadoop
Cluster
…
iSCSI or FC SAN
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS File Write Operation
23
Array can provide
redundancy, no need to
replicate data across
data nodes
Array
Replication
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS File Read Operation
24
Array redundancy,
means only a single
source for data
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
SAN for Hadoop Storage
Instead of storing data on direct attached local disks, data is in one
or more arrays attached to data nodes through a SAN
Looks like local storage to data nodes
Hadoop still utilizes HDFS
Pros
All the normal advantages of arrays
RAID, centralized caching, thin provisioning, other advanced array features
Centralized management, easy redistribution of storage
Retains advantages of HDFS (as long as array is not over-utilized)
Easy failover when compute node dies, can eliminate or reduce 3-way
replication
Cons
Cost? It depends
Unless if multiple arrays are used, scale is limited
And with multiple arrays, management and cost advantages are reduced
Still have HDFS complexity and manageability issues
25
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Overview
Introduction
What is Hadoop
What is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage options
Native Hadoop – HDFS
On direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storage
Emerging options
26
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Other distributed filesystems
Kernel-based tightly coupled distributed file system
Kernel-based, i.e., no special access libraries, looks like a normal
local file system
These filesystems have existed for years in high performance
computing, scale-out NAS servers, and other scale-out computing
environments
Many commercial and research examples
Not originally designed for Hadoop like HDFS
Location awareness is part of the file system – no NameNode
Works better if functionality is “exposed” to Hadoop
Compute nodes may or may not have local storage
Compute nodes are “part of the storage cluster,” but may be
diskless – i.e., equal access to files and global namespace
Can tie the filesystem’s location awareness into task tracker to reduce remote
storage access
Remote storage is accessed using a filesystem specific inter-node
protocol
Single network hop due to filesystem’s location awareness
27
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Tightly coupled DFS for Hadoop
General purpose shared file system
Implemented in the kernel, single namespace, compatible with
most applications (no special library or language)
Data is distributed across local storage node disks
Architecturally like HDFS
Can utilize same disk options as HDFS
– Including shared nothing DAS
– SAN storage
Some can also support “shared SAN” storage where raw volumes can be
accessed by multiple nodes
– Failover model – where only one node actively uses a volume, other can take
over after failure
– Multiple initiator model – where multiple nodes actively use a volume
Shared nothing option has similar cost/performance to HDFS on
DAS
28
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Distributed FS – local disks
29
Compute
nodes are
part of the
DFS, data
spread
across
nodes
…
Distributed
FS inter-node
Protocol
…
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Distributed FS – remote disks
30
Compute nodes are
distributed FS clients
Scale out nodes are
distributed FS servers
Distributed FS
inter-node
Protocol
…
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Remote DFS Write Operation
31
Note that these diagrams are intended
to be generic, and leave out much of the
detail of any specific DFS
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Local DFS Write Operation
32
Note that these diagrams are
intended to be generic, and leave out
much of the detail of any specific DFS
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Remote DFS Read Operation
33
Note that these diagrams are intended
to be generic, and leave out much of the
detail of any specific DFS
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Local DFS Read Operation
34
Note that these diagrams are
intended to be generic, and leave out
much of the detail of any specific DFS
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Tightly coupled DFS for Hadoop
Pros
Shared data access, any node can access any data like it is local
POSIX compatible, works for non-Hadoop apps just like a local file system
Centralized management and administration
No NameNode, may have a better block mapping mechanism
Compute in-place, same copy can be served via NFS/CIFS
Many of the performance benefits
Cons
HDFS is highly optimized for Hadoop, unlikely to get same optimization for a
general purpose DFS
Large file striping is not regular, based on compute distribution
Copies are simultaneously readable
Strict POSIX compliance leads to unnecessary serialization
Hadoop assumes multiple-access to files, however, accesses are on block boundaries and don’t
overlap
Need to relax POSIX compliance for large files, or just stick with many smaller files
Some DFS’s have scaling limitations that are worse than HDFS, not designed for
“thousands” of nodes
35
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Overview
Introduction
What is Hadoop
What is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage options
Native Hadoop – HDFS
On direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storage
Emerging options
36
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Cloud Object Storage for Hadoop
Uses a REST API like CDMI, S3, or Swift
HTTP based protocol, data is remote
Objects are write once, read many, streaming access
Objects have some stored metadata
Data is stored in cloud object storage
Could be local or across internet
Cheap, high volume
Systems utilize triple redundancy or erasure coding, for reliability
Often uses Hadoop “S3” connector
37
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop on Object Storage
38
…
Cloud Object Storage
REST/HTTP
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Write on Object Storage
39
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Read on Object Storage
40
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Object Storage for Hadoop
Pros
Low cost, high volume, reliable storage
Good location for infrequently used “WORM” data
Public cloud options
Scalable storage
Data can easily be shared between Hadoop and other applications
Cons
All data is remote – performance
No data/compute colocation
Limited capabilities, though a good match for Hadoop
High disk cost if triple redundancy is used
Good choice for large infrequently accessed WORM items
that may need to be accessed by non-Hadoop jobs as well
41
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Emerging Options
New options are emerging in the storage research
community
Caching – from enterprise storage
Mirror to enterprise storage, NAS/NFS
SSD
Improvements to HDFS
HA options
Access to non-Hadoop jobs
Bottom line
The limitations of HDFS are known
Work is ongoing to improve Hadoop storage options
42
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Summary
Hadoop provides a scalable fault-tolerant environment for
analyzing unstructured and structured information
The default way to store data for Hadoop is HDFS on local
direct attached disks
Alternatives to this architecture include SAN array storage,
tightly-coupled general purpose DFS, and cloud object
storage
They can provide some significant advantages
However, they aren’t without their downsides, its hard to beat a
filesystem designed specifically for Hadoop
Which one is best for you?
Depends on what is most important – cost, manageability,
compatibility with existing infrastructure, performance, scale, …
43
Big Data Storage Options for Hadoop
© 2013 Storage Networking Industry Association. All Rights Reserved.
Attribution & Feedback
44
Please send any questions or comments regarding this SNIA
Tutorial to tracktutorials@snia.org
The SNIA Education Committee thanks the following
individuals for their contributions to this Tutorial.
Authorship History
Sam Fineberg,August 2012
Updates:
Sam Fineberg, February 2013
Sam Fineberg, March 2013
Additional Contributors
Rob Peglar
Joseph White
Chris Santilli

More Related Content

What's hot

Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
The Hadoop Ecosystem Table
The Hadoop Ecosystem TableThe Hadoop Ecosystem Table
The Hadoop Ecosystem TableDr. Volkan OBAN
 
Hadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdfHadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdfSheetal Jain
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For HadoopCloudera, Inc.
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product pageJanu Jahnavi
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterEdureka!
 

What's hot (20)

Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
The Hadoop Ecosystem Table
The Hadoop Ecosystem TableThe Hadoop Ecosystem Table
The Hadoop Ecosystem Table
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdfHadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdf
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Hadoop
HadoopHadoop
Hadoop
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For Hadoop
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product page
 
Cloudera
ClouderaCloudera
Cloudera
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 

Viewers also liked

C12 group build lead impact
C12 group build lead impactC12 group build lead impact
C12 group build lead impactDavid Gullotti
 
Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
Enterprise Storage Solutions for Overcoming Big Data and Analytics ChallengesEnterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
Enterprise Storage Solutions for Overcoming Big Data and Analytics ChallengesINFINIDAT
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaEdureka!
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataRamakrishna Prasad Sakhamuri
 

Viewers also liked (8)

C12 group build lead impact
C12 group build lead impactC12 group build lead impact
C12 group build lead impact
 
Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
Enterprise Storage Solutions for Overcoming Big Data and Analytics ChallengesEnterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigData
 

Similar to Sam fineberg big_data_hadoop_storage_options_3v9-1

Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaEdureka!
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons LearnedHadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons LearnedCloudera, Inc.
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahoMartin Ferguson
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 

Similar to Sam fineberg big_data_hadoop_storage_options_3v9-1 (20)

Hadoop-2022.pptx
Hadoop-2022.pptxHadoop-2022.pptx
Hadoop-2022.pptx
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
HDFS
HDFSHDFS
HDFS
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
 
Hdfs design
Hdfs designHdfs design
Hdfs design
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons LearnedHadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 

Sam fineberg big_data_hadoop_storage_options_3v9-1

  • 1. PRESENTATION TITLE GOES HEREBig Data Storage Options for Hadoop Sam Fineberg/HP Storage Division
  • 2. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations and literature under the following conditions: Any slide or slides used must be reproduced in their entirety without modification The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations. This presentation is a project of the SNIA Education Committee. Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney. The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 2
  • 3. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Abstract Big Data Storage Options for Hadoop The Hadoop system was developed to enable the transformation and analysis of vast amounts of structured and unstructured information. It does this by implementing an algorithm called MapReduce across compute clusters that may consist of hundreds or even thousands of nodes. In this presentation Hadoop will be looked at from a storage perspective. The tutorial will describe the key aspects of Hadoop storage, the built-in Hadoop file system (HDFS), and some other options for Hadoop storage that exist in the commercial and open source communities. 3
  • 4. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 4
  • 5. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 5
  • 6. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. What is Hadoop? A scalable fault-tolerant distributed system for data storage and processing Core Hadoop has two main components MapReduce: fault-tolerant distributed processing Programming model for processing sets of data Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) Hadoop Distributed File System (HDFS): high-bandwidth clustered storage Distributed file system optimized for large files Operates on unstructured and structured data A large and active ecosystem Written in Java Open source under the friendly Apache License http://hadoop.apache.org 6
  • 7. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. What is MapReduce? A method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two developer-created phases 1. Map 2. Reduce In between Map and Reduce is the Shuffle and Sort 7
  • 8. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce 8 © Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html
  • 9. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. What was the max temperature for the last century? MapReduce Operation 9
  • 10. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Key MapReduce Terminology Concepts A user runs a client program (typically a Java application) on a client computer The client program submits a job to Hadoop The job is sent to the JobTracker process on the Master Node Each Slave Node runs a process called the TaskTracker The JobTracker instructs TaskTrackers to run and monitor tasks A task attempt is an instance of a task running on a slave node There will be at least as many task attempts as there are tasks which need to be performed 10
  • 11. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce in Hadoop 11 © Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html Task Tracker Input (HDFS) Output (HDFS) Mapper Reducer Worker=Tasks
  • 12. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce: Basic Concepts Each Mapper processes single input split from HDFS Hadoop passes one record at a time to the developer’s Map code Each record has a key and a value Intermediate data written by the Mapper to local disk (not HDFS) on each of the individual cluster nodes intermediate data is reliable or globally accessible During shuffle and sort phase, all values associated with same intermediate key are transferred to same Reducer Reducer is passed each key and a list of all its values Output from Reducers is written to HDFS 12
  • 13. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. What is a Distributed File System? A distributed file system is a file system that allows access to files from multiple hosts across a network A network filesystem (NFS/CIFS) is a type of distributed file system – more tuned for file sharing than distributed computation Distributed computing applications, like Hadoop, utilize a tightly coupled distributed file system Tightly coupled distributed filesystems Provide a single global namespace across all nodes Support multiple initiators, multiple disk nodes, multiple access to files – file parallelism Examples include HDFS, GlusterFS, pNFS, as well as many commercial and research systems 13
  • 14. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 14
  • 15. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Distributed File System - HDFS Architecture Java application, not deeply integrated with the server OS Layered on top of a standard FS (e.g., ext, xfs, etc.) Must use Hadoop or a special library to access HDFS files Shared-nothing, all nodes have direct attached disks Write once filesystem – must copy a file to modify it HDFS basics Data is organized into files & directories Files are divided into 64-128MB blocks, distributed across nodes Block placement is handled by the “NameNode” Placement coordinated with job tracker = writes always co-located, reads co- located with computation whenever possible Blocks replicated to handle failure, replica blocks can be used by compute tasks Checksums used to ensure data integrity Replication: one and only strategy for error handling, recovery and fault tolerance Self Healing Makes multiple copies (typically 3) 15
  • 16. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS on local DAS A Hadoop cluster consisting of many nodes, each of which has local direct attached storage (DAS) Disks are running a standard file system (e.g., ext. xfs, etc.) HDFS blocks are stored as files in a special directory Disks attached directly, for example, with SAS or SATA No storage is shared, disks only attach to a single node The most common use case for Hadoop Original design point for Hadoop/HDFS Can work with cheap unreliable hardware Some very large systems utilize this model 16
  • 17. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS on local DAS 17 … Compute nodes are part of HDFS, data spread across nodes HDFS Protocol …
  • 18. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS File Write Operation 18 Image source: Hadoop,The Definitive Guide Tom White, O’Reilly 3-way replication
  • 19. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS File Read Operation 19 Image source: Hadoop,The Definitive Guide Tom White, O’Reilly
  • 20. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS on local DAS - Pros and Cons Pros Writes are highly parallel Large files are broken into many parts, distributed across the cluster Three copies of any file block, one written local, two remote Not a simple round-robin scheme, tuned for Hadoop jobs Job tracker attempts to make reads local If possible, tasks scheduled in same node as the needed file segment Duplicate file segments are also readable, can be used for tasks too Cons Not a replacement for general purpose storage Not a kernel-based POSIX filesystem Incompatible with standard applications and utilities (but future versions of Hadoop are adding more other application models) High replication cost compared with RAID/shared disk The NameNode keeps track of data location SPOF - location data is critical and must be protected Scalability bottleneck (everything has to be in memory) Improvements to NameNode are in the works 20
  • 21. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Other HDFS storage options HDFS on Storage Area Network (SAN) attached storage A lot like DAS, Disks are logical volumes in storage array(s), accessed across a SAN HDFS doesn’t know the difference – Still appears like a locally attached disk SAN attached arrays aren’t the same as DAS Array has its own cache, redundancy, replication, etc. Any node on the SAN can access any array volume So a new node can be assigned to a failed node’s data 21
  • 22. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS with SAN Storage 22 Storage Arrays Compute nodes Hadoop Cluster … iSCSI or FC SAN
  • 23. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS File Write Operation 23 Array can provide redundancy, no need to replicate data across data nodes Array Replication
  • 24. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS File Read Operation 24 Array redundancy, means only a single source for data
  • 25. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. SAN for Hadoop Storage Instead of storing data on direct attached local disks, data is in one or more arrays attached to data nodes through a SAN Looks like local storage to data nodes Hadoop still utilizes HDFS Pros All the normal advantages of arrays RAID, centralized caching, thin provisioning, other advanced array features Centralized management, easy redistribution of storage Retains advantages of HDFS (as long as array is not over-utilized) Easy failover when compute node dies, can eliminate or reduce 3-way replication Cons Cost? It depends Unless if multiple arrays are used, scale is limited And with multiple arrays, management and cost advantages are reduced Still have HDFS complexity and manageability issues 25
  • 26. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 26
  • 27. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Other distributed filesystems Kernel-based tightly coupled distributed file system Kernel-based, i.e., no special access libraries, looks like a normal local file system These filesystems have existed for years in high performance computing, scale-out NAS servers, and other scale-out computing environments Many commercial and research examples Not originally designed for Hadoop like HDFS Location awareness is part of the file system – no NameNode Works better if functionality is “exposed” to Hadoop Compute nodes may or may not have local storage Compute nodes are “part of the storage cluster,” but may be diskless – i.e., equal access to files and global namespace Can tie the filesystem’s location awareness into task tracker to reduce remote storage access Remote storage is accessed using a filesystem specific inter-node protocol Single network hop due to filesystem’s location awareness 27
  • 28. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Tightly coupled DFS for Hadoop General purpose shared file system Implemented in the kernel, single namespace, compatible with most applications (no special library or language) Data is distributed across local storage node disks Architecturally like HDFS Can utilize same disk options as HDFS – Including shared nothing DAS – SAN storage Some can also support “shared SAN” storage where raw volumes can be accessed by multiple nodes – Failover model – where only one node actively uses a volume, other can take over after failure – Multiple initiator model – where multiple nodes actively use a volume Shared nothing option has similar cost/performance to HDFS on DAS 28
  • 29. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Distributed FS – local disks 29 Compute nodes are part of the DFS, data spread across nodes … Distributed FS inter-node Protocol …
  • 30. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Distributed FS – remote disks 30 Compute nodes are distributed FS clients Scale out nodes are distributed FS servers Distributed FS inter-node Protocol …
  • 31. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Remote DFS Write Operation 31 Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS
  • 32. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Local DFS Write Operation 32 Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS
  • 33. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Remote DFS Read Operation 33 Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS
  • 34. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Local DFS Read Operation 34 Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS
  • 35. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Tightly coupled DFS for Hadoop Pros Shared data access, any node can access any data like it is local POSIX compatible, works for non-Hadoop apps just like a local file system Centralized management and administration No NameNode, may have a better block mapping mechanism Compute in-place, same copy can be served via NFS/CIFS Many of the performance benefits Cons HDFS is highly optimized for Hadoop, unlikely to get same optimization for a general purpose DFS Large file striping is not regular, based on compute distribution Copies are simultaneously readable Strict POSIX compliance leads to unnecessary serialization Hadoop assumes multiple-access to files, however, accesses are on block boundaries and don’t overlap Need to relax POSIX compliance for large files, or just stick with many smaller files Some DFS’s have scaling limitations that are worse than HDFS, not designed for “thousands” of nodes 35
  • 36. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 36
  • 37. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Cloud Object Storage for Hadoop Uses a REST API like CDMI, S3, or Swift HTTP based protocol, data is remote Objects are write once, read many, streaming access Objects have some stored metadata Data is stored in cloud object storage Could be local or across internet Cheap, high volume Systems utilize triple redundancy or erasure coding, for reliability Often uses Hadoop “S3” connector 37
  • 38. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop on Object Storage 38 … Cloud Object Storage REST/HTTP
  • 39. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Write on Object Storage 39
  • 40. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Read on Object Storage 40
  • 41. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Object Storage for Hadoop Pros Low cost, high volume, reliable storage Good location for infrequently used “WORM” data Public cloud options Scalable storage Data can easily be shared between Hadoop and other applications Cons All data is remote – performance No data/compute colocation Limited capabilities, though a good match for Hadoop High disk cost if triple redundancy is used Good choice for large infrequently accessed WORM items that may need to be accessed by non-Hadoop jobs as well 41
  • 42. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Emerging Options New options are emerging in the storage research community Caching – from enterprise storage Mirror to enterprise storage, NAS/NFS SSD Improvements to HDFS HA options Access to non-Hadoop jobs Bottom line The limitations of HDFS are known Work is ongoing to improve Hadoop storage options 42
  • 43. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Summary Hadoop provides a scalable fault-tolerant environment for analyzing unstructured and structured information The default way to store data for Hadoop is HDFS on local direct attached disks Alternatives to this architecture include SAN array storage, tightly-coupled general purpose DFS, and cloud object storage They can provide some significant advantages However, they aren’t without their downsides, its hard to beat a filesystem designed specifically for Hadoop Which one is best for you? Depends on what is most important – cost, manageability, compatibility with existing infrastructure, performance, scale, … 43
  • 44. Big Data Storage Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Attribution & Feedback 44 Please send any questions or comments regarding this SNIA Tutorial to tracktutorials@snia.org The SNIA Education Committee thanks the following individuals for their contributions to this Tutorial. Authorship History Sam Fineberg,August 2012 Updates: Sam Fineberg, February 2013 Sam Fineberg, March 2013 Additional Contributors Rob Peglar Joseph White Chris Santilli