Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pays Off

White Paper:
Big Data SSD Architecture
Digging Deep to Discover Where SSD Performance Pays Off

Big data is much more than
just “lots of data”. State-of-the-
art applications gather data
from many different sources in
varying formats with complex
relationships. Analytics yield
insight into events, trends, and
behaviors, and may adapt in
real-time depending on specific
findings. These data sets can
indeed grow extremely large,
with captured sequences kept
for future analysis.
Traditional database tools
usually handle data with pre-
determined structures and
fixed relationships, typically on
a scale-up server and storage
cluster. Often, these tools link
with some transactional process,
providing front-end services for
users while capturing back-end
data. Larger deployments may
use a data warehouse, which
consolidates data from various
applications into one managed,
reliable repository of information
for enterprise decision-making.
New distributed, flexible, and
scalable tools for multi-sourced
and poly-structured big data
typically employ numbers
of less-expensive network
processing and storage systems.
Rather than forcing all data into
a massive warehouse before
examination, distributed big data
processing nodes can relieve
architectural stress points with
localized processing. For larger
tasks, local nodes can join in
global processing, increasing
parallelism and supporting
chained jobs.
Which storage technology, hard
disk drives (HDDs) or solid-state
drives (SSDs), excels in big data
architecture? SSDs clearly win
on speed, offering both higher
sequential read/write speeds and
higher input/output operations
per second (IOPS). A naïve
approach might replace all HDDs
with SSDs everywhere in a big
data system. However, deploying
SSDs in hundreds or thousands
of nodes could add up to a
very expensive proposition. A
better approach identifies critical
locations where SSDs enable
immediate cost-per-performance
wins, and integrating HDDs used
in less stressful roles to save on
system costs.
This whitepaper will look at the
basics of big data tools, review
two performance wins with SSDs
in a well-known framework, as
well as present some examples
of emerging opportunities on
the leading edge of big
data technology.
Location: SSDs in Big Data Architecture
Adapted from “Defining the Big Data
Architecture Framework,” Demchenko,
University of Amsterdam, July 2013
5 Vs of
Big Data
• Terabytes
• Records/Arch
• Transactions
• Tables, Files
• Structured
• Unstructured
• Multi-factor
• Probabilistic
• Batch
• Real/near-time
• Processes
• Streams
• Trustworthiness
• Authenticity
• Origin, Reputation
• Availability
• Accountability
• Statistical
• Events
• Correlations
• Hypothetical
Veracity Value
VelocityVariety
Volume

Finding those precise locations
where SSDs speed up big
data operations is a hot topic.
Researchers are digging deep
inside architectures looking
for spots where HDDs are
overwhelmed and slowing things
down. Before we introduce
findings from some of these
studies, a quick overview of big
data tools and terminology is
in order.
Apache Hadoop is one of
the most well known tools
associated with big data
architecture. Its framework
provides a straightforward way to
distribute and manage data on
multiple networked computers
– nodes – and schedule
processing resources for
applications. First released at the
end of 2011, Hadoop has gained
massive popularity in search
engine, social networking, and
cloud computing infrastructure.1
Hadoop leverages parallelism for
both storage and computational
resources. Instead of storing a
very large file in a single location,
Hadoop spreads files across
many nodes. This effectively
multiplies available file system
bandwidth. Nodes also contain
processing elements, allowing
scheduling of jobs and tasks
across multiple heterogeneous
machines simultaneously. The
architecture provides fault
resilience; data is replicated on
multiple Hadoop nodes, and if a
node is down, another replaces
it. Adding nodes enhances
scalability, with large instances
achieving hundreds of petabytes
of storage.
Inside Hadoop are two primary
engines: HDFS (Hadoop
Distributed File System) and
MapReduce. Both run on the
same nodes concurrently.
HDFS manages files across a
cluster containing a NameNode
tracking metadata and a number
of DataNodes containing the
actual data. HDFS maintains
data and rack (physical location)
awareness between jobs and
tasks, minimizing network
transfers. It rebalances data
within a cluster automatically,
and supports snapshots,
upgrades, and rollback for
maintenance. Data can be
accessed via a Java API, web
browsers and a REST API, or
other languages via the Thrift
API. Applications using HDFS
include the Apache HBase non-
relational distributed database
modeled after Google Big Table,
and the Apache Hive data
warehouse manager. Hadoop
is open to other distributed
file systems with varying
functionality.2
MapReduce is an engine for
parallel, distributed processing.
A job separates a large input
data set into smaller chunks,
with data represented in (key,
value) pairs. Map tasks process
these chunks in parallel, then
an intermediate shuffle function
sorts the results passed to
Reduce tasks for mathematical
operations. MapReduce is
by nature disk-based, batch
oriented, and non-real-time.3
Basics of Big Data Tools
MapReduce
Layer
HDFS
Layer
multi-node cluster
master slave
task tracker task tracker
job tracker
name node
data node data node
Adapted from “Running Hadoop on Ubuntu Linux (Multi-Node Cluster),” Michael Noll
Hadoop Multi-Node Cluster

Most computing jobs fall into
one of two categories: compute
intensive, or I/O intensive. Studying
the MapReduce workload, a
team from Cloudera4
deployed
one SSD versus several HDDs to
approximate the same theoretical
aggregate sequential read/write
bandwidth, isolating performance
differences.
Hadoop clusters with all SSDs
outperformed HDD clusters by as
much 70% under some workloads
in the Cloudera study. SSDs also
achieve about twice the actual
sequential I/O size with lower
latency supporting more task
scheduling. For some workloads,
the difference is less. Chopping
up large files means sequential
access is not the limiting factor
– exactly what Hadoop intends
to achieve. However, when file
granularity exceeds what node
memory can hold, an important
change occurs:
“In practice, independent
of I/O concurrency, there
is negligible disk I/O for
intermediate data that fits
in memory, while a large
amount of intermediate
data leads to severe load
on the disks.”
—The Truth About Map
Reduce Performance on
SSDs, 2014
MapReduce actually involves
three phases: Map, shuffle, and
Reduce. The intermediate shuffle
operation stores its results in
a single file on local disk. By
targeting shuffle results to an
SSD using the mapred.local.dir
parameter, performance increases
dramatically. Splitting the SSD
into multiple data directories
also improves sequential read/
write task scheduling. Hybrid
configurations with HDDs and a
properly configured SSD may be
the most cost-effective quick win
for Hadoop performance.
Hadoop also presents compute-
intensive opportunities. To quote
Apache, “moving computation is
cheaper than moving data.”
A Microsoft study suggests that
most analytics jobs do not use
huge data sets, citing a data point
from Facebook that 90% of jobs
have input sizes under 100 GB.5
Their suggestion is to “scale-up,”
but what they describe is actually
a converged modular server with
32 cores and paired SSD storage,
bypassing HDFS for local access.
One point Microsoft makes
strongly: “… Without SSDs,
many Hadoop jobs become
disk-bound.”
New PCIe SSDs using the latest
NVMe protocol will be ideal
to support compute-intensive
workloads and keep the network
free. This is vital for use cases
such as streaming content
and incoming IoT sensor data.
Removing storage latency also
allows more Hadoop jobs to be
scheduled. Some applications,
such as Bayesian classification
used in machine learning, chain
Hadoop jobs.
y concentrating processing on
bigger multicore servers with
all-SSD storage for some jobs,
Microsoft finds that overall Hadoop
cluster performance increases.
Their suggested architecture is
a mix of bigger all-SSD nodes
and smaller, less expensive
nodes (perhaps with hybrid drive
configurations) combined in a
scale-out approach, handling
a range of Hadoop job sizes
most efficiently.
Locating Quick Wins with SSDs
split 0 map
sort
merge
copy
output
HDFS
input
HDFS
HDFS
replication
HDFS
replication
reduce part 0
reduce part 0
split 1 map
split 2 map
MapReduce Data Flow

Open source projects attract
innovation, and big data is
no exception. The basics
of Hadoop, HDFS, and
MapReduce are established,
and researchers are turning to
other sophisticated techniques.
In many cases, SSDs still
hold the key to performance
improvement.
Virtualization can consolidate
multiple nodes on scale-up
servers, but inserts significant
overhead. Samsung researchers
found that Hadoop instances
running over Xen virtualization
suffer from 50% to 83%
degradation in sort-based
benchmarks using HDDs.
The same configurations with
SSDs fell only 10% to 21%
running on a virtualized system.
Further analysis reveals why
– virtualization decreases
I/O block size and increases
the number of requests, an
environment where SSDs
perform better.6
At Facebook, using HBase
over HDFS for the message
stack simplifies development,
but introduces performance
issues. Messages are small
files; 90% are less than 15 MB.
Compounding the problem, I/O
is random and “hot” frequently
used data is often too large
for RAM. Part of their solution
is a 60 GB SSD configured as
HBase cache that more than
triples performance through
reduced latency.7
Several efforts are after “in-
memory” benefits. Aerospike
has created a hybrid RAM/
SSD architecture for a NoSQL
database. Smaller files are in
RAM, accessed with raw read/
write bypassing the Linux file
system, while larger files are
on low-latency SSDs. Indices
also reside in RAM. Data on
the SSDs is contiguous and
optimized for “one read per
record.”8
Teams at Virginia Tech
redesigned HDFS for tiered
storage clusters.9
Their study
on hatS (heterogeneity-aware
tiered storage) shows the value
of aiming more requests at PCIe
SSDs. All 27 worker nodes have
a SATA HDD (3 racks, 9 nodes
each), nine nodes have a SATA
SSD (3 per rack), and three
nodes have a PCIe SSD (1 in
each rack). Running a synthetic
Facebook benchmark and hatS,
the average I/O rate of HDFS
increases 36%, and execution
time falls 26%.
Spark, a new in-memory
architecture that could
supplant MapReduce, allows
programs to load resilient
distributed datasets (RDDs)
into cluster memory.10
Spark
redefines shuffle algorithms,
with each processor core (in
earlier releases, each map
task) creating a number of
shuffle result files matching
the number of reduce tasks.
Contrary to popular belief,
Spark RDDs can spill to disk
for persistence. Benchmarks
by Databricks deploying Spark
on 206 Amazon EC2 i2x8large
nodes with Amazon’s SSD-
based S3 storage recently broke
sort records, three times faster
using one-tenth the number
of machines of an equivalent
Hadoop implementation.
MIT researchers find that if in-
memory servers go to disk as
little as 5%, the overhead can
nullify gains from RAM-based
nodes.11
Their experiment
uses Xilinx FPGAs with ARM
processing cores to pre-process
Samsung SSD access in an
interconnect fabric. A small
cluster with 20 servers achieves
near RAM-based speeds on
image search, PageRank,
and Memcached algorithms.
This proves the value of tightly
coupled nodes combining
processing with an SSD.
More Prime Spots on Faster Routes

These studies all show choosing
the right locations for SSDs in
big data architecture produces
significant cost-per-performance
gains. Existing Hadoop
installations speed up instantly
with a simple change targeting
MapReduce shuffle results to
an SSD instead of an HDD.
New Hadoop installations can
take advantage of more SSDs
within asymmetric clusters, with
a scale-up server taking care
of intense jobs augmenting a
network of smaller machines,
and faster nodes with localized
pre-processing for SSDs.
The big data community
continues to press the
boundaries, looking for
architecture enhancements.
Hybrid installations with
enhanced file systems are
showing promise. These
approaches are capitalizing
on the ability of SSDs to
service more requests quickly,
which translates to more job
scheduling and improved cluster
performance. Noteworthy is
the misconception that “in-
memory” means no disk, while
newer big data architectures are
leveraging RAM performance,
when they do go to disk, SSDs
are essential to keeping overall
cluster performance high.
One powerful observation is
that big data is not necessarily
operating on large, sequential
files. When a cluster does its
job, input files have been parsed
into smaller pieces, and disk
operations are more random
and multi-threaded – an ideal
use case where SSDs excel.
As V-NAND flash and PCIe
interface technology mature,
the cost-per-performance
metric shifts in favor of SSDs.
High-performance SSDs will
continue to move from just
critical locations to broader use
throughout big data clusters.
Learn more: samsung.com/enterprisessd | 1-866-SAM4BIZ
Follow us: youtube.com/samsungbizusa | @SamsungBizUSA | insights.samsung.com
©2016 Samsung Electronics America, Inc. All rights reserved. Samsung is a registered trademark of Samsung Electronics Co., Ltd.
All products, logos and brand names are trademarks or registered trademarks of their respective companies. This white paper is for
informational purposes only. Samsung makes no warranties, express or implied, in this white paper.

WHP-SSD-BIGDATA-JAN16J
1
Apache Hadoop project
2
Apache Hadoop documentation, HDFS Architecture Guide
3
Apache Hadoop documentation, MapReduce Tutorial
4
“The Truth About MapReduce Performance on SSDs”, Karthik Kambatla and Yanpei Chen, Cloudera Inc. and Purdue University, presented at LISA 14 sponsored
by USENIX, November 2014
5
“Scale-up vs Scale-out for Hadoop: Time to rethink?”, Appuswamy et al, Microsoft Research, presented at SoCC ‘13, October 2013
6
“Performance Implications of SSDs in Virtualized Hadoop Clusters”, Ahn et al, Samsung Electronics, presented at IEEE BigData Congress, June 2014
7
“Analysis of HDFS under HBase: A Facebook Messages Case Study”, Harter et al, Facebook Inc. and University of Wisconsin, Madison, presented at FAST ’14 sponsored
by USENIX, February 2014
8
Aerospike web site
9
“hatS: A Heterogeneity-Aware Tiered Storage for Hadoop”. Krish et al, Virginia Tech, presented at IEEE/ACM CCGrid2014, May 2014
10
“Spark the fastest open source engine for sorting a petabyte”, Databricks, November 2014
11
“Cutting cost and power consumption for big data”, Larry Hardesty, Massachusetts Institute of Technology, July 10, 2015
Faster Now, Even Faster Soon
Samsung enterprise
ssd portfolio
PM863 Series Data Center SSDs
• 3 bit MLC NAND
• Designed for read-intensive
applications
• Built in Power Loss Protection
• SATA 6Gb/s Interface
• Form-factors: 2.5”
SM863 Series Data Center SSDs
• 2 bit MLC NAND
• Designed for write-intensive
applications
• Built in Power Loss Protection
• SATA 6 Gb/s Interface
Samsung Workstation
ssd portfolio
950 Pro Series Client PC SSDs
• 2 bit MLC NAND
• Designed for high-end PCs
• PCIe Interface
• NVMe protocol
• Form-factors: M.2
850 Pro Series Client PC SSDs
• 2 bit MLC NAND
• SATA 6Gb/s Interface

Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pays Off

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pays Off

Similar to Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pays Off (20)

More from Samsung Business USA

More from Samsung Business USA (20)

Recently uploaded

Recently uploaded (20)

Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pays Off