SlideShare a Scribd company logo
White Paper:
Big Data SSD Architecture
Digging Deep to Discover Where SSD Performance Pays Off
Big data is much more than
just “lots of data”. State-of-the-
art applications gather data
from many different sources in
varying formats with complex
relationships. Analytics yield
insight into events, trends, and
behaviors, and may adapt in
real-time depending on specific
findings. These data sets can
indeed grow extremely large,
with captured sequences kept
for future analysis.
Traditional database tools
usually handle data with pre-
determined structures and
fixed relationships, typically on
a scale-up server and storage
cluster. Often, these tools link
with some transactional process,
providing front-end services for
users while capturing back-end
data. Larger deployments may
use a data warehouse, which
consolidates data from various
applications into one managed,
reliable repository of information
for enterprise decision-making.
New distributed, flexible, and
scalable tools for multi-sourced
and poly-structured big data
typically employ numbers
of less-expensive network
processing and storage systems.
Rather than forcing all data into
a massive warehouse before
examination, distributed big data
processing nodes can relieve
architectural stress points with
localized processing. For larger
tasks, local nodes can join in
global processing, increasing
parallelism and supporting
chained jobs.
Which storage technology, hard
disk drives (HDDs) or solid-state
drives (SSDs), excels in big data
architecture? SSDs clearly win
on speed, offering both higher
sequential read/write speeds and
higher input/output operations
per second (IOPS). A naïve
approach might replace all HDDs
with SSDs everywhere in a big
data system. However, deploying
SSDs in hundreds or thousands
of nodes could add up to a
very expensive proposition. A
better approach identifies critical
locations where SSDs enable
immediate cost-per-performance
wins, and integrating HDDs used
in less stressful roles to save on
system costs.
This whitepaper will look at the
basics of big data tools, review
two performance wins with SSDs
in a well-known framework, as
well as present some examples
of emerging opportunities on
the leading edge of big
data technology.
Location: SSDs in Big Data Architecture
Adapted from “Defining the Big Data
Architecture Framework,” Demchenko,
University of Amsterdam, July 2013
5 Vs of
Big Data
• Terabytes
• Records/Arch
• Transactions
• Tables, Files
• Structured
• Unstructured
• Multi-factor
• Probabilistic
• Batch
• Real/near-time
• Processes
• Streams
• Trustworthiness
• Authenticity
• Origin, Reputation
• Availability
• Accountability
• Statistical
• Events
• Correlations
• Hypothetical
Veracity Value
VelocityVariety
Volume
Finding those precise locations
where SSDs speed up big
data operations is a hot topic.
Researchers are digging deep
inside architectures looking
for spots where HDDs are
overwhelmed and slowing things
down. Before we introduce
findings from some of these
studies, a quick overview of big
data tools and terminology is
in order.
Apache Hadoop is one of
the most well known tools
associated with big data
architecture. Its framework
provides a straightforward way to
distribute and manage data on
multiple networked computers
– nodes – and schedule
processing resources for
applications. First released at the
end of 2011, Hadoop has gained
massive popularity in search
engine, social networking, and
cloud computing infrastructure.1
Hadoop leverages parallelism for
both storage and computational
resources. Instead of storing a
very large file in a single location,
Hadoop spreads files across
many nodes. This effectively
multiplies available file system
bandwidth. Nodes also contain
processing elements, allowing
scheduling of jobs and tasks
across multiple heterogeneous
machines simultaneously. The
architecture provides fault
resilience; data is replicated on
multiple Hadoop nodes, and if a
node is down, another replaces
it. Adding nodes enhances
scalability, with large instances
achieving hundreds of petabytes
of storage.
Inside Hadoop are two primary
engines: HDFS (Hadoop
Distributed File System) and
MapReduce. Both run on the
same nodes concurrently.
HDFS manages files across a
cluster containing a NameNode
tracking metadata and a number
of DataNodes containing the
actual data. HDFS maintains
data and rack (physical location)
awareness between jobs and
tasks, minimizing network
transfers. It rebalances data
within a cluster automatically,
and supports snapshots,
upgrades, and rollback for
maintenance. Data can be
accessed via a Java API, web
browsers and a REST API, or
other languages via the Thrift
API. Applications using HDFS
include the Apache HBase non-
relational distributed database
modeled after Google Big Table,
and the Apache Hive data
warehouse manager. Hadoop
is open to other distributed
file systems with varying
functionality.2
MapReduce is an engine for
parallel, distributed processing.
A job separates a large input
data set into smaller chunks,
with data represented in (key,
value) pairs. Map tasks process
these chunks in parallel, then
an intermediate shuffle function
sorts the results passed to
Reduce tasks for mathematical
operations. MapReduce is
by nature disk-based, batch
oriented, and non-real-time.3
Basics of Big Data Tools
MapReduce
Layer
HDFS
Layer
multi-node cluster
master slave
task tracker task tracker
job tracker
name node
data node data node
Adapted from “Running Hadoop on Ubuntu Linux (Multi-Node Cluster),” Michael Noll
Hadoop Multi-Node Cluster
Most computing jobs fall into
one of two categories: compute
intensive, or I/O intensive. Studying
the MapReduce workload, a
team from Cloudera4
deployed
one SSD versus several HDDs to
approximate the same theoretical
aggregate sequential read/write
bandwidth, isolating performance
differences.
Hadoop clusters with all SSDs
outperformed HDD clusters by as
much 70% under some workloads
in the Cloudera study. SSDs also
achieve about twice the actual
sequential I/O size with lower
latency supporting more task
scheduling. For some workloads,
the difference is less. Chopping
up large files means sequential
access is not the limiting factor
– exactly what Hadoop intends
to achieve. However, when file
granularity exceeds what node
memory can hold, an important
change occurs:
“In practice, independent
of I/O concurrency, there
is negligible disk I/O for
intermediate data that fits
in memory, while a large
amount of intermediate
data leads to severe load
on the disks.”
—The Truth About Map
Reduce Performance on
SSDs, 2014
MapReduce actually involves
three phases: Map, shuffle, and
Reduce. The intermediate shuffle
operation stores its results in
a single file on local disk. By
targeting shuffle results to an
SSD using the mapred.local.dir
parameter, performance increases
dramatically. Splitting the SSD
into multiple data directories
also improves sequential read/
write task scheduling. Hybrid
configurations with HDDs and a
properly configured SSD may be
the most cost-effective quick win
for Hadoop performance.
Hadoop also presents compute-
intensive opportunities. To quote
Apache, “moving computation is
cheaper than moving data.”
A Microsoft study suggests that
most analytics jobs do not use
huge data sets, citing a data point
from Facebook that 90% of jobs
have input sizes under 100 GB.5
Their suggestion is to “scale-up,”
but what they describe is actually
a converged modular server with
32 cores and paired SSD storage,
bypassing HDFS for local access.
One point Microsoft makes
strongly: “… Without SSDs,
many Hadoop jobs become
disk-bound.”
New PCIe SSDs using the latest
NVMe protocol will be ideal
to support compute-intensive
workloads and keep the network
free. This is vital for use cases
such as streaming content
and incoming IoT sensor data.
Removing storage latency also
allows more Hadoop jobs to be
scheduled. Some applications,
such as Bayesian classification
used in machine learning, chain
Hadoop jobs.
y concentrating processing on
bigger multicore servers with
all-SSD storage for some jobs,
Microsoft finds that overall Hadoop
cluster performance increases.
Their suggested architecture is
a mix of bigger all-SSD nodes
and smaller, less expensive
nodes (perhaps with hybrid drive
configurations) combined in a
scale-out approach, handling
a range of Hadoop job sizes
most efficiently.
Locating Quick Wins with SSDs
split 0 map
sort
merge
copy
output
HDFS
input
HDFS
HDFS
replication
HDFS
replication
reduce part 0
reduce part 0
split 1 map
split 2 map
MapReduce Data Flow
Open source projects attract
innovation, and big data is
no exception. The basics
of Hadoop, HDFS, and
MapReduce are established,
and researchers are turning to
other sophisticated techniques.
In many cases, SSDs still
hold the key to performance
improvement.
Virtualization can consolidate
multiple nodes on scale-up
servers, but inserts significant
overhead. Samsung researchers
found that Hadoop instances
running over Xen virtualization
suffer from 50% to 83%
degradation in sort-based
benchmarks using HDDs.
The same configurations with
SSDs fell only 10% to 21%
running on a virtualized system.
Further analysis reveals why
– virtualization decreases
I/O block size and increases
the number of requests, an
environment where SSDs
perform better.6
At Facebook, using HBase
over HDFS for the message
stack simplifies development,
but introduces performance
issues. Messages are small
files; 90% are less than 15 MB.
Compounding the problem, I/O
is random and “hot” frequently
used data is often too large
for RAM. Part of their solution
is a 60 GB SSD configured as
HBase cache that more than
triples performance through
reduced latency.7
Several efforts are after “in-
memory” benefits. Aerospike
has created a hybrid RAM/
SSD architecture for a NoSQL
database. Smaller files are in
RAM, accessed with raw read/
write bypassing the Linux file
system, while larger files are
on low-latency SSDs. Indices
also reside in RAM. Data on
the SSDs is contiguous and
optimized for “one read per
record.”8
Teams at Virginia Tech
redesigned HDFS for tiered
storage clusters.9
Their study
on hatS (heterogeneity-aware
tiered storage) shows the value
of aiming more requests at PCIe
SSDs. All 27 worker nodes have
a SATA HDD (3 racks, 9 nodes
each), nine nodes have a SATA
SSD (3 per rack), and three
nodes have a PCIe SSD (1 in
each rack). Running a synthetic
Facebook benchmark and hatS,
the average I/O rate of HDFS
increases 36%, and execution
time falls 26%.
Spark, a new in-memory
architecture that could
supplant MapReduce, allows
programs to load resilient
distributed datasets (RDDs)
into cluster memory.10
Spark
redefines shuffle algorithms,
with each processor core (in
earlier releases, each map
task) creating a number of
shuffle result files matching
the number of reduce tasks.
Contrary to popular belief,
Spark RDDs can spill to disk
for persistence. Benchmarks
by Databricks deploying Spark
on 206 Amazon EC2 i2x8large
nodes with Amazon’s SSD-
based S3 storage recently broke
sort records, three times faster
using one-tenth the number
of machines of an equivalent
Hadoop implementation.
MIT researchers find that if in-
memory servers go to disk as
little as 5%, the overhead can
nullify gains from RAM-based
nodes.11
Their experiment
uses Xilinx FPGAs with ARM
processing cores to pre-process
Samsung SSD access in an
interconnect fabric. A small
cluster with 20 servers achieves
near RAM-based speeds on
image search, PageRank,
and Memcached algorithms.
This proves the value of tightly
coupled nodes combining
processing with an SSD.
More Prime Spots on Faster Routes
These studies all show choosing
the right locations for SSDs in
big data architecture produces
significant cost-per-performance
gains. Existing Hadoop
installations speed up instantly
with a simple change targeting
MapReduce shuffle results to
an SSD instead of an HDD.
New Hadoop installations can
take advantage of more SSDs
within asymmetric clusters, with
a scale-up server taking care
of intense jobs augmenting a
network of smaller machines,
and faster nodes with localized
pre-processing for SSDs.
The big data community
continues to press the
boundaries, looking for
architecture enhancements.
Hybrid installations with
enhanced file systems are
showing promise. These
approaches are capitalizing
on the ability of SSDs to
service more requests quickly,
which translates to more job
scheduling and improved cluster
performance. Noteworthy is
the misconception that “in-
memory” means no disk, while
newer big data architectures are
leveraging RAM performance,
when they do go to disk, SSDs
are essential to keeping overall
cluster performance high.
One powerful observation is
that big data is not necessarily
operating on large, sequential
files. When a cluster does its
job, input files have been parsed
into smaller pieces, and disk
operations are more random
and multi-threaded – an ideal
use case where SSDs excel.
As V-NAND flash and PCIe
interface technology mature,
the cost-per-performance
metric shifts in favor of SSDs.
High-performance SSDs will
continue to move from just
critical locations to broader use
throughout big data clusters.
Learn more: samsung.com/enterprisessd | 1-866-SAM4BIZ
Follow us: youtube.com/samsungbizusa | @SamsungBizUSA | insights.samsung.com
©2016 Samsung Electronics America, Inc. All rights reserved. Samsung is a registered trademark of Samsung Electronics Co., Ltd.
All products, logos and brand names are trademarks or registered trademarks of their respective companies. This white paper is for
informational purposes only. Samsung makes no warranties, express or implied, in this white paper.

WHP-SSD-BIGDATA-JAN16J
1
Apache Hadoop project
2
Apache Hadoop documentation, HDFS Architecture Guide
3
Apache Hadoop documentation, MapReduce Tutorial
4
“The Truth About MapReduce Performance on SSDs”, Karthik Kambatla and Yanpei Chen, Cloudera Inc. and Purdue University, presented at LISA 14 sponsored
by USENIX, November 2014
5
“Scale-up vs Scale-out for Hadoop: Time to rethink?”, Appuswamy et al, Microsoft Research, presented at SoCC ‘13, October 2013
6
“Performance Implications of SSDs in Virtualized Hadoop Clusters”, Ahn et al, Samsung Electronics, presented at IEEE BigData Congress, June 2014
7
“Analysis of HDFS under HBase: A Facebook Messages Case Study”, Harter et al, Facebook Inc. and University of Wisconsin, Madison, presented at FAST ’14 sponsored
by USENIX, February 2014
8
Aerospike web site
9
“hatS: A Heterogeneity-Aware Tiered Storage for Hadoop”. Krish et al, Virginia Tech, presented at IEEE/ACM CCGrid2014, May 2014
10
“Spark the fastest open source engine for sorting a petabyte”, Databricks, November 2014
11
“Cutting cost and power consumption for big data”, Larry Hardesty, Massachusetts Institute of Technology, July 10, 2015
Faster Now, Even Faster Soon
Samsung enterprise
ssd portfolio
PM863 Series Data Center SSDs
•	3 bit MLC NAND
•	Designed for read-intensive
applications
•	Built in Power Loss Protection
•	SATA 6Gb/s Interface
•	Form-factors: 2.5”
SM863 Series Data Center SSDs
•	2 bit MLC NAND
• Designed for write-intensive
applications
•	Built in Power Loss Protection
•	SATA 6 Gb/s Interface
•	Form-factors: 2.5”
Samsung Workstation
ssd portfolio
950 Pro Series Client PC SSDs
•	2 bit MLC NAND
•	Designed for high-end PCs
•	PCIe Interface
•	NVMe protocol
•	Form-factors: M.2
850 Pro Series Client PC SSDs
•	2 bit MLC NAND
•	SATA 6Gb/s Interface
•	Form-factors: 2.5”

More Related Content

What's hot

Whitepaper_Cassandra_Datastax_Final
Whitepaper_Cassandra_Datastax_FinalWhitepaper_Cassandra_Datastax_Final
Whitepaper_Cassandra_Datastax_FinalMichele Hunter
 
Make sense of important data faster with AWS EC2 M6i instances
Make sense of important data faster with AWS EC2 M6i instancesMake sense of important data faster with AWS EC2 M6i instances
Make sense of important data faster with AWS EC2 M6i instances
Principled Technologies
 
Scalability: Lenovo ThinkServer RD540 system and Lenovo ThinkServer SA120 sto...
Scalability: Lenovo ThinkServer RD540 system and Lenovo ThinkServer SA120 sto...Scalability: Lenovo ThinkServer RD540 system and Lenovo ThinkServer SA120 sto...
Scalability: Lenovo ThinkServer RD540 system and Lenovo ThinkServer SA120 sto...
Principled Technologies
 
Back up deduplicated data in less time with the Dell DR6000 Disk Backup Appli...
Back up deduplicated data in less time with the Dell DR6000 Disk Backup Appli...Back up deduplicated data in less time with the Dell DR6000 Disk Backup Appli...
Back up deduplicated data in less time with the Dell DR6000 Disk Backup Appli...
Principled Technologies
 
Insiders Guide- Managing Storage Performance
Insiders Guide- Managing Storage PerformanceInsiders Guide- Managing Storage Performance
Insiders Guide- Managing Storage PerformanceDataCore Software
 
I_O Switch Thunderstorm Data Sheet
I_O Switch Thunderstorm Data SheetI_O Switch Thunderstorm Data Sheet
I_O Switch Thunderstorm Data SheetDimitar Boyn
 
Dell PowerEdge M820 blades: Balancing performance, density, and high availabi...
Dell PowerEdge M820 blades: Balancing performance, density, and high availabi...Dell PowerEdge M820 blades: Balancing performance, density, and high availabi...
Dell PowerEdge M820 blades: Balancing performance, density, and high availabi...
Principled Technologies
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperDavid Walker
 
Hitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-familyHitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-familyHitachi Vantara
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data Storage
EMC
 
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
DataCore APAC
 
Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...
Principled Technologies
 
Hp smart cache technology c03641668
Hp smart cache technology c03641668Hp smart cache technology c03641668
Hp smart cache technology c03641668Paul Cao
 
Data Domain Architecture
Data Domain ArchitectureData Domain Architecture
Data Domain Architecture
koesteruk22
 
Cloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodCloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control Method
IRJET Journal
 
Virtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged StorageVirtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged Storage
DataCore Software
 
Ddn 2017 10_dse_primer
Ddn 2017 10_dse_primerDdn 2017 10_dse_primer
Ddn 2017 10_dse_primer
Daniel M. Farrell
 
Offer faster access to critical data and achieve greater inline data reductio...
Offer faster access to critical data and achieve greater inline data reductio...Offer faster access to critical data and achieve greater inline data reductio...
Offer faster access to critical data and achieve greater inline data reductio...
Principled Technologies
 
FAQ on Dedupe NetApp
FAQ on Dedupe NetAppFAQ on Dedupe NetApp
FAQ on Dedupe NetApp
Ashwin Pawar
 
An Assessment of SSD Performance in the IBM System Storage DS8000
An Assessment of SSD Performance in the IBM System Storage DS8000An Assessment of SSD Performance in the IBM System Storage DS8000
An Assessment of SSD Performance in the IBM System Storage DS8000
IBM India Smarter Computing
 

What's hot (20)

Whitepaper_Cassandra_Datastax_Final
Whitepaper_Cassandra_Datastax_FinalWhitepaper_Cassandra_Datastax_Final
Whitepaper_Cassandra_Datastax_Final
 
Make sense of important data faster with AWS EC2 M6i instances
Make sense of important data faster with AWS EC2 M6i instancesMake sense of important data faster with AWS EC2 M6i instances
Make sense of important data faster with AWS EC2 M6i instances
 
Scalability: Lenovo ThinkServer RD540 system and Lenovo ThinkServer SA120 sto...
Scalability: Lenovo ThinkServer RD540 system and Lenovo ThinkServer SA120 sto...Scalability: Lenovo ThinkServer RD540 system and Lenovo ThinkServer SA120 sto...
Scalability: Lenovo ThinkServer RD540 system and Lenovo ThinkServer SA120 sto...
 
Back up deduplicated data in less time with the Dell DR6000 Disk Backup Appli...
Back up deduplicated data in less time with the Dell DR6000 Disk Backup Appli...Back up deduplicated data in less time with the Dell DR6000 Disk Backup Appli...
Back up deduplicated data in less time with the Dell DR6000 Disk Backup Appli...
 
Insiders Guide- Managing Storage Performance
Insiders Guide- Managing Storage PerformanceInsiders Guide- Managing Storage Performance
Insiders Guide- Managing Storage Performance
 
I_O Switch Thunderstorm Data Sheet
I_O Switch Thunderstorm Data SheetI_O Switch Thunderstorm Data Sheet
I_O Switch Thunderstorm Data Sheet
 
Dell PowerEdge M820 blades: Balancing performance, density, and high availabi...
Dell PowerEdge M820 blades: Balancing performance, density, and high availabi...Dell PowerEdge M820 blades: Balancing performance, density, and high availabi...
Dell PowerEdge M820 blades: Balancing performance, density, and high availabi...
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
 
Hitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-familyHitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-family
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data Storage
 
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
 
Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...
 
Hp smart cache technology c03641668
Hp smart cache technology c03641668Hp smart cache technology c03641668
Hp smart cache technology c03641668
 
Data Domain Architecture
Data Domain ArchitectureData Domain Architecture
Data Domain Architecture
 
Cloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodCloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control Method
 
Virtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged StorageVirtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged Storage
 
Ddn 2017 10_dse_primer
Ddn 2017 10_dse_primerDdn 2017 10_dse_primer
Ddn 2017 10_dse_primer
 
Offer faster access to critical data and achieve greater inline data reductio...
Offer faster access to critical data and achieve greater inline data reductio...Offer faster access to critical data and achieve greater inline data reductio...
Offer faster access to critical data and achieve greater inline data reductio...
 
FAQ on Dedupe NetApp
FAQ on Dedupe NetAppFAQ on Dedupe NetApp
FAQ on Dedupe NetApp
 
An Assessment of SSD Performance in the IBM System Storage DS8000
An Assessment of SSD Performance in the IBM System Storage DS8000An Assessment of SSD Performance in the IBM System Storage DS8000
An Assessment of SSD Performance in the IBM System Storage DS8000
 

Similar to Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pays Off

Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
Sysfore Technologies
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
Cognizant
 
G017143640
G017143640G017143640
G017143640
IOSR Journals
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
faizrashid1995
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
IJEACS
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
Graisy Biswal
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 

Similar to Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pays Off (20)

Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop
HadoopHadoop
Hadoop
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
paper
paperpaper
paper
 

More from Samsung Business USA

13 tricks to get the most out of the S Pen
13 tricks to get the most out of the S Pen13 tricks to get the most out of the S Pen
13 tricks to get the most out of the S Pen
Samsung Business USA
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom
Samsung Business USA
 
10 reasons to choose Galaxy Tab S9 for work on the go
10 reasons to choose Galaxy Tab S9 for work on the go10 reasons to choose Galaxy Tab S9 for work on the go
10 reasons to choose Galaxy Tab S9 for work on the go
Samsung Business USA
 
10 reasons to upgrade to Samsung’s Galaxy S23
10 reasons to upgrade to Samsung’s Galaxy S2310 reasons to upgrade to Samsung’s Galaxy S23
10 reasons to upgrade to Samsung’s Galaxy S23
Samsung Business USA
 
13 things you didn’t know you could do with the S Pen
13 things you didn’t know you could do with the S Pen13 things you didn’t know you could do with the S Pen
13 things you didn’t know you could do with the S Pen
Samsung Business USA
 
10 ways to fuel productivity
10 ways to fuel productivity10 ways to fuel productivity
10 ways to fuel productivity
Samsung Business USA
 
8 ways ViewFinity monitors set the standard for professionals
8 ways ViewFinity monitors set the standard for professionals8 ways ViewFinity monitors set the standard for professionals
8 ways ViewFinity monitors set the standard for professionals
Samsung Business USA
 
10 field-ready features of Galaxy XCover6 Pro
10 field-ready features of Galaxy XCover6 Pro10 field-ready features of Galaxy XCover6 Pro
10 field-ready features of Galaxy XCover6 Pro
Samsung Business USA
 
5 Ways Smart Hospital Rooms Can Improve The Patient Experience
5 Ways Smart Hospital Rooms Can Improve The Patient Experience5 Ways Smart Hospital Rooms Can Improve The Patient Experience
5 Ways Smart Hospital Rooms Can Improve The Patient Experience
Samsung Business USA
 
INFOGRAPHIC_10-field-ready-features-of-the-Tab-Active4_FINAL.pdf
INFOGRAPHIC_10-field-ready-features-of-the-Tab-Active4_FINAL.pdfINFOGRAPHIC_10-field-ready-features-of-the-Tab-Active4_FINAL.pdf
INFOGRAPHIC_10-field-ready-features-of-the-Tab-Active4_FINAL.pdf
Samsung Business USA
 
10 ways to fuel productivity with Galaxy Book
10 ways to fuel productivity with Galaxy Book10 ways to fuel productivity with Galaxy Book
10 ways to fuel productivity with Galaxy Book
Samsung Business USA
 
The best monitors for teachers: 5 features every educator needs
The best monitors for teachers: 5 features every educator needsThe best monitors for teachers: 5 features every educator needs
The best monitors for teachers: 5 features every educator needs
Samsung Business USA
 
To BYOD or not to BYOD?
To BYOD or not to BYOD?To BYOD or not to BYOD?
To BYOD or not to BYOD?
Samsung Business USA
 
6 smart ways Samsung's 85-inch Interactive Display changes the classroom
6 smart ways Samsung's 85-inch Interactive Display changes the classroom6 smart ways Samsung's 85-inch Interactive Display changes the classroom
6 smart ways Samsung's 85-inch Interactive Display changes the classroom
Samsung Business USA
 
13 things you didn't know you could do with the S Pen
13 things you didn't know you could do with the S Pen13 things you didn't know you could do with the S Pen
13 things you didn't know you could do with the S Pen
Samsung Business USA
 
10 reasons to choose Galaxy Tab S8 for work on the go
10 reasons to choose Galaxy Tab S8 for work on the go10 reasons to choose Galaxy Tab S8 for work on the go
10 reasons to choose Galaxy Tab S8 for work on the go
Samsung Business USA
 
10 reasons to upgrade to Galaxy S22 Ultra
10 reasons to upgrade to Galaxy S22 Ultra10 reasons to upgrade to Galaxy S22 Ultra
10 reasons to upgrade to Galaxy S22 Ultra
Samsung Business USA
 
9 benefits of mobilizing patient care with hospital technology
9 benefits of mobilizing patient care with hospital technology9 benefits of mobilizing patient care with hospital technology
9 benefits of mobilizing patient care with hospital technology
Samsung Business USA
 
6 ways Samsung Kiosk gives retailers an all-in-one self-service solution
6 ways Samsung Kiosk gives retailers an all-in-one self-service solution6 ways Samsung Kiosk gives retailers an all-in-one self-service solution
6 ways Samsung Kiosk gives retailers an all-in-one self-service solution
Samsung Business USA
 
7 ways displays create a connected campus
7 ways displays create a connected campus7 ways displays create a connected campus
7 ways displays create a connected campus
Samsung Business USA
 

More from Samsung Business USA (20)

13 tricks to get the most out of the S Pen
13 tricks to get the most out of the S Pen13 tricks to get the most out of the S Pen
13 tricks to get the most out of the S Pen
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom
 
10 reasons to choose Galaxy Tab S9 for work on the go
10 reasons to choose Galaxy Tab S9 for work on the go10 reasons to choose Galaxy Tab S9 for work on the go
10 reasons to choose Galaxy Tab S9 for work on the go
 
10 reasons to upgrade to Samsung’s Galaxy S23
10 reasons to upgrade to Samsung’s Galaxy S2310 reasons to upgrade to Samsung’s Galaxy S23
10 reasons to upgrade to Samsung’s Galaxy S23
 
13 things you didn’t know you could do with the S Pen
13 things you didn’t know you could do with the S Pen13 things you didn’t know you could do with the S Pen
13 things you didn’t know you could do with the S Pen
 
10 ways to fuel productivity
10 ways to fuel productivity10 ways to fuel productivity
10 ways to fuel productivity
 
8 ways ViewFinity monitors set the standard for professionals
8 ways ViewFinity monitors set the standard for professionals8 ways ViewFinity monitors set the standard for professionals
8 ways ViewFinity monitors set the standard for professionals
 
10 field-ready features of Galaxy XCover6 Pro
10 field-ready features of Galaxy XCover6 Pro10 field-ready features of Galaxy XCover6 Pro
10 field-ready features of Galaxy XCover6 Pro
 
5 Ways Smart Hospital Rooms Can Improve The Patient Experience
5 Ways Smart Hospital Rooms Can Improve The Patient Experience5 Ways Smart Hospital Rooms Can Improve The Patient Experience
5 Ways Smart Hospital Rooms Can Improve The Patient Experience
 
INFOGRAPHIC_10-field-ready-features-of-the-Tab-Active4_FINAL.pdf
INFOGRAPHIC_10-field-ready-features-of-the-Tab-Active4_FINAL.pdfINFOGRAPHIC_10-field-ready-features-of-the-Tab-Active4_FINAL.pdf
INFOGRAPHIC_10-field-ready-features-of-the-Tab-Active4_FINAL.pdf
 
10 ways to fuel productivity with Galaxy Book
10 ways to fuel productivity with Galaxy Book10 ways to fuel productivity with Galaxy Book
10 ways to fuel productivity with Galaxy Book
 
The best monitors for teachers: 5 features every educator needs
The best monitors for teachers: 5 features every educator needsThe best monitors for teachers: 5 features every educator needs
The best monitors for teachers: 5 features every educator needs
 
To BYOD or not to BYOD?
To BYOD or not to BYOD?To BYOD or not to BYOD?
To BYOD or not to BYOD?
 
6 smart ways Samsung's 85-inch Interactive Display changes the classroom
6 smart ways Samsung's 85-inch Interactive Display changes the classroom6 smart ways Samsung's 85-inch Interactive Display changes the classroom
6 smart ways Samsung's 85-inch Interactive Display changes the classroom
 
13 things you didn't know you could do with the S Pen
13 things you didn't know you could do with the S Pen13 things you didn't know you could do with the S Pen
13 things you didn't know you could do with the S Pen
 
10 reasons to choose Galaxy Tab S8 for work on the go
10 reasons to choose Galaxy Tab S8 for work on the go10 reasons to choose Galaxy Tab S8 for work on the go
10 reasons to choose Galaxy Tab S8 for work on the go
 
10 reasons to upgrade to Galaxy S22 Ultra
10 reasons to upgrade to Galaxy S22 Ultra10 reasons to upgrade to Galaxy S22 Ultra
10 reasons to upgrade to Galaxy S22 Ultra
 
9 benefits of mobilizing patient care with hospital technology
9 benefits of mobilizing patient care with hospital technology9 benefits of mobilizing patient care with hospital technology
9 benefits of mobilizing patient care with hospital technology
 
6 ways Samsung Kiosk gives retailers an all-in-one self-service solution
6 ways Samsung Kiosk gives retailers an all-in-one self-service solution6 ways Samsung Kiosk gives retailers an all-in-one self-service solution
6 ways Samsung Kiosk gives retailers an all-in-one self-service solution
 
7 ways displays create a connected campus
7 ways displays create a connected campus7 ways displays create a connected campus
7 ways displays create a connected campus
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pays Off

  • 1. White Paper: Big Data SSD Architecture Digging Deep to Discover Where SSD Performance Pays Off
  • 2. Big data is much more than just “lots of data”. State-of-the- art applications gather data from many different sources in varying formats with complex relationships. Analytics yield insight into events, trends, and behaviors, and may adapt in real-time depending on specific findings. These data sets can indeed grow extremely large, with captured sequences kept for future analysis. Traditional database tools usually handle data with pre- determined structures and fixed relationships, typically on a scale-up server and storage cluster. Often, these tools link with some transactional process, providing front-end services for users while capturing back-end data. Larger deployments may use a data warehouse, which consolidates data from various applications into one managed, reliable repository of information for enterprise decision-making. New distributed, flexible, and scalable tools for multi-sourced and poly-structured big data typically employ numbers of less-expensive network processing and storage systems. Rather than forcing all data into a massive warehouse before examination, distributed big data processing nodes can relieve architectural stress points with localized processing. For larger tasks, local nodes can join in global processing, increasing parallelism and supporting chained jobs. Which storage technology, hard disk drives (HDDs) or solid-state drives (SSDs), excels in big data architecture? SSDs clearly win on speed, offering both higher sequential read/write speeds and higher input/output operations per second (IOPS). A naïve approach might replace all HDDs with SSDs everywhere in a big data system. However, deploying SSDs in hundreds or thousands of nodes could add up to a very expensive proposition. A better approach identifies critical locations where SSDs enable immediate cost-per-performance wins, and integrating HDDs used in less stressful roles to save on system costs. This whitepaper will look at the basics of big data tools, review two performance wins with SSDs in a well-known framework, as well as present some examples of emerging opportunities on the leading edge of big data technology. Location: SSDs in Big Data Architecture Adapted from “Defining the Big Data Architecture Framework,” Demchenko, University of Amsterdam, July 2013 5 Vs of Big Data • Terabytes • Records/Arch • Transactions • Tables, Files • Structured • Unstructured • Multi-factor • Probabilistic • Batch • Real/near-time • Processes • Streams • Trustworthiness • Authenticity • Origin, Reputation • Availability • Accountability • Statistical • Events • Correlations • Hypothetical Veracity Value VelocityVariety Volume
  • 3. Finding those precise locations where SSDs speed up big data operations is a hot topic. Researchers are digging deep inside architectures looking for spots where HDDs are overwhelmed and slowing things down. Before we introduce findings from some of these studies, a quick overview of big data tools and terminology is in order. Apache Hadoop is one of the most well known tools associated with big data architecture. Its framework provides a straightforward way to distribute and manage data on multiple networked computers – nodes – and schedule processing resources for applications. First released at the end of 2011, Hadoop has gained massive popularity in search engine, social networking, and cloud computing infrastructure.1 Hadoop leverages parallelism for both storage and computational resources. Instead of storing a very large file in a single location, Hadoop spreads files across many nodes. This effectively multiplies available file system bandwidth. Nodes also contain processing elements, allowing scheduling of jobs and tasks across multiple heterogeneous machines simultaneously. The architecture provides fault resilience; data is replicated on multiple Hadoop nodes, and if a node is down, another replaces it. Adding nodes enhances scalability, with large instances achieving hundreds of petabytes of storage. Inside Hadoop are two primary engines: HDFS (Hadoop Distributed File System) and MapReduce. Both run on the same nodes concurrently. HDFS manages files across a cluster containing a NameNode tracking metadata and a number of DataNodes containing the actual data. HDFS maintains data and rack (physical location) awareness between jobs and tasks, minimizing network transfers. It rebalances data within a cluster automatically, and supports snapshots, upgrades, and rollback for maintenance. Data can be accessed via a Java API, web browsers and a REST API, or other languages via the Thrift API. Applications using HDFS include the Apache HBase non- relational distributed database modeled after Google Big Table, and the Apache Hive data warehouse manager. Hadoop is open to other distributed file systems with varying functionality.2 MapReduce is an engine for parallel, distributed processing. A job separates a large input data set into smaller chunks, with data represented in (key, value) pairs. Map tasks process these chunks in parallel, then an intermediate shuffle function sorts the results passed to Reduce tasks for mathematical operations. MapReduce is by nature disk-based, batch oriented, and non-real-time.3 Basics of Big Data Tools MapReduce Layer HDFS Layer multi-node cluster master slave task tracker task tracker job tracker name node data node data node Adapted from “Running Hadoop on Ubuntu Linux (Multi-Node Cluster),” Michael Noll Hadoop Multi-Node Cluster
  • 4. Most computing jobs fall into one of two categories: compute intensive, or I/O intensive. Studying the MapReduce workload, a team from Cloudera4 deployed one SSD versus several HDDs to approximate the same theoretical aggregate sequential read/write bandwidth, isolating performance differences. Hadoop clusters with all SSDs outperformed HDD clusters by as much 70% under some workloads in the Cloudera study. SSDs also achieve about twice the actual sequential I/O size with lower latency supporting more task scheduling. For some workloads, the difference is less. Chopping up large files means sequential access is not the limiting factor – exactly what Hadoop intends to achieve. However, when file granularity exceeds what node memory can hold, an important change occurs: “In practice, independent of I/O concurrency, there is negligible disk I/O for intermediate data that fits in memory, while a large amount of intermediate data leads to severe load on the disks.” —The Truth About Map Reduce Performance on SSDs, 2014 MapReduce actually involves three phases: Map, shuffle, and Reduce. The intermediate shuffle operation stores its results in a single file on local disk. By targeting shuffle results to an SSD using the mapred.local.dir parameter, performance increases dramatically. Splitting the SSD into multiple data directories also improves sequential read/ write task scheduling. Hybrid configurations with HDDs and a properly configured SSD may be the most cost-effective quick win for Hadoop performance. Hadoop also presents compute- intensive opportunities. To quote Apache, “moving computation is cheaper than moving data.” A Microsoft study suggests that most analytics jobs do not use huge data sets, citing a data point from Facebook that 90% of jobs have input sizes under 100 GB.5 Their suggestion is to “scale-up,” but what they describe is actually a converged modular server with 32 cores and paired SSD storage, bypassing HDFS for local access. One point Microsoft makes strongly: “… Without SSDs, many Hadoop jobs become disk-bound.” New PCIe SSDs using the latest NVMe protocol will be ideal to support compute-intensive workloads and keep the network free. This is vital for use cases such as streaming content and incoming IoT sensor data. Removing storage latency also allows more Hadoop jobs to be scheduled. Some applications, such as Bayesian classification used in machine learning, chain Hadoop jobs. y concentrating processing on bigger multicore servers with all-SSD storage for some jobs, Microsoft finds that overall Hadoop cluster performance increases. Their suggested architecture is a mix of bigger all-SSD nodes and smaller, less expensive nodes (perhaps with hybrid drive configurations) combined in a scale-out approach, handling a range of Hadoop job sizes most efficiently. Locating Quick Wins with SSDs split 0 map sort merge copy output HDFS input HDFS HDFS replication HDFS replication reduce part 0 reduce part 0 split 1 map split 2 map MapReduce Data Flow
  • 5. Open source projects attract innovation, and big data is no exception. The basics of Hadoop, HDFS, and MapReduce are established, and researchers are turning to other sophisticated techniques. In many cases, SSDs still hold the key to performance improvement. Virtualization can consolidate multiple nodes on scale-up servers, but inserts significant overhead. Samsung researchers found that Hadoop instances running over Xen virtualization suffer from 50% to 83% degradation in sort-based benchmarks using HDDs. The same configurations with SSDs fell only 10% to 21% running on a virtualized system. Further analysis reveals why – virtualization decreases I/O block size and increases the number of requests, an environment where SSDs perform better.6 At Facebook, using HBase over HDFS for the message stack simplifies development, but introduces performance issues. Messages are small files; 90% are less than 15 MB. Compounding the problem, I/O is random and “hot” frequently used data is often too large for RAM. Part of their solution is a 60 GB SSD configured as HBase cache that more than triples performance through reduced latency.7 Several efforts are after “in- memory” benefits. Aerospike has created a hybrid RAM/ SSD architecture for a NoSQL database. Smaller files are in RAM, accessed with raw read/ write bypassing the Linux file system, while larger files are on low-latency SSDs. Indices also reside in RAM. Data on the SSDs is contiguous and optimized for “one read per record.”8 Teams at Virginia Tech redesigned HDFS for tiered storage clusters.9 Their study on hatS (heterogeneity-aware tiered storage) shows the value of aiming more requests at PCIe SSDs. All 27 worker nodes have a SATA HDD (3 racks, 9 nodes each), nine nodes have a SATA SSD (3 per rack), and three nodes have a PCIe SSD (1 in each rack). Running a synthetic Facebook benchmark and hatS, the average I/O rate of HDFS increases 36%, and execution time falls 26%. Spark, a new in-memory architecture that could supplant MapReduce, allows programs to load resilient distributed datasets (RDDs) into cluster memory.10 Spark redefines shuffle algorithms, with each processor core (in earlier releases, each map task) creating a number of shuffle result files matching the number of reduce tasks. Contrary to popular belief, Spark RDDs can spill to disk for persistence. Benchmarks by Databricks deploying Spark on 206 Amazon EC2 i2x8large nodes with Amazon’s SSD- based S3 storage recently broke sort records, three times faster using one-tenth the number of machines of an equivalent Hadoop implementation. MIT researchers find that if in- memory servers go to disk as little as 5%, the overhead can nullify gains from RAM-based nodes.11 Their experiment uses Xilinx FPGAs with ARM processing cores to pre-process Samsung SSD access in an interconnect fabric. A small cluster with 20 servers achieves near RAM-based speeds on image search, PageRank, and Memcached algorithms. This proves the value of tightly coupled nodes combining processing with an SSD. More Prime Spots on Faster Routes
  • 6. These studies all show choosing the right locations for SSDs in big data architecture produces significant cost-per-performance gains. Existing Hadoop installations speed up instantly with a simple change targeting MapReduce shuffle results to an SSD instead of an HDD. New Hadoop installations can take advantage of more SSDs within asymmetric clusters, with a scale-up server taking care of intense jobs augmenting a network of smaller machines, and faster nodes with localized pre-processing for SSDs. The big data community continues to press the boundaries, looking for architecture enhancements. Hybrid installations with enhanced file systems are showing promise. These approaches are capitalizing on the ability of SSDs to service more requests quickly, which translates to more job scheduling and improved cluster performance. Noteworthy is the misconception that “in- memory” means no disk, while newer big data architectures are leveraging RAM performance, when they do go to disk, SSDs are essential to keeping overall cluster performance high. One powerful observation is that big data is not necessarily operating on large, sequential files. When a cluster does its job, input files have been parsed into smaller pieces, and disk operations are more random and multi-threaded – an ideal use case where SSDs excel. As V-NAND flash and PCIe interface technology mature, the cost-per-performance metric shifts in favor of SSDs. High-performance SSDs will continue to move from just critical locations to broader use throughout big data clusters. Learn more: samsung.com/enterprisessd | 1-866-SAM4BIZ Follow us: youtube.com/samsungbizusa | @SamsungBizUSA | insights.samsung.com ©2016 Samsung Electronics America, Inc. All rights reserved. Samsung is a registered trademark of Samsung Electronics Co., Ltd. All products, logos and brand names are trademarks or registered trademarks of their respective companies. This white paper is for informational purposes only. Samsung makes no warranties, express or implied, in this white paper. WHP-SSD-BIGDATA-JAN16J 1 Apache Hadoop project 2 Apache Hadoop documentation, HDFS Architecture Guide 3 Apache Hadoop documentation, MapReduce Tutorial 4 “The Truth About MapReduce Performance on SSDs”, Karthik Kambatla and Yanpei Chen, Cloudera Inc. and Purdue University, presented at LISA 14 sponsored by USENIX, November 2014 5 “Scale-up vs Scale-out for Hadoop: Time to rethink?”, Appuswamy et al, Microsoft Research, presented at SoCC ‘13, October 2013 6 “Performance Implications of SSDs in Virtualized Hadoop Clusters”, Ahn et al, Samsung Electronics, presented at IEEE BigData Congress, June 2014 7 “Analysis of HDFS under HBase: A Facebook Messages Case Study”, Harter et al, Facebook Inc. and University of Wisconsin, Madison, presented at FAST ’14 sponsored by USENIX, February 2014 8 Aerospike web site 9 “hatS: A Heterogeneity-Aware Tiered Storage for Hadoop”. Krish et al, Virginia Tech, presented at IEEE/ACM CCGrid2014, May 2014 10 “Spark the fastest open source engine for sorting a petabyte”, Databricks, November 2014 11 “Cutting cost and power consumption for big data”, Larry Hardesty, Massachusetts Institute of Technology, July 10, 2015 Faster Now, Even Faster Soon Samsung enterprise ssd portfolio PM863 Series Data Center SSDs • 3 bit MLC NAND • Designed for read-intensive applications • Built in Power Loss Protection • SATA 6Gb/s Interface • Form-factors: 2.5” SM863 Series Data Center SSDs • 2 bit MLC NAND • Designed for write-intensive applications • Built in Power Loss Protection • SATA 6 Gb/s Interface • Form-factors: 2.5” Samsung Workstation ssd portfolio 950 Pro Series Client PC SSDs • 2 bit MLC NAND • Designed for high-end PCs • PCIe Interface • NVMe protocol • Form-factors: M.2 850 Pro Series Client PC SSDs • 2 bit MLC NAND • SATA 6Gb/s Interface • Form-factors: 2.5”