SlideShare a Scribd company logo
1 of 10
Download to read offline
SCYLLADB WHITE PAPER
Scaling Up versus Scaling Out:
Mythbusting Database
Deployment Options
for Big Data
CONTENTS
OVERVIEW	3
THE NEW AGE OF NOSQL
AND COMMODITY HARDWARE	 3
WHAT IS A LARGE NODE?	 4
ADVANTAGES OF LARGE NODES	 4
NODE SIZE AND PERFORMANCE 	 5
NODE SIZE AND STREAMING	 5
NODE SIZE AND COMPACTION	 7
THE DATABASE MAKES
A DIFFERENCE	 8
3
OVERVIEW
In the world of Big Data, scaling out the
database is the norm. As a result, Big Data
deployments are often awash in clusters of
small nodes, which bring with them many
hidden costs. For example, scaling out usually
comes at the cost of resource efficiency,
since it can lead to low resource utilization.
So why is scaling out so common? Scaled
out deployment architectures are based on
several assumptions that we’ll examine in this
white paper. We’ll demonstrate that these
assumptions prove unfounded for database
infrastructure that is able to take advantage of
the rich computing resources available on large
nodes. In the process, we’ll show that Scylla is
uniquely positioned to leverage the multi-core
architecture that modern cloud platforms offer,
making it not only possible, but also preferable,
to use smaller clusters of large nodes for Big
Data deployments.
THE NEW AGE OF NOSQL
AND COMMODITY HARDWARE
The NoSQL revolution in database management
systems kicked off a full decade ago. Since
then, organizations of all sizes have benefitted
from a key feature that the NoSQL architecture
introduced: massive scale using relatively
inexpensive commodity hardware. Thanks to
this innovation, organizations have been able
to deploy architectures that would have been
prohibitively expensive and impossible to scale
using traditional relational database systems.
As businesses across industries embrace digital
transformation, their ability to align scale with
consumer demand has proven to be a key
competitive advantage. As a result, IT groups
constantly strive to balance competing demands
for higher performance and lower costs. Since
NoSQL makes it easy to scale out and back in
rapidly on cheaper hardware, it offers the best
of both worlds: performance along with cost
savings. This need for flexible and cost-effective
scale will only intensify as companies accelerate
digital transformation.
Over the same decade, ‘commodity hardware’
itself has undergone a transformation. However,
most modern software doesn’t take advantage
of modern computing resources. Most Big Data
frameworks that scale out, don’t scale up. They
aren’t able to take advantage of the resources
offered by large nodes, such as the added CPU,
memory, and solid-state drives (SSDs), nor
can they store large amounts of data on disk
efficiently. Managed runtimes, like Java, are
further constrained by heap size. Multi-threaded
code, with its locking overhead and lack of
attention for Non-Uniform Memory Architecture
(NUMA), imposes a significant performance
penalty against modern hardware architectures.
Software’s inability to keep up with hardware
advancements has led to the widespread belief
that running database infrastructure on many
small nodes is the optimal architecture for
scaling massive workloads. The alternative,
using small clusters of large nodes, is often
treated with skepticism.
These assumptions are largely based on
experience with NoSQL databases like Apache
Cassandra. Cassandra’s architecture prevents
it from efficiently exploiting modern computing
resources—in particular, multi-core CPUs. Even
as cloud platforms offer bigger and bigger
machine instances, with massive amounts of
memory, Cassandra is constrained by many
factors, primarily its use of Java but also its
caching model and its lack of an asynchronous
architecture. As a result, Cassandra is often
deployed on clusters of small instances that
run at low-levels of utilization.
ScyllaDB set out to fix this with its close-to-
the-hardware Scylla database. Written in C++
instead of Java, Scylla implements a number
of low-level architectural techniques, such as
thread-per-core and sharding. Those techniques
combined with a shared-nothing architecture
enable Scylla to scale linearly with the available
resources. This means that Scylla can run
massive workloads on smaller clusters of larger
nodes, an approach that vastly reduces all of the
inputs to TCO.
4
A few common concerns are that large nodes
won’t be fully utilized, that they have a hard
time streaming data when scaling out and,
finally, they might have a catastrophic effect on
recovery times. Scylla breaks these assumptions,
resulting in improved TCO and reduced
maintenance.
In light of these concerns, we ran real-world
scenarios against Scylla to demonstrate that the
skepticism towards big nodes is misplaced. In fact,
with Scylla, big nodes are often best for Big Data.
WHAT IS A LARGE NODE?
Before diving in, it’s important to understand
what we mean by ‘node size’ in the context
of today’s cloud hardware. Since Amazon EC2
instances are regularly used to run database
infrastructure, we will use their offerings for
reference. The image below shows the Amazon
EC2 i3 family, with increasingly large nodes,
or ‘instances.’ As you can see, the number
of virtual CPUs spans from 2 to 64. The
amount of memory of those machines grows
proportionally, reaching almost half a terabyte
with the i3.16xlarge instance. Similarly, the
amount of storage grows from half a terabyte
to more than fifteen terabytes.
On the surface, the small nodes look inexpensive,
but prices are actually consistent across node
sizes. No matter how you organize your
resources—the number of cores, the amount of
memory, and the number of disks­—the price to
use those resources in the cloud will be roughly
the same. Other cloud providers, such as
Microsoft Azure and Google Cloud, have similar
pricing structures, and therefore a comparable
cost benefit to running on bigger nodes.
ADVANTAGES OF LARGE NODES
Now that we’ve established resource price
parity among node sizes, let’s examine some
of the advantages that large nodes provide.
•	 Less Noisy Neighbors: On cloud platforms
multi-tenancy is the norm. A cloud platform
is, by definition, based on shared network
bandwidth, I/O, memory, storage, and so
on. As a result, a deployment of many small
nodes is susceptible to the ‘noisy neighbor’
effect. This effect is experienced when one
application or virtual machine consumes
more than its fair share of available resources.
As nodes increase in size, fewer and fewer
resources are shared among tenants. In fact,
beyond a certain size your applications are
likely to be the only tenant on the physical
machines on which your system is deployed.
This isolates your system from potential
degradation and outages. Large nodes shield
your systems from noisy neighbors.
•	 Fewer Failures: Since nodes large and small
fail at roughly the same rate, large nodes
deliver a higher mean time between failures,
or “MTBF” than small nodes. Failures in the
data layer require operator intervention, and
restoring a large node requires the same
amount of human effort as a small one. In a
cluster of a thousand nodes, you’ll see failures
every day. As a result, big clusters of small
nodes magnify administrative costs.
•	 Datacenter Density: Many organizations
with on-premises datacenters are seeking
to increase density by consolidating servers
into fewer, larger boxes with more computing
resources per server. Small clusters of
large nodes help this process by efficiently
consuming denser resources, in turn
decreasing energy and operating costs.
Instance
Name
vCPU
Count
Memory
Instance Storage
(NVMe SSD)
Price/
Hour
i3.large 2 15.25 GiB 0.475 TB $0.15
i3.xlarge 4 30.5 GiB 0.950 TB $0.31
i3.2xlarge 8 61GiB 1.9 TB $0.62
i3.4xlarge 16 122 GiB 3.8 TB (2 disks) $1.25
i3.8xlarge 32 244 GiB 7.6 TB (4 disks) $2.50
i3.16xlarge 64 488 GiB 15.2 TB (8 disks) $4.99
The Amazon EC2 i3 family
5
•	 Operational Simplicity: Big clusters of small
instances demand more attention, and
generate more alerts, than small clusters
of large instances. All of those small nodes
multiply the effort of real-time monitoring
and periodic maintenance, such as rolling
upgrades.
In spite of these significant benefits, systems
awash in a sea of small instances remain
commonplace. In the following sections, we’ll
run scenarios that investigate the assumptions
that lead to the use of small nodes instead of
big nodes, including performance, compaction,
and streaming. We’ll demonstrate that those
assumptions don’t always hold true when you
crunch the numbers.
NODE SIZE AND PERFORMANCE
A key consideration in environment sizing is
the performance characteristics of the software
stack. As we’ve pointed out, Scylla is written
on a framework that scales up as well as out,
enabling it to double performance as resources
double. To prove this, we ran a few scenarios
against Scylla.
This scenario uses a single cluster of 3 machines
from the i3 family. (Although we’re using only
3 nodes here, out in the real-world Scylla runs
in clusters of 100s of nodes, both big and small.
In fact, one of our users runs a 1PB cluster
with only 30 nodes). 3 nodes are configured
as replicas (a replication factor of 3). Using a
single loader machine, the reference workload
inserts 1,000,000,000 partitions to the cluster
as quickly as possible. We double resources
available to the database at each iteration. We
also increase the number of loaders, doubling
the number of partitions
and the dataset at each step.
As you can see in the chart below, ingest time
of 300GB with 4 virtual CPU clusters is roughly
equal to 600GB using 8 virtual CPUs, 1.2TB with
16 virtual CPUs, 24TB with 16 virtual CPUs and
even when the total data size reached almost
5TB, ingest time for the workload remained
relatively constant. These tests show that
Scylla’s linear scale-up characteristics make it
advantageous to scale up before scaling out,
since you can achieve the same results on fewer
machines.
NODE SIZE AND STREAMING
Some architects are concerned that putting
more data on fewer nodes increases the risks
associated with outages and data loss. You can
think of this as the ‘Big Basket’ problem. It may
seem intuitive that storing all of your data on a
few large nodes makes them more vulnerable
to outages, like putting all of your eggs in one
basket. Current thinking in distributed systems
tends to agree, arguing that amortizing failures
over many tiny and ostensibly inexpensive
instances is best. But this doesn’t necessarily
hold true. Scylla uses a number of innovative
techniques to ensure availability while also
accelerating recovery from failures, making
big nodes both safer and more economical.
12:00:00
16:00:00
Time (in hours) to insert the growing dataset
Instance and dataset size
Timeinhours
8:00:00
4:00:00
0:00:00
i3.xlarge
(1B partitions,
300GB)
i3.2xlarge
(1B partitions,
600GB)
i3.4xlarge
(1B partitions,
1.2 TB)
i3.8xlarge
(1B partitions,
2.4 TB)
i3.16xlarge
(1B partitions,
4.8 TB)
The time to ingest the workload in Scylla remains constant as the dataset grows from 300GB to 4.8TB.
6
Restarting or replacing a node requires the
entire dataset to be replicated to the new
node via streaming. On the surface, replacing
a node by transferring 1TB seems preferable
to transferring 16TB. If larger nodes with more
power and faster disks speed up compaction,
can they also speed up replication? The most
obvious potential bottleneck to streaming is
network bandwidth.
But the network isn’t the bottleneck for
streaming. AWS’s i3.16xlarge instances have
25 Gbps network links­­­—enough bandwidth
to transfer 5TB of data in thirty minutes. In
practice, that speed proves impossible to
achieve because data streaming depends not
only on bandwidth, but also on CPU and disk.
A node that dedicates significant resources to
streaming will impact the latency of customer-
facing queries.
Streaming operations benefit from Scylla’s
autonomous and auto-tuning capabilities. Scylla
queues all requests in the database and then
uses an I/O scheduler and a CPU scheduler to
execute them in priority order. The schedulers
use algorithms derived from control theory to
determine the optimal pace at which to run
operations, based on available resources.
These autonomous capabilities result in rapid
replication with minimal impact on latency.
Since this happens dynamically, added
resources automatically factor into scheduling
decisions. Nodes can be brought up in a cluster
very quickly, further lowering the risk of running
smaller clusters with large nodes.
To demonstrate this, we ran a streaming
scenario against a Scylla cluster. Using the
same clusters as in the previous demonstration,
we destroyed the node and rebuilt it from the
other two nodes in the cluster. Based on these
results, we can see that it takes about the same
amount of time to rebuild a node from the
other two, even for large datasets. It doesn’t
matter if the dataset is 300GB or almost 5TB.
Our experiment shows that the Scylla database
busts the myth that big nodes result in big
problems when a node goes down.
Calculating the real cost of failures involves
more factors than than the time to replicate via
streaming. If the software stack scales properly,
as it does with Scylla, the cost to restore a large
node is the same as for a small one. Since the
mean time between failures tends to remain
constant regardless of node size, fewer nodes
also means fewer failures, less complexity, and
lower maintenance overhead.
Efficient streaming enables nodes to be rebuilt
or scaled out with minimal impact. But a second
problem, cold caches, can also impact restarts.
What happens when a node is restarted, or a
new node is added to the cluster during rolling
upgrades or following hardware outages?
The typical database simply routes requests
randomly across replicas in the cluster. Nodes
with cold caches cannot sustain the same
throughput with the same latency.
6:00:00
8:00:00
Time (in hours) to rebuild a node (dataset grows twice as large each step)
Instance and dataset size
Timeinhours
4:00:00
2:00:00
0:00:00
i3.xlarge
(1B partitions,
300GB)
5:28:59
6:10:55 6:05:05
5:50:54 6:03:25
i3.2xlarge
(1B partitions,
600GB)
i3.4xlarge
(1B partitions,
1.2 TB)
i3.8xlarge
(1B partitions,
2.4 TB)
i3.16xlarge
(1B partitions,
4.8 TB)
The time to rebuild a Scylla node remains constant with larger datasets.
7
Scylla takes a different approach. Our patent-
pending technology, heat-weighted load
balancing, intelligently routes requests across
the cluster to ensure that node restarts don’t
impact throughput or latency. Heat-weighted
load balancing implements an algorithm that
optimizes the ratio between a node’s cache
hit rate and the proportion of requests it
receives. Over time, the node serving as request
coordinator sends a smaller proportion of reads
to the nodes that were up, while sending more
reads to the restored node. The feature enables
operators to restore or add nodes to clusters
efficiently, since rebooting individual nodes no
longer affects the cluster’s read latency.
NODE SIZE AND COMPACTION
Readers with real-world experience with
Apache Cassandra might react to large nodes
with a healthy dose of skepticism. In Apache
Cassandra, a node with just 5TB can take a long
time to compact. (Compaction is a background
process similar to garbage collection in the Java
programming language, except that it works
with disk data, rather than objects in memory).
Since compaction time increases along with the
size of the dataset, it’s often cited as a reason
to avoid big nodes for data infrastructure. If
compactions happen too quickly, foreground
workloads are destroyed because all of the CPU
1:30:00
2:00:00
2:30:00
Time (in hours) to compact the growing dataset
Instance and dataset size
Timeinhours
1:00:00
0:30:00
0:00:00
i3.xlarge
(1B partitions,
300GB)
1:45:27 1:47:05
2:00:41 2:02:59
2:11:34
i3.2xlarge
(1B partitions,
600GB)
i3.4xlarge
(1B partitions,
1.2 TB)
i3.8xlarge
(1B partitions,
2.4 TB)
i3.16xlarge
(1B partitions,
4.8 TB)
Compaction times in Scylla remained level as the dataset doubled with each successive iteration.
Scylla prevents latency spikes when nodes are restarted.
BEFORE:
AFTER:
8
and disk resources are soaked up. If compaction
is too slow, then reads suffer. A database like
Cassandra requires operators to find the proper
settings through trial and error, and once set,
they are static. With those databases, resource
consumption remains fixed regardless of shifting
workloads.
As noted in the context of streaming, Scylla
takes a very different approach, providing
autonomous operations and auto-tuning that
give priority to foreground operations over
background tasks like compaction­—even as
workloads fluctuate. The end result is rapid
compaction with no measurable effect on end
user queries.
To prove this point, we forced compaction on
a node using Scylla’s tool, which is compatible
with Cassandra. We called nodetool compact
in one of the nodes in same cluster as in the
previous experiment. The tool forces the node
to compact all SSTables into a single one.
Remarkably, these results show that compaction
time on Scylla remains constant for a dataset
of 300GB versus 5TB, as long as all instances
in the cluster scale resources proportionately.
This demonstrates that one of the primary
arguments against using large nodes for Big
Data hinges on the false assumption that
compaction of huge datasets must necessarily
be slow. Scylla’s linear scale-up characteristics
show that the opposite is true; compaction
of large datasets can also benefit from the
resources of large nodes.
THE DATABASE MAKES
A DIFFERENCE
As our scenarios demonstrate, running on fewer
nodes is inherently advantageous. The highly
performant Scylla database reduces the number
of nodes required for Big Data applications.
It’s scale-up capabilities enable organizations
to further reduce node count by moving to
larger machines. By increasing the size of nodes
and reducing their number, IT organizations
can recognize benefits at all levels, especially
in TCO. Moving from a large cluster of small
nodes to small clusters of large nodes has a
significant impact on TCO, while vastly reducing
the mean-time between failures (MTBF) and
administrative overhead.
The benefits of scaling up that we have
demonstrated in this paper have been proven
in the field. The diagram below illustrates
the server costs and administration benefits
experienced by a ScyllaDB customer. This
customer migrated a Cassandra installation,
distributed across 120 i3.2xl AWS instances
CASSANDRA
120 i3.2xl
($651,744/year)
SCYLLA
24 13.2xl
($130,348/year)
SCYLLA
3 i3.16xl
($130,348/year)
5x cost
reduction
5x reduced
admin
8x reduced
admin
Results: 5x TCO Reduction & MTBF Improved by 40x
Scylla reduces server cost by 5X and improves MTBF by 40X
9
with 8 virtual CPUs each, to Scylla. Using the
same node sizes, Scylla achieved the customer’s
performance targets with a much smaller
cluster of only 24 nodes. The initial reduction in
node sprawl produced a 5X reduction in server
costs, from $651,744 to $130,348 annually. It
also achieved a 5X reduction in administrative
overhead, encompassing failures, upgrades,
monitoring, etc.
Given Scylla’ s scale-up capabilities, the
customer then moved to the biggest nodes
available, the i3.16xl instance with 64 virtual
CPUs each. While the cost remained the same,
those 3 large nodes were capable of maintaining
the customer’s SLAs. Scaling up resulted in
reducing complexity (and administration,
failures, etc.) by a factor of 40 compared
with where the customer began.
As we’ve seen, there are many advantages to
scaling up a distributed database environment
before scaling out. If you’ve already scaled out,
scaling up offers an escape from the costs and
overhead of node sprawl.
Scylla gives users both the scale out benefits of
NoSQL along with scale up characteristics that
enable database consolidation and improved
TCO.
Ask yourself...
•	 Are you concerned your database isn’t
efficiently utilizing all of its CPUs and disks?
•	 Are you investing too much time in managing
excessively large clusters?
•	 Does your team have to replace dead nodes
at odd hours on the weekends and holidays?
•	 Are you struggling to scale with your business
as you grow 2X, 4X, 10X?
•	 Are you unable to meet service level
agreements during failures and maintenance
operations?
If the answer to even one of these questions is
‘yes,” we’d suggest evaluating your distributed
database options. Open source Scylla is
available for download from our website.
Or start by running a cloud-hosted Scylla Test
Drive, which lets you spin-up a hosted Scylla
cluster in minutes.
Copyright © 2018 ScyllaDB Inc. All rights reserved. All trademarks or
registered trademarks used herein are property of their respective owners.
United States Headquarters
1900 Embarcadero Rd
Palo Alto, CA 94303 U.S.A.
Phone: +1 (650) 332-9582
Email: info@scylladb.com
Israel Headquarters
11 Galgalei Haplada
Herzelia, Israel
SCYLLADB.COM
ABOUT US
Founded by the team responsible for the KVM
hypervisor, ScyllaDB is the company behind
the Scylla real-time big data database. Fully
compatible with Apache Cassandra, Scylla
embraces a shared-nothing approach that
increases throughput and storage capacity up
to 10X that of Cassandra. AppNexus, Samsung,
Olacabs, Grab, Investing.com, Snapfish, Zen.ly,
IBM’s Compose and many more leading
companies have adopted Scylla to realize
order-of-magnitude performance improvements
and reduce hardware costs.
For more information, please visit ScyllaDB.com

More Related Content

What's hot

OpenDrives_-_Product_Sheet_v13D (2) (1)
OpenDrives_-_Product_Sheet_v13D (2) (1)OpenDrives_-_Product_Sheet_v13D (2) (1)
OpenDrives_-_Product_Sheet_v13D (2) (1)Scott Eiser
 
SanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and CassandraSanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and CassandraDataStax Academy
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoopdarugar
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1Donghan Kim
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Aws meetup (sep 2015) exprimir cada centavo
Aws meetup (sep 2015)   exprimir cada centavoAws meetup (sep 2015)   exprimir cada centavo
Aws meetup (sep 2015) exprimir cada centavoSebastian Montini
 
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDsSeagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDsRed_Hat_Storage
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
White paper whitewater-datastorageinthecloud
White paper whitewater-datastorageinthecloudWhite paper whitewater-datastorageinthecloud
White paper whitewater-datastorageinthecloudAccenture
 
Virtual graphic workspace
Virtual graphic workspace Virtual graphic workspace
Virtual graphic workspace TTEC
 
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...inside-BigData.com
 

What's hot (20)

OpenDrives_-_Product_Sheet_v13D (2) (1)
OpenDrives_-_Product_Sheet_v13D (2) (1)OpenDrives_-_Product_Sheet_v13D (2) (1)
OpenDrives_-_Product_Sheet_v13D (2) (1)
 
SanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and CassandraSanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and Cassandra
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Aws meetup (sep 2015) exprimir cada centavo
Aws meetup (sep 2015)   exprimir cada centavoAws meetup (sep 2015)   exprimir cada centavo
Aws meetup (sep 2015) exprimir cada centavo
 
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDsSeagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data Storage
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
White paper whitewater-datastorageinthecloud
White paper whitewater-datastorageinthecloudWhite paper whitewater-datastorageinthecloud
White paper whitewater-datastorageinthecloud
 
Mesos by zigi
Mesos by zigiMesos by zigi
Mesos by zigi
 
Virtual graphic workspace
Virtual graphic workspace Virtual graphic workspace
Virtual graphic workspace
 
VMworld 2009: VMworld Data Center
VMworld 2009: VMworld Data CenterVMworld 2009: VMworld Data Center
VMworld 2009: VMworld Data Center
 
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
 
GlusterFS as a DFS
GlusterFS as a DFSGlusterFS as a DFS
GlusterFS as a DFS
 
Cassandra admin
Cassandra adminCassandra admin
Cassandra admin
 

Similar to Scaling Up vs. Scaling-out

Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree AnikeyRoy
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtreesamirandev1
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtreedevraajsingh
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog sameerroshan
 
The Distributed Cloud
The Distributed CloudThe Distributed Cloud
The Distributed CloudWowd
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopImpetus Technologies
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computersshopnil786
 
MongoDB Sharding
MongoDB ShardingMongoDB Sharding
MongoDB Shardinguzzal basak
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaOpenNebula Project
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageredpel dot com
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperDavid Walker
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESijccsa
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESijccsa
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESijccsa
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESneirew J
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic WebIrina Hutanu
 

Similar to Scaling Up vs. Scaling-out (20)

Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog
 
The Distributed Cloud
The Distributed CloudThe Distributed Cloud
The Distributed Cloud
 
Report 2.0.docx
Report 2.0.docxReport 2.0.docx
Report 2.0.docx
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
 
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computers
 
MongoDB Sharding
MongoDB ShardingMongoDB Sharding
MongoDB Sharding
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
 
10g db grid
10g db grid10g db grid
10g db grid
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Scaling Up vs. Scaling-out

  • 1. SCYLLADB WHITE PAPER Scaling Up versus Scaling Out: Mythbusting Database Deployment Options for Big Data
  • 2. CONTENTS OVERVIEW 3 THE NEW AGE OF NOSQL AND COMMODITY HARDWARE 3 WHAT IS A LARGE NODE? 4 ADVANTAGES OF LARGE NODES 4 NODE SIZE AND PERFORMANCE 5 NODE SIZE AND STREAMING 5 NODE SIZE AND COMPACTION 7 THE DATABASE MAKES A DIFFERENCE 8
  • 3. 3 OVERVIEW In the world of Big Data, scaling out the database is the norm. As a result, Big Data deployments are often awash in clusters of small nodes, which bring with them many hidden costs. For example, scaling out usually comes at the cost of resource efficiency, since it can lead to low resource utilization. So why is scaling out so common? Scaled out deployment architectures are based on several assumptions that we’ll examine in this white paper. We’ll demonstrate that these assumptions prove unfounded for database infrastructure that is able to take advantage of the rich computing resources available on large nodes. In the process, we’ll show that Scylla is uniquely positioned to leverage the multi-core architecture that modern cloud platforms offer, making it not only possible, but also preferable, to use smaller clusters of large nodes for Big Data deployments. THE NEW AGE OF NOSQL AND COMMODITY HARDWARE The NoSQL revolution in database management systems kicked off a full decade ago. Since then, organizations of all sizes have benefitted from a key feature that the NoSQL architecture introduced: massive scale using relatively inexpensive commodity hardware. Thanks to this innovation, organizations have been able to deploy architectures that would have been prohibitively expensive and impossible to scale using traditional relational database systems. As businesses across industries embrace digital transformation, their ability to align scale with consumer demand has proven to be a key competitive advantage. As a result, IT groups constantly strive to balance competing demands for higher performance and lower costs. Since NoSQL makes it easy to scale out and back in rapidly on cheaper hardware, it offers the best of both worlds: performance along with cost savings. This need for flexible and cost-effective scale will only intensify as companies accelerate digital transformation. Over the same decade, ‘commodity hardware’ itself has undergone a transformation. However, most modern software doesn’t take advantage of modern computing resources. Most Big Data frameworks that scale out, don’t scale up. They aren’t able to take advantage of the resources offered by large nodes, such as the added CPU, memory, and solid-state drives (SSDs), nor can they store large amounts of data on disk efficiently. Managed runtimes, like Java, are further constrained by heap size. Multi-threaded code, with its locking overhead and lack of attention for Non-Uniform Memory Architecture (NUMA), imposes a significant performance penalty against modern hardware architectures. Software’s inability to keep up with hardware advancements has led to the widespread belief that running database infrastructure on many small nodes is the optimal architecture for scaling massive workloads. The alternative, using small clusters of large nodes, is often treated with skepticism. These assumptions are largely based on experience with NoSQL databases like Apache Cassandra. Cassandra’s architecture prevents it from efficiently exploiting modern computing resources—in particular, multi-core CPUs. Even as cloud platforms offer bigger and bigger machine instances, with massive amounts of memory, Cassandra is constrained by many factors, primarily its use of Java but also its caching model and its lack of an asynchronous architecture. As a result, Cassandra is often deployed on clusters of small instances that run at low-levels of utilization. ScyllaDB set out to fix this with its close-to- the-hardware Scylla database. Written in C++ instead of Java, Scylla implements a number of low-level architectural techniques, such as thread-per-core and sharding. Those techniques combined with a shared-nothing architecture enable Scylla to scale linearly with the available resources. This means that Scylla can run massive workloads on smaller clusters of larger nodes, an approach that vastly reduces all of the inputs to TCO.
  • 4. 4 A few common concerns are that large nodes won’t be fully utilized, that they have a hard time streaming data when scaling out and, finally, they might have a catastrophic effect on recovery times. Scylla breaks these assumptions, resulting in improved TCO and reduced maintenance. In light of these concerns, we ran real-world scenarios against Scylla to demonstrate that the skepticism towards big nodes is misplaced. In fact, with Scylla, big nodes are often best for Big Data. WHAT IS A LARGE NODE? Before diving in, it’s important to understand what we mean by ‘node size’ in the context of today’s cloud hardware. Since Amazon EC2 instances are regularly used to run database infrastructure, we will use their offerings for reference. The image below shows the Amazon EC2 i3 family, with increasingly large nodes, or ‘instances.’ As you can see, the number of virtual CPUs spans from 2 to 64. The amount of memory of those machines grows proportionally, reaching almost half a terabyte with the i3.16xlarge instance. Similarly, the amount of storage grows from half a terabyte to more than fifteen terabytes. On the surface, the small nodes look inexpensive, but prices are actually consistent across node sizes. No matter how you organize your resources—the number of cores, the amount of memory, and the number of disks­—the price to use those resources in the cloud will be roughly the same. Other cloud providers, such as Microsoft Azure and Google Cloud, have similar pricing structures, and therefore a comparable cost benefit to running on bigger nodes. ADVANTAGES OF LARGE NODES Now that we’ve established resource price parity among node sizes, let’s examine some of the advantages that large nodes provide. • Less Noisy Neighbors: On cloud platforms multi-tenancy is the norm. A cloud platform is, by definition, based on shared network bandwidth, I/O, memory, storage, and so on. As a result, a deployment of many small nodes is susceptible to the ‘noisy neighbor’ effect. This effect is experienced when one application or virtual machine consumes more than its fair share of available resources. As nodes increase in size, fewer and fewer resources are shared among tenants. In fact, beyond a certain size your applications are likely to be the only tenant on the physical machines on which your system is deployed. This isolates your system from potential degradation and outages. Large nodes shield your systems from noisy neighbors. • Fewer Failures: Since nodes large and small fail at roughly the same rate, large nodes deliver a higher mean time between failures, or “MTBF” than small nodes. Failures in the data layer require operator intervention, and restoring a large node requires the same amount of human effort as a small one. In a cluster of a thousand nodes, you’ll see failures every day. As a result, big clusters of small nodes magnify administrative costs. • Datacenter Density: Many organizations with on-premises datacenters are seeking to increase density by consolidating servers into fewer, larger boxes with more computing resources per server. Small clusters of large nodes help this process by efficiently consuming denser resources, in turn decreasing energy and operating costs. Instance Name vCPU Count Memory Instance Storage (NVMe SSD) Price/ Hour i3.large 2 15.25 GiB 0.475 TB $0.15 i3.xlarge 4 30.5 GiB 0.950 TB $0.31 i3.2xlarge 8 61GiB 1.9 TB $0.62 i3.4xlarge 16 122 GiB 3.8 TB (2 disks) $1.25 i3.8xlarge 32 244 GiB 7.6 TB (4 disks) $2.50 i3.16xlarge 64 488 GiB 15.2 TB (8 disks) $4.99 The Amazon EC2 i3 family
  • 5. 5 • Operational Simplicity: Big clusters of small instances demand more attention, and generate more alerts, than small clusters of large instances. All of those small nodes multiply the effort of real-time monitoring and periodic maintenance, such as rolling upgrades. In spite of these significant benefits, systems awash in a sea of small instances remain commonplace. In the following sections, we’ll run scenarios that investigate the assumptions that lead to the use of small nodes instead of big nodes, including performance, compaction, and streaming. We’ll demonstrate that those assumptions don’t always hold true when you crunch the numbers. NODE SIZE AND PERFORMANCE A key consideration in environment sizing is the performance characteristics of the software stack. As we’ve pointed out, Scylla is written on a framework that scales up as well as out, enabling it to double performance as resources double. To prove this, we ran a few scenarios against Scylla. This scenario uses a single cluster of 3 machines from the i3 family. (Although we’re using only 3 nodes here, out in the real-world Scylla runs in clusters of 100s of nodes, both big and small. In fact, one of our users runs a 1PB cluster with only 30 nodes). 3 nodes are configured as replicas (a replication factor of 3). Using a single loader machine, the reference workload inserts 1,000,000,000 partitions to the cluster as quickly as possible. We double resources available to the database at each iteration. We also increase the number of loaders, doubling the number of partitions and the dataset at each step. As you can see in the chart below, ingest time of 300GB with 4 virtual CPU clusters is roughly equal to 600GB using 8 virtual CPUs, 1.2TB with 16 virtual CPUs, 24TB with 16 virtual CPUs and even when the total data size reached almost 5TB, ingest time for the workload remained relatively constant. These tests show that Scylla’s linear scale-up characteristics make it advantageous to scale up before scaling out, since you can achieve the same results on fewer machines. NODE SIZE AND STREAMING Some architects are concerned that putting more data on fewer nodes increases the risks associated with outages and data loss. You can think of this as the ‘Big Basket’ problem. It may seem intuitive that storing all of your data on a few large nodes makes them more vulnerable to outages, like putting all of your eggs in one basket. Current thinking in distributed systems tends to agree, arguing that amortizing failures over many tiny and ostensibly inexpensive instances is best. But this doesn’t necessarily hold true. Scylla uses a number of innovative techniques to ensure availability while also accelerating recovery from failures, making big nodes both safer and more economical. 12:00:00 16:00:00 Time (in hours) to insert the growing dataset Instance and dataset size Timeinhours 8:00:00 4:00:00 0:00:00 i3.xlarge (1B partitions, 300GB) i3.2xlarge (1B partitions, 600GB) i3.4xlarge (1B partitions, 1.2 TB) i3.8xlarge (1B partitions, 2.4 TB) i3.16xlarge (1B partitions, 4.8 TB) The time to ingest the workload in Scylla remains constant as the dataset grows from 300GB to 4.8TB.
  • 6. 6 Restarting or replacing a node requires the entire dataset to be replicated to the new node via streaming. On the surface, replacing a node by transferring 1TB seems preferable to transferring 16TB. If larger nodes with more power and faster disks speed up compaction, can they also speed up replication? The most obvious potential bottleneck to streaming is network bandwidth. But the network isn’t the bottleneck for streaming. AWS’s i3.16xlarge instances have 25 Gbps network links­­­—enough bandwidth to transfer 5TB of data in thirty minutes. In practice, that speed proves impossible to achieve because data streaming depends not only on bandwidth, but also on CPU and disk. A node that dedicates significant resources to streaming will impact the latency of customer- facing queries. Streaming operations benefit from Scylla’s autonomous and auto-tuning capabilities. Scylla queues all requests in the database and then uses an I/O scheduler and a CPU scheduler to execute them in priority order. The schedulers use algorithms derived from control theory to determine the optimal pace at which to run operations, based on available resources. These autonomous capabilities result in rapid replication with minimal impact on latency. Since this happens dynamically, added resources automatically factor into scheduling decisions. Nodes can be brought up in a cluster very quickly, further lowering the risk of running smaller clusters with large nodes. To demonstrate this, we ran a streaming scenario against a Scylla cluster. Using the same clusters as in the previous demonstration, we destroyed the node and rebuilt it from the other two nodes in the cluster. Based on these results, we can see that it takes about the same amount of time to rebuild a node from the other two, even for large datasets. It doesn’t matter if the dataset is 300GB or almost 5TB. Our experiment shows that the Scylla database busts the myth that big nodes result in big problems when a node goes down. Calculating the real cost of failures involves more factors than than the time to replicate via streaming. If the software stack scales properly, as it does with Scylla, the cost to restore a large node is the same as for a small one. Since the mean time between failures tends to remain constant regardless of node size, fewer nodes also means fewer failures, less complexity, and lower maintenance overhead. Efficient streaming enables nodes to be rebuilt or scaled out with minimal impact. But a second problem, cold caches, can also impact restarts. What happens when a node is restarted, or a new node is added to the cluster during rolling upgrades or following hardware outages? The typical database simply routes requests randomly across replicas in the cluster. Nodes with cold caches cannot sustain the same throughput with the same latency. 6:00:00 8:00:00 Time (in hours) to rebuild a node (dataset grows twice as large each step) Instance and dataset size Timeinhours 4:00:00 2:00:00 0:00:00 i3.xlarge (1B partitions, 300GB) 5:28:59 6:10:55 6:05:05 5:50:54 6:03:25 i3.2xlarge (1B partitions, 600GB) i3.4xlarge (1B partitions, 1.2 TB) i3.8xlarge (1B partitions, 2.4 TB) i3.16xlarge (1B partitions, 4.8 TB) The time to rebuild a Scylla node remains constant with larger datasets.
  • 7. 7 Scylla takes a different approach. Our patent- pending technology, heat-weighted load balancing, intelligently routes requests across the cluster to ensure that node restarts don’t impact throughput or latency. Heat-weighted load balancing implements an algorithm that optimizes the ratio between a node’s cache hit rate and the proportion of requests it receives. Over time, the node serving as request coordinator sends a smaller proportion of reads to the nodes that were up, while sending more reads to the restored node. The feature enables operators to restore or add nodes to clusters efficiently, since rebooting individual nodes no longer affects the cluster’s read latency. NODE SIZE AND COMPACTION Readers with real-world experience with Apache Cassandra might react to large nodes with a healthy dose of skepticism. In Apache Cassandra, a node with just 5TB can take a long time to compact. (Compaction is a background process similar to garbage collection in the Java programming language, except that it works with disk data, rather than objects in memory). Since compaction time increases along with the size of the dataset, it’s often cited as a reason to avoid big nodes for data infrastructure. If compactions happen too quickly, foreground workloads are destroyed because all of the CPU 1:30:00 2:00:00 2:30:00 Time (in hours) to compact the growing dataset Instance and dataset size Timeinhours 1:00:00 0:30:00 0:00:00 i3.xlarge (1B partitions, 300GB) 1:45:27 1:47:05 2:00:41 2:02:59 2:11:34 i3.2xlarge (1B partitions, 600GB) i3.4xlarge (1B partitions, 1.2 TB) i3.8xlarge (1B partitions, 2.4 TB) i3.16xlarge (1B partitions, 4.8 TB) Compaction times in Scylla remained level as the dataset doubled with each successive iteration. Scylla prevents latency spikes when nodes are restarted. BEFORE: AFTER:
  • 8. 8 and disk resources are soaked up. If compaction is too slow, then reads suffer. A database like Cassandra requires operators to find the proper settings through trial and error, and once set, they are static. With those databases, resource consumption remains fixed regardless of shifting workloads. As noted in the context of streaming, Scylla takes a very different approach, providing autonomous operations and auto-tuning that give priority to foreground operations over background tasks like compaction­—even as workloads fluctuate. The end result is rapid compaction with no measurable effect on end user queries. To prove this point, we forced compaction on a node using Scylla’s tool, which is compatible with Cassandra. We called nodetool compact in one of the nodes in same cluster as in the previous experiment. The tool forces the node to compact all SSTables into a single one. Remarkably, these results show that compaction time on Scylla remains constant for a dataset of 300GB versus 5TB, as long as all instances in the cluster scale resources proportionately. This demonstrates that one of the primary arguments against using large nodes for Big Data hinges on the false assumption that compaction of huge datasets must necessarily be slow. Scylla’s linear scale-up characteristics show that the opposite is true; compaction of large datasets can also benefit from the resources of large nodes. THE DATABASE MAKES A DIFFERENCE As our scenarios demonstrate, running on fewer nodes is inherently advantageous. The highly performant Scylla database reduces the number of nodes required for Big Data applications. It’s scale-up capabilities enable organizations to further reduce node count by moving to larger machines. By increasing the size of nodes and reducing their number, IT organizations can recognize benefits at all levels, especially in TCO. Moving from a large cluster of small nodes to small clusters of large nodes has a significant impact on TCO, while vastly reducing the mean-time between failures (MTBF) and administrative overhead. The benefits of scaling up that we have demonstrated in this paper have been proven in the field. The diagram below illustrates the server costs and administration benefits experienced by a ScyllaDB customer. This customer migrated a Cassandra installation, distributed across 120 i3.2xl AWS instances CASSANDRA 120 i3.2xl ($651,744/year) SCYLLA 24 13.2xl ($130,348/year) SCYLLA 3 i3.16xl ($130,348/year) 5x cost reduction 5x reduced admin 8x reduced admin Results: 5x TCO Reduction & MTBF Improved by 40x Scylla reduces server cost by 5X and improves MTBF by 40X
  • 9. 9 with 8 virtual CPUs each, to Scylla. Using the same node sizes, Scylla achieved the customer’s performance targets with a much smaller cluster of only 24 nodes. The initial reduction in node sprawl produced a 5X reduction in server costs, from $651,744 to $130,348 annually. It also achieved a 5X reduction in administrative overhead, encompassing failures, upgrades, monitoring, etc. Given Scylla’ s scale-up capabilities, the customer then moved to the biggest nodes available, the i3.16xl instance with 64 virtual CPUs each. While the cost remained the same, those 3 large nodes were capable of maintaining the customer’s SLAs. Scaling up resulted in reducing complexity (and administration, failures, etc.) by a factor of 40 compared with where the customer began. As we’ve seen, there are many advantages to scaling up a distributed database environment before scaling out. If you’ve already scaled out, scaling up offers an escape from the costs and overhead of node sprawl. Scylla gives users both the scale out benefits of NoSQL along with scale up characteristics that enable database consolidation and improved TCO. Ask yourself... • Are you concerned your database isn’t efficiently utilizing all of its CPUs and disks? • Are you investing too much time in managing excessively large clusters? • Does your team have to replace dead nodes at odd hours on the weekends and holidays? • Are you struggling to scale with your business as you grow 2X, 4X, 10X? • Are you unable to meet service level agreements during failures and maintenance operations? If the answer to even one of these questions is ‘yes,” we’d suggest evaluating your distributed database options. Open source Scylla is available for download from our website. Or start by running a cloud-hosted Scylla Test Drive, which lets you spin-up a hosted Scylla cluster in minutes.
  • 10. Copyright © 2018 ScyllaDB Inc. All rights reserved. All trademarks or registered trademarks used herein are property of their respective owners. United States Headquarters 1900 Embarcadero Rd Palo Alto, CA 94303 U.S.A. Phone: +1 (650) 332-9582 Email: info@scylladb.com Israel Headquarters 11 Galgalei Haplada Herzelia, Israel SCYLLADB.COM ABOUT US Founded by the team responsible for the KVM hypervisor, ScyllaDB is the company behind the Scylla real-time big data database. Fully compatible with Apache Cassandra, Scylla embraces a shared-nothing approach that increases throughput and storage capacity up to 10X that of Cassandra. AppNexus, Samsung, Olacabs, Grab, Investing.com, Snapfish, Zen.ly, IBM’s Compose and many more leading companies have adopted Scylla to realize order-of-magnitude performance improvements and reduce hardware costs. For more information, please visit ScyllaDB.com