Why Spark over Hadoop?

www.prwatech.in
Address:
No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank
ATM Bangalore – 560068, India
Spark over Hadoop
Nowadays Hadoop is getting replaced with Scala.The basic reason behind
that is Scala is 100 times faster than Hadoop MapReduce so the task
performed on Scala is much faster and efficient than Hadoop.
So to understand the basic difference between these two techniques and
how they are different from each other we need to first understand how
they function
Hadoop: Hadoop is an Apache.org project that is a software library and a
framework that allows for distributed processing of large data sets (big

www.prwatech.in
Address:
data) across computer clusters using simple programming models.
Hadoop can scale from single computer systems up to thousands of
commodity systems that offer local storage and compute power. Hadoop,
in essence, is the ubiquitous 800-lb big data gorilla in the Big Data
Analytics space.
Hadoop is composed of modules that work together to create the Hadoop
framework. The primary Hadoop framework modules are:
 Hadoop Common
 Hadoop Distributed File System (HDFS)
 Hadoop YARN
 Hadoop MapReduce
Although the above four modules comprise Hadoop’s core, there are
several other modules. These include Ambari, Avro, Cassandra, Hive, Pig,
Oozie, Flume, and Sqoop, which further enhance and extend Hadoop’s
power and reach into big data applications and large data set processing.
Many companies that use big data sets and analytics use Hadoop. It has
become the de facto standard in big data applications. Hadoop originally
was designed to handle crawling and searching billions of web pages and
collecting their information into a database. The result of the desire to
crawl and search the web was Hadoop’s HDFS and its distributed
processing engine, MapReduce.

www.prwatech.in
Address:
Hadoop is useful to companies when data sets become so large or so
complex that their current solutions cannot effectively process the
information in what the data users consider being a reasonable amount of
time.
MapReduce is an excellent text processing engine and rightly so since
crawling and searching the web (its first job) are both text-based tasks.
Spark Defined: The Apache Spark developers bill it as “a fast and general
engine for large-scale data processing.” By comparison, and sticking with
the analogy, if Hadoop’s Big Data framework is the 800-lb gorilla, then
Spark is the 130-lb big data cheetah.

www.prwatech.in
Address:
Although critics of Spark’s in-memory processing admit that Spark is very
fast (Up to 100 times faster than Hadoop MapReduce), they might not be
so ready to acknowledge that it runs up to ten times faster on disk. Spark
can also perform batch processing, however, it really excels at streaming
workloads, interactive queries, and machine-based learning.
Spark’s big claim to fame is its real-time data processing capability as
compared to MapReduce’s disk-bound, batch processing engine. Spark is
compatible with Hadoop and its modules. In fact, on Hadoop’s project
page, Spark is listed as a module.
Spark has its own page because, while it can run in Hadoop clusters
through YARN (Yet Another Resource Negotiator), it also has a
standalone mode. The fact that it can run as a Hadoop module and as a
standalone solution makes it tricky to directly compare and contrast.
However, as time goes on, some big data scientists expect Spark to
diverge and perhaps replace Hadoop, especially in instances where faster
access to processed data is critical.
Spark is a cluster-computing framework, which means that it competes
more with MapReduce than with the entire Hadoop Ecosystem. For
example, Spark doesn’t have its own distributed filesystem but can use
HDFS.
Spark uses memory and can use the disk for processing, whereas
MapReduce is strictly disk-based. The primary difference between

www.prwatech.in
Address:
MapReduce and Spark is that MapReduce uses persistent storage and
Spark uses Resilient Distributed Datasets (RDDs), which is covered in
more detail under the Fault Tolerance section.
Why Choose Scala over Hadoop:
Performance: The reason why Scala is faster than Hadoop is that Scala
Processes everything in memory. It can also use the disk for data that
doesn't all fits into memory.
Spark’s in-memory processing delivers near real-time analytics for data
from marketing campaigns, machine learning, Internet of Things sensors,
log monitoring, security analytics, and social media sites. MapReduce
alternatively uses batch processing and was really never built for blinding
speed. It was originally set up to continuously gather information from
websites and there were no requirements for this data in or near real-time.
Ease of use: Spark is well known for its performance, but it’s also
somewhat well known for its ease of use in that it comes with user-friendly
APIs for Scala (its native language), Java, Python, and Spark SQL. Spark
SQL is very similar to SQL 92, so there’s almost no learning curve
required in order to use it.
Spark also has an interactive mode so that developers and users alike can
have immediate feedback for queries and other actions. MapReduce has

www.prwatech.in
Address:
no interactive mode, but add-ons such as Hive and Pig make working with
MapReduce a little easier for adopters.
Cost: Both Scala and Hadoop is open software and free software product
so it doesn't require a license. Also, both products are designed to run on
commodity hardware, such as a low-cost system.
The only difference in cost occurs due to their different way of performing
a task.
MapReduce uses standard amounts of memory because its processing is
disk-based, so a company will have to purchase faster disks and a lot of
disk space to run MapReduce. MapReduce also requires more systems to
distribute the disk I/O over multiple systems.
Sparks requires a lot of memory but can deal with the standard amount of
disk that runs at standard speeds. Disk space is a relatively inexpensive
commodity and since Spark does not use disk I/O for processing.
Data Processing: MapReduce is a batch-processing engine. MapReduce
operates in sequential steps by reading data from the cluster, performing
its operation on the data, writing the results back to the cluster, reading
updated data from the cluster, performing the next data operation, writing
those results back to the cluster and so on. Spark performs similar
operations, but it does so in a single step and in memory. It reads data
from the cluster, performs its operation on the data, and then writes it back
to the cluster.

www.prwatech.in
Address:
Spark also includes its own graph computation library, GraphX. GraphX
allows users to view the same data as graphs and as collections. Users
can also transform and join graphs with Resilient Distributed Datasets
(RDDs), discussed in the Fault Tolerance section.
Fault Tolerance: For fault tolerance, MapReduce and Spark resolve the
problem from two different directions. MapReduce uses TaskTrackers that
provide heartbeats to the JobTracker. If a heartbeat is missed then the
JobTracker reschedules all pending and in-progress operations to another
TaskTracker. This method is effective in providing fault tolerance,
however, it can significantly increase the completion times for operations
that have even a single failure.
Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant
collections of elements that can be operated on in parallel. RDDs can
reference a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat. Spark can create RDDs from any storage source supported
by Hadoop, including local filesystems or one of those listed previously.
Scalability: By definition, both MapReduce and Spark are scalable using
the HDFS.
Compability: Spark can be deployed on a variety of platforms. It runs on
Windows and UNIX (such as Linux and Mac OS) and can be deployed in

www.prwatech.in
Address:
standalone mode on a single node when it has a supported OS. Spark can
also be deployed in a cluster node on Hadoop YARN as well as Apache
Mesos.

Why Spark over Hadoop?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why Spark over Hadoop?

Similar to Why Spark over Hadoop? (20)

Recently uploaded

Recently uploaded (20)

Why Spark over Hadoop?