spark_v1_2

Tech Article
Apache Spark in the Big Data landscape:
Is Spark going to replace Hadoop?
Fidus Objects
Frank Schroeter
April 2015
© 2015 - Fidus Objects, Frank Schroeter

Overview
1 Definition - What is Apache Spark?
1.1 Origin
2 Benefits of Spark
2.1 Speed
2.2 Ease of Use
2.3 Combines SQL, streaming, and complex analytics
2.4 Runs Everywhere
2.5 Language Flexibility
2.6 Some negative impressions of MapReduce
2.7 Challenges
2.8 Summary
3 Positioning of Spark
3.1 Is Spark going to replace Hadoop?
3.2 Difference between Apache Spark and Hadoop MapReduce
4 Practical Application of Spark and Trends

1 Definition - What is Apache Spark?
Apache Spark (Spark) is an open-source data analytics cluster computing tool. It's part of a greater set
of tools, i.e. framework including Apache Hadoop and other open-source resources for today’s
analytics community. It can be used with the Hadoop Distributed File System (HDFS), which is a
particular Hadoop component that facilitates complicated file handling. Some describe the use of Spark
as a potential substitute for the Apache Hadoop MapReduce component. MapReduce is also a
clustering tool that helps developers process large sets of data. Spark can be many times faster than
MapReduce, in some situations.
Spark runs on top of existing hadoop cluster and can access hadoop data store (HDFS), it can also
process structured data in Hive and Streaming data from HDFS, Flume, Kafka,Twitter.
In greater detail is Spark an open-source data access engine which provides elegant, attractive
development APIs and allows data workers to rapidly iterate over data via machine learning (by
allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well
suited to machine learning algorithms) and other data science techniques that require fast, in-memory
data processing.
1.1 Origin
Spark was initially started at UC Berkeley AMPLab in 2009, and open sourced in 2010 under a BSD
license (Berkeley Source Distribution license *1
). In 2013, the project was donated to the Apache
Software Foundation and switched its license to Apache 2.0. The current release is v1.3.1 / April 17,
2015. Spark is built by a wide set of developers from over 50 companies. Spark had over 500

contributors in 2015, making it the most active project in the Apache Software Foundation and among
Big Data open source projects.
*1 http://www.linfo.org/bsdlicense.html
2. Benefits of Spark
This is a compilation of Spark's features which are highlighting it in the Big Data world:
2.1 Speed
Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x
faster even when running on disk. Spark makes it possible by reducing number of read/write to
disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient
Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it
to disc only it’s needed. This helps to reduce most of the disc read and write – the main time
consuming factors of data processing.

2.2 Spark is Easy of Use
Spark lets developers quickly write applications in Java, Scala, or Python. Python and Scala are
languages that fit data engineering well, as they are more functional-oriented than Java. This
helps them to create and run their applications on their familiar programming languages and
easy to build parallel apps. It comes with a built-in set of over 80 high-level operators.
Example:
A word count procedure consists in Hadoop + Java of 50 lines; Spark + scala: 5 lines:
datafile = spark.textFile("hdfs://...")
datafile.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x+y)
Less code results in less maintenance, less bugs, more productivity. Java developers can easily
learn Scala, as Scala class can exist alongside Java classes (both run on JVMs).
2.3 Spark Combines SQL, streaming, and complex analytics
In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming
data (Spark Streaming allows real-time data processing; works well with Kafka, ...), and
complex analytics such as machine learning and graph algorithms out-of-the-box. As an added
benefit users can combine all these capabilities seamlessly in a single workflow.
2.4 It Runs Everywhere
Spark leverages the Hadoop ecosystem; Spark's connectivity and integration Spark results from
its use of HDFS. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access
diverse data sources including HDFS, Cassandra, HBase, S3.
2.5 There is Language Flexibility
Spark natively provides support for a variety of popular development languages. Out of the box,
it supports Scala, Java, and Python.
Some negative impressions of MapReduce
MapReduce (MR) starts to feel growingly uneasy when compared to Spark, because the API is
rudimentary, hard to test, and easy to render complicate. Tool like Cascading, Pig, Hive, etc., make this
easier, but that might just serve as more evidence that the core API is fundamentally flawed:
• MR requires lots of code to perform even the simplest of tasks.
• It can only perform very basic operations out of the box. There are a fair amount of
configuration and far too many processes to run just to get a simple single-node installation
working. When developing in MapReduce, we are often forced to stitch together basic
operations as custom Mapper/Reducer jobs because there are no built-in features to simplify
this process. For that reason, many developers turn to the higher-level APIs offered by
frameworks like Apache Crunch or Cascading to write their MapReduce jobs.

Challenges and today's report
Spark security is still somewhat in its infancy; Hadoop MapReduce has more security features and
projects. Hadoop MapReduce is a more mature platform and it was built for batch processing. It can be
more cost-effective than Spark for Big Data that doesn’t fit in memory and also due to the greater
availability of experienced staff. Furthermore, the Hadoop MapReduce ecosystem is currently bigger
thanks to many supporting projects, tools and cloud services. But even if Spark looks like the big
winner, the chances are that for now it won’t be use entirely on its own—there is still HDFS to store the
data. This means there might still be a case to run Hadoop and MapReduce alongside Spark for a full
Big Data package.
3. Positioning of Spark
Difference between Spark and MapReduce (Hadoop)
Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve
fault tolerance whereas Spark uses a different data storage model, resilient distributed datasets (RDD),
and uses a more sophisticated way of guaranteeing fault tolerance that minimizes network I/O. *2
Spark uses more RAM instead of network and disk I/O. But as it uses large RAM it also needs a
dedicated high end physical machine for producing effective results. It all depends and the variables on
which this decision depends keep on changing dynamically with time.
*2
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
So, here is the big question now: Is Spark going to replace Hadoop?
The short take is that Spark, which runs in Hadoop, is everything that Hadoop's MapReduce engine is
not. Hadoop is a parallel data processing framework that has traditionally been used to run MapReduce
jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on
top of Hadoop and it is an alternative to the traditional batch MapReduce model that can be used for
real-time stream data processing and fast interactive queries that finish within seconds. Hadoop
supports both traditional MapReduce and Spark.
We should look at Hadoop as a general purpose framework that supports multiple models and at this
very moment we should look at Spark as an alternative to Hadoop MapReduce rather than a
replacement for the framework Hadoop. However, Hadoop is based on an outdated paradigm:
computation + memory are expensive. Data was written on disk to keep memory consumption low and
avoid repeated computation. Today, this is not necessarily true anymore; space and memory are two
commodities, and their spot instances can easily be obtained on AWS. Tibco and Jaspersoft show how
it is done (http://spotfire.tibco.com/products/spotfire-aws). Besides, disk I/O are painfully slow
compared to cost and time of recomputing. In other words: Computation and memory are cheap, and I/
O time expensive. Spark stores as much computation in memory as possible (ideal for iterative ML
algorithm), and if it gets lost/evicted, recomputes it, which is still cheaper than re-reading from disk.
In a nutshell, one could argue the battle is already over. Spark will replace Hadoop, like NoSQL is
replacing SQL. Migration of existing batch process can be expected to make this transition within five

years. Spark did exactly all the right things:
• keep the awesome part of Hadoop: HDFS
• re-wrote the "core" that fits today's cost of resources
• simple API (*3
), usable by both data scientist (python usually) and data engineer (scala/java);
extra points for scala
• built-in powerful modules: GraphX, mllib, spark SQL, spark streaming
• good ecosystem (thanks to HDFS): Cassandra, Hive, Parquet, Avro, ...
*3
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
Summary
• Spark is the emperor of data processing; Hadoop MapReduce is the king of batch processing.
• Spark is fast, but it also gives developers a positive experience they won’t forget. Spark is well
known today for its performance benefits over MapReduce, as well as its versatility. However,
another important benefit – the elegance of the development experience – gets less mainstream
attention.
• Spark includes graph processing and machine-learning capabilities.
• Spark is highly cost-effective thanks to in-memory data processing. It’s compatible with all of
Hadoop’s data sources and file formats, and thanks to friendly APIs that are available in several
languages, it also has a faster learning curve.
• Spark is the shiny new toy on the Big Data playground, however currently there are still exist
use cases for using Hadoop MapReduce which are destined to fade away though.
4. Practical application of Apache Spark and Trends
Reports on the modern use of Spark show that companies are using it in various ways. One common
use is for aggregating data and structuring it in more refined ways. Spark can also be helpful with
analytics machine-learning work or data classification. Typically, organizations face the challenge of
refining data in an efficient and somewhat automated way. Spark may be used for these kinds of tasks.
It may also be implied that using Spark can help provide access to those who are less knowledgeable
about programming and want to get involved in analytics handling. Spark includes APIs for Python and
related software languages. All those benefits described earlier and above reduce costs and improve
efficiency. One example might be a streaming video website. Spark will to provide real-time and batch
clickstream analysis to build customer profiles, as well as to run the machine learning algorithms
powering the video recommendation engine. Multiple different types of workloads are leveraged by
one execution engine. Through Spark interactions (such as distribution and security) between two
workloads residing in the same process are game-changing simplified.
Spark has more than 500 enterprise adopters, and Spark promoter Databricks has more than 50 beta
customers for its Databricks Cloud service based on Spark. Streaming data analysis is just one play for
Spark, which makes it a competitor to IBM InfoSphere Streams. However proprietary solutions (like
IBM Streams) are not acceptable anymore. Open source means reduced maintenance cost and
transparency. They are more engineers building spark today than IBM engineers building Streams.

Maybe open source is the real "shiny new thing" that commercial vendors are competing against. And
to further feed that fire DataStax has recently open-sourced a well performing Spark driver that’s
available on their GitHub repo. It gets rid of all the job config requirements that early adopters of Spark
on Cassandra previously had to deal with, as there’s no longer a need to go through the Spark Hadoop
API. It supports server-side filters (basically WHERE clauses), that allow Cassandra to filter the data
prior to pulling it in for analysis.
To the point - Getting real value out of big data
How does Spark help organizations get real value out of their data? The challenge, Databricks points
out, is that Hadoop consists of a number of independent, but related, projects. First, an organization
must learn about these projects, what the technology does and how to put them together to solve the
organization's problems. Then they have to learn how to build a Hadoop cluster and how to prepare the
data. Only then can they start the process of exploring the data and gaining some insights.
Databricks wants to reduce the work involved by signing up for its service, pointing to the
organization's data and then beginning the process of cataloging and analyzing data. Databricks has
done the work of collecting the appropriate tools, configuring them and turning a bunch of independent
projects into a tool that an organization can quickly use. It appears that Databricks has significantly
simplified the process.
The still valid postulate of customer centric integration and data analysis means recognizing that time-
to-market is everything. My bet for the sake of the subject of this article (Spark) is on those vendors
who offer self-service revenues, thus controlling pricing and packaging, experiment with subscription
models, enable partners to sell integration and cloud services, and reduce churn of big data projects –
preferably all with one solution.
references:
https://www.youtube.com/watch?v=BFtQrfQ2rn0
http://www.zdnet.com/article/databricks-makes-hadoop-and-apache-spark-easy-to-use/
http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/
http://en.wikipedia.org/wiki/Apache_Spark
http://www.techopedia.com/definition/13800/apache-hadoop
http://www.techopedia.com/2/30166/technology-trends/big-data/how-hadoop-helps-solve-the-big-data-
problem
https://spark.apache.org/
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
http://planetcassandra.org/blog/the-new-analytics-toolbox-with-apache-spark-going-beyond-hadoop/
http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/
http://spotfire.tibco.com/products/spotfire-aws

spark_v1_2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to spark_v1_2

Similar to spark_v1_2 (20)

spark_v1_2