Apache spark installation [autosaved]

 What is SPARK
 SPARK Architecture
 Why Spark when Hadoop is already there?
 Spark vs. Hadoop MapReduce
 Apache Spark Installation
 Operating or Deploying a Spark Cluster Manually
 Running Spark Application
 Spark Environment

 Spark was introduced by Apache Software Foundation for speeding up
the Hadoop computational computing software process.
 Apache Spark is an open-source cluster computing framework for real-time
processing.
 Spark is being adopted by major players likeAmazon, eBay, andYahoo!
 It was built on top of Hadoop MapReduce and it extends the MapReduce
model to efficiently use more types of computations.
 Hadoop is just one of the ways to implement Spark.
 Spark uses Hadoop in two ways – one is storage and second
is processing.
 Spark has its own cluster management computation.

 Hadoop is based on batch processing of big data.This means that the
data is stored over a period of time and is then processed using Hadoop.
 Whereas in Spark, processing can take place in real-time.This real-time
processing power in Spark helps us to solve the use cases of RealTime
Analytics .
 Spark is also able to do batch processing 100 times faster than that of
Hadoop MapReduce (Processing framework in Apache Hadoop).

1. Introduction
 Apache Spark – It is an open source big data framework. It provides
faster and more general purpose data processing engine. Spark is
basically designed for fast computation.
 Hadoop MapReduce – It is also an open source framework for writing
applications. It also processes structured and unstructured data that are
stored in HDFS. MapReduce can process data in batch mode.
2. Speed
 Apache Spark – Spark is lightning fast cluster computing tool. Apache
Spark runs applications up to 100x faster in memory and 10x faster on
disk than Hadoop.
 Hadoop MapReduce – MapReduce reads and writes from disk, as a
result, it slows down the processing speed.

3. Real-time analysis
 Apache Spark – It can process real time data i.e. data coming from the real-time
event streams at the rate of millions of events per second, e.g.Twitter data for
instance or Facebook sharing/posting. Spark’s strength is the ability to process live
streams efficiently.
 Hadoop MapReduce – MapReduce fails when it comes to real-time data processing
as it was designed to perform batch processing on voluminous amounts of data.
4. latency
 Apache Spark – Spark provides low-latency computing.
 Hadoop MapReduce – MapReduce is a high latency computing framework.
5. Interactive mode
 Apache Spark – Spark can process data interactively.
 Hadoop MapReduce – MapReduce doesn’t have an interactive mode.
6. Streaming
 Apache Spark – Spark can process real time data through Spark Streaming.
 Hadoop MapReduce –With MapReduce, you can only process data in batch mode.

7. Fault tolerance
 Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart
the application from scratch in case of any failure.
 Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant,
so there is no need to restart the application from scratch in case of any
failure.
8. Cost
 Apache Spark – As spark requires a lot of RAM to run in-memory.Thus,
increases the cluster, and also its cost.
 Hadoop MapReduce – MapReduce is a cheaper option available while
comparing it in terms of cost.
9. Language Developed
 Apache Spark – Spark is developed in Scala.
 Hadoop MapReduce – Hadoop MapReduce is developed in Java.

10. OS support
 Apache Spark – Spark supports cross-platform.
 Hadoop MapReduce – Hadoop MapReduce also supports cross-
platform.
11. Programming Language support
 Apache Spark – Scala, Java, Python, R, SQL.
 Hadoop MapReduce – Primarily Java, other languages like C, C++, Ruby,
Groovy, Perl, Python are also supported using Hadoop streaming.
 13. SQL support
 Apache Spark – It enables the user to run SQL queries using Spark SQL.
 Hadoop MapReduce – It enables users to run SQL queries using
Apache Hive(HQL).
 14. Scalability
 Apache Spark – Spark is highly scalable.Thus, we can add n number of
nodes in the cluster. Also a largest known Spark Cluster is of 8000 nodes.
 Hadoop MapReduce – MapReduce is also highly scalable we can keep
adding n number of nodes in the cluster. Also, a largest known Hadoop
cluster is of 14000 nodes.

15.The line of code
 Apache Spark – Apache Spark is developed in merely 20000 line of
codes.
 Hadoop MapReduce – Hadoop 2.0 has 1,20,000 line of codes
16. Machine Learning
 Apache Spark – Spark has its own set of machine learning
i.e. MLlib.
 Hadoop MapReduce – Hadoop requires machine learning tool for
example Apache Mahout.
17.Hardware Requirements
 Apache Spark – Spark needs mid to high-level hardware.
 Hadoop MapReduce – MapReduce runs very well on commodity
hardware.

 Download Scala from the link:
http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.msi
 Install Scala :Under C drive in Scala folder(C:scala)

 My Computer properties  advanced system settings environment
variables  tab of user variable New

 User variable:
Variable: SCALA_HOME
Value: C:scala
Click Ok Button

 System variable:
Variable: PATH
Value: C:scalabin
Then Click okokok

 Download it from the following link:
http://spark.apache.org/downloads.html and extract it into E drive, such
as E:Spark.(you can extract in other drives also)

 User variable:
Variable: SPARK_HOME
Value: E:sparkspark-1.6.1-bin-hadoop2.3
 System variable:
Variable: PATH
Value: E:sparkspark-1.6.1-bin-hadoop2.3bin

 Download it from the link:
 https://github.com/prabaprakash/Hadoop-2.3/tree/master/bin
(winutils.exe) And paste it in E:sparkspark-1.6.1-bin-hadoop2.3bin

 Let’s look at the hadoop MapReduce example ofWord Count inApache
Spark –
 The input in the file input.txt contains the following text –

 We will submit the word count example in Apache Spark using the Spark
shell instead of running the word count program as a whole -
 Let’s start Spark shell

 Let’s create a Spark RDD using the input file that we want to run our first Spark
program on.You should specify the absolute path of the input file-
 On executing the above command, the following output is observed -

 Now is the step to count the number of words –
 You will get the following output:

 The next step is to store the output in a text file-
 Go to the output directory (location where you have created the file
named output)

Apache spark installation [autosaved]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache spark installation [autosaved]

Similar to Apache spark installation [autosaved] (20)

Recently uploaded

Recently uploaded (20)

Apache spark installation [autosaved]