2. What is Apache Spark
Apache Spark is an open-source parallel processing framework that supports in-
memory processing to boost the performance of big-data analytic applications.
• Apache Spark was introduced in 2009 in the UC Berkeley R&D Lab, later it
becomes AMP Lab. It was open sourced in 2010 under BSD license. In 2013 spark
was donated to Apache Software Foundation where it became top-level Apache
project in 2014.
• Apache Spark is a general-purpose & lightning fast cluster computing system. It
provides high-level API. For example, Java, Scala, Python and R. Apache Spark is a
tool for Running Spark Applications. Spark is 100 times faster
than Bigdata Hadoop and 10 times faster than accessing data from disk.
• Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R.
• It can be integrated with Hadoop and can process existing Hadoop HDFS data.
3. Need of Spark
In the industry, there was a need for general purpose cluster computing
tool as:
• Hadoop MapReduce can only perform batch processing.
• Apache Storm / S4 can only perform stream processing.
• Apache Impala / Apache Tez can only perform interactive processing
• Neo4j / Apache Giraph can only perform to graph processing
• Hence in the industry, there was a big demand for a powerful engine
that can process the data in real-time (streaming) as well as in batch
mode. There is a need for an engine that can respond in sub-second
and perform in-memory processing.
4. Apache Spark Components
• Spark Core: It is the kernel of Spark, which provides an
execution platform for all the Spark applications. It is a
generalized platform to support a wide array of applications.
• Spark SQL: It enables users to run SQL / HQL queries on the top
of Spark. Using Apache Spark SQL, we can process structured as
well as semi-structured data. It also provides an engine
for Hive to run unmodified queries up to 100 times faster on
existing deployments.
• Spark Streaming: Apache Spark Streaming enables powerful
interactive and data analytics application across live streaming
data. The live streams are converted into micro-batches which
are executed on top of spark core.
• Spark Mllib: It is the scalable machine learning library which
delivers both efficiencies as well as the high-quality
algorithm. Apache Spark MLlib is one of the hottest choices
for Data Scientist due to its capability of in-memory data
processing, which improves the performance of iterative
algorithm drastically.
• Spark GraphX: Apache Spark GraphX is the graph computation
engine built on top of spark that enables to process graph data
at scale.
5. Data in Apache Spark
• Resilient Distributed Dataset (RDD) : is the fundamental unit of data in Apache Spark, which is a distributed
collection of elements across cluster nodes and can perform parallel operations. Spark RDDs are immutable
but can generate new RDD by transforming existing RDD.
• Dataframe :Unlike an RDD, data organizes into named columns. For example a table in a relational
database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose
a structure onto a distributed collection of data, allowing higher-level abstraction.
• Datasets :Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-
oriented programming interface. Dataset takes advantage of Spark’s Catalyst optimizer by exposing
expressions and data fields to a query planner.
• Apache Spark RDDs support two types of operations:
Transformation – Creates a new RDD from the existing one. It passes the dataset to the function and
returns new dataset.
Action – Spark Action returns final result to driver program or write it to the external datastore.
7. Apache Spark Components
a. Apache Spark Driver
• The main() method of the program runs in the driver. The driver is the process that runs the user code that creates RDDs, and performs transformation and action, and also
creates SparkContext. When the Spark Shell is launched, this signifies that we have created a driver program. On the termination of the driver, the application is finished.
• The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler resides in the driver and distributes task among
workers. The two main key roles of drivers are:
Converting user program into the task.
Scheduling task on the executor.
• The structure of Spark program at a higher level is: RDDs are created from some input data, derive new RDD from existing using various transformations, and then after it
performs an action to compute data. In Spark Program, the DAG (directed acyclic graph) of operations are created implicitly. And when the driver runs, it converts that Spark
DAG into a physical execution plan.
b. Apache Spark Cluster Manager
• Spark relies on cluster manager to launch executors and in some cases, even the drivers are launched through it. It is a pluggable component in Spark. On the cluster manager,
jobs and action within a spark application are scheduled by Spark Scheduler in a FIFO fashion. Alternatively, the scheduling can also be done in Round Robin fashion. The
resources used by a Spark application can be dynamically adjusted based on the workload. Thus, the application can free unused resources and request them again when there is
a demand. This is available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.
c. Apache Spark Executors
• The individual task in the given Spark job runs in the Spark executors. Executors are launched once in the beginning of Spark Application and then they run for the entire lifetime
of an application. Even if the Spark executor fails, the Spark application can continue with ease. There are two main roles of the executors:
Runs the task that makes up the application and return the result to the driver.
Provide in-memory storage for RDDs that are cached by the user
9. How does Spark work?
• Using spark-submit, the user submits an application.
• In spark-submit, the main() method specified by the user is invoked. It also
launches the driver program.
• The driver program asks for the resources to the cluster manager that is
required to launch executors.
• The cluster manager launches executors on behalf of the driver program.
• The driver process runs with the help of user application. Based on the
actions and transformation on RDDs, the driver sends work to executors in
the form of tasks.
• The executors process the task and the obtained result is sent back to the
driver through cluster manager.
11. How Does Microsoft Azure helps?
Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Apache Hadoop was the
original open-source framework for distributed processing and analysis of big data sets on clusters of computers. It provides 99.9% SLA for
service availability.
The following components and utilities are included on HDInsight clusters:+
• Ambari: Cluster provisioning, management, monitoring, and utilities.
• Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.
• Hive & HCatalog: SQL-like querying, and a table and storage management layer.
• Mahout: For scalable machine learning applications.
• MapReduce: Legacy framework for Hadoop distributed processing and resource management. See YARN.
• Oozie: Workflow management.
• Phoenix: Relational database layer over HBase.
• Pig: Simpler scripting for MapReduce transformations.
• Sqoop: Data import and export.
• Tez: Allows data-intensive processes to run efficiently at scale.
• YARN: Resource management that is part of the Hadoop core library.
• ZooKeeper: Coordination of processes in distributed systems.