Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
0001 spark architecture updated
1. http://tutorials.algaeservice.com
/
Module 1
Big Data and SparkChapter 1: Apache Spark
www.algaeservices.co.in
By: Algae Services (Pradeep Kumar)
Email: ops@algaeservices.co.in
Course: http://tutorials.algaeservice.com/
2. http://tutorials.algaeservice.com
/
About Us
Author: Algae Services
Have Decade of experience in Corporate Training Classroom
sessions in different streams of Technology.
• Expertise in Big Data, ERP, Business Process Engineering,
Database Technologies(SQL Server), Spread sheet modeling.
• Worked with top service and product companies like
Wipro , Volvo , TVS , TEG ANALYTICS, General Electrics ,
Royal Bank of Scotland , Verizon etc.
• Worked with Universities like Jain University , RGTU.
www.algaeservices.co.in
3. http://tutorials.algaeservice.com
/
Introduction
Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009,
and open sourced in 2010 under a BSD license.
In 2013, the project was donated to the Apache Software Foundation and
switched its license to Apache 2.0. In February 2014, Spark became an
Apache Top-Level Project.
In November 2014, the engineering team at Databricks used Spark and set a
new world record in large scale sorting.
www.algaeservices.co.in
4. http://tutorials.algaeservice.com
/
Introduction
Apache Spark is a fast, in-memory data processing engine with elegant and expressive
development APIs to allow data workers to efficiently execute
Streaming
Machine learning
SQL workloads
that require fast iterative access to datasets.
It is a framework for performing general data analytics on distributed computing cluster
like Hadoop.
It provides in memory computations for increase speed and data process over
mapreduce.
It runs on top of existing hadoop cluster and access hadoop data store (HDFS)
It can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka,
Twitter
www.algaeservices.co.in
5. http://tutorials.algaeservice.com
/
Is Apache Spark going to replace Hadoop?
Hadoop is parallel data processing framework that has traditionally been used to run map/reduce
jobs.
These are long running jobs that take minutes or hours to complete.
Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch
map/reduce model that can be used for real-time stream data processing and fast interactive
queries that finish within seconds.
So, Hadoop supports both traditional map/reduce and Spark.
We should look at Hadoop as a general purpose Framework that supports multiple models
We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to
Hadoop.
Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop.
But as it uses large RAM it needs a dedicated high end physical machine for producing effective
results
www.algaeservices.co.in
6. http://tutorials.algaeservice.com
/
Difference between Hadoop MapReduce and Apache Spark
Spark stores data in-memory whereas Hadoop stores data on disk.
Hadoop uses replication to achieve fault tolerance whereas Spark uses different data
storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing
fault tolerance that minimizes network I/O.
From the Spark academic paper: "RDDs achieve fault tolerance through a notion of
lineage:
If a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.
This removes the need for replication to achieve fault tolerance.
www.algaeservices.co.in
7. http://tutorials.algaeservice.com
/
Resilient Distributed datasets
Resilient Distributed Datasets (RDDs), a distributed memory abstraction that allows
programmers to perform in-memory computations on large clusters while retaining the
fault tolerance of data flow models like MapReduce.
RDDs are motivated by two types of applications that current data flow systems handle
inefficiently:
iterative algorithms, which are common in graph applications.
machine learning and interactive data mining tools.
In both cases, keeping data in memory can improve performance by an order of
magnitude.
To achieve fault tolerance efficiently, RDDs provide a highly restricted form of shared
memory: they are read-only datasets that can only be constructed through bulk
operations on other RDDs.
However, RDDs are expressive enough to capture a wide class of computations,
including MapReduce and specialized programming models for iterative jobs such as
Pregel(google platform for graph processing).
Implementation of RDDs can outperform Hadoop by 20x for iterative jobs and can be
used interactively to search a 1 TB dataset with latencies of 5-7 seconds.
www.algaeservices.co.in
8. http://tutorials.algaeservice.com
/
Apache Spark's features
Speed: Spark enables applications in Hadoop clusters to run up to 100x faster in
memory.
10x faster even when running on disk.
Spark makes it possible by reducing number of read/write to disc.
It stores this intermediate processing data in-memory.
It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to
transparently store data on memory and persist it to disc only it’s needed.
This helps to reduce most of the disc read and write – the main time consuming
factors – of data processing.
www.algaeservices.co.in
9. http://tutorials.algaeservice.com
/
Apache Spark's features
Easy to use: Spark lets you quickly write applications in Java, Scala, or Python.
This helps developers to create and run their applications on their familiar
programming languages and easy to build parallel apps. It comes with a built-
in set of over 80 high-level operators. We can use it interactively to query
data within the shell too.
Word count in Spark's Python API
www.algaeservices.co.in
10. http://tutorials.algaeservice.com
/
Apache Spark's features
Combines SQL, streaming, and complex analytics.
• In addition to simple “map” and “reduce” operations, Spark supports SQL queries,
streaming data, and complex analytics such as machine learning and graph algorithms
out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a
single workflow.
www.algaeservices.co.in
11. http://tutorials.algaeservice.com
/
Spark’s major use cases over Hadoop
• Iterative Algorithms in Machine Learning
• Interactive Data Mining and Data Processing
• Spark is a fully Apache Hive-compatible data warehousing system that can run 100x
faster than Hive.
• Stream processing: Log processing and Fraud detection in live streams for alerts,
aggregates and analysis
• Sensor data processing: Where data is fetched and joined from multiple sources, in-
memory dataset really helpful as they are easy and fast to process.
• Note : Spark is still working out bugs as it matures.
www.algaeservices.co.in
12. http://tutorials.algaeservice.com
/
GETTING START WITH SPARK
• Stage 1 – Explore and Develop in Spark Local Mode
• The first stage starts with a local mode of Spark where Spark runs on a single node.
• The developer uses this system to learn Spark and starts to build a prototype of the their
application leveraging the Spark API.
• Using Spark Shells (Scala & PySpark), a developer rapidly prototypes and packages a
Spark application with tools such as Maven or Scala Build Tool (SBT).
• Even though the dataset is typically small (so that it fits on a developer machine), a
developer can easily debug the application on a single node.
www.algaeservices.co.in
13. http://tutorials.algaeservice.com
/
Spark cluster view:
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext
object in your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications.
Once connected, Spark acquires executors on nodes in the cluster, which are processes that run
computations and store data for your application.
Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the
executors. Finally, SparkContext sends tasks for the executors to run.
www.algaeservices.co.in
14. http://tutorials.algaeservice.com
/
Things to note about this architecture::
Each application gets its own executor processes, which stay up for the duration of the whole application
and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both
the scheduling side (each driver schedules its own tasks) and executor side (tasks from different
applications run in different JVMs). However, it also means that data cannot be shared across different
Spark applications (instances of SparkContext) without writing it to an external storage system.
Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these
communicate with each other, it is relatively easy to run it even on a cluster manager that also supports
other applications (e.g. Mesos/YARN).
The driver program must listen for and accept incoming connections from its executors throughout its
lifetime. As such, the driver program must be network addressable from the worker nodes.
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably
on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open
an RPC to the driver and have it submit operations from nearby than to run a driver far away from the
worker nodes.
www.algaeservices.co.in
15. http://tutorials.algaeservice.com
/
Terminologies:
Application: User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar: A jar containing the user's Spark application. In some cases users will want to create an
"user jar" containing their application along with its dependencies. The user's jar should never include
Hadoop or Spark libraries, however, these will be added at runtime.
Driver program: The process running the main() function of the application and creating the SparkContext.
Cluster manager: An external service for acquiring resources on the cluster (e.g. standalone manager,
Mesos, YARN).
Deploy mode: Distinguishes where the driver process runs. In "cluster" mode, the framework launches the
driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Worker node: Any node that can run application code in the cluster.
Executor: A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them. Each application has its own executors.
www.algaeservices.co.in
16. http://tutorials.algaeservice.com
/
Terminologies:
Task: A unit of work that will be sent to one executor.
Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action
(e.g. save,collect); you'll see this term used in the driver's logs.
Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to
the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
www.algaeservices.co.in
17. http://tutorials.algaeservice.com
/
Misconception about Spark
Spark is in-memory technology: none of the Spark developers officially states this! These are the
rumors based on the misunderstanding of the Spark computation processes.
Spark has no option for in-memory data persistence, it has pluggable connectors for different persistent
storage systems like HDFS, Tachyon, HBase, Cassandra and so on, but it does not have native persistence
code, neither for in-memory nor for on-disk storage. Everything it can do is to cache the data, which is not
the “persistence”. Cached data can be easily dropped and recomputed later based on the other data
available in the source persistent store available through connector. And even more the heart of Spark,
“shuffle”, writes data to disks. If you have a “group by” statement in your SparkSQL query or you are just
transforming RDD to PairRDD and calling on it some aggregation by key, you are forcing Spark to
distribute data among the partitions based on the hash value of the key.
So finally, Spark is not an in-memory technology. It is the technology that allows you to efficiently utilize
in-memory LRU cache with possible on-disk eviction on memory full condition. It does not have built-in
persistence functionality (neither in-memory, nor on-disk). And it puts all the dataset data on the local
filesystems during the “shuffle” process.
www.algaeservices.co.in
19. http://tutorials.algaeservice.com
/
ARCHITECTURE IN DETAILS
Any Spark process that would ever work on your
cluster or local machine is a JVM process. As for any
JVM process, you can configure its heap size with -
Xmx and -Xms flags of the JVM. How does this process
use its heap memory and why does it need it at all?
Here’s the diagram of Spark memory allocation inside
of the JVM heap:
www.algaeservices.co.in
20. http://tutorials.algaeservice.com
/
ARCHITECTURE IN DETAILS
By default, Spark starts with 512MB JVM heap. To be on a safe side and avoid OOM error Spark allows to
utilize only 90% of the heap, which is controlled by the spark.storage.safetyFractionparameter of Spark.
Ok, as you might have heard of Spark as an in-memory tool, Spark allows you to store some data in
memory.
you should understand that Spark is not really in-memory tool, it just utilizes the memory for its LRU
cache, So some amount of memory is reserved for the caching of the data you are processing, and this part
is usually 60% of the safe heap, which is controlled by the spark.storage.memoryFraction parameter.
So if you want to know how much data you can cache in Spark, you should take the sum of all the heap
sizes for all the executors, multiply it by safetyFraction and by storage.memoryFraction, and by default it
is 0.9 * 0.6 = 0.54 or 54% of the total heap size you allow Spark to use.
Now a bit more about the shuffle memory. It is calculated as “Heap Size”
*spark.shuffle.safetyFraction * spark.shuffle.memoryFraction. Default value for
spark.shuffle.safetyFraction is 0.8 or 80%, default value for spark.shuffle.memoryFraction is 0.2 or 20%.
So finally you can use up to 0.8*0.2 = 0.16 or 16% of the JVM heap for the shuffle.
www.algaeservices.co.in
21. http://tutorials.algaeservice.com
/
ARCHITECTURE IN DETAILS
In general Spark uses this memory for the exact task it is called after – for Shuffle. When the shuffle is
performed, sometimes you as well need to sort the data. When you sort the data, you usually need a
buffer to store the sorted data (remember, you cannot modify the data in the LRU cache in place as it is
there to be reused later).
So it needs some amount of RAM to store the sorted chunks of data. What happens if you don’t have
enough memory to sort the data? There is a wide range of algorithms usually referenced as “external
sorting” that allows you to sort the data chunk-by-chunk and then merge the final result together.
The last part of RAM is “unroll” memory. The amount of RAM that is allowed to be utilized by unroll
process is spark.storage.unrollFraction * spark.storage.safetyFraction, which with the default values equal
to 0.2 * 0.9 = 0.18 or 18% of the heap. This is the memory that can be used when you are unrolling the
block of data into the memory.
Why do you need to unroll it after all? Spark allows you to store the data both in serialized and
deserialized form. The data in serialized form cannot be used directly, so you have to unroll it before
using, so this is the RAM that is used for unrolling. It is shared with the storage RAM, which means that if
you need some memory to unroll the data, this might cause dropping some of the partitions stored in the
Spark LRU cache.
www.algaeservices.co.in
23. http://tutorials.algaeservice.com
/
SPARK CLUSTER VIEW WITH YARN
When you have a YARN cluster, it has a YARN Resource Manager daemon that controls the cluster
resources (practically memory) and a series of YARN Node Managers running on the cluster nodes and
controlling node resource utilization.
From the YARN standpoint, each node represents a pool of RAM that you have a control over. When you
request some resources from YARN Resource Manager, it gives you information of which Node Managers
you can contact to bring up the execution containers for you. Each execution container is a JVM with
requested heap size. JVM locations are chosen by the YARN Resource Manager and you have no control
over it – if the node has 64GB of RAM controlled by YARN (yarn.nodemanager.resource.memory-
mb setting in yarn-site.xml) and you request 10 executors with 4GB each, all of them can be easily started
on a single YARN node even if you have a big cluster.
www.algaeservices.co.in
24. http://tutorials.algaeservice.com
/
SPARK CLUSTER VIEW WITH YARN
When you start Spark cluster on top of YARN,
you specify the amount of executors you need (–num-executors flag
or spark.executor.instances parameter),
amount of memory to be used for each of the executors (–executor-memory flag
or spark.executor.memory parameter),
amount of cores allowed to use for each executors (–executor-cores flag
of spark.executor.cores parameter), and
amount of cores dedicated for each task’s execution (spark.task.cpus parameter)
Also you specify the amount of memory to be used by the driver application (–driver-memory flag
orspark.driver.memory parameter).
When you execute something on a cluster, the processing of your job is split up into stages, and each
stage is split into tasks. Each task is scheduled separately. You can consider each of the JVMs working as
executors as a pool of task execution slots, each executor would give
youspark.executor.cores / spark.task.cpus execution slots for your tasks, with a total
ofspark.executor.instances executors.
www.algaeservices.co.in
25. http://tutorials.algaeservice.com
/
SPARK HORTONWORKS HDP
Spark is certified as YARN
Ready and is a part of HDP.
Memory and CPU-intensive
Spark-based applications
can coexist with other
workloads deployed in a
YARN-enabled cluster. This
approach avoids the need
to create and manage
dedicated Spark clusters
and allows for more
efficient resource use
within a single cluster.
www.algaeservices.co.in
26. http://tutorials.algaeservice.com
/
SPARK HORTONWORKS HDP
HDP also provides consistent governance, security and management policies for Spark applications, just as
it does for the other data processing engines within HDP.
Hortonworks approached Spark in the same way they approached other data access engines like Storm,
Hive, and HBase. They outline a strategy, rally the community, and contribute key features within the
Apache Software Foundation’s process.
www.algaeservices.co.in
27. http://tutorials.algaeservice.com
/
SPARK HORTONWORKS HDP
Below is a summary of the various integration points that make Spark enterprise-ready:
Support for the ORCFile format(Optimized Row Columnar (ORC) file format
provides a highly efficient way to store data).
Security.
Operations.
Improved Reliability and Scale of Spark-on-YARN.
YARN Integration.
www.algaeservices.co.in
28. http://tutorials.algaeservice.com
/
Support for the ORCFile format.
As part of the Stinger Initiative, the Hive community introduced the Optimized Row
Columnar (ORC) file format.
ORC is a columnar storage format that is tightly integrated with HDFS and provides
optimizations for both read performance and data compression.
It is rapidly becoming the defacto storage format for Hive.
Hortonworks contributed to SPARK-2883, which provides basic support of ORCFile in
Spark.
www.algaeservices.co.in
29. http://tutorials.algaeservice.com
/
SECURITY
Many of their customers’ initial use cases for Spark run on Hadoop clusters which either
do not contain sensitive data or are dedicated for a single application and so they are not
subject to broad security requirements.
But users plan to deploy Spark-based applications alongside other applications in a single
cluster, so they worked to integrate Spark with the security constructs of the broader
Hadoop platform.
They hear a common request that Spark runs effectively on a secure Hadoop cluster and
can leverage authorization offered by HDFS.
Also to improve security they have worked within the community to ensure that Spark
runs on a Kerberos-enabled cluster.
This means that only authenticated users can submit Spark jobs.
www.algaeservices.co.in
30. http://tutorials.algaeservice.com
/
OPERATIONS
Hortonworks continues to focus on streamlining operations for Spark through the 100% open
source Apache Ambari.
Customers use Ambari to provision, manage and monitor their HDP clusters, and many Hortonworks
partners, such as Microsoft, Teradata, Pivotal and HP have all taken advantage and backed this
foundational Hadoop project.
Currently, their partners leverage Ambari Stacks to rapidly define new components/services and add
those within a Hadoop cluster.
With Stacks, Spark component(s) and services can be managed by Ambari so that you can install, start,
stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in
your Hadoop cluster.
The Quick Links feature of Ambari will allow for the cluster operator to access the native Spark User
Interface.
To simplify the operational experience, HDP 2.2.4 also allows Spark to be installed and be managed
by Apache Ambari 2.0.
Ambari allows the cluster administrator to manage the configuration of Spark and Spark daemons life
cycles.
www.algaeservices.co.in
31. http://tutorials.algaeservice.com
/
Improved Reliability and Scale of Spark-on-
YARN
The Spark API allows developers to create both iterative and in-memory applications on
Apache Hadoop YARN.
With the community interest behind it Spark is making great strides in efficient cluster
resource usage.
With Dynamic executor Allocation on YARN, Spark only uses Executors within a bound.
They continue to believe Spark can use the cluster resources more efficiently and are
working with the community to promote a better resource usage.
www.algaeservices.co.in
32. http://tutorials.algaeservice.com
/
YARN ATS Integration
From an operations perspective, Hortonworks has integrated Spark with the YARN
Application Timeline Server (ATS).
ATS provides generic storage and retrieval of applications’ current and historic
information.
This permits a common integration point for certain classes of operational information
and metrics.
With this integration, the cluster operator can take advantage of information already
available from YARN to gain additional visibility into the health and execution status of
the Spark jobs.
www.algaeservices.co.in
33. http://tutorials.algaeservice.com
/
About Us
Author: Algae Services
Have Decade of experience in Corporate Training
Classroom sessions in different streams of
Technology.
• Last two years on Big Data and ERP and Database
Technologies
• Worked with top service and product companies like
Wipro , Volvo , TVS , TEG ANALYTICS, General
Electrics , Royal Bank of Scotland , Verizon etc.
• Worked with Universities like Jain University , RGTU.
www.algaeservices.co.in