SlideShare a Scribd company logo
1 of 34
http://tutorials.algaeservice.com
/
Module 1
Big Data and SparkChapter 1: Apache Spark
www.algaeservices.co.in
By: Algae Services (Pradeep Kumar)
Email: ops@algaeservices.co.in
Course: http://tutorials.algaeservice.com/
http://tutorials.algaeservice.com
/
About Us
Author: Algae Services
Have Decade of experience in Corporate Training Classroom
sessions in different streams of Technology.
• Expertise in Big Data, ERP, Business Process Engineering,
Database Technologies(SQL Server), Spread sheet modeling.
• Worked with top service and product companies like
Wipro , Volvo , TVS , TEG ANALYTICS, General Electrics ,
Royal Bank of Scotland , Verizon etc.
• Worked with Universities like Jain University , RGTU.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Introduction
 Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009,
and open sourced in 2010 under a BSD license.
 In 2013, the project was donated to the Apache Software Foundation and
switched its license to Apache 2.0. In February 2014, Spark became an
Apache Top-Level Project.
 In November 2014, the engineering team at Databricks used Spark and set a
new world record in large scale sorting.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Introduction
 Apache Spark is a fast, in-memory data processing engine with elegant and expressive
development APIs to allow data workers to efficiently execute
 Streaming
 Machine learning
 SQL workloads
that require fast iterative access to datasets.
 It is a framework for performing general data analytics on distributed computing cluster
like Hadoop.
 It provides in memory computations for increase speed and data process over
mapreduce.
 It runs on top of existing hadoop cluster and access hadoop data store (HDFS)
 It can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka,
Twitter
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Is Apache Spark going to replace Hadoop?
Hadoop is parallel data processing framework that has traditionally been used to run map/reduce
jobs.
These are long running jobs that take minutes or hours to complete.
Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch
map/reduce model that can be used for real-time stream data processing and fast interactive
queries that finish within seconds.
 So, Hadoop supports both traditional map/reduce and Spark.
We should look at Hadoop as a general purpose Framework that supports multiple models
We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to
Hadoop.
Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop.
But as it uses large RAM it needs a dedicated high end physical machine for producing effective
results
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Difference between Hadoop MapReduce and Apache Spark
Spark stores data in-memory whereas Hadoop stores data on disk.
Hadoop uses replication to achieve fault tolerance whereas Spark uses different data
storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing
fault tolerance that minimizes network I/O.
From the Spark academic paper: "RDDs achieve fault tolerance through a notion of
lineage:
 If a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.
 This removes the need for replication to achieve fault tolerance.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Resilient Distributed datasets
 Resilient Distributed Datasets (RDDs), a distributed memory abstraction that allows
programmers to perform in-memory computations on large clusters while retaining the
fault tolerance of data flow models like MapReduce.
 RDDs are motivated by two types of applications that current data flow systems handle
inefficiently:
 iterative algorithms, which are common in graph applications.
 machine learning and interactive data mining tools.
 In both cases, keeping data in memory can improve performance by an order of
magnitude.
 To achieve fault tolerance efficiently, RDDs provide a highly restricted form of shared
memory: they are read-only datasets that can only be constructed through bulk
operations on other RDDs.
 However, RDDs are expressive enough to capture a wide class of computations,
including MapReduce and specialized programming models for iterative jobs such as
Pregel(google platform for graph processing).
 Implementation of RDDs can outperform Hadoop by 20x for iterative jobs and can be
used interactively to search a 1 TB dataset with latencies of 5-7 seconds.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Apache Spark's features
 Speed: Spark enables applications in Hadoop clusters to run up to 100x faster in
memory.
 10x faster even when running on disk.
 Spark makes it possible by reducing number of read/write to disc.
 It stores this intermediate processing data in-memory.
 It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to
transparently store data on memory and persist it to disc only it’s needed.
 This helps to reduce most of the disc read and write – the main time consuming
factors – of data processing.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Apache Spark's features
 Easy to use: Spark lets you quickly write applications in Java, Scala, or Python.
This helps developers to create and run their applications on their familiar
programming languages and easy to build parallel apps. It comes with a built-
in set of over 80 high-level operators. We can use it interactively to query
data within the shell too.
 Word count in Spark's Python API
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Apache Spark's features
 Combines SQL, streaming, and complex analytics.
• In addition to simple “map” and “reduce” operations, Spark supports SQL queries,
streaming data, and complex analytics such as machine learning and graph algorithms
out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a
single workflow.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Spark’s major use cases over Hadoop
• Iterative Algorithms in Machine Learning
• Interactive Data Mining and Data Processing
• Spark is a fully Apache Hive-compatible data warehousing system that can run 100x
faster than Hive.
• Stream processing: Log processing and Fraud detection in live streams for alerts,
aggregates and analysis
• Sensor data processing: Where data is fetched and joined from multiple sources, in-
memory dataset really helpful as they are easy and fast to process.
• Note : Spark is still working out bugs as it matures.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
GETTING START WITH SPARK
• Stage 1 – Explore and Develop in Spark Local Mode
• The first stage starts with a local mode of Spark where Spark runs on a single node.
• The developer uses this system to learn Spark and starts to build a prototype of the their
application leveraging the Spark API.
• Using Spark Shells (Scala & PySpark), a developer rapidly prototypes and packages a
Spark application with tools such as Maven or Scala Build Tool (SBT).
• Even though the dataset is typically small (so that it fits on a developer machine), a
developer can easily debug the application on a single node.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Spark cluster view:
 Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext
object in your main program (called the driver program).
 Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications.
 Once connected, Spark acquires executors on nodes in the cluster, which are processes that run
computations and store data for your application.
 Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the
executors. Finally, SparkContext sends tasks for the executors to run.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Things to note about this architecture::
 Each application gets its own executor processes, which stay up for the duration of the whole application
and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both
the scheduling side (each driver schedules its own tasks) and executor side (tasks from different
applications run in different JVMs). However, it also means that data cannot be shared across different
Spark applications (instances of SparkContext) without writing it to an external storage system.
 Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these
communicate with each other, it is relatively easy to run it even on a cluster manager that also supports
other applications (e.g. Mesos/YARN).
 The driver program must listen for and accept incoming connections from its executors throughout its
lifetime. As such, the driver program must be network addressable from the worker nodes.
 Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably
on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open
an RPC to the driver and have it submit operations from nearby than to run a driver far away from the
worker nodes.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Terminologies:
 Application: User program built on Spark. Consists of a driver program and executors on the cluster.
 Application jar: A jar containing the user's Spark application. In some cases users will want to create an
"user jar" containing their application along with its dependencies. The user's jar should never include
Hadoop or Spark libraries, however, these will be added at runtime.
 Driver program: The process running the main() function of the application and creating the SparkContext.
 Cluster manager: An external service for acquiring resources on the cluster (e.g. standalone manager,
Mesos, YARN).
 Deploy mode: Distinguishes where the driver process runs. In "cluster" mode, the framework launches the
driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
 Worker node: Any node that can run application code in the cluster.
 Executor: A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them. Each application has its own executors.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Terminologies:
 Task: A unit of work that will be sent to one executor.
 Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action
(e.g. save,collect); you'll see this term used in the driver's logs.
 Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to
the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Misconception about Spark
 Spark is in-memory technology: none of the Spark developers officially states this! These are the
rumors based on the misunderstanding of the Spark computation processes.
 Spark has no option for in-memory data persistence, it has pluggable connectors for different persistent
storage systems like HDFS, Tachyon, HBase, Cassandra and so on, but it does not have native persistence
code, neither for in-memory nor for on-disk storage. Everything it can do is to cache the data, which is not
the “persistence”. Cached data can be easily dropped and recomputed later based on the other data
available in the source persistent store available through connector. And even more the heart of Spark,
“shuffle”, writes data to disks. If you have a “group by” statement in your SparkSQL query or you are just
transforming RDD to PairRDD and calling on it some aggregation by key, you are forcing Spark to
distribute data among the partitions based on the hash value of the key.
 So finally, Spark is not an in-memory technology. It is the technology that allows you to efficiently utilize
in-memory LRU cache with possible on-disk eviction on memory full condition. It does not have built-in
persistence functionality (neither in-memory, nor on-disk). And it puts all the dataset data on the local
filesystems during the “shuffle” process.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Apache Spark Architecture
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
ARCHITECTURE IN DETAILS
  Any Spark process that would ever work on your 
cluster or local machine is a JVM process. As for any 
JVM process, you can configure its heap size with -
Xmx and -Xms flags of the JVM. How does this process 
use its heap memory and why does it need it at all? 
Here’s the diagram of Spark memory allocation inside 
of the JVM heap:
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
ARCHITECTURE IN DETAILS
 By default, Spark starts with 512MB JVM heap. To be on a safe side and avoid OOM error Spark allows to 
utilize only 90% of the heap, which is controlled by the spark.storage.safetyFractionparameter of Spark. 
Ok, as you might have heard of Spark as an in-memory tool, Spark allows you to store some data in 
memory. 
 you should understand that Spark is not really in-memory tool, it just utilizes the memory for its LRU 
cache, So some amount of memory is reserved for the caching of the data you are processing, and this part 
is usually 60% of the safe heap, which is controlled by the spark.storage.memoryFraction parameter. 
 So if you want to know how much data you can cache in Spark, you should take the sum of all the heap 
sizes for all the executors, multiply it by safetyFraction and by storage.memoryFraction, and by default it 
is 0.9 * 0.6 = 0.54 or 54% of the total heap size you allow Spark to use.
 Now a bit more about the shuffle memory. It is calculated as “Heap Size” 
*spark.shuffle.safetyFraction * spark.shuffle.memoryFraction. Default value for 
spark.shuffle.safetyFraction is 0.8 or 80%, default value for spark.shuffle.memoryFraction is 0.2 or 20%. 
 So finally you can use up to 0.8*0.2 = 0.16 or 16% of the JVM heap for the shuffle.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
ARCHITECTURE IN DETAILS
 In general Spark uses this memory for the exact task it is called after – for Shuffle. When the shuffle is 
performed, sometimes you as well need to sort the data. When you sort the data, you usually need a 
buffer to store the sorted data (remember, you cannot modify the data in the LRU cache in place as it is 
there to be reused later). 
 So it needs some amount of RAM to store the sorted chunks of data. What happens if you don’t have 
enough memory to sort the data? There is a wide range of algorithms usually referenced as “external 
sorting” that allows you to sort the data chunk-by-chunk and then merge the final result together.
 The last part of RAM is “unroll” memory. The amount of RAM that is allowed to be utilized by unroll 
process is spark.storage.unrollFraction * spark.storage.safetyFraction, which with the default values equal 
to 0.2 * 0.9 = 0.18 or 18% of the heap. This is the memory that can be used when you are unrolling the 
block of data into the memory. 
 Why do you need to unroll it after all? Spark allows you to store the data both in serialized and 
deserialized form. The data in serialized form cannot be used directly, so you have to unroll it before 
using, so this is the RAM that is used for unrolling. It is shared with the storage RAM, which means that if 
you need some memory to unroll the data, this might cause dropping some of the partitions stored in the 
Spark LRU cache. 
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
SPARK CLUSTER VIEW WITH YARN
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
SPARK CLUSTER VIEW WITH YARN
 When you have a YARN cluster, it has a YARN Resource Manager daemon that controls the cluster 
resources (practically memory) and a series of YARN Node Managers running on the cluster nodes and 
controlling node resource utilization. 
 From the YARN standpoint, each node represents a pool of RAM that you have a control over. When you 
request some resources from YARN Resource Manager, it gives you information of which Node Managers 
you can contact to bring up the execution containers for you. Each execution container is a JVM with 
requested heap size. JVM locations are chosen by the YARN Resource Manager and you have no control 
over it – if the node has 64GB of RAM controlled by YARN (yarn.nodemanager.resource.memory-
mb setting in yarn-site.xml) and you request 10 executors with 4GB each, all of them can be easily started 
on a single YARN node even if you have a big cluster.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
SPARK CLUSTER VIEW WITH YARN
 When you start Spark cluster on top of YARN, 
 you specify the amount of executors you need (–num-executors flag 
or spark.executor.instances parameter), 
 amount of memory to be used for each of the executors (–executor-memory flag 
or spark.executor.memory  parameter), 
 amount of cores allowed to use for each executors (–executor-cores flag 
of spark.executor.cores parameter), and 
 amount of cores dedicated for each task’s execution (spark.task.cpus parameter)
  Also you specify the amount of memory to be used by the driver application (–driver-memory flag 
orspark.driver.memory parameter).
 When you execute something on a cluster, the processing of your job is split up into stages, and each 
stage is split into tasks. Each task is scheduled separately. You can consider each of the JVMs working as 
executors as a pool of task execution slots, each executor would give 
youspark.executor.cores / spark.task.cpus execution slots for your tasks, with a total 
ofspark.executor.instances executors.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
SPARK HORTONWORKS HDP
 Spark is certified as YARN 
Ready and is a part of HDP. 
Memory and CPU-intensive 
Spark-based applications 
can coexist with other 
workloads deployed in a 
YARN-enabled cluster. This 
approach avoids the need 
to create and manage 
dedicated Spark clusters 
and allows for more 
efficient resource use 
within a single cluster.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
SPARK HORTONWORKS HDP
 HDP also provides consistent governance, security and management policies for Spark applications, just as 
it does for the other data processing engines within HDP.
 Hortonworks approached Spark in the same way they approached other data access engines like Storm, 
Hive, and HBase. They outline a strategy, rally the community, and contribute key features within the 
Apache Software Foundation’s process.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
SPARK HORTONWORKS HDP
 Below is a summary of the various integration points that make Spark enterprise-ready:
 Support for the ORCFile format(Optimized Row Columnar (ORC) file format 
provides a highly efficient way to store data).
 Security.
 Operations.
 Improved Reliability and Scale of Spark-on-YARN.
 YARN Integration.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Support for the ORCFile format.
 As part of the Stinger Initiative, the Hive community introduced the Optimized Row
Columnar (ORC) file format.
 ORC is a columnar storage format that is tightly integrated with HDFS and provides
optimizations for both read performance and data compression.
 It is rapidly becoming the defacto storage format for Hive.
 Hortonworks contributed to SPARK-2883, which provides basic support of ORCFile in
Spark.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
SECURITY
 Many of their customers’ initial use cases for Spark run on Hadoop clusters which either
do not contain sensitive data or are dedicated for a single application and so they are not
subject to broad security requirements.
 But users plan to deploy Spark-based applications alongside other applications in a single
cluster, so they worked to integrate Spark with the security constructs of the broader
Hadoop platform.
 They hear a common request that Spark runs effectively on a secure Hadoop cluster and
can leverage authorization offered by HDFS.
 Also to improve security they have worked within the community to ensure that Spark
runs on a Kerberos-enabled cluster.
 This means that only authenticated users can submit Spark jobs.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
OPERATIONS
 Hortonworks continues to focus on streamlining operations for Spark through the 100% open
source Apache Ambari.
 Customers use Ambari to provision, manage and monitor their HDP clusters, and many Hortonworks
partners, such as Microsoft, Teradata, Pivotal and HP have all taken advantage and backed this
foundational Hadoop project.
 Currently, their partners leverage Ambari Stacks to rapidly define new components/services and add
those within a Hadoop cluster.
 With Stacks, Spark component(s) and services can be managed by Ambari so that you can install, start,
stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in
your Hadoop cluster.
 The Quick Links feature of Ambari will allow for the cluster operator to access the native Spark User
Interface.
 To simplify the operational experience, HDP 2.2.4 also allows Spark to be installed and be managed
by Apache Ambari 2.0.
 Ambari allows the cluster administrator to manage the configuration of Spark and Spark daemons life
cycles.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Improved Reliability and Scale of Spark-on-
YARN
 The Spark API allows developers to create both iterative and in-memory applications on
Apache Hadoop YARN.
 With the community interest behind it Spark is making great strides in efficient cluster
resource usage.
 With Dynamic executor Allocation on YARN, Spark only uses Executors within a bound.
 They continue to believe Spark can use the cluster resources more efficiently and are
working with the community to promote a better resource usage.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
YARN ATS Integration
 From an operations perspective, Hortonworks has integrated Spark with the YARN
Application Timeline Server (ATS).
 ATS provides generic storage and retrieval of applications’ current and historic
information.
 This permits a common integration point for certain classes of operational information
and metrics.
 With this integration, the cluster operator can take advantage of information already
available from YARN to gain additional visibility into the health and execution status of
the Spark jobs.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
About Us
Author: Algae Services
Have Decade of experience in Corporate Training
Classroom sessions in different streams of
Technology.
• Last two years on Big Data and ERP and Database
Technologies
• Worked with top service and product companies like
Wipro , Volvo , TVS , TEG ANALYTICS, General
Electrics , Royal Bank of Scotland , Verizon etc.
• Worked with Universities like Jain University , RGTU.
www.algaeservices.co.in
http://tutorials.algaeservice.com
/
Contact Us
– Thanks
– Algae Services
– www.algaeservices.co.in
– http://tutorials.algaeservice.com/
– BTM Bangalore
www.algaeservices.co.in

More Related Content

Recently uploaded

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

0001 spark architecture updated

  • 1. http://tutorials.algaeservice.com / Module 1 Big Data and SparkChapter 1: Apache Spark www.algaeservices.co.in By: Algae Services (Pradeep Kumar) Email: ops@algaeservices.co.in Course: http://tutorials.algaeservice.com/
  • 2. http://tutorials.algaeservice.com / About Us Author: Algae Services Have Decade of experience in Corporate Training Classroom sessions in different streams of Technology. • Expertise in Big Data, ERP, Business Process Engineering, Database Technologies(SQL Server), Spread sheet modeling. • Worked with top service and product companies like Wipro , Volvo , TVS , TEG ANALYTICS, General Electrics , Royal Bank of Scotland , Verizon etc. • Worked with Universities like Jain University , RGTU. www.algaeservices.co.in
  • 3. http://tutorials.algaeservice.com / Introduction  Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009, and open sourced in 2010 under a BSD license.  In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became an Apache Top-Level Project.  In November 2014, the engineering team at Databricks used Spark and set a new world record in large scale sorting. www.algaeservices.co.in
  • 4. http://tutorials.algaeservice.com / Introduction  Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute  Streaming  Machine learning  SQL workloads that require fast iterative access to datasets.  It is a framework for performing general data analytics on distributed computing cluster like Hadoop.  It provides in memory computations for increase speed and data process over mapreduce.  It runs on top of existing hadoop cluster and access hadoop data store (HDFS)  It can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka, Twitter www.algaeservices.co.in
  • 5. http://tutorials.algaeservice.com / Is Apache Spark going to replace Hadoop? Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds.  So, Hadoop supports both traditional map/reduce and Spark. We should look at Hadoop as a general purpose Framework that supports multiple models We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop. Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop. But as it uses large RAM it needs a dedicated high end physical machine for producing effective results www.algaeservices.co.in
  • 6. http://tutorials.algaeservice.com / Difference between Hadoop MapReduce and Apache Spark Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O. From the Spark academic paper: "RDDs achieve fault tolerance through a notion of lineage:  If a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.  This removes the need for replication to achieve fault tolerance. www.algaeservices.co.in
  • 7. http://tutorials.algaeservice.com / Resilient Distributed datasets  Resilient Distributed Datasets (RDDs), a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.  RDDs are motivated by two types of applications that current data flow systems handle inefficiently:  iterative algorithms, which are common in graph applications.  machine learning and interactive data mining tools.  In both cases, keeping data in memory can improve performance by an order of magnitude.  To achieve fault tolerance efficiently, RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.  However, RDDs are expressive enough to capture a wide class of computations, including MapReduce and specialized programming models for iterative jobs such as Pregel(google platform for graph processing).  Implementation of RDDs can outperform Hadoop by 20x for iterative jobs and can be used interactively to search a 1 TB dataset with latencies of 5-7 seconds. www.algaeservices.co.in
  • 8. http://tutorials.algaeservice.com / Apache Spark's features  Speed: Spark enables applications in Hadoop clusters to run up to 100x faster in memory.  10x faster even when running on disk.  Spark makes it possible by reducing number of read/write to disc.  It stores this intermediate processing data in-memory.  It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed.  This helps to reduce most of the disc read and write – the main time consuming factors – of data processing. www.algaeservices.co.in
  • 9. http://tutorials.algaeservice.com / Apache Spark's features  Easy to use: Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps. It comes with a built- in set of over 80 high-level operators. We can use it interactively to query data within the shell too.  Word count in Spark's Python API www.algaeservices.co.in
  • 10. http://tutorials.algaeservice.com / Apache Spark's features  Combines SQL, streaming, and complex analytics. • In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow. www.algaeservices.co.in
  • 11. http://tutorials.algaeservice.com / Spark’s major use cases over Hadoop • Iterative Algorithms in Machine Learning • Interactive Data Mining and Data Processing • Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive. • Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis • Sensor data processing: Where data is fetched and joined from multiple sources, in- memory dataset really helpful as they are easy and fast to process. • Note : Spark is still working out bugs as it matures. www.algaeservices.co.in
  • 12. http://tutorials.algaeservice.com / GETTING START WITH SPARK • Stage 1 – Explore and Develop in Spark Local Mode • The first stage starts with a local mode of Spark where Spark runs on a single node. • The developer uses this system to learn Spark and starts to build a prototype of the their application leveraging the Spark API. • Using Spark Shells (Scala & PySpark), a developer rapidly prototypes and packages a Spark application with tools such as Maven or Scala Build Tool (SBT). • Even though the dataset is typically small (so that it fits on a developer machine), a developer can easily debug the application on a single node. www.algaeservices.co.in
  • 13. http://tutorials.algaeservice.com / Spark cluster view:  Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).  Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications.  Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.  Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run. www.algaeservices.co.in
  • 14. http://tutorials.algaeservice.com / Things to note about this architecture::  Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.  Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).  The driver program must listen for and accept incoming connections from its executors throughout its lifetime. As such, the driver program must be network addressable from the worker nodes.  Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes. www.algaeservices.co.in
  • 15. http://tutorials.algaeservice.com / Terminologies:  Application: User program built on Spark. Consists of a driver program and executors on the cluster.  Application jar: A jar containing the user's Spark application. In some cases users will want to create an "user jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.  Driver program: The process running the main() function of the application and creating the SparkContext.  Cluster manager: An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN).  Deploy mode: Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.  Worker node: Any node that can run application code in the cluster.  Executor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. www.algaeservices.co.in
  • 16. http://tutorials.algaeservice.com / Terminologies:  Task: A unit of work that will be sent to one executor.  Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save,collect); you'll see this term used in the driver's logs.  Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs. www.algaeservices.co.in
  • 17. http://tutorials.algaeservice.com / Misconception about Spark  Spark is in-memory technology: none of the Spark developers officially states this! These are the rumors based on the misunderstanding of the Spark computation processes.  Spark has no option for in-memory data persistence, it has pluggable connectors for different persistent storage systems like HDFS, Tachyon, HBase, Cassandra and so on, but it does not have native persistence code, neither for in-memory nor for on-disk storage. Everything it can do is to cache the data, which is not the “persistence”. Cached data can be easily dropped and recomputed later based on the other data available in the source persistent store available through connector. And even more the heart of Spark, “shuffle”, writes data to disks. If you have a “group by” statement in your SparkSQL query or you are just transforming RDD to PairRDD and calling on it some aggregation by key, you are forcing Spark to distribute data among the partitions based on the hash value of the key.  So finally, Spark is not an in-memory technology. It is the technology that allows you to efficiently utilize in-memory LRU cache with possible on-disk eviction on memory full condition. It does not have built-in persistence functionality (neither in-memory, nor on-disk). And it puts all the dataset data on the local filesystems during the “shuffle” process. www.algaeservices.co.in
  • 19. http://tutorials.algaeservice.com / ARCHITECTURE IN DETAILS   Any Spark process that would ever work on your  cluster or local machine is a JVM process. As for any  JVM process, you can configure its heap size with - Xmx and -Xms flags of the JVM. How does this process  use its heap memory and why does it need it at all?  Here’s the diagram of Spark memory allocation inside  of the JVM heap: www.algaeservices.co.in
  • 20. http://tutorials.algaeservice.com / ARCHITECTURE IN DETAILS  By default, Spark starts with 512MB JVM heap. To be on a safe side and avoid OOM error Spark allows to  utilize only 90% of the heap, which is controlled by the spark.storage.safetyFractionparameter of Spark.  Ok, as you might have heard of Spark as an in-memory tool, Spark allows you to store some data in  memory.   you should understand that Spark is not really in-memory tool, it just utilizes the memory for its LRU  cache, So some amount of memory is reserved for the caching of the data you are processing, and this part  is usually 60% of the safe heap, which is controlled by the spark.storage.memoryFraction parameter.   So if you want to know how much data you can cache in Spark, you should take the sum of all the heap  sizes for all the executors, multiply it by safetyFraction and by storage.memoryFraction, and by default it  is 0.9 * 0.6 = 0.54 or 54% of the total heap size you allow Spark to use.  Now a bit more about the shuffle memory. It is calculated as “Heap Size”  *spark.shuffle.safetyFraction * spark.shuffle.memoryFraction. Default value for  spark.shuffle.safetyFraction is 0.8 or 80%, default value for spark.shuffle.memoryFraction is 0.2 or 20%.   So finally you can use up to 0.8*0.2 = 0.16 or 16% of the JVM heap for the shuffle. www.algaeservices.co.in
  • 21. http://tutorials.algaeservice.com / ARCHITECTURE IN DETAILS  In general Spark uses this memory for the exact task it is called after – for Shuffle. When the shuffle is  performed, sometimes you as well need to sort the data. When you sort the data, you usually need a  buffer to store the sorted data (remember, you cannot modify the data in the LRU cache in place as it is  there to be reused later).   So it needs some amount of RAM to store the sorted chunks of data. What happens if you don’t have  enough memory to sort the data? There is a wide range of algorithms usually referenced as “external  sorting” that allows you to sort the data chunk-by-chunk and then merge the final result together.  The last part of RAM is “unroll” memory. The amount of RAM that is allowed to be utilized by unroll  process is spark.storage.unrollFraction * spark.storage.safetyFraction, which with the default values equal  to 0.2 * 0.9 = 0.18 or 18% of the heap. This is the memory that can be used when you are unrolling the  block of data into the memory.   Why do you need to unroll it after all? Spark allows you to store the data both in serialized and  deserialized form. The data in serialized form cannot be used directly, so you have to unroll it before  using, so this is the RAM that is used for unrolling. It is shared with the storage RAM, which means that if  you need some memory to unroll the data, this might cause dropping some of the partitions stored in the  Spark LRU cache.  www.algaeservices.co.in
  • 22. http://tutorials.algaeservice.com / SPARK CLUSTER VIEW WITH YARN www.algaeservices.co.in
  • 23. http://tutorials.algaeservice.com / SPARK CLUSTER VIEW WITH YARN  When you have a YARN cluster, it has a YARN Resource Manager daemon that controls the cluster  resources (practically memory) and a series of YARN Node Managers running on the cluster nodes and  controlling node resource utilization.   From the YARN standpoint, each node represents a pool of RAM that you have a control over. When you  request some resources from YARN Resource Manager, it gives you information of which Node Managers  you can contact to bring up the execution containers for you. Each execution container is a JVM with  requested heap size. JVM locations are chosen by the YARN Resource Manager and you have no control  over it – if the node has 64GB of RAM controlled by YARN (yarn.nodemanager.resource.memory- mb setting in yarn-site.xml) and you request 10 executors with 4GB each, all of them can be easily started  on a single YARN node even if you have a big cluster. www.algaeservices.co.in
  • 24. http://tutorials.algaeservice.com / SPARK CLUSTER VIEW WITH YARN  When you start Spark cluster on top of YARN,   you specify the amount of executors you need (–num-executors flag  or spark.executor.instances parameter),   amount of memory to be used for each of the executors (–executor-memory flag  or spark.executor.memory  parameter),   amount of cores allowed to use for each executors (–executor-cores flag  of spark.executor.cores parameter), and   amount of cores dedicated for each task’s execution (spark.task.cpus parameter)   Also you specify the amount of memory to be used by the driver application (–driver-memory flag  orspark.driver.memory parameter).  When you execute something on a cluster, the processing of your job is split up into stages, and each  stage is split into tasks. Each task is scheduled separately. You can consider each of the JVMs working as  executors as a pool of task execution slots, each executor would give  youspark.executor.cores / spark.task.cpus execution slots for your tasks, with a total  ofspark.executor.instances executors. www.algaeservices.co.in
  • 25. http://tutorials.algaeservice.com / SPARK HORTONWORKS HDP  Spark is certified as YARN  Ready and is a part of HDP.  Memory and CPU-intensive  Spark-based applications  can coexist with other  workloads deployed in a  YARN-enabled cluster. This  approach avoids the need  to create and manage  dedicated Spark clusters  and allows for more  efficient resource use  within a single cluster. www.algaeservices.co.in
  • 26. http://tutorials.algaeservice.com / SPARK HORTONWORKS HDP  HDP also provides consistent governance, security and management policies for Spark applications, just as  it does for the other data processing engines within HDP.  Hortonworks approached Spark in the same way they approached other data access engines like Storm,  Hive, and HBase. They outline a strategy, rally the community, and contribute key features within the  Apache Software Foundation’s process. www.algaeservices.co.in
  • 27. http://tutorials.algaeservice.com / SPARK HORTONWORKS HDP  Below is a summary of the various integration points that make Spark enterprise-ready:  Support for the ORCFile format(Optimized Row Columnar (ORC) file format  provides a highly efficient way to store data).  Security.  Operations.  Improved Reliability and Scale of Spark-on-YARN.  YARN Integration. www.algaeservices.co.in
  • 28. http://tutorials.algaeservice.com / Support for the ORCFile format.  As part of the Stinger Initiative, the Hive community introduced the Optimized Row Columnar (ORC) file format.  ORC is a columnar storage format that is tightly integrated with HDFS and provides optimizations for both read performance and data compression.  It is rapidly becoming the defacto storage format for Hive.  Hortonworks contributed to SPARK-2883, which provides basic support of ORCFile in Spark. www.algaeservices.co.in
  • 29. http://tutorials.algaeservice.com / SECURITY  Many of their customers’ initial use cases for Spark run on Hadoop clusters which either do not contain sensitive data or are dedicated for a single application and so they are not subject to broad security requirements.  But users plan to deploy Spark-based applications alongside other applications in a single cluster, so they worked to integrate Spark with the security constructs of the broader Hadoop platform.  They hear a common request that Spark runs effectively on a secure Hadoop cluster and can leverage authorization offered by HDFS.  Also to improve security they have worked within the community to ensure that Spark runs on a Kerberos-enabled cluster.  This means that only authenticated users can submit Spark jobs. www.algaeservices.co.in
  • 30. http://tutorials.algaeservice.com / OPERATIONS  Hortonworks continues to focus on streamlining operations for Spark through the 100% open source Apache Ambari.  Customers use Ambari to provision, manage and monitor their HDP clusters, and many Hortonworks partners, such as Microsoft, Teradata, Pivotal and HP have all taken advantage and backed this foundational Hadoop project.  Currently, their partners leverage Ambari Stacks to rapidly define new components/services and add those within a Hadoop cluster.  With Stacks, Spark component(s) and services can be managed by Ambari so that you can install, start, stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in your Hadoop cluster.  The Quick Links feature of Ambari will allow for the cluster operator to access the native Spark User Interface.  To simplify the operational experience, HDP 2.2.4 also allows Spark to be installed and be managed by Apache Ambari 2.0.  Ambari allows the cluster administrator to manage the configuration of Spark and Spark daemons life cycles. www.algaeservices.co.in
  • 31. http://tutorials.algaeservice.com / Improved Reliability and Scale of Spark-on- YARN  The Spark API allows developers to create both iterative and in-memory applications on Apache Hadoop YARN.  With the community interest behind it Spark is making great strides in efficient cluster resource usage.  With Dynamic executor Allocation on YARN, Spark only uses Executors within a bound.  They continue to believe Spark can use the cluster resources more efficiently and are working with the community to promote a better resource usage. www.algaeservices.co.in
  • 32. http://tutorials.algaeservice.com / YARN ATS Integration  From an operations perspective, Hortonworks has integrated Spark with the YARN Application Timeline Server (ATS).  ATS provides generic storage and retrieval of applications’ current and historic information.  This permits a common integration point for certain classes of operational information and metrics.  With this integration, the cluster operator can take advantage of information already available from YARN to gain additional visibility into the health and execution status of the Spark jobs. www.algaeservices.co.in
  • 33. http://tutorials.algaeservice.com / About Us Author: Algae Services Have Decade of experience in Corporate Training Classroom sessions in different streams of Technology. • Last two years on Big Data and ERP and Database Technologies • Worked with top service and product companies like Wipro , Volvo , TVS , TEG ANALYTICS, General Electrics , Royal Bank of Scotland , Verizon etc. • Worked with Universities like Jain University , RGTU. www.algaeservices.co.in
  • 34. http://tutorials.algaeservice.com / Contact Us – Thanks – Algae Services – www.algaeservices.co.in – http://tutorials.algaeservice.com/ – BTM Bangalore www.algaeservices.co.in