Big Data Processing Using Spark.pptx

Big Data Processing with Apache Spark
Jan 16, 2024
© 2024 Wipfli LLP. All rights reserved.

Agenda
 What is Apache Spark ?
 Hadoop and Spark
 Features of Spark
 Spark ecosystem
 Spark architecture
 How Apache Spark integrates with Hadoop?
 How to choose between Hadoop and Spark?
 Limitations of Spark
 Demo 1 – Data ingestion, transformation and visualization using PySpark.
 Demo 2 – Big data ingestion using PySpark.
 Industry implementations
 Resources
 Q&A
2

What is Apache Spark ?
 Apache Spark is a cluster-computing platform that provides anAPI for distributed programming similar to the MapReduce model but is
designed to be fast for interactive queries and iterative algorithms.
 Designed specifically to replace MapReduce, Spark also processes data in batches, with workloads distributed across a cluster of
interconnected servers.
 Similar to its predecessor, the engine supports single- and multi-node deployment scenarios and master-slave architecture. Each Spark
cluster has a single master node or driver to manage tasks and numerous slaves or executors to perform operations.And that’s almost
where the likeness ends.
 The main difference between Hadoop and Spark lies in data processing methods.
 MapReduce stores intermediate results on local discs and reads them later for further calculations. In contrast, Spark caches data in the
main computer memory or RAM (RandomAccess Memory.)
 Even the best possible disk read time lags far behind RAM speeds. Not a big surprise that Spark runs workloads 100 times faster than
MapReduce if all data fits in RAM. When datasets are so large or queries are so complex that they have to be saved to disc, Spark still
outperforms the Hadoop engine by ten times.
3

What is Apache Spark ? Continues..
 The Spark driver: - The driver is the program or process responsible for coordinating the
execution of the Spark application. It runs the main function and creates the SparkContext,
which connects to the cluster manager.
 The Spark executors: - Executors are worker processes responsible for executing tasks in
Spark applications. They are launched on worker nodes and communicate with the driver
program and cluster manager. Executors run tasks concurrently and store data in memory
or disk for caching and intermediate storage.
 The cluster manager: - The cluster manager is responsible for allocating resources and
managing the cluster on which the Spark application runs. Spark supports various cluster
managers likeApache Mesos, Hadoop YARN, and standalone cluster manager.
 Task: -A task is the smallest unit of work in Spark, representing a unit of computation that
can be performed on a single partition of data. The driver program divides the Spark job into
tasks and assigns them to the executor nodes for execution.
4

Hadoop Vs Spark
5
Hadoop Apache Spark
Data Processing Batch processing Batch/stream processing
Real-time processing None Near real-time
Performance
Slower, as the disk is used for
storage
100 times faster due to in-
memory operations
Fault-tolerance
Replication used for fault
tolerance
Checkpointing and RDDs
provide fault tolerance
Latency High latency Low latency
Interactive mode No Yes
Resource Management YARN
Spark standalone, YARN,
Mesos
Ease of use
Complex; need to understand
low-level APIs
Abstracts most of the
distributed system details
Language Support Java, Python Scala, Java, Python, R, SQL
Cloud support Yes Yes
Machine Learning Requires Apache Mahout Provides MLlib
Cost
Low cost, as disk drives are
cheaper
High price since a memory-
intensive solution
Security Highly secure Basic security
MapReduce Architecture

Map Reduce
6
 It is a framework in which we can write applications to run huge amount of data in parallel and in large cluster of commodity hardware in a
reliable manner.
 Different Phases of MapReduce:-
 Mapping :- It is the first phase of MapReduce programming. Mapping Phase accepts key-value pairs as input as (k, v), where the key
represents the Key address of each record and the value represents the entire record content.The output of the Mapping phase will also
be in the key-value format (k’, v’).
 Shuffling and Sorting :- The output of various mapping parts (k’, v’), then goes into Shuffling and Sorting phase. All the same values are
deleted, and different values are grouped together based on same keys. The output of the Shuffling and Sorting phase will be key-value
pairs again as key and array of values (k, v[ ]).
 Reducer :- The output of the Shuffling and Sorting phase (k, v[]) will be the input of the Reducer phase. In this phase reducer function’s
logic is executed and all the values are Collected against their corresponding keys. Reducer stabilize outputs of various mappers and
computes the final output.
 Combining :- It is an optional phase in the MapReduce phases . The combiner phase is used to optimize the performance of
MapReduce phases. This phase makes the Shuffling and Sorting phase work even quicker by enabling additional performance features in
MapReduce phases.

Map Reduce Continues..
7
User_Id Movie_Id Rating Timestamp
196 242 3 881250949
186 302 3 891717742
196 377 1 878887116
244 51 2 880606923
166 346 1 886397596
186 474 4 884182806
186 265 2 881171488
 Numeric Example

Map Reduce Continues..
8
 Step 1 – First, we must map the values , it has happened in 1st phase of Map Reduce model.
 Mapping: - 196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ; 186:274 ; 186:265
 Step 2 – After Mapping shuffle and sort the values.
 Reduce: - 166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51
 Step 3 – After completion of step1 and step2 we have to reduce each key’s values.

Features of Spark
 Speed: Spark takes MapReduce to the next level with less expensive shuffles in the data processing. Spark holds intermediate results in memory
rather than writing them to disk which is very useful especially when there is a need to work on the same dataset multiple times which can be several
times faster than other big data technologies.
 Fault Tolerance:Apache Spark achieves fault tolerance using a spark abstraction layer called RDD (Resilient Distributed Datasets), which is designed
to handle worker node failure.
 Lazy Evaluation: Spark supports lazy evaluation of big data queries, which helps with optimization of the steps in data processing workflows. It
provides a higher-levelAPI to improve developer productivity and a consistent architect model for big data solutions.
 Multiple Language Support: Spark provides multiple programming language support, and you can use it interactively from the Scala, Python, R, and
SQL shells.
 Real-Time Stream Processing: Spark Streaming bringsApache Spark's language-integratedAPI to stream processing, letting you write streaming
jobs the same way you write batch jobs.
 Decouple storage and compute: It can connect to virtually any storage system, from HDFS to Cassandra to S3, and import data from a myriad of
sources.
9

Spark ecosystem
 Spark SQL: Provides the capability to expose the Spark datasets over JDBCAPI
and allow running the SQL like queries on Spark data using traditional BI and
visualization tools. Spark SQL allows the users to ETL their data from different
formats it’s currently in (like JSON, Parquet, a Database), transform it, and expose it
for ad-hoc querying.
 Spark Streaming: Can be used for processing the real-time streaming data. This is
based on micro batch style of computing and processing. It uses the DStream
which is basically a series of RDDs, to process the real-time data.
 MLlib: Its Spark’s scalable machine learning library consisting of common learning
algorithms and utilities, including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization primitives.
 GraphX:A collection of algorithms and tools for manipulating graphs and
performing parallel graph operations and computations. GraphX extends the RDD
API to include operations for manipulating graphs, creating subgraphs, or accessing
all vertices in a path.
10

Spark architecture
 STEP 1: The client submits spark user application code. When an application code is submitted, the driver implicitly
converts user code that contains transformations and actions into a logically directed acyclic graph called DAG. At this
stage, it also performs optimizations such as pipelining transformations.
 STEP 2: After that, it converts the logical graph called DAG into physical execution plan with many stages. After converting
into a physical execution plan, it creates physical execution units called tasks under each stage. Then the tasks are
bundled and sent to the cluster.
 STEP 3: Now the driver talks to the cluster manager and negotiates the resources. Cluster manager launches executors
in worker nodes on behalf of the driver. At this point, the driver will send the tasks to the executors based on data
placement. When executors start, they register themselves with drivers. So, the driver will have a complete view of
executors that are executing the task.
 STEP 4: During the execution of tasks, driver program will monitor the set of executors that runs. Driver node also
schedules future tasks based on data placement.
11
Spark Architecture
DAG based processing

How Apache Spark integrates with Hadoop?
 Unlike Hadoop, which unites storing, processing, and resource management capabilities, Spark is for processing only, having no native
storage system. Instead, it can read and write data from/to different sources, including but not limited to HDFS, HBase, and Apache
Cassandra. It is compatible with a plethora of other data repositories, outside the Hadoop ecosystem — say,Amazon S3.
Processing data across multiple servers, Spark couldn’t control resources — mainly, CPU and memory — by itself. For this task, it needs
a resource or cluster manager. Currently, the framework supports four options:
 Standalone, a simple pre-built cluster manager;
 Hadoop YARN, which is the most common choice for Spark;
 Apache Mesos, used to control resources of entire data centers and heavy-duty services; and
 Kubernetes, a container orchestration platform. Running Spark on Kubernetes makes sense if a company plans to move the entire
company tech stack to the cloud-native infrastructure.
12

How to choose between Hadoop and Spark?
The choice is not between Spark and Hadoop, but between two processing engines, since Hadoop is more than that.
A clear advantage of MapReduce is that you can perform large, delay-tolerant processing tasks at a relatively low cost.
It works best for archived data that can be analyzed later — say, during night hours. Some real-life use cases are
 Online sentiment analysis to understand how people feel about your products.
 Predictive maintenance to address issues with equipment before they really happen.
 log files analysis to prevent security breaches.
Spark, in turn, shines when speed is prioritized over price. It’s a natural choice for
 fraud detection and prevention,
 stock market trends prediction,
 near real-time recommendation systems, and
 risk management.
13

Limitations of Spark
 Pricey hardware. RAM prices are higher than those of hard discs exploited by MapReduce, making Spark operations more expensive.
 Near, but not truly real-time processing. Spark Streaming and in-memory caching allow you to analyze data very quickly. But still it won’t
be truly real-time, since the module works with micro-batches — or small groups of events collected over a predefined interval. Genuine
real-time processing tools process data streams at the moment they are generated.
14

Demo 1 – Data ingestion, transformation and visualization using PySpark
 Analyze retail data with PySpark and Databricks
 Objectives:
 Use modern tools like Databricks and PySpark to find hidden insights from the data.
 Ingest retail data from DBFS available in csv format.
 Utilize PySpark Dataframe API to perform variety of transformations and actions.
 Use graphical representation to enhance our understanding and analysis of the results.
15
Resources:

Demo 2 – Big data ingestion using Pyspark.
 Ingest big data files available in PDF format and translate to desired language.
 Objectives:
 Install required libraries in Databricks notebook.
 Create functions to extract text, table, read and convert table data to plain text.
 Ingest and read text from pdf files available in DBFS into Dataframe.
 Translate text.
16
Resources:

Industry Implementations
 Show around the Databricks end-to-end pipeline.
 Run the pipeline and show DAG created by Spark.
17
DAG : Query execution plan
End-to-end data pipeline

Resources
 Spark Architecture
 PySpark
 Pandas
 Install Hadoop on Windows – Step by Step
 Install Apache Spark on Windows – Step by Step
 Generate fake data using python faker library
18

Q&A
19

20

Big Data Processing Using Spark.pptx

Recommended

Recommended

More Related Content

Similar to Big Data Processing Using Spark.pptx

Similar to Big Data Processing Using Spark.pptx (20)

Recently uploaded

Recently uploaded (20)

Big Data Processing Using Spark.pptx

Editor's Notes