SlideShare a Scribd company logo
Big Data Processing with Apache Spark
Jan 16, 2024
© 2024 Wipfli LLP. All rights reserved.
Agenda
 What is Apache Spark ?
 Hadoop and Spark
 Features of Spark
 Spark ecosystem
 Spark architecture
 How Apache Spark integrates with Hadoop?
 How to choose between Hadoop and Spark?
 Limitations of Spark
 Demo 1 – Data ingestion, transformation and visualization using PySpark.
 Demo 2 – Big data ingestion using PySpark.
 Industry implementations
 Resources
 Q&A
© 2024 Wipfli LLP. All rights reserved.
2
What is Apache Spark ?
 Apache Spark is a cluster-computing platform that provides anAPI for distributed programming similar to the MapReduce model but is
designed to be fast for interactive queries and iterative algorithms.
 Designed specifically to replace MapReduce, Spark also processes data in batches, with workloads distributed across a cluster of
interconnected servers.
 Similar to its predecessor, the engine supports single- and multi-node deployment scenarios and master-slave architecture. Each Spark
cluster has a single master node or driver to manage tasks and numerous slaves or executors to perform operations.And that’s almost
where the likeness ends.
 The main difference between Hadoop and Spark lies in data processing methods.
 MapReduce stores intermediate results on local discs and reads them later for further calculations. In contrast, Spark caches data in the
main computer memory or RAM (RandomAccess Memory.)
 Even the best possible disk read time lags far behind RAM speeds. Not a big surprise that Spark runs workloads 100 times faster than
MapReduce if all data fits in RAM. When datasets are so large or queries are so complex that they have to be saved to disc, Spark still
outperforms the Hadoop engine by ten times.
© 2024 Wipfli LLP. All rights reserved.
3
What is Apache Spark ? Continues..
 The Spark driver: - The driver is the program or process responsible for coordinating the
execution of the Spark application. It runs the main function and creates the SparkContext,
which connects to the cluster manager.
 The Spark executors: - Executors are worker processes responsible for executing tasks in
Spark applications. They are launched on worker nodes and communicate with the driver
program and cluster manager. Executors run tasks concurrently and store data in memory
or disk for caching and intermediate storage.
 The cluster manager: - The cluster manager is responsible for allocating resources and
managing the cluster on which the Spark application runs. Spark supports various cluster
managers likeApache Mesos, Hadoop YARN, and standalone cluster manager.
 Task: -A task is the smallest unit of work in Spark, representing a unit of computation that
can be performed on a single partition of data. The driver program divides the Spark job into
tasks and assigns them to the executor nodes for execution.
© 2024 Wipfli LLP. All rights reserved.
4
Hadoop Vs Spark
© 2024 Wipfli LLP. All rights reserved.
5
Hadoop Apache Spark
Data Processing Batch processing Batch/stream processing
Real-time processing None Near real-time
Performance
Slower, as the disk is used for
storage
100 times faster due to in-
memory operations
Fault-tolerance
Replication used for fault
tolerance
Checkpointing and RDDs
provide fault tolerance
Latency High latency Low latency
Interactive mode No Yes
Resource Management YARN
Spark standalone, YARN,
Mesos
Ease of use
Complex; need to understand
low-level APIs
Abstracts most of the
distributed system details
Language Support Java, Python Scala, Java, Python, R, SQL
Cloud support Yes Yes
Machine Learning Requires Apache Mahout Provides MLlib
Cost
Low cost, as disk drives are
cheaper
High price since a memory-
intensive solution
Security Highly secure Basic security
MapReduce Architecture
Map Reduce
© 2024 Wipfli LLP. All rights reserved.
6
 It is a framework in which we can write applications to run huge amount of data in parallel and in large cluster of commodity hardware in a
reliable manner.
 Different Phases of MapReduce:-
 Mapping :- It is the first phase of MapReduce programming. Mapping Phase accepts key-value pairs as input as (k, v), where the key
represents the Key address of each record and the value represents the entire record content.​The output of the Mapping phase will also
be in the key-value format (k’, v’).
 Shuffling and Sorting :- The output of various mapping parts (k’, v’), then goes into Shuffling and Sorting phase.​ All the same values are
deleted, and different values are grouped together based on same keys.​ The output of the Shuffling and Sorting phase will be key-value
pairs again as key and array of values (k, v[ ]).
 Reducer :- The output of the Shuffling and Sorting phase (k, v[]) will be the input of the Reducer phase.​ In this phase reducer function’s
logic is executed and all the values are Collected against their corresponding keys. ​Reducer stabilize outputs of various mappers and
computes the final output.​
 Combining :- It is an optional phase in the MapReduce phases .​ The combiner phase is used to optimize the performance of
MapReduce phases. This phase makes the Shuffling and Sorting phase work even quicker by enabling additional performance features in
MapReduce phases.
Map Reduce Continues..
© 2024 Wipfli LLP. All rights reserved.
7
User_Id Movie_Id Rating Timestamp
196 242 3 881250949
186 302 3 891717742
196 377 1 878887116
244 51 2 880606923
166 346 1 886397596
186 474 4 884182806
186 265 2 881171488
 Numeric Example
Map Reduce Continues..
© 2024 Wipfli LLP. All rights reserved.
8
 Step 1 – First, we must map the values , it has happened in 1st phase of Map Reduce model.
 Mapping: - 196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ; 186:274 ; 186:265
 Step 2 – After Mapping shuffle and sort the values.
 Reduce: - 166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51
 Step 3 – After completion of step1 and step2 we have to reduce each key’s values.
Features of Spark
 Speed: Spark takes MapReduce to the next level with less expensive shuffles in the data processing. Spark holds intermediate results in memory
rather than writing them to disk which is very useful especially when there is a need to work on the same dataset multiple times which can be several
times faster than other big data technologies.
 Fault Tolerance:Apache Spark achieves fault tolerance using a spark abstraction layer called RDD (Resilient Distributed Datasets), which is designed
to handle worker node failure.
 Lazy Evaluation: Spark supports lazy evaluation of big data queries, which helps with optimization of the steps in data processing workflows. It
provides a higher-levelAPI to improve developer productivity and a consistent architect model for big data solutions.
 Multiple Language Support: Spark provides multiple programming language support, and you can use it interactively from the Scala, Python, R, and
SQL shells.
 Real-Time Stream Processing: Spark Streaming bringsApache Spark's language-integratedAPI to stream processing, letting you write streaming
jobs the same way you write batch jobs.
 Decouple storage and compute: It can connect to virtually any storage system, from HDFS to Cassandra to S3, and import data from a myriad of
sources.
© 2024 Wipfli LLP. All rights reserved.
9
Spark ecosystem
 Spark SQL: Provides the capability to expose the Spark datasets over JDBCAPI
and allow running the SQL like queries on Spark data using traditional BI and
visualization tools. Spark SQL allows the users to ETL their data from different
formats it’s currently in (like JSON, Parquet, a Database), transform it, and expose it
for ad-hoc querying.
 Spark Streaming: Can be used for processing the real-time streaming data. This is
based on micro batch style of computing and processing. It uses the DStream
which is basically a series of RDDs, to process the real-time data.
 MLlib: Its Spark’s scalable machine learning library consisting of common learning
algorithms and utilities, including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization primitives.
 GraphX:A collection of algorithms and tools for manipulating graphs and
performing parallel graph operations and computations. GraphX extends the RDD
API to include operations for manipulating graphs, creating subgraphs, or accessing
all vertices in a path.
© 2024 Wipfli LLP. All rights reserved.
10
Spark architecture
 STEP 1: The client submits spark user application code. When an application code is submitted, the driver implicitly
converts user code that contains transformations and actions into a logically directed acyclic graph called DAG. At this
stage, it also performs optimizations such as pipelining transformations.
 STEP 2: After that, it converts the logical graph called DAG into physical execution plan with many stages. After converting
into a physical execution plan, it creates physical execution units called tasks under each stage. Then the tasks are
bundled and sent to the cluster.
 STEP 3: Now the driver talks to the cluster manager and negotiates the resources. Cluster manager launches executors
in worker nodes on behalf of the driver. At this point, the driver will send the tasks to the executors based on data
placement. When executors start, they register themselves with drivers. So, the driver will have a complete view of
executors that are executing the task.
 STEP 4: During the execution of tasks, driver program will monitor the set of executors that runs. Driver node also
schedules future tasks based on data placement.
© 2024 Wipfli LLP. All rights reserved.
11
Spark Architecture
DAG based processing
How Apache Spark integrates with Hadoop?
 Unlike Hadoop, which unites storing, processing, and resource management capabilities, Spark is for processing only, having no native
storage system. Instead, it can read and write data from/to different sources, including but not limited to HDFS, HBase, and Apache
Cassandra. It is compatible with a plethora of other data repositories, outside the Hadoop ecosystem — say,Amazon S3.
Processing data across multiple servers, Spark couldn’t control resources — mainly, CPU and memory — by itself. For this task, it needs
a resource or cluster manager. Currently, the framework supports four options:
 Standalone, a simple pre-built cluster manager;
 Hadoop YARN, which is the most common choice for Spark;
 Apache Mesos, used to control resources of entire data centers and heavy-duty services; and
 Kubernetes, a container orchestration platform. Running Spark on Kubernetes makes sense if a company plans to move the entire
company tech stack to the cloud-native infrastructure.
© 2024 Wipfli LLP. All rights reserved.
12
How to choose between Hadoop and Spark?
The choice is not between Spark and Hadoop, but between two processing engines, since Hadoop is more than that.
A clear advantage of MapReduce is that you can perform large, delay-tolerant processing tasks at a relatively low cost.
It works best for archived data that can be analyzed later — say, during night hours. Some real-life use cases are
 Online sentiment analysis to understand how people feel about your products.
 Predictive maintenance to address issues with equipment before they really happen.
 log files analysis to prevent security breaches.
Spark, in turn, shines when speed is prioritized over price. It’s a natural choice for
 fraud detection and prevention,
 stock market trends prediction,
 near real-time recommendation systems, and
 risk management.
© 2024 Wipfli LLP. All rights reserved.
13
Limitations of Spark
 Pricey hardware. RAM prices are higher than those of hard discs exploited by MapReduce, making Spark operations more expensive.
 Near, but not truly real-time processing. Spark Streaming and in-memory caching allow you to analyze data very quickly. But still it won’t
be truly real-time, since the module works with micro-batches — or small groups of events collected over a predefined interval. Genuine
real-time processing tools process data streams at the moment they are generated.
© 2024 Wipfli LLP. All rights reserved.
14
Demo 1 – Data ingestion, transformation and visualization using PySpark
 Analyze retail data with PySpark and Databricks
 Objectives:
 Use modern tools like Databricks and PySpark to find hidden insights from the data.
 Ingest retail data from DBFS available in csv format.
 Utilize PySpark Dataframe API to perform variety of transformations and actions.
 Use graphical representation to enhance our understanding and analysis of the results.
© 2024 Wipfli LLP. All rights reserved.
15
Resources:
Demo 2 – Big data ingestion using Pyspark.
 Ingest big data files available in PDF format and translate to desired language.
 Objectives:
 Install required libraries in Databricks notebook.
 Create functions to extract text, table, read and convert table data to plain text.
 Ingest and read text from pdf files available in DBFS into Dataframe.
 Translate text.
© 2024 Wipfli LLP. All rights reserved.
16
Resources:
Industry Implementations
 Show around the Databricks end-to-end pipeline.
 Run the pipeline and show DAG created by Spark.
© 2024 Wipfli LLP. All rights reserved.
17
DAG : Query execution plan
End-to-end data pipeline
Resources
 Spark Architecture
 PySpark
 Pandas
 Install Hadoop on Windows – Step by Step
 Install Apache Spark on Windows – Step by Step
 Generate fake data using python faker library
© 2024 Wipfli LLP. All rights reserved.
18
Q&A
© 2024 Wipfli LLP. All rights reserved.
19
20
© 2024 Wipfli LLP. All rights reserved.

More Related Content

Similar to Big Data Processing Using Spark.pptx

5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Edureka!
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
Edureka!
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Dharmjit Singh
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Spark 101
Spark 101Spark 101
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
Demi Ben-Ari
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
siddharth30121
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
Edureka!
 

Similar to Big Data Processing Using Spark.pptx (20)

5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Apache spark
Apache sparkApache spark
Apache spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Spark 101
Spark 101Spark 101
Spark 101
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 

Recently uploaded

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 

Recently uploaded (20)

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 

Big Data Processing Using Spark.pptx

  • 1. Big Data Processing with Apache Spark Jan 16, 2024 © 2024 Wipfli LLP. All rights reserved.
  • 2. Agenda  What is Apache Spark ?  Hadoop and Spark  Features of Spark  Spark ecosystem  Spark architecture  How Apache Spark integrates with Hadoop?  How to choose between Hadoop and Spark?  Limitations of Spark  Demo 1 – Data ingestion, transformation and visualization using PySpark.  Demo 2 – Big data ingestion using PySpark.  Industry implementations  Resources  Q&A © 2024 Wipfli LLP. All rights reserved. 2
  • 3. What is Apache Spark ?  Apache Spark is a cluster-computing platform that provides anAPI for distributed programming similar to the MapReduce model but is designed to be fast for interactive queries and iterative algorithms.  Designed specifically to replace MapReduce, Spark also processes data in batches, with workloads distributed across a cluster of interconnected servers.  Similar to its predecessor, the engine supports single- and multi-node deployment scenarios and master-slave architecture. Each Spark cluster has a single master node or driver to manage tasks and numerous slaves or executors to perform operations.And that’s almost where the likeness ends.  The main difference between Hadoop and Spark lies in data processing methods.  MapReduce stores intermediate results on local discs and reads them later for further calculations. In contrast, Spark caches data in the main computer memory or RAM (RandomAccess Memory.)  Even the best possible disk read time lags far behind RAM speeds. Not a big surprise that Spark runs workloads 100 times faster than MapReduce if all data fits in RAM. When datasets are so large or queries are so complex that they have to be saved to disc, Spark still outperforms the Hadoop engine by ten times. © 2024 Wipfli LLP. All rights reserved. 3
  • 4. What is Apache Spark ? Continues..  The Spark driver: - The driver is the program or process responsible for coordinating the execution of the Spark application. It runs the main function and creates the SparkContext, which connects to the cluster manager.  The Spark executors: - Executors are worker processes responsible for executing tasks in Spark applications. They are launched on worker nodes and communicate with the driver program and cluster manager. Executors run tasks concurrently and store data in memory or disk for caching and intermediate storage.  The cluster manager: - The cluster manager is responsible for allocating resources and managing the cluster on which the Spark application runs. Spark supports various cluster managers likeApache Mesos, Hadoop YARN, and standalone cluster manager.  Task: -A task is the smallest unit of work in Spark, representing a unit of computation that can be performed on a single partition of data. The driver program divides the Spark job into tasks and assigns them to the executor nodes for execution. © 2024 Wipfli LLP. All rights reserved. 4
  • 5. Hadoop Vs Spark © 2024 Wipfli LLP. All rights reserved. 5 Hadoop Apache Spark Data Processing Batch processing Batch/stream processing Real-time processing None Near real-time Performance Slower, as the disk is used for storage 100 times faster due to in- memory operations Fault-tolerance Replication used for fault tolerance Checkpointing and RDDs provide fault tolerance Latency High latency Low latency Interactive mode No Yes Resource Management YARN Spark standalone, YARN, Mesos Ease of use Complex; need to understand low-level APIs Abstracts most of the distributed system details Language Support Java, Python Scala, Java, Python, R, SQL Cloud support Yes Yes Machine Learning Requires Apache Mahout Provides MLlib Cost Low cost, as disk drives are cheaper High price since a memory- intensive solution Security Highly secure Basic security MapReduce Architecture
  • 6. Map Reduce © 2024 Wipfli LLP. All rights reserved. 6  It is a framework in which we can write applications to run huge amount of data in parallel and in large cluster of commodity hardware in a reliable manner.  Different Phases of MapReduce:-  Mapping :- It is the first phase of MapReduce programming. Mapping Phase accepts key-value pairs as input as (k, v), where the key represents the Key address of each record and the value represents the entire record content.​The output of the Mapping phase will also be in the key-value format (k’, v’).  Shuffling and Sorting :- The output of various mapping parts (k’, v’), then goes into Shuffling and Sorting phase.​ All the same values are deleted, and different values are grouped together based on same keys.​ The output of the Shuffling and Sorting phase will be key-value pairs again as key and array of values (k, v[ ]).  Reducer :- The output of the Shuffling and Sorting phase (k, v[]) will be the input of the Reducer phase.​ In this phase reducer function’s logic is executed and all the values are Collected against their corresponding keys. ​Reducer stabilize outputs of various mappers and computes the final output.​  Combining :- It is an optional phase in the MapReduce phases .​ The combiner phase is used to optimize the performance of MapReduce phases. This phase makes the Shuffling and Sorting phase work even quicker by enabling additional performance features in MapReduce phases.
  • 7. Map Reduce Continues.. © 2024 Wipfli LLP. All rights reserved. 7 User_Id Movie_Id Rating Timestamp 196 242 3 881250949 186 302 3 891717742 196 377 1 878887116 244 51 2 880606923 166 346 1 886397596 186 474 4 884182806 186 265 2 881171488  Numeric Example
  • 8. Map Reduce Continues.. © 2024 Wipfli LLP. All rights reserved. 8  Step 1 – First, we must map the values , it has happened in 1st phase of Map Reduce model.  Mapping: - 196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ; 186:274 ; 186:265  Step 2 – After Mapping shuffle and sort the values.  Reduce: - 166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51  Step 3 – After completion of step1 and step2 we have to reduce each key’s values.
  • 9. Features of Spark  Speed: Spark takes MapReduce to the next level with less expensive shuffles in the data processing. Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when there is a need to work on the same dataset multiple times which can be several times faster than other big data technologies.  Fault Tolerance:Apache Spark achieves fault tolerance using a spark abstraction layer called RDD (Resilient Distributed Datasets), which is designed to handle worker node failure.  Lazy Evaluation: Spark supports lazy evaluation of big data queries, which helps with optimization of the steps in data processing workflows. It provides a higher-levelAPI to improve developer productivity and a consistent architect model for big data solutions.  Multiple Language Support: Spark provides multiple programming language support, and you can use it interactively from the Scala, Python, R, and SQL shells.  Real-Time Stream Processing: Spark Streaming bringsApache Spark's language-integratedAPI to stream processing, letting you write streaming jobs the same way you write batch jobs.  Decouple storage and compute: It can connect to virtually any storage system, from HDFS to Cassandra to S3, and import data from a myriad of sources. © 2024 Wipfli LLP. All rights reserved. 9
  • 10. Spark ecosystem  Spark SQL: Provides the capability to expose the Spark datasets over JDBCAPI and allow running the SQL like queries on Spark data using traditional BI and visualization tools. Spark SQL allows the users to ETL their data from different formats it’s currently in (like JSON, Parquet, a Database), transform it, and expose it for ad-hoc querying.  Spark Streaming: Can be used for processing the real-time streaming data. This is based on micro batch style of computing and processing. It uses the DStream which is basically a series of RDDs, to process the real-time data.  MLlib: Its Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.  GraphX:A collection of algorithms and tools for manipulating graphs and performing parallel graph operations and computations. GraphX extends the RDD API to include operations for manipulating graphs, creating subgraphs, or accessing all vertices in a path. © 2024 Wipfli LLP. All rights reserved. 10
  • 11. Spark architecture  STEP 1: The client submits spark user application code. When an application code is submitted, the driver implicitly converts user code that contains transformations and actions into a logically directed acyclic graph called DAG. At this stage, it also performs optimizations such as pipelining transformations.  STEP 2: After that, it converts the logical graph called DAG into physical execution plan with many stages. After converting into a physical execution plan, it creates physical execution units called tasks under each stage. Then the tasks are bundled and sent to the cluster.  STEP 3: Now the driver talks to the cluster manager and negotiates the resources. Cluster manager launches executors in worker nodes on behalf of the driver. At this point, the driver will send the tasks to the executors based on data placement. When executors start, they register themselves with drivers. So, the driver will have a complete view of executors that are executing the task.  STEP 4: During the execution of tasks, driver program will monitor the set of executors that runs. Driver node also schedules future tasks based on data placement. © 2024 Wipfli LLP. All rights reserved. 11 Spark Architecture DAG based processing
  • 12. How Apache Spark integrates with Hadoop?  Unlike Hadoop, which unites storing, processing, and resource management capabilities, Spark is for processing only, having no native storage system. Instead, it can read and write data from/to different sources, including but not limited to HDFS, HBase, and Apache Cassandra. It is compatible with a plethora of other data repositories, outside the Hadoop ecosystem — say,Amazon S3. Processing data across multiple servers, Spark couldn’t control resources — mainly, CPU and memory — by itself. For this task, it needs a resource or cluster manager. Currently, the framework supports four options:  Standalone, a simple pre-built cluster manager;  Hadoop YARN, which is the most common choice for Spark;  Apache Mesos, used to control resources of entire data centers and heavy-duty services; and  Kubernetes, a container orchestration platform. Running Spark on Kubernetes makes sense if a company plans to move the entire company tech stack to the cloud-native infrastructure. © 2024 Wipfli LLP. All rights reserved. 12
  • 13. How to choose between Hadoop and Spark? The choice is not between Spark and Hadoop, but between two processing engines, since Hadoop is more than that. A clear advantage of MapReduce is that you can perform large, delay-tolerant processing tasks at a relatively low cost. It works best for archived data that can be analyzed later — say, during night hours. Some real-life use cases are  Online sentiment analysis to understand how people feel about your products.  Predictive maintenance to address issues with equipment before they really happen.  log files analysis to prevent security breaches. Spark, in turn, shines when speed is prioritized over price. It’s a natural choice for  fraud detection and prevention,  stock market trends prediction,  near real-time recommendation systems, and  risk management. © 2024 Wipfli LLP. All rights reserved. 13
  • 14. Limitations of Spark  Pricey hardware. RAM prices are higher than those of hard discs exploited by MapReduce, making Spark operations more expensive.  Near, but not truly real-time processing. Spark Streaming and in-memory caching allow you to analyze data very quickly. But still it won’t be truly real-time, since the module works with micro-batches — or small groups of events collected over a predefined interval. Genuine real-time processing tools process data streams at the moment they are generated. © 2024 Wipfli LLP. All rights reserved. 14
  • 15. Demo 1 – Data ingestion, transformation and visualization using PySpark  Analyze retail data with PySpark and Databricks  Objectives:  Use modern tools like Databricks and PySpark to find hidden insights from the data.  Ingest retail data from DBFS available in csv format.  Utilize PySpark Dataframe API to perform variety of transformations and actions.  Use graphical representation to enhance our understanding and analysis of the results. © 2024 Wipfli LLP. All rights reserved. 15 Resources:
  • 16. Demo 2 – Big data ingestion using Pyspark.  Ingest big data files available in PDF format and translate to desired language.  Objectives:  Install required libraries in Databricks notebook.  Create functions to extract text, table, read and convert table data to plain text.  Ingest and read text from pdf files available in DBFS into Dataframe.  Translate text. © 2024 Wipfli LLP. All rights reserved. 16 Resources:
  • 17. Industry Implementations  Show around the Databricks end-to-end pipeline.  Run the pipeline and show DAG created by Spark. © 2024 Wipfli LLP. All rights reserved. 17 DAG : Query execution plan End-to-end data pipeline
  • 18. Resources  Spark Architecture  PySpark  Pandas  Install Hadoop on Windows – Step by Step  Install Apache Spark on Windows – Step by Step  Generate fake data using python faker library © 2024 Wipfli LLP. All rights reserved. 18
  • 19. Q&A © 2024 Wipfli LLP. All rights reserved. 19
  • 20. 20 © 2024 Wipfli LLP. All rights reserved.

Editor's Notes

  1. - Distributed computing framework: - Spark is built for in-memory parallel processing. Unlike many distributed systems that store intermediate computations on disk, Spark keeps them in memory. - Spark engine supports single- and multi-node deployments, meaning can be install on 1 or multiple machines. - MapReduce: - Spark project was started Matai Zaharia who is now a CTO and co-founder of Databricks company. He started Spark project to replace Hadoop’s map reduce. - Like Hadoop, Spark also supports single and multi-node deployment. Explain single and multi-node. - Spark follows Master-slave architecture, next slide.
  2. - Why Hadoop only runs in batch? - Fault tolerance - How YARN works? - Machine learning (Mahout vs MLib) - Security features in Hadoop
  3. - Why Hadoop only runs in batch? - Fault tolerance - How YARN works? - Machine learning (Mahout vs MLib) - Security features in Hadoop
  4. - Why Hadoop only runs in batch? - Fault tolerance - How YARN works? - Machine learning (Mahout vs MLib) - Security features in Hadoop
  5. - Why Hadoop only runs in batch? - Fault tolerance - How YARN works? - Machine learning (Mahout vs MLib) - Security features in Hadoop
  6. - Fault tolerance: - Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes. - Lazy evaluation: - lazy evaluation helps us in optimizing the process by evaluating the expression only when it’s needed and avoiding unnecessary overhead. So, it memorizes the results and evaluate later.
  7. - Spark SQL: - Exposes the Spark dataset over JDBC API and allows running SQL like queries. - Spark streaming Vs Structured streaming: - Spark streaming is based on RDD API which is a collection of data divided into chunks whereas structured streaming is based on dataframes and datasets which uses Spark SQL optimizer to speed up the streaming process. - Mlib: - Is a machine learning library available in Spark program which provides an API to run algorithms like classification, regression, clustering etc. - GraphX: - GraphX is the Spark API for graphs and graph-parallel computation. It includes a growing collection of graph algorithms and builders to simplify graph analytics tasks. Use cases of Graph analysis are Disaster Detection Systems (Earthquake, Tsunami), Financial Fraud Detection, Page Rank (Finding a Social Media Influencer in a network), Social media analysis (who is following whom and who liked whose comments)
  8. - DAG: - Directed Acyclic Graph is a sequence of events. Wake up → Leave the bed → Get fresh → Take breakfast → Get ready → Drive to office The sequence of events can be related to different stages of an action of “going to office”, but no-one stopping you to just wake up and go to office and have your breakfast in the office. So, the end goal is to reach office for which if I need to break the cycle then I will. Similarly, DAG works in acyclic manner meaning there are no cycles or loops in the graph. This property allows Spark to optimize and schedule the execution of the operations effectively, as it can determine the dependencies and execute the stages in the most efficient order. - Physical execution plan: - -