SlideShare a Scribd company logo
1 of 22
Internal
OVERVIEW ON SPARK
what Spark is in the context of Big Data?
Internal
SPARK - vs - MapREDUCE
• Spark was designed for fast, interactive computation that runs in
memory, enabling machine learning to run quickly.
• MapReduce requires files to be stored in HDFS, Spark does not!
• Spark also can perform operations up to 100x faster than
MapReduce
• So how does it achieve this speed?
• MapReduce writes most data to disk after each map and reduce
operation
• Spark keeps most of the data in memory after each
transformation
• Spark can spill over to disk if the memory is filled
Internal
• Spark DataFrames hold data in a column and row format.
• Each column represents some feature or variable.
• Each row represents an individual data point.
Internal
What is RDD?
• The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It
is a collection of elements, partitioned across the nodes of the cluster so
that we can execute various parallel operations on it.
• There are two ways to create RDDs:
• Parallelizing an existing data in the driver program
• Referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.
Internal
Spark Architecture
• The Spark follows the master-
slave architecture. Its cluster
consists of a single master and
multiple slaves.
• The Spark architecture depends
upon two abstractions:
• Resilient Distributed Dataset
(RDD)
• Directed Acyclic Graph
(DAG)
Internal
Driver Program
The Driver Program is a process that runs the main() function of the application and creates the
SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running
as independent sets of processes on a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster managers and then
perform the following tasks: -
• It acquires executors on nodes in the cluster.
• Then, it sends your application code to the executors. Here, the application code can be defined
by JAR or Python files passed to the SparkContext.
• At last, the SparkContext sends tasks to the executors to run
Internal
Cluster Manager
• The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of
clusters.
• It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler.
• Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines.
Worker Node
• The worker node is a slave node
• Its role is to run the application code in the cluster.
Executor
• An executor is a process launched for an application on a worker node.
• It runs tasks and keeps data in memory or disk storage across them.
• It read and write data to the external sources.
• Every application contains its executor.
• Task
• A unit of work that will be sent to one executor
Internal
Transformation and Actions
Spark
Transformation
map()
flatMap()
filter()
mapPartitions()
Spark Actions
reduceByKey()
collect()
count()
take()
takeOrdered()
Internal
Actions
count() returns the number of elements in RDD.
• For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.count()” will give the
result 8.
Count() example:
• [php]val data = spark.read.textFile(“spark_test.txt”).rdd
• val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”)
• println(mapFile.count())[/php]
Note – In above code flatMap() function maps line into words and count the word “Spark”
using count() Action after filtering lines containing “Spark” from mapFile
Internal
Collect()
• The action collect() is the common and simplest operation that returns our entire RDDs
content to driver program. The application of collect() is unit testing where the entire
RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD
with the expected result.
• Action Collect() had a constraint that all the data should fit in the machine, and copies to
the driver.
Collect() example:
• [php]val data = spark.sparkContext.parallelize(Array((‘A’,1),(‘b’,2),(‘c’,3)))
• val data2 =spark.sparkContext.parallelize(Array((‘A’,4),(‘A’,6),(‘b’,7),(‘c’,3),(‘c’,8)))
• val result = data.join(data2)
• println(result.collect().mkString(“,”))[/php]
Internal
Transformations
• Spark RDD filter() function returns a new RDD, containing only the elements that meet a predicate. It is
a narrow operation because it does not shuffle data from one partition to many partitions.
• For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4, and 5) and the predicate is check
for an even number. The resulting RDD after the filter will contain only the even numbers i.e., 2 and 4.
Filter() example:
• [php]val data = spark.read.textFile(“spark_test.txt”).rdd
• val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”)
• println(mapFile.count())[/php]
• Note – In above code, flatMap function map line into words and then count the word “Spark”
using count() Action after filtering lines containing “Spark” from mapFile.
Internal
• With the help of flatMap() function, to each input element, we have many elements in an
output RDD. The most simple use of flatMap() is to split each input string into words.
• Map and flatMap are similar in the way that they take a line from input RDD and apply a function
on that line. The key difference between map() and flatMap() is map() returns only one element,
while flatMap() can return a list of elements.
flatMap() example:
• [php]val data = spark.read.textFile(“spark_test.txt”).rdd
• val flatmapFile = data.flatMap(lines => lines.split(” “))
• flatmapFile.foreach(println)[/php]
• Note – In above code, flatMap() function splits each line when space occurs.
Internal
Starting a Spark session
• SparkSession was introduced in version Spark 2.0.
• It is an entry point to underlying Spark functionality in order to programmatically create Spark
RDD, DataFrame, and DataSet.
• SparkSession’s object spark is the default variable available in spark-shell and it can be created
programmatically using SparkSession builder pattern.
• Usage:
• From pyspark.sql import SparkSession
• spark = SparkSession.builder() .appName("SparkByExamples.com“).getOrCreate()
Internal
Pyspark basic syntax
• using createDataFrame(): you can create a DataFrame
• we can get the schema of the DataFrame using df.printSchema()
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1) ]
columns =
["firstname","middlename","lastname","dob","gender","salary"
]
df = spark.createDataFrame(data=data, schema = columns)
Internal
df.show(): shows the 20 elements from the DataFrame
Below is shown how to read a csv file with pyspark:
Output:
Internal
df.select('age','fnlwgt').show(5)
Select : You can select and show the rows with select and the names of the features
Count by group: If you want to count the number of occurrence by group, you can chain:
groupBy()
count()
Below we count the number of rows by the education level:
df.groupBy("education").count().sort("count", ascending=True).show()
Internal
Describe the data
To get a summary statistics, of the data, you can use describe(). It will compute the :
1. count
2. mean
3. standard deviation
4. min
5. max
df.describe().show()
Internal
Drop column
• There are two intuitive API to drop columns:
• drop(): Drop a column
• dropna(): Drop NA’s
Filter data
You can use filter() to apply descriptive statistics in a subset of
data. For instance, you can count the number of people above
40 year old
df.filter(df.age > 40).count()
Descriptive statistics by group
Finally, you can group data by group and compute statistical
operations like the mean.
Internal
Best Practices and Performance
Tuning Activities for PySpark
• Technique 1: reduce data shuffle using repartition
• Technique 2. Use caching, when necessary
• Technique 3. Join strategies - broadcast join and bucketed joins
Internal
Spark Streaming
• Spark Streaming is a separate library in Spark to process continuously flowing streaming data.
• PySpark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both
batch and streaming workloads.
• It is used to process real-time data from sources like file system folder, TCP socket, S3, Kafka, Flume, Twitter,
and Amazon Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards e.t.c
The steps for streaming will be:
 Create a SparkContext
 Create a StreamingContext
 Create a Socket Text Stream
 Read in the lines as a “DStream”
Internal
• We will load our data into a streaming DataFrame by using the “ readStream”. We can also check
status of our streaming with the “isStreaming” method
Internal
Streaming Example
• Notebook Attached!
• Using the provided TweetRead.py and Introduction to Spark
Streaming.ipynb will save you a lot of time and frustration!
• Let’s get started!

More Related Content

Similar to OVERVIEW ON SPARK.pptx

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkJohn Godoi
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.JananiJ19
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 

Similar to OVERVIEW ON SPARK.pptx (20)

Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.
 
Data Science
Data ScienceData Science
Data Science
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 

Recently uploaded (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 

OVERVIEW ON SPARK.pptx

  • 1. Internal OVERVIEW ON SPARK what Spark is in the context of Big Data?
  • 2. Internal SPARK - vs - MapREDUCE • Spark was designed for fast, interactive computation that runs in memory, enabling machine learning to run quickly. • MapReduce requires files to be stored in HDFS, Spark does not! • Spark also can perform operations up to 100x faster than MapReduce • So how does it achieve this speed? • MapReduce writes most data to disk after each map and reduce operation • Spark keeps most of the data in memory after each transformation • Spark can spill over to disk if the memory is filled
  • 3. Internal • Spark DataFrames hold data in a column and row format. • Each column represents some feature or variable. • Each row represents an individual data point.
  • 4. Internal What is RDD? • The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a collection of elements, partitioned across the nodes of the cluster so that we can execute various parallel operations on it. • There are two ways to create RDDs: • Parallelizing an existing data in the driver program • Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
  • 5. Internal Spark Architecture • The Spark follows the master- slave architecture. Its cluster consists of a single master and multiple slaves. • The Spark architecture depends upon two abstractions: • Resilient Distributed Dataset (RDD) • Directed Acyclic Graph (DAG)
  • 6. Internal Driver Program The Driver Program is a process that runs the main() function of the application and creates the SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running as independent sets of processes on a cluster. To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform the following tasks: - • It acquires executors on nodes in the cluster. • Then, it sends your application code to the executors. Here, the application code can be defined by JAR or Python files passed to the SparkContext. • At last, the SparkContext sends tasks to the executors to run
  • 7. Internal Cluster Manager • The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of clusters. • It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler. • Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines. Worker Node • The worker node is a slave node • Its role is to run the application code in the cluster. Executor • An executor is a process launched for an application on a worker node. • It runs tasks and keeps data in memory or disk storage across them. • It read and write data to the external sources. • Every application contains its executor. • Task • A unit of work that will be sent to one executor
  • 9. Internal Actions count() returns the number of elements in RDD. • For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.count()” will give the result 8. Count() example: • [php]val data = spark.read.textFile(“spark_test.txt”).rdd • val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”) • println(mapFile.count())[/php] Note – In above code flatMap() function maps line into words and count the word “Spark” using count() Action after filtering lines containing “Spark” from mapFile
  • 10. Internal Collect() • The action collect() is the common and simplest operation that returns our entire RDDs content to driver program. The application of collect() is unit testing where the entire RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD with the expected result. • Action Collect() had a constraint that all the data should fit in the machine, and copies to the driver. Collect() example: • [php]val data = spark.sparkContext.parallelize(Array((‘A’,1),(‘b’,2),(‘c’,3))) • val data2 =spark.sparkContext.parallelize(Array((‘A’,4),(‘A’,6),(‘b’,7),(‘c’,3),(‘c’,8))) • val result = data.join(data2) • println(result.collect().mkString(“,”))[/php]
  • 11. Internal Transformations • Spark RDD filter() function returns a new RDD, containing only the elements that meet a predicate. It is a narrow operation because it does not shuffle data from one partition to many partitions. • For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4, and 5) and the predicate is check for an even number. The resulting RDD after the filter will contain only the even numbers i.e., 2 and 4. Filter() example: • [php]val data = spark.read.textFile(“spark_test.txt”).rdd • val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”) • println(mapFile.count())[/php] • Note – In above code, flatMap function map line into words and then count the word “Spark” using count() Action after filtering lines containing “Spark” from mapFile.
  • 12. Internal • With the help of flatMap() function, to each input element, we have many elements in an output RDD. The most simple use of flatMap() is to split each input string into words. • Map and flatMap are similar in the way that they take a line from input RDD and apply a function on that line. The key difference between map() and flatMap() is map() returns only one element, while flatMap() can return a list of elements. flatMap() example: • [php]val data = spark.read.textFile(“spark_test.txt”).rdd • val flatmapFile = data.flatMap(lines => lines.split(” “)) • flatmapFile.foreach(println)[/php] • Note – In above code, flatMap() function splits each line when space occurs.
  • 13. Internal Starting a Spark session • SparkSession was introduced in version Spark 2.0. • It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet. • SparkSession’s object spark is the default variable available in spark-shell and it can be created programmatically using SparkSession builder pattern. • Usage: • From pyspark.sql import SparkSession • spark = SparkSession.builder() .appName("SparkByExamples.com“).getOrCreate()
  • 14. Internal Pyspark basic syntax • using createDataFrame(): you can create a DataFrame • we can get the schema of the DataFrame using df.printSchema() data = [('James','','Smith','1991-04-01','M',3000), ('Michael','Rose','','2000-05-19','M',4000), ('Robert','','Williams','1978-09-05','M',4000), ('Maria','Anne','Jones','1967-12-01','F',4000), ('Jen','Mary','Brown','1980-02-17','F',-1) ] columns = ["firstname","middlename","lastname","dob","gender","salary" ] df = spark.createDataFrame(data=data, schema = columns)
  • 15. Internal df.show(): shows the 20 elements from the DataFrame Below is shown how to read a csv file with pyspark: Output:
  • 16. Internal df.select('age','fnlwgt').show(5) Select : You can select and show the rows with select and the names of the features Count by group: If you want to count the number of occurrence by group, you can chain: groupBy() count() Below we count the number of rows by the education level: df.groupBy("education").count().sort("count", ascending=True).show()
  • 17. Internal Describe the data To get a summary statistics, of the data, you can use describe(). It will compute the : 1. count 2. mean 3. standard deviation 4. min 5. max df.describe().show()
  • 18. Internal Drop column • There are two intuitive API to drop columns: • drop(): Drop a column • dropna(): Drop NA’s Filter data You can use filter() to apply descriptive statistics in a subset of data. For instance, you can count the number of people above 40 year old df.filter(df.age > 40).count() Descriptive statistics by group Finally, you can group data by group and compute statistical operations like the mean.
  • 19. Internal Best Practices and Performance Tuning Activities for PySpark • Technique 1: reduce data shuffle using repartition • Technique 2. Use caching, when necessary • Technique 3. Join strategies - broadcast join and bucketed joins
  • 20. Internal Spark Streaming • Spark Streaming is a separate library in Spark to process continuously flowing streaming data. • PySpark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. • It is used to process real-time data from sources like file system folder, TCP socket, S3, Kafka, Flume, Twitter, and Amazon Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards e.t.c The steps for streaming will be:  Create a SparkContext  Create a StreamingContext  Create a Socket Text Stream  Read in the lines as a “DStream”
  • 21. Internal • We will load our data into a streaming DataFrame by using the “ readStream”. We can also check status of our streaming with the “isStreaming” method
  • 22. Internal Streaming Example • Notebook Attached! • Using the provided TweetRead.py and Introduction to Spark Streaming.ipynb will save you a lot of time and frustration! • Let’s get started!