Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka

Copyright © 2018, edureka and/or its affiliates. All rights reserved.
PySpark Tutorial

www.edureka.co/pyspark-certification-trainingPython Spark Certification Training using PySpark
Objectives of Today’s Training
PySpark1
Advantages of PySpark2
PySpark Installation3
PySpark Fundamentals4
Demo5

PySpark

Spark Ecosystem
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
Apache Spark Core API

Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
Apache Spark Core API
Python API for Spark(PySpark)
Python in Spark Ecosystem

PySpark
Spark is an open-source cluster-computing framework which is built around speed, ease of use,
and streaming analytics
Python is general purpose high level programming language. It provides wide range of libraries
and is majorly used for Machine Learning and Data Science
• It is a Python API for Spark majorly used for Data Science and Analysis
• Using PySpark, you can work with Spark RDDs in Python

Advantages Spark with Python

Advantages
EASYTO
LEARN

EASYTO
LEARN
SIMPLE&
COMPREHENSIVE API
Advantages

Advantages
EASYTO
LEARN
BETTERCODE
READABILITY&MAINTENANCE
SIMPLE&
COMPREHENSIVE API

Advantages
EASYTO
LEARN
BETTERCODE
SIMPLE&
COMPREHENSIVE API
AVAILABITLITYOF
VISUALIZATION

Advantages
EASYTO
LEARN
BETTERCODE
SIMPLE&
COMPREHENSIVE API
WIDERANGEOF
LIBRARIES
AVAILABITLITYOF
VISUALIZATION

Advantages
EASYTO
LEARN
BETTERCODE
SIMPLE&
COMPREHENSIVE API
WIDERANGEOF
LIBRARIES
AVAILABITLITYOF
VISUALIZATION
ACTIVE
COMMUNITY

PySpark Installation

1. Go to: https://spark.apache.org/downloads.html
2. Select the Spark version from the drop down list
3. Click on the link to download the file.

Install pip (version 10 or more)
Install jupyter notebook

Add the Spark and PySpark in the bashrc file

PySpark Fundamentals
Spark Context
RDDs
Broadcast &
Accumulator
SparkConf
SparkFiles
DataFrames
StorageLevel
MLlib

Spark Context
RDDs
Broadcast &
Accumulator
SparkConf
SparkFiles
DataFrames
StorageLevel
MLlib

Spark Context
Spark Context
Spark
Context
Py Process
Py4J
Worker (JVM)
Block 1
Worker(JVM)
Block 2
Local FS
Py Process
Py Process
Py Process
Local Cluster
SparkContext is the entry point to any spark functionality
Socket
Socket
Pipe
Pipe
Pipe
Pipe

Spark Context
Master appName sparkHome pyFiles
Environment batchSize Serializer conf
Gateaway JSC Profiler_cls
SparkContext parameters

Spark Context
SparkContext parameters
sparkHome pyFiles
Environment Serializer
Gateaway JSC Profiler_cls
Master appName
batchSize conf

PySpark
Basic life cycle of a PySpark program
01 03
02
Create RDDs Cache RDDs
Lazy
Transformation
Create RDDs from some external
data source or parallelize a
collection in your driver
program.
Lazily transform the base RDDs
into new RDDs using
transformations
Cache some of those RDDs for
future reuse
04 Perform Actions
Perform actions to execute
parallel computation and to
produce results

Resilient Distributed Dataset (RDDs)
RDDs is the building block of every Spark application and is immutable
R
D
D
esilient
istributed
ataset
Fault tolerant and is capable of rebuilding data on failure
Data is distributed among the multiple nodes in a cluster
Collection of partitioned data with primitive values or values of value

Transformations & Actions in RDDs
To work on this immutable data, you need to create a new one via Transformations and Actions
Transformations
❑ map
❑ flatMap
❑ filter
❑ distinct
❑ reduceByKey
❑ mapPartitions
❑ sortBy
Actions
❑ collect
❑ collectAsMap
❑ reduce
❑ countByKey/countByValue
❑ take
❑ first

Broadcast & Accumulator
Parallel processing is achieved in Spark by using shared variables
Shared Variables
Broadcast Accumulator
These variables are used to save
the copy of data across all
nodes
These variables are used to
aggregate the information
through associative and
commutative operations

SparkConf
SparkConf provides the configurations to run a Spark application on a local system or a cluster
SparkConf object is used to set different parameters which takes priority over the system properties
Once SparkConf object is passed to Spark, it becomes immutable

SparkConf
Attributes of SparkConf class
set(key, value)………………………………………
setMaster(value)……………………………………
setAppName(value)…………………………………
get(key, defaultValue=None)………
setSparkHome(value)……………………………
Sets Config property
Sets the master URL
Sets an application’s name
Gets the configuration value of a key
Sets the Spark installation path on worker nodes

SparkFiles
SparkFiles class helps in resolving the paths of files added to the Spark
get(filename)……………………………………………
getrootdirectory()………………………………
It specifies the path of the file that is added through sc.addFile()
It specifies the path to the root directory of the file that is added through sc.addFile()

DataFrames
Dataframe is a distributed collection of rows under named columns
Immutable
Lazy Evaluations
Distributed

Dataframes
Col 1 Col 2 … Col n
Row 1
Row 2
:
Row 3
RDDs
RDBMS
DATA

StorageLevels
Disk Serialize
Memory Replicate
Class StorageLevel decides how RDDs should be stored

MLlib
Machine Learning API in Spark which interoperates with
NumPy in Python is called MLlib
It provides an integrated Data Analysis workflow
Enhances speed and performance

MLlib
Various algorithms supported by MLlib
MLlib Clustering Frequent Pattern Matching Linear Algebra
Linear RegressionClassificationCollaborative Filtering

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka

Similar to Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka