** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
6. www.edureka.co/pyspark-certification-trainingPython Spark Certification Training using PySpark
Resilient Distributed Dataframe (RDD)
RDD is the abstracted data over the distributed collection
Created using various Spark Context Functions
Follows lazy initialization principle
RDDs are immutable and cacheable in nature
Supports two different types of operations
Transformations
Actions
9. www.edureka.co/pyspark-certification-trainingPython Spark Certification Training using PySpark
Helps in increase in performance of PySpark queries3
DataFrame
Immutable but distributed collection of structured & semi-
structured data
1
Organized into named columns similar to a RDMS table2
Supports a wide range of data formats and sources4
API support for various languages like Python, R, Scala, Java5
11. www.edureka.co/pyspark-certification-trainingPython Spark Certification Training using PySpark
PySpark SQL
01
PySparkSQL module is a
higher-level abstraction over
PySpark Core 02
PySparkSQL is used
for processing structured and
semi-structured datasets
03
Through PySparkSQL, SQL
and HiveQL code can be
used 04
PySparkSQL provides an
optimized API
13. www.edureka.co/pyspark-certification-trainingPython Spark Certification Training using PySpark
PySpark Streaming
It can efficiently deal with various
fault-tolerance aspects and is
highly scalable
Fault
Tolerant
Discretized Stream or Dstream
is a high-level abstraction
which represents a continuous
stream of data
Discretized
Stream
It is a set of APIs that provide a
wrapper over PySpark Core
APIs
PySpark Streaming is the live
data streaming library of
PySpark
Library
PySpark Streaming is the structured stream processing framework that utilizes Spark DataFrames
17. www.edureka.co/pyspark-certification-trainingPython Spark Certification Training using PySpark
Machine Learning (MLlib)
PySpark facilitates the development of custom ML algorithms
It is a wrapper over PySpark Core to do data analysis using machine-learning algorithms
It works on distributed systems and is scalable
MLlib in PySpark, is a machine-learning library