2. Internal
SPARK - vs - MapREDUCE
• Spark was designed for fast, interactive computation that runs in
memory, enabling machine learning to run quickly.
• MapReduce requires files to be stored in HDFS, Spark does not!
• Spark also can perform operations up to 100x faster than
MapReduce
• So how does it achieve this speed?
• MapReduce writes most data to disk after each map and reduce
operation
• Spark keeps most of the data in memory after each
transformation
• Spark can spill over to disk if the memory is filled
3. Internal
• Spark DataFrames hold data in a column and row format.
• Each column represents some feature or variable.
• Each row represents an individual data point.
4. Internal
What is RDD?
• The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It
is a collection of elements, partitioned across the nodes of the cluster so
that we can execute various parallel operations on it.
• There are two ways to create RDDs:
• Parallelizing an existing data in the driver program
• Referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.
5. Internal
Spark Architecture
• The Spark follows the master-
slave architecture. Its cluster
consists of a single master and
multiple slaves.
• The Spark architecture depends
upon two abstractions:
• Resilient Distributed Dataset
(RDD)
• Directed Acyclic Graph
(DAG)
6. Internal
Driver Program
The Driver Program is a process that runs the main() function of the application and creates the
SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running
as independent sets of processes on a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster managers and then
perform the following tasks: -
• It acquires executors on nodes in the cluster.
• Then, it sends your application code to the executors. Here, the application code can be defined
by JAR or Python files passed to the SparkContext.
• At last, the SparkContext sends tasks to the executors to run
7. Internal
Cluster Manager
• The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of
clusters.
• It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler.
• Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines.
Worker Node
• The worker node is a slave node
• Its role is to run the application code in the cluster.
Executor
• An executor is a process launched for an application on a worker node.
• It runs tasks and keeps data in memory or disk storage across them.
• It read and write data to the external sources.
• Every application contains its executor.
• Task
• A unit of work that will be sent to one executor
9. Internal
Actions
count() returns the number of elements in RDD.
• For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.count()” will give the
result 8.
Count() example:
• [php]val data = spark.read.textFile(“spark_test.txt”).rdd
• val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”)
• println(mapFile.count())[/php]
Note – In above code flatMap() function maps line into words and count the word “Spark”
using count() Action after filtering lines containing “Spark” from mapFile
10. Internal
Collect()
• The action collect() is the common and simplest operation that returns our entire RDDs
content to driver program. The application of collect() is unit testing where the entire
RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD
with the expected result.
• Action Collect() had a constraint that all the data should fit in the machine, and copies to
the driver.
Collect() example:
• [php]val data = spark.sparkContext.parallelize(Array((‘A’,1),(‘b’,2),(‘c’,3)))
• val data2 =spark.sparkContext.parallelize(Array((‘A’,4),(‘A’,6),(‘b’,7),(‘c’,3),(‘c’,8)))
• val result = data.join(data2)
• println(result.collect().mkString(“,”))[/php]
11. Internal
Transformations
• Spark RDD filter() function returns a new RDD, containing only the elements that meet a predicate. It is
a narrow operation because it does not shuffle data from one partition to many partitions.
• For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4, and 5) and the predicate is check
for an even number. The resulting RDD after the filter will contain only the even numbers i.e., 2 and 4.
Filter() example:
• [php]val data = spark.read.textFile(“spark_test.txt”).rdd
• val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”)
• println(mapFile.count())[/php]
• Note – In above code, flatMap function map line into words and then count the word “Spark”
using count() Action after filtering lines containing “Spark” from mapFile.
12. Internal
• With the help of flatMap() function, to each input element, we have many elements in an
output RDD. The most simple use of flatMap() is to split each input string into words.
• Map and flatMap are similar in the way that they take a line from input RDD and apply a function
on that line. The key difference between map() and flatMap() is map() returns only one element,
while flatMap() can return a list of elements.
flatMap() example:
• [php]val data = spark.read.textFile(“spark_test.txt”).rdd
• val flatmapFile = data.flatMap(lines => lines.split(” “))
• flatmapFile.foreach(println)[/php]
• Note – In above code, flatMap() function splits each line when space occurs.
13. Internal
Starting a Spark session
• SparkSession was introduced in version Spark 2.0.
• It is an entry point to underlying Spark functionality in order to programmatically create Spark
RDD, DataFrame, and DataSet.
• SparkSession’s object spark is the default variable available in spark-shell and it can be created
programmatically using SparkSession builder pattern.
• Usage:
• From pyspark.sql import SparkSession
• spark = SparkSession.builder() .appName("SparkByExamples.com“).getOrCreate()
14. Internal
Pyspark basic syntax
• using createDataFrame(): you can create a DataFrame
• we can get the schema of the DataFrame using df.printSchema()
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1) ]
columns =
["firstname","middlename","lastname","dob","gender","salary"
]
df = spark.createDataFrame(data=data, schema = columns)
15. Internal
df.show(): shows the 20 elements from the DataFrame
Below is shown how to read a csv file with pyspark:
Output:
16. Internal
df.select('age','fnlwgt').show(5)
Select : You can select and show the rows with select and the names of the features
Count by group: If you want to count the number of occurrence by group, you can chain:
groupBy()
count()
Below we count the number of rows by the education level:
df.groupBy("education").count().sort("count", ascending=True).show()
17. Internal
Describe the data
To get a summary statistics, of the data, you can use describe(). It will compute the :
1. count
2. mean
3. standard deviation
4. min
5. max
df.describe().show()
18. Internal
Drop column
• There are two intuitive API to drop columns:
• drop(): Drop a column
• dropna(): Drop NA’s
Filter data
You can use filter() to apply descriptive statistics in a subset of
data. For instance, you can count the number of people above
40 year old
df.filter(df.age > 40).count()
Descriptive statistics by group
Finally, you can group data by group and compute statistical
operations like the mean.
19. Internal
Best Practices and Performance
Tuning Activities for PySpark
• Technique 1: reduce data shuffle using repartition
• Technique 2. Use caching, when necessary
• Technique 3. Join strategies - broadcast join and bucketed joins
20. Internal
Spark Streaming
• Spark Streaming is a separate library in Spark to process continuously flowing streaming data.
• PySpark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both
batch and streaming workloads.
• It is used to process real-time data from sources like file system folder, TCP socket, S3, Kafka, Flume, Twitter,
and Amazon Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards e.t.c
The steps for streaming will be:
Create a SparkContext
Create a StreamingContext
Create a Socket Text Stream
Read in the lines as a “DStream”
21. Internal
• We will load our data into a streaming DataFrame by using the “ readStream”. We can also check
status of our streaming with the “isStreaming” method
22. Internal
Streaming Example
• Notebook Attached!
• Using the provided TweetRead.py and Introduction to Spark
Streaming.ipynb will save you a lot of time and frustration!
• Let’s get started!