Spark Programming Basic Training Handout

APACHE SPARK
Spark Programming Basic

Course Outline
Lesson 1.
Big Data and Apache Hadoop
overview
Lesson 2.
Introduction to Spark
Lesson 3.
Introduction to Python
Lesson 4.
Introduction to PySpark
Lesson 5.
Building Spark Application

Course Progress
Lesson 1
Big Data and
Apache Hadoop
Overview
Lesson 2
Introduction to
Spark
Lesson 3
Introduction to
Python
Lesson 4
Introduction to
PySpark
Lesson 5
Building Spark
Application

Traditional vs Big Data Analytics
Big Data Analytics
Traditional
$30,000+ per TB
Expensive & Unattainable
• Hard to scale
• Network is a bottleneck
• Only handles relational data
• Difficult to add new fields & data types
Expensive, Special purpose, “Reliable” Servers
Expensive Licensed Software
Network
Data Storage
(SAN, NAS)
Compute
(RDBMS, EDW)
$300-$1,000 per TB
Affordable & Attainable
• Scales out forever
• No bottlenecks
• Easy to ingest any data
• Agile data access
Commodity “Unreliable” Servers
Hybrid Open Source Software
Compute
(CPU)
Memory Storage
(Disk)
z
z

Big Data Scalability
Big Data aims for linear horizontal
scalability
With horizontal-scaling it is often
easier to scale dynamically by adding
more machines into the existing pool
Vertical-scaling is often limited to the
capacity of a single machine, scaling
beyond that capacity often involves
downtime and comes with an upper
limit.
Horizontal Scaling
Horizontal Scaling

Big Data Scalability
Big Data clusters are built from
industry-standard, widely-available
and relatively inexpensive servers
Big Data clusters can reach
massive scalability by exploiting a
simple distribution architecture
and coordination model
A machine that have 1000 cores
CPU would be much more
expensive than 1000 single core
or 250 quad-core machines

Apache Hadoop is an open
source platform made based
on paper published by google
in 2003.
History of Hadoop

Apache Spark
◦Apache Spark is a fast and
general-purpose cluster
computing system
◦Spark is not a modified
version of Hadoop and is
not, really, dependent on
Hadoop because it has its
own cluster management.
◦Hadoop is just one of the
ways to implement Spark

◦Spark can achieve faster
speed through controlled
partitioning.
◦It manages data using
partitions that help
parallelize distributed
data processing with
minimal network traffic
Spark Features:
Speed

Spark provides high-level
APIs in Java, Scala,
Python and R
Spark Features:
Polyglot

Spark supports multiple
data sources such as
Parquet, JSON, Hive
and Cassandra apart
from the usual formats
such as text files, CSV
and RDBMS tables
Spark Features:
Multiple Formats

◦Apache Spark delays its
evaluation till it is
absolutely necessary
◦Spark’s computation is
real-time and has
low latency because of
its in-memory
computation
Spark Features:
Lazy & Real Time

Spark Components
◦Apache Spark is a general
purpose cluster computing
system. It provides high-
level API in Java, Scala,
Python, and R.
◦Spark provide an
optimized engine that
supports general execution
graph. It also has
abundant high-level tools
for structured data
processing, machine
learning, graph processing,
and streaming.

Spark SQL lets you query
structured data inside Spark
programs and provide a
common way to access a
variety of data sources,
including Hive, Avro,
Parquet, ORC, JSON, and
JDBC. You can even join
data across these sources
Spark Components:
Spark SQL

Spark SQL supports the
HiveQL syntax as well as
Hive SerDes and UDFs,
allowing you to access
existing Hive warehouses
Spark Components:
Spark SQL

Spark Streaming brings
Apache Spark's language-
integrated API to stream
processing, letting you
write streaming jobs the
same way you write batch
jobs. It supports Java,
Scala and Python
Spark Components:
Spark Streaming

Spark Streaming recovers
both lost work and
operator state out of the
box, without any extra
code on your part
Spark Components:
Spark Streaming

By running on Spark,
Spark Streaming lets you
reuse the same code for
batch processing, join
streams against historical
data, or run ad-hoc
queries on stream state
Spark Components:
Spark Streaming

MLlib is Apache
Spark's scalable machine
learning library, with APIs
in Java, Scala, Python, and
R. MLlib contains many
algorithms and utilities
for machine learning
models.
Spark Components:
Spark MLlib

◦MLlib fits into Spark's
APIs and interoperates
with NumPy in Python
and R libraries
◦You can use any Hadoop
data source (e.g. HDFS,
HBase, or local files)
Spark Components:
Spark MLlib

◦Spark excels at iterative
computation, enabling
MLlib to run fast
◦MLlib contains high-
quality algorithms that
leverage iteration, and can
yield better results than
the one-pass
approximations sometimes
used on MapReduce
Spark Components:
Spark MLlib

GraphX is Apache
Spark's API for graphs
and graph-parallel
computation
Spark Components:
Spark GraphX

Python is a popular
programming language and was
mainly developed for emphasis
on code readability, its syntax
allows programmers to express
concepts in few lines of code. It’s
used for:
◦ web development (server-side),
◦ software development,
◦ mathematics,
◦ system scripting.
Introduction to
Python

Exercise
1. Create a function to check whether an input is palindrome or not
Palindrome is a word, phrase, or sequence that reads the same backward as
forward, for example:
• Malam
• Don’t nod
• Harum semar kayak rames murah
• Madam, in eden I’m Adam
2. Using the requests and BeautifulSoup Python libraries, get the full text of
the article on detik.com and save the article to a text file. Try to create
functions for specific task

Exercise Answer: Palindrome
import re
def is_palindrome(str_text):
str_text = re.sub(
"[^a-z0-9]", "", str_text, 0, re.IGNORECASE).lower()
if str_text == str_text[::-1]:
return True
else:
return False
str_test = input("Input text for testing: ")
print("'%s' is%s a palindrome!" % (
str_test,
"" if is_palindrome(str_test) else "n't"
))
Output:
Input text for testing: Madam, in eden I’m Adam
'Madam, in eden I’m Adam' is a palindrome!

PySpark: Spark Session
The first thing a Spark program must do is to create a SparkSession object,
which tells Spark how to access a cluster. First, you need to import some
Spark classes into your program. Add the following line
from pyspark.sql import SparkSession
Then, build a SparkSession that contains information about your application.
spark = SparkSession
.builder
.appName("Test Spark")
.master("local[4]")
.getOrCreate()

A DataFrame is
a Dataset organized into named
columns. It is conceptually
equivalent to a table in a
relational database or a data
frame in R/Python, but with
richer optimizations under the
hood. DataFrames can be
constructed from a wide array
of sources such as: structured
data files, tables in Hive, external
databases, or existing RDDs.
DataFrame

Creating DataFrame
We can create dataframe by loading data into Spark. For example, if we want to
load data from csv
df = spark.read
.option("header", "true")
.option("delimiter", ";")
.csv("data/starbucks")
df.show(5)

Selecting Columns on DataFrame
To select columns on the dataframe, we can use select function
df.select("Store Name").show(5, False)
df.select("Store Name", "Street Address", "Phone Number").show(5, False)

Filtering data on DataFrame
To filter data on the dataframe, we can use filter or where function
df.select("Store Name", "Street Address", "Phone Number").filter("City =
'Jakarta'").show(5, False)
df.select("Store Name", "Street Address", "Phone Number").where("City =
'Jakarta'").show(5, False)
df.select("Store Name", "Street Address", "Phone Number").filter(df.City ==
"Jakarta").show(5, False)
df.select("Store Name", "Street Address", "Phone Number").where(df.City ==
"Jakarta").show(5, False)

Grouping data on DataFrame
To grouping data on the dataframe, we can use groupBy function. You can add
aggregation function (max, min, count, etc) and sort after groupBy
df.groupBy("City").count().sort("count",
ascending=False).show(10)
from pyspark.sql.functions import col
df.groupBy("City").count()
.sort(
col("count").desc(),
col("City")
).show(10)

Spark SQL is a Spark module
for structured data processing.
Unlike the basic Spark RDD
API, the interfaces provided by
Spark SQL provide Spark with
more information about the
structure of both the data and
the computation being
performed. Internally, Spark
SQL uses this extra
information to perform extra
optimizations.
SparkSQL

Spark SQL
Execution Plan tells how Spark executes a Spark Program or Application.
Physical Execution Plan contains tasks and are bundled to be sent to nodes of
cluster.

Using SparkSQL on DataFrame
We can use SparkSQL to load Hive table or previously created DataFrame. To
use SparkSQL on dataframe, we must register it as temporary view on Spark
df.createOrReplaceTempView("starbucks")
spark.sql("select `Store Name`, `Street Address`, `Phone Number` from
starbucks").show(5, False)

Using SparkSQL on DataFrame
To use multiple line of SQL, you can use triple quotes like this
spark.sql("""
select `Store Name`, `Street Address`, `Phone Number`
from starbucks
where City = 'Jakarta'
""").show(5, False)

User Defined Function (UDF)
User-Defined Functions (aka UDF) is a feature of Spark SQL to define new
Column-based functions that extend the vocabulary of Spark SQL’s DSL for
transforming DataFrame

UDF Example
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
prov = {
"AC":"Aceh", "BA":"Bali", "BB":"Bangka Belitung Islands", "BT":"Banten",
"BE":"Bengkulu", "GO":"Gorontalo", "JA":"Jambi", "JB":"West Java",
"JT":"Central Java", "JI":"East Java", "KB":"West Kalimantan",
"KS":"South Kalimantan", "KT":"Central Kalimantan", "KI":"East Kalimantan",
"KU":"North Kalimantan", "KR":"Riau Islands", "LA":"Lampung",
"MA":"Maluku", "MU":"North Maluku", "NB":"West Nusa Tenggara",
"NT":"East Nusa Tenggara", "PA":"Papua", "PB":"West Papua", "RI":"Riau",
"SR":"West Sulawesi", "SN":"South Sulawesi", "ST":"Central Sulawesi",
"SG":"Southeast Sulawesi", "SA":"North Sulawesi", "SB":"West Sumatra",
"SS":"South Sumatra", "SU":"North Sumatra", "JK":"Jakarta",
"YO":"Yogyakarta", "JW":"Jawa",
}

UDF Example
def trans_prov(provName):
return prov[provName]
spark.udf.register("trans_prov_udf", trans_prov, StringType())
spark.sql("""
select `Store Name`, trans_prov_udf(`State/Province`) Province, City `Street
Address`
from starbucks
""").show(10, False)

To convert existing ipynb
(interactive python
notebook), you can right
click on ipynb file nad
select “Convert to Python
Script”
Convert to Python
Script

To execute the script
outside jupyter or visual
studio code, you can use
usual python execution
script or submit it to
existing cluster (or local)
Spark-submit

Challenge
1. Create a function to check whether an input is palindrome or not
Palindrome is a word, phrase, or sequence that reads the same backward as
forward, for example:
• Malam
• Don’t nod
• Harum semar kayak rames murah
• Madam, in eden I’m Adam
2. Using the requests and BeautifulSoup Python libraries, get the full text of
the article on detik.com and save the article to a text file. Try to create
functions for specific task

THANK YOU!
yanuar.singgih@gmail.com | yanuar@xqinformatics.com
081315161641

Spark Programming Basic Training Handout

More Related Content

Similar to Spark Programming Basic Training Handout

Recently uploaded

Spark Programming Basic Training Handout