Spark with HDInsight

| © Copyright 2015 Hitachi Consulting1
Spark with Azure HDInsight
Lighting-fast Big Data Processing
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.

Outline
 Spark and Big Data
 Installing Spark
 Spark Core Concepts
 Programming with Spark
 Spark SQL
 Getting Started with Spark on HDInsight
 Spark CLR (Mobius)
 ETL and Automation
 Useful Resources

Introducing Spark

What is Spark?
The Lightening-fast Big Data Processing
General-purpose Big Data Processing
Integrates with HDFS
Graph Processing
Stream Processing
Machine Learning
Libraries
In-memory (fast)
Iterative
Processing
Interactive
Query
SQL
Scala – Python – Java – R – .NET

Spark and Hadoop Ecosystem
Spark and the zoo…
Hadoop Distributed File System (HDFS)
Applications
In-Memory Stream SQL
 Spark-
SQL
NoSQL Machine
Learning
….
Batch
Yet Another Resource Negotiator (YARN)
Search Orchest.
MgmntAcquisition
Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N

Spark Components
Spark and the zoo…
Hadoop Distributed File System (HDFS)
Spark
….
Yet Another Resource Negotiator (YARN)Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
Spark Core Engine (RDDs: Resilient Distributed Datasets)
Spark SQL
(structured data)
Spark Streaming
(real-time)
Mlib
(machine learning)
GraphX
(graph processing)
Scala
Java
Python
R
.NET (Mobius)

Spark Components
Spark Core
Spark SQL
Spark
Streaming
Spark MLib
Spark
GraphX
Cluster
Managers
 Contains the basic functionality of Spark, including components for task scheduling, memory management, fault
recovery, interacting with storage systems, etc.
 Home to the API that defines resilient distributed datasets (RDDs)
 Package for working with structured data (DataFrames). It allows querying data via SQL as well as the Apache
Hive Supports many sources of data, including Hive tables, Parquet, and JSON.
 Allows developers to intermix SQL queries with the programmatic data manipulations supported by
 RDDs in a single application, thus combining SQL with complex analytics
• Provides an API for manipulating data streams that closely matches the Spark Core’s RDD API, making it easy for
programmers to learn the project and move between applications that manipulate data stored in memory, on disk,
or arriving in real time.
 Provides multiple types of machine learning algorithms, including classification, regression, clustering, and
collaborative filtering, as well as supporting functionality such as model evaluation and data import.
 Provides some lower-level ML primitives, including a generic gradient descent optimization algorithm.
 Provides graph manipulation operations and performing graph-parallel computations.
 Allows creating a directed graph with arbitrary properties attached to each vertex and edge.
 Provides various operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common
graph algorithms (e.g., PageRank and triangle counting).
 Spark can run over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple
 cluster manager included in Spark itself called the Standalone Scheduler

Cluster Manager
(Standalone/ YARN/ Mesos)
What is Spark?
The Lightening-fast Big Data processing
Master Node
Driver Program
SparkContext
Worker Node 1
Executor
Task
Worker Node 2
Executor
Worker Node N
Executor
…
Task Task Task Task
Driver Program – Contains main function, defines distributed
datasets, apply operations to them. E.g. Spark Shell. Submits Tasks
to the executor processes
SparkContext – Connection to the computing cluster,
creates distributed datasets. Initialized automatically when using
Spark Shell with default config

What is Spark?
The Lightening-fast Big Data processing
The user submits an
application using
spark-submit.
spark-submit
launches the driver
program and invokes
the main() method
The driver program
contacts the cluster
manager to ask for
resources to launch
executors.
The cluster manager
launches executors
on behalf of the
driver program.
the driver sends
RDD transformations
and actions to
executors in the form
of tasks.
Tasks are run on
executor processes
to compute and save
results.
The driver’s main()
method exits or it
calls
SparkContext.stop()
it will terminate the
executors and
release resources
from the cluster
manager.
How Spark works on a cluster:

Installing Spark

Installing Spark
Windows Standalone Installation (no HDFS)
 Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)
 Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”,
using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92
 Check that the variable has been set: >ECHO %JAVA_HOME%
 Install Spark 1.5.2 or 1.6.*
 Unzip the content to c:spark

Installing Spark

Installing Spark
 If there is no Hadoop, you need to install winutils.exe
 Place winutils.exe in c:hadoopbin
 Set HADOOP_HOME environment variable to “c:hadoopbin”
using the command prompt: >SETx HADOOP_HOME c:hadoopbin
 Check that the variable has been set: >ECHO %HADOOP_HOME%
 Go to “C:sparkspark-1.6.1-bin-hadoop2.4bin” and run pyspark

Installing Spark

Installing Spark
 You can test the following statements
 1+1
 List = sc.parallelize([1,2,5])
 List.count()
 exit()

Submitting a Python script to Spark
Using spark-submit
C:sparkspark-1.6.1-bin-hadoop2.4binspark-submit <scriptFilePath>

Spark Core Concepts

Spark Core Concepts
Key/Value (Pair) RDDs
Persisting & Removing
RDDs
Per-Partition Operations
Accumulators &
Broadcast Variables
Resilient Distributed Datasets (RDDs)
Transformations Actions

Spark Core Concepts
Resilient Distributed Datasets
 Distributed, Fault-tolerant, Immutable Collection of Memory Objects
 Split into partitions to be processed on different nodes of the cluster.
 Can contain any type of Python, Java, or Scala objects, including
user-defined classes
 Processed through Transformation and Actions

Spark Core Concepts
Creating an RDD
Parallelizing an
existing collection in
the driver program
Loading a data set in
a externa data store

Spark Core Concepts
Creating an RDD
 Parallelizing an existing collection in the driver program
collection = [“Khalid”,”Magdy”, “Nagib”, “Salama”]
rdd= sc.parallelize(collection)

Spark Core Concepts
Creating an RDD
 Referencing a dataset in an external storage system, such as a shared filesystem, HDFS,
Hbase, etc.
filePath = <“/directory/file.csv” | “/directory” | “directory/*.csv>
rdd = sc.textFile(filePath)
rdd = sc.wholeFiles(directoryPath)
Can read a file, all files
in a folder, or based
on a wildcard.
Returns a collection of lines
Returns a dictionary of
(filename, content)

Spark Core Concepts
Creating an RDD
 Loading json files
import json
…
input = sc.textFile(“jsonfile.json”)
data = input.map(lambda x: json.loads(x))

Spark Core Concepts
Creating an RDD
 Load CSV file
import csv
import StringIO
..
def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["store", "date", "value"])
return reader.next()
..
inputFile = "C:/spark/mywork/data/data.csv"
input = sc.textFile(inputFile).map(loadRecord)
input.collect()[0]

Spark Core Concepts
Processing RDDs
Transformations
 Construct a new RDD based on the current one via manipulating
the collection
 Lazy Execution: only performed when an action is invoked.
 The set of transformation as optimized prior execution (action) to load
and process less amount of data
Actions
 Compute a result based on an RDD
 Return the results to Driver Program, or save to an external storage system
 RDD is recomputed (i.e., transformation are re-applied)
each time an action is invoked
 rdd.cache() or rdd.persist([option]) to reuse the computed rdd
filter(), map(), flatMap()
groupByKey(), cogroup(),
reduceByKey(), sortByKey(),
distinct(), sample(), union(),
interest(),
join(), and more…
reduce(), first(),
take(), takeSample()
count(), countByKey()
collect(),
saveAsTextFile(),
foreach()

Spark Core Concepts
Spark Program
Spark Program in a nutshell:
1. Create an RDD by loading a dataset from an external file, using textFile()
2. Apply transformation to the RDD, like filter(), map(), join()
3. Call RDD.persist() to apply the transformation and persist the computed RDD for reuse
4. Apply actions to the RDD, like count(), reduce(), collect()
5. Save the action results to an external data storage using saveAsTextFile()

Spark Transformations

Programming with Spark
Transformations
Filter – return a subset of the RDD based on a some condition(s)
lines = sc.textFile(“log.txt”)
errors = lines.filter(lambda line: “error” in line or “warn” in line)
errors = errors.filter(lambda line: len(line) > 10)
numbers = sc.parallelize([1,2,3,4,5])
evenNumbers = numbers.filter(lambda n: n % 2==0 and n>3)
def isPrime(n):
for i in range(2,int(n**0.5)+1):
if n % i==0:
return false
return true
primeNumbers = numbers.filter(isPrime)

Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5]) [1,2,3,4,5]

Transformations
list = sc.parallelize([1,2,3,4,5])
list_sqr = list.map(lambda n: n*n)
[1,2,3,4,5] [1,4,9,16,25]

Transformations
lines = sc.textFile(“data.txt”)
This product is great
The product I bought yesterday is so bad
I am happy
Very bad product, very bad
[
This product is great,
The product I bought yesterday is so bad,
I am happy,
]

Transformations
linewords = lines.map(lambda line: line.split(“ “))
[
This product is great,
The product I bought yesterday is so bad,
I am happy,
]
[
[This, product, is, great],
[The, product, I, bought, yesterday, is, so, bad],
[I, am, happy],
[Very, bad, product, very, bad]
]

Transformations
subset = lines.filter(lambda line: “product” in line)
[
[I, am, happy],
]
[
]

Transformations
output = lines.map(lambda line: line.count(“bad”))
[
]
[0,1,2]

Transformations
1, 2016/01/01, productA,456
2, 2016/01/01, productB,65
3, 2016/01/02, productA,104
[
1, 2016/01/01, productA,456
2, 2016/01/01, productB,65
3, 2016/01/02, productA,104
]

Transformations
records = lines.map(Order.ParseLineToOrder(line)
[
1, 2016/01/01, productA,456
2, 2016/01/01, productB,65
3, 2016/01/02, productA,104
]
[
Order(Id:1, date:2016/01/01, product:” “productA”,SalsValue:456)
Order(Id:2, date:2016/01/01, product:” “productB”,SalsValue:65)
Order(Id:3, date:2016/01/02, product:” “productA”,SalsValue:104)
]

Transformations
records = lines.map(Order.ParseLineToOrder(line)
filtered = records.filter(lambda order: order.SalesValue > 100)
[
Order(Id:1, date:2016-01-01, product:” “productA”,SalsValue:456)
Order(Id:2, date:2016-01-01, product:” “productB”,SalsValue:65)
]
[
]

Transformations
FlatMap – if the map function return a collection per each item in the RDD,
flatMap will return a “flat collection, rather than a collection of collections
words = lines.flatMap(lambda line: line.split(“ “))
#word count example
words = lines.flatMap(lambda line: line.split(“ “))
counts = words.map(lambda word: (word,1)
combined = counts.reduceByKey(lambda a,b: a+b)
combined.saveAsTextFile(“output.txt)
This product is great
The product I bought yesterday is so bad
I am happy
[
This, product, is, great, The, product, I,
bought, yesterday, is, so bad, I, am,
happy, Very, bad, product, very, bad
]

Transformations
Union, Intersection, subtract, distinct
list1 = [1,2,3,4,5]
list2 = [2,4,6,8,10]
rdd1 = sc.parallelize(list1)
rdd2 = sc.parallelize(list2)
rdd3 = list1.union(list2)
rdd4 = list1.intersection(list2)
rdd5 = list1.subtract(list2)
[1,2,3,4,5,6,7,8,9,10]
[2,4]
[1,3,5]

Spark Actions

Actions
Reduce – operates on two elements in your RDD and returns a new element of the same type.
sum = numbers.reduce(lambda a,b: a+b)
max = numbers.reduce(lambda a,b: a if a > b else b)

Actions
Reduce – operates on two elements in your RDD and returns a new element of the same type.
sum = numbers.reduce(lambda a,b: a+b)
max = numbers.reduce(lambda a,b: a if a > b else b)
words = sc.parallelize([“hello”,”my”,”name”,”is”,”Khalid”])
concatenated = words.reduce(lambda a,b: a+” “+b)
[
“hello”,
”my”,
”name”,
”is”
,”Khalid”
]
‘Hello my name is Khalid’

Actions
Aggregate - operates on two elements in your RDD and returns a new element of any type.
words = sc.parallelize([“hello”,”my”,”name”,”is”,”Khalid”])
number_of_letters = words.aggregate(0,
(lambda acc, value: acc+len(value)),
(lambda acc1,acc2: acc1+acc2))
alternatively,
number_of_letters = words.map(lambda word:len(word).reduce(lambda a,b: a+b)
Return a tuple (sum/count) to calculate average:
sumCount = nums.aggregate((0, 0),
(lambda acc, value: (acc[0] + value, acc[1] + 1),
(lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))))
avg = sumCount[0] / float(sumCount[1])
Takes an initial value of the return type
Here, the initial value is a tuple!
2 functions: 1) how to add an element to
the accumulated value, 2) how to
aggregate two accumulated values

Actions
collect – Returns the RDD as a collection (not an RDD anymore!).
count – Returns the number of elements in the RDD.
takeSample – Takes random n elements from RDD.
first – Takes first element in RDD.
countByValue – Returns a map of each unique value to its count.
foreach – Performs and operation based on each element in the computed RDD.
saveAsTextFile – Saves the content of the RDD to a text file.
saveAsSequenceFile – Saves the content of the RDD to a Sequnece file.

Actions
Saving data as JSON
import csv
import StringIO
import json
..
def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["store", "date", "value"])
return reader.next()
..
inputFile = "C:/spark/mywork/data/data.csv"
input = sc.textFile(inputFile).map(loadRecord)
#data transformation, e.g., map(), filter(), reduce, etc..
outputFile = "C:/spark/mywork/data/data.json"
input.map(lambda element: json.dumps(element)).saveAsTextFile(outputFile)

Persisting RDDs

RDD Persistence
 Spark perform the transformation on an RDD in a lazy manner; only after an action is invoked
 Spark re-computes the RDD each time an action is called on the RDD
 This can be especially expensive for iterative algorithms, which look at the data many times
rdd = numbers.filter(lambda a: a >10)
rdd2 = rdd.map(lambda a: a*a)
rdd2.count()
rdd2.collect()
rdd = numbers.filter(lambda a: a >10)
rdd2 = rdd.map(lambda a: a*a)
rdd2.cache()
rdd2.count()
rdd2.collect()
Each action will cause the RDD to be
recomputed (filter & map)
This will compute and persist the RDD to
perform several actions on it

RDD Persistence
rdd.presist(StoragLevel.)
rdd.cache() is the same as rdd.persist() with the default level (StorageLevel.MEMORY_ONLY)
rdd.unpresist() to free-up memory from unused rdds

Working with Pair RDDs

Working with Key/Value Pairs
 Pair RDDs are a useful building block in many programs, as they expose operations that allow
you to act on each key in parallel or regroup data across the network.
 Usually used to perform operation like join, group, sort, reduceByKey, etc.
data = sc.textFile(“data.txt2)
data = data.map(lambda line: (line.split(“,”))
keyValueDatase = data.map(lambda elements: (elements[0],elements[1]))
def parseLine(line):
parts = line.split(“,”)
record = MyRecord(parts[0],parts[1],parts[2])
return (record.Product,record)
keyValueDatase = da.map(parseLine)
A,234,01/01/205
B,567,01/01/205
A,157,01/01/205
C,56,01/01/205
B,345,01/01/205
B,678,01/01/205
[(A,234), (B,567),
(A,157),(C,56,),(B,345),
(B,678)]

Working with Key/Value Pairs - Transformations
reduceByKey – aggregate values with the same key using a given function
result = keyValueDataset.reduceByKey(lambda a,b: a+b)
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
[
(A,3),
(B,14),
(C,5)
]

groupByKey - group values with the same key in a collection
result = keyValueDataset.groupByKey()
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
[
(A,[2,1]),
(B,[5,3,6]),
(C,[5])
]

mapValues - apply a function on each value of the pair without changing the key
result = keyValueDataset.mapValue(lambda a: a*a)
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
(A,4),
(B,25),
(A,1),
(C,25),
(B,9),
(B,36)

join - Perform an inner join between two RDDs
data1 = sc.textFile(“data1.txt2)
data1 = data.map(lambda line: (line.split(“,”))
keyValueDatase1 = data1.map(lambda elements: (elements[0],elements[1]))
data2 = data2.map(lambda line: (line.split(“,”))
keyValueDatase = data2.map(lambda elements: (elements[0],elements[1]))
result = keyValueDataset1.join(keyValueDatase2)
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
[
(A,(2,22)),
(A,(2,11)),
(A,(1,22)),
(A,(1,11)),
(B,(5,55),
(B,(3,55),
(B,(6,55),
]
A,22,
B,55,
A,11

cogroup - Group data from both RDDs sharing the same key
keyValueDatase1 = data1.map(lambda elements: (elements[0],elements[1]))
data2 = data2.map(lambda line: (line.split(“,”))
keyValueDatase = data2.map(lambda elements: (elements[0],elements[1]))
result = keyValueDataset1.cogroup(keyValueDatase2)
result = result.mapValues(lambda tuple: tuple[0]+tuple[1])
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
[
(A,([2,1],[22,11])),
(B,([5,3,6],[55])),
(C,([5],[]))
]
A,22,
B,55,
A,11
[
(A,[2,1,22,11]),
(B, [5,3,6,55]),
(C,[5])
]

zip - Pair and RDD with an other RDD to produce a new Key/Value RDD,
wrt the elements order of each RDD
rdd1 = sc.parallelize([‘A’,’B’,’C’,’D’])
rdd2 = sc.parallelize([101,102,103,104])
pairs = rdd1.zip(rdd2)
paris.collect()
[
A,
B,
C,
D
]
[
101,
102,
103,
104
]
[
(A,101)
(B,102)
(C,103)
(D,104)
]

zipWithIndex - Pair each element in the RDD with its index
list = [‘A’,’B’,’C’,’D’]
rdd = sc.parallelize(list)
rdd_indexed = rdd.zipWithIndex()
rdd_indexed.collect()
[
A,
B,
C,
D
]
[
(A, 0)
(B, 1)
(C, 3)
(D, 4)
]

flatMapValues() – same as flatMap(), but with pair RDDs
keys()
values()
sortByKey()
leftOuterJoin()
rightOuterJoin()
subtractByKey()
combineByKey() – same as aggregate(), but with pair RDDs

partitionBy
 Hash Partitioning by key.
 elements with the same key will end up being processed on the same compute node, to reduce data
shuffling
 Useful with operations like cogroup(), groupWith(), join(), groupByKey(), reduceByKey(), combineByKey(),
and lookup().
 Usually used when data is loaded, then the rdd is persisted
keyValueDatase1 = data1.map(lambda elements: (elements[0],elements[1])). partitionBy(10).persist()
data 2= sc.textFile(“data2.txt2)
keyValueDatase2 = data2.map(lambda line: (line.split(“,”)[0],line.split(“,”)[1]))
joined = keyValueDatase1.join(keyValueDatase2)

Working with Key/Value Pairs - Actions
countByKey()
collectAsMap()
Lookup(key) - Return all values associated with the provided key.
Any pair RDD action takes number of reducer as an optional parameter

Accumulators and Broadcast
Variables

Accumulators
 An RDD function in Spark (such as map() or filter) can use a variable defined outside the driver program
 However, each task running on the cluster gets a new copy of this variable.
 Updates from these copies are not propagated back to the driver.
 Accumulators provides a simple syntax for aggregating values from worker nodes back to the driver
program.
 A common uses of accumulators is to count events that occur during job execution, maybe for
debugging purposes.
 In a worker task, accumulators are write-only variables. Only the driver program can retrieve the value of an
accumulator

Accumulators
Accumulator Example
file = sc.textFile(inputFile)
blankLines = sc.accumulator(0)
def extractRecords(line):
global blankLines
If (line == ""):
blankLines += 1
return line.split(" ")
records = file.flatMap(extractRecords)
print "Blank lines: %d" % blankLines.value()
Define an accumulator of type INT
with 0 as initial value – referenced
by blankLines variable
Increment accumulator
through the blankLines
variable
Retrieve the accumulator
value in the driver program

Broadcast Variables
 Allow keeping a read-only variable cached on each worker node, rather than shipping a copy of it with tasks
 E.g., to give every node a copy of a large input dataset (reference data) in an efficient manner.
 After the broadcast variable is created, it should be used instead of the original value in any functions run on
the cluster, so that the variable is not shipped to the nodes more than once.
Broadcast Example
rdd = sc.textFile(“ref_data.txt”)
ref_data = sc.broadcast(rdd)
def processData(input, ref_data):
…
data = data.map(lambda a: processData(a,ref_data))

 Some operations need to be executed per partition as a whole, rather than per each item in the RDD,
which is the normal behaviour of the transformations like map() or filter()
 E.g., Setup a database connection, creating a random number generator, preparing a return object of
the aggregation to happened on the RDD, etc.
 For all of the mentioned objects, we only need one per RDD partition, rather than per element.
mapPartition()
mapParitionWithIndex()
foreachPartition()

Example: foreach() vs. foreachPartition()
list = [1,2,3,4,5,6,7,8,9,10]
rdd = sc.parallelize(list)
counter = sc.accumulator(0)
def operation(element):
global counter
counter+=1
rdd.foreach(operation)
print "counter value is:" + repr(counter)
counter2 = sc.accumulator(1)
rdd=rdd.repartition(3)
print “number of paritionts = “ + repr(rdd.getNumPartitions())
def operation2(element):
global counter2
counter2+=1
rdd.foreachPartition(operation2)
print "counter2 value is:" + repr(counter2)
The counter returns 10,
one for each item in the
RDD
The counter returns 3,
one for each RDD
partition

Spark SQL

Spark SQL
DataFrames
Distributed collection of data organized into named columns
Conceptually equivalent to a table in a relational database or a data frame in
R/Python, with richer Spark optimizations
RDDs
Structured Data
Files
Hive Tables RDBMS

Spark SQL
DataFrames
Creating a DataFrame from RDD of Rows
from pyspark import SparkContext, SparkConf
from pyspark.sql import *
from pyspark.sql.types import *
conf = SparkConf().setAppName("My App").setMaster("local")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
row1 = Row(id = 1, Name = 'khalid')
row2 = Row(id = 2, Name = 'Zahra')
row3 = Row(id = 3, Name = 'Adel')
row4 = Row(id = 4, Name = 'Jassem')
rdd = sc.parallelize([row1,row2,row3,row4])
df = sqlContext.createDataFrame(rdd)
df.printSchema()

Spark SQL
DataFrames
Creating a DataFrame from RDD – with a list of column names
from pyspark.sql import SQLContext
rdd =
sc.parallelize([("productA","01/01/2015",50),("productA","01/02/2015",100),("productB","01/01/2015",70)])
df = sqlContext.createDataFrame(rdd, ["product","date","value"])
df.printSchema()
If only the column names are
supplied, data types will be
inferred

Spark SQL
DataFrames
Creating a DataFrame from RDD – with schema
from pyspark.sql.types import *
rdd = sc.parallelize([("productA","01/01/2015",50),("productA","01/02/2015",100),("productB","01/01/2015",70)])
schema =
StructType([StructField("Item",StringType(),True),StructField("Date",StringType(),True),StructField("Stock",LongType(),True)])
df = sqlContext.createDataFrame(rdd,schema)
df.printSchema()
Supplied schema

Spark SQL
DataFrames
Show DataFrame content
df.show()

Spark SQL
DataFrames
Creating a DataFrame – sqlContext.read
 sqlContext.read.json(inputFile)
 sqlContext.read.format('jdbc').options(jdbcConnectionString).load()
 sqlContext.read.parquet(inputFile)
 sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load(inputFile)
Saving a DataFrame
 df.write.json(outputFile)
 df.createJDBCTable(jdbcConnectionString, TableName, allowExisting = true)
 df.insertIntoJDBC(jdbcConnectionString, TableName, overwrite = false)
 df.write. parquet(outputFile)
 df.write.format("com.databricks.spark.csv").save("/data/home.csv")

Spark SQL
DataFrames
Creating a DataFrame from JSON
df = sqlContext.read.json(“data.json")
df.printSchema()
df.show()

Spark SQL
DataFrames
Creating a DataFrame from CSV
 Load csv data to RDD (using csv.DictReader()), then create a DataFrame from RDD
 Use csv loader com.databricks:spark-csv
C:sparkspark-1.6.1-bin-hadoop2.4bin>pyspark --packages com.databricks:spark-csv_2.11:1.4.0
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(inputFile)

Spark SQL
Manipulating Data Frames
A DataFrame is a collection of spark.sql.Row objects
rows = df.collect()
rows.count()
rows[0]
rows[0][1]
Reference a column in a DataFrame
 As attribute of the DataFrame: df.product
 As Key: df[“product”]
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Return collection of Row objects
Row(product=‘product A’, date=‘2015-01-01’ ,value=50)
‘2014-01-01’

Spark SQL
Filtering DataFrame
df = df.filter(df.date=‘2015-01-01’)
df = df.filter(df[“product”]=“Product B”)
df = df.filter(“value > 50”)
OR
df = df.filter(“value > 50 AND product = ‘Product A’”)
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Date Value
Product B 2015-01-01 70

Spark SQL
Selecting Columns (projection)
df = df.select(df.product, df.value)
df = df.select(df.product, df.value*10)
df = df.select(df.product, df.value*10, len(product)*+df.value))
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Value
Product A 50
Product A 100
Product B 70
Product Value*10 len(product)*+df.value
Product A 500 508
Product A 1000 1008
Product B 700 708
Create a new columns
based on columns in
DataFrame

Spark SQL
Selecting Columns (projection)
df = df.select(df.product, df.value)
df = df.select(df.product, df.value*10)
df = df.select(df.product, df.value*10, (len(product)*+df.value).alias(derived))
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Value
Product A 50
Product A 100
Product B 70
Product Value*100 derived
Product A 500 508
Product A 1000 1008
Product B 700 708
Give alias to the new
column

Spark SQL
Order By
from pyspark.sql.functions import asc, desc
df = df.orderBy(df.value)
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Date Value
Product A 2015-01-01 50
Product B 2015-01-01 70
Product A 2015-01-02 100

Spark SQL
join
result = df1.join(
df2,
df1.product==df2.p,
“inner”)
.select(df1.product,df2.model,df1.value)
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Model
Product A X
Product B Y
Join condition
Join Type
Product Model Value
Product A X 50
Product A X 100
Product B Y 70

Spark SQL
groupBy
result = df.groupBy(df.product)
result.count().show()
result.sum(“value”).show()
result.max(“value”).show()
agg
import pyspark.sql.functions as F
…
result = df.groupBy(df.product).agg(df.product,F.sum(“value”),F.min(“value”))
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Return
spark.sql.group.GroupedData
DataFrames

Spark SQL
It’s important to persist() or cache() your DataFrame after processing it via
filter(), join(), groupBy(), etc., so that these expensive operations are not re-
computed each time you perform select()

Spark SQL
Using Structured Query Language
df.registerTempTable(“MyDataTable”);
query = “SELECT product, SUM(value) Total FROM MyDataTable GROUP BY product”
result = sqlContext.sql(query)
result.show() Resturn a DataFrame

Spark SQL
Hive Integration

Getting Started with Spark on
Azure HDInsight

Spark on Azure HDInsight
Creating HDInsight Spark Cluster

Running Python Scripts on Spark HDInsight

Running Python Scripts on Spark HDInsight – Data Frames

Running Python Scripts on Spark HDInsight – Spark SQL

Running Python Scripts on Spark HDInsight – Using Custom Library
Python Script
Uploaded to
ksmsdnspark/HdiSamples/HdiSamples/
WebsiteLogSampleData

Running Python Scripts on Spark HDInsight – Using External Package

Microsoft Power BI and Spark SQL
Upload Adventure works data file extracts to the blob container

Process files with Spark and Save as hive table

Processing Output is Saved and Partitioned in Hive

Query Data in Spark SQL

Hive Integration
Spark SQL
JDBC/ODBC
Custom
App
Spark SQL
Shell
…
Hive JSON Parquet …
Excel Tableau …
 Spark SQL with Hive support allows us to access Hive tables,
UDFs (user-defined functions), SerDes (serialization and
deserialization formats), and the Hive query language (HiveQL)
 sqlContext(which is HiveContext) is the entry point to access
Hive metastore and functionality, where HiveQL is
recommended to be used.
 If hive is installed, hive-site.xml file must be copied to Spark’s
configuration directory ($SPARK_HOME/conf)
 If there is no hive installation, Spark SQL will create its own Hive
metastore (metadata DB) in your program’s work directory,
called metastore_db.
 In addition, if you attempt to create tables using HiveQL’s
CREATE TABLE, they will be placed in the
/user/hive/warehouse directory on your default filesystem
(either your local filesystem, or HDFS).

Hive Integration
sqlContext.sql("CREATE TABLE MyTable (id int, name string, salary float)")
tables = sqlContext.sql("SHOW TABLES")
tables.show()
description = sqlContext.sql("DESCRIBE MyTable")
description.show()
sqlContext.sql("INSERT INTO MyTable SELECT * FROM (SELECT 1101 as id,
'Khalid Salama' as name, 70000 as salary) query")
result = sqlContext.sql("SELECT * FROM MyTable")
result.show()

Connecting with Excel to Spark SQL
 Download and install Spark ODBC Driver https://www.microsoft.com/en-us/download/details.aspx?id=49883
 Add an ODBC Data Source
 Select Microsoft Spark ODBC Driver

 Create an Spark ODBC connection
 Configure Spark ODBC connection

 Create an Spark ODBC connection
 Configure Spark ODBC connection
 Test connection

 Open Excel and got to Data, From other Sources, Microsoft Query

 Browse tables and select attributes to include

 Show data in Pivot Table

Spark CLR (Mobuis)

Spark CLR
Installing Spark CLR
 Download spark Mobius https://github.com/Microsoft/Mobius
 Unzip the content of the zip folder to C:sparkspark-clr_2.10-1.6.100
 You should find inside this folder a “runtime” folder, which include (bin, lib, dependencies, scripts) folders
 Create a new visual Studio project (Console App). Right-click project, NutGet, and install the following
packages on by one (which you will find after in the packages.config)
<packages>
<package id="log4net" version="2.0.5" targetFramework="net452" />
<package id="Microsoft.SparkCLR" version="1.6.100" targetFramework="net452" />
<package id="Newtonsoft.Json" version="7.0.1" targetFramework="net452" />
<package id="Razorvine.Pyrolite" version="4.10.0.0" targetFramework="net452" />
<package id="Razorvine.Serpent" version="1.12.0.0" targetFramework="net452" />
</packages>

Spark CLR
Running .NET Apps with Spark CLR
Writing Spark Processor Class

Spark CLR
main function, in SparkAppDemo
Class calls processor.process()
 Build you project to produce SparkAppDemo.exe
 C:sparkspark-clr_2.10-1.6.100runtimescripts
 Run the following command
>sparkclr-submit --exe SparkDempApp.exe
C:sparkmyworkCSharpSparkDempApp

Spark CLR

How to Get Started with Spark
 Read the slides!
 Azure Spark HDInsight Documentation
https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-apache-spark-overview/
 Apache Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html
 Spark CLR (Mobius)
https://github.com/Microsoft/Mobius
 Introduction to Big Data Analytics (week 5) – Coursera Big Data Specialization
https://www.coursera.org/learn/bigdata-analytics/home/week/5
 Data Manipulation at Scale (week4, lesion 20) – Coursera Data Science at Scale
https://www.coursera.org/learn/data-manipulation/home/week/4
 Data Science and Engineering with Apache Spark – edx 5 course track
https://www.edx.org/xseries/data-science-engineering-apache-spark
O’Reliy Books – Learning Spark

Appendix A: Spark Configurations
SparkConf()
 spark.app.name
 spark.master spark://host:<port> | mesos://host:<port> | yarn | local | local[<cores>]
 spark.ui.port
 spark.executor.memory
 spark.executor.cores
 spark.serializer
 spark.eventLog.enabled
 spark.eventLog.dir
spark-submit
--master
--deploy-mode client | cluster
--name
--files
--py-files
--executor-memory
--driver-memory

My Background
Applying Computational Intelligence in Data Mining
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 25+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning,
– evolving neural networks, and
– data clustering
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org

Thank you!

Spark with HDInsight

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark with HDInsight

Similar to Spark with HDInsight (20)

Recently uploaded

Recently uploaded (20)

Spark with HDInsight