SlideShare a Scribd company logo
1 of 125
| © Copyright 2015 Hitachi Consulting1
Spark with Azure HDInsight
Lighting-fast Big Data Processing
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.
| © Copyright 2015 Hitachi Consulting2
Outline
 Spark and Big Data
 Installing Spark
 Spark Core Concepts
 Programming with Spark
 Spark SQL
 Getting Started with Spark on HDInsight
 Spark CLR (Mobius)
 ETL and Automation
 Useful Resources
| © Copyright 2015 Hitachi Consulting3
Introducing Spark
| © Copyright 2015 Hitachi Consulting4
What is Spark?
The Lightening-fast Big Data Processing
General-purpose Big Data Processing
Integrates with HDFS
Graph Processing
Stream Processing
Machine Learning
Libraries
In-memory (fast)
Iterative
Processing
Interactive
Query
SQL
Scala – Python – Java – R – .NET
| © Copyright 2015 Hitachi Consulting5
Spark and Hadoop Ecosystem
Spark and the zoo…
Hadoop Distributed File System (HDFS)
Applications
In-Memory Stream SQL
 Spark-
SQL
NoSQL Machine
Learning
….
Batch
Yet Another Resource Negotiator (YARN)
Search Orchest.
MgmntAcquisition
Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
| © Copyright 2015 Hitachi Consulting6
Spark Components
Spark and the zoo…
Hadoop Distributed File System (HDFS)
Spark
….
Yet Another Resource Negotiator (YARN)Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
Spark Core Engine (RDDs: Resilient Distributed Datasets)
Spark SQL
(structured data)
Spark Streaming
(real-time)
Mlib
(machine learning)
GraphX
(graph processing)
Scala
Java
Python
R
.NET (Mobius)
| © Copyright 2015 Hitachi Consulting7
Spark Components
Spark Core
Spark SQL
Spark
Streaming
Spark MLib
Spark
GraphX
Cluster
Managers
 Contains the basic functionality of Spark, including components for task scheduling, memory management, fault
recovery, interacting with storage systems, etc.
 Home to the API that defines resilient distributed datasets (RDDs)
 Package for working with structured data (DataFrames). It allows querying data via SQL as well as the Apache
Hive Supports many sources of data, including Hive tables, Parquet, and JSON.
 Allows developers to intermix SQL queries with the programmatic data manipulations supported by
 RDDs in a single application, thus combining SQL with complex analytics
• Provides an API for manipulating data streams that closely matches the Spark Core’s RDD API, making it easy for
programmers to learn the project and move between applications that manipulate data stored in memory, on disk,
or arriving in real time.
 Provides multiple types of machine learning algorithms, including classification, regression, clustering, and
collaborative filtering, as well as supporting functionality such as model evaluation and data import.
 Provides some lower-level ML primitives, including a generic gradient descent optimization algorithm.
 Provides graph manipulation operations and performing graph-parallel computations.
 Allows creating a directed graph with arbitrary properties attached to each vertex and edge.
 Provides various operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common
graph algorithms (e.g., PageRank and triangle counting).
 Spark can run over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple
 cluster manager included in Spark itself called the Standalone Scheduler
| © Copyright 2015 Hitachi Consulting8
Cluster Manager
(Standalone/ YARN/ Mesos)
What is Spark?
The Lightening-fast Big Data processing
Master Node
Driver Program
SparkContext
Worker Node 1
Executor
Task
Worker Node 2
Executor
Worker Node N
Executor
…
Task Task Task Task
Driver Program – Contains main function, defines distributed
datasets, apply operations to them. E.g. Spark Shell. Submits Tasks
to the executor processes
SparkContext – Connection to the computing cluster,
creates distributed datasets. Initialized automatically when using
Spark Shell with default config
| © Copyright 2015 Hitachi Consulting9
What is Spark?
The Lightening-fast Big Data processing
The user submits an
application using
spark-submit.
spark-submit
launches the driver
program and invokes
the main() method
The driver program
contacts the cluster
manager to ask for
resources to launch
executors.
The cluster manager
launches executors
on behalf of the
driver program.
the driver sends
RDD transformations
and actions to
executors in the form
of tasks.
Tasks are run on
executor processes
to compute and save
results.
The driver’s main()
method exits or it
calls
SparkContext.stop()
it will terminate the
executors and
release resources
from the cluster
manager.
How Spark works on a cluster:
| © Copyright 2015 Hitachi Consulting10
Installing Spark
| © Copyright 2015 Hitachi Consulting11
Installing Spark
Windows Standalone Installation (no HDFS)
 Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)
 Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”,
using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92
 Check that the variable has been set: >ECHO %JAVA_HOME%
 Install Spark 1.5.2 or 1.6.*
 Unzip the content to c:spark
| © Copyright 2015 Hitachi Consulting12
Installing Spark
Windows Standalone Installation (no HDFS)
 Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)
 Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”,
using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92
 Check that the variable has been set: >ECHO %JAVA_HOME%
 Install Spark 1.5.2 or 1.6.*
 Unzip the content to c:spark
| © Copyright 2015 Hitachi Consulting13
Installing Spark
Windows Standalone Installation (no HDFS)
 Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)
 Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”,
using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92
 Check that the variable has been set: >ECHO %JAVA_HOME%
 Install Spark 1.5.2 or 1.6.*
 Unzip the content to c:spark
 If there is no Hadoop, you need to install winutils.exe
 Place winutils.exe in c:hadoopbin
 Set HADOOP_HOME environment variable to “c:hadoopbin”
using the command prompt: >SETx HADOOP_HOME c:hadoopbin
 Check that the variable has been set: >ECHO %HADOOP_HOME%
 Go to “C:sparkspark-1.6.1-bin-hadoop2.4bin” and run pyspark
| © Copyright 2015 Hitachi Consulting14
Installing Spark
Windows Standalone Installation (no HDFS)
 Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)
 Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”,
using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92
 Check that the variable has been set: >ECHO %JAVA_HOME%
 Install Spark 1.5.2 or 1.6.*
 Unzip the content to c:spark
 If there is no Hadoop, you need to install winutils.exe
 Place winutils.exe in c:hadoopbin
 Set HADOOP_HOME environment variable to “c:hadoopbin”
using the command prompt: >SETx HADOOP_HOME c:hadoopbin
 Check that the variable has been set: >ECHO %HADOOP_HOME%
 Go to “C:sparkspark-1.6.1-bin-hadoop2.4bin” and run pyspark
| © Copyright 2015 Hitachi Consulting15
Installing Spark
Windows Standalone Installation (no HDFS)
 Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)
 Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”,
using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92
 Check that the variable has been set: >ECHO %JAVA_HOME%
 Install Spark 1.5.2 or 1.6.*
 Unzip the content to c:spark
 If there is no Hadoop, you need to install winutils.exe
 Place winutils.exe in c:hadoopbin
 Set HADOOP_HOME environment variable to “c:hadoopbin”
using the command prompt: >SETx HADOOP_HOME c:hadoopbin
 Check that the variable has been set: >ECHO %HADOOP_HOME%
 Go to “C:sparkspark-1.6.1-bin-hadoop2.4bin” and run pyspark
 You can test the following statements
 1+1
 List = sc.parallelize([1,2,5])
 List.count()
 exit()
| © Copyright 2015 Hitachi Consulting16
Installing Spark
Windows Standalone Installation (no HDFS)
 Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)
 Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”,
using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92
 Check that the variable has been set: >ECHO %JAVA_HOME%
 Install Spark 1.5.2 or 1.6.*
 Unzip the content to c:spark
 If there is no Hadoop, you need to install winutils.exe
 Place winutils.exe in c:hadoopbin
 Set HADOOP_HOME environment variable to “c:hadoopbin”
using the command prompt: >SETx HADOOP_HOME c:hadoopbin
 Check that the variable has been set: >ECHO %HADOOP_HOME%
 Go to “C:sparkspark-1.6.1-bin-hadoop2.4bin” and run pyspark
 You can test the following statements
 1+1
 List = sc.parallelize([1,2,5])
 List.count()
 exit()
| © Copyright 2015 Hitachi Consulting17
Submitting a Python script to Spark
Using spark-submit
C:sparkspark-1.6.1-bin-hadoop2.4binspark-submit <scriptFilePath>
| © Copyright 2015 Hitachi Consulting18
Spark Core Concepts
| © Copyright 2015 Hitachi Consulting19
Spark Core Concepts
Key/Value (Pair) RDDs
Persisting & Removing
RDDs
Per-Partition Operations
Accumulators &
Broadcast Variables
Resilient Distributed Datasets (RDDs)
Transformations Actions
| © Copyright 2015 Hitachi Consulting20
Spark Core Concepts
Resilient Distributed Datasets
 Distributed, Fault-tolerant, Immutable Collection of Memory Objects
 Split into partitions to be processed on different nodes of the cluster.
 Can contain any type of Python, Java, or Scala objects, including
user-defined classes
 Processed through Transformation and Actions
| © Copyright 2015 Hitachi Consulting21
Spark Core Concepts
Resilient Distributed Datasets
Creating an RDD
Parallelizing an
existing collection in
the driver program
Loading a data set in
a externa data store
| © Copyright 2015 Hitachi Consulting22
Spark Core Concepts
Resilient Distributed Datasets
Creating an RDD
 Parallelizing an existing collection in the driver program
collection = [“Khalid”,”Magdy”, “Nagib”, “Salama”]
rdd= sc.parallelize(collection)
| © Copyright 2015 Hitachi Consulting23
Spark Core Concepts
Resilient Distributed Datasets
Creating an RDD
 Referencing a dataset in an external storage system, such as a shared filesystem, HDFS,
Hbase, etc.
filePath = <“/directory/file.csv” | “/directory” | “directory/*.csv>
rdd = sc.textFile(filePath)
rdd = sc.wholeFiles(directoryPath)
Can read a file, all files
in a folder, or based
on a wildcard.
Returns a collection of lines
Returns a dictionary of
(filename, content)
| © Copyright 2015 Hitachi Consulting24
Spark Core Concepts
Resilient Distributed Datasets
Creating an RDD
 Loading json files
import json
…
input = sc.textFile(“jsonfile.json”)
data = input.map(lambda x: json.loads(x))
| © Copyright 2015 Hitachi Consulting25
Spark Core Concepts
Resilient Distributed Datasets
Creating an RDD
 Load CSV file
import csv
import StringIO
..
def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["store", "date", "value"])
return reader.next()
..
inputFile = "C:/spark/mywork/data/data.csv"
input = sc.textFile(inputFile).map(loadRecord)
input.collect()[0]
| © Copyright 2015 Hitachi Consulting26
Spark Core Concepts
Processing RDDs
Transformations
 Construct a new RDD based on the current one via manipulating
the collection
 Lazy Execution: only performed when an action is invoked.
 The set of transformation as optimized prior execution (action) to load
and process less amount of data
Actions
 Compute a result based on an RDD
 Return the results to Driver Program, or save to an external storage system
 RDD is recomputed (i.e., transformation are re-applied)
each time an action is invoked
 rdd.cache() or rdd.persist([option]) to reuse the computed rdd
filter(), map(), flatMap()
groupByKey(), cogroup(),
reduceByKey(), sortByKey(),
distinct(), sample(), union(),
interest(),
join(), and more…
reduce(), first(),
take(), takeSample()
count(), countByKey()
collect(),
saveAsTextFile(),
foreach()
| © Copyright 2015 Hitachi Consulting27
Spark Core Concepts
Spark Program
Spark Program in a nutshell:
1. Create an RDD by loading a dataset from an external file, using textFile()
2. Apply transformation to the RDD, like filter(), map(), join()
3. Call RDD.persist() to apply the transformation and persist the computed RDD for reuse
4. Apply actions to the RDD, like count(), reduce(), collect()
5. Save the action results to an external data storage using saveAsTextFile()
| © Copyright 2015 Hitachi Consulting28
Spark Transformations
| © Copyright 2015 Hitachi Consulting29
Programming with Spark
Transformations
Filter – return a subset of the RDD based on a some condition(s)
lines = sc.textFile(“log.txt”)
errors = lines.filter(lambda line: “error” in line or “warn” in line)
errors = errors.filter(lambda line: len(line) > 10)
numbers = sc.parallelize([1,2,3,4,5])
evenNumbers = numbers.filter(lambda n: n % 2==0 and n>3)
def isPrime(n):
for i in range(2,int(n**0.5)+1):
if n % i==0:
return false
return true
primeNumbers = numbers.filter(isPrime)
| © Copyright 2015 Hitachi Consulting30
Programming with Spark
Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5]) [1,2,3,4,5]
| © Copyright 2015 Hitachi Consulting31
Programming with Spark
Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5])
list_sqr = list.map(lambda n: n*n)
[1,2,3,4,5] [1,4,9,16,25]
| © Copyright 2015 Hitachi Consulting32
Programming with Spark
Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5])
list_sqr = list.map(lambda n: n*n)
lines = sc.textFile(“data.txt”)
This product is great
The product I bought yesterday is so bad
I am happy
Very bad product, very bad
[
This product is great,
The product I bought yesterday is so bad,
I am happy,
Very bad product, very bad
]
| © Copyright 2015 Hitachi Consulting33
Programming with Spark
Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5])
list_sqr = list.map(lambda n: n*n)
lines = sc.textFile(“data.txt”)
linewords = lines.map(lambda line: line.split(“ “))
[
This product is great,
The product I bought yesterday is so bad,
I am happy,
Very bad product, very bad
]
[
[This, product, is, great],
[The, product, I, bought, yesterday, is, so, bad],
[I, am, happy],
[Very, bad, product, very, bad]
]
| © Copyright 2015 Hitachi Consulting34
Programming with Spark
Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5])
list_sqr = list.map(lambda n: n*n)
lines = sc.textFile(“data.txt”)
linewords = lines.map(lambda line: line.split(“ “))
subset = lines.filter(lambda line: “product” in line)
[
[This, product, is, great],
[The, product, I, bought, yesterday, is, so, bad],
[I, am, happy],
[Very, bad, product, very, bad]
]
[
[This, product, is, great],
[The, product, I, bought, yesterday, is, so, bad],
[Very, bad, product, very, bad]
]
| © Copyright 2015 Hitachi Consulting35
Programming with Spark
Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5])
list_sqr = list.map(lambda n: n*n)
lines = sc.textFile(“data.txt”)
linewords = lines.map(lambda line: line.split(“ “))
subset = lines.filter(lambda line: “product” in line)
output = lines.map(lambda line: line.count(“bad”))
[
[This, product, is, great],
[The, product, I, bought, yesterday, is, so, bad],
[Very, bad, product, very, bad]
]
[0,1,2]
| © Copyright 2015 Hitachi Consulting36
Programming with Spark
Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5])
list_sqr = list.map(lambda n: n*n)
lines = sc.textFile(“data.txt”)
linewords = lines.map(lambda line: line.split(“ “))
subset = lines.filter(lambda line: “product” in line)
output = lines.map(lambda line: line.count(“bad”))
lines = sc.textFile(“data.txt”)
1, 2016/01/01, productA,456
2, 2016/01/01, productB,65
3, 2016/01/02, productA,104
[
1, 2016/01/01, productA,456
2, 2016/01/01, productB,65
3, 2016/01/02, productA,104
]
| © Copyright 2015 Hitachi Consulting37
Programming with Spark
Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5])
list_sqr = list.map(lambda n: n*n)
lines = sc.textFile(“data.txt”)
linewords = lines.map(lambda line: line.split(“ “))
subset = lines.filter(lambda line: “product” in line)
output = lines.map(lambda line: line.count(“bad”))
lines = sc.textFile(“data.txt”)
records = lines.map(Order.ParseLineToOrder(line)
[
1, 2016/01/01, productA,456
2, 2016/01/01, productB,65
3, 2016/01/02, productA,104
]
[
Order(Id:1, date:2016/01/01, product:” “productA”,SalsValue:456)
Order(Id:2, date:2016/01/01, product:” “productB”,SalsValue:65)
Order(Id:3, date:2016/01/02, product:” “productA”,SalsValue:104)
]
| © Copyright 2015 Hitachi Consulting38
Programming with Spark
Transformations
Map – return a new collection by applying a function on each element of the RDD
list = sc.parallelize([1,2,3,4,5])
list_sqr = list.map(lambda n: n*n)
lines = sc.textFile(“data.txt”)
linewords = lines.map(lambda line: line.split(“ “))
subset = lines.filter(lambda line: “product” in line)
output = lines.map(lambda line: line.count(“bad”))
lines = sc.textFile(“data.txt”)
records = lines.map(Order.ParseLineToOrder(line)
filtered = records.filter(lambda order: order.SalesValue > 100)
[
Order(Id:1, date:2016-01-01, product:” “productA”,SalsValue:456)
Order(Id:2, date:2016-01-01, product:” “productB”,SalsValue:65)
Order(Id:3, date:2016-01-02, product:” “productA”,SalsValue:104)
]
[
Order(Id:1, date:2016-01-01, product:” “productA”,SalsValue:456)
Order(Id:3, date:2016-01-02, product:” “productA”,SalsValue:104)
]
| © Copyright 2015 Hitachi Consulting39
Programming with Spark
Transformations
FlatMap – if the map function return a collection per each item in the RDD,
flatMap will return a “flat collection, rather than a collection of collections
lines = sc.textFile(“data.txt”)
words = lines.flatMap(lambda line: line.split(“ “))
#word count example
lines = sc.textFile(“data.txt”)
words = lines.flatMap(lambda line: line.split(“ “))
counts = words.map(lambda word: (word,1)
combined = counts.reduceByKey(lambda a,b: a+b)
combined.saveAsTextFile(“output.txt)
This product is great
The product I bought yesterday is so bad
I am happy
Very bad product, very bad
[
This, product, is, great, The, product, I,
bought, yesterday, is, so bad, I, am,
happy, Very, bad, product, very, bad
]
| © Copyright 2015 Hitachi Consulting40
Programming with Spark
Transformations
Union, Intersection, subtract, distinct
list1 = [1,2,3,4,5]
list2 = [2,4,6,8,10]
rdd1 = sc.parallelize(list1)
rdd2 = sc.parallelize(list2)
rdd3 = list1.union(list2)
rdd4 = list1.intersection(list2)
rdd5 = list1.subtract(list2)
[1,2,3,4,5,6,7,8,9,10]
[2,4]
[1,3,5]
| © Copyright 2015 Hitachi Consulting41
Spark Actions
| © Copyright 2015 Hitachi Consulting42
Programming with Spark
Actions
Reduce – operates on two elements in your RDD and returns a new element of the same type.
numbers = sc.parallelize([1,2,3,4,5])
sum = numbers.reduce(lambda a,b: a+b)
max = numbers.reduce(lambda a,b: a if a > b else b)
| © Copyright 2015 Hitachi Consulting43
Programming with Spark
Actions
Reduce – operates on two elements in your RDD and returns a new element of the same type.
numbers = sc.parallelize([1,2,3,4,5])
sum = numbers.reduce(lambda a,b: a+b)
max = numbers.reduce(lambda a,b: a if a > b else b)
words = sc.parallelize([“hello”,”my”,”name”,”is”,”Khalid”])
concatenated = words.reduce(lambda a,b: a+” “+b)
[
“hello”,
”my”,
”name”,
”is”
,”Khalid”
]
‘Hello my name is Khalid’
| © Copyright 2015 Hitachi Consulting44
Programming with Spark
Actions
Aggregate - operates on two elements in your RDD and returns a new element of any type.
words = sc.parallelize([“hello”,”my”,”name”,”is”,”Khalid”])
number_of_letters = words.aggregate(0,
(lambda acc, value: acc+len(value)),
(lambda acc1,acc2: acc1+acc2))
alternatively,
number_of_letters = words.map(lambda word:len(word).reduce(lambda a,b: a+b)
Return a tuple (sum/count) to calculate average:
sumCount = nums.aggregate((0, 0),
(lambda acc, value: (acc[0] + value, acc[1] + 1),
(lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))))
avg = sumCount[0] / float(sumCount[1])
Takes an initial value of the return type
Here, the initial value is a tuple!
2 functions: 1) how to add an element to
the accumulated value, 2) how to
aggregate two accumulated values
| © Copyright 2015 Hitachi Consulting45
Programming with Spark
Actions
collect – Returns the RDD as a collection (not an RDD anymore!).
count – Returns the number of elements in the RDD.
takeSample – Takes random n elements from RDD.
first – Takes first element in RDD.
countByValue – Returns a map of each unique value to its count.
foreach – Performs and operation based on each element in the computed RDD.
saveAsTextFile – Saves the content of the RDD to a text file.
saveAsSequenceFile – Saves the content of the RDD to a Sequnece file.
| © Copyright 2015 Hitachi Consulting46
Programming with Spark
Actions
Saving data as JSON
import csv
import StringIO
import json
..
def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["store", "date", "value"])
return reader.next()
..
inputFile = "C:/spark/mywork/data/data.csv"
input = sc.textFile(inputFile).map(loadRecord)
#data transformation, e.g., map(), filter(), reduce, etc..
outputFile = "C:/spark/mywork/data/data.json"
input.map(lambda element: json.dumps(element)).saveAsTextFile(outputFile)
| © Copyright 2015 Hitachi Consulting47
Persisting RDDs
| © Copyright 2015 Hitachi Consulting48
Programming with Spark
RDD Persistence
 Spark perform the transformation on an RDD in a lazy manner; only after an action is invoked
 Spark re-computes the RDD each time an action is called on the RDD
 This can be especially expensive for iterative algorithms, which look at the data many times
rdd = numbers.filter(lambda a: a >10)
rdd2 = rdd.map(lambda a: a*a)
rdd2.count()
rdd2.collect()
rdd = numbers.filter(lambda a: a >10)
rdd2 = rdd.map(lambda a: a*a)
rdd2.cache()
rdd2.count()
rdd2.collect()
Each action will cause the RDD to be
recomputed (filter & map)
This will compute and persist the RDD to
perform several actions on it
| © Copyright 2015 Hitachi Consulting49
Programming with Spark
RDD Persistence
rdd.presist(StoragLevel.)
rdd.cache() is the same as rdd.persist() with the default level (StorageLevel.MEMORY_ONLY)
rdd.unpresist() to free-up memory from unused rdds
| © Copyright 2015 Hitachi Consulting50
Working with Pair RDDs
| © Copyright 2015 Hitachi Consulting51
Programming with Spark
Working with Key/Value Pairs
 Pair RDDs are a useful building block in many programs, as they expose operations that allow
you to act on each key in parallel or regroup data across the network.
 Usually used to perform operation like join, group, sort, reduceByKey, etc.
data = sc.textFile(“data.txt2)
data = data.map(lambda line: (line.split(“,”))
keyValueDatase = data.map(lambda elements: (elements[0],elements[1]))
def parseLine(line):
parts = line.split(“,”)
record = MyRecord(parts[0],parts[1],parts[2])
return (record.Product,record)
keyValueDatase = da.map(parseLine)
A,234,01/01/205
B,567,01/01/205
A,157,01/01/205
C,56,01/01/205
B,345,01/01/205
B,678,01/01/205
[(A,234), (B,567),
(A,157),(C,56,),(B,345),
(B,678)]
| © Copyright 2015 Hitachi Consulting52
Programming with Spark
Working with Key/Value Pairs - Transformations
reduceByKey – aggregate values with the same key using a given function
data = sc.textFile(“data.txt2)
data = data.map(lambda line: (line.split(“,”))
keyValueDatase = data.map(lambda elements: (elements[0],elements[1]))
result = keyValueDataset.reduceByKey(lambda a,b: a+b)
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
[
(A,3),
(B,14),
(C,5)
]
| © Copyright 2015 Hitachi Consulting53
Programming with Spark
Working with Key/Value Pairs - Transformations
groupByKey - group values with the same key in a collection
data = sc.textFile(“data.txt2)
data = data.map(lambda line: (line.split(“,”))
keyValueDatase = data.map(lambda elements: (elements[0],elements[1]))
result = keyValueDataset.groupByKey()
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
[
(A,[2,1]),
(B,[5,3,6]),
(C,[5])
]
| © Copyright 2015 Hitachi Consulting54
Programming with Spark
Working with Key/Value Pairs - Transformations
mapValues - apply a function on each value of the pair without changing the key
data = sc.textFile(“data.txt2)
data = data.map(lambda line: (line.split(“,”))
keyValueDatase = data.map(lambda elements: (elements[0],elements[1]))
result = keyValueDataset.mapValue(lambda a: a*a)
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
(A,4),
(B,25),
(A,1),
(C,25),
(B,9),
(B,36)
| © Copyright 2015 Hitachi Consulting55
Programming with Spark
Working with Key/Value Pairs - Transformations
join - Perform an inner join between two RDDs
data1 = sc.textFile(“data1.txt2)
data1 = data.map(lambda line: (line.split(“,”))
keyValueDatase1 = data1.map(lambda elements: (elements[0],elements[1]))
data2 = sc.textFile(“data2.txt2)
data2 = data2.map(lambda line: (line.split(“,”))
keyValueDatase = data2.map(lambda elements: (elements[0],elements[1]))
result = keyValueDataset1.join(keyValueDatase2)
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
[
(A,(2,22)),
(A,(2,11)),
(A,(1,22)),
(A,(1,11)),
(B,(5,55),
(B,(3,55),
(B,(6,55),
]
A,22,
B,55,
A,11
| © Copyright 2015 Hitachi Consulting56
Programming with Spark
Working with Key/Value Pairs - Transformations
cogroup - Group data from both RDDs sharing the same key
data1 = sc.textFile(“data1.txt2)
data1 = data.map(lambda line: (line.split(“,”))
keyValueDatase1 = data1.map(lambda elements: (elements[0],elements[1]))
data2 = sc.textFile(“data2.txt2)
data2 = data2.map(lambda line: (line.split(“,”))
keyValueDatase = data2.map(lambda elements: (elements[0],elements[1]))
result = keyValueDataset1.cogroup(keyValueDatase2)
result = result.mapValues(lambda tuple: tuple[0]+tuple[1])
A,2,
B,5,
A,1,
C,5,
B,3,
B,6,
[
(A,([2,1],[22,11])),
(B,([5,3,6],[55])),
(C,([5],[]))
]
A,22,
B,55,
A,11
[
(A,[2,1,22,11]),
(B, [5,3,6,55]),
(C,[5])
]
| © Copyright 2015 Hitachi Consulting57
Programming with Spark
Working with Key/Value Pairs - Transformations
zip - Pair and RDD with an other RDD to produce a new Key/Value RDD,
wrt the elements order of each RDD
rdd1 = sc.parallelize([‘A’,’B’,’C’,’D’])
rdd2 = sc.parallelize([101,102,103,104])
pairs = rdd1.zip(rdd2)
paris.collect()
[
A,
B,
C,
D
]
[
101,
102,
103,
104
]
[
(A,101)
(B,102)
(C,103)
(D,104)
]
| © Copyright 2015 Hitachi Consulting58
Programming with Spark
Working with Key/Value Pairs - Transformations
zipWithIndex - Pair each element in the RDD with its index
list = [‘A’,’B’,’C’,’D’]
rdd = sc.parallelize(list)
rdd_indexed = rdd.zipWithIndex()
rdd_indexed.collect()
[
A,
B,
C,
D
]
[
(A, 0)
(B, 1)
(C, 3)
(D, 4)
]
| © Copyright 2015 Hitachi Consulting59
Programming with Spark
Working with Key/Value Pairs - Transformations
flatMapValues() – same as flatMap(), but with pair RDDs
keys()
values()
sortByKey()
leftOuterJoin()
rightOuterJoin()
subtractByKey()
combineByKey() – same as aggregate(), but with pair RDDs
| © Copyright 2015 Hitachi Consulting60
Programming with Spark
Working with Key/Value Pairs - Transformations
partitionBy
 Hash Partitioning by key.
 elements with the same key will end up being processed on the same compute node, to reduce data
shuffling
 Useful with operations like cogroup(), groupWith(), join(), groupByKey(), reduceByKey(), combineByKey(),
and lookup().
 Usually used when data is loaded, then the rdd is persisted
data1 = sc.textFile(“data1.txt2)
data1 = data.map(lambda line: (line.split(“,”))
keyValueDatase1 = data1.map(lambda elements: (elements[0],elements[1])). partitionBy(10).persist()
data 2= sc.textFile(“data2.txt2)
keyValueDatase2 = data2.map(lambda line: (line.split(“,”)[0],line.split(“,”)[1]))
joined = keyValueDatase1.join(keyValueDatase2)
| © Copyright 2015 Hitachi Consulting61
Programming with Spark
Working with Key/Value Pairs - Actions
countByKey()
collectAsMap()
Lookup(key) - Return all values associated with the provided key.
Any pair RDD action takes number of reducer as an optional parameter
| © Copyright 2015 Hitachi Consulting62
Accumulators and Broadcast
Variables
| © Copyright 2015 Hitachi Consulting63
Programming with Spark
Accumulators
 An RDD function in Spark (such as map() or filter) can use a variable defined outside the driver program
 However, each task running on the cluster gets a new copy of this variable.
 Updates from these copies are not propagated back to the driver.
 Accumulators provides a simple syntax for aggregating values from worker nodes back to the driver
program.
 A common uses of accumulators is to count events that occur during job execution, maybe for
debugging purposes.
 In a worker task, accumulators are write-only variables. Only the driver program can retrieve the value of an
accumulator
| © Copyright 2015 Hitachi Consulting64
Programming with Spark
Accumulators
Accumulator Example
file = sc.textFile(inputFile)
blankLines = sc.accumulator(0)
def extractRecords(line):
global blankLines
If (line == ""):
blankLines += 1
return line.split(" ")
records = file.flatMap(extractRecords)
print "Blank lines: %d" % blankLines.value()
Define an accumulator of type INT
with 0 as initial value – referenced
by blankLines variable
Increment accumulator
through the blankLines
variable
Retrieve the accumulator
value in the driver program
| © Copyright 2015 Hitachi Consulting65
Programming with Spark
Broadcast Variables
 Allow keeping a read-only variable cached on each worker node, rather than shipping a copy of it with tasks
 E.g., to give every node a copy of a large input dataset (reference data) in an efficient manner.
 After the broadcast variable is created, it should be used instead of the original value in any functions run on
the cluster, so that the variable is not shipped to the nodes more than once.
Broadcast Example
rdd = sc.textFile(“ref_data.txt”)
ref_data = sc.broadcast(rdd)
def processData(input, ref_data):
…
data = data.map(lambda a: processData(a,ref_data))
| © Copyright 2015 Hitachi Consulting66
Per-Partition Operations
| © Copyright 2015 Hitachi Consulting67
Programming with Spark
Per-Partition Operations
 Some operations need to be executed per partition as a whole, rather than per each item in the RDD,
which is the normal behaviour of the transformations like map() or filter()
 E.g., Setup a database connection, creating a random number generator, preparing a return object of
the aggregation to happened on the RDD, etc.
 For all of the mentioned objects, we only need one per RDD partition, rather than per element.
mapPartition()
mapParitionWithIndex()
foreachPartition()
| © Copyright 2015 Hitachi Consulting68
Programming with Spark
Per-Partition Operations
Example: foreach() vs. foreachPartition()
list = [1,2,3,4,5,6,7,8,9,10]
rdd = sc.parallelize(list)
counter = sc.accumulator(0)
def operation(element):
global counter
counter+=1
rdd.foreach(operation)
print "counter value is:" + repr(counter)
counter2 = sc.accumulator(1)
rdd=rdd.repartition(3)
print “number of paritionts = “ + repr(rdd.getNumPartitions())
def operation2(element):
global counter2
counter2+=1
rdd.foreachPartition(operation2)
print "counter2 value is:" + repr(counter2)
The counter returns 10,
one for each item in the
RDD
The counter returns 3,
one for each RDD
partition
| © Copyright 2015 Hitachi Consulting69
Spark SQL
| © Copyright 2015 Hitachi Consulting70
Spark SQL
DataFrames
Distributed collection of data organized into named columns
Conceptually equivalent to a table in a relational database or a data frame in
R/Python, with richer Spark optimizations
RDDs
Structured Data
Files
Hive Tables RDBMS
| © Copyright 2015 Hitachi Consulting71
Spark SQL
DataFrames
Creating a DataFrame from RDD of Rows
from pyspark import SparkContext, SparkConf
from pyspark.sql import *
from pyspark.sql.types import *
conf = SparkConf().setAppName("My App").setMaster("local")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
row1 = Row(id = 1, Name = 'khalid')
row2 = Row(id = 2, Name = 'Zahra')
row3 = Row(id = 3, Name = 'Adel')
row4 = Row(id = 4, Name = 'Jassem')
rdd = sc.parallelize([row1,row2,row3,row4])
df = sqlContext.createDataFrame(rdd)
df.printSchema()
| © Copyright 2015 Hitachi Consulting72
Spark SQL
DataFrames
Creating a DataFrame from RDD – with a list of column names
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("My App").setMaster("local")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
rdd =
sc.parallelize([("productA","01/01/2015",50),("productA","01/02/2015",100),("productB","01/01/2015",70)])
df = sqlContext.createDataFrame(rdd, ["product","date","value"])
df.printSchema()
If only the column names are
supplied, data types will be
inferred
| © Copyright 2015 Hitachi Consulting73
Spark SQL
DataFrames
Creating a DataFrame from RDD – with schema
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
conf = SparkConf().setAppName("My App").setMaster("local")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
rdd = sc.parallelize([("productA","01/01/2015",50),("productA","01/02/2015",100),("productB","01/01/2015",70)])
schema =
StructType([StructField("Item",StringType(),True),StructField("Date",StringType(),True),StructField("Stock",LongType(),True)])
df = sqlContext.createDataFrame(rdd,schema)
df.printSchema()
Supplied schema
| © Copyright 2015 Hitachi Consulting74
Spark SQL
DataFrames
Show DataFrame content
df.show()
| © Copyright 2015 Hitachi Consulting75
Spark SQL
DataFrames
Creating a DataFrame – sqlContext.read
 sqlContext.read.json(inputFile)
 sqlContext.read.format('jdbc').options(jdbcConnectionString).load()
 sqlContext.read.parquet(inputFile)
 sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load(inputFile)
Saving a DataFrame
 df.write.json(outputFile)
 df.createJDBCTable(jdbcConnectionString, TableName, allowExisting = true)
 df.insertIntoJDBC(jdbcConnectionString, TableName, overwrite = false)
 df.write. parquet(outputFile)
 df.write.format("com.databricks.spark.csv").save("/data/home.csv")
| © Copyright 2015 Hitachi Consulting76
Spark SQL
DataFrames
Creating a DataFrame from JSON
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("My App").setMaster("local")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.json(“data.json")
df.printSchema()
df.show()
| © Copyright 2015 Hitachi Consulting77
Spark SQL
DataFrames
Creating a DataFrame from CSV
 Load csv data to RDD (using csv.DictReader()), then create a DataFrame from RDD
 Use csv loader com.databricks:spark-csv
C:sparkspark-1.6.1-bin-hadoop2.4bin>pyspark --packages com.databricks:spark-csv_2.11:1.4.0
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(inputFile)
| © Copyright 2015 Hitachi Consulting78
Spark SQL
Manipulating Data Frames
A DataFrame is a collection of spark.sql.Row objects
rows = df.collect()
rows.count()
rows[0]
rows[0][1]
Reference a column in a DataFrame
 As attribute of the DataFrame: df.product
 As Key: df[“product”]
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Return collection of Row objects
Row(product=‘product A’, date=‘2015-01-01’ ,value=50)
‘2014-01-01’
| © Copyright 2015 Hitachi Consulting79
Spark SQL
Manipulating Data Frames
Filtering DataFrame
df = df.filter(df.date=‘2015-01-01’)
df = df.filter(df[“product”]=“Product B”)
df = df.filter(“value > 50”)
OR
df = df.filter(“value > 50 AND product = ‘Product A’”)
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Date Value
Product B 2015-01-01 70
| © Copyright 2015 Hitachi Consulting80
Spark SQL
Manipulating Data Frames
Selecting Columns (projection)
df = df.select(df.product, df.value)
df = df.select(df.product, df.value*10)
df = df.select(df.product, df.value*10, len(product)*+df.value))
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Value
Product A 50
Product A 100
Product B 70
Product Value*10 len(product)*+df.value
Product A 500 508
Product A 1000 1008
Product B 700 708
Create a new columns
based on columns in
DataFrame
| © Copyright 2015 Hitachi Consulting81
Spark SQL
Manipulating Data Frames
Selecting Columns (projection)
df = df.select(df.product, df.value)
df = df.select(df.product, df.value*10)
df = df.select(df.product, df.value*10, (len(product)*+df.value).alias(derived))
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Value
Product A 50
Product A 100
Product B 70
Product Value*100 derived
Product A 500 508
Product A 1000 1008
Product B 700 708
Give alias to the new
column
| © Copyright 2015 Hitachi Consulting82
Spark SQL
Manipulating Data Frames
Order By
from pyspark.sql.functions import asc, desc
df = df.orderBy(df.value)
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Date Value
Product A 2015-01-01 50
Product B 2015-01-01 70
Product A 2015-01-02 100
| © Copyright 2015 Hitachi Consulting83
Spark SQL
Manipulating Data Frames
join
result = df1.join(
df2,
df1.product==df2.p,
“inner”)
.select(df1.product,df2.model,df1.value)
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Product Model
Product A X
Product B Y
Join condition
Join Type
Product Model Value
Product A X 50
Product A X 100
Product B Y 70
| © Copyright 2015 Hitachi Consulting84
Spark SQL
Manipulating Data Frames
groupBy
result = df.groupBy(df.product)
result.count().show()
result.sum(“value”).show()
result.max(“value”).show()
agg
import pyspark.sql.functions as F
…
result = df.groupBy(df.product).agg(df.product,F.sum(“value”),F.min(“value”))
Product Date Value
Product A 2015-01-01 50
Product A 2015-01-02 100
Product B 2015-01-01 70
Return
spark.sql.group.GroupedData
DataFrames
| © Copyright 2015 Hitachi Consulting85
Spark SQL
Manipulating Data Frames
It’s important to persist() or cache() your DataFrame after processing it via
filter(), join(), groupBy(), etc., so that these expensive operations are not re-
computed each time you perform select()
| © Copyright 2015 Hitachi Consulting86
Spark SQL
Using Structured Query Language
df.registerTempTable(“MyDataTable”);
query = “SELECT product, SUM(value) Total FROM MyDataTable GROUP BY product”
result = sqlContext.sql(query)
result.show() Resturn a DataFrame
| © Copyright 2015 Hitachi Consulting87
Spark SQL
Hive Integration
| © Copyright 2015 Hitachi Consulting88
Getting Started with Spark on
Azure HDInsight
| © Copyright 2015 Hitachi Consulting89
Spark on Azure HDInsight
Creating HDInsight Spark Cluster
| © Copyright 2015 Hitachi Consulting90
Spark on Azure HDInsight
Creating HDInsight Spark Cluster
| © Copyright 2015 Hitachi Consulting91
Spark on Azure HDInsight
Creating HDInsight Spark Cluster
| © Copyright 2015 Hitachi Consulting92
Spark on Azure HDInsight
Running Python Scripts on Spark HDInsight
| © Copyright 2015 Hitachi Consulting93
Spark on Azure HDInsight
Running Python Scripts on Spark HDInsight
| © Copyright 2015 Hitachi Consulting94
Spark on Azure HDInsight
Running Python Scripts on Spark HDInsight
| © Copyright 2015 Hitachi Consulting95
Spark on Azure HDInsight
Running Python Scripts on Spark HDInsight – Data Frames
| © Copyright 2015 Hitachi Consulting96
Spark on Azure HDInsight
Running Python Scripts on Spark HDInsight – Spark SQL
| © Copyright 2015 Hitachi Consulting97
Spark on Azure HDInsight
Running Python Scripts on Spark HDInsight – Spark SQL
| © Copyright 2015 Hitachi Consulting98
Spark on Azure HDInsight
Running Python Scripts on Spark HDInsight – Using Custom Library
Python Script
Uploaded to
ksmsdnspark/HdiSamples/HdiSamples/
WebsiteLogSampleData
| © Copyright 2015 Hitachi Consulting99
Spark on Azure HDInsight
Running Python Scripts on Spark HDInsight – Using External Package
| © Copyright 2015 Hitachi Consulting100
Spark on Azure HDInsight
Microsoft Power BI and Spark SQL
Upload Adventure works data file extracts to the blob container
| © Copyright 2015 Hitachi Consulting101
Spark on Azure HDInsight
Microsoft Power BI and Spark SQL
Process files with Spark and Save as hive table
| © Copyright 2015 Hitachi Consulting102
Spark on Azure HDInsight
Microsoft Power BI and Spark SQL
Process files with Spark and Save as hive table
| © Copyright 2015 Hitachi Consulting103
Spark on Azure HDInsight
Microsoft Power BI and Spark SQL
Processing Output is Saved and Partitioned in Hive
| © Copyright 2015 Hitachi Consulting104
Spark on Azure HDInsight
Microsoft Power BI and Spark SQL
Query Data in Spark SQL
| © Copyright 2015 Hitachi Consulting105
Spark on Azure HDInsight
Microsoft Power BI and Spark SQL
| © Copyright 2015 Hitachi Consulting106
Spark on Azure HDInsight
Microsoft Power BI and Spark SQL
| © Copyright 2015 Hitachi Consulting107
Spark on Azure HDInsight
Microsoft Power BI and Spark SQL
| © Copyright 2015 Hitachi Consulting108
Spark on Azure HDInsight
Microsoft Power BI and Spark SQL
| © Copyright 2015 Hitachi Consulting109
Spark on Azure HDInsight
Hive Integration
Spark SQL
JDBC/ODBC
Custom
App
Spark SQL
Shell
…
Hive JSON Parquet …
Excel Tableau …
 Spark SQL with Hive support allows us to access Hive tables,
UDFs (user-defined functions), SerDes (serialization and
deserialization formats), and the Hive query language (HiveQL)
 sqlContext(which is HiveContext) is the entry point to access
Hive metastore and functionality, where HiveQL is
recommended to be used.
 If hive is installed, hive-site.xml file must be copied to Spark’s
configuration directory ($SPARK_HOME/conf)
 If there is no hive installation, Spark SQL will create its own Hive
metastore (metadata DB) in your program’s work directory,
called metastore_db.
 In addition, if you attempt to create tables using HiveQL’s
CREATE TABLE, they will be placed in the
/user/hive/warehouse directory on your default filesystem
(either your local filesystem, or HDFS).
| © Copyright 2015 Hitachi Consulting110
Spark on Azure HDInsight
Hive Integration
sqlContext.sql("CREATE TABLE MyTable (id int, name string, salary float)")
tables = sqlContext.sql("SHOW TABLES")
tables.show()
description = sqlContext.sql("DESCRIBE MyTable")
description.show()
sqlContext.sql("INSERT INTO MyTable SELECT * FROM (SELECT 1101 as id,
'Khalid Salama' as name, 70000 as salary) query")
result = sqlContext.sql("SELECT * FROM MyTable")
result.show()
| © Copyright 2015 Hitachi Consulting111
Spark on Azure HDInsight
Connecting with Excel to Spark SQL
 Download and install Spark ODBC Driver https://www.microsoft.com/en-us/download/details.aspx?id=49883
 Add an ODBC Data Source
 Select Microsoft Spark ODBC Driver
| © Copyright 2015 Hitachi Consulting112
Spark on Azure HDInsight
Connecting with Excel to Spark SQL
 Download and install Spark ODBC Driver https://www.microsoft.com/en-us/download/details.aspx?id=49883
 Create an Spark ODBC connection
 Configure Spark ODBC connection
| © Copyright 2015 Hitachi Consulting113
Spark on Azure HDInsight
Connecting with Excel to Spark SQL
 Download and install Spark ODBC Driver https://www.microsoft.com/en-us/download/details.aspx?id=49883
 Create an Spark ODBC connection
 Configure Spark ODBC connection
 Test connection
| © Copyright 2015 Hitachi Consulting114
Spark on Azure HDInsight
Connecting with Excel to Spark SQL
 Open Excel and got to Data, From other Sources, Microsoft Query
| © Copyright 2015 Hitachi Consulting115
Spark on Azure HDInsight
Connecting with Excel to Spark SQL
 Browse tables and select attributes to include
| © Copyright 2015 Hitachi Consulting116
Spark on Azure HDInsight
Connecting with Excel to Spark SQL
 Show data in Pivot Table
| © Copyright 2015 Hitachi Consulting117
Spark CLR (Mobuis)
| © Copyright 2015 Hitachi Consulting118
Spark CLR
Installing Spark CLR
 Download spark Mobius https://github.com/Microsoft/Mobius
 Unzip the content of the zip folder to C:sparkspark-clr_2.10-1.6.100
 You should find inside this folder a “runtime” folder, which include (bin, lib, dependencies, scripts) folders
 Create a new visual Studio project (Console App). Right-click project, NutGet, and install the following
packages on by one (which you will find after in the packages.config)
<packages>
<package id="log4net" version="2.0.5" targetFramework="net452" />
<package id="Microsoft.SparkCLR" version="1.6.100" targetFramework="net452" />
<package id="Newtonsoft.Json" version="7.0.1" targetFramework="net452" />
<package id="Razorvine.Pyrolite" version="4.10.0.0" targetFramework="net452" />
<package id="Razorvine.Serpent" version="1.12.0.0" targetFramework="net452" />
</packages>
| © Copyright 2015 Hitachi Consulting119
Spark CLR
Running .NET Apps with Spark CLR
Writing Spark Processor Class
| © Copyright 2015 Hitachi Consulting120
Spark CLR
Running .NET Apps with Spark CLR
main function, in SparkAppDemo
Class calls processor.process()
 Build you project to produce SparkAppDemo.exe
 C:sparkspark-clr_2.10-1.6.100runtimescripts
 Run the following command
>sparkclr-submit --exe SparkDempApp.exe
C:sparkmyworkCSharpSparkDempApp
| © Copyright 2015 Hitachi Consulting121
Spark CLR
Running .NET Apps with Spark CLR
| © Copyright 2015 Hitachi Consulting122
How to Get Started with Spark
 Read the slides!
 Azure Spark HDInsight Documentation
https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-apache-spark-overview/
 Apache Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html
 Spark CLR (Mobius)
https://github.com/Microsoft/Mobius
 Introduction to Big Data Analytics (week 5) – Coursera Big Data Specialization
https://www.coursera.org/learn/bigdata-analytics/home/week/5
 Data Manipulation at Scale (week4, lesion 20) – Coursera Data Science at Scale
https://www.coursera.org/learn/data-manipulation/home/week/4
 Data Science and Engineering with Apache Spark – edx 5 course track
https://www.edx.org/xseries/data-science-engineering-apache-spark
O’Reliy Books – Learning Spark
| © Copyright 2015 Hitachi Consulting123
Appendix A: Spark Configurations
SparkConf()
 spark.app.name
 spark.master spark://host:<port> | mesos://host:<port> | yarn | local | local[<cores>]
 spark.ui.port
 spark.executor.memory
 spark.executor.cores
 spark.serializer
 spark.eventLog.enabled
 spark.eventLog.dir
spark-submit
--master
--deploy-mode client | cluster
--name
--files
--py-files
--executor-memory
--driver-memory
| © Copyright 2015 Hitachi Consulting124
My Background
Applying Computational Intelligence in Data Mining
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 25+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning,
– evolving neural networks, and
– data clustering
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org
| © Copyright 2015 Hitachi Consulting125
Thank you!

More Related Content

What's hot

Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhDSpark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhDAdnan Masood
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureLynn Langit
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Driving Network and Marketing Investments at O2 by Focusing on Improving the ...
Driving Network and Marketing Investments at O2 by Focusing on Improving the ...Driving Network and Marketing Investments at O2 by Focusing on Improving the ...
Driving Network and Marketing Investments at O2 by Focusing on Improving the ...DataWorks Summit
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at WalgreensDataWorks Summit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 

What's hot (20)

Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhDSpark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows Azure
 
Data-In-Motion Unleashed
Data-In-Motion UnleashedData-In-Motion Unleashed
Data-In-Motion Unleashed
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Driving Network and Marketing Investments at O2 by Focusing on Improving the ...
Driving Network and Marketing Investments at O2 by Focusing on Improving the ...Driving Network and Marketing Investments at O2 by Focusing on Improving the ...
Driving Network and Marketing Investments at O2 by Focusing on Improving the ...
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 

Viewers also liked

Microsoft Azure Batch
Microsoft Azure BatchMicrosoft Azure Batch
Microsoft Azure BatchKhalid Salama
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionAdnan Masood
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesHortonworks
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Steve Kramer
 
Сторінками юридичних періодичних видань.
 Сторінками юридичних періодичних видань. Сторінками юридичних періодичних видань.
Сторінками юридичних періодичних видань.Понкратова Людмила
 
Missions Mobilization Principles @globalcast
Missions Mobilization Principles @globalcastMissions Mobilization Principles @globalcast
Missions Mobilization Principles @globalcastGlobalCAST Resources
 
BioSharing - ELIXIR All Hands, March 2017
BioSharing - ELIXIR All Hands, March 2017BioSharing - ELIXIR All Hands, March 2017
BioSharing - ELIXIR All Hands, March 2017Susanna-Assunta Sansone
 
NeuString - Roaming Discount Agreements vs Spreadsheets e.1.1
NeuString - Roaming Discount Agreements vs Spreadsheets e.1.1NeuString - Roaming Discount Agreements vs Spreadsheets e.1.1
NeuString - Roaming Discount Agreements vs Spreadsheets e.1.1Casper Tribler
 
XOHW17 - tReeSearch Project Presentation
XOHW17 - tReeSearch Project PresentationXOHW17 - tReeSearch Project Presentation
XOHW17 - tReeSearch Project PresentationtReeSearch@NECST
 
Adult Learning Pyramid
Adult Learning PyramidAdult Learning Pyramid
Adult Learning Pyramidjlong1232
 
[PL] Code Europe 2016 - Python and Microsoft Azure
[PL] Code Europe 2016 - Python and Microsoft Azure[PL] Code Europe 2016 - Python and Microsoft Azure
[PL] Code Europe 2016 - Python and Microsoft AzureMichał Smereczyński
 
Media Literacy: Evaluating the News We Consume
Media Literacy: Evaluating the News We ConsumeMedia Literacy: Evaluating the News We Consume
Media Literacy: Evaluating the News We ConsumeShelly Sanchez Terrell
 

Viewers also liked (20)

Microsoft Azure Batch
Microsoft Azure BatchMicrosoft Azure Batch
Microsoft Azure Batch
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial Services
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
 
tce
tcetce
tce
 
Сторінками юридичних періодичних видань.
 Сторінками юридичних періодичних видань. Сторінками юридичних періодичних видань.
Сторінками юридичних періодичних видань.
 
Missions Mobilization Principles @globalcast
Missions Mobilization Principles @globalcastMissions Mobilization Principles @globalcast
Missions Mobilization Principles @globalcast
 
BioSharing - ELIXIR All Hands, March 2017
BioSharing - ELIXIR All Hands, March 2017BioSharing - ELIXIR All Hands, March 2017
BioSharing - ELIXIR All Hands, March 2017
 
NeuString - Roaming Discount Agreements vs Spreadsheets e.1.1
NeuString - Roaming Discount Agreements vs Spreadsheets e.1.1NeuString - Roaming Discount Agreements vs Spreadsheets e.1.1
NeuString - Roaming Discount Agreements vs Spreadsheets e.1.1
 
XOHW17 - tReeSearch Project Presentation
XOHW17 - tReeSearch Project PresentationXOHW17 - tReeSearch Project Presentation
XOHW17 - tReeSearch Project Presentation
 
ICDS2 IARIA presentation M. Hartog
ICDS2 IARIA presentation M. HartogICDS2 IARIA presentation M. Hartog
ICDS2 IARIA presentation M. Hartog
 
Adult Learning Pyramid
Adult Learning PyramidAdult Learning Pyramid
Adult Learning Pyramid
 
Thank you 3.22.2017
Thank you 3.22.2017Thank you 3.22.2017
Thank you 3.22.2017
 
[PL] Code Europe 2016 - Python and Microsoft Azure
[PL] Code Europe 2016 - Python and Microsoft Azure[PL] Code Europe 2016 - Python and Microsoft Azure
[PL] Code Europe 2016 - Python and Microsoft Azure
 
Media Literacy: Evaluating the News We Consume
Media Literacy: Evaluating the News We ConsumeMedia Literacy: Evaluating the News We Consume
Media Literacy: Evaluating the News We Consume
 

Similar to Spark with HDInsight

HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)Durga Gadiraju
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Hadoop on OpenStack
Hadoop on OpenStackHadoop on OpenStack
Hadoop on OpenStackSandeep Raju
 
Cloudera hadoop installation
Cloudera hadoop installationCloudera hadoop installation
Cloudera hadoop installationSumitra Pundlik
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_onSri Ambati
 
Sparkling Water
Sparkling WaterSparkling Water
Sparkling Waterh2oworld
 

Similar to Spark with HDInsight (20)

HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Hadoop on OpenStack
Hadoop on OpenStackHadoop on OpenStack
Hadoop on OpenStack
 
Cloudera hadoop installation
Cloudera hadoop installationCloudera hadoop installation
Cloudera hadoop installation
 
Spark core
Spark coreSpark core
Spark core
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 
Sparkling Water
Sparkling WaterSparkling Water
Sparkling Water
 

Recently uploaded

Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 

Recently uploaded (20)

Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 

Spark with HDInsight

  • 1. | © Copyright 2015 Hitachi Consulting1 Spark with Azure HDInsight Lighting-fast Big Data Processing Khalid M. Salama, Ph.D. Business Insights & Analytics Hitachi Consulting UK We Make it Happen. Better.
  • 2. | © Copyright 2015 Hitachi Consulting2 Outline  Spark and Big Data  Installing Spark  Spark Core Concepts  Programming with Spark  Spark SQL  Getting Started with Spark on HDInsight  Spark CLR (Mobius)  ETL and Automation  Useful Resources
  • 3. | © Copyright 2015 Hitachi Consulting3 Introducing Spark
  • 4. | © Copyright 2015 Hitachi Consulting4 What is Spark? The Lightening-fast Big Data Processing General-purpose Big Data Processing Integrates with HDFS Graph Processing Stream Processing Machine Learning Libraries In-memory (fast) Iterative Processing Interactive Query SQL Scala – Python – Java – R – .NET
  • 5. | © Copyright 2015 Hitachi Consulting5 Spark and Hadoop Ecosystem Spark and the zoo… Hadoop Distributed File System (HDFS) Applications In-Memory Stream SQL  Spark- SQL NoSQL Machine Learning …. Batch Yet Another Resource Negotiator (YARN) Search Orchest. MgmntAcquisition Named Node DataNode 1 DataNode 2 DataNode 3 DataNode N
  • 6. | © Copyright 2015 Hitachi Consulting6 Spark Components Spark and the zoo… Hadoop Distributed File System (HDFS) Spark …. Yet Another Resource Negotiator (YARN)Named Node DataNode 1 DataNode 2 DataNode 3 DataNode N Spark Core Engine (RDDs: Resilient Distributed Datasets) Spark SQL (structured data) Spark Streaming (real-time) Mlib (machine learning) GraphX (graph processing) Scala Java Python R .NET (Mobius)
  • 7. | © Copyright 2015 Hitachi Consulting7 Spark Components Spark Core Spark SQL Spark Streaming Spark MLib Spark GraphX Cluster Managers  Contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, etc.  Home to the API that defines resilient distributed datasets (RDDs)  Package for working with structured data (DataFrames). It allows querying data via SQL as well as the Apache Hive Supports many sources of data, including Hive tables, Parquet, and JSON.  Allows developers to intermix SQL queries with the programmatic data manipulations supported by  RDDs in a single application, thus combining SQL with complex analytics • Provides an API for manipulating data streams that closely matches the Spark Core’s RDD API, making it easy for programmers to learn the project and move between applications that manipulate data stored in memory, on disk, or arriving in real time.  Provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import.  Provides some lower-level ML primitives, including a generic gradient descent optimization algorithm.  Provides graph manipulation operations and performing graph-parallel computations.  Allows creating a directed graph with arbitrary properties attached to each vertex and edge.  Provides various operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle counting).  Spark can run over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple  cluster manager included in Spark itself called the Standalone Scheduler
  • 8. | © Copyright 2015 Hitachi Consulting8 Cluster Manager (Standalone/ YARN/ Mesos) What is Spark? The Lightening-fast Big Data processing Master Node Driver Program SparkContext Worker Node 1 Executor Task Worker Node 2 Executor Worker Node N Executor … Task Task Task Task Driver Program – Contains main function, defines distributed datasets, apply operations to them. E.g. Spark Shell. Submits Tasks to the executor processes SparkContext – Connection to the computing cluster, creates distributed datasets. Initialized automatically when using Spark Shell with default config
  • 9. | © Copyright 2015 Hitachi Consulting9 What is Spark? The Lightening-fast Big Data processing The user submits an application using spark-submit. spark-submit launches the driver program and invokes the main() method The driver program contacts the cluster manager to ask for resources to launch executors. The cluster manager launches executors on behalf of the driver program. the driver sends RDD transformations and actions to executors in the form of tasks. Tasks are run on executor processes to compute and save results. The driver’s main() method exits or it calls SparkContext.stop() it will terminate the executors and release resources from the cluster manager. How Spark works on a cluster:
  • 10. | © Copyright 2015 Hitachi Consulting10 Installing Spark
  • 11. | © Copyright 2015 Hitachi Consulting11 Installing Spark Windows Standalone Installation (no HDFS)  Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)  Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”, using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92  Check that the variable has been set: >ECHO %JAVA_HOME%  Install Spark 1.5.2 or 1.6.*  Unzip the content to c:spark
  • 12. | © Copyright 2015 Hitachi Consulting12 Installing Spark Windows Standalone Installation (no HDFS)  Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)  Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”, using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92  Check that the variable has been set: >ECHO %JAVA_HOME%  Install Spark 1.5.2 or 1.6.*  Unzip the content to c:spark
  • 13. | © Copyright 2015 Hitachi Consulting13 Installing Spark Windows Standalone Installation (no HDFS)  Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)  Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”, using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92  Check that the variable has been set: >ECHO %JAVA_HOME%  Install Spark 1.5.2 or 1.6.*  Unzip the content to c:spark  If there is no Hadoop, you need to install winutils.exe  Place winutils.exe in c:hadoopbin  Set HADOOP_HOME environment variable to “c:hadoopbin” using the command prompt: >SETx HADOOP_HOME c:hadoopbin  Check that the variable has been set: >ECHO %HADOOP_HOME%  Go to “C:sparkspark-1.6.1-bin-hadoop2.4bin” and run pyspark
  • 14. | © Copyright 2015 Hitachi Consulting14 Installing Spark Windows Standalone Installation (no HDFS)  Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)  Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”, using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92  Check that the variable has been set: >ECHO %JAVA_HOME%  Install Spark 1.5.2 or 1.6.*  Unzip the content to c:spark  If there is no Hadoop, you need to install winutils.exe  Place winutils.exe in c:hadoopbin  Set HADOOP_HOME environment variable to “c:hadoopbin” using the command prompt: >SETx HADOOP_HOME c:hadoopbin  Check that the variable has been set: >ECHO %HADOOP_HOME%  Go to “C:sparkspark-1.6.1-bin-hadoop2.4bin” and run pyspark
  • 15. | © Copyright 2015 Hitachi Consulting15 Installing Spark Windows Standalone Installation (no HDFS)  Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)  Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”, using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92  Check that the variable has been set: >ECHO %JAVA_HOME%  Install Spark 1.5.2 or 1.6.*  Unzip the content to c:spark  If there is no Hadoop, you need to install winutils.exe  Place winutils.exe in c:hadoopbin  Set HADOOP_HOME environment variable to “c:hadoopbin” using the command prompt: >SETx HADOOP_HOME c:hadoopbin  Check that the variable has been set: >ECHO %HADOOP_HOME%  Go to “C:sparkspark-1.6.1-bin-hadoop2.4bin” and run pyspark  You can test the following statements  1+1  List = sc.parallelize([1,2,5])  List.count()  exit()
  • 16. | © Copyright 2015 Hitachi Consulting16 Installing Spark Windows Standalone Installation (no HDFS)  Install Java Development Kit (JDK) 7u85 or 8u60 (OpenJDK or Oracle JDK)  Set JAVA_HOME environment variable to the installation path (usually “C:Program FilesJavajdk<version>”, using the command prompt: >SETx JAVA_HOME C:Program FilesJavajdk1.8.0_92  Check that the variable has been set: >ECHO %JAVA_HOME%  Install Spark 1.5.2 or 1.6.*  Unzip the content to c:spark  If there is no Hadoop, you need to install winutils.exe  Place winutils.exe in c:hadoopbin  Set HADOOP_HOME environment variable to “c:hadoopbin” using the command prompt: >SETx HADOOP_HOME c:hadoopbin  Check that the variable has been set: >ECHO %HADOOP_HOME%  Go to “C:sparkspark-1.6.1-bin-hadoop2.4bin” and run pyspark  You can test the following statements  1+1  List = sc.parallelize([1,2,5])  List.count()  exit()
  • 17. | © Copyright 2015 Hitachi Consulting17 Submitting a Python script to Spark Using spark-submit C:sparkspark-1.6.1-bin-hadoop2.4binspark-submit <scriptFilePath>
  • 18. | © Copyright 2015 Hitachi Consulting18 Spark Core Concepts
  • 19. | © Copyright 2015 Hitachi Consulting19 Spark Core Concepts Key/Value (Pair) RDDs Persisting & Removing RDDs Per-Partition Operations Accumulators & Broadcast Variables Resilient Distributed Datasets (RDDs) Transformations Actions
  • 20. | © Copyright 2015 Hitachi Consulting20 Spark Core Concepts Resilient Distributed Datasets  Distributed, Fault-tolerant, Immutable Collection of Memory Objects  Split into partitions to be processed on different nodes of the cluster.  Can contain any type of Python, Java, or Scala objects, including user-defined classes  Processed through Transformation and Actions
  • 21. | © Copyright 2015 Hitachi Consulting21 Spark Core Concepts Resilient Distributed Datasets Creating an RDD Parallelizing an existing collection in the driver program Loading a data set in a externa data store
  • 22. | © Copyright 2015 Hitachi Consulting22 Spark Core Concepts Resilient Distributed Datasets Creating an RDD  Parallelizing an existing collection in the driver program collection = [“Khalid”,”Magdy”, “Nagib”, “Salama”] rdd= sc.parallelize(collection)
  • 23. | © Copyright 2015 Hitachi Consulting23 Spark Core Concepts Resilient Distributed Datasets Creating an RDD  Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, Hbase, etc. filePath = <“/directory/file.csv” | “/directory” | “directory/*.csv> rdd = sc.textFile(filePath) rdd = sc.wholeFiles(directoryPath) Can read a file, all files in a folder, or based on a wildcard. Returns a collection of lines Returns a dictionary of (filename, content)
  • 24. | © Copyright 2015 Hitachi Consulting24 Spark Core Concepts Resilient Distributed Datasets Creating an RDD  Loading json files import json … input = sc.textFile(“jsonfile.json”) data = input.map(lambda x: json.loads(x))
  • 25. | © Copyright 2015 Hitachi Consulting25 Spark Core Concepts Resilient Distributed Datasets Creating an RDD  Load CSV file import csv import StringIO .. def loadRecord(line): input = StringIO.StringIO(line) reader = csv.DictReader(input, fieldnames=["store", "date", "value"]) return reader.next() .. inputFile = "C:/spark/mywork/data/data.csv" input = sc.textFile(inputFile).map(loadRecord) input.collect()[0]
  • 26. | © Copyright 2015 Hitachi Consulting26 Spark Core Concepts Processing RDDs Transformations  Construct a new RDD based on the current one via manipulating the collection  Lazy Execution: only performed when an action is invoked.  The set of transformation as optimized prior execution (action) to load and process less amount of data Actions  Compute a result based on an RDD  Return the results to Driver Program, or save to an external storage system  RDD is recomputed (i.e., transformation are re-applied) each time an action is invoked  rdd.cache() or rdd.persist([option]) to reuse the computed rdd filter(), map(), flatMap() groupByKey(), cogroup(), reduceByKey(), sortByKey(), distinct(), sample(), union(), interest(), join(), and more… reduce(), first(), take(), takeSample() count(), countByKey() collect(), saveAsTextFile(), foreach()
  • 27. | © Copyright 2015 Hitachi Consulting27 Spark Core Concepts Spark Program Spark Program in a nutshell: 1. Create an RDD by loading a dataset from an external file, using textFile() 2. Apply transformation to the RDD, like filter(), map(), join() 3. Call RDD.persist() to apply the transformation and persist the computed RDD for reuse 4. Apply actions to the RDD, like count(), reduce(), collect() 5. Save the action results to an external data storage using saveAsTextFile()
  • 28. | © Copyright 2015 Hitachi Consulting28 Spark Transformations
  • 29. | © Copyright 2015 Hitachi Consulting29 Programming with Spark Transformations Filter – return a subset of the RDD based on a some condition(s) lines = sc.textFile(“log.txt”) errors = lines.filter(lambda line: “error” in line or “warn” in line) errors = errors.filter(lambda line: len(line) > 10) numbers = sc.parallelize([1,2,3,4,5]) evenNumbers = numbers.filter(lambda n: n % 2==0 and n>3) def isPrime(n): for i in range(2,int(n**0.5)+1): if n % i==0: return false return true primeNumbers = numbers.filter(isPrime)
  • 30. | © Copyright 2015 Hitachi Consulting30 Programming with Spark Transformations Map – return a new collection by applying a function on each element of the RDD list = sc.parallelize([1,2,3,4,5]) [1,2,3,4,5]
  • 31. | © Copyright 2015 Hitachi Consulting31 Programming with Spark Transformations Map – return a new collection by applying a function on each element of the RDD list = sc.parallelize([1,2,3,4,5]) list_sqr = list.map(lambda n: n*n) [1,2,3,4,5] [1,4,9,16,25]
  • 32. | © Copyright 2015 Hitachi Consulting32 Programming with Spark Transformations Map – return a new collection by applying a function on each element of the RDD list = sc.parallelize([1,2,3,4,5]) list_sqr = list.map(lambda n: n*n) lines = sc.textFile(“data.txt”) This product is great The product I bought yesterday is so bad I am happy Very bad product, very bad [ This product is great, The product I bought yesterday is so bad, I am happy, Very bad product, very bad ]
  • 33. | © Copyright 2015 Hitachi Consulting33 Programming with Spark Transformations Map – return a new collection by applying a function on each element of the RDD list = sc.parallelize([1,2,3,4,5]) list_sqr = list.map(lambda n: n*n) lines = sc.textFile(“data.txt”) linewords = lines.map(lambda line: line.split(“ “)) [ This product is great, The product I bought yesterday is so bad, I am happy, Very bad product, very bad ] [ [This, product, is, great], [The, product, I, bought, yesterday, is, so, bad], [I, am, happy], [Very, bad, product, very, bad] ]
  • 34. | © Copyright 2015 Hitachi Consulting34 Programming with Spark Transformations Map – return a new collection by applying a function on each element of the RDD list = sc.parallelize([1,2,3,4,5]) list_sqr = list.map(lambda n: n*n) lines = sc.textFile(“data.txt”) linewords = lines.map(lambda line: line.split(“ “)) subset = lines.filter(lambda line: “product” in line) [ [This, product, is, great], [The, product, I, bought, yesterday, is, so, bad], [I, am, happy], [Very, bad, product, very, bad] ] [ [This, product, is, great], [The, product, I, bought, yesterday, is, so, bad], [Very, bad, product, very, bad] ]
  • 35. | © Copyright 2015 Hitachi Consulting35 Programming with Spark Transformations Map – return a new collection by applying a function on each element of the RDD list = sc.parallelize([1,2,3,4,5]) list_sqr = list.map(lambda n: n*n) lines = sc.textFile(“data.txt”) linewords = lines.map(lambda line: line.split(“ “)) subset = lines.filter(lambda line: “product” in line) output = lines.map(lambda line: line.count(“bad”)) [ [This, product, is, great], [The, product, I, bought, yesterday, is, so, bad], [Very, bad, product, very, bad] ] [0,1,2]
  • 36. | © Copyright 2015 Hitachi Consulting36 Programming with Spark Transformations Map – return a new collection by applying a function on each element of the RDD list = sc.parallelize([1,2,3,4,5]) list_sqr = list.map(lambda n: n*n) lines = sc.textFile(“data.txt”) linewords = lines.map(lambda line: line.split(“ “)) subset = lines.filter(lambda line: “product” in line) output = lines.map(lambda line: line.count(“bad”)) lines = sc.textFile(“data.txt”) 1, 2016/01/01, productA,456 2, 2016/01/01, productB,65 3, 2016/01/02, productA,104 [ 1, 2016/01/01, productA,456 2, 2016/01/01, productB,65 3, 2016/01/02, productA,104 ]
  • 37. | © Copyright 2015 Hitachi Consulting37 Programming with Spark Transformations Map – return a new collection by applying a function on each element of the RDD list = sc.parallelize([1,2,3,4,5]) list_sqr = list.map(lambda n: n*n) lines = sc.textFile(“data.txt”) linewords = lines.map(lambda line: line.split(“ “)) subset = lines.filter(lambda line: “product” in line) output = lines.map(lambda line: line.count(“bad”)) lines = sc.textFile(“data.txt”) records = lines.map(Order.ParseLineToOrder(line) [ 1, 2016/01/01, productA,456 2, 2016/01/01, productB,65 3, 2016/01/02, productA,104 ] [ Order(Id:1, date:2016/01/01, product:” “productA”,SalsValue:456) Order(Id:2, date:2016/01/01, product:” “productB”,SalsValue:65) Order(Id:3, date:2016/01/02, product:” “productA”,SalsValue:104) ]
  • 38. | © Copyright 2015 Hitachi Consulting38 Programming with Spark Transformations Map – return a new collection by applying a function on each element of the RDD list = sc.parallelize([1,2,3,4,5]) list_sqr = list.map(lambda n: n*n) lines = sc.textFile(“data.txt”) linewords = lines.map(lambda line: line.split(“ “)) subset = lines.filter(lambda line: “product” in line) output = lines.map(lambda line: line.count(“bad”)) lines = sc.textFile(“data.txt”) records = lines.map(Order.ParseLineToOrder(line) filtered = records.filter(lambda order: order.SalesValue > 100) [ Order(Id:1, date:2016-01-01, product:” “productA”,SalsValue:456) Order(Id:2, date:2016-01-01, product:” “productB”,SalsValue:65) Order(Id:3, date:2016-01-02, product:” “productA”,SalsValue:104) ] [ Order(Id:1, date:2016-01-01, product:” “productA”,SalsValue:456) Order(Id:3, date:2016-01-02, product:” “productA”,SalsValue:104) ]
  • 39. | © Copyright 2015 Hitachi Consulting39 Programming with Spark Transformations FlatMap – if the map function return a collection per each item in the RDD, flatMap will return a “flat collection, rather than a collection of collections lines = sc.textFile(“data.txt”) words = lines.flatMap(lambda line: line.split(“ “)) #word count example lines = sc.textFile(“data.txt”) words = lines.flatMap(lambda line: line.split(“ “)) counts = words.map(lambda word: (word,1) combined = counts.reduceByKey(lambda a,b: a+b) combined.saveAsTextFile(“output.txt) This product is great The product I bought yesterday is so bad I am happy Very bad product, very bad [ This, product, is, great, The, product, I, bought, yesterday, is, so bad, I, am, happy, Very, bad, product, very, bad ]
  • 40. | © Copyright 2015 Hitachi Consulting40 Programming with Spark Transformations Union, Intersection, subtract, distinct list1 = [1,2,3,4,5] list2 = [2,4,6,8,10] rdd1 = sc.parallelize(list1) rdd2 = sc.parallelize(list2) rdd3 = list1.union(list2) rdd4 = list1.intersection(list2) rdd5 = list1.subtract(list2) [1,2,3,4,5,6,7,8,9,10] [2,4] [1,3,5]
  • 41. | © Copyright 2015 Hitachi Consulting41 Spark Actions
  • 42. | © Copyright 2015 Hitachi Consulting42 Programming with Spark Actions Reduce – operates on two elements in your RDD and returns a new element of the same type. numbers = sc.parallelize([1,2,3,4,5]) sum = numbers.reduce(lambda a,b: a+b) max = numbers.reduce(lambda a,b: a if a > b else b)
  • 43. | © Copyright 2015 Hitachi Consulting43 Programming with Spark Actions Reduce – operates on two elements in your RDD and returns a new element of the same type. numbers = sc.parallelize([1,2,3,4,5]) sum = numbers.reduce(lambda a,b: a+b) max = numbers.reduce(lambda a,b: a if a > b else b) words = sc.parallelize([“hello”,”my”,”name”,”is”,”Khalid”]) concatenated = words.reduce(lambda a,b: a+” “+b) [ “hello”, ”my”, ”name”, ”is” ,”Khalid” ] ‘Hello my name is Khalid’
  • 44. | © Copyright 2015 Hitachi Consulting44 Programming with Spark Actions Aggregate - operates on two elements in your RDD and returns a new element of any type. words = sc.parallelize([“hello”,”my”,”name”,”is”,”Khalid”]) number_of_letters = words.aggregate(0, (lambda acc, value: acc+len(value)), (lambda acc1,acc2: acc1+acc2)) alternatively, number_of_letters = words.map(lambda word:len(word).reduce(lambda a,b: a+b) Return a tuple (sum/count) to calculate average: sumCount = nums.aggregate((0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1), (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])))) avg = sumCount[0] / float(sumCount[1]) Takes an initial value of the return type Here, the initial value is a tuple! 2 functions: 1) how to add an element to the accumulated value, 2) how to aggregate two accumulated values
  • 45. | © Copyright 2015 Hitachi Consulting45 Programming with Spark Actions collect – Returns the RDD as a collection (not an RDD anymore!). count – Returns the number of elements in the RDD. takeSample – Takes random n elements from RDD. first – Takes first element in RDD. countByValue – Returns a map of each unique value to its count. foreach – Performs and operation based on each element in the computed RDD. saveAsTextFile – Saves the content of the RDD to a text file. saveAsSequenceFile – Saves the content of the RDD to a Sequnece file.
  • 46. | © Copyright 2015 Hitachi Consulting46 Programming with Spark Actions Saving data as JSON import csv import StringIO import json .. def loadRecord(line): input = StringIO.StringIO(line) reader = csv.DictReader(input, fieldnames=["store", "date", "value"]) return reader.next() .. inputFile = "C:/spark/mywork/data/data.csv" input = sc.textFile(inputFile).map(loadRecord) #data transformation, e.g., map(), filter(), reduce, etc.. outputFile = "C:/spark/mywork/data/data.json" input.map(lambda element: json.dumps(element)).saveAsTextFile(outputFile)
  • 47. | © Copyright 2015 Hitachi Consulting47 Persisting RDDs
  • 48. | © Copyright 2015 Hitachi Consulting48 Programming with Spark RDD Persistence  Spark perform the transformation on an RDD in a lazy manner; only after an action is invoked  Spark re-computes the RDD each time an action is called on the RDD  This can be especially expensive for iterative algorithms, which look at the data many times rdd = numbers.filter(lambda a: a >10) rdd2 = rdd.map(lambda a: a*a) rdd2.count() rdd2.collect() rdd = numbers.filter(lambda a: a >10) rdd2 = rdd.map(lambda a: a*a) rdd2.cache() rdd2.count() rdd2.collect() Each action will cause the RDD to be recomputed (filter & map) This will compute and persist the RDD to perform several actions on it
  • 49. | © Copyright 2015 Hitachi Consulting49 Programming with Spark RDD Persistence rdd.presist(StoragLevel.) rdd.cache() is the same as rdd.persist() with the default level (StorageLevel.MEMORY_ONLY) rdd.unpresist() to free-up memory from unused rdds
  • 50. | © Copyright 2015 Hitachi Consulting50 Working with Pair RDDs
  • 51. | © Copyright 2015 Hitachi Consulting51 Programming with Spark Working with Key/Value Pairs  Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network.  Usually used to perform operation like join, group, sort, reduceByKey, etc. data = sc.textFile(“data.txt2) data = data.map(lambda line: (line.split(“,”)) keyValueDatase = data.map(lambda elements: (elements[0],elements[1])) def parseLine(line): parts = line.split(“,”) record = MyRecord(parts[0],parts[1],parts[2]) return (record.Product,record) keyValueDatase = da.map(parseLine) A,234,01/01/205 B,567,01/01/205 A,157,01/01/205 C,56,01/01/205 B,345,01/01/205 B,678,01/01/205 [(A,234), (B,567), (A,157),(C,56,),(B,345), (B,678)]
  • 52. | © Copyright 2015 Hitachi Consulting52 Programming with Spark Working with Key/Value Pairs - Transformations reduceByKey – aggregate values with the same key using a given function data = sc.textFile(“data.txt2) data = data.map(lambda line: (line.split(“,”)) keyValueDatase = data.map(lambda elements: (elements[0],elements[1])) result = keyValueDataset.reduceByKey(lambda a,b: a+b) A,2, B,5, A,1, C,5, B,3, B,6, [ (A,3), (B,14), (C,5) ]
  • 53. | © Copyright 2015 Hitachi Consulting53 Programming with Spark Working with Key/Value Pairs - Transformations groupByKey - group values with the same key in a collection data = sc.textFile(“data.txt2) data = data.map(lambda line: (line.split(“,”)) keyValueDatase = data.map(lambda elements: (elements[0],elements[1])) result = keyValueDataset.groupByKey() A,2, B,5, A,1, C,5, B,3, B,6, [ (A,[2,1]), (B,[5,3,6]), (C,[5]) ]
  • 54. | © Copyright 2015 Hitachi Consulting54 Programming with Spark Working with Key/Value Pairs - Transformations mapValues - apply a function on each value of the pair without changing the key data = sc.textFile(“data.txt2) data = data.map(lambda line: (line.split(“,”)) keyValueDatase = data.map(lambda elements: (elements[0],elements[1])) result = keyValueDataset.mapValue(lambda a: a*a) A,2, B,5, A,1, C,5, B,3, B,6, (A,4), (B,25), (A,1), (C,25), (B,9), (B,36)
  • 55. | © Copyright 2015 Hitachi Consulting55 Programming with Spark Working with Key/Value Pairs - Transformations join - Perform an inner join between two RDDs data1 = sc.textFile(“data1.txt2) data1 = data.map(lambda line: (line.split(“,”)) keyValueDatase1 = data1.map(lambda elements: (elements[0],elements[1])) data2 = sc.textFile(“data2.txt2) data2 = data2.map(lambda line: (line.split(“,”)) keyValueDatase = data2.map(lambda elements: (elements[0],elements[1])) result = keyValueDataset1.join(keyValueDatase2) A,2, B,5, A,1, C,5, B,3, B,6, [ (A,(2,22)), (A,(2,11)), (A,(1,22)), (A,(1,11)), (B,(5,55), (B,(3,55), (B,(6,55), ] A,22, B,55, A,11
  • 56. | © Copyright 2015 Hitachi Consulting56 Programming with Spark Working with Key/Value Pairs - Transformations cogroup - Group data from both RDDs sharing the same key data1 = sc.textFile(“data1.txt2) data1 = data.map(lambda line: (line.split(“,”)) keyValueDatase1 = data1.map(lambda elements: (elements[0],elements[1])) data2 = sc.textFile(“data2.txt2) data2 = data2.map(lambda line: (line.split(“,”)) keyValueDatase = data2.map(lambda elements: (elements[0],elements[1])) result = keyValueDataset1.cogroup(keyValueDatase2) result = result.mapValues(lambda tuple: tuple[0]+tuple[1]) A,2, B,5, A,1, C,5, B,3, B,6, [ (A,([2,1],[22,11])), (B,([5,3,6],[55])), (C,([5],[])) ] A,22, B,55, A,11 [ (A,[2,1,22,11]), (B, [5,3,6,55]), (C,[5]) ]
  • 57. | © Copyright 2015 Hitachi Consulting57 Programming with Spark Working with Key/Value Pairs - Transformations zip - Pair and RDD with an other RDD to produce a new Key/Value RDD, wrt the elements order of each RDD rdd1 = sc.parallelize([‘A’,’B’,’C’,’D’]) rdd2 = sc.parallelize([101,102,103,104]) pairs = rdd1.zip(rdd2) paris.collect() [ A, B, C, D ] [ 101, 102, 103, 104 ] [ (A,101) (B,102) (C,103) (D,104) ]
  • 58. | © Copyright 2015 Hitachi Consulting58 Programming with Spark Working with Key/Value Pairs - Transformations zipWithIndex - Pair each element in the RDD with its index list = [‘A’,’B’,’C’,’D’] rdd = sc.parallelize(list) rdd_indexed = rdd.zipWithIndex() rdd_indexed.collect() [ A, B, C, D ] [ (A, 0) (B, 1) (C, 3) (D, 4) ]
  • 59. | © Copyright 2015 Hitachi Consulting59 Programming with Spark Working with Key/Value Pairs - Transformations flatMapValues() – same as flatMap(), but with pair RDDs keys() values() sortByKey() leftOuterJoin() rightOuterJoin() subtractByKey() combineByKey() – same as aggregate(), but with pair RDDs
  • 60. | © Copyright 2015 Hitachi Consulting60 Programming with Spark Working with Key/Value Pairs - Transformations partitionBy  Hash Partitioning by key.  elements with the same key will end up being processed on the same compute node, to reduce data shuffling  Useful with operations like cogroup(), groupWith(), join(), groupByKey(), reduceByKey(), combineByKey(), and lookup().  Usually used when data is loaded, then the rdd is persisted data1 = sc.textFile(“data1.txt2) data1 = data.map(lambda line: (line.split(“,”)) keyValueDatase1 = data1.map(lambda elements: (elements[0],elements[1])). partitionBy(10).persist() data 2= sc.textFile(“data2.txt2) keyValueDatase2 = data2.map(lambda line: (line.split(“,”)[0],line.split(“,”)[1])) joined = keyValueDatase1.join(keyValueDatase2)
  • 61. | © Copyright 2015 Hitachi Consulting61 Programming with Spark Working with Key/Value Pairs - Actions countByKey() collectAsMap() Lookup(key) - Return all values associated with the provided key. Any pair RDD action takes number of reducer as an optional parameter
  • 62. | © Copyright 2015 Hitachi Consulting62 Accumulators and Broadcast Variables
  • 63. | © Copyright 2015 Hitachi Consulting63 Programming with Spark Accumulators  An RDD function in Spark (such as map() or filter) can use a variable defined outside the driver program  However, each task running on the cluster gets a new copy of this variable.  Updates from these copies are not propagated back to the driver.  Accumulators provides a simple syntax for aggregating values from worker nodes back to the driver program.  A common uses of accumulators is to count events that occur during job execution, maybe for debugging purposes.  In a worker task, accumulators are write-only variables. Only the driver program can retrieve the value of an accumulator
  • 64. | © Copyright 2015 Hitachi Consulting64 Programming with Spark Accumulators Accumulator Example file = sc.textFile(inputFile) blankLines = sc.accumulator(0) def extractRecords(line): global blankLines If (line == ""): blankLines += 1 return line.split(" ") records = file.flatMap(extractRecords) print "Blank lines: %d" % blankLines.value() Define an accumulator of type INT with 0 as initial value – referenced by blankLines variable Increment accumulator through the blankLines variable Retrieve the accumulator value in the driver program
  • 65. | © Copyright 2015 Hitachi Consulting65 Programming with Spark Broadcast Variables  Allow keeping a read-only variable cached on each worker node, rather than shipping a copy of it with tasks  E.g., to give every node a copy of a large input dataset (reference data) in an efficient manner.  After the broadcast variable is created, it should be used instead of the original value in any functions run on the cluster, so that the variable is not shipped to the nodes more than once. Broadcast Example rdd = sc.textFile(“ref_data.txt”) ref_data = sc.broadcast(rdd) def processData(input, ref_data): … data = data.map(lambda a: processData(a,ref_data))
  • 66. | © Copyright 2015 Hitachi Consulting66 Per-Partition Operations
  • 67. | © Copyright 2015 Hitachi Consulting67 Programming with Spark Per-Partition Operations  Some operations need to be executed per partition as a whole, rather than per each item in the RDD, which is the normal behaviour of the transformations like map() or filter()  E.g., Setup a database connection, creating a random number generator, preparing a return object of the aggregation to happened on the RDD, etc.  For all of the mentioned objects, we only need one per RDD partition, rather than per element. mapPartition() mapParitionWithIndex() foreachPartition()
  • 68. | © Copyright 2015 Hitachi Consulting68 Programming with Spark Per-Partition Operations Example: foreach() vs. foreachPartition() list = [1,2,3,4,5,6,7,8,9,10] rdd = sc.parallelize(list) counter = sc.accumulator(0) def operation(element): global counter counter+=1 rdd.foreach(operation) print "counter value is:" + repr(counter) counter2 = sc.accumulator(1) rdd=rdd.repartition(3) print “number of paritionts = “ + repr(rdd.getNumPartitions()) def operation2(element): global counter2 counter2+=1 rdd.foreachPartition(operation2) print "counter2 value is:" + repr(counter2) The counter returns 10, one for each item in the RDD The counter returns 3, one for each RDD partition
  • 69. | © Copyright 2015 Hitachi Consulting69 Spark SQL
  • 70. | © Copyright 2015 Hitachi Consulting70 Spark SQL DataFrames Distributed collection of data organized into named columns Conceptually equivalent to a table in a relational database or a data frame in R/Python, with richer Spark optimizations RDDs Structured Data Files Hive Tables RDBMS
  • 71. | © Copyright 2015 Hitachi Consulting71 Spark SQL DataFrames Creating a DataFrame from RDD of Rows from pyspark import SparkContext, SparkConf from pyspark.sql import * from pyspark.sql.types import * conf = SparkConf().setAppName("My App").setMaster("local") sc = SparkContext(conf = conf) sqlContext = SQLContext(sc) row1 = Row(id = 1, Name = 'khalid') row2 = Row(id = 2, Name = 'Zahra') row3 = Row(id = 3, Name = 'Adel') row4 = Row(id = 4, Name = 'Jassem') rdd = sc.parallelize([row1,row2,row3,row4]) df = sqlContext.createDataFrame(rdd) df.printSchema()
  • 72. | © Copyright 2015 Hitachi Consulting72 Spark SQL DataFrames Creating a DataFrame from RDD – with a list of column names from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf().setAppName("My App").setMaster("local") sc = SparkContext(conf = conf) sqlContext = SQLContext(sc) rdd = sc.parallelize([("productA","01/01/2015",50),("productA","01/02/2015",100),("productB","01/01/2015",70)]) df = sqlContext.createDataFrame(rdd, ["product","date","value"]) df.printSchema() If only the column names are supplied, data types will be inferred
  • 73. | © Copyright 2015 Hitachi Consulting73 Spark SQL DataFrames Creating a DataFrame from RDD – with schema from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext from pyspark.sql.types import * conf = SparkConf().setAppName("My App").setMaster("local") sc = SparkContext(conf = conf) sqlContext = SQLContext(sc) rdd = sc.parallelize([("productA","01/01/2015",50),("productA","01/02/2015",100),("productB","01/01/2015",70)]) schema = StructType([StructField("Item",StringType(),True),StructField("Date",StringType(),True),StructField("Stock",LongType(),True)]) df = sqlContext.createDataFrame(rdd,schema) df.printSchema() Supplied schema
  • 74. | © Copyright 2015 Hitachi Consulting74 Spark SQL DataFrames Show DataFrame content df.show()
  • 75. | © Copyright 2015 Hitachi Consulting75 Spark SQL DataFrames Creating a DataFrame – sqlContext.read  sqlContext.read.json(inputFile)  sqlContext.read.format('jdbc').options(jdbcConnectionString).load()  sqlContext.read.parquet(inputFile)  sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(inputFile) Saving a DataFrame  df.write.json(outputFile)  df.createJDBCTable(jdbcConnectionString, TableName, allowExisting = true)  df.insertIntoJDBC(jdbcConnectionString, TableName, overwrite = false)  df.write. parquet(outputFile)  df.write.format("com.databricks.spark.csv").save("/data/home.csv")
  • 76. | © Copyright 2015 Hitachi Consulting76 Spark SQL DataFrames Creating a DataFrame from JSON from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf().setAppName("My App").setMaster("local") sc = SparkContext(conf = conf) sqlContext = SQLContext(sc) df = sqlContext.read.json(“data.json") df.printSchema() df.show()
  • 77. | © Copyright 2015 Hitachi Consulting77 Spark SQL DataFrames Creating a DataFrame from CSV  Load csv data to RDD (using csv.DictReader()), then create a DataFrame from RDD  Use csv loader com.databricks:spark-csv C:sparkspark-1.6.1-bin-hadoop2.4bin>pyspark --packages com.databricks:spark-csv_2.11:1.4.0 df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(inputFile)
  • 78. | © Copyright 2015 Hitachi Consulting78 Spark SQL Manipulating Data Frames A DataFrame is a collection of spark.sql.Row objects rows = df.collect() rows.count() rows[0] rows[0][1] Reference a column in a DataFrame  As attribute of the DataFrame: df.product  As Key: df[“product”] Product Date Value Product A 2015-01-01 50 Product A 2015-01-02 100 Product B 2015-01-01 70 Return collection of Row objects Row(product=‘product A’, date=‘2015-01-01’ ,value=50) ‘2014-01-01’
  • 79. | © Copyright 2015 Hitachi Consulting79 Spark SQL Manipulating Data Frames Filtering DataFrame df = df.filter(df.date=‘2015-01-01’) df = df.filter(df[“product”]=“Product B”) df = df.filter(“value > 50”) OR df = df.filter(“value > 50 AND product = ‘Product A’”) Product Date Value Product A 2015-01-01 50 Product A 2015-01-02 100 Product B 2015-01-01 70 Product Date Value Product B 2015-01-01 70
  • 80. | © Copyright 2015 Hitachi Consulting80 Spark SQL Manipulating Data Frames Selecting Columns (projection) df = df.select(df.product, df.value) df = df.select(df.product, df.value*10) df = df.select(df.product, df.value*10, len(product)*+df.value)) Product Date Value Product A 2015-01-01 50 Product A 2015-01-02 100 Product B 2015-01-01 70 Product Value Product A 50 Product A 100 Product B 70 Product Value*10 len(product)*+df.value Product A 500 508 Product A 1000 1008 Product B 700 708 Create a new columns based on columns in DataFrame
  • 81. | © Copyright 2015 Hitachi Consulting81 Spark SQL Manipulating Data Frames Selecting Columns (projection) df = df.select(df.product, df.value) df = df.select(df.product, df.value*10) df = df.select(df.product, df.value*10, (len(product)*+df.value).alias(derived)) Product Date Value Product A 2015-01-01 50 Product A 2015-01-02 100 Product B 2015-01-01 70 Product Value Product A 50 Product A 100 Product B 70 Product Value*100 derived Product A 500 508 Product A 1000 1008 Product B 700 708 Give alias to the new column
  • 82. | © Copyright 2015 Hitachi Consulting82 Spark SQL Manipulating Data Frames Order By from pyspark.sql.functions import asc, desc df = df.orderBy(df.value) Product Date Value Product A 2015-01-01 50 Product A 2015-01-02 100 Product B 2015-01-01 70 Product Date Value Product A 2015-01-01 50 Product B 2015-01-01 70 Product A 2015-01-02 100
  • 83. | © Copyright 2015 Hitachi Consulting83 Spark SQL Manipulating Data Frames join result = df1.join( df2, df1.product==df2.p, “inner”) .select(df1.product,df2.model,df1.value) Product Date Value Product A 2015-01-01 50 Product A 2015-01-02 100 Product B 2015-01-01 70 Product Model Product A X Product B Y Join condition Join Type Product Model Value Product A X 50 Product A X 100 Product B Y 70
  • 84. | © Copyright 2015 Hitachi Consulting84 Spark SQL Manipulating Data Frames groupBy result = df.groupBy(df.product) result.count().show() result.sum(“value”).show() result.max(“value”).show() agg import pyspark.sql.functions as F … result = df.groupBy(df.product).agg(df.product,F.sum(“value”),F.min(“value”)) Product Date Value Product A 2015-01-01 50 Product A 2015-01-02 100 Product B 2015-01-01 70 Return spark.sql.group.GroupedData DataFrames
  • 85. | © Copyright 2015 Hitachi Consulting85 Spark SQL Manipulating Data Frames It’s important to persist() or cache() your DataFrame after processing it via filter(), join(), groupBy(), etc., so that these expensive operations are not re- computed each time you perform select()
  • 86. | © Copyright 2015 Hitachi Consulting86 Spark SQL Using Structured Query Language df.registerTempTable(“MyDataTable”); query = “SELECT product, SUM(value) Total FROM MyDataTable GROUP BY product” result = sqlContext.sql(query) result.show() Resturn a DataFrame
  • 87. | © Copyright 2015 Hitachi Consulting87 Spark SQL Hive Integration
  • 88. | © Copyright 2015 Hitachi Consulting88 Getting Started with Spark on Azure HDInsight
  • 89. | © Copyright 2015 Hitachi Consulting89 Spark on Azure HDInsight Creating HDInsight Spark Cluster
  • 90. | © Copyright 2015 Hitachi Consulting90 Spark on Azure HDInsight Creating HDInsight Spark Cluster
  • 91. | © Copyright 2015 Hitachi Consulting91 Spark on Azure HDInsight Creating HDInsight Spark Cluster
  • 92. | © Copyright 2015 Hitachi Consulting92 Spark on Azure HDInsight Running Python Scripts on Spark HDInsight
  • 93. | © Copyright 2015 Hitachi Consulting93 Spark on Azure HDInsight Running Python Scripts on Spark HDInsight
  • 94. | © Copyright 2015 Hitachi Consulting94 Spark on Azure HDInsight Running Python Scripts on Spark HDInsight
  • 95. | © Copyright 2015 Hitachi Consulting95 Spark on Azure HDInsight Running Python Scripts on Spark HDInsight – Data Frames
  • 96. | © Copyright 2015 Hitachi Consulting96 Spark on Azure HDInsight Running Python Scripts on Spark HDInsight – Spark SQL
  • 97. | © Copyright 2015 Hitachi Consulting97 Spark on Azure HDInsight Running Python Scripts on Spark HDInsight – Spark SQL
  • 98. | © Copyright 2015 Hitachi Consulting98 Spark on Azure HDInsight Running Python Scripts on Spark HDInsight – Using Custom Library Python Script Uploaded to ksmsdnspark/HdiSamples/HdiSamples/ WebsiteLogSampleData
  • 99. | © Copyright 2015 Hitachi Consulting99 Spark on Azure HDInsight Running Python Scripts on Spark HDInsight – Using External Package
  • 100. | © Copyright 2015 Hitachi Consulting100 Spark on Azure HDInsight Microsoft Power BI and Spark SQL Upload Adventure works data file extracts to the blob container
  • 101. | © Copyright 2015 Hitachi Consulting101 Spark on Azure HDInsight Microsoft Power BI and Spark SQL Process files with Spark and Save as hive table
  • 102. | © Copyright 2015 Hitachi Consulting102 Spark on Azure HDInsight Microsoft Power BI and Spark SQL Process files with Spark and Save as hive table
  • 103. | © Copyright 2015 Hitachi Consulting103 Spark on Azure HDInsight Microsoft Power BI and Spark SQL Processing Output is Saved and Partitioned in Hive
  • 104. | © Copyright 2015 Hitachi Consulting104 Spark on Azure HDInsight Microsoft Power BI and Spark SQL Query Data in Spark SQL
  • 105. | © Copyright 2015 Hitachi Consulting105 Spark on Azure HDInsight Microsoft Power BI and Spark SQL
  • 106. | © Copyright 2015 Hitachi Consulting106 Spark on Azure HDInsight Microsoft Power BI and Spark SQL
  • 107. | © Copyright 2015 Hitachi Consulting107 Spark on Azure HDInsight Microsoft Power BI and Spark SQL
  • 108. | © Copyright 2015 Hitachi Consulting108 Spark on Azure HDInsight Microsoft Power BI and Spark SQL
  • 109. | © Copyright 2015 Hitachi Consulting109 Spark on Azure HDInsight Hive Integration Spark SQL JDBC/ODBC Custom App Spark SQL Shell … Hive JSON Parquet … Excel Tableau …  Spark SQL with Hive support allows us to access Hive tables, UDFs (user-defined functions), SerDes (serialization and deserialization formats), and the Hive query language (HiveQL)  sqlContext(which is HiveContext) is the entry point to access Hive metastore and functionality, where HiveQL is recommended to be used.  If hive is installed, hive-site.xml file must be copied to Spark’s configuration directory ($SPARK_HOME/conf)  If there is no hive installation, Spark SQL will create its own Hive metastore (metadata DB) in your program’s work directory, called metastore_db.  In addition, if you attempt to create tables using HiveQL’s CREATE TABLE, they will be placed in the /user/hive/warehouse directory on your default filesystem (either your local filesystem, or HDFS).
  • 110. | © Copyright 2015 Hitachi Consulting110 Spark on Azure HDInsight Hive Integration sqlContext.sql("CREATE TABLE MyTable (id int, name string, salary float)") tables = sqlContext.sql("SHOW TABLES") tables.show() description = sqlContext.sql("DESCRIBE MyTable") description.show() sqlContext.sql("INSERT INTO MyTable SELECT * FROM (SELECT 1101 as id, 'Khalid Salama' as name, 70000 as salary) query") result = sqlContext.sql("SELECT * FROM MyTable") result.show()
  • 111. | © Copyright 2015 Hitachi Consulting111 Spark on Azure HDInsight Connecting with Excel to Spark SQL  Download and install Spark ODBC Driver https://www.microsoft.com/en-us/download/details.aspx?id=49883  Add an ODBC Data Source  Select Microsoft Spark ODBC Driver
  • 112. | © Copyright 2015 Hitachi Consulting112 Spark on Azure HDInsight Connecting with Excel to Spark SQL  Download and install Spark ODBC Driver https://www.microsoft.com/en-us/download/details.aspx?id=49883  Create an Spark ODBC connection  Configure Spark ODBC connection
  • 113. | © Copyright 2015 Hitachi Consulting113 Spark on Azure HDInsight Connecting with Excel to Spark SQL  Download and install Spark ODBC Driver https://www.microsoft.com/en-us/download/details.aspx?id=49883  Create an Spark ODBC connection  Configure Spark ODBC connection  Test connection
  • 114. | © Copyright 2015 Hitachi Consulting114 Spark on Azure HDInsight Connecting with Excel to Spark SQL  Open Excel and got to Data, From other Sources, Microsoft Query
  • 115. | © Copyright 2015 Hitachi Consulting115 Spark on Azure HDInsight Connecting with Excel to Spark SQL  Browse tables and select attributes to include
  • 116. | © Copyright 2015 Hitachi Consulting116 Spark on Azure HDInsight Connecting with Excel to Spark SQL  Show data in Pivot Table
  • 117. | © Copyright 2015 Hitachi Consulting117 Spark CLR (Mobuis)
  • 118. | © Copyright 2015 Hitachi Consulting118 Spark CLR Installing Spark CLR  Download spark Mobius https://github.com/Microsoft/Mobius  Unzip the content of the zip folder to C:sparkspark-clr_2.10-1.6.100  You should find inside this folder a “runtime” folder, which include (bin, lib, dependencies, scripts) folders  Create a new visual Studio project (Console App). Right-click project, NutGet, and install the following packages on by one (which you will find after in the packages.config) <packages> <package id="log4net" version="2.0.5" targetFramework="net452" /> <package id="Microsoft.SparkCLR" version="1.6.100" targetFramework="net452" /> <package id="Newtonsoft.Json" version="7.0.1" targetFramework="net452" /> <package id="Razorvine.Pyrolite" version="4.10.0.0" targetFramework="net452" /> <package id="Razorvine.Serpent" version="1.12.0.0" targetFramework="net452" /> </packages>
  • 119. | © Copyright 2015 Hitachi Consulting119 Spark CLR Running .NET Apps with Spark CLR Writing Spark Processor Class
  • 120. | © Copyright 2015 Hitachi Consulting120 Spark CLR Running .NET Apps with Spark CLR main function, in SparkAppDemo Class calls processor.process()  Build you project to produce SparkAppDemo.exe  C:sparkspark-clr_2.10-1.6.100runtimescripts  Run the following command >sparkclr-submit --exe SparkDempApp.exe C:sparkmyworkCSharpSparkDempApp
  • 121. | © Copyright 2015 Hitachi Consulting121 Spark CLR Running .NET Apps with Spark CLR
  • 122. | © Copyright 2015 Hitachi Consulting122 How to Get Started with Spark  Read the slides!  Azure Spark HDInsight Documentation https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-apache-spark-overview/  Apache Spark Programming Guide http://spark.apache.org/docs/latest/programming-guide.html  Spark CLR (Mobius) https://github.com/Microsoft/Mobius  Introduction to Big Data Analytics (week 5) – Coursera Big Data Specialization https://www.coursera.org/learn/bigdata-analytics/home/week/5  Data Manipulation at Scale (week4, lesion 20) – Coursera Data Science at Scale https://www.coursera.org/learn/data-manipulation/home/week/4  Data Science and Engineering with Apache Spark – edx 5 course track https://www.edx.org/xseries/data-science-engineering-apache-spark O’Reliy Books – Learning Spark
  • 123. | © Copyright 2015 Hitachi Consulting123 Appendix A: Spark Configurations SparkConf()  spark.app.name  spark.master spark://host:<port> | mesos://host:<port> | yarn | local | local[<cores>]  spark.ui.port  spark.executor.memory  spark.executor.cores  spark.serializer  spark.eventLog.enabled  spark.eventLog.dir spark-submit --master --deploy-mode client | cluster --name --files --py-files --executor-memory --driver-memory
  • 124. | © Copyright 2015 Hitachi Consulting124 My Background Applying Computational Intelligence in Data Mining • Honorary Research Fellow, School of Computing , University of Kent. • Ph.D. Computer Science, University of Kent, Canterbury, UK. • M.Sc. Computer Science , The American University in Cairo, Egypt. • 25+ published journal and conference papers, focusing on: – classification rules induction, – decision trees construction, – Bayesian classification modelling, – data reduction, – instance-based learning, – evolving neural networks, and – data clustering • Journals: Swarm Intelligence, Swarm & Evolutionary Computation, , Applied Soft Computing, and Memetic Computing. • Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio, ECTA, IEEE WCCI and INNS-BigData. ResearchGate.org
  • 125. | © Copyright 2015 Hitachi Consulting125 Thank you!