SlideShare a Scribd company logo
Basic Usage Of Spark
Author(s): Nagavarunkumar Kolla
Table of Contents
1. Introduction to Data Analysis with Spark
1.1 What Is Apache Spark?
1.2 Important Notes about Spark
2. Spark Installation Steps
2.1 How to start scala, python & R
3. Technical Details
3.1 Transformations
a. cartesian (otherDataset)
b. partitions(func)
c. cogroup (otherDataset, [num Tasks])
d. distinct([numTasks])
e. filter(func) & filterWith(func)
f. sortBy() & sortByKey()
g. flatMap(func)
h. join (otherDataset, [numTasks])
i. sum(), max() & min()
j. pipe(command, [envVars])
3.2 Actions
a. collect() & collectAsMap()
b. count(), countByKey() & countByValue()
c. first()
d. foreach(func)
e. keys()
f. reduce(func)
g. saveAsTextFile(path)
h. stats()
i. take(n) & takeOrdered(n, [ordering])
j. top()
3.3 Miscellaneous
a. Zip & CombineByKey
b. toDebugString()
c. ++ Operator
d. values & variance
4. DataFrames
a. Defining a Function
b. Defining a Function with Aggreate
5. References and Links
1.Introduction to Data Analysis with Spark:
1.1 What Is Apache Spark?
Apache Spark is a cluster computing platform designed to be fast and general purpose
Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL
and rich built-in libraries
1.2 Important Notes about Spark:
1. Spark provides two ways to create RDDs: loading an external dataset and parallelizing a
collection in your driver program
2. The Simplest way to create RDDs is to take an existing collection in your program and pass it
to SparkContext's parallelize() method
3. RDD (Resilient Distributed Dataset), supports two types of operations:Transformations
& Actions
Transformations are operations on RDDs that return a new RDD, such as map() and
Actions are operations that return a result to the driver program on write it to storage
such as count() and first()
2. Spark Installation Steps
1. Install Java, Python, R
sudo apt-get install openjdk-7-jdk
sudo apt-get install python
sudo apt-get install r-base
2. Install Hadoop-2.6.0, Hive-1.2.1
3. Install scala-2.10.4
Download the file from below link
copy the tar file into work dir & extract it
update the ~/.bashrc file with SCALA_HOME path
4. Install spark-1.4.1
Download the file from below link
copy the tar file into work dir & extract it
update the ~/.bashrc file with SPARK_HOME path
5. reopen the terminal to reflect the changes
2.1 How to start scala, python & R
1. "scala" is the command to start the "scala"
"python" is the command to start the "python"
"R" is the command to start the "R"
Verify the all are installed or not
2. spark-shell, pyspark, sparkR, spark-sql these are the main commands to
work with spark
execute "spark-shell" command, it will move to "scala" prompt
execute "pyspark" command, it will move to "python" prompt
execute "sparkR" command, it will move to "R" prompt
3. Technical Details
3.1 Transformations
All transformations in Spark are lazy, in that they do not compute their results right away.
The transformations are only computed when an action requires a result to be returned to the
driver program.
The following lists some of the common transformations supported by Spark
cartesian (otherDataset): When called on datasets of types T and U, returns a dataset of (T,
U) pairs (all pairs of elements).
val x = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
res: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6),
(3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))
partitions(func): A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be operated on in parallel.
val y = sc.parallelize(1 to 10, 10)
res: Int = 10
val z = y.coalesce(2, false)
res: Int = 2
cogroup (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W),
returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b =, "b"))
val c =, "c"))
val d =, "d"))
res: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b,
b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))),
b.cogroup(c, d).collect
res: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b,
b),CompactBuffer(c, c),CompactBuffer(d, d))),
distinct([numTasks]): Return a new dataset that contains the distinct elements of the source
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
res: Array[String] = Array(Dog, Gnu, Cat, Rat)
Example 2:
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
res: Int = 2
res: Int = 3
filter(func) & filterWith(func): Return a new dataset formed by selecting those elements of the
source on which func returns true.
Example 1:
val a = sc.parallelize(1 to 10, 3)
val b = a.filter(_ % 2 == 0)
res: Array[Int] = Array(2, 4, 6, 8, 10)
Example 2:
val a = sc.parallelize(1 to 9, 3)
val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0)
res: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)
a.filterWith(x=> x)((a, b) => b == 0).collect
res: Array[Int] = Array(1, 2)
a.filterWith(x=> x)((a, b) => a % (b+1) == 0).collect
res: Array[Int] = Array(1, 2, 4, 6, 8, 10)
a.filterWith(x=> x.toString)((a, b) => b == "2").collect
res: Array[Int] = Array(5, 6)
sortBy() & sortByKey([ascending], [numTasks]): When called on a dataset of (K, V) pairs
where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the boolean ascending argument.
Example 1:
val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1))
y.sortBy(c => c, true).collect
res: Array[Int] = Array(1, 1, 2, 3, 5, 7)
y.sortBy(c => c, false).collect
res: Array[Int] = Array(7, 5, 3, 2, 1, 1)
val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)))
z.sortBy(c => c._1, true).collect
res: Array[(String, Int)] = Array((A,26), (H,10), (L,5),(Z,1))
z.sortBy(c => c._2, true).collect
res: Array[(String, Int)] = Array((Z,1), (L,5), (H,10),(A,26))
Example 2:
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c =
res: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))
res: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))
val a = sc.parallelize(1 to 100, 5)
val b = a.cartesian(a)
val c = sc.parallelize(b.takeSample(true, 5, 13), 2)
val d = c.sortByKey(false)
res: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4))
flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so
func should return a Seq rather than a single item).
val a = sc.parallelize(1 to 10, 5)
a.flatMap(1 to _).collect
res: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4,
5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10)
sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect
res: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)
The program below generates a random number of copies (up to 10) of
the items in the list.
val x = sc.parallelize(1 to 10, 3)
res: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7,
7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)
join (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through
leftOuterJoin, rightOuterJoin, and fullOuterJoin.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
res: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,
(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,
(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,
(rat,gnu)), (3,(rat,bee)))
pipe(command, [envVars]): Pipe each partition of the RDD through a shell command, e.g. a Perl or
bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned
as an RDD of strings.
val a = sc.parallelize(1 to 9, 3)
a.pipe("head -n 1").collect
res: Array[String] = Array(1, 4, 7)
sum(), max() & min(): These functions are used to get the results in transformation and provide
the accurate values
Example 1:
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
res: Double = 101.39999999999999
Example 2:
val y = sc.parallelize(10 to 30)
res: Int = 30
Example 3:
val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9,
"lion"), (18, "cat")))
res: (Int, String) = (3,tiger)
3.2 Actions
By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory using the persist (or cache) method, in
which case Spark will keep the elements around on the cluster for much faster access the next
time you query it.
The following lists some of the common actions supported by Spark.
collect() & collectAsMap(): Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that returns a sufficiently small
subset of the data.
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
res: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)
Example 2:
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b =
res: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3))
res: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)
count(): Return the number of elements in the dataset.
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
res: Long = 4
countByKey() & countByValue(): Only available on RDDs of type (K, V). Returns a hashmap
of (K, Int) pairs with the count of each key.
Example 1:
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)
res: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
Example 2:
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
res: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 ->
first(): Return the first element of the dataset (similar to take(1)).
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
res: String = Gnu
foreach(func): Run a function func on each element of the dataset. This is usually done for side
effects such as updating an accumulator variable or interacting with external storage systems.
val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin",
"spider"), 3)
c.foreach(x => println(x + "s are yummy"))
lions are yummy
gnus are yummy
crocodiles are yummy
ants are yummy
whales are yummy
dolphins are yummy
spiders are yummy
keys(): For each key k in this or other, return a resulting RDD that contains a tuple with the list
of values for that key in this as well as other.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat",
"panther", "eagle"), 2)
val b = => (x.length, x))
res: Array[Int] = Array(3, 5, 4, 3, 7, 5)
reduce(func):Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative so that it can
be computed correctly in parallel
val a = sc.parallelize(1 to 10, 3)
a.reduce(_ + _)
res: Int = 55
saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark
will call toString on each element to convert it to a line of text in the file.
val a = sc.parallelize(1 to 10000, 3)
a.saveAsTextFile("mydata_b", classOf[GzipCodec])
val x = sc.textFile("mydata_b")
res: Long = 10000
val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)
val sp = sc.textFile("hdfs://localhost:8020/user/hadoop/sp_data")
sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/hadoop/sp_x")
val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2)
val x = sc.parallelize(1 to 100, 3)
val y = sc.objectFile[Array[Int]]("objFile")
res: Array[Int] = Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33)
stats(): They provide column summary statistics for RDD[Vector] through the function Stats
available in Statistics.
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02,
19.29, 11.09, 21.0), 2)
res: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667, stdev: 8.126859, max:
21.000000, min: 1.000000)
take(n) : Return an array with the first n elements of the dataset. Note that this is currently not
executed in parallel. Instead, the driver program computes all the elements
Example 1:
val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2)
res: Array[String] = Array(dog, cat)
Example 2:
val b = sc.parallelize(1 to 10000, 5000)
res: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100)
takeOrdered(n, [ordering]): Return the first n elements of the RDD using either their natural
order or a custom comparator
Example 1:
val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2)
res: Array[String] = Array(ape, cat)
top(num: Int): Returns the top K elements from this RDD as defined by the specified implicit
val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
res: Array[Int] = Array(9, 8)
3.3 Miscellaneous
The following lists some of the common Miscellaneous actions supported by Spark
Zip : Zips this RDD with another one, returning key-value pairs with the first element in each
RDD, second element in each RDD, etc.
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c =
res: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2,salmon), (2,rabbit), (1,turkey),
(2,wolf), (2,bear), (2,bee))
CombineByKey: Simplified version of combineByKey that hash-partitions the resulting RDD
using the default parallelism level.
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) =>
x ::: y)
res: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee,
bear, wolf)))
toDebugString():toDebugString method of an RDD did a better job of explaining where shuffle
boundaries occur in the lineage graph. One way to do this would be to only indent the tree at a
shuffle boundary instead of indenting it for every parent.
val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
res156: String =
(3) MapPartitionsRDD[146] at subtract at <console>:26 []
| SubtractedRDD[145] at subtract at <console>:26 []
+-(3) MapPartitionsRDD[143] at subtract at <console>:26 []
| | ParallelCollectionRDD[141] at parallelize at <console>:22 []
+-(3) MapPartitionsRDD[144] at subtract at <console>:26 []
| ParallelCollectionRDD[142] at parallelize at <console>:22 []
++ Operator:This is the special operator treated in Spark
Example 1:
val a = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(5 to 7, 1)
(a ++ b).collect
res: Array[Int] = Array(1, 2, 3, 5, 6, 7)
values & variance:
Example 1:
val a = sc.parallelize(List("dog", "tiger", "lion", "cat",
"panther", "eagle"), 2)
val b = => (x.length, x))
res: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)
Example 2:
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
res: Double = 10.605333333333332
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
res: Double = 66.04584444444443
res: Double = 74.30157499999999
4. DataFrames
A DataFrame is a distributed collection of data organized into named columns.The
DataFrame API is available in Scala, Java, Python, and R
Examples For Scala:
command:- spark-shell
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df ="file:/home/hadoop/workshop/spark-1.4.1-bin-
// Show the content of the DataFrame
// Print the schema in a tree format
// Select only the "name" column"name").show()
// Select everybody, but increment the age by 1"name"), df("age") + 1).show()
// Select people older than 21
df.filter(df("age") > 21).show()
// Count people by age
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// you can use custom classes that implement the Product interface.
Step 1:
case class Person(name: String, age: Int)
Step 2:
val people = sc.textFile("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0),
Step 3:
val results = sqlContext.sql("SELECT * FROM people")
Step4: => "Name: " + t(0)).collect().foreach(println)
Example For Python:
command:- pyspark
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df ="file:/home/hadoop/workshop/spark-1.4.1-bin-
# Displays the content of the DataFrame to stdout
# Print the schema in a tree format
# Select only the "name" column"name").show()
# Select everybody, but increment the age by 1['name'], df['age'] + 1).show()
# Select people older than 21
df.filter(df['age'] > 21).show()
# Count people by age
# you can use custom classes that implement the Product interface
Step 1:
schemaPeople = sqlContext.createDataFrame(people)
Step 2:
Step 3:
peoples = sqlContext.sql("SELECT * FROM people")
for people in peoples.collect():print people
Step 4:
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
Examples for R:
command:- sparkR
sqlContext <- SQLContext(sc)
df <- jsonFile(sqlContext, "file path")
# Displays the content of the DataFrame to stdout
# Print the schema in a tree format
# Select only the "name" column
showDF(select(df, "name"))
# Select everybody, but increment the age by 1
showDF(select(df, df$name, df$age + 1))
# Select people older than 21
showDF(where(df, df$age > 21))
# Count people by age
showDF(count(groupBy(df, "age")))
Examples For Hive:
One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or
Spark SQL can also be used to read data from an existing Hive installation
command:- spark-shell
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE DATABASE IF NOT EXISTS varun")
sqlContext.sql("CREATE TABLE IF NOT EXISTS varun.src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH '/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/kv1.txt' INTO TABL varun.src")
sqlContext.sql("FROM varun.src SELECT key, value").collect().foreach(println)
sqlContext.sql("SELECT key, value FROM varun.src").collect().foreach(println)
sqlContext.sql("SELECT key, value FROM varun.src where key > 100").collect().foreach(println)
sqlContext.sql("SELECT key, count(key) FROM varun.src group by
4.1 Defining a Function: This function operates on distributed DataFrames and works row by
row (unless you're creating an user defined aggregation function)
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12),
("mouse", 2)), 2)
def myfunc(index: Int, iter: Iterator[(String, Int)]) :
Iterator[String] = { => "[partID:" +
index + ", val: " + x + "]").iterator
res: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)],
[partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))
4.2 Defining a Function with Aggreate:
val nums = sc.parallelize(List(1,2,3,4,5,6), 2)
val chars = sc.parallelize(List("a","b","c","d","e","f"),2)
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = { => "[partID:" + index + ", val: " + x + "]").iterator
nums.aggregate(1)(math.max(_, _), _ + _)
reduce of partition 0 will be max(1, 1, 2, 3) = 3
reduce of partition 1 will be max(1, 4, 5, 6) = 6
inal reduce across partitions will be 1 + 3 + 6 = 10
nums.aggregate(5)(math.max(_, _), _ + _)
reduce of partition 0 will be max(5, 1, 2, 3) = 5
reduce of partition 1 will be max(5, 4, 5, 6) = 6
inal reduce across partitions will be 5 + 5 + 6 = 16
chars.aggregate("")(_ + _, _+_)
res: String = defabc
chars.aggregate("x")(_ + _, _+_)
res: String = xxabcxdef
5. References and Links:

More Related Content

What's hot

Postgres performance for humans
Postgres performance for humansPostgres performance for humans
Postgres performance for humansCraig Kerstiens
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Oracle 10g Performance: chapter 00 sampling
Oracle 10g Performance: chapter 00 samplingOracle 10g Performance: chapter 00 sampling
Oracle 10g Performance: chapter 00 samplingKyle Hailey
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
Lukas Fittl
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
How to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'rollHow to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'roll
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
Abhik Seal
An Introduction to RxJava
An Introduction to RxJavaAn Introduction to RxJava
An Introduction to RxJava
K. Matthew Dupree
The elements of a functional mindset
The elements of a functional mindsetThe elements of a functional mindset
The elements of a functional mindset
Eric Normand
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
Aleksander Alekseev
Moving 12c database from NON-ASM to ASM
Moving 12c database from NON-ASM to ASMMoving 12c database from NON-ASM to ASM
Moving 12c database from NON-ASM to ASM
Monowar Mukul
Data handling in r
Data handling in rData handling in r
Data handling in r
Abhik Seal
Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
Bartosz Konieczny
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
Bartosz Konieczny
PostgreSQL: Advanced indexing
PostgreSQL: Advanced indexingPostgreSQL: Advanced indexing
PostgreSQL: Advanced indexing
Hans-Jürgen Schönig
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasets
Bartosz Konieczny
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization

What's hot (19)

Postgres performance for humans
Postgres performance for humansPostgres performance for humans
Postgres performance for humans
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Oracle 10g Performance: chapter 00 sampling
Oracle 10g Performance: chapter 00 samplingOracle 10g Performance: chapter 00 sampling
Oracle 10g Performance: chapter 00 sampling
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016
How to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'rollHow to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'roll
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
An Introduction to RxJava
An Introduction to RxJavaAn Introduction to RxJava
An Introduction to RxJava
The elements of a functional mindset
The elements of a functional mindsetThe elements of a functional mindset
The elements of a functional mindset
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
Moving 12c database from NON-ASM to ASM
Moving 12c database from NON-ASM to ASMMoving 12c database from NON-ASM to ASM
Moving 12c database from NON-ASM to ASM
Data handling in r
Data handling in rData handling in r
Data handling in r
Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
PostgreSQL: Advanced indexing
PostgreSQL: Advanced indexingPostgreSQL: Advanced indexing
PostgreSQL: Advanced indexing
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasets
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization

Similar to Spark_Documentation_Template1

Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015
Leonardo Borges
Transducers in JavaScript
Transducers in JavaScriptTransducers in JavaScript
Transducers in JavaScript
Pavel Forkert
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
Albert Bifet
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
Chucheng Hsieh
Vlad Ifrim
Useful javascript
Useful javascriptUseful javascript
Useful javascriptLei Kang
Python 101 language features and functional programming
Python 101 language features and functional programmingPython 101 language features and functional programming
Python 101 language features and functional programming
Lukasz Dynowski
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
Survey Department
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
GeoGebra JavaScript CheatSheet
GeoGebra JavaScript CheatSheetGeoGebra JavaScript CheatSheet
GeoGebra JavaScript CheatSheet
Jose Perez
Zen Urban
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
Laura Hughes
PostgreSQL 9.6 새 기능 소개
PostgreSQL 9.6 새 기능 소개PostgreSQL 9.6 새 기능 소개
PostgreSQL 9.6 새 기능 소개
Introduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkIntroduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with spark
Angelo Leto
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
Jairam Chandar
Introduction to R
Introduction to RIntroduction to R
Introduction to Ragnonchik

Similar to Spark_Documentation_Template1 (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015
Transducers in JavaScript
Transducers in JavaScriptTransducers in JavaScript
Transducers in JavaScript
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
Useful javascript
Useful javascriptUseful javascript
Useful javascript
Python 101 language features and functional programming
Python 101 language features and functional programmingPython 101 language features and functional programming
Python 101 language features and functional programming
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
GeoGebra JavaScript CheatSheet
GeoGebra JavaScript CheatSheetGeoGebra JavaScript CheatSheet
GeoGebra JavaScript CheatSheet
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
PostgreSQL 9.6 새 기능 소개
PostgreSQL 9.6 새 기능 소개PostgreSQL 9.6 새 기능 소개
PostgreSQL 9.6 새 기능 소개
Introduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkIntroduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with spark
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
Introduction to R
Introduction to RIntroduction to R
Introduction to R


  • 1. Basic Usage Of Spark Author(s): Nagavarunkumar Kolla Table of Contents 1. Introduction to Data Analysis with Spark
  • 2. 1.1 What Is Apache Spark? 1.2 Important Notes about Spark 2. Spark Installation Steps 2.1 How to start scala, python & R 3. Technical Details 3.1 Transformations a. cartesian (otherDataset) b. partitions(func) c. cogroup (otherDataset, [num Tasks]) d. distinct([numTasks]) e. filter(func) & filterWith(func) f. sortBy() & sortByKey() g. flatMap(func) h. join (otherDataset, [numTasks]) i. sum(), max() & min() j. pipe(command, [envVars]) 3.2 Actions a. collect() & collectAsMap() b. count(), countByKey() & countByValue() c. first() d. foreach(func) e. keys() f. reduce(func) g. saveAsTextFile(path) h. stats() i. take(n) & takeOrdered(n, [ordering]) j. top() 3.3 Miscellaneous a. Zip & CombineByKey b. toDebugString() c. ++ Operator d. values & variance 4. DataFrames a. Defining a Function b. Defining a Function with Aggreate 5. References and Links 1.Introduction to Data Analysis with Spark: 1.1 What Is Apache Spark? Apache Spark is a cluster computing platform designed to be fast and general purpose Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL and rich built-in libraries
  • 3. 1.2 Important Notes about Spark: 1. Spark provides two ways to create RDDs: loading an external dataset and parallelizing a collection in your driver program 2. The Simplest way to create RDDs is to take an existing collection in your program and pass it to SparkContext's parallelize() method 3. RDD (Resilient Distributed Dataset), supports two types of operations:Transformations & Actions Transformations are operations on RDDs that return a new RDD, such as map() and filter() Actions are operations that return a result to the driver program on write it to storage such as count() and first() 2. Spark Installation Steps 1. Install Java, Python, R sudo apt-get install openjdk-7-jdk sudo apt-get install python sudo apt-get install r-base 2. Install Hadoop-2.6.0, Hive-1.2.1 3. Install scala-2.10.4 Download the file from below link copy the tar file into work dir & extract it update the ~/.bashrc file with SCALA_HOME path 4. Install spark-1.4.1 Download the file from below link copy the tar file into work dir & extract it update the ~/.bashrc file with SPARK_HOME path 5. reopen the terminal to reflect the changes 2.1 How to start scala, python & R 1. "scala" is the command to start the "scala" "python" is the command to start the "python" "R" is the command to start the "R"
  • 4. Verify the all are installed or not 2. spark-shell, pyspark, sparkR, spark-sql these are the main commands to work with spark execute "spark-shell" command, it will move to "scala" prompt execute "pyspark" command, it will move to "python" prompt execute "sparkR" command, it will move to "R" prompt 3. Technical Details 3.1 Transformations All transformations in Spark are lazy, in that they do not compute their results right away. The transformations are only computed when an action requires a result to be returned to the driver program. The following lists some of the common transformations supported by Spark cartesian (otherDataset): When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). Example: val x = sc.parallelize(List(1,2,3,4,5)) val y = sc.parallelize(List(6,7,8,9,10)) x.cartesian(y).collect res: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10)) partitions(func): A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Example: val y = sc.parallelize(1 to 10, 10) y.partitions.length res: Int = 10 val z = y.coalesce(2, false) z.partitions.length res: Int = 2
  • 5. cogroup (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith. Example: val a = sc.parallelize(List(1, 2, 1, 3), 1) val b =, "b")) val c =, "c")) val d =, "d")) b.cogroup(c).collect res: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))), (2,(CompactBuffer(b),CompactBuffer(c)))) b.cogroup(c, d).collect res: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c),CompactBuffer(d, d))), (3,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))), (2,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d)))) distinct([numTasks]): Return a new dataset that contains the distinct elements of the source dataset Example 1: val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2) c.distinct.collect res: Array[String] = Array(Dog, Gnu, Cat, Rat) Example 2: val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)) a.distinct(2).partitions.length res: Int = 2 a.distinct(3).partitions.length res: Int = 3
  • 6. filter(func) & filterWith(func): Return a new dataset formed by selecting those elements of the source on which func returns true. Example 1: val a = sc.parallelize(1 to 10, 3) val b = a.filter(_ % 2 == 0) b.collect res: Array[Int] = Array(2, 4, 6, 8, 10) Example 2: val a = sc.parallelize(1 to 9, 3) val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0) b.collect res: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9) -------------------------------------------------------------------^ val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5) a.filterWith(x=> x)((a, b) => b == 0).collect res: Array[Int] = Array(1, 2) ---------------------------------------------------------------------^ a.filterWith(x=> x)((a, b) => a % (b+1) == 0).collect res: Array[Int] = Array(1, 2, 4, 6, 8, 10) ---------------------------------------------------------------------^ a.filterWith(x=> x.toString)((a, b) => b == "2").collect res: Array[Int] = Array(5, 6) sortBy() & sortByKey([ascending], [numTasks]): When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument. Example 1: val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1)) y.sortBy(c => c, true).collect res: Array[Int] = Array(1, 1, 2, 3, 5, 7) --------------------------------------------------------------------------------^ y.sortBy(c => c, false).collect res: Array[Int] = Array(7, 5, 3, 2, 1, 1)
  • 7. --------------------------------------------------------------------------------^ val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5))) z.sortBy(c => c._1, true).collect res: Array[(String, Int)] = Array((A,26), (H,10), (L,5),(Z,1)) --------------------------------------------------------------------------------^ z.sortBy(c => c._2, true).collect res: Array[(String, Int)] = Array((Z,1), (L,5), (H,10),(A,26)) Example 2: val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2) val b = sc.parallelize(1 to a.count.toInt, 2) val c = c.sortByKey(true).collect res: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3)) --------------------------------------------------------------------------------------------------^ c.sortByKey(false).collect res: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5)) -------------------------------------------------------------------------------------------------^ val a = sc.parallelize(1 to 100, 5) val b = a.cartesian(a) val c = sc.parallelize(b.takeSample(true, 5, 13), 2) val d = c.sortByKey(false) res: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4)) flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). Example: val a = sc.parallelize(1 to 10, 5) a.flatMap(1 to _).collect res: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) -------------------------------------------------------------------------------------------------^ sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect res: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3) -------------------------------------------------------------------------------------------------^
  • 8. The program below generates a random number of copies (up to 10) of the items in the list. val x = sc.parallelize(1 to 10, 3) x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect res: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10) join (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. Example: val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) val d = c.keyBy(_.length) b.join(d).collect res: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6, (salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3, (dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3, (rat,gnu)), (3,(rat,bee))) pipe(command, [envVars]): Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings. Example: val a = sc.parallelize(1 to 9, 3) a.pipe("head -n 1").collect res: Array[String] = Array(1, 4, 7) sum(), max() & min(): These functions are used to get the results in transformation and provide the accurate values Example 1:
  • 9. val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2) x.sum res: Double = 101.39999999999999 Example 2: val y = sc.parallelize(10 to 30) y.max res: Int = 30 Example 3: val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (18, "cat"))) a.min res: (Int, String) = (3,tiger) 3.2 Actions By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. The following lists some of the common actions supported by Spark. collect() & collectAsMap(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. Example 1: val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2) c.collect res: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat) Example 2: val a = sc.parallelize(List(1, 2, 1, 3), 1) val b = b.collect
  • 10. res: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3)) b.collectAsMap res: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3) count(): Return the number of elements in the dataset. Example: val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) c.count res: Long = 4 countByKey() & countByValue(): Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. Example 1: val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2) c.countByKey res: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1) Example 2: val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1)) b.countByValue res: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2) first(): Return the first element of the dataset (similar to take(1)). Example 1: val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) c.first res: String = Gnu
  • 11. foreach(func): Run a function func on each element of the dataset. This is usually done for side effects such as updating an accumulator variable or interacting with external storage systems. Example: val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin", "spider"), 3) c.foreach(x => println(x + "s are yummy")) res: lions are yummy gnus are yummy crocodiles are yummy ants are yummy whales are yummy dolphins are yummy spiders are yummy keys(): For each key k in this or other, return a resulting RDD that contains a tuple with the list of values for that key in this as well as other. Example: val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = => (x.length, x)) b.keys.collect res: Array[Int] = Array(3, 5, 4, 3, 7, 5) reduce(func):Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel Example: val a = sc.parallelize(1 to 10, 3) a.reduce(_ + _) res: Int = 55 saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
  • 12. Example: val a = sc.parallelize(1 to 10000, 3) a.saveAsTextFile("mydata_a") import a.saveAsTextFile("mydata_b", classOf[GzipCodec]) val x = sc.textFile("mydata_b") x.count res: Long = 10000 val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3) x.saveAsTextFile("hdfs://localhost:8020/user/hadoop/test"); val sp = sc.textFile("hdfs://localhost:8020/user/hadoop/sp_data") sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/hadoop/sp_x") val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2) v.saveAsSequenceFile("hd_seq_file") val x = sc.parallelize(1 to 100, 3) x.saveAsObjectFile("objFile") val y = sc.objectFile[Array[Int]]("objFile") y.collect res: Array[Int] = Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33) stats(): They provide column summary statistics for RDD[Vector] through the function Stats available in Statistics. Example: val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2) x.stats res: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667, stdev: 8.126859, max: 21.000000, min: 1.000000)
  • 13. take(n) : Return an array with the first n elements of the dataset. Note that this is currently not executed in parallel. Instead, the driver program computes all the elements Example 1: val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2) b.take(2) res: Array[String] = Array(dog, cat) Example 2: val b = sc.parallelize(1 to 10000, 5000) b.take(100) res: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100) takeOrdered(n, [ordering]): Return the first n elements of the RDD using either their natural order or a custom comparator Example 1: val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2) b.takeOrdered(2) res: Array[String] = Array(ape, cat) top(num: Int): Returns the top K elements from this RDD as defined by the specified implicit Ordering[T]. Example: val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2) res: Array[Int] = Array(9, 8) 3.3 Miscellaneous
  • 14. The following lists some of the common Miscellaneous actions supported by Spark Zip : Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Example: val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3) val c = c.collect res: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2,salmon), (2,rabbit), (1,turkey), (2,wolf), (2,bear), (2,bee)) CombineByKey: Simplified version of combineByKey that hash-partitions the resulting RDD using the default parallelism level. Example: val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y) d.collect res: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf))) toDebugString():toDebugString method of an RDD did a better job of explaining where shuffle boundaries occur in the lineage graph. One way to do this would be to only indent the tree at a shuffle boundary instead of indenting it for every parent. Example: val a = sc.parallelize(1 to 9, 3) val b = sc.parallelize(1 to 3, 3) val c = a.subtract(b) c.toDebugString res156: String = (3) MapPartitionsRDD[146] at subtract at <console>:26 [] | SubtractedRDD[145] at subtract at <console>:26 [] +-(3) MapPartitionsRDD[143] at subtract at <console>:26 [] | | ParallelCollectionRDD[141] at parallelize at <console>:22 [] +-(3) MapPartitionsRDD[144] at subtract at <console>:26 [] | ParallelCollectionRDD[142] at parallelize at <console>:22 []
  • 15. ++ Operator:This is the special operator treated in Spark Example 1: val a = sc.parallelize(1 to 3, 1) val b = sc.parallelize(5 to 7, 1) (a ++ b).collect res: Array[Int] = Array(1, 2, 3, 5, 6, 7) values & variance: Example 1: val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = => (x.length, x)) b.values.collect res: Array[String] = Array(dog, tiger, lion, cat, panther, eagle) Example 2: val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3) a.variance res: Double = 10.605333333333332 --------------------------------------------------------------------------^ val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2) x.variance res: Double = 66.04584444444443 x.sampleVariance res: Double = 74.30157499999999 4. DataFrames A DataFrame is a distributed collection of data organized into named columns.The DataFrame API is available in Scala, Java, Python, and R Examples For Scala: -:Scala:-
  • 16. command:- spark-shell import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext val sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df ="file:/home/hadoop/workshop/spark-1.4.1-bin- hadoop2.6/examples/src/main/resources/people.json") // Show the content of the DataFrame // Print the schema in a tree format df.printSchema() // Select only the "name" column"name").show() // Select everybody, but increment the age by 1"name"), df("age") + 1).show() // Select people older than 21 df.filter(df("age") > 21).show() // Count people by age df.groupBy("age").count().show() // this is used to implicitly convert an RDD to a DataFrame. import sqlContext.implicits._ // you can use custom classes that implement the Product interface. Step 1: case class Person(name: String, age: Int) Step 2: val people = sc.textFile("file:/home/hadoop/workshop/spark-1.4.1-bin- hadoop2.6/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF() people.registerTempTable("people") Step 3: val results = sqlContext.sql("SELECT * FROM people") Step4: => "Name: " + t(0)).collect().foreach(println)
  • 17. Example For Python: -:Python:- command:- pyspark from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df ="file:/home/hadoop/workshop/spark-1.4.1-bin- hadoop2.6/examples/src/main/resources/people.json") # Displays the content of the DataFrame to stdout # Print the schema in a tree format df.printSchema() # Select only the "name" column"name").show() # Select everybody, but increment the age by 1['name'], df['age'] + 1).show() # Select people older than 21 df.filter(df['age'] > 21).show() # Count people by age df.groupBy("age").count().show() # you can use custom classes that implement the Product interface Step 1: schemaPeople = sqlContext.createDataFrame(people) Step 2: schemaPeople.registerTempTable("people") Step 3: peoples = sqlContext.sql("SELECT * FROM people") for people in peoples.collect():print people Step 4: teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") Examples for R: -:R:- command:- sparkR
  • 18. sqlContext <- SQLContext(sc) df <- jsonFile(sqlContext, "file path") # Displays the content of the DataFrame to stdout showDF(df) # Print the schema in a tree format printSchema(df) # Select only the "name" column showDF(select(df, "name")) # Select everybody, but increment the age by 1 showDF(select(df, df$name, df$age + 1)) # Select people older than 21 showDF(where(df, df$age > 21)) # Count people by age showDF(count(groupBy(df, "age"))) Examples For Hive: -:Hive:- One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL Spark SQL can also be used to read data from an existing Hive installation command:- spark-shell val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE DATABASE IF NOT EXISTS varun") sqlContext.sql("CREATE TABLE IF NOT EXISTS varun.src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH '/home/hadoop/workshop/spark-1.4.1-bin- hadoop2.6/examples/src/main/resources/kv1.txt' INTO TABL varun.src") sqlContext.sql("FROM varun.src SELECT key, value").collect().foreach(println) sqlContext.sql("SELECT key, value FROM varun.src").collect().foreach(println) sqlContext.sql("SELECT key, value FROM varun.src where key > 100").collect().foreach(println)
  • 19. sqlContext.sql("SELECT key, count(key) FROM varun.src group by key").collect().foreach(println) 4.1 Defining a Function: This function operates on distributed DataFrames and works row by row (unless you're creating an user defined aggregation function) Example: val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2) def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = { => "[partID:" + index + ", val: " + x + "]").iterator } pairRDD.mapPartitionsWithIndex(myfunc).collect res: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)]) pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect res: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6)) pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect res: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200)) 4.2 Defining a Function with Aggreate: Example: val nums = sc.parallelize(List(1,2,3,4,5,6), 2) val chars = sc.parallelize(List("a","b","c","d","e","f"),2) def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = { => "[partID:" + index + ", val: " + x + "]").iterator } nums.foreach(println) chars.foreach(println) nums.mapPartitionsWithIndex(myfunc).collect
  • 20. chars.mapPartitionsWithIndex(myfunc).collect nums.aggregate(1)(math.max(_, _), _ + _) reduce of partition 0 will be max(1, 1, 2, 3) = 3 reduce of partition 1 will be max(1, 4, 5, 6) = 6 inal reduce across partitions will be 1 + 3 + 6 = 10 nums.aggregate(5)(math.max(_, _), _ + _) reduce of partition 0 will be max(5, 1, 2, 3) = 5 reduce of partition 1 will be max(5, 4, 5, 6) = 6 inal reduce across partitions will be 5 + 5 + 6 = 16 chars.aggregate("")(_ + _, _+_) res: String = defabc chars.aggregate("x")(_ + _, _+_) res: String = xxabcxdef 5. References and Links: