SlideShare a Scribd company logo
Basic Usage Of Spark
Author(s): Nagavarunkumar Kolla
Table of Contents
1. Introduction to Data Analysis with Spark
1.1 What Is Apache Spark?
1.2 Important Notes about Spark
2. Spark Installation Steps
2.1 How to start scala, python & R
3. Technical Details
3.1 Transformations
a. cartesian (otherDataset)
b. partitions(func)
c. cogroup (otherDataset, [num Tasks])
d. distinct([numTasks])
e. filter(func) & filterWith(func)
f. sortBy() & sortByKey()
g. flatMap(func)
h. join (otherDataset, [numTasks])
i. sum(), max() & min()
j. pipe(command, [envVars])
3.2 Actions
a. collect() & collectAsMap()
b. count(), countByKey() & countByValue()
c. first()
d. foreach(func)
e. keys()
f. reduce(func)
g. saveAsTextFile(path)
h. stats()
i. take(n) & takeOrdered(n, [ordering])
j. top()
3.3 Miscellaneous
a. Zip & CombineByKey
b. toDebugString()
c. ++ Operator
d. values & variance
4. DataFrames
a. Defining a Function
b. Defining a Function with Aggreate
5. References and Links
1.Introduction to Data Analysis with Spark:
1.1 What Is Apache Spark?
Apache Spark is a cluster computing platform designed to be fast and general purpose
Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL
and rich built-in libraries
1.2 Important Notes about Spark:
1. Spark provides two ways to create RDDs: loading an external dataset and parallelizing a
collection in your driver program
2. The Simplest way to create RDDs is to take an existing collection in your program and pass it
to SparkContext's parallelize() method
3. RDD (Resilient Distributed Dataset), supports two types of operations:Transformations
& Actions
Transformations are operations on RDDs that return a new RDD, such as map() and
filter()
Actions are operations that return a result to the driver program on write it to storage
such as count() and first()
2. Spark Installation Steps
1. Install Java, Python, R
sudo apt-get install openjdk-7-jdk
sudo apt-get install python
sudo apt-get install r-base
2. Install Hadoop-2.6.0, Hive-1.2.1
3. Install scala-2.10.4
Download the file from below link
http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
copy the tar file into work dir & extract it
update the ~/.bashrc file with SCALA_HOME path
4. Install spark-1.4.1
Download the file from below link
https://spark.apache.org/downloads.html
copy the tar file into work dir & extract it
update the ~/.bashrc file with SPARK_HOME path
5. reopen the terminal to reflect the changes
2.1 How to start scala, python & R
1. "scala" is the command to start the "scala"
"python" is the command to start the "python"
"R" is the command to start the "R"
Verify the all are installed or not
2. spark-shell, pyspark, sparkR, spark-sql these are the main commands to
work with spark
execute "spark-shell" command, it will move to "scala" prompt
execute "pyspark" command, it will move to "python" prompt
execute "sparkR" command, it will move to "R" prompt
3. Technical Details
3.1 Transformations
All transformations in Spark are lazy, in that they do not compute their results right away.
The transformations are only computed when an action requires a result to be returned to the
driver program.
The following lists some of the common transformations supported by Spark
cartesian (otherDataset): When called on datasets of types T and U, returns a dataset of (T,
U) pairs (all pairs of elements).
Example:
val x = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6),
(3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))
partitions(func): A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be operated on in parallel.
Example:
val y = sc.parallelize(1 to 10, 10)
y.partitions.length
res: Int = 10
val z = y.coalesce(2, false)
z.partitions.length
res: Int = 2
cogroup (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W),
returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called
groupWith.
Example:
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
val d = a.map((_, "d"))
b.cogroup(c).collect
res: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b,
b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))),
(2,(CompactBuffer(b),CompactBuffer(c))))
b.cogroup(c, d).collect
res: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b,
b),CompactBuffer(c, c),CompactBuffer(d, d))),
(3,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))),
(2,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))))
distinct([numTasks]): Return a new dataset that contains the distinct elements of the source
dataset
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.distinct.collect
res: Array[String] = Array(Dog, Gnu, Cat, Rat)
Example 2:
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res: Int = 2
a.distinct(3).partitions.length
res: Int = 3
filter(func) & filterWith(func): Return a new dataset formed by selecting those elements of the
source on which func returns true.
Example 1:
val a = sc.parallelize(1 to 10, 3)
val b = a.filter(_ % 2 == 0)
b.collect
res: Array[Int] = Array(2, 4, 6, 8, 10)
Example 2:
val a = sc.parallelize(1 to 9, 3)
val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0)
b.collect
res: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)
-------------------------------------------------------------------^
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)
a.filterWith(x=> x)((a, b) => b == 0).collect
res: Array[Int] = Array(1, 2)
---------------------------------------------------------------------^
a.filterWith(x=> x)((a, b) => a % (b+1) == 0).collect
res: Array[Int] = Array(1, 2, 4, 6, 8, 10)
---------------------------------------------------------------------^
a.filterWith(x=> x.toString)((a, b) => b == "2").collect
res: Array[Int] = Array(5, 6)
sortBy() & sortByKey([ascending], [numTasks]): When called on a dataset of (K, V) pairs
where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the boolean ascending argument.
Example 1:
val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1))
y.sortBy(c => c, true).collect
res: Array[Int] = Array(1, 1, 2, 3, 5, 7)
--------------------------------------------------------------------------------^
y.sortBy(c => c, false).collect
res: Array[Int] = Array(7, 5, 3, 2, 1, 1)
--------------------------------------------------------------------------------^
val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)))
z.sortBy(c => c._1, true).collect
res: Array[(String, Int)] = Array((A,26), (H,10), (L,5),(Z,1))
--------------------------------------------------------------------------------^
z.sortBy(c => c._2, true).collect
res: Array[(String, Int)] = Array((Z,1), (L,5), (H,10),(A,26))
Example 2:
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)
c.sortByKey(true).collect
res: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))
--------------------------------------------------------------------------------------------------^
c.sortByKey(false).collect
res: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))
-------------------------------------------------------------------------------------------------^
val a = sc.parallelize(1 to 100, 5)
val b = a.cartesian(a)
val c = sc.parallelize(b.takeSample(true, 5, 13), 2)
val d = c.sortByKey(false)
res: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4))
flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so
func should return a Seq rather than a single item).
Example:
val a = sc.parallelize(1 to 10, 5)
a.flatMap(1 to _).collect
res: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4,
5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10)
-------------------------------------------------------------------------------------------------^
sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect
res: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)
-------------------------------------------------------------------------------------------------^
The program below generates a random number of copies (up to 10) of
the items in the list.
val x = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect
res: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7,
7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)
join (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through
leftOuterJoin, rightOuterJoin, and fullOuterJoin.
Example:
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect
res: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,
(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,
(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,
(rat,gnu)), (3,(rat,bee)))
pipe(command, [envVars]): Pipe each partition of the RDD through a shell command, e.g. a Perl or
bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned
as an RDD of strings.
Example:
val a = sc.parallelize(1 to 9, 3)
a.pipe("head -n 1").collect
res: Array[String] = Array(1, 4, 7)
sum(), max() & min(): These functions are used to get the results in transformation and provide
the accurate values
Example 1:
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.sum
res: Double = 101.39999999999999
Example 2:
val y = sc.parallelize(10 to 30)
y.max
res: Int = 30
Example 3:
val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9,
"lion"), (18, "cat")))
a.min
res: (Int, String) = (3,tiger)
3.2 Actions
By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory using the persist (or cache) method, in
which case Spark will keep the elements around on the cluster for much faster access the next
time you query it.
The following lists some of the common actions supported by Spark.
collect() & collectAsMap(): Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that returns a sufficiently small
subset of the data.
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.collect
res: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)
Example 2:
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.zip(a)
b.collect
res: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3))
b.collectAsMap
res: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)
count(): Return the number of elements in the dataset.
Example:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.count
res: Long = 4
countByKey() & countByValue(): Only available on RDDs of type (K, V). Returns a hashmap
of (K, Int) pairs with the count of each key.
Example 1:
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)
c.countByKey
res: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
Example 2:
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b.countByValue
res: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 ->
2)
first(): Return the first element of the dataset (similar to take(1)).
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.first
res: String = Gnu
foreach(func): Run a function func on each element of the dataset. This is usually done for side
effects such as updating an accumulator variable or interacting with external storage systems.
Example:
val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin",
"spider"), 3)
c.foreach(x => println(x + "s are yummy"))
res:
lions are yummy
gnus are yummy
crocodiles are yummy
ants are yummy
whales are yummy
dolphins are yummy
spiders are yummy
keys(): For each key k in this or other, return a resulting RDD that contains a tuple with the list
of values for that key in this as well as other.
Example:
val a = sc.parallelize(List("dog", "tiger", "lion", "cat",
"panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.keys.collect
res: Array[Int] = Array(3, 5, 4, 3, 7, 5)
reduce(func):Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative so that it can
be computed correctly in parallel
Example:
val a = sc.parallelize(1 to 10, 3)
a.reduce(_ + _)
res: Int = 55
saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark
will call toString on each element to convert it to a line of text in the file.
Example:
val a = sc.parallelize(1 to 10000, 3)
a.saveAsTextFile("mydata_a")
import org.apache.hadoop.io.compress.GzipCodec
a.saveAsTextFile("mydata_b", classOf[GzipCodec])
val x = sc.textFile("mydata_b")
x.count
res: Long = 10000
val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)
x.saveAsTextFile("hdfs://localhost:8020/user/hadoop/test");
val sp = sc.textFile("hdfs://localhost:8020/user/hadoop/sp_data")
sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/hadoop/sp_x")
val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2)
v.saveAsSequenceFile("hd_seq_file")
val x = sc.parallelize(1 to 100, 3)
x.saveAsObjectFile("objFile")
val y = sc.objectFile[Array[Int]]("objFile")
y.collect
res: Array[Int] = Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33)
stats(): They provide column summary statistics for RDD[Vector] through the function Stats
available in Statistics.
Example:
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02,
19.29, 11.09, 21.0), 2)
x.stats
res: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667, stdev: 8.126859, max:
21.000000, min: 1.000000)
take(n) : Return an array with the first n elements of the dataset. Note that this is currently not
executed in parallel. Instead, the driver program computes all the elements
Example 1:
val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2)
b.take(2)
res: Array[String] = Array(dog, cat)
Example 2:
val b = sc.parallelize(1 to 10000, 5000)
b.take(100)
res: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100)
takeOrdered(n, [ordering]): Return the first n elements of the RDD using either their natural
order or a custom comparator
Example 1:
val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2)
b.takeOrdered(2)
res: Array[String] = Array(ape, cat)
top(num: Int): Returns the top K elements from this RDD as defined by the specified implicit
Ordering[T].
Example:
val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
c.top(2)
res: Array[Int] = Array(9, 8)
3.3 Miscellaneous
The following lists some of the common Miscellaneous actions supported by Spark
Zip : Zips this RDD with another one, returning key-value pairs with the first element in each
RDD, second element in each RDD, etc.
Example:
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
c.collect
res: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2,salmon), (2,rabbit), (1,turkey),
(2,wolf), (2,bear), (2,bee))
CombineByKey: Simplified version of combineByKey that hash-partitions the resulting RDD
using the default parallelism level.
Example:
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) =>
x ::: y)
d.collect
res: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee,
bear, wolf)))
toDebugString():toDebugString method of an RDD did a better job of explaining where shuffle
boundaries occur in the lineage graph. One way to do this would be to only indent the tree at a
shuffle boundary instead of indenting it for every parent.
Example:
val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.toDebugString
res156: String =
(3) MapPartitionsRDD[146] at subtract at <console>:26 []
| SubtractedRDD[145] at subtract at <console>:26 []
+-(3) MapPartitionsRDD[143] at subtract at <console>:26 []
| | ParallelCollectionRDD[141] at parallelize at <console>:22 []
+-(3) MapPartitionsRDD[144] at subtract at <console>:26 []
| ParallelCollectionRDD[142] at parallelize at <console>:22 []
++ Operator:This is the special operator treated in Spark
Example 1:
val a = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(5 to 7, 1)
(a ++ b).collect
res: Array[Int] = Array(1, 2, 3, 5, 6, 7)
values & variance:
Example 1:
val a = sc.parallelize(List("dog", "tiger", "lion", "cat",
"panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.values.collect
res: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)
Example 2:
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.variance
res: Double = 10.605333333333332
--------------------------------------------------------------------------^
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.variance
res: Double = 66.04584444444443
x.sampleVariance
res: Double = 74.30157499999999
4. DataFrames
A DataFrame is a distributed collection of data organized into named columns.The
DataFrame API is available in Scala, Java, Python, and R
Examples For Scala:
-:Scala:-
command:- spark-shell
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.json")
// Show the content of the DataFrame
df.show()
// Print the schema in a tree format
df.printSchema()
// Select only the "name" column
df.select("name").show()
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// Select people older than 21
df.filter(df("age") > 21).show()
// Count people by age
df.groupBy("age").count().show()
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// you can use custom classes that implement the Product interface.
Step 1:
case class Person(name: String, age: Int)
Step 2:
val people = sc.textFile("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0),
p(1).trim.toInt)).toDF()
people.registerTempTable("people")
Step 3:
val results = sqlContext.sql("SELECT * FROM people")
Step4:
results.map(t => "Name: " + t(0)).collect().foreach(println)
Example For Python:
-:Python:-
command:- pyspark
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# Print the schema in a tree format
df.printSchema()
# Select only the "name" column
df.select("name").show()
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# Select people older than 21
df.filter(df['age'] > 21).show()
# Count people by age
df.groupBy("age").count().show()
# you can use custom classes that implement the Product interface
Step 1:
schemaPeople = sqlContext.createDataFrame(people)
Step 2:
schemaPeople.registerTempTable("people")
Step 3:
peoples = sqlContext.sql("SELECT * FROM people")
for people in peoples.collect():print people
Step 4:
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
Examples for R:
-:R:-
command:- sparkR
sqlContext <- SQLContext(sc)
df <- jsonFile(sqlContext, "file path")
# Displays the content of the DataFrame to stdout
showDF(df)
# Print the schema in a tree format
printSchema(df)
# Select only the "name" column
showDF(select(df, "name"))
# Select everybody, but increment the age by 1
showDF(select(df, df$name, df$age + 1))
# Select people older than 21
showDF(where(df, df$age > 21))
# Count people by age
showDF(count(groupBy(df, "age")))
Examples For Hive:
-:Hive:-
One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or
HiveQL
Spark SQL can also be used to read data from an existing Hive installation
command:- spark-shell
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE DATABASE IF NOT EXISTS varun")
sqlContext.sql("CREATE TABLE IF NOT EXISTS varun.src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH '/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/kv1.txt' INTO TABL varun.src")
sqlContext.sql("FROM varun.src SELECT key, value").collect().foreach(println)
sqlContext.sql("SELECT key, value FROM varun.src").collect().foreach(println)
sqlContext.sql("SELECT key, value FROM varun.src where key > 100").collect().foreach(println)
sqlContext.sql("SELECT key, count(key) FROM varun.src group by
key").collect().foreach(println)
4.1 Defining a Function: This function operates on distributed DataFrames and works row by
row (unless you're creating an user defined aggregation function)
Example:
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12),
("mouse", 2)), 2)
def myfunc(index: Int, iter: Iterator[(String, Int)]) :
Iterator[String] = {
iter.toList.map(x => "[partID:" +
index + ", val: " + x + "]").iterator
}
pairRDD.mapPartitionsWithIndex(myfunc).collect
res: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)],
[partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))
4.2 Defining a Function with Aggreate:
Example:
val nums = sc.parallelize(List(1,2,3,4,5,6), 2)
val chars = sc.parallelize(List("a","b","c","d","e","f"),2)
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
nums.foreach(println)
chars.foreach(println)
nums.mapPartitionsWithIndex(myfunc).collect
chars.mapPartitionsWithIndex(myfunc).collect
nums.aggregate(1)(math.max(_, _), _ + _)
reduce of partition 0 will be max(1, 1, 2, 3) = 3
reduce of partition 1 will be max(1, 4, 5, 6) = 6
inal reduce across partitions will be 1 + 3 + 6 = 10
nums.aggregate(5)(math.max(_, _), _ + _)
reduce of partition 0 will be max(5, 1, 2, 3) = 5
reduce of partition 1 will be max(5, 4, 5, 6) = 6
inal reduce across partitions will be 5 + 5 + 6 = 16
chars.aggregate("")(_ + _, _+_)
res: String = defabc
chars.aggregate("x")(_ + _, _+_)
res: String = xxabcxdef
5. References and Links:
https://spark.apache.org/examples.html
https://spark.apache.org/docs/latest/
Spark_Documentation_Template1
Spark_Documentation_Template1

More Related Content

What's hot

Postgres performance for humans
Postgres performance for humansPostgres performance for humans
Postgres performance for humansCraig Kerstiens
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Databricks
 
Oracle 10g Performance: chapter 00 sampling
Oracle 10g Performance: chapter 00 samplingOracle 10g Performance: chapter 00 sampling
Oracle 10g Performance: chapter 00 samplingKyle Hailey
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
Lukas Fittl
 
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
How to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'rollHow to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'roll
PGConf APAC
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
Abhik Seal
 
An Introduction to RxJava
An Introduction to RxJavaAn Introduction to RxJava
An Introduction to RxJava
K. Matthew Dupree
 
The elements of a functional mindset
The elements of a functional mindsetThe elements of a functional mindset
The elements of a functional mindset
Eric Normand
 
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
PgDay.Seoul
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
Aleksander Alekseev
 
Moving 12c database from NON-ASM to ASM
Moving 12c database from NON-ASM to ASMMoving 12c database from NON-ASM to ASM
Moving 12c database from NON-ASM to ASM
Monowar Mukul
 
Data handling in r
Data handling in rData handling in r
Data handling in r
Abhik Seal
 
Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009
mattsmiley
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
Bartosz Konieczny
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
Bartosz Konieczny
 
PostgreSQL: Advanced indexing
PostgreSQL: Advanced indexingPostgreSQL: Advanced indexing
PostgreSQL: Advanced indexing
Hans-Jürgen Schönig
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasets
Bartosz Konieczny
 
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
PgDay.Seoul
 

What's hot (19)

Postgres performance for humans
Postgres performance for humansPostgres performance for humans
Postgres performance for humans
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 
Oracle 10g Performance: chapter 00 sampling
Oracle 10g Performance: chapter 00 samplingOracle 10g Performance: chapter 00 sampling
Oracle 10g Performance: chapter 00 sampling
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
 
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016
 
How to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'rollHow to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'roll
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
An Introduction to RxJava
An Introduction to RxJavaAn Introduction to RxJava
An Introduction to RxJava
 
The elements of a functional mindset
The elements of a functional mindsetThe elements of a functional mindset
The elements of a functional mindset
 
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
Moving 12c database from NON-ASM to ASM
Moving 12c database from NON-ASM to ASMMoving 12c database from NON-ASM to ASM
Moving 12c database from NON-ASM to ASM
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
 
PostgreSQL: Advanced indexing
PostgreSQL: Advanced indexingPostgreSQL: Advanced indexing
PostgreSQL: Advanced indexing
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasets
 
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
 

Similar to Spark_Documentation_Template1

Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015
Leonardo Borges
 
Transducers in JavaScript
Transducers in JavaScriptTransducers in JavaScript
Transducers in JavaScript
Pavel Forkert
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
Albert Bifet
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
Chucheng Hsieh
 
Javascript
JavascriptJavascript
Javascript
Vlad Ifrim
 
Useful javascript
Useful javascriptUseful javascript
Useful javascriptLei Kang
 
Python 101 language features and functional programming
Python 101 language features and functional programmingPython 101 language features and functional programming
Python 101 language features and functional programming
Lukasz Dynowski
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Eta
EtaEta
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
Survey Department
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
GeoGebra JavaScript CheatSheet
GeoGebra JavaScript CheatSheetGeoGebra JavaScript CheatSheet
GeoGebra JavaScript CheatSheet
Jose Perez
 
Groovy
GroovyGroovy
Groovy
Zen Urban
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
Laura Hughes
 
PostgreSQL 9.6 새 기능 소개
PostgreSQL 9.6 새 기능 소개PostgreSQL 9.6 새 기능 소개
PostgreSQL 9.6 새 기능 소개
PgDay.Seoul
 
Introduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkIntroduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with spark
Angelo Leto
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
Jairam Chandar
 
Introduction to R
Introduction to RIntroduction to R
Introduction to Ragnonchik
 

Similar to Spark_Documentation_Template1 (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015
 
Transducers in JavaScript
Transducers in JavaScriptTransducers in JavaScript
Transducers in JavaScript
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Javascript
JavascriptJavascript
Javascript
 
Useful javascript
Useful javascriptUseful javascript
Useful javascript
 
Python 101 language features and functional programming
Python 101 language features and functional programmingPython 101 language features and functional programming
Python 101 language features and functional programming
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Eta
EtaEta
Eta
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
GeoGebra JavaScript CheatSheet
GeoGebra JavaScript CheatSheetGeoGebra JavaScript CheatSheet
GeoGebra JavaScript CheatSheet
 
Groovy
GroovyGroovy
Groovy
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
PostgreSQL 9.6 새 기능 소개
PostgreSQL 9.6 새 기능 소개PostgreSQL 9.6 새 기능 소개
PostgreSQL 9.6 새 기능 소개
 
Introduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkIntroduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with spark
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 

Spark_Documentation_Template1

  • 1. Basic Usage Of Spark Author(s): Nagavarunkumar Kolla Table of Contents 1. Introduction to Data Analysis with Spark
  • 2. 1.1 What Is Apache Spark? 1.2 Important Notes about Spark 2. Spark Installation Steps 2.1 How to start scala, python & R 3. Technical Details 3.1 Transformations a. cartesian (otherDataset) b. partitions(func) c. cogroup (otherDataset, [num Tasks]) d. distinct([numTasks]) e. filter(func) & filterWith(func) f. sortBy() & sortByKey() g. flatMap(func) h. join (otherDataset, [numTasks]) i. sum(), max() & min() j. pipe(command, [envVars]) 3.2 Actions a. collect() & collectAsMap() b. count(), countByKey() & countByValue() c. first() d. foreach(func) e. keys() f. reduce(func) g. saveAsTextFile(path) h. stats() i. take(n) & takeOrdered(n, [ordering]) j. top() 3.3 Miscellaneous a. Zip & CombineByKey b. toDebugString() c. ++ Operator d. values & variance 4. DataFrames a. Defining a Function b. Defining a Function with Aggreate 5. References and Links 1.Introduction to Data Analysis with Spark: 1.1 What Is Apache Spark? Apache Spark is a cluster computing platform designed to be fast and general purpose Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL and rich built-in libraries
  • 3. 1.2 Important Notes about Spark: 1. Spark provides two ways to create RDDs: loading an external dataset and parallelizing a collection in your driver program 2. The Simplest way to create RDDs is to take an existing collection in your program and pass it to SparkContext's parallelize() method 3. RDD (Resilient Distributed Dataset), supports two types of operations:Transformations & Actions Transformations are operations on RDDs that return a new RDD, such as map() and filter() Actions are operations that return a result to the driver program on write it to storage such as count() and first() 2. Spark Installation Steps 1. Install Java, Python, R sudo apt-get install openjdk-7-jdk sudo apt-get install python sudo apt-get install r-base 2. Install Hadoop-2.6.0, Hive-1.2.1 3. Install scala-2.10.4 Download the file from below link http://www.scala-lang.org/files/archive/scala-2.10.4.tgz copy the tar file into work dir & extract it update the ~/.bashrc file with SCALA_HOME path 4. Install spark-1.4.1 Download the file from below link https://spark.apache.org/downloads.html copy the tar file into work dir & extract it update the ~/.bashrc file with SPARK_HOME path 5. reopen the terminal to reflect the changes 2.1 How to start scala, python & R 1. "scala" is the command to start the "scala" "python" is the command to start the "python" "R" is the command to start the "R"
  • 4. Verify the all are installed or not 2. spark-shell, pyspark, sparkR, spark-sql these are the main commands to work with spark execute "spark-shell" command, it will move to "scala" prompt execute "pyspark" command, it will move to "python" prompt execute "sparkR" command, it will move to "R" prompt 3. Technical Details 3.1 Transformations All transformations in Spark are lazy, in that they do not compute their results right away. The transformations are only computed when an action requires a result to be returned to the driver program. The following lists some of the common transformations supported by Spark cartesian (otherDataset): When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). Example: val x = sc.parallelize(List(1,2,3,4,5)) val y = sc.parallelize(List(6,7,8,9,10)) x.cartesian(y).collect res: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10)) partitions(func): A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Example: val y = sc.parallelize(1 to 10, 10) y.partitions.length res: Int = 10 val z = y.coalesce(2, false) z.partitions.length res: Int = 2
  • 5. cogroup (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith. Example: val a = sc.parallelize(List(1, 2, 1, 3), 1) val b = a.map((_, "b")) val c = a.map((_, "c")) val d = a.map((_, "d")) b.cogroup(c).collect res: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))), (2,(CompactBuffer(b),CompactBuffer(c)))) b.cogroup(c, d).collect res: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c),CompactBuffer(d, d))), (3,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))), (2,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d)))) distinct([numTasks]): Return a new dataset that contains the distinct elements of the source dataset Example 1: val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2) c.distinct.collect res: Array[String] = Array(Dog, Gnu, Cat, Rat) Example 2: val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)) a.distinct(2).partitions.length res: Int = 2 a.distinct(3).partitions.length res: Int = 3
  • 6. filter(func) & filterWith(func): Return a new dataset formed by selecting those elements of the source on which func returns true. Example 1: val a = sc.parallelize(1 to 10, 3) val b = a.filter(_ % 2 == 0) b.collect res: Array[Int] = Array(2, 4, 6, 8, 10) Example 2: val a = sc.parallelize(1 to 9, 3) val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0) b.collect res: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9) -------------------------------------------------------------------^ val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5) a.filterWith(x=> x)((a, b) => b == 0).collect res: Array[Int] = Array(1, 2) ---------------------------------------------------------------------^ a.filterWith(x=> x)((a, b) => a % (b+1) == 0).collect res: Array[Int] = Array(1, 2, 4, 6, 8, 10) ---------------------------------------------------------------------^ a.filterWith(x=> x.toString)((a, b) => b == "2").collect res: Array[Int] = Array(5, 6) sortBy() & sortByKey([ascending], [numTasks]): When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument. Example 1: val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1)) y.sortBy(c => c, true).collect res: Array[Int] = Array(1, 1, 2, 3, 5, 7) --------------------------------------------------------------------------------^ y.sortBy(c => c, false).collect res: Array[Int] = Array(7, 5, 3, 2, 1, 1)
  • 7. --------------------------------------------------------------------------------^ val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5))) z.sortBy(c => c._1, true).collect res: Array[(String, Int)] = Array((A,26), (H,10), (L,5),(Z,1)) --------------------------------------------------------------------------------^ z.sortBy(c => c._2, true).collect res: Array[(String, Int)] = Array((Z,1), (L,5), (H,10),(A,26)) Example 2: val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2) val b = sc.parallelize(1 to a.count.toInt, 2) val c = a.zip(b) c.sortByKey(true).collect res: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3)) --------------------------------------------------------------------------------------------------^ c.sortByKey(false).collect res: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5)) -------------------------------------------------------------------------------------------------^ val a = sc.parallelize(1 to 100, 5) val b = a.cartesian(a) val c = sc.parallelize(b.takeSample(true, 5, 13), 2) val d = c.sortByKey(false) res: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4)) flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). Example: val a = sc.parallelize(1 to 10, 5) a.flatMap(1 to _).collect res: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) -------------------------------------------------------------------------------------------------^ sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect res: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3) -------------------------------------------------------------------------------------------------^
  • 8. The program below generates a random number of copies (up to 10) of the items in the list. val x = sc.parallelize(1 to 10, 3) x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect res: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10) join (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. Example: val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) val d = c.keyBy(_.length) b.join(d).collect res: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6, (salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3, (dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3, (rat,gnu)), (3,(rat,bee))) pipe(command, [envVars]): Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings. Example: val a = sc.parallelize(1 to 9, 3) a.pipe("head -n 1").collect res: Array[String] = Array(1, 4, 7) sum(), max() & min(): These functions are used to get the results in transformation and provide the accurate values Example 1:
  • 9. val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2) x.sum res: Double = 101.39999999999999 Example 2: val y = sc.parallelize(10 to 30) y.max res: Int = 30 Example 3: val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (18, "cat"))) a.min res: (Int, String) = (3,tiger) 3.2 Actions By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. The following lists some of the common actions supported by Spark. collect() & collectAsMap(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. Example 1: val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2) c.collect res: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat) Example 2: val a = sc.parallelize(List(1, 2, 1, 3), 1) val b = a.zip(a) b.collect
  • 10. res: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3)) b.collectAsMap res: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3) count(): Return the number of elements in the dataset. Example: val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) c.count res: Long = 4 countByKey() & countByValue(): Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. Example 1: val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2) c.countByKey res: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1) Example 2: val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1)) b.countByValue res: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2) first(): Return the first element of the dataset (similar to take(1)). Example 1: val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) c.first res: String = Gnu
  • 11. foreach(func): Run a function func on each element of the dataset. This is usually done for side effects such as updating an accumulator variable or interacting with external storage systems. Example: val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin", "spider"), 3) c.foreach(x => println(x + "s are yummy")) res: lions are yummy gnus are yummy crocodiles are yummy ants are yummy whales are yummy dolphins are yummy spiders are yummy keys(): For each key k in this or other, return a resulting RDD that contains a tuple with the list of values for that key in this as well as other. Example: val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) b.keys.collect res: Array[Int] = Array(3, 5, 4, 3, 7, 5) reduce(func):Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel Example: val a = sc.parallelize(1 to 10, 3) a.reduce(_ + _) res: Int = 55 saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
  • 12. Example: val a = sc.parallelize(1 to 10000, 3) a.saveAsTextFile("mydata_a") import org.apache.hadoop.io.compress.GzipCodec a.saveAsTextFile("mydata_b", classOf[GzipCodec]) val x = sc.textFile("mydata_b") x.count res: Long = 10000 val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3) x.saveAsTextFile("hdfs://localhost:8020/user/hadoop/test"); val sp = sc.textFile("hdfs://localhost:8020/user/hadoop/sp_data") sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/hadoop/sp_x") val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2) v.saveAsSequenceFile("hd_seq_file") val x = sc.parallelize(1 to 100, 3) x.saveAsObjectFile("objFile") val y = sc.objectFile[Array[Int]]("objFile") y.collect res: Array[Int] = Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33) stats(): They provide column summary statistics for RDD[Vector] through the function Stats available in Statistics. Example: val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2) x.stats res: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667, stdev: 8.126859, max: 21.000000, min: 1.000000)
  • 13. take(n) : Return an array with the first n elements of the dataset. Note that this is currently not executed in parallel. Instead, the driver program computes all the elements Example 1: val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2) b.take(2) res: Array[String] = Array(dog, cat) Example 2: val b = sc.parallelize(1 to 10000, 5000) b.take(100) res: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100) takeOrdered(n, [ordering]): Return the first n elements of the RDD using either their natural order or a custom comparator Example 1: val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2) b.takeOrdered(2) res: Array[String] = Array(ape, cat) top(num: Int): Returns the top K elements from this RDD as defined by the specified implicit Ordering[T]. Example: val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2) c.top(2) res: Array[Int] = Array(9, 8) 3.3 Miscellaneous
  • 14. The following lists some of the common Miscellaneous actions supported by Spark Zip : Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Example: val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3) val c = b.zip(a) c.collect res: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2,salmon), (2,rabbit), (1,turkey), (2,wolf), (2,bear), (2,bee)) CombineByKey: Simplified version of combineByKey that hash-partitions the resulting RDD using the default parallelism level. Example: val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y) d.collect res: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf))) toDebugString():toDebugString method of an RDD did a better job of explaining where shuffle boundaries occur in the lineage graph. One way to do this would be to only indent the tree at a shuffle boundary instead of indenting it for every parent. Example: val a = sc.parallelize(1 to 9, 3) val b = sc.parallelize(1 to 3, 3) val c = a.subtract(b) c.toDebugString res156: String = (3) MapPartitionsRDD[146] at subtract at <console>:26 [] | SubtractedRDD[145] at subtract at <console>:26 [] +-(3) MapPartitionsRDD[143] at subtract at <console>:26 [] | | ParallelCollectionRDD[141] at parallelize at <console>:22 [] +-(3) MapPartitionsRDD[144] at subtract at <console>:26 [] | ParallelCollectionRDD[142] at parallelize at <console>:22 []
  • 15. ++ Operator:This is the special operator treated in Spark Example 1: val a = sc.parallelize(1 to 3, 1) val b = sc.parallelize(5 to 7, 1) (a ++ b).collect res: Array[Int] = Array(1, 2, 3, 5, 6, 7) values & variance: Example 1: val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) b.values.collect res: Array[String] = Array(dog, tiger, lion, cat, panther, eagle) Example 2: val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3) a.variance res: Double = 10.605333333333332 --------------------------------------------------------------------------^ val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2) x.variance res: Double = 66.04584444444443 x.sampleVariance res: Double = 74.30157499999999 4. DataFrames A DataFrame is a distributed collection of data organized into named columns.The DataFrame API is available in Scala, Java, Python, and R Examples For Scala: -:Scala:-
  • 16. command:- spark-shell import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext val sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.json("file:/home/hadoop/workshop/spark-1.4.1-bin- hadoop2.6/examples/src/main/resources/people.json") // Show the content of the DataFrame df.show() // Print the schema in a tree format df.printSchema() // Select only the "name" column df.select("name").show() // Select everybody, but increment the age by 1 df.select(df("name"), df("age") + 1).show() // Select people older than 21 df.filter(df("age") > 21).show() // Count people by age df.groupBy("age").count().show() // this is used to implicitly convert an RDD to a DataFrame. import sqlContext.implicits._ // you can use custom classes that implement the Product interface. Step 1: case class Person(name: String, age: Int) Step 2: val people = sc.textFile("file:/home/hadoop/workshop/spark-1.4.1-bin- hadoop2.6/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF() people.registerTempTable("people") Step 3: val results = sqlContext.sql("SELECT * FROM people") Step4: results.map(t => "Name: " + t(0)).collect().foreach(println)
  • 17. Example For Python: -:Python:- command:- pyspark from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("file:/home/hadoop/workshop/spark-1.4.1-bin- hadoop2.6/examples/src/main/resources/people.json") # Displays the content of the DataFrame to stdout df.show() # Print the schema in a tree format df.printSchema() # Select only the "name" column df.select("name").show() # Select everybody, but increment the age by 1 df.select(df['name'], df['age'] + 1).show() # Select people older than 21 df.filter(df['age'] > 21).show() # Count people by age df.groupBy("age").count().show() # you can use custom classes that implement the Product interface Step 1: schemaPeople = sqlContext.createDataFrame(people) Step 2: schemaPeople.registerTempTable("people") Step 3: peoples = sqlContext.sql("SELECT * FROM people") for people in peoples.collect():print people Step 4: teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") Examples for R: -:R:- command:- sparkR
  • 18. sqlContext <- SQLContext(sc) df <- jsonFile(sqlContext, "file path") # Displays the content of the DataFrame to stdout showDF(df) # Print the schema in a tree format printSchema(df) # Select only the "name" column showDF(select(df, "name")) # Select everybody, but increment the age by 1 showDF(select(df, df$name, df$age + 1)) # Select people older than 21 showDF(where(df, df$age > 21)) # Count people by age showDF(count(groupBy(df, "age"))) Examples For Hive: -:Hive:- One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL Spark SQL can also be used to read data from an existing Hive installation command:- spark-shell val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE DATABASE IF NOT EXISTS varun") sqlContext.sql("CREATE TABLE IF NOT EXISTS varun.src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH '/home/hadoop/workshop/spark-1.4.1-bin- hadoop2.6/examples/src/main/resources/kv1.txt' INTO TABL varun.src") sqlContext.sql("FROM varun.src SELECT key, value").collect().foreach(println) sqlContext.sql("SELECT key, value FROM varun.src").collect().foreach(println) sqlContext.sql("SELECT key, value FROM varun.src where key > 100").collect().foreach(println)
  • 19. sqlContext.sql("SELECT key, count(key) FROM varun.src group by key").collect().foreach(println) 4.1 Defining a Function: This function operates on distributed DataFrames and works row by row (unless you're creating an user defined aggregation function) Example: val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2) def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = { iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator } pairRDD.mapPartitionsWithIndex(myfunc).collect res: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)]) pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect res: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6)) pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect res: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200)) 4.2 Defining a Function with Aggreate: Example: val nums = sc.parallelize(List(1,2,3,4,5,6), 2) val chars = sc.parallelize(List("a","b","c","d","e","f"),2) def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = { iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator } nums.foreach(println) chars.foreach(println) nums.mapPartitionsWithIndex(myfunc).collect
  • 20. chars.mapPartitionsWithIndex(myfunc).collect nums.aggregate(1)(math.max(_, _), _ + _) reduce of partition 0 will be max(1, 1, 2, 3) = 3 reduce of partition 1 will be max(1, 4, 5, 6) = 6 inal reduce across partitions will be 1 + 3 + 6 = 10 nums.aggregate(5)(math.max(_, _), _ + _) reduce of partition 0 will be max(5, 1, 2, 3) = 5 reduce of partition 1 will be max(5, 4, 5, 6) = 6 inal reduce across partitions will be 5 + 5 + 6 = 16 chars.aggregate("")(_ + _, _+_) res: String = defabc chars.aggregate("x")(_ + _, _+_) res: String = xxabcxdef 5. References and Links: https://spark.apache.org/examples.html https://spark.apache.org/docs/latest/