Spark_Documentation_Template1

Basic Usage Of Spark
Author(s): Nagavarunkumar Kolla
Table of Contents
1. Introduction to Data Analysis with Spark

1.1 What Is Apache Spark?
1.2 Important Notes about Spark
2. Spark Installation Steps
2.1 How to start scala, python & R
3. Technical Details
3.1 Transformations
a. cartesian (otherDataset)
b. partitions(func)
c. cogroup (otherDataset, [num Tasks])
d. distinct([numTasks])
e. filter(func) & filterWith(func)
f. sortBy() & sortByKey()
g. flatMap(func)
h. join (otherDataset, [numTasks])
i. sum(), max() & min()
j. pipe(command, [envVars])
3.2 Actions
a. collect() & collectAsMap()
b. count(), countByKey() & countByValue()
c. first()
d. foreach(func)
e. keys()
f. reduce(func)
g. saveAsTextFile(path)
h. stats()
i. take(n) & takeOrdered(n, [ordering])
j. top()
3.3 Miscellaneous
a. Zip & CombineByKey
b. toDebugString()
c. ++ Operator
d. values & variance
4. DataFrames
a. Defining a Function
b. Defining a Function with Aggreate
5. References and Links
1.Introduction to Data Analysis with Spark:
1.1 What Is Apache Spark?
Apache Spark is a cluster computing platform designed to be fast and general purpose
Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL
and rich built-in libraries

1.2 Important Notes about Spark:
1. Spark provides two ways to create RDDs: loading an external dataset and parallelizing a
collection in your driver program
2. The Simplest way to create RDDs is to take an existing collection in your program and pass it
to SparkContext's parallelize() method
3. RDD (Resilient Distributed Dataset), supports two types of operations:Transformations
& Actions
Transformations are operations on RDDs that return a new RDD, such as map() and
filter()
Actions are operations that return a result to the driver program on write it to storage
such as count() and first()
2. Spark Installation Steps
1. Install Java, Python, R
sudo apt-get install openjdk-7-jdk
sudo apt-get install python
sudo apt-get install r-base
2. Install Hadoop-2.6.0, Hive-1.2.1
3. Install scala-2.10.4
Download the file from below link
http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
copy the tar file into work dir & extract it
update the ~/.bashrc file with SCALA_HOME path
4. Install spark-1.4.1
Download the file from below link
https://spark.apache.org/downloads.html
copy the tar file into work dir & extract it
update the ~/.bashrc file with SPARK_HOME path
5. reopen the terminal to reflect the changes
2.1 How to start scala, python & R
1. "scala" is the command to start the "scala"
"python" is the command to start the "python"
"R" is the command to start the "R"

Verify the all are installed or not
2. spark-shell, pyspark, sparkR, spark-sql these are the main commands to
work with spark
execute "spark-shell" command, it will move to "scala" prompt
execute "pyspark" command, it will move to "python" prompt
execute "sparkR" command, it will move to "R" prompt
3. Technical Details
3.1 Transformations
All transformations in Spark are lazy, in that they do not compute their results right away.
The transformations are only computed when an action requires a result to be returned to the
driver program.
The following lists some of the common transformations supported by Spark
cartesian (otherDataset): When called on datasets of types T and U, returns a dataset of (T,
U) pairs (all pairs of elements).
Example:
val x = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6),
(3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))
partitions(func): A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be operated on in parallel.
Example:
val y = sc.parallelize(1 to 10, 10)
y.partitions.length
res: Int = 10
val z = y.coalesce(2, false)
z.partitions.length
res: Int = 2

cogroup (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W),
returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called
groupWith.
Example:
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
val d = a.map((_, "d"))
b.cogroup(c).collect
res: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b,
b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))),
(2,(CompactBuffer(b),CompactBuffer(c))))
b.cogroup(c, d).collect
res: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b,
b),CompactBuffer(c, c),CompactBuffer(d, d))),
(3,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))),
(2,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))))
distinct([numTasks]): Return a new dataset that contains the distinct elements of the source
dataset
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.distinct.collect
res: Array[String] = Array(Dog, Gnu, Cat, Rat)
Example 2:
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res: Int = 2
a.distinct(3).partitions.length
res: Int = 3

filter(func) & filterWith(func): Return a new dataset formed by selecting those elements of the
source on which func returns true.
Example 1:
val a = sc.parallelize(1 to 10, 3)
val b = a.filter(_ % 2 == 0)
b.collect
res: Array[Int] = Array(2, 4, 6, 8, 10)
Example 2:
val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0)
b.collect
res: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)
-------------------------------------------------------------------^
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)
a.filterWith(x=> x)((a, b) => b == 0).collect
res: Array[Int] = Array(1, 2)
---------------------------------------------------------------------^
a.filterWith(x=> x)((a, b) => a % (b+1) == 0).collect
res: Array[Int] = Array(1, 2, 4, 6, 8, 10)
---------------------------------------------------------------------^
a.filterWith(x=> x.toString)((a, b) => b == "2").collect
sortBy() & sortByKey([ascending], [numTasks]): When called on a dataset of (K, V) pairs
where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the boolean ascending argument.
Example 1:
val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1))
y.sortBy(c => c, true).collect
--------------------------------------------------------------------------------^
y.sortBy(c => c, false).collect

--------------------------------------------------------------------------------^
val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)))
z.sortBy(c => c._1, true).collect
res: Array[(String, Int)] = Array((A,26), (H,10), (L,5),(Z,1))
--------------------------------------------------------------------------------^
z.sortBy(c => c._2, true).collect
res: Array[(String, Int)] = Array((Z,1), (L,5), (H,10),(A,26))
Example 2:
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)
c.sortByKey(true).collect
res: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))
--------------------------------------------------------------------------------------------------^
c.sortByKey(false).collect
res: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))
-------------------------------------------------------------------------------------------------^
val b = a.cartesian(a)
val c = sc.parallelize(b.takeSample(true, 5, 13), 2)
val d = c.sortByKey(false)
res: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4))
flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so
func should return a Seq rather than a single item).
Example:
a.flatMap(1 to _).collect
res: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4,
5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10)
-------------------------------------------------------------------------------------------------^
sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect
res: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)
-------------------------------------------------------------------------------------------------^

The program below generates a random number of copies (up to 10) of
the items in the list.
val x = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect
res: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7,
7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)
join (otherDataset, [numTasks]): When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through
leftOuterJoin, rightOuterJoin, and fullOuterJoin.
Example:
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect
res: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,
(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,
(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,
(rat,gnu)), (3,(rat,bee)))
pipe(command, [envVars]): Pipe each partition of the RDD through a shell command, e.g. a Perl or
bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned
as an RDD of strings.
Example:
a.pipe("head -n 1").collect
res: Array[String] = Array(1, 4, 7)
sum(), max() & min(): These functions are used to get the results in transformation and provide
the accurate values
Example 1:

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.sum
res: Double = 101.39999999999999
Example 2:
val y = sc.parallelize(10 to 30)
y.max
res: Int = 30
Example 3:
val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9,
"lion"), (18, "cat")))
a.min
res: (Int, String) = (3,tiger)
3.2 Actions
By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory using the persist (or cache) method, in
which case Spark will keep the elements around on the cluster for much faster access the next
time you query it.
The following lists some of the common actions supported by Spark.
collect() & collectAsMap(): Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that returns a sufficiently small
subset of the data.
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.collect
res: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)
Example 2:
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.zip(a)
b.collect

res: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3))
b.collectAsMap
res: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)
count(): Return the number of elements in the dataset.
Example:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.count
res: Long = 4
countByKey() & countByValue(): Only available on RDDs of type (K, V). Returns a hashmap
of (K, Int) pairs with the count of each key.
Example 1:
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)
c.countByKey
res: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
Example 2:
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b.countByValue
res: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 ->
2)
first(): Return the first element of the dataset (similar to take(1)).
Example 1:
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.first
res: String = Gnu

foreach(func): Run a function func on each element of the dataset. This is usually done for side
effects such as updating an accumulator variable or interacting with external storage systems.
Example:
val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin",
"spider"), 3)
c.foreach(x => println(x + "s are yummy"))
res:
lions are yummy
gnus are yummy
crocodiles are yummy
ants are yummy
whales are yummy
dolphins are yummy
spiders are yummy
keys(): For each key k in this or other, return a resulting RDD that contains a tuple with the list
of values for that key in this as well as other.
Example:
val a = sc.parallelize(List("dog", "tiger", "lion", "cat",
"panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.keys.collect
reduce(func):Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative so that it can
be computed correctly in parallel
Example:
a.reduce(_ + _)
res: Int = 55
saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark
will call toString on each element to convert it to a line of text in the file.

Example:
a.saveAsTextFile("mydata_a")
import org.apache.hadoop.io.compress.GzipCodec
a.saveAsTextFile("mydata_b", classOf[GzipCodec])
val x = sc.textFile("mydata_b")
x.count
res: Long = 10000
val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)
x.saveAsTextFile("hdfs://localhost:8020/user/hadoop/test");
val sp = sc.textFile("hdfs://localhost:8020/user/hadoop/sp_data")
sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/hadoop/sp_x")
val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2)
v.saveAsSequenceFile("hd_seq_file")
val x = sc.parallelize(1 to 100, 3)
x.saveAsObjectFile("objFile")
val y = sc.objectFile[Array[Int]]("objFile")
y.collect
res: Array[Int] = Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33)
stats(): They provide column summary statistics for RDD[Vector] through the function Stats
available in Statistics.
Example:
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02,
19.29, 11.09, 21.0), 2)
x.stats
res: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667, stdev: 8.126859, max:
21.000000, min: 1.000000)

take(n) : Return an array with the first n elements of the dataset. Note that this is currently not
executed in parallel. Instead, the driver program computes all the elements
Example 1:
val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2)
b.take(2)
res: Array[String] = Array(dog, cat)
Example 2:
val b = sc.parallelize(1 to 10000, 5000)
b.take(100)
res: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100)
takeOrdered(n, [ordering]): Return the first n elements of the RDD using either their natural
order or a custom comparator
Example 1:
val b = sc.parallelize(List("dog", "cat", "ape", "salmon","gnu"), 2)
b.takeOrdered(2)
res: Array[String] = Array(ape, cat)
top(num: Int): Returns the top K elements from this RDD as defined by the specified implicit
Ordering[T].
Example:
val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
c.top(2)
3.3 Miscellaneous

The following lists some of the common Miscellaneous actions supported by Spark
Zip : Zips this RDD with another one, returning key-value pairs with the first element in each
RDD, second element in each RDD, etc.
Example:
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
c.collect
res: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2,salmon), (2,rabbit), (1,turkey),
(2,wolf), (2,bear), (2,bee))
CombineByKey: Simplified version of combineByKey that hash-partitions the resulting RDD
using the default parallelism level.
Example:
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) =>
x ::: y)
d.collect
res: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee,
bear, wolf)))
toDebugString():toDebugString method of an RDD did a better job of explaining where shuffle
boundaries occur in the lineage graph. One way to do this would be to only indent the tree at a
shuffle boundary instead of indenting it for every parent.
Example:
val c = a.subtract(b)
c.toDebugString
res156: String =
(3) MapPartitionsRDD[146] at subtract at <console>:26 []
| SubtractedRDD[145] at subtract at <console>:26 []
+-(3) MapPartitionsRDD[143] at subtract at <console>:26 []
| | ParallelCollectionRDD[141] at parallelize at <console>:22 []
+-(3) MapPartitionsRDD[144] at subtract at <console>:26 []
| ParallelCollectionRDD[142] at parallelize at <console>:22 []

++ Operator:This is the special operator treated in Spark
Example 1:
(a ++ b).collect
values & variance:
Example 1:
val a = sc.parallelize(List("dog", "tiger", "lion", "cat",
"panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.values.collect
res: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)
Example 2:
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.variance
res: Double = 10.605333333333332
--------------------------------------------------------------------------^
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.variance
res: Double = 66.04584444444443
x.sampleVariance
res: Double = 74.30157499999999
4. DataFrames
A DataFrame is a distributed collection of data organized into named columns.The
DataFrame API is available in Scala, Java, Python, and R
Examples For Scala:
-:Scala:-

command:- spark-shell
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.json")
// Show the content of the DataFrame
df.show()
// Print the schema in a tree format
df.printSchema()
// Select only the "name" column
df.select("name").show()
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// Select people older than 21
df.filter(df("age") > 21).show()
// Count people by age
df.groupBy("age").count().show()
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// you can use custom classes that implement the Product interface.
Step 1:
case class Person(name: String, age: Int)
Step 2:
val people = sc.textFile("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0),
p(1).trim.toInt)).toDF()
people.registerTempTable("people")
Step 3:
val results = sqlContext.sql("SELECT * FROM people")
Step4:
results.map(t => "Name: " + t(0)).collect().foreach(println)

Example For Python:
-:Python:-
command:- pyspark
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# Print the schema in a tree format
df.printSchema()
# Select only the "name" column
df.select("name").show()
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# Select people older than 21
df.filter(df['age'] > 21).show()
# Count people by age
df.groupBy("age").count().show()
# you can use custom classes that implement the Product interface
Step 1:
schemaPeople = sqlContext.createDataFrame(people)
Step 2:
schemaPeople.registerTempTable("people")
Step 3:
peoples = sqlContext.sql("SELECT * FROM people")
for people in peoples.collect():print people
Step 4:
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
Examples for R:
-:R:-
command:- sparkR

sqlContext <- SQLContext(sc)
df <- jsonFile(sqlContext, "file path")
# Displays the content of the DataFrame to stdout
showDF(df)
# Print the schema in a tree format
printSchema(df)
# Select only the "name" column
showDF(select(df, "name"))
# Select everybody, but increment the age by 1
showDF(select(df, df$name, df$age + 1))
# Select people older than 21
showDF(where(df, df$age > 21))
# Count people by age
showDF(count(groupBy(df, "age")))
Examples For Hive:
-:Hive:-
One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or
HiveQL
Spark SQL can also be used to read data from an existing Hive installation
command:- spark-shell
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE DATABASE IF NOT EXISTS varun")
sqlContext.sql("CREATE TABLE IF NOT EXISTS varun.src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH '/home/hadoop/workshop/spark-1.4.1-bin-
hadoop2.6/examples/src/main/resources/kv1.txt' INTO TABL varun.src")
sqlContext.sql("FROM varun.src SELECT key, value").collect().foreach(println)
sqlContext.sql("SELECT key, value FROM varun.src").collect().foreach(println)
sqlContext.sql("SELECT key, value FROM varun.src where key > 100").collect().foreach(println)

sqlContext.sql("SELECT key, count(key) FROM varun.src group by
key").collect().foreach(println)
4.1 Defining a Function: This function operates on distributed DataFrames and works row by
row (unless you're creating an user defined aggregation function)
Example:
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12),
("mouse", 2)), 2)
def myfunc(index: Int, iter: Iterator[(String, Int)]) :
Iterator[String] = {
iter.toList.map(x => "[partID:" +
index + ", val: " + x + "]").iterator
}
pairRDD.mapPartitionsWithIndex(myfunc).collect
res: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)],
[partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))
4.2 Defining a Function with Aggreate:
Example:
val nums = sc.parallelize(List(1,2,3,4,5,6), 2)
val chars = sc.parallelize(List("a","b","c","d","e","f"),2)
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
nums.foreach(println)
chars.foreach(println)
nums.mapPartitionsWithIndex(myfunc).collect

chars.mapPartitionsWithIndex(myfunc).collect
nums.aggregate(1)(math.max(_, _), _ + _)
reduce of partition 0 will be max(1, 1, 2, 3) = 3
inal reduce across partitions will be 1 + 3 + 6 = 10
nums.aggregate(5)(math.max(_, _), _ + _)
inal reduce across partitions will be 5 + 5 + 6 = 16
chars.aggregate("")(_ + _, _+_)
res: String = defabc
chars.aggregate("x")(_ + _, _+_)
res: String = xxabcxdef
5. References and Links:
https://spark.apache.org/examples.html
https://spark.apache.org/docs/latest/

Spark_Documentation_Template1

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Spark_Documentation_Template1

Similar to Spark_Documentation_Template1 (20)

Spark_Documentation_Template1