Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs

Learning Apache Spark –
Part 2 – Transformations
and Actions on RDDs

Presenter Introduction
Tim Spann, Senior Solutions Architect, airis.DATA
• ex-Pivotal Senior Field Engineer
• DZONE MVB and Zone Leader
• ex-Startup Senior Engineer / Team Lead
http://www.slideshare.net/bunkertor
http://sparkdeveloper.com/
http://www.twitter.com/PaasDev

airis.DATA
airis.DATA is a next generation system integrator that specializes in rapidly deployable
machine learning and graph solutions.
Our core competencies involve providing modular, scalable Big Data products that can be
tailored to fit use cases across industry verticals.
We offer predictive modeling and machine learning solutions at Petabyte scale utilizing
the most advanced, best-in-class technologies and frameworks including Spark, H20, and
Flink.
Our data pipelining solutions can be deployed in batch, real-time or near-real-time
settings to fit your specific business use-case.

Agenda
• Hands-On: Quick Install Zeppelin
• RDD Transformations
• RDD Actions
• Hands-On:
• RDD transformations and actions in Scala on Spark Standalone local

Installing Zeppelin and Spark 1.6
• Java JDK 8, Scala 2.10, SBT 0.13, Maven 3., Spark 1.6.0
• http://www.oracle.com/technetwork/java/javase/downloads/index.html
• http://www.scala-lang.org/download/2.10.6.html
• http://www.scala-lang.org/download/install.html
• http://www.scala-sbt.org/download.html
• http://apache.claz.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.zip
• http://spark.apache.org/downloads.html
• http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
• http://www.apache.org/dyn/closer.cgi/incubator/zeppelin/0.5.6-incubating/zeppelin-
0.5.6-incubating-bin-all.tgz
• For Mac (brew install sbt)

Installing Zeppelin and Spark 1.6 no.2
export SPARK_MASTER_IP=127.0.0.1
export SPARK_LOCAL_IP=127.0.0.1
export SCALA_HOME={YOURDIR}/scala-2.10.6
export PATH=$PATH:$SCALA_HOME/bin
For Windows, use SET instead of EXPORT and ; and not :.

Running Zeppelin and Spark 1.6
https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/install/install.html
https://github.com/hortonworks-gallery/zeppelin-notebooks
Download the Apache Zeppelin binary (Mac and Linux)
zeppelin-0.5.6-incubating-bin-all
Unzip
Run
cd zeppelin-0.5.6-incubating-bin-all
./bin/zeppelin-daemon.sh start
http://localhost:8080/

Resilient Distributed Datasets (RDDs)
have ACTIONS that return values (output)
val textfile = sc.textFile(”mydata.txt”)
textfile.count()
TRANSFORMATIONS which return pointers to new RDDs.
val lines = textFile.filter(line => line.contains(“Spark”))

Transformation Meaning
map(func) Return a new distributed dataset formed by passing each element of the source through a
function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which funcreturns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a
Seq rather than a single item).
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type
Iterator<T> => Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the
partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.
sample(withReplacement, fractio
n, seed)
Sample a fraction fraction of the data, with or without replacement, using a given random number
generator seed.
union(otherDataset) Return a new dataset that contains the union of the elements in the source dataset and the argument.
intersection(otherDataset) Return a new RDD that contains the intersection of elements in the source dataset and the argument.
distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key,
using reduceByKey or aggregateByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the number of partitions of the
parent RDD. You can pass an optional numTasks argument to set a different number of tasks.
Transformations

reduceByKey(func,
[numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated
using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks
is configurable through an optional second argument.
aggregateByKey(ze
roValue)(seqOp, co
mbOp, [numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated
using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the
input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable
through an optional second argument.
sortByKey([ascendi
ng], [numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in
ascending or descending order, as specified in the booleanascending argument.
join(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for
each key. Outer joins are supported through leftOuterJoin,rightOuterJoin, and fullOuterJoin.
cogroup(otherData
set, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This
operation is also called groupWith.
Transformations

cartesian(otherDataset) When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of
elements).
pipe(command, [envVars]) Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD
elements are written to the process's stdin and lines output to its stdout are returned as
an RDD of strings.
coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running
operations more efficiently after filtering down a large dataset.
repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and
balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(
partitioner)
Repartition the RDD according to the given partitioner and, within each resulting
partition, sort records by their keys. This is more efficient than calling repartition and
then sorting within each partition because it can push the sorting down into the shuffle
machinery.
Transformations

MAP
logFile.map(parseLogLine)
Where parseLogLine is a Scala function that takes one line of the Apache log as
a String and parses it into a LogRecord case class. For each line in the file RDD,
we call the Map function on it, the final result is a new RDD.
FILTER
filter(!_.clientIp.equals("Empty"))
Where we filter out ”Empty” lines from our resulting RDD. This filter is
operating on an RDD of LogRecords
MAP FILTER
Transformations

FLATMAP
val flatRDD = originalRDD.flatMap(_.split(" "))
Maps to 0 or more items returning a Scala Seq(uence).
MAPPARTITIONSWITHINDEX
val mapped = originalRDD.mapPartitionsWithIndex{
(index, iterator) => { println("Index -> " + index)
val myList = iterator.toList
myList.map(x => x + " -> " + index).iterator
}
}
Run a map on each partition and get an index. Otherwise same as MapPartitions.
FLATMAP MAPPARTITIONS+
Transformations

val rddSpark = sc.parallelize(List("SQL","Streaming","GraphX", "MLLib", "Bagel",
"SparkR","Python","Scala","Java", "Alluxio", "Tungsten", "Zeppelin"))
val rddHadoop = sc.parallelize(List("HDFS", "YARN", "TEZ", "Hive", "HBase", "Pig", "Atlas", "Storm",
"Accumulo", "Ranger", "Phoenix", "MapReduce", "Slider", "Flume", "Kafka", "Oozie", "Sqoop",
"Falcon","Knox", "Ambari", "Zookeeper", "Cloudbreak", "SQL", "Java", "Scala", "Python"))
UNION
rddHadoop.union(rddSpark).collect()
Do a set UNION on source dataset and argument
INTERSECTION
rddHadoop.intersection(rddSpark).collect()
Do a set intersection on source dataset and argument
UNION INTERSECTION
Transformations

DISTINCT
bigDataRDD.distinct().collect()
Get distinct elements from the source dataset
SAMPLE
bigDataRDD.sample(true,0.25 ).collect()
res89: Array[String] = Array(HDFS, TEZ, Pig, Knox, Python, Python)
Sample a fraction (0.25) of the data with replacement (true).
Sampling without replacement requires one additional pass over the RDD to
guarantee sample size, whereas sampling with replacement requires two
additional passes. With replacement is slower.
DISTINCT SAMPLE
Transformations

GROUPBYKEY
val groupByRDD = keyValueRDD.groupByKey()
For Datasets (K,V) pairs, not often used. reduceByKey is preferred.
REDUCEBYKEY
val kvRDD = sc.parallelize(Seq((1,"Bacon"), (1, "Hamburger"), (1,"Cheeseburger")))
val reducedByRDD = kvRDD.reduceByKey((a, b) => a.concat(b))
reducedByRDD: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[66] at
reduceByKey at <console>:31 res136: Array[(Int, String)] =
Array((1,BaconHamburgerCheeseburger))
Reduce by function (concat) on the key.
GROUPBYKEY REDUCEBYKEY
Transformations

AGGREGATEBYKEY
val namesRDD = sc.parallelize(List((1, 25), (1, 27), (3, 25), (3, 27)))val groupByRDD
= namesRDD.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect()
groupByRDD: Array[(Int, Int)] = Array((1,52), (3,52))
For Datasets (K,V) pairs, returns pairs where values for each key are aggregated
with a function and “zero” value.
SORTBYKEY
val sortByRDD = namesRDD.sortByKey(true).collect()
sortByRDD: Array[(Int, Int)] = Array((1,25), (1,27), (3,25), (3,27))
Returns a dataset of pairs sorted by keys in ascending or descending order
AGGREGATEBYKEY SORTBYKEY
Transformations

JOIN
val otherKeyValueRDD = sc.parallelize(Seq(("Bacon", "Amazing"), ("Steak", "Fine"), ("Lettuce", "Sad")))
keyValueRDD.join(otherKeyValueRDD).collect()
res166: Array[(String, (String, String))] = Array((Bacon,(Awesome,Amazing)))
Returns a dataset with pairs for each key.
LEFTOUTERJOIN
keyValueRDD.leftOuterJoin(otherKeyValueRDD).collect()
res170: Array[(String, (String, Option[String]))] = Array((PorkRoll,(Great,None)),
(Tofu,(Bogus,None)), (Bacon,(Awesome,Some(Amazing))))
Returns a dataset following SQL style outer joins.
JOIN LEFTOUTERJOIN RIGHTOUTERJOIN FULLOUTERJOIN

COGROUP
keyValueRDD.cogroup(otherKeyValueRDD).collect()
res178: Array[(String, (Iterable[String], Iterable[String]))] =
Array((PorkRoll,(CompactBuffer(Great),CompactBuffer())), (Steak,(CompactBuffer(),CompactBuffer(Fine))),
(Tofu,(CompactBuffer(Bogus),CompactBuffer())), (Lettuce,(CompactBuffer(),CompactBuffer(Sad))),
(Bacon,(CompactBuffer(Awesome),CompactBuffer(Amazing))))
Also known as ”groupWith”.
CARTESIAN
keyValueRDD.cartesian(otherKeyValueRDD).collect()
res182: Array[((String, String), (String, String))] = Array(((Bacon,Awesome),(Bacon,Amazing)),
((Bacon,Awesome),(Steak,Fine)), ((Bacon,Awesome),(Lettuce,Sad)),
((PorkRoll,Great),(Bacon,Amazing)), ((PorkRoll,Great),(Steak,Fine)), ((PorkRoll,Great),(Lettuce,Sad)),
((Tofu,Bogus),(Bacon,Amazing)), ((Tofu,Bogus),(Steak,Fine)), ((Tofu,Bogus),(Lettuce,Sad)))
Returns dataset of all pairs of elements. Cartesian Product.
COGROUP CARTESIAN
Transformations

PIPE
keyValueRDD.pipe("cut -c2-4").collect()
res213: Array[String] = Array(Bac, Por, Tof)
Call a command line function.
COALESCE
keyValueRDD.coalesce(1).collect()
Decrease the number of partitions.
PIPE COALESCE
Transformations

REPARTITION
keyValueRDD.repartition(2).collect()
res241: Array[(String, String)] = Array((Bacon,Awesome), (PorkRoll,Great), (Tofu,Bogus))
Reshuffle the data in the RDD randomly to create either more or less partitions
and balance across them. Good after filtering down a large dataset.
REPARTITIONANDSORTWITHINPARTITIONS
keyValueRDD.repartitionAndSortWithinPartitions(YourPartioner).collect()
Repartition using a customer partitioner, sort records by their keys. Secondary
sorting.
See: http://codingjunkie.net/spark-secondary-sort/
REPARTITION REPARTITIONANDSORTWITHINPARTITIONS

Action Meaning
reduce(func) Aggregate the elements of the dataset using a function func (which takes two arguments and
returns one). The function should be commutative and associative so that it can be computed
correctly in parallel.
collect() Return all the elements of the dataset as an array at the driver program. This is usually useful
after a filter or other operation that returns a sufficiently small subset of the data.
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeSample(withReplaceme
nt,num, [seed])
Return an array with a random sample of num elements of the dataset, with or without
replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering]) Return the first n elements of the RDD using either their natural order or a custom comparator.
Actions

Action Meaning
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the
local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on
each element to convert it to a line of text in the file.
saveAsSequenceFile(pa
th)
(Java and Scala)
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local
filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of
key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on
types that are implicitly convertible to Writable (Spark includes conversions for basic types
like Int, Double, String, etc).
saveAsObjectFile(path)
(Java and Scala)
Write the elements of the dataset in a simple format using Java serialization, which can then
be loaded usingSparkContext.objectFile().
countByKey() Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of
each key.
foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such
as updating anAccumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in
undefined behavior. See Understanding closures for more details.
Actions

ACTIONS
originalRDD.collect()
originalRDD.collect().foreach(println)
originalRDD.count()
originalRDD.first()
originalRDD.take(2)
originalRDD.takeSample(true,5,7634184)
originalRDD.takeOrdered(5)
Take Sample takes the # of samples, if you want replacements an a
random number generator seed.
COLLECT COUNT FIRST TAKE TAKESAMPLE TAKEORDERED

ACTIONS
keyValueRDD.countByKey().foreach(println)
(PorkRoll,1) (Tofu,1) (Bacon,1)
keyValueRDD.saveAsTextFile("here")
keyValueRDD.saveAsSequenceFile("here2")
keyValueRDD.saveAsObjectFile("here3")
In Zeppelin, %sh ls, will show you the local files. And you can see files created for
here, here2, here3. You can cat “here/part-0003” to see the content of the file. It
created in directory “here”.
COUNTBYKEY FOREACH SAVEAS…
Actions

ACTIONS
bigDataRDD.reduce((a, b) => a.concat(b))
res154: String =
AmbariZookeeperCloudbreakSQLJavaScalaPythonFlumeKafkaOozieSqoopFalconKnoxAtlasStormAccumuloRanger
PhoenixMapReduceSliderHDFSYARNTEZHiveHBasePigSQLStreamingGraphXMLLibBagelSparkRPythonScalaJavaAll
uxioTungstenZeppelin
Aggregates the elements of the dataset using a function. For this one, we concatenate all the Big
Data Strings into one long String appropriate for resumes.
REDUCEActions

Running a Spark Job
DRIVER PROGRAM
SPARK CONTEXT
WORKER NODE
EXECUTOR
TASKTASK
WORKER NODE
EXECUTOR
TASKTASK

Spark Resources
http://www.slideshare.net/airisdata/parquet-and-avro
http://airisdata.com/scala-spark-resources-setup-learning/
https://dzone.com/articles/anatomy-of-a-scala-spark-program
https://dzone.com/articles/proper-sbt-setup-for-scala-210-and-spark-streaming
https://github.com/airisdata/sparkworkshop
https://github.com/airisdata/SparkTransformations
https://github.com/airisdata/avroparquet
http://www.slideshare.net/airisdata/apache-spark-overview-59903397
https://plugins.jetbrains.com/plugin/?id=1347
http://mund-consulting.com/Products/Sparklet.aspx

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs

Similar to Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs (20)

More from Timothy Spann

More from Timothy Spann (20)

Recently uploaded

Recently uploaded (20)

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs