11. From Hadoop to Spark 2/2

From Hadoop to Spark
2/2
Dr. Fabio Fumarola

Outline
• Spark Shell
– Scala
– Python
• Shark Shell
• Data Frames
• Spark Streaming
• Code Examples: Processing and Machine Learning
2

Start the docker container
• Pull the image
– From https://github.com/sequenceiq/docker-spark
– Via command: docker pull sequenceiq/spark:1.3.0
• Run the Docker
– Interactive: docker run –it –P sequenceiq/spark:1.3.0 bash
Or
– Daemon: docker run –d -P sequenceiq/spark:1.3.0 -d
3

Separate Container Master/Worker
Or in alternative
$ docker pull snufkin/spark-master
$ docker pull snufkin/spark-worker
•These images are based on snufkin/spark-base
$ docker run … master
$ docker run … worker
4

Start the spark shell
• Shell in YARN-client mode: the driver run in a client process
and the master is used to request resources from YARN
– spark-shell --master yarn-client --driver-memory 1g --executor-
memory 1g --executor-cores 1
• YARN-cluster mode: spark runs inside the master which is
managed by YARN
– spark-submit --class org.apache.spark.examples.SparkPi --master yarn-
cluster --driver-memory 1g --executor-memory 1g --executor-cores 1
$SPARK_HOME/lib/spark-examples-1.3.0-hadoop2.4.0.jar
5

Start the shell
• Scala Spark-shell local
– spark-shell --master local[2] --driver-memory 1g
--executor-memory 1g
• Python Spark-shell local
– pyspark --master local[2] --driver-memory 1g --executor-
memory 1g
7

RDD Basics
Internally, each RDD is characterized by five main properties:
•A list of partitions
•A function for computing each split
•A list of dependencies on other RDDs
•Optionally, a Partitioner for key-value RDDs (e.g. to say that the
RDD is hash-partitioned)
•Optionally, a list of preferred locations to compute each split on
(e.g. block locations for an HDFS file)
8
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

RDD Basics
• When a shell is started a SparkContext is created for you
• An RDD in Spark can be obtained via:
– Loading an external dataset with sc.textFile(…)
– Distributing a collection of object sc.parallelize(1 to 1000)
• Spark can read and distributed dataset from HDFS (hdfs://),
Cassandra, Hbase, Amazon S3 (s3://), etc
9
scala> sc
res0: org.apache.spark.SparkContext =org.apache.spark.SparkContext@5d02b84a

Creating an RDD from a file
If you run from YARN
•You need to interact with hdfs to list a file
– hadoop fs -ls /
– hdfs dfs –ls /
•Download a file
– wget http://pbdmng.datatoknowledge.it/files/access_log
– curl -O http://pbdmng.datatoknowledge.it/files/error_log
10

• Copy to hdfs
– hadoop fs -copyFromLocal access_log ./
• List the files
– hadoop fs -ls ./
11
bash-4.1# hadoop fs -ls ./
Found 3 items
drwxr-xr-x - root supergroup 0 2015-05-28 05:06 .sparkStaging
drwxr-xr-x - root supergroup 0 2015-01-15 04:05 input
-rw-r--r-- 1 root supergroup 5589889 2015-05-28 05:44 access_log
http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-
common/FileSystemShell.html

• Scala
val lines = sc.textFile("/user/root/access_log ")
lines.count
• Python
>>> lines = sc.textFile("/user/root/error_log")
>>> lines.count()
12

Creating an RDD
• Scala
scala> val rdd = sc.parallelize(1 to 1000)
• Python
>>> data = [1,2,3,4,5]
>>> rdd = sc.parallelize(data)
>>> rdd.count()
13

RDD Example
• Create an RDD of numbers from 1 to 1000 and sum
its elements
• Scala
scala> val rdd = sc.parallelize(1 to 1000)
scala> val sum = rdd.reduce((a,b) => a + b)
• Python
>>> rdd = sc.parallelize(range(1,1001))
>>> sum = rdd.reduce(lambda a, b: a + b)
14

RDD and Computation
• RDD are by default recomputed each time an action
is called
• To reuse the same RDD in multiple actions
– rdd.persist()
– rdd.cache()
15

When to Cache and when to Persist?
• With persist() and cache() on a RDD its partitions are stored in
memory buffers
– Spark limits the amount of memory by default to the 20% of the
overall JVM reserved heap
• Since the reserved cache is limited, sometime it is better to
call persist instead of cache on RDD
• Otherwise, cached RDD will be removed and needs to be
recomputed
• While persisted RDD can be persisted restored from the disk
16

Passing Functions to Spark
• Spark’s API relies on passing function in the driver
program to run on the cluster
• Recommendations for functions
– Anonymous functions
– Methods in a singleton objects
– Class with RDD as function parameters
18

Passing Functions to Spark: Scala
• Anonymous function syntax
scala> (x: Int) => x *x
res0: Int => Int = <function1>
• Singleton Object
scala> object MyFunctions {
| def func1(s: String): String = s + s
| }
scala> lines.map(MyFunctions.func1)
19

Passing Functions to Spark: Scala
• Class
scala> class MyClass {
| def func1(s: String): String = ???
| def doStuff(rdd: RDD[String]): RDD[String] = rdd.map(func1)
| }
• Class with a val
scala> class MyClass {
| val field = "hello"
| def doStuff(rdd: RDD[String]): RDD[String] = rdd.map(_ + field)
| }
20

Passing Functions to Spark:
Python
• Function
>>> if __name__ == "__main__":
... def myFunc(s):
... words = s.split(" ")
... return len(words)
• Class
>>> class MyClass(object):
... def func(self, s):
... return s
... def doStuff(self, rdd):
... return rdd.map(self.func)
21

Functions and Memory Usage
• Spark reserves the 20% of the allocated JVM heap to
store user functions
• When we create functions we should try to minimize
the code used
• Otherwise we can incur to memory issues
22

RDD Operations
• Transformations
• Actions
23

Transformations
• Are operations on RDDs that return a new RDD
• Transformed RDDs are computer lazily, only when an
action is called
• 2 Type of operations:
– Element-wise
– Partition-wise
24

Transformations: map
Scala
scala> val numbers = sc.parallelize(1 to 100)
scala> val result = numbers.map(x => x * 2)
Python
>>> numbers = sc.parallelize(range(1,101))
>>> result = numbers.map(lambda a : a * 2)
25

Transformations: flatMap
26
Scala
scala> val list = List(“hello world”, “hi”)
scala> val values= sc.parallelize(list)
scala> numbers.flatMap(l => l.split(“”))
Python
>>> numbers = sc.parallelize([“hello world”, “hi”]))
>>> result = numbers.flatMap(lambda line: line.split(“ “))

Transformations: filter
27
Scala
scala> val result = numbers.filter(x => x % 2 == 0)
Python
>>> result = numbers.filter(lambda x : x % 2 == 0)

Transformations: mapPartitions
28
Scala
scala> val result = numbers.mapPartitions(x => x * 2)
Python
>>> result = numbers.mapPartitions(lambda a : a * 2)

Transformations: mapPartitionsWithIndex
29
Scala
scala> val result = numbers.mapPartitionWithIndex(_.map(e => e * 2)
Python
>>> result = numbers. mapPartitionWithIndex(lambda it : for e in it: e * 2)

Transformations: sample
30
Scala
scala> val result = numbers.sample(false,0.5D)
Python
>>> result = numbers.sample(false,0.5)

Transformations: Union
31
Scala
scala> val list1= sc.parallelize(1 to 100)
scala> val result = list1.union(list2)
Python
>>> list1 = sc.parallelize(range(1,101))
>>> result = list1.union(list2)

Transformations: Intersection
32
Scala
scala> val result = list1.intersection(list2)
Python
>>> result = list1.intersection(list2)

Transformations: Distinct
33
Scala
scala> val result = list1.union(list2).distinct
Python
>>> result = list1.intersection(list2).distinct()

Other Transformations
• pipe(command, [envVars]) => Pipe each partition of the RDD
through a shell command, e.g. a R or bash script. RDD
elements are written to the process's stdin and lines output
to its stdout are returned as an RDD of strings.
• coalesce(numPartitions) => Decrease the number of partitions
in the RDD to numPartitions. Useful, when a RDD is shrink
after a filter operation
34

Other Transformations
• repartition(numPartitions) => Reshuffle the data in the RDD
randomly to create more or fewer partitions and balance it
across them. This always shuffles all data over the network.
• repartitionAndSortWithinPartitions(partitioner) =>
Repartition the RDD according to the given partitioner and,
within each resulting partition, sort records by their keys.
35

Actions
• Are used to spread the computation on the cluster
• Actions return a value to the driver program after
running a computation on the dataset
• For example:
– map is a transformation that passes each element to a
function
– reduce in an action that aggregates all the element using a
function and return the results to the driver program
37

Actions: reduce
• Aggregate the element using a function
(commutative and associative)
scala> val lines = sc.parallelize(1 to 1000)
scala> lines.reduce(_ + _)
38

Actions: collect
• Return all the elements of the dataset as an array at the
driver program.
• This is usually useful after a filter or other operation that
returns a sufficiently small subset of the data.
scala> lines.collect
39

Actions: count, first, take(n)
scala> lines.count
res1: Long = 1000
scala> lines.first
res2: Int = 1
scala> lines.take(5)
res4: Array[Int] = Array(1, 2, 3, 4, 5)
40

Actions: takeSample, takeOrdered
scala> lines.takeSample(false,10)
res8: Array[Int] = Array(170, 26, 984, 688, 519, 282, 227, 812, 456,
460)
scala> lines.takeOrdered(10)
res10: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
41

Action: Save File
• saveAsTextFile(path)
scala> lines.saveAsTextFile("./prova.txt”)
• saveAsSequenceFile(path)  removed in 1.3.0
scala> lines.saveAsSequenceFile("./prova.txt”)
• saveAsObjectFile(path)
scala> lines.saveAsSequenceFile("./prova.txt”)
42

Motivation
• Pair RDDs are useful for operations that allow you to
work on each key in parallel
• Key/value RDD are commonly used to perform
aggregations
• Often we will do some initial ETL to get our data inot
key/value format
44

Why key/value pairs
• Let us consider an example
scala> val fakePairs = lines.map(v => (v.toString, v))
• The type of pairs is RDD[(String, Int)] and exposes basic RDD
functions
• But, Spark provides PairRDDFunctions with methods on
key/value pairs
scala> import org.apache.spark.rdd.RDD._
scala> val pairs = rddToPairRDDFunctions(lines.map(i => i -> i.toString))
//<- from spark 1.3.0
45

Transformations for key/value
• groupByKey([numTasks]) => Called on a dataset of (K, V)
pairs, returns a dataset of (K, Iterable<V>) pairs.
• reduceByKey(func, [numTasks]) => Called on a dataset of (K,
V) pairs, returns a dataset of (K, V) pairs where the values for
each key are aggregated using the given reduce function func,
which must be of type (V,V) => V.
46

• sortByKey([ascending], [numTasks]) => Called on a dataset of
(K, V) pairs where K implements Ordered, returns a dataset of
(K, V) pairs sorted by keys in ascending or descending order,
as specified in the boolean ascending argument.
• join(otherDataset, [numTasks]) => Called on datasets of type
(K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all
pairs of elements for each key. Outer joins are supported
through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
47

• cogroup(otherDataset, [numTasks]) => Called on datasets of
type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>,
Iterable<W>)) tuples. This operation is also called groupWith.
• cartesian(otherDataset) => Called on datasets of types T
and U, returns a dataset of (T, U) pairs (all pairs of elements).
48

Aggregations with PairRDD
• If we have key/value pairs is common to want to
aggregate statistics across all elements of the same
key
• Examples are:
– Per key average
– Word count
49

Per Key Average
We use reduceByKey() with mapValues() to compute
per key average
>>> rdd.mapValues(lambda: x: (x,1)).reduceByKey(lambda x,y: (x[0] +
y[0], x[1] + y[1]))
scala> rdd.mapValues((_,1)).reduceByKey((x,y) => (x._1 + y._1, x._2 +
y._2))
50

Word Count
We can use the reduceByKey() function
>>> result = word.map(lambda x: (x,1)).reduceByKey(lambda x,y: x +
y)
scala> val result = words.map((_,1).reduceByKey(_ + _)
51

PairRDD Best Practices
• In base these operations involve data shuffling
• From Spark 1.0 PairRDD functions such as cogroup(),
join(), left and right join, groupByKey(),
reduceByKey() and lookup() benefit on data
partitioning.
• For example, in reduceByKey() the function is
computed locally and the final result is sent to the
network
52

• In general is better to prefer the reduceByKey() to
the groupByKey()
53

• In general is better to prefer the reduceByKey() to
the groupByKey()
54

Shared Variables
• EACH function passed to a Spark operation is
executed on a remote cluster node
• These variables are copied to each machine
• And no updates to the variables on the remote
machine are propagated back to the driver program
• To enable shared variables Spark supports: Broadcast
Variables and Accumulators
56

Broadcast Variables
• Allow the programmer to keep a read-only variable
cached on each machine.
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
scala> broadcastVar.value
>>> broadcastVar = sc.broadcast([1, 2, 3])
>>> broadcastVar.value
57

Accumulators
• They can be used to implement counters (as in
MapReduce) or sums.
scala> val accum = sc.accumulator(0, "My Accumulator")
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
scala> accum.value
>>> accum = sc.accumulator(0)
>>> sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
>>> accum.value
58

Spark SQL
Provides 3 main capabilities:
1.Load data from different sources (JSON, Hive and
Parquet)
2.Query the data using SQL
3.Integration between SQL and regular
Python/Java/Scala API
This API are changing due to the DataFrames API
60

Initializing Spark SQL
The entrypoint to create a basic SQLContext, all you need is a
SparkContext.
•If we have a link to Hive
scala> import org.apache.spark.sql.hive.HiveContext
scala> val hiveContext = new HiveContext(sc)
•otherwise
scala> import org.apache.spark.sql.SQLContext
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
61

Basic Query Example
• To make a query on a table we call the sql() method
on the hive or sql context
scala> val table = hiveContext.jsonFile("file.json")
scala> table.registerTempTable("tweets")
scala> val topTweets = hiveContext.sql("Select text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10")
Download file from: https://raw.githubusercontent.com/databricks/learning-
spark/master/files/testweet.json
62

Schema RDD
• Both loading data and executing queries return a
SchemaRDD.
• SchemarRDD is an RDD composed of:
– Row objects with
– Information about schema and columns
• Row objects are wrappers around arrays of basic
types (integer, string, double,…)
63

Data Types
• All data types of Spark SQL are located and visible at
scala> import org.apache.spark.sql.types._
64
http://spark.apache.org/docs/latest/sql-programming-guide.html#data_types

Loading and Saving Data
• Spark SQL supports different structured data sources
out of the box:
– Hive tables,
– JSON,
– Parquet files, and
– JDBC NoSQL
– Regular RDDs converted
65

Apache Hive
• In this scenario, Spark SQL supports any Hive-
supported storage format:
– Text files, RCFiles, Parquet, Avro, ProtoBuff
scala> import org.apache.spark.sql.hive.HiveContext
scala> val hiveContext = new HiveContext(sc)
scala> val rows = hiveContext.sql(SELECT key, value FROM mytable)
scala> val keys = rows.map(row => row.getInt(0))
66

JDBC NoSQL
• Spark SQL supports driver from several:
– JDBC drivers: Postgres, MySQL, ..
– NoSQL: HBase, Cassandra, MongoDB, Elastic.co
67
scala> val jdbcDF = sqlContext.load("jdbc", Map(
| "url" -> "jdbc:postgresql:dbserver",
| "dbtable" -> "schema.tablename"))
scala> val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
scala> val rdd = sc.cassandraTable("test", "kv")

Parquet.io
• Column-Oriented storage format that can store records with
nested fields efficiently.
• Spark SQL support reading and writing from this format
scala> val people: RDD[Person] = ...
scala> people.saveAsParquetFile("people.parquet")
scala> val parquetFile = sqlContext.parquetFile("people.parquet")
scala> parquetFile.registerTempTable("parquetFile")
scala> val teenagers = sqlContext.sql("SELECT name FROM parquetFile
WHERE age >= 13 AND age <= 19")
68

JSON
• Spark load JSON from:
– jsonFile: loads data from directory of json files
– jsonRDD: load data from RDD of JSON objects
scala> val path = "examples/src/main/resources/people.json”
scala> val people = sqlContext.jsonFile(path)
scala> people.printSchema()
scala> val anotherPeopleRDD = sc.parallelize(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
scala> val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
69

Partition Discovery
• Table partitioning is a common optimization approach used in
systems like Hive
• Data are usually stored in different directories, with
partitioning column values encoded in the path of each
partition directory
70

DataFrames API
• It is a distributed collection of data organized into named
columns
• equivalent to a table in a relational database or a data frame
in R/Python
scala> val df = sqlContext.jsonFile("examples/src/main/resources/people.json")
scala> df.show()
scala> df.select("name", "age + 1").show()
scala> df.filter(df("age") > 21).show()
• Not stable right now
•
•
71

Overview
• Extension of the core API for processing live data streams
• Data can be ingested from: kafka, Flume, Twitter, ZeroMQ,
Kinesis or TCP sockets
• And can be processed using complex algorithms expressed
with high-level functions like map, reduce, join and window.
73

How it works internally
• It receives live input data streams and divides the
data into batches
• These batches are processed by the Spark engine to
generate the final stream of results in batches.
74

Example: Word Count
• Create the streaming context
scala> import org.apache.spark._
scala> import org.apache.spark.streaming._
scala> val ssc = new StreamingContext(sc, Seconds(5))
• Create a DStream
scala> val lines = ssc.socketTextStream("localhost", 9999)
scala> val words = lines.flatMap(_.split(" "))
75

Example: Word Count
• Perform the streaming word count
scala> val words = lines.flatMap(_.split(" "))
scala> val pairs = words.map(word => (word, 1))
scala> val wordCounts = pairs.reduceByKey(_ + _)
scala> wordCounts.print()
• Start the streaming processing
scala> ssc.start()
scala> ssc.awaitTermination()
76

Example Word Count
• Start a shell and Install netcat
– Docker exec –it <docker name> bash
– yum install nc.x86_64
• Start a netcat on port 9999
– nc -lk 9999
• Write some words
77

Discretized DStreams
• It represents a continuous stream of data
• Internally, a DStream is represented by a continuous series of
RDDs
• Each RDD in a DStream contains data from a certain interval
78

Operation on DStreams
• Any operation applied on a DStream translates to operations
on the underlying RDDs
79

Streaming Sources
• Apart the example, we can create streams from:
– Basic sources: files (HDFS, S3, NFS) or from Akka Actors
and Queue of RDDs as a Stream (for test)
– Advanced sources: from systems like, kafka, Flume,
Twitter, ZeroMQ, Kinesis
• Advanced Source are used as external libs
80

Advanced Source: Twitter
• Linking: Add the artifact spark-streaming-twitter_2.10 to the
SBT/Maven project dependencies.
• Programming: create a DStream with
TwitterUtils.createStream
scala> import org.apache.spark.streaming.twitter._
scala> TwitterUtils.createStream(ssc, None)
81

Transformations on DStreams
82

Output Operations on DStreams
83

Sliding Window Operations
• Spark Streaming also provides windowed computations
– window length - The duration of the window (3 in the figure)
– sliding interval - The interval at which the window operation is
performed (2 in the figure).
scala> val windowedWordCounts =
pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30),
Seconds(10))
84

Examples
http://ampcamp.berkeley.edu/5/exercises/index.html
85

Download Data
• https://www.dropbox.com/s/nsep3m3dv7yejrm/training-download
• Data with example data
• Machine Learning projects in scala
86

Interactive Analysis
scala> sc
res: spark.SparkContext = spark.SparkContext@470d1f30
Load the data
scala> val pagecounts = sc.textFile("data/pagecounts")
INFO mapred.FileInputFormat: Total input paths to process : 74
pagecounts: spark.RDD[String] = MappedRDD[1] at textFile at <console>:12
87

• Get the first 10 records
scala> pagecounts.take(10)
• Print the element
scala> pagecounts.take(10).foreach(println)
20090505-000000 aa.b ?71G4Bo1cAdWyg 1 14463
20090505-000000 aa.b Special:Statistics 1 840
20090505-000000 aa.b
Special:Whatlinkshere/MediaWiki:Returnto 1 1019
88

scala> pagecounts.count
89
http://localhost:4040

• To avoid reload the RDD in memory for each operation we
can cache it
scala> val enPages = pagecounts.filter(_.split(" ")(1) ==
"en").cache
• Next time we call an operation on enPages it will be executed
from cache
scala> enPages.count
90

• Let us generate a histogram of total pages on Wikipedia pages
for the date range in out dataset
scala> val enTuples = enPages.map(line => line.split(" "))
scala> val enKeyValuePairs = enTuples.map(line => (line(0).substring(0, 8),
line(3).toInt))
scala> enKeyValuePairs.reduceByKey(_+_, 1).collect
91

Other Exercise series
• Spark SQL: Use the Spark shell to write interactive
SQL queries
• Tachyon: Deploy Tachyon and try simple
functionalities.
• MLib: Build a movie recommender with Spark
• GraphX: Explore graph-structured data and graph
algorithms
92
http://ampcamp.berkeley.edu/5/
http://ampcamp.berkeley.edu/big-data-
mini-course-home/

11. From Hadoop to Spark 2/2

More Related Content

What's hot

Similar to 11. From Hadoop to Spark 2/2

More from Fabio Fumarola

Recently uploaded

11. From Hadoop to Spark 2/2

Editor's Notes