Spark tutorial @ KCC 2015

Jongwook Woo
HiPIC
CSULA
KCC 2015
Jeju University, Korea
June24 2015
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Introduction to Spark and
its Data Analysis and Use
Cases in Big Data

High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases

Jongwook Woo
CSULA
Myself
Name: 우종욱, Jongwook Woo
Experience:
 2012 - Present
– Certified Cloudera Instructor: R&D, Consulting, Training
 2012 - Present : Big Data Academic Parterships
– Cloudera, Hortonworks Partner for Hadoop Training
– Amazon AWS, MicroSoft Azure, IBM Bluemix
 Since 2002, Professor at California State Univ Los Angeles
 Since 1998: R&D consulting in Hollywood
– implements eBusiness applications using J2EE and middleware
– Information Search and Integration with FAST, Lucene/Solr,
Sphinx
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 Since 2007: Exposed to Big Data
 PhD in 2001: Computer Science and Engineering at USC

Jongwook Woo
CSULA
Myself
Experience (Cont’d): Bring in Big Data
training and R&D to Korea since 2009
2014: Training Hadoop and the Ecosystems
• Data Analysis / Science
• Hadoop Developer, Admin, HBase
• Spark
Summer 2013 Igloo Security:
– Collect, Search, and Analyze Security Log files 30GB –
100GB / day
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
– Training Hadoop and the Ecosystems
 Since 2008
– Introduce Hadoop Big Data and education in Univ and
Research Centers

Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received IBM Bluemix Education Grant (April 2015 – March
2016)
 Received MicroSoft Windows Azure Educator Grant (Oct 2013
- July 2016)
 Received Amazon AWS in Education Research Grant (July
2012 - July 201)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2015, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Received Academic Partnership with Hortonworks since May
2015

Jongwook Woo
CSULA
Experience in Big Data
 Certificate
 Certified Cloudera Hadoop Instructor
 Certified Cloudera Hadoop Developer / Administrator / Hbase /
Spark
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
 https://github.com/dalgual

Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social
Computing, smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On inexpensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers

Jongwook Woo
CSULA
What is Hadoop?
10
하둡의 창시자:
Doug Cutting
Chief Architect at Cloudera

Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and
process it faster in parallel
Hadoop
–Inexpensive Super Computer
–You can build and run your applications

Jongwook Woo
CSULA
Hadoop CDH: Logical Diagram
Web Browser to control Cloudera Manager
Server
HTTP(S)
Agent CDH Agent CDH Agent CDH
CM
.
.
.
.
.
.
.
.
.
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala

Jongwook Woo
CSULA
Big Data tool: Hadoop
13
Big Data Market Potential is BIG
Source: BofA Merrill Lynch Global Research March 2012
Hardware
$21B Services
$42B
Software
$34B
Complementary
Database
$35B
Hadoop
$14B
13

Jongwook Woo
CSULA
Definition: Big Data
Big Data
 Find out hidden value from data sets
–No!
• It is one of big data applications, which we
have been looking for with the traditional
systems
– With legacy computers, DW, DB
 It is the new approach by using
Supercomputer in data intensive
computing, called Hadoop

Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
10 ~ 100x faster than N/W and Disk

Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Fast Iterative
Faster than Hadoop MapReduce
Can integrate with Hadoop and its
ecosystems
HDFS
HBase, Hive, Sequence files
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query

Jongwook Woo
CSULA
Spark Publications
Spark: papers in 2010 and 2012
Spark: Cluster Computing with Working Sets, Matei
Zaharia, Mosharaf Chowdhury, Michael J. Franklin,
Scott Shenker, Ion Stoica USENIX HotCloud (2010)
Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing, Matei
Zaharia, Mosharaf Chowdhury, Tathagata Das,
Ankur Dave, Justin Ma, Murphy McCauley, Michael
J. Franklin, Scott Shenker, Ion Stoica NSDI (2012)

Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices

Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Create RDDs
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–development

Jongwook Woo
CSULA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase,
Amazon S3…
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads

Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects that can
be cached in memory
RDD, DStream, SchemaRDD, PairRDD
Immutable
Lineage
–History of the objects
–Automatically and efficiently recompute lost
data

Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()

Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java

Jongwook Woo
CSULA
Example Job in Scala Spark
val sc = new SparkContext(
“spark://...”, “MyJob”, home, jars)
val file = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.count()
Resilient distributed
datasets (RDDs)
Action
Transformations

Jongwook Woo
CSULA
Data Locality
First run: data from HDFS,
so use HadoopRDD’s locality
FilteredRDD: Transformation
FilteredRDD: no data yet
Lineage:
–If something falls out of FilteredRDD, go
back to HDFS
Count(): action
Generate data

Jongwook Woo
CSULA
RDD Graph:
No values on Transformations
HadoopRDD
path = hdfs://...
file:
Partition-level view:
RDD split into 4 partitions
Dataset-level view:
Task 1Task 2
RDD 1

Jongwook Woo
CSULA
RDD Graph:
No values on Transformations
HadoopRDD
path = hdfs://...
FilteredRDD
func = _.contains(…)
file:
errors:
Partition-level view:Dataset-level view:
Task 1Task 2
RDD 1
RDD 2

Jongwook Woo
CSULA
RDD Graph:
values after actions
HadoopRDD
path = hdfs://...
FilteredRDD
func = _.contains(…)
file:
errors:
Partition-level view:Dataset-level view:
Task 1 Task 2 ...
RDD 1
RDD 2
Value 477errors.count(…)
count:
v1 v2 v3 v4
f1 f2 f3 f4

Jongwook Woo
CSULA
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed

Jongwook Woo
CSULA
RDD Interface
Set of partitions (“splits”)
List of dependencies on parent RDDs
Function to compute a partition given
parents
Optional preferred locations
Optional partitioning info (Partitioner)
Captures all current Spark operations!

Jongwook Woo
CSULA
Example: HadoopRDD
partitions = one per HDFS block
dependencies = none
compute(partition) = read
corresponding block
preferredLocations(part) = HDFS
block location
partitioner = none

Jongwook Woo
CSULA
Example: FilteredRDD
partitions = same as parent RDD
dependencies = “one-to-one” on
parent
compute(partition) = compute parent
and filter it
preferredLocations(part) = none (ask
parent)
partitioner = none

Jongwook Woo
CSULA
Example: JoinedRDD
partitions = one per reduce task
dependencies = “shuffle” on each
parent
compute(partition) = read and join
shuffled data
preferredLocations(part) = none
partitioner = HashPartitioner(numTasks)

Jongwook Woo
CSULA
DAG Scheduler
Stage:
Consists of tasks that can be performed on the same
node
Interface:
receives a “target” RDD,
– a function to run on each partition,
– and a listener for results
Roles:
Build stages of Task objects (code + preferred loc.)
Submit them to TaskScheduler as ready
Resubmit failed stages if outputs are lost

Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Wide” (shuffle) deps:
boundary of stages
“Narrow” deps: A stage pipeline
to be run on the same node

Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: A stage pipeline
to be run on the same node
“Wide” (shuffle) deps:
boundary of stages

Jongwook Woo
CSULA
Scheduler Optimizations
Pipelines within a
stage 2
map, union
Stage 3:
join algorithms
based on
partitioning
(minimize shuffles) join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task

Jongwook Woo
CSULA
Scheduler Optimizations
Conceptually
Stage 1: 3 tasks
Stage 2: 4 tasks
Stage 3: 3 tasks
Total: 3 stages, 10
tasks
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task

Jongwook Woo
CSULA
TaskScheduler Details
Can run multiple concurrent TaskSets,
but currently does so in FIFO order
Maintains one TaskSetManager per TaskSet
tracks its locality and failure info
Polls these for tasks in order (FIFO)

Jongwook Woo
CSULA
Spark SQL
Turning an RDD into a Relation
Querying using SQL
Import data from Hive
and Export to Parquet file format
In-Memory Columnar Storage
Spark SQL can cache tables using an in-
memory columnar format:

Jongwook Woo
CSULA
Turning an RDD into a Relation
 // Define the schema using a case class.
case class Person(name: String, age: Int)
// Create an RDD of Person objects, register it as a
table.
val people =
sc.textFile("examples/src/main/resources/people.txt"
)
.map(_.split(",")
.map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")


Jongwook Woo
CSULA
Querying using SQL
 // SQL statements can be run directly on RDD’s
val teenagers =
sql("SELECT name FROM people
WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support
// normal RDD operations.
val nameList = teenagers.map(t => "Name: " + t(0)).collect()
 // Language integrated queries (ala LINQ)
val teenagers =
people.where('age >= 10).where('age <= 19).select('name)

Jongwook Woo
CSULA
Import and Export
 // Save SchemaRDD’s directly to parquet
people.saveAsParquetFile("people.parquet")
 // Load data stored in Hive
val hiveContext =
new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
// Queries can be expressed in HiveQL.
hql("FROM src SELECT key, value")

Jongwook Woo
CSULA
Spark Streaming
DStream
RDD in streaming
Windows
To select DStream from streaming data

Jongwook Woo
CSULA
MLlib
MLib
Sparse vector support
Decision trees
Linear algebra
SVD and PCA
Evaluation support
3 contributors in the last 6 months

Jongwook Woo
CSULA
MLlib
Kmeans clustering
val data = sc.textFile("data/kmeans_data.txt")
val parsedData = data.map(
s => Vectors.dense(s.split(' ').map(_.toDouble)))
val clusters = KMeans.train(parsedData, 4, 100)

Jongwook Woo
CSULA
WordCount.java (Driver)
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit(); }

Jongwook Woo
CSULA
WordMapper.java (Mapper class)
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}

Jongwook Woo
CSULA
SumReducer.java (Reducer class)
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.mapreduce.Reducer;
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}

Jongwook Woo
CSULA
Example Python Code: Word Count
val text_file =
sc.textFile(“hdfs://namenode:port/user/myname/thelonera
nger")
counts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

Jongwook Woo
CSULA
Example Scala Code: Word Count
val text_file =
sc.textFile(“hdfs://namenode:port/user/me/theloneranger")
var counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (a, b) => a + b)

Jongwook Woo
CSULA
val text_file =
val: immutable variable
sc: Spark Context preconfigured at Spark
Shell
textFile(): read input data from HDFS i

Jongwook Woo
CSULA
val text_file =
An idealistic lawyer in Texas he rides with
his brother and fellow Texas Rangers

Jongwook Woo
CSULA
val text_file =
line.split(“ “): read a word separated by “ “
flatMap(): each word becomes an element
map(): each word is paired with 1
reduceByKey(): add value 1’s by each
word

Jongwook Woo
CSULA
Word Count: RDD
val text_file =
An
idealistic
Lawyer
in
Texas
he
Rides
with
his
brother
and
fellow
Texas
Rangers

Jongwook Woo
CSULA
Word Count: RDD
val text_file =
(An, 1)
(Idealistic, 1)
(lawyer, 1)
(in, 1)
(Texas, 1)
(he, 1)
(Rides, 1)
(with, 1)
(his, 1)
(brother, 1)
(and, 1)
(fellow, 1)
(Texas, 1)
(Rangers, 1)

Jongwook Woo
CSULA
Word Count: RDD
val text_file =
(An, 1)
(Idealistic, 1)
(lawyer, 1)
(in, 1)
(he, 1)
(Rides, 1)
(with, 1)
(his, 1)
(brother, 1)
(and, 1)
(fellow, 1)
(Texas, 2)
(Rangers, 1)

Jongwook Woo
CSULA
val text_file =
Save the counts to HDFS

Jongwook Woo
CSULA
Word Count code
Could be used for other applications
Sentiment Analysis
Market Basket Analysis
…
Next Example

Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet
}
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word,
1)).reduceByKey((a, b) => a+b)
val sortedResult = result.map(pair => pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile("jwoo/result3.2")

Jongwook Woo
CSULA
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int):
Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.
mkString(outSep)).toSet
}toLowerCase(): convert to lower letters
split(inSep): split by the separator inSep
sliding(n): select n words as a group
_.sorted: sort the elements in the group
mkString(outSep): the elements are appended
with outSep
toSet: make the group as a set with unique elmts

Jongwook Woo
CSULA
Analysis
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
sortedResult.saveAsTextFile(“jwoo/result32G”)
Extract and count bigram

Jongwook Woo
CSULA
Analysis
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
sortedResult.saveAsTextFile(“jwoo/result32G”)
Sort the bigram in descending
order of the value

Jongwook Woo
CSULA
Spark at Yahoo
Two Spark projects in the works,
Personalizing news pages for Web visitors
– ML algorithms running on Spark
• to figure out what individual users are interested
in, and also to categorize news stories
– as they arise to figure out what types of users
would be interested in reading them
– 120 lines of Scala
• Comparing to 15,000 lines of C++
– 1/100 size
Analytics for advertising

Jongwook Woo
CSULA
Spark at Yahoo (Cont’d)
Two Spark projects in the works,
Analytics for advertising
–Hive on Spark (Shark’s) interactive capability.
• use existing BI tools to view
• and query their advertising analytic data
collected in Hadoop

Jongwook Woo
CSULA
Spark at Conviva
one of the largest streaming video
companies on the Internet
with about 4 billion video feeds per month
second only to YouTube
uses Spark Streaming
to learn network conditions in real time
to ensure a high quality of service (QoS)
–by avoiding dreaded screen buffering.

Jongwook Woo
CSULA
Spark at ClearStory
one of Databricks first customers,
Needs its interactive, real-time product
– For Data Integration
needed a way to help business users
merge their internal data sources with
external sources,
–such as social media traffic and public data
feeds,
• without requiring complex data modeling.

Jongwook Woo
CSULA
Spark Training
California State University Los Angeles
(Prof Jongwook Woo)
UC Berkeley Edx (MOOC)
UC Berkeley amplab camp
Stanford
Cloudera, Hortonworks, DataStax Training
courses
IBM Big University

Jongwook Woo
CSULA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo

Jongwook Woo
CSULA
Training Hadoop and Spark

Jongwook Woo
CSULA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles

Jongwook Woo
CSULA
Conclusion
Big Data is Hadoop
Spark is the way to go for Big Data
Spark training is important

Jongwook Woo
CSULA
Question?

Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
 “Market Basket Analysis using Spark”, Jongwook
Woo, in Journal of Science and Technology, April
2015, Volume 5, No 4, pp207-209, ISSN 2225-7217,
ARPN
https://github.com/hipic/spark_mba, HiPIC of
California State University Los Angenes

Jongwook Woo
CSULA
References
 Introduction to Big Data with Apache Spark, databricks
 Stanford Spark Class (http://stanford.edu/~rezab )
 Cornell University, CS5304
 DS320: DataStax Enterprise Analytics with Spark
 Cloudera, http://www.cloudera.com
 Hortonworks, http://www.hortonworks.com
 Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_3_
real-world_use_cases/

Spark tutorial @ KCC 2015

More Related Content

What's hot

Viewers also liked

Similar to Spark tutorial @ KCC 2015

More from Jongwook Woo

Recently uploaded

Spark tutorial @ KCC 2015