Jongwook Woo
HiPIC
CSULA
KCC 2015
Jeju University, Korea
June24 2015
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Introduction to Spark and
its Data Analysis and Use
Cases in Big Data
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Name: 우종욱, Jongwook Woo
Experience:
 2012 - Present
– Certified Cloudera Instructor: R&D, Consulting, Training
 2012 - Present : Big Data Academic Parterships
– Cloudera, Hortonworks Partner for Hadoop Training
– Amazon AWS, MicroSoft Azure, IBM Bluemix
 Since 2002, Professor at California State Univ Los Angeles
 Since 1998: R&D consulting in Hollywood
– implements eBusiness applications using J2EE and middleware
– Information Search and Integration with FAST, Lucene/Solr,
Sphinx
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 Since 2007: Exposed to Big Data
 PhD in 2001: Computer Science and Engineering at USC
High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Experience (Cont’d): Bring in Big Data
training and R&D to Korea since 2009
2014: Training Hadoop and the Ecosystems
• Data Analysis / Science
• Hadoop Developer, Admin, HBase
• Spark
Summer 2013 Igloo Security:
– Collect, Search, and Analyze Security Log files 30GB –
100GB / day
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
– Training Hadoop and the Ecosystems
 Since 2008
– Introduce Hadoop Big Data and education in Univ and
Research Centers
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received IBM Bluemix Education Grant (April 2015 – March
2016)
 Received MicroSoft Windows Azure Educator Grant (Oct 2013
- July 2016)
 Received Amazon AWS in Education Research Grant (July
2012 - July 201)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2015, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Received Academic Partnership with Hortonworks since May
2015
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Certificate
 Certified Cloudera Hadoop Instructor
 Certified Cloudera Hadoop Developer / Administrator / Hbase /
Spark
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
 https://github.com/dalgual
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social
Computing, smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On inexpensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Hadoop?
10
하둡의 창시자:
Doug Cutting
Chief Architect at Cloudera
High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and
process it faster in parallel
Hadoop
–Inexpensive Super Computer
–You can build and run your applications
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop CDH: Logical Diagram
Web Browser to control Cloudera Manager
Server
HTTP(S)
Agent CDH Agent CDH Agent CDH
Agent CDH Agent CDH Agent CDH
CM
.
.
.
.
.
.
.
.
.
Agent CDH Agent CDH Agent CDH
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data tool: Hadoop
13
Big Data Market Potential is BIG
Source: BofA Merrill Lynch Global Research March 2012
Hardware
$21B Services
$42B
Software
$34B
Complementary
Database
$35B
Hadoop
$14B
13
High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
Big Data
 Find out hidden value from data sets
–No!
• It is one of big data applications, which we
have been looking for with the traditional
systems
– With legacy computers, DW, DB
 It is the new approach by using
Supercomputer in data intensive
computing, called Hadoop
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
10 ~ 100x faster than N/W and Disk
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Fast Iterative
Faster than Hadoop MapReduce
Can integrate with Hadoop and its
ecosystems
HDFS
HBase, Hive, Sequence files
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Publications
Spark: papers in 2010 and 2012
Spark: Cluster Computing with Working Sets, Matei
Zaharia, Mosharaf Chowdhury, Michael J. Franklin,
Scott Shenker, Ion Stoica USENIX HotCloud (2010)
Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing, Matei
Zaharia, Mosharaf Chowdhury, Tathagata Das,
Ankur Dave, Justin Ma, Murphy McCauley, Michael
J. Franklin, Scott Shenker, Ion Stoica NSDI (2012)
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Create RDDs
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–development
High Performance Information Computing Center
Jongwook Woo
CSULA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase,
Amazon S3…
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects that can
be cached in memory
RDD, DStream, SchemaRDD, PairRDD
Immutable
Lineage
–History of the objects
–Automatically and efficiently recompute lost
data
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
High Performance Information Computing Center
Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Job in Scala Spark
val sc = new SparkContext(
“spark://...”, “MyJob”, home, jars)
val file = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.count()
Resilient distributed
datasets (RDDs)
Action
Transformations
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Locality
First run: data from HDFS,
so use HadoopRDD’s locality
FilteredRDD: Transformation
FilteredRDD: no data yet
Lineage:
–If something falls out of FilteredRDD, go
back to HDFS
Count(): action
Generate data
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Graph:
No values on Transformations
HadoopRDD
path = hdfs://...
file:
Partition-level view:
RDD split into 4 partitions
Dataset-level view:
Task 1Task 2
RDD 1
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Graph:
No values on Transformations
HadoopRDD
path = hdfs://...
FilteredRDD
func = _.contains(…)
file:
errors:
Partition-level view:Dataset-level view:
Task 1Task 2
RDD 1
RDD 2
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Graph:
values after actions
HadoopRDD
path = hdfs://...
FilteredRDD
func = _.contains(…)
file:
errors:
Partition-level view:Dataset-level view:
Task 1 Task 2 ...
RDD 1
RDD 2
Value 477errors.count(…)
count:
v1 v2 v3 v4
f1 f2 f3 f4
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Interface
Set of partitions (“splits”)
List of dependencies on parent RDDs
Function to compute a partition given
parents
Optional preferred locations
Optional partitioning info (Partitioner)
Captures all current Spark operations!
High Performance Information Computing Center
Jongwook Woo
CSULA
Example: HadoopRDD
partitions = one per HDFS block
dependencies = none
compute(partition) = read
corresponding block
preferredLocations(part) = HDFS
block location
partitioner = none
High Performance Information Computing Center
Jongwook Woo
CSULA
Example: FilteredRDD
partitions = same as parent RDD
dependencies = “one-to-one” on
parent
compute(partition) = compute parent
and filter it
preferredLocations(part) = none (ask
parent)
partitioner = none
High Performance Information Computing Center
Jongwook Woo
CSULA
Example: JoinedRDD
partitions = one per reduce task
dependencies = “shuffle” on each
parent
compute(partition) = read and join
shuffled data
preferredLocations(part) = none
partitioner = HashPartitioner(numTasks)
High Performance Information Computing Center
Jongwook Woo
CSULA
DAG Scheduler
Stage:
Consists of tasks that can be performed on the same
node
Interface:
receives a “target” RDD,
– a function to run on each partition,
– and a listener for results
Roles:
Build stages of Task objects (code + preferred loc.)
Submit them to TaskScheduler as ready
Resubmit failed stages if outputs are lost
High Performance Information Computing Center
Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Wide” (shuffle) deps:
boundary of stages
“Narrow” deps: A stage pipeline
to be run on the same node
High Performance Information Computing Center
Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: A stage pipeline
to be run on the same node
“Wide” (shuffle) deps:
boundary of stages
High Performance Information Computing Center
Jongwook Woo
CSULA
Scheduler Optimizations
Pipelines within a
stage 2
map, union
Stage 3:
join algorithms
based on
partitioning
(minimize shuffles) join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
High Performance Information Computing Center
Jongwook Woo
CSULA
Scheduler Optimizations
Conceptually
Stage 1: 3 tasks
Stage 2: 4 tasks
Stage 3: 3 tasks
Total: 3 stages, 10
tasks
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
High Performance Information Computing Center
Jongwook Woo
CSULA
TaskScheduler Details
Can run multiple concurrent TaskSets,
but currently does so in FIFO order
Maintains one TaskSetManager per TaskSet
tracks its locality and failure info
Polls these for tasks in order (FIFO)
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark SQL
Turning an RDD into a Relation
Querying using SQL
Import data from Hive
and Export to Parquet file format
In-Memory Columnar Storage
Spark SQL can cache tables using an in-
memory columnar format:
High Performance Information Computing Center
Jongwook Woo
CSULA
Turning an RDD into a Relation
 // Define the schema using a case class.
case class Person(name: String, age: Int)
// Create an RDD of Person objects, register it as a
table.
val people =
sc.textFile("examples/src/main/resources/people.txt"
)
.map(_.split(",")
.map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")

High Performance Information Computing Center
Jongwook Woo
CSULA
Querying using SQL
 // SQL statements can be run directly on RDD’s
val teenagers =
sql("SELECT name FROM people
WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support
// normal RDD operations.
val nameList = teenagers.map(t => "Name: " + t(0)).collect()
 // Language integrated queries (ala LINQ)
val teenagers =
people.where('age >= 10).where('age <= 19).select('name)
High Performance Information Computing Center
Jongwook Woo
CSULA
Import and Export
 // Save SchemaRDD’s directly to parquet
people.saveAsParquetFile("people.parquet")
 // Load data stored in Hive
val hiveContext =
new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
// Queries can be expressed in HiveQL.
hql("FROM src SELECT key, value")
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Streaming
DStream
RDD in streaming
Windows
To select DStream from streaming data
High Performance Information Computing Center
Jongwook Woo
CSULA
MLlib
MLib
Sparse vector support
Decision trees
Linear algebra
SVD and PCA
Evaluation support
3 contributors in the last 6 months
High Performance Information Computing Center
Jongwook Woo
CSULA
MLlib
Kmeans clustering
val data = sc.textFile("data/kmeans_data.txt")
val parsedData = data.map(
s => Vectors.dense(s.split(' ').map(_.toDouble)))
val clusters = KMeans.train(parsedData, 4, 100)
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
WordCount.java (Driver)
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit(); }
High Performance Information Computing Center
Jongwook Woo
CSULA
WordMapper.java (Mapper class)
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
High Performance Information Computing Center
Jongwook Woo
CSULA
SumReducer.java (Reducer class)
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Python Code: Word Count
val text_file =
sc.textFile(“hdfs://namenode:port/user/myname/thelonera
nger")
counts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Scala Code: Word Count
val text_file =
sc.textFile(“hdfs://namenode:port/user/me/theloneranger")
var counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (a, b) => a + b)
counts.saveAsTextFile("hdfs://...")
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Scala Code: Word Count
val text_file =
sc.textFile(“hdfs://namenode:port/user/me/theloneranger")
var counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (a, b) => a + b)
counts.saveAsTextFile("hdfs://...")
val: immutable variable
sc: Spark Context preconfigured at Spark
Shell
textFile(): read input data from HDFS i
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Scala Code: Word Count
val text_file =
sc.textFile(“hdfs://namenode:port/user/me/theloneranger")
var counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (a, b) => a + b)
counts.saveAsTextFile("hdfs://...")
An idealistic lawyer in Texas he rides with
his brother and fellow Texas Rangers
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Scala Code: Word Count
val text_file =
sc.textFile(“hdfs://namenode:port/user/me/theloneranger")
var counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (a, b) => a + b)
counts.saveAsTextFile("hdfs://...")
line.split(“ “): read a word separated by “ “
flatMap(): each word becomes an element
map(): each word is paired with 1
reduceByKey(): add value 1’s by each
word
High Performance Information Computing Center
Jongwook Woo
CSULA
Word Count: RDD
val text_file =
sc.textFile(“hdfs://namenode:port/user/me/theloneranger")
var counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (a, b) => a + b)
counts.saveAsTextFile("hdfs://...")
An
idealistic
Lawyer
in
Texas
he
Rides
with
his
brother
and
fellow
Texas
Rangers
High Performance Information Computing Center
Jongwook Woo
CSULA
Word Count: RDD
val text_file =
sc.textFile(“hdfs://namenode:port/user/me/theloneranger")
var counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (a, b) => a + b)
counts.saveAsTextFile("hdfs://...")
(An, 1)
(Idealistic, 1)
(lawyer, 1)
(in, 1)
(Texas, 1)
(he, 1)
(Rides, 1)
(with, 1)
(his, 1)
(brother, 1)
(and, 1)
(fellow, 1)
(Texas, 1)
(Rangers, 1)
High Performance Information Computing Center
Jongwook Woo
CSULA
Word Count: RDD
val text_file =
sc.textFile(“hdfs://namenode:port/user/me/theloneranger")
var counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (a, b) => a + b)
counts.saveAsTextFile("hdfs://...")
(An, 1)
(Idealistic, 1)
(lawyer, 1)
(in, 1)
(he, 1)
(Rides, 1)
(with, 1)
(his, 1)
(brother, 1)
(and, 1)
(fellow, 1)
(Texas, 2)
(Rangers, 1)
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Scala Code: Word Count
val text_file =
sc.textFile(“hdfs://namenode:port/user/me/theloneranger")
var counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (a, b) => a + b)
counts.saveAsTextFile("hdfs://...")
Save the counts to HDFS
High Performance Information Computing Center
Jongwook Woo
CSULA
Word Count code
Could be used for other applications
Sentiment Analysis
Market Basket Analysis
…
Next Example
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet
}
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word,
1)).reduceByKey((a, b) => a+b)
val sortedResult = result.map(pair => pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile("jwoo/result3.2")
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int):
Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.
mkString(outSep)).toSet
}toLowerCase(): convert to lower letters
split(inSep): split by the separator inSep
sliding(n): select n words as a group
_.sorted: sort the elements in the group
mkString(outSep): the elements are appended
with outSep
toSet: make the group as a set with unique elmts
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile(“jwoo/result32G”)
Extract and count bigram
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile(“jwoo/result32G”)
Sort the bigram in descending
order of the value
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark at Yahoo
Two Spark projects in the works,
Personalizing news pages for Web visitors
– ML algorithms running on Spark
• to figure out what individual users are interested
in, and also to categorize news stories
– as they arise to figure out what types of users
would be interested in reading them
– 120 lines of Scala
• Comparing to 15,000 lines of C++
– 1/100 size
Analytics for advertising
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark at Yahoo (Cont’d)
Two Spark projects in the works,
Analytics for advertising
–Hive on Spark (Shark’s) interactive capability.
• use existing BI tools to view
• and query their advertising analytic data
collected in Hadoop
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark at Conviva
one of the largest streaming video
companies on the Internet
with about 4 billion video feeds per month
second only to YouTube
uses Spark Streaming
to learn network conditions in real time
to ensure a high quality of service (QoS)
–by avoiding dreaded screen buffering.
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark at ClearStory
one of Databricks first customers,
Needs its interactive, real-time product
– For Data Integration
needed a way to help business users
merge their internal data sources with
external sources,
–such as social media traffic and public data
feeds,
• without requiring complex data modeling.
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Training
California State University Los Angeles
(Prof Jongwook Woo)
UC Berkeley Edx (MOOC)
UC Berkeley amplab camp
Stanford
Cloudera, Hortonworks, DataStax Training
courses
IBM Big University
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Spark
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Big Data is Hadoop
Spark is the way to go for Big Data
Spark training is important
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?
High Performance Information Computing Center
Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
 “Market Basket Analysis using Spark”, Jongwook
Woo, in Journal of Science and Technology, April
2015, Volume 5, No 4, pp207-209, ISSN 2225-7217,
ARPN
https://github.com/hipic/spark_mba, HiPIC of
California State University Los Angenes
High Performance Information Computing Center
Jongwook Woo
CSULA
References
 Introduction to Big Data with Apache Spark, databricks
 Stanford Spark Class (http://stanford.edu/~rezab )
 Cornell University, CS5304
 DS320: DataStax Enterprise Analytics with Spark
 Cloudera, http://www.cloudera.com
 Hortonworks, http://www.hortonworks.com
 Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_3_
real-world_use_cases/

Spark tutorial @ KCC 2015

  • 1.
    Jongwook Woo HiPIC CSULA KCC 2015 JejuUniversity, Korea June24 2015 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) Cloudera Academic Partner and Grants Awardee of Amazon AWS California State University Los Angeles Introduction to Spark and its Data Analysis and Use Cases in Big Data
  • 2.
    High Performance InformationComputing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Task Scheduling  Spark SQL, Streaming, ML  Examples  Use Cases
  • 3.
    High Performance InformationComputing Center Jongwook Woo CSULA Myself Name: 우종욱, Jongwook Woo Experience:  2012 - Present – Certified Cloudera Instructor: R&D, Consulting, Training  2012 - Present : Big Data Academic Parterships – Cloudera, Hortonworks Partner for Hadoop Training – Amazon AWS, MicroSoft Azure, IBM Bluemix  Since 2002, Professor at California State Univ Los Angeles  Since 1998: R&D consulting in Hollywood – implements eBusiness applications using J2EE and middleware – Information Search and Integration with FAST, Lucene/Solr, Sphinx – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  Since 2007: Exposed to Big Data  PhD in 2001: Computer Science and Engineering at USC
  • 4.
    High Performance InformationComputing Center Jongwook Woo CSULA Myself Experience (Cont’d): Bring in Big Data training and R&D to Korea since 2009 2014: Training Hadoop and the Ecosystems • Data Analysis / Science • Hadoop Developer, Admin, HBase • Spark Summer 2013 Igloo Security: – Collect, Search, and Analyze Security Log files 30GB – 100GB / day • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute – Training Hadoop and the Ecosystems  Since 2008 – Introduce Hadoop Big Data and education in Univ and Research Centers
  • 5.
    High Performance InformationComputing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received IBM Bluemix Education Grant (April 2015 – March 2016)  Received MicroSoft Windows Azure Educator Grant (Oct 2013 - July 2016)  Received Amazon AWS in Education Research Grant (July 2012 - July 201)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2015, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Received Academic Partnership with Hortonworks since May 2015
  • 6.
    High Performance InformationComputing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certified Cloudera Hadoop Instructor  Certified Cloudera Hadoop Developer / Administrator / Hbase / Spark  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  • 7.
    High Performance InformationComputing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Task Scheduling  Spark SQL, Streaming, ML  Examples  Use Cases
  • 8.
    High Performance InformationComputing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 9.
    High Performance InformationComputing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – On inexpensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with multiple non-expensive computers • Own super computers
  • 10.
    High Performance InformationComputing Center Jongwook Woo CSULA What is Hadoop? 10 하둡의 창시자: Doug Cutting Chief Architect at Cloudera
  • 11.
    High Performance InformationComputing Center Jongwook Woo CSULA Definition: Big Data Inexpensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Inexpensive Super Computer –You can build and run your applications
  • 12.
    High Performance InformationComputing Center Jongwook Woo CSULA Hadoop CDH: Logical Diagram Web Browser to control Cloudera Manager Server HTTP(S) Agent CDH Agent CDH Agent CDH Agent CDH Agent CDH Agent CDH CM . . . . . . . . . Agent CDH Agent CDH Agent CDH HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  • 13.
    High Performance InformationComputing Center Jongwook Woo CSULA Big Data tool: Hadoop 13 Big Data Market Potential is BIG Source: BofA Merrill Lynch Global Research March 2012 Hardware $21B Services $42B Software $34B Complementary Database $35B Hadoop $14B 13
  • 14.
    High Performance InformationComputing Center Jongwook Woo CSULA Definition: Big Data Big Data  Find out hidden value from data sets –No! • It is one of big data applications, which we have been looking for with the traditional systems – With legacy computers, DW, DB  It is the new approach by using Supercomputer in data intensive computing, called Hadoop
  • 15.
    High Performance InformationComputing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Task Scheduling  Spark SQL, Streaming, ML  Examples  Use Cases
  • 16.
    High Performance InformationComputing Center Jongwook Woo CSULA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab In-memory storage for intermediate data 10 ~ 100x faster than N/W and Disk
  • 17.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark In-Memory Data Computing Fast Iterative Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS HBase, Hive, Sequence files New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  • 18.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark Publications Spark: papers in 2010 and 2012 Spark: Cluster Computing with Working Sets, Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica USENIX HotCloud (2010) Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica NSDI (2012)
  • 19.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  • 20.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark Drivers and Workers Drivers Client –with SparkContext • Create RDDs Workers Spark Executor Run on cluster nodes –Production Run in local threads –development
  • 21.
    High Performance InformationComputing Center Jongwook Woo CSULA Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3… RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  • 22.
    High Performance InformationComputing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Task Scheduling  Spark SQL, Streaming, ML  Examples  Use Cases
  • 23.
    High Performance InformationComputing Center Jongwook Woo CSULA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects that can be cached in memory RDD, DStream, SchemaRDD, PairRDD Immutable Lineage –History of the objects –Automatically and efficiently recompute lost data
  • 24.
    High Performance InformationComputing Center Jongwook Woo CSULA RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  • 25.
    High Performance InformationComputing Center Jongwook Woo CSULA Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  • 26.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Job in Scala Spark val sc = new SparkContext( “spark://...”, “MyJob”, home, jars) val file = sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.count() Resilient distributed datasets (RDDs) Action Transformations
  • 27.
    High Performance InformationComputing Center Jongwook Woo CSULA Data Locality First run: data from HDFS, so use HadoopRDD’s locality FilteredRDD: Transformation FilteredRDD: no data yet Lineage: –If something falls out of FilteredRDD, go back to HDFS Count(): action Generate data
  • 28.
    High Performance InformationComputing Center Jongwook Woo CSULA RDD Graph: No values on Transformations HadoopRDD path = hdfs://... file: Partition-level view: RDD split into 4 partitions Dataset-level view: Task 1Task 2 RDD 1
  • 29.
    High Performance InformationComputing Center Jongwook Woo CSULA RDD Graph: No values on Transformations HadoopRDD path = hdfs://... FilteredRDD func = _.contains(…) file: errors: Partition-level view:Dataset-level view: Task 1Task 2 RDD 1 RDD 2
  • 30.
    High Performance InformationComputing Center Jongwook Woo CSULA RDD Graph: values after actions HadoopRDD path = hdfs://... FilteredRDD func = _.contains(…) file: errors: Partition-level view:Dataset-level view: Task 1 Task 2 ... RDD 1 RDD 2 Value 477errors.count(…) count: v1 v2 v3 v4 f1 f2 f3 f4
  • 31.
    High Performance InformationComputing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Task Scheduling  Spark SQL, Streaming, ML  Examples  Use Cases
  • 32.
    High Performance InformationComputing Center Jongwook Woo CSULA Scheduling Process ) rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  • 33.
    High Performance InformationComputing Center Jongwook Woo CSULA RDD Interface Set of partitions (“splits”) List of dependencies on parent RDDs Function to compute a partition given parents Optional preferred locations Optional partitioning info (Partitioner) Captures all current Spark operations!
  • 34.
    High Performance InformationComputing Center Jongwook Woo CSULA Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(partition) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none
  • 35.
    High Performance InformationComputing Center Jongwook Woo CSULA Example: FilteredRDD partitions = same as parent RDD dependencies = “one-to-one” on parent compute(partition) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none
  • 36.
    High Performance InformationComputing Center Jongwook Woo CSULA Example: JoinedRDD partitions = one per reduce task dependencies = “shuffle” on each parent compute(partition) = read and join shuffled data preferredLocations(part) = none partitioner = HashPartitioner(numTasks)
  • 37.
    High Performance InformationComputing Center Jongwook Woo CSULA DAG Scheduler Stage: Consists of tasks that can be performed on the same node Interface: receives a “target” RDD, – a function to run on each partition, – and a listener for results Roles: Build stages of Task objects (code + preferred loc.) Submit them to TaskScheduler as ready Resubmit failed stages if outputs are lost
  • 38.
    High Performance InformationComputing Center Jongwook Woo CSULA Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Wide” (shuffle) deps: boundary of stages “Narrow” deps: A stage pipeline to be run on the same node
  • 39.
    High Performance InformationComputing Center Jongwook Woo CSULA Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Narrow” deps: A stage pipeline to be run on the same node “Wide” (shuffle) deps: boundary of stages
  • 40.
    High Performance InformationComputing Center Jongwook Woo CSULA Scheduler Optimizations Pipelines within a stage 2 map, union Stage 3: join algorithms based on partitioning (minimize shuffles) join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  • 41.
    High Performance InformationComputing Center Jongwook Woo CSULA Scheduler Optimizations Conceptually Stage 1: 3 tasks Stage 2: 4 tasks Stage 3: 3 tasks Total: 3 stages, 10 tasks join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  • 42.
    High Performance InformationComputing Center Jongwook Woo CSULA TaskScheduler Details Can run multiple concurrent TaskSets, but currently does so in FIFO order Maintains one TaskSetManager per TaskSet tracks its locality and failure info Polls these for tasks in order (FIFO)
  • 43.
    High Performance InformationComputing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Task Scheduling  Spark SQL, Streaming, ML  Examples  Use Cases
  • 44.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark SQL Turning an RDD into a Relation Querying using SQL Import data from Hive and Export to Parquet file format In-Memory Columnar Storage Spark SQL can cache tables using an in- memory columnar format:
  • 45.
    High Performance InformationComputing Center Jongwook Woo CSULA Turning an RDD into a Relation  // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects, register it as a table. val people = sc.textFile("examples/src/main/resources/people.txt" ) .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people") 
  • 46.
    High Performance InformationComputing Center Jongwook Woo CSULA Querying using SQL  // SQL statements can be run directly on RDD’s val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs and support // normal RDD operations. val nameList = teenagers.map(t => "Name: " + t(0)).collect()  // Language integrated queries (ala LINQ) val teenagers = people.where('age >= 10).where('age <= 19).select('name)
  • 47.
    High Performance InformationComputing Center Jongwook Woo CSULA Import and Export  // Save SchemaRDD’s directly to parquet people.saveAsParquetFile("people.parquet")  // Load data stored in Hive val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ // Queries can be expressed in HiveQL. hql("FROM src SELECT key, value")
  • 48.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark Streaming DStream RDD in streaming Windows To select DStream from streaming data
  • 49.
    High Performance InformationComputing Center Jongwook Woo CSULA MLlib MLib Sparse vector support Decision trees Linear algebra SVD and PCA Evaluation support 3 contributors in the last 6 months
  • 50.
    High Performance InformationComputing Center Jongwook Woo CSULA MLlib Kmeans clustering val data = sc.textFile("data/kmeans_data.txt") val parsedData = data.map( s => Vectors.dense(s.split(' ').map(_.toDouble))) val clusters = KMeans.train(parsedData, 4, 100)
  • 51.
    High Performance InformationComputing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Task Scheduling  Spark SQL, Streaming, ML  Examples  Use Cases
  • 52.
    High Performance InformationComputing Center Jongwook Woo CSULA WordCount.java (Driver) import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); }
  • 53.
    High Performance InformationComputing Center Jongwook Woo CSULA WordMapper.java (Mapper class) import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } }
  • 54.
    High Performance InformationComputing Center Jongwook Woo CSULA SumReducer.java (Reducer class) import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }
  • 55.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Python Code: Word Count val text_file = sc.textFile(“hdfs://namenode:port/user/myname/thelonera nger") counts = text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...")
  • 56.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Scala Code: Word Count val text_file = sc.textFile(“hdfs://namenode:port/user/me/theloneranger") var counts = text_file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (a, b) => a + b) counts.saveAsTextFile("hdfs://...")
  • 57.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Scala Code: Word Count val text_file = sc.textFile(“hdfs://namenode:port/user/me/theloneranger") var counts = text_file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (a, b) => a + b) counts.saveAsTextFile("hdfs://...") val: immutable variable sc: Spark Context preconfigured at Spark Shell textFile(): read input data from HDFS i
  • 58.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Scala Code: Word Count val text_file = sc.textFile(“hdfs://namenode:port/user/me/theloneranger") var counts = text_file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (a, b) => a + b) counts.saveAsTextFile("hdfs://...") An idealistic lawyer in Texas he rides with his brother and fellow Texas Rangers
  • 59.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Scala Code: Word Count val text_file = sc.textFile(“hdfs://namenode:port/user/me/theloneranger") var counts = text_file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (a, b) => a + b) counts.saveAsTextFile("hdfs://...") line.split(“ “): read a word separated by “ “ flatMap(): each word becomes an element map(): each word is paired with 1 reduceByKey(): add value 1’s by each word
  • 60.
    High Performance InformationComputing Center Jongwook Woo CSULA Word Count: RDD val text_file = sc.textFile(“hdfs://namenode:port/user/me/theloneranger") var counts = text_file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (a, b) => a + b) counts.saveAsTextFile("hdfs://...") An idealistic Lawyer in Texas he Rides with his brother and fellow Texas Rangers
  • 61.
    High Performance InformationComputing Center Jongwook Woo CSULA Word Count: RDD val text_file = sc.textFile(“hdfs://namenode:port/user/me/theloneranger") var counts = text_file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (a, b) => a + b) counts.saveAsTextFile("hdfs://...") (An, 1) (Idealistic, 1) (lawyer, 1) (in, 1) (Texas, 1) (he, 1) (Rides, 1) (with, 1) (his, 1) (brother, 1) (and, 1) (fellow, 1) (Texas, 1) (Rangers, 1)
  • 62.
    High Performance InformationComputing Center Jongwook Woo CSULA Word Count: RDD val text_file = sc.textFile(“hdfs://namenode:port/user/me/theloneranger") var counts = text_file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (a, b) => a + b) counts.saveAsTextFile("hdfs://...") (An, 1) (Idealistic, 1) (lawyer, 1) (in, 1) (he, 1) (Rides, 1) (with, 1) (his, 1) (brother, 1) (and, 1) (fellow, 1) (Texas, 2) (Rangers, 1)
  • 63.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Scala Code: Word Count val text_file = sc.textFile(“hdfs://namenode:port/user/me/theloneranger") var counts = text_file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (a, b) => a + b) counts.saveAsTextFile("hdfs://...") Save the counts to HDFS
  • 64.
    High Performance InformationComputing Center Jongwook Woo CSULA Word Count code Could be used for other applications Sentiment Analysis Market Basket Analysis … Next Example
  • 65.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Code: Market Basket Analysis // ngrams to pair items def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = { s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet } val fPath = "jwoo/files3.2G.dat" val lines = sc.textFile(fPath) // lines: Array[String] val ngramNo = 2 val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word, 1)).reduceByKey((a, b) => a+b) val sortedResult = result.map(pair => pair.swap).sortByKey(false) //save result to HDFS sortedResult.saveAsTextFile("jwoo/result3.2")
  • 66.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Code: Market Basket Analysis // ngrams to pair items def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = { s.toLowerCase.split(inSep).sliding(n).map(_.sorted. mkString(outSep)).toSet }toLowerCase(): convert to lower letters split(inSep): split by the separator inSep sliding(n): select n words as a group _.sorted: sort the elements in the group mkString(outSep): the elements are appended with outSep toSet: make the group as a set with unique elmts
  • 67.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Code: Market Basket Analysis val fPath = "jwoo/files3.2G.dat" val lines = sc.textFile(fPath) // lines: Array[String] val ngramNo = 2 val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word, 1)).reduceByKey((a, b) => a+b) val sortedResult = result.map(pair => pair.swap).sortByKey(false) //save result to HDFS sortedResult.saveAsTextFile(“jwoo/result32G”) Extract and count bigram
  • 68.
    High Performance InformationComputing Center Jongwook Woo CSULA Example Code: Market Basket Analysis val fPath = "jwoo/files3.2G.dat" val lines = sc.textFile(fPath) // lines: Array[String] val ngramNo = 2 val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word, 1)).reduceByKey((a, b) => a+b) val sortedResult = result.map(pair => pair.swap).sortByKey(false) //save result to HDFS sortedResult.saveAsTextFile(“jwoo/result32G”) Sort the bigram in descending order of the value
  • 69.
    High Performance InformationComputing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Task Scheduling  Spark SQL, Streaming, ML  Examples  Use Cases
  • 70.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark at Yahoo Two Spark projects in the works, Personalizing news pages for Web visitors – ML algorithms running on Spark • to figure out what individual users are interested in, and also to categorize news stories – as they arise to figure out what types of users would be interested in reading them – 120 lines of Scala • Comparing to 15,000 lines of C++ – 1/100 size Analytics for advertising
  • 71.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark at Yahoo (Cont’d) Two Spark projects in the works, Analytics for advertising –Hive on Spark (Shark’s) interactive capability. • use existing BI tools to view • and query their advertising analytic data collected in Hadoop
  • 72.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark at Conviva one of the largest streaming video companies on the Internet with about 4 billion video feeds per month second only to YouTube uses Spark Streaming to learn network conditions in real time to ensure a high quality of service (QoS) –by avoiding dreaded screen buffering.
  • 73.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark at ClearStory one of Databricks first customers, Needs its interactive, real-time product – For Data Integration needed a way to help business users merge their internal data sources with external sources, –such as social media traffic and public data feeds, • without requiring complex data modeling.
  • 74.
    High Performance InformationComputing Center Jongwook Woo CSULA Spark Training California State University Los Angeles (Prof Jongwook Woo) UC Berkeley Edx (MOOC) UC Berkeley amplab camp Stanford Cloudera, Hortonworks, DataStax Training courses IBM Big University
  • 75.
    High Performance InformationComputing Center Jongwook Woo CSULA Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  • 76.
    High Performance InformationComputing Center Jongwook Woo CSULA Training Hadoop and Spark
  • 77.
    High Performance InformationComputing Center Jongwook Woo CSULA Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  • 78.
    High Performance InformationComputing Center Jongwook Woo CSULA Conclusion Big Data is Hadoop Spark is the way to go for Big Data Spark training is important
  • 79.
    High Performance InformationComputing Center Jongwook Woo CSULA Question?
  • 80.
    High Performance InformationComputing Center Jongwook Woo CSULA References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )  “Market Basket Analysis using Spark”, Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes
  • 81.
    High Performance InformationComputing Center Jongwook Woo CSULA References  Introduction to Big Data with Apache Spark, databricks  Stanford Spark Class (http://stanford.edu/~rezab )  Cornell University, CS5304  DS320: DataStax Enterprise Analytics with Spark  Cloudera, http://www.cloudera.com  Hortonworks, http://www.hortonworks.com  Spark 3 Use Cases, http://www.datanami.com/2014/03/06/apache_spark_3_ real-world_use_cases/