Spark ukc2015v1.1

Jongwook Woo
HiPIC
CSULA
UKC 2015
Atlanta, GA
July 30 2015
Nillohit Bhattacharya, nbhatta2@calstatela.edu
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Big Data Analysis and
Industrial Approach
using Spark

High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 MBA Examples
 Academic Cloud Computing

Jongwook Woo
CSULA
Myself
Name: 우종욱, Jongwook Woo
Experience:
 2012 - Present
– Certified Cloudera Instructor: R&D, Consulting, Training
 2012 - Present : Big Data Academic Parterships
– Cloudera, Hortonworks Partner for Hadoop Training
– Amazon AWS, MicroSoft Azure, IBM Bluemix
 Since 2002, Professor at California State Univ Los Angeles
 Since 1998: R&D consulting in Hollywood
– implements eBusiness applications using J2EE and middleware
– Information Search and Integration with FAST, Lucene/Solr,
Sphinx
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 Since 2007: Exposed to Big Data
 PhD in 2001: Computer Science and Engineering at USC

Jongwook Woo
CSULA
Myself
Experience (Cont’d): Bring in Big Data
training and R&D to Korea since 2009
2014: Training Hadoop and the Ecosystems
Summer 2013 Igloo Security:
– Collect, Search, and Analyze Security Log files 30GB –
100GB / day
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education in Univ and
Research Centers

Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received IBM Bluemix , MicroSoft Windows Azure, Amazon
AWS in Education Research Grant
 Partnership
 Received Academic Education Partnership with Cloudera,
Hortonworks and IBM
 Certificate
 Certified Cloudera Hadoop Instructor
 Certified Cloudera Hadoop Developer / Administrator / Hbase /
Spark

Jongwook Woo
CSULA
Contents
 Myself
 Spark Cores
 RDD
 Task Scheduling
 Examples
 Use Cases

Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social
Computing, Streaming data, smart phone, online
game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers

Jongwook Woo
CSULA
What is Hadoop?
9
Hadoop Founder:
Doug Cutting
Chief Architect at Cloudera

Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and
process it faster in parallel
Hadoop
–Non-expensive Super Computer
–You can build and run your applications

Jongwook Woo
CSULA
Hadoop CDH: Logical Diagram
Web Browser to control Cloudera Manager
Server
HTTP(S)
Agent CDH Agent CDH Agent CDH
CM
.
.
.
.
.
.
.
.
.
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala

Jongwook Woo
CSULA
Contents
 Myself
 Hive Examples
 Spark Cores
 RDD
 MBA Examples
 Use Cases

Jongwook Woo
CSULA
Hive
One of Ecosystems
Developed at Facebook
Turns Hadoop into a data warehouse
HiveQL
SQL syntax
Convert to MapReduce jobs
– Batch Processing
– Slow
• in the beginning even for any simple SQL statement
• Impala on MPP for interactive querying

Jongwook Woo
CSULA
Airline Data Set
 Cluster by Nillohit at HiPIC, CSULA:
 Airline Data Set in 2012 – 2014
– US Dept of transportation
 Microsoft Azure using Hive
 Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
 Hadoop and Hive
 Highest Average Departure Delay
– E Air Lines: 13.25 minutes
– U Air Lines: 12.59 minutes
 Least Avg Departure Delay (Least Delayed/On Time):-
– AS, "Alaska Airlines Inc.", 1.33 minutes
– HA, "Hawaiian Airlines Inc.", - 0.56 minutes
• (Negative, i.e. most of the time the flight departs before the scheduled time)

Jongwook Woo
CSULA
Airline Data Set

Jongwook Woo
CSULA
Hive
Easy for Data Analysis
HiveQL
– SQL syntax
– Convert to MapReduce jobs
Slow
Batch Processing
– in the beginning even for any simple SQL statement
Impala on MPP for interactive querying

Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
10 ~ 100x faster than N/W and Disk

Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its
ecosystems
HDFS
HBase, Hive, Sequence files
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query

Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices

Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Create RDDs
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–development

Jongwook Woo
CSULA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase,
Amazon S3…
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads

Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
RDD, DStream, SchemaRDD, PairRDD
Immutable
Lineage
–History of the objects
–Automatically and efficiently recompute lost
data

Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()

Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java

Jongwook Woo
CSULA
Spark
SparkSQL
Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
MLib
Sparse vector support, Decision trees,
Linear/Logistic Regression, PCA
SVD and PCA

Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Collect the list of pair of transaction
items
most frequently occurred together at a store(s)
Traditional Business Intelligence
Analysis
 much better opportunity for a profit
– by controlling the order of products and
marketing
– control the stocks more intelligently
– arrange items on shelves
– promote items together etc.

Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Transactions in Store A: Input data
Transaction 1: cracker, icecream, beer
Transaction 2: chicken, pizza, coke, bread
Transaction 3: baguette, soda, hering, cracker,
beer
Transaction 4: bourbon, coke, turkey
Transaction 5: sardines, beer, chicken, coke
Transaction 6: apples, peppers, avocado, steak
Transaction 7: sardines, apples, peppers,
avocado, steak
…
What is a pair of items that people
frequently buy at Store A?

Jongwook Woo
CSULA
Market Basket Analysis (MBA) on
Spark
n-gram() Functional Algorithm
1. Take an input transaction text
2. Items are slided by n elements and sorted
Slided items are generated as (e1, e2, …, en)
3. Duplicated items pairs are removed
4. Several (e1, e2, …, en) pairs are produced by
n

Jongwook Woo
CSULA
n-gram
Normally “He follows Texas Rangers”
Bi-gram
–(He follows), (follows Texas), (Texas
Rangers)
Tri-gram
–(He follows Texas), (follows Texas
Rangers)
a transaction: (coke, beer, cracker)
 the output list of bi-gram in MBA
– (beer, coke), (beer, cracker), (coke, cracker)

Jongwook Woo
CSULA
Market Basket Analysis (MBA) on
Spark
MBA Functional Algorithm
1. Take input text files
2. For all files, n-gram(each line)
Each line generates ngram list
– The elements of the ngram list is flattened
Each flattened element generates (ngram, 1) pair
3. All pairs are reduced by each key
Values by a key are summed
4. (ngram, sum of values) pair are generated
and sorted by sum of values

Jongwook Woo
CSULA
Experimental Result
The number of nodes on AWS EC2
2, 4, and 6 nodes
– 1 is a master node
– the input transaction data sets
• 1.6 GB and 3.2 GB file size
Don’t need many nodes in Spark
– 1 node
• Equivalent to 10 – 100 nodes in MapReduce

Jongwook Woo
CSULA
Experimental Result
Using Spark Scala on Amazon AWS
Spark version is 1.3.0
– the data is stored at its HDFS on the cluster.
AWS EC2
– m1.large
• 2 core (2 EC2 vCPU compute unit),
• 7.5GB memory and 2 x 420GB storage on 64 bits
Amazon Linux OS
– m1.xlarge
• 4 core (4 EC2 vCPU compute unit),
• 15GB memory and 4 x 420GB storage on 64 bits
Amazon Linux OS

Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet
}
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word,
1)).reduceByKey((a, b) => a+b)
val sortedResult = result.map(pair => pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile("jwoo/result3.2")

Jongwook Woo
CSULA
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int):
Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.
mkString(outSep)).toSet
}toLowerCase(): convert to lower letters
split(inSep): split by the separator inSep
sliding(n): select n words as a group
_.sorted: sort the elements in the group
mkString(outSep): the elements are appended
with outSep
toSet: make the group as a set with unique elmts

Jongwook Woo
CSULA
Analysis
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
sortedResult.saveAsTextFile(“jwoo/result32G”)
Extract and count bigram

Jongwook Woo
CSULA
Analysis
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
sortedResult.saveAsTextFile(“jwoo/result32G”)
Sort the bigram in descending
order of the value

Jongwook Woo
CSULA
Time for MBA
0 100 200 300 400 500
1
3
5
sec
m1.large
(1.6GB)
m1.xlarge
(1.6GB)
m1.large
(3.2GB)
m1.xlarge
(3.2GB)

Jongwook Woo
CSULA
Scala Spark vs Java
Code Size:
1/100 of Java code
Performance:
Scala Spark: 10x ~ 100x faster than MapRedude in
Java
Scala Spark vs PySpark
– Scala Spark
• 2x faster
– PySpark
• has more libraries
Meaning
100 nodes in Java MapReduce
1 – 10 nodes in Spark

Jongwook Woo
CSULA
Spark Training
California State University Los Angeles
(Prof Jongwook Woo)
UC Berkeley Edx (MOOC)
UC Berkeley amplab camp
Stanford
Cloudera, Hortonworks, DataStax Training
courses
IBM Big University

Jongwook Woo
CSULA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo

Jongwook Woo
CSULA
Training Hadoop and Spark

Jongwook Woo
CSULA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles

Jongwook Woo
CSULA
Spark on Cloud Computing
Amazon AWS
Spark example at https://spark.apache.org/
Microsoft Azure
IBM Bluemix
Support Spark and AMP Labs
–Will hire 3,500 Spark engineers
Launch Spark on BlueMix on July 2015
Academic Initiative Programs
– Contact for Spark on Bluemix account

Jongwook Woo
CSULA
Conclusion
Big Data is Hadoop
Spark is the way to go for Big Data
Spark training and academic partnership

Jongwook Woo
CSULA
Question?

Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
 “Market Basket Analysis using Spark”, Jongwook
Woo, in Journal of Science and Technology, April
2015, Volume 5, No 4, pp207-209, ISSN 2225-7217,
ARPN
https://github.com/hipic/spark_mba, HiPIC of
California State University Los Angenes

Jongwook Woo
CSULA
References
 Introduction to Big Data with Apache Spark, databricks
 Stanford Spark Class (http://stanford.edu/~rezab )
 Cornell University, CS5304
 DS320: DataStax Enterprise Analytics with Spark
 Cloudera, http://www.cloudera.com
 Hortonworks, http://www.hortonworks.com
 Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_3_
real-world_use_cases/

Spark ukc2015v1.1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark ukc2015v1.1

Similar to Spark ukc2015v1.1 (20)

Recently uploaded

Recently uploaded (20)

Spark ukc2015v1.1