Spark is used to perform market basket analysis on transaction data to identify commonly purchased item pairs. The data is read from files as lines of transactions and n-gram is applied to generate item pairs. These item pairs are counted and aggregated to find the most frequently co-occurring pairs. The results are sorted and saved to HDFS.
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Spark ukc2015v1.1
1. Jongwook Woo
HiPIC
CSULA
UKC 2015
Atlanta, GA
July 30 2015
Nillohit Bhattacharya, nbhatta2@calstatela.edu
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Big Data Analysis and
Industrial Approach
using Spark
2. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Hive Examples
Spark Cores
RDD
Spark SQL, Streaming, ML
MBA Examples
Academic Cloud Computing
3. High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Name: 우종욱, Jongwook Woo
Experience:
2012 - Present
– Certified Cloudera Instructor: R&D, Consulting, Training
2012 - Present : Big Data Academic Parterships
– Cloudera, Hortonworks Partner for Hadoop Training
– Amazon AWS, MicroSoft Azure, IBM Bluemix
Since 2002, Professor at California State Univ Los Angeles
Since 1998: R&D consulting in Hollywood
– implements eBusiness applications using J2EE and middleware
– Information Search and Integration with FAST, Lucene/Solr,
Sphinx
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
Since 2007: Exposed to Big Data
PhD in 2001: Computer Science and Engineering at USC
4. High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Experience (Cont’d): Bring in Big Data
training and R&D to Korea since 2009
2014: Training Hadoop and the Ecosystems
Summer 2013 Igloo Security:
– Collect, Search, and Analyze Security Log files 30GB –
100GB / day
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education in Univ and
Research Centers
5. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Grants
Received IBM Bluemix , MicroSoft Windows Azure, Amazon
AWS in Education Research Grant
Partnership
Received Academic Education Partnership with Cloudera,
Hortonworks and IBM
Certificate
Certified Cloudera Hadoop Instructor
Certified Cloudera Hadoop Developer / Administrator / Hbase /
Spark
6. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Spark Cores
RDD
Task Scheduling
Spark SQL, Streaming, ML
Examples
Use Cases
7. High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social
Computing, Streaming data, smart phone, online
game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
8. High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers
9. High Performance Information Computing Center
Jongwook Woo
CSULA
What is Hadoop?
9
Hadoop Founder:
Doug Cutting
Chief Architect at Cloudera
10. High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and
process it faster in parallel
Hadoop
–Non-expensive Super Computer
–You can build and run your applications
11. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop CDH: Logical Diagram
Web Browser to control Cloudera Manager
Server
HTTP(S)
Agent CDH Agent CDH Agent CDH
Agent CDH Agent CDH Agent CDH
CM
.
.
.
.
.
.
.
.
.
Agent CDH Agent CDH Agent CDH
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
12. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Hive Examples
Spark Cores
RDD
Spark SQL, Streaming, ML
MBA Examples
Use Cases
13. High Performance Information Computing Center
Jongwook Woo
CSULA
Hive
One of Ecosystems
Developed at Facebook
Turns Hadoop into a data warehouse
HiveQL
SQL syntax
Convert to MapReduce jobs
– Batch Processing
– Slow
• in the beginning even for any simple SQL statement
• Impala on MPP for interactive querying
14. High Performance Information Computing Center
Jongwook Woo
CSULA
Airline Data Set
Cluster by Nillohit at HiPIC, CSULA:
Airline Data Set in 2012 – 2014
– US Dept of transportation
Microsoft Azure using Hive
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
Hadoop and Hive
Highest Average Departure Delay
– E Air Lines: 13.25 minutes
– U Air Lines: 12.59 minutes
Least Avg Departure Delay (Least Delayed/On Time):-
– AS, "Alaska Airlines Inc.", 1.33 minutes
– HA, "Hawaiian Airlines Inc.", - 0.56 minutes
• (Negative, i.e. most of the time the flight departs before the scheduled time)
18. High Performance Information Computing Center
Jongwook Woo
CSULA
Hive
Easy for Data Analysis
HiveQL
– SQL syntax
– Convert to MapReduce jobs
Slow
Batch Processing
– in the beginning even for any simple SQL statement
Impala on MPP for interactive querying
19. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Hive Examples
Spark Cores
RDD
Spark SQL, Streaming, ML
MBA Examples
Academic Cloud Computing
20. High Performance Information Computing Center
Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
10 ~ 100x faster than N/W and Disk
21. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its
ecosystems
HDFS
HBase, Hive, Sequence files
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
22. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
23. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Create RDDs
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–development
24. High Performance Information Computing Center
Jongwook Woo
CSULA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase,
Amazon S3…
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
25. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Hive Examples
Spark Cores
RDD
Spark SQL, Streaming, ML
MBA Examples
Use Cases
26. High Performance Information Computing Center
Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
RDD, DStream, SchemaRDD, PairRDD
Immutable
Lineage
–History of the objects
–Automatically and efficiently recompute lost
data
27. High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
28. High Performance Information Computing Center
Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
29. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Hive Examples
Spark Cores
RDD
Spark SQL, Streaming, ML
MBA Examples
Academic Cloud Computing
30. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
SparkSQL
Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
MLib
Sparse vector support, Decision trees,
Linear/Logistic Regression, PCA
SVD and PCA
31. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Hive Examples
Spark Cores
RDD
Spark SQL, Streaming, ML
MBA Examples
Use Cases
32. High Performance Information Computing Center
Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Collect the list of pair of transaction
items
most frequently occurred together at a store(s)
Traditional Business Intelligence
Analysis
much better opportunity for a profit
– by controlling the order of products and
marketing
– control the stocks more intelligently
– arrange items on shelves
– promote items together etc.
33. High Performance Information Computing Center
Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Transactions in Store A: Input data
Transaction 1: cracker, icecream, beer
Transaction 2: chicken, pizza, coke, bread
Transaction 3: baguette, soda, hering, cracker,
beer
Transaction 4: bourbon, coke, turkey
Transaction 5: sardines, beer, chicken, coke
Transaction 6: apples, peppers, avocado, steak
Transaction 7: sardines, apples, peppers,
avocado, steak
…
What is a pair of items that people
frequently buy at Store A?
34. High Performance Information Computing Center
Jongwook Woo
CSULA
Market Basket Analysis (MBA) on
Spark
n-gram() Functional Algorithm
1. Take an input transaction text
2. Items are slided by n elements and sorted
Slided items are generated as (e1, e2, …, en)
3. Duplicated items pairs are removed
4. Several (e1, e2, …, en) pairs are produced by
n
35. High Performance Information Computing Center
Jongwook Woo
CSULA
n-gram
Normally “He follows Texas Rangers”
Bi-gram
–(He follows), (follows Texas), (Texas
Rangers)
Tri-gram
–(He follows Texas), (follows Texas
Rangers)
a transaction: (coke, beer, cracker)
the output list of bi-gram in MBA
– (beer, coke), (beer, cracker), (coke, cracker)
36. High Performance Information Computing Center
Jongwook Woo
CSULA
Market Basket Analysis (MBA) on
Spark
MBA Functional Algorithm
1. Take input text files
2. For all files, n-gram(each line)
Each line generates ngram list
– The elements of the ngram list is flattened
Each flattened element generates (ngram, 1) pair
3. All pairs are reduced by each key
Values by a key are summed
4. (ngram, sum of values) pair are generated
and sorted by sum of values
37. High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Result
The number of nodes on AWS EC2
2, 4, and 6 nodes
– 1 is a master node
– the input transaction data sets
• 1.6 GB and 3.2 GB file size
Don’t need many nodes in Spark
– 1 node
• Equivalent to 10 – 100 nodes in MapReduce
38. High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Result
Using Spark Scala on Amazon AWS
Spark version is 1.3.0
– the data is stored at its HDFS on the cluster.
AWS EC2
– m1.large
• 2 core (2 EC2 vCPU compute unit),
• 7.5GB memory and 2 x 420GB storage on 64 bits
Amazon Linux OS
– m1.xlarge
• 4 core (4 EC2 vCPU compute unit),
• 15GB memory and 4 x 420GB storage on 64 bits
Amazon Linux OS
39. High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet
}
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word,
1)).reduceByKey((a, b) => a+b)
val sortedResult = result.map(pair => pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile("jwoo/result3.2")
40. High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int):
Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.
mkString(outSep)).toSet
}toLowerCase(): convert to lower letters
split(inSep): split by the separator inSep
sliding(n): select n words as a group
_.sorted: sort the elements in the group
mkString(outSep): the elements are appended
with outSep
toSet: make the group as a set with unique elmts
41. High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile(“jwoo/result32G”)
Extract and count bigram
42. High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile(“jwoo/result32G”)
Sort the bigram in descending
order of the value
43. High Performance Information Computing Center
Jongwook Woo
CSULA
Time for MBA
0 100 200 300 400 500
1
3
5
sec
m1.large
(1.6GB)
m1.xlarge
(1.6GB)
m1.large
(3.2GB)
m1.xlarge
(3.2GB)
44. High Performance Information Computing Center
Jongwook Woo
CSULA
Scala Spark vs Java
Code Size:
1/100 of Java code
Performance:
Scala Spark: 10x ~ 100x faster than MapRedude in
Java
Scala Spark vs PySpark
– Scala Spark
• 2x faster
– PySpark
• has more libraries
Meaning
100 nodes in Java MapReduce
1 – 10 nodes in Spark
45. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Hive Examples
Spark Cores
RDD
Spark SQL, Streaming, ML
MBA Examples
Academic Cloud Computing
46. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Training
California State University Los Angeles
(Prof Jongwook Woo)
UC Berkeley Edx (MOOC)
UC Berkeley amplab camp
Stanford
Cloudera, Hortonworks, DataStax Training
courses
IBM Big University
47. High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo
49. High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles
50. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark on Cloud Computing
Amazon AWS
Spark example at https://spark.apache.org/
Microsoft Azure
IBM Bluemix
Support Spark and AMP Labs
–Will hire 3,500 Spark engineers
Launch Spark on BlueMix on July 2015
Academic Initiative Programs
– Contact for Spark on Bluemix account
51. High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Big Data is Hadoop
Spark is the way to go for Big Data
Spark training and academic partnership
53. High Performance Information Computing Center
Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
“Market Basket Analysis using Spark”, Jongwook
Woo, in Journal of Science and Technology, April
2015, Volume 5, No 4, pp207-209, ISSN 2225-7217,
ARPN
https://github.com/hipic/spark_mba, HiPIC of
California State University Los Angenes
54. High Performance Information Computing Center
Jongwook Woo
CSULA
References
Introduction to Big Data with Apache Spark, databricks
Stanford Spark Class (http://stanford.edu/~rezab )
Cornell University, CS5304
DS320: DataStax Enterprise Analytics with Spark
Cloudera, http://www.cloudera.com
Hortonworks, http://www.hortonworks.com
Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_3_
real-world_use_cases/