1. Jongwook Woo
HiPIC
CalStat
eLA
UKC 2016
Dallas, TX
Aug 12 2016
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Big Data Trend and
Open Data
2. High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
3. High Performance Information Computing Center
Jongwook Woo
CalStat
Myself
Experience:
Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
Since 2007: Exposed to Big Data at CitySearch.com
2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
4. High Performance Information Computing Center
Jongwook Woo
CalStat
Experience (Cont’d): Bring in Big Data R&D
and training to Korea since 2009
Collaborating with LA city in 2016
– Collect, Search, and Analyze City Data
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and
Research Centers
• Yonsei, Gachon
• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
5. High Performance Information Computing Center
Jongwook Woo
CalStat
Experience in Big Data
Collaboration
Council Member of IBM Spark Technology Center
City of Los Angeles for OpenHub and Open Data
Startup Companies in Los Angeles
External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
Grants
IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
Partnership
Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata
6. High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
7. High Performance Information Computing Center
Jongwook Woo
CalStat
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
8. High Performance Information Computing Center
Jongwook Woo
CalStat
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity
computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
9. High Performance Information Computing Center
Jongwook Woo
CalStat
What is Hadoop?
9
Hadoop Founder:
o Doug Cutting
Apache Committer:
Lucene, Nutch, …
10. High Performance Information Computing Center
Jongwook Woo
CalStat
Definition: Big Data
Non-expensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–More public than the traditional super
computers
• You can store and process your applications
– In your university labs, small companies,
research centers
12. High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
13. High Performance Information Computing Center
Jongwook Woo
CalStat
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
20 ~ 100 times faster than N/W and Disk
– MapReduce
14. High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
15. High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
ML
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
16. High Performance Information Computing Center
Jongwook Woo
CalStat
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
17. High Performance Information Computing Center
Jongwook Woo
CalStat
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
18. High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
Spark SQL
DataFrame
– Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
Mlib, ML
Sparse vector support, Decision trees, Linear/Logistic Regression,
PCA
Pipeline
19. High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
20. High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS
– No Hadoop ecosystems
21. High Performance Information Computing Center
Jongwook Woo
CalStat
Spark with Hadoop YARN
Spark Client
Slave Nodes
ResourceManager (RM) Per Cluster
Create Spark AM and
allocate Containers for Spark AM
NodeManager (NM) Per Node
Spark workers
ApplicationMaster (AM) Per Application
Containers for Spark Executors
Master
Node
Node
Manager
Node
Manager
Node
Manager
Container:
Spark Executor
Spark AM
Resource
Manager
22. High Performance Information Computing Center
Jongwook Woo
CalStat
Big Data Analysis Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Datameer, Qlik, …)
Data Visualization
Qlik, Datameer, Excel
PowerView
24. High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Use Cases
Hadoop Spark Training
25. High Performance Information Computing Center
Jongwook Woo
CalStat
Open Data
USA government
Federal, State, City governments
Expose data to public
USA Business
Twitter, Yelp, …
Expose data to public with APIs
– Some restriction to download
City government
New York
– Taxi, Uber, …
Los Angeles
– Open Data, Open Hub with Geo info
26. High Performance Information Computing Center
Jongwook Woo
CalStat
Open Big Data Analysis in
CalStateLA
Social Media Data Analysis
Twitter Sentiment Analysis for Alphago
Open Data from Government
Airline Data analysis
Crime Data analysis
Web Service API
Business Data Analysis from Yelp and Google Places API
27. High Performance Information Computing Center
Jongwook Woo
CalStat
Data from Industry: Twitter
Data
Systems
Azure HDInsights Spark
8 Nodes
– 40 cores: 2.4GHz Intel Xeon
– Memory - Each Node: 28 GB
Data Source
Keyword ‘alphago’ from Tweeter via Apache NiFi
Data Size
63,193 tweets
Real Time Data Collection period
03/12 – 03/17/2016
– No data collected on 03/13
29. High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries
# of Tweets per
Country
USA: > 11,000
Japan: > 9,000
Korea: > 1,900
Russia, UK: > 1,600
Thai Land, France : >
1,000
Netherland, Spain,
Ukraine: > 600
31. High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries
Most Tweeted Countries
All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…
32. High Performance Information Computing Center
Jongwook Woo
CalStat
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016
Alphago vs Lee Sedol
Game 4: Mar 13
Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12
33. High Performance Information Computing Center
Jongwook Woo
CalStat
Ngram words
3 word in row right after Go-Champion
“sedol” and “se-dol”
sedol
se-dol
3-grams Frequency
Again-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168
35. High Performance Information Computing Center
Jongwook Woo
CalStat
Sentiment Map of Lee Se-Dol vs Alphago
YouTube video: “alphago sentiment” by Google
The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb
ToiB8wQ2w14a
36. High Performance Information Computing Center
Jongwook Woo
CalStat
Federal Government: Airline Data Set
Government Open Data
Airline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CalStateLA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
40. High Performance Information Computing Center
Jongwook Woo
CalStat
City Government: Crime Data Set
Open Data in City of Los Angeles
Crime Data Set in 2012-2015
File Size – 151MB
Total Number of offenses – 8.94 million
Ram Dharan and Sridhar Reddy at HiPIC,
CalStateLA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB
– Windows Server 2012 R2 Datacenter
– Extending to last 10 years of data set
41. High Performance Information Computing Center
Jongwook Woo
CalStat
Projection of Raw Data
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
year2012 year2013 year2014 year2015
42. High Performance Information Computing Center
Jongwook Woo
CalStat
Total No. of Crimes in 2012-15
0
5000
10000
15000
20000
25000
year2012 year2013 year2014 year2015
43. High Performance Information Computing Center
Jongwook Woo
CalStat
Mapping of Crimes Occurred within 5miles
from CalStateLA, UCLA and USC in 2015
44. High Performance Information Computing Center
Jongwook Woo
CalStat
No.of Crimes for every 5miles
from CalStateLA
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 >35
csula_2012 csula_2013 csula_2014 csula_2015
45. High Performance Information Computing Center
Jongwook Woo
CalStat
No.of Crimes for every 5miles
from UCLA
0
20000
40000
60000
80000
100000
120000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40
ucla_2012 ucla_2013 ucla_2014 ucla_2015
46. High Performance Information Computing Center
Jongwook Woo
CalStat
No. of Crimes for every 5miles
from USC
0
20000
40000
60000
80000
100000
120000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40
ucla_2012 ucla_2013 ucla_2014 ucla_2015
47. High Performance Information Computing Center
Jongwook Woo
CalStat
Comparision of Crimes for
every 5miles from CalStateLA,
UCLA and USC in 2015
0
20000
40000
60000
80000
100000
120000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 40-50 >50
csula_2015 ucla_2015 usc_2015
48. High Performance Information Computing Center
Jongwook Woo
CalStat
No.of crimes per area in LA
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
77thStreet
Mission
Newton
Rampart
Southwest
Topanga
VanNuys
Wilshire
Central
Devonshire
Foothill
Harbor
Hollenbeck
Hollywood
NHollywood
Pacific
WestValley
Northeast
Olympic
Southeast
WestLA
in2012 in2013 in2014 in2015
49. High Performance Information Computing Center
Jongwook Woo
CalStat
Total No.of Crimes for every
2hours in LA
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
77thStreet
Mission
Newton
Rampart
Southwest
Topanga
VanNuys
Wilshire
Central
Devonshire
Foothill
Harbor
Hollenbeck
Hollywood
NHollywood
Pacific
WestValley
Northeast
Olympic
Southeast
WestLA
in2012 in2013 in2014 in2015
50. High Performance Information Computing Center
Jongwook Woo
CalStat
No.of crimes for every 2hrs
within 5miles from CalStateLA,
UCLA and USC in 2015
0 2000 4000 6000 8000 10000 12000
00:00-02:00
02:00-04:00
04:00-06:00
06:00-08:00
08:00-10:00
10:00-12:00
12:00-14:00
14:00-16:00
16:00-18:00
18:00-20:00
20:00-22:00
22:00-24:00
usc ucla csula
51. High Performance Information Computing Center
Jongwook Woo
CalStat
BUSINESS DATA ANALYSIS
DATA SET DETAILS
• Yelp Review Data : 1.9GB
• Business Data: 500MB
• Web Service API from Yelp and Google Places
Analysis Join
YELP
CHALLENGE
DATA SET
GOOGLE
PLACES
YELP DATA
52. High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 businesses within 5 miles from CalStateLA
(with 5 or 4 star ratings)
34
31
29
26
19 19
15 15 15
0
5
10
15
20
25
30
35
40
count
Chart Title
Hair Salons Auto Repair General Dentistry
Insurance Churches Skin Care
Chiropractors Barbers Elementary Schools
• Hair Salons and
Insurance are
popular qualified
business categories
53. High Performance Information Computing Center
Jongwook Woo
CalStat
Businesses popular in 5 miles of CalStateLA,
usc , ucla
54. High Performance Information Computing Center
Jongwook Woo
CalStat
Number of food business in radius
0-25 miles from CalStateLA, usc
and ucla
CalStateLA have more food businesses within 5 miles compared
to UCLA and USC
0
100
200
300
400
500
600
0- 5 5-10. 10-15. 15-20 20-25
CSULA USC UCLA
55. High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
The Cal State L.A. Hydrogen Research and Fueling
Facility (H2 Station)
opened on May 7, 2014.
56. High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
The station
producing hydrogen for Hydrogen Vehicle
Cal State L.A. Hydrogen Research and Fueling
Facility
the first station in the nation to sell hydrogen fuel to
the public.
Hyundai, Toyota
57. High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Workflow
58. High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Model by Manvi Chandra
59. High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Results and observations
60. High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Results and observations
Can predict Vehicle Pressure
– Pressure of hydrogen gas within the vehicle Hydrogen
Storage System
– using our model in Azure Visual Studio ML
– Building Spark ML
Decision forest Regression
– constructing a multitude of decision trees at training
time
• the mode of the classes (classification)
• mean prediction (regression) of the individual trees.
61. High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
62. High Performance Information Computing Center
Jongwook Woo
CalStat
Spark Big Data Training and R&D
HiPIC
California State University Los Angeles
Supported by
– Databricks and its cloud computing services
– Amazon AWS, IBM Buemix, MS Azure
– Hortonworks, Cloudera
– Datameer
67. High Performance Information Computing Center
Jongwook Woo
CalStat
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
“Market Basket Analysis using Spark”,
Jongwook Woo, in Journal of Science and
Technology, April 2015, Volume 5, No 4,
pp207-209, ISSN 2225-7217, ARPN
https://github.com/hipic/spark_mba, HiPIC
of California State University Los Angenes
68. High Performance Information Computing Center
Jongwook Woo
CalStat
Introduction to Big Data with Apache Spark, databricks
Stanford Spark Class (http://stanford.edu/~rezab )
Cornell University, CS5304
DS320: DataStax Enterprise Analytics with Spark
Cloudera, http://www.cloudera.com
Hortonworks, http://www.hortonworks.com
Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_
3_real-world_use_cases/
References
69. High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
Optimizer
Optimizer: build
operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
70. High Performance Information Computing Center
Jongwook Woo
CalStat
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase, Amazon S3,
Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
71. High Performance Information Computing Center
Jongwook Woo
CalStat
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Wide” (shuffle) deps:
boundary of stages
“Narrow” deps: A stage pipeline
to be run on the same node
72. High Performance Information Computing Center
Jongwook Woo
CalStat
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: A stage pipeline
to be run on the same node
“Wide” (shuffle) deps:
boundary of stages
73. High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduler Optimizations
Pipelines within a
stage 2
map, union
Stage 3:
join algorithms
based on
partitioning
(minimize shuffles) join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task