Jongwook Woo
HiPIC
CalStat
eLA
UKC 2016
Dallas, TX
Aug 12 2016
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Big Data Trend and
Open Data
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Myself
Experience:
 Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
High Performance Information Computing Center
Jongwook Woo
CalStat
Experience (Cont’d): Bring in Big Data R&D
and training to Korea since 2009
Collaborating with LA city in 2016
– Collect, Search, and Analyze City Data
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and
Research Centers
• Yonsei, Gachon
• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
High Performance Information Computing Center
Jongwook Woo
CalStat
Experience in Big Data
 Collaboration
 Council Member of IBM Spark Technology Center
 City of Los Angeles for OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
 Grants
 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CalStat
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity
computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
High Performance Information Computing Center
Jongwook Woo
CalStat
What is Hadoop?
9
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …
High Performance Information Computing Center
Jongwook Woo
CalStat
Definition: Big Data
Non-expensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–More public than the traditional super
computers
• You can store and process your applications
– In your university labs, small companies,
research centers
High Performance Information Computing Center
Jongwook Woo
CalStat
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
 Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
ML
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
High Performance Information Computing Center
Jongwook Woo
CalStat
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
High Performance Information Computing Center
Jongwook Woo
CalStat
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
 Spark SQL
 DataFrame
– Turning an RDD into a Relation
 Querying using SQL
 Spark Streaming
 DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
 Mlib, ML
 Sparse vector support, Decision trees, Linear/Logistic Regression,
PCA
 Pipeline
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS
– No Hadoop ecosystems
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark with Hadoop YARN
Spark Client
Slave Nodes
 ResourceManager (RM) Per Cluster
 Create Spark AM and
 allocate Containers for Spark AM
 NodeManager (NM) Per Node
 Spark workers
 ApplicationMaster (AM) Per Application
 Containers for Spark Executors
Master
Node
Node
Manager
Node
Manager
Node
Manager
Container:
Spark Executor
Spark AM
Resource
Manager
High Performance Information Computing Center
Jongwook Woo
CalStat
Big Data Analysis Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Datameer, Qlik, …)
Data Visualization
Qlik, Datameer, Excel
PowerView
High Performance Information Computing Center
Jongwook Woo
CalStat
Databricks cluster at CalStateLA
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Open Data
USA government
Federal, State, City governments
Expose data to public
USA Business
Twitter, Yelp, …
Expose data to public with APIs
– Some restriction to download
City government
New York
– Taxi, Uber, …
Los Angeles
– Open Data, Open Hub with Geo info
High Performance Information Computing Center
Jongwook Woo
CalStat
Open Big Data Analysis in
CalStateLA
Social Media Data Analysis
Twitter Sentiment Analysis for Alphago
Open Data from Government
Airline Data analysis
Crime Data analysis
Web Service API
Business Data Analysis from Yelp and Google Places API
High Performance Information Computing Center
Jongwook Woo
CalStat
Data from Industry: Twitter
Data
 Systems
Azure HDInsights Spark
8 Nodes
– 40 cores: 2.4GHz Intel Xeon
– Memory - Each Node: 28 GB
 Data Source
Keyword ‘alphago’ from Tweeter via Apache NiFi
 Data Size
 63,193 tweets
 Real Time Data Collection period
03/12 – 03/17/2016
– No data collected on 03/13
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries that Tweets
“Alphago”
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries
 # of Tweets per
Country
USA: > 11,000
Japan: > 9,000
Korea: > 1,900
Russia, UK: > 1,600
Thai Land, France : >
1,000
 Netherland, Spain,
Ukraine: > 600
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries Sentiment
Positive Negative
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries
Most Tweeted Countries
 All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…
High Performance Information Computing Center
Jongwook Woo
CalStat
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016
Alphago vs Lee Sedol
Game 4: Mar 13
Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12
High Performance Information Computing Center
Jongwook Woo
CalStat
Ngram words
 3 word in row right after Go-Champion
“sedol” and “se-dol”
sedol
 se-dol
3-grams Frequency
Again-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168
High Performance Information Computing Center
Jongwook Woo
CalStat
Sentiment Map of Alphago
Positive
Negative
High Performance Information Computing Center
Jongwook Woo
CalStat
Sentiment Map of Lee Se-Dol vs Alphago
 YouTube video: “alphago sentiment” by Google
 The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb
ToiB8wQ2w14a
High Performance Information Computing Center
Jongwook Woo
CalStat
Federal Government: Airline Data Set
Government Open Data
Airline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CalStateLA
Microsoft Azure using Hive and Spark SQL
 Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
High Performance Information Computing Center
Jongwook Woo
CalStat
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalStat
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalStat
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalStat
City Government: Crime Data Set
Open Data in City of Los Angeles
 Crime Data Set in 2012-2015
 File Size – 151MB
 Total Number of offenses – 8.94 million
Ram Dharan and Sridhar Reddy at HiPIC,
CalStateLA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB
– Windows Server 2012 R2 Datacenter
– Extending to last 10 years of data set
High Performance Information Computing Center
Jongwook Woo
CalStat
Projection of Raw Data
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
year2012 year2013 year2014 year2015
High Performance Information Computing Center
Jongwook Woo
CalStat
Total No. of Crimes in 2012-15
0
5000
10000
15000
20000
25000
year2012 year2013 year2014 year2015
High Performance Information Computing Center
Jongwook Woo
CalStat
Mapping of Crimes Occurred within 5miles
from CalStateLA, UCLA and USC in 2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of Crimes for every 5miles
from CalStateLA
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 >35
csula_2012 csula_2013 csula_2014 csula_2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of Crimes for every 5miles
from UCLA
0
20000
40000
60000
80000
100000
120000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40
ucla_2012 ucla_2013 ucla_2014 ucla_2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No. of Crimes for every 5miles
from USC
0
20000
40000
60000
80000
100000
120000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40
ucla_2012 ucla_2013 ucla_2014 ucla_2015
High Performance Information Computing Center
Jongwook Woo
CalStat
Comparision of Crimes for
every 5miles from CalStateLA,
UCLA and USC in 2015
0
20000
40000
60000
80000
100000
120000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 40-50 >50
csula_2015 ucla_2015 usc_2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of crimes per area in LA
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
77thStreet
Mission
Newton
Rampart
Southwest
Topanga
VanNuys
Wilshire
Central
Devonshire
Foothill
Harbor
Hollenbeck
Hollywood
NHollywood
Pacific
WestValley
Northeast
Olympic
Southeast
WestLA
in2012 in2013 in2014 in2015
High Performance Information Computing Center
Jongwook Woo
CalStat
Total No.of Crimes for every
2hours in LA
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
77thStreet
Mission
Newton
Rampart
Southwest
Topanga
VanNuys
Wilshire
Central
Devonshire
Foothill
Harbor
Hollenbeck
Hollywood
NHollywood
Pacific
WestValley
Northeast
Olympic
Southeast
WestLA
in2012 in2013 in2014 in2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of crimes for every 2hrs
within 5miles from CalStateLA,
UCLA and USC in 2015
0 2000 4000 6000 8000 10000 12000
00:00-02:00
02:00-04:00
04:00-06:00
06:00-08:00
08:00-10:00
10:00-12:00
12:00-14:00
14:00-16:00
16:00-18:00
18:00-20:00
20:00-22:00
22:00-24:00
usc ucla csula
High Performance Information Computing Center
Jongwook Woo
CalStat
BUSINESS DATA ANALYSIS
 DATA SET DETAILS
• Yelp Review Data : 1.9GB
• Business Data: 500MB
• Web Service API from Yelp and Google Places
Analysis Join
YELP
CHALLENGE
DATA SET
GOOGLE
PLACES
YELP DATA
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 businesses within 5 miles from CalStateLA
(with 5 or 4 star ratings)
34
31
29
26
19 19
15 15 15
0
5
10
15
20
25
30
35
40
count
Chart Title
Hair Salons Auto Repair General Dentistry
Insurance Churches Skin Care
Chiropractors Barbers Elementary Schools
• Hair Salons and
Insurance are
popular qualified
business categories
High Performance Information Computing Center
Jongwook Woo
CalStat
Businesses popular in 5 miles of CalStateLA,
usc , ucla
High Performance Information Computing Center
Jongwook Woo
CalStat
Number of food business in radius
0-25 miles from CalStateLA, usc
and ucla
CalStateLA have more food businesses within 5 miles compared
to UCLA and USC
0
100
200
300
400
500
600
0- 5 5-10. 10-15. 15-20 20-25
CSULA USC UCLA
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
The Cal State L.A. Hydrogen Research and Fueling
Facility (H2 Station)
opened on May 7, 2014.
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
The station
producing hydrogen for Hydrogen Vehicle
Cal State L.A. Hydrogen Research and Fueling
Facility
the first station in the nation to sell hydrogen fuel to
the public.
Hyundai, Toyota
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Workflow
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Model by Manvi Chandra
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Results and observations
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Results and observations
 Can predict Vehicle Pressure
– Pressure of hydrogen gas within the vehicle Hydrogen
Storage System
– using our model in Azure Visual Studio ML
– Building Spark ML
Decision forest Regression
– constructing a multitude of decision trees at training
time
• the mode of the classes (classification)
• mean prediction (regression) of the individual trees.
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark Big Data Training and R&D
HiPIC
California State University Los Angeles
Supported by
– Databricks and its cloud computing services
– Amazon AWS, IBM Buemix, MS Azure
– Hortonworks, Cloudera
– Datameer
High Performance Information Computing Center
Jongwook Woo
CalStat
Databricks Partners
High Performance Information Computing Center
Jongwook Woo
CalStat
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo
High Performance Information Computing Center
Jongwook Woo
CalStat
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles
High Performance Information Computing Center
Jongwook Woo
CalStat
Question?
High Performance Information Computing Center
Jongwook Woo
CalStat
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
 “Market Basket Analysis using Spark”,
Jongwook Woo, in Journal of Science and
Technology, April 2015, Volume 5, No 4,
pp207-209, ISSN 2225-7217, ARPN
https://github.com/hipic/spark_mba, HiPIC
of California State University Los Angenes
High Performance Information Computing Center
Jongwook Woo
CalStat
 Introduction to Big Data with Apache Spark, databricks
 Stanford Spark Class (http://stanford.edu/~rezab )
 Cornell University, CS5304
 DS320: DataStax Enterprise Analytics with Spark
 Cloudera, http://www.cloudera.com
 Hortonworks, http://www.hortonworks.com
 Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_
3_real-world_use_cases/
References
High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
Optimizer
Optimizer: build
operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
High Performance Information Computing Center
Jongwook Woo
CalStat
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase, Amazon S3,
Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
High Performance Information Computing Center
Jongwook Woo
CalStat
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Wide” (shuffle) deps:
boundary of stages
“Narrow” deps: A stage pipeline
to be run on the same node
High Performance Information Computing Center
Jongwook Woo
CalStat
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: A stage pipeline
to be run on the same node
“Wide” (shuffle) deps:
boundary of stages
High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduler Optimizations
Pipelines within a
stage 2
map, union
Stage 3:
join algorithms
based on
partitioning
(minimize shuffles) join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduler Optimizations
Conceptually
Stage 1: 3 tasks
Stage 2: 4 tasks
Stage 3: 3 tasks
Total: 3 stages, 10
tasks
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task

Big Data Trend and Open Data

  • 1.
    Jongwook Woo HiPIC CalStat eLA UKC 2016 Dallas,TX Aug 12 2016 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles Big Data Trend and Open Data
  • 2.
    High Performance InformationComputing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 3.
    High Performance InformationComputing Center Jongwook Woo CalStat Myself Experience:  Since 2002, Professor at California State Univ Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  • 4.
    High Performance InformationComputing Center Jongwook Woo CalStat Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city in 2016 – Collect, Search, and Analyze City Data • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  • 5.
    High Performance InformationComputing Center Jongwook Woo CalStat Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University  Grants  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  • 6.
    High Performance InformationComputing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 7.
    High Performance InformationComputing Center Jongwook Woo CalStat Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  • 8.
    High Performance InformationComputing Center Jongwook Woo CalStat Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  • 9.
    High Performance InformationComputing Center Jongwook Woo CalStat What is Hadoop? 9  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  • 10.
    High Performance InformationComputing Center Jongwook Woo CalStat Definition: Big Data Non-expensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers
  • 11.
    High Performance InformationComputing Center Jongwook Woo CalStat Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  • 12.
    High Performance InformationComputing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 13.
    High Performance InformationComputing Center Jongwook Woo CalStat Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce
  • 14.
    High Performance InformationComputing Center Jongwook Woo CalStat Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS  Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase… New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  • 15.
    High Performance InformationComputing Center Jongwook Woo CalStat Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib ML machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  • 16.
    High Performance InformationComputing Center Jongwook Woo CalStat RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  • 17.
    High Performance InformationComputing Center Jongwook Woo CalStat Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  • 18.
    High Performance InformationComputing Center Jongwook Woo CalStat Spark  Spark SQL  DataFrame – Turning an RDD into a Relation  Querying using SQL  Spark Streaming  DStream – RDD in streaming – Windows • To select DStream from streaming data  Mlib, ML  Sparse vector support, Decision trees, Linear/Logistic Regression, PCA  Pipeline
  • 19.
    High Performance InformationComputing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 20.
    High Performance InformationComputing Center Jongwook Woo CalStat Spark Spark File Systems: Tachyon Resource Manager: Mesos But, Hadoop has been dominating market Integrating Spark into Hadoop cluster Cloud Computing – Amazon AWS, Azure HDInsight, IBM Bluemix • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS – No Hadoop ecosystems
  • 21.
    High Performance InformationComputing Center Jongwook Woo CalStat Spark with Hadoop YARN Spark Client Slave Nodes  ResourceManager (RM) Per Cluster  Create Spark AM and  allocate Containers for Spark AM  NodeManager (NM) Per Node  Spark workers  ApplicationMaster (AM) Per Application  Containers for Spark Executors Master Node Node Manager Node Manager Node Manager Container: Spark Executor Spark AM Resource Manager
  • 22.
    High Performance InformationComputing Center Jongwook Woo CalStat Big Data Analysis Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, …) Data Visualization Qlik, Datameer, Excel PowerView
  • 23.
    High Performance InformationComputing Center Jongwook Woo CalStat Databricks cluster at CalStateLA
  • 24.
    High Performance InformationComputing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Use Cases  Hadoop Spark Training
  • 25.
    High Performance InformationComputing Center Jongwook Woo CalStat Open Data USA government Federal, State, City governments Expose data to public USA Business Twitter, Yelp, … Expose data to public with APIs – Some restriction to download City government New York – Taxi, Uber, … Los Angeles – Open Data, Open Hub with Geo info
  • 26.
    High Performance InformationComputing Center Jongwook Woo CalStat Open Big Data Analysis in CalStateLA Social Media Data Analysis Twitter Sentiment Analysis for Alphago Open Data from Government Airline Data analysis Crime Data analysis Web Service API Business Data Analysis from Yelp and Google Places API
  • 27.
    High Performance InformationComputing Center Jongwook Woo CalStat Data from Industry: Twitter Data  Systems Azure HDInsights Spark 8 Nodes – 40 cores: 2.4GHz Intel Xeon – Memory - Each Node: 28 GB  Data Source Keyword ‘alphago’ from Tweeter via Apache NiFi  Data Size  63,193 tweets  Real Time Data Collection period 03/12 – 03/17/2016 – No data collected on 03/13
  • 28.
    High Performance InformationComputing Center Jongwook Woo CalStat Top 10 Countries that Tweets “Alphago”
  • 29.
    High Performance InformationComputing Center Jongwook Woo CalStat Top 10 Countries  # of Tweets per Country USA: > 11,000 Japan: > 9,000 Korea: > 1,900 Russia, UK: > 1,600 Thai Land, France : > 1,000  Netherland, Spain, Ukraine: > 600
  • 30.
    High Performance InformationComputing Center Jongwook Woo CalStat Top 10 Countries Sentiment Positive Negative
  • 31.
    High Performance InformationComputing Center Jongwook Woo CalStat Top 10 Countries Most Tweeted Countries  All countries show more positive tweets –Korea, Japan, USA Country Positive Negative USA 5070 3567 Japan 8118 217 … Korea 1053 407 …
  • 32.
    High Performance InformationComputing Center Jongwook Woo CalStat Daily Tweets in 03/12 – 03/17/2016 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016 Alphago vs Lee Sedol Game 4: Mar 13 Lee Se-Dol win Game 5: Mar 15 Game 3: Mar 12
  • 33.
    High Performance InformationComputing Center Jongwook Woo CalStat Ngram words  3 word in row right after Go-Champion “sedol” and “se-dol” sedol  se-dol 3-grams Frequency Again-to-win 1,187 Is-something-I’ll 369 Is-something-i 199 In-go-tournament 168
  • 34.
    High Performance InformationComputing Center Jongwook Woo CalStat Sentiment Map of Alphago Positive Negative
  • 35.
    High Performance InformationComputing Center Jongwook Woo CalStat Sentiment Map of Lee Se-Dol vs Alphago  YouTube video: “alphago sentiment” by Google  The sentiment of the World in Geo and Time: https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb ToiB8wQ2w14a
  • 36.
    High Performance InformationComputing Center Jongwook Woo CalStat Federal Government: Airline Data Set Government Open Data Airline Data Set in 2012 – 2014 – US Dept of transportation Cluster by Nillohit at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQL  Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 7 GB – Windows Server 2012 R2 Datacenter
  • 37.
    High Performance InformationComputing Center Jongwook Woo CalStat Airline Data Set
  • 38.
    High Performance InformationComputing Center Jongwook Woo CalStat Airline Data Set
  • 39.
    High Performance InformationComputing Center Jongwook Woo CalStat Airline Data Set
  • 40.
    High Performance InformationComputing Center Jongwook Woo CalStat City Government: Crime Data Set Open Data in City of Los Angeles  Crime Data Set in 2012-2015  File Size – 151MB  Total Number of offenses – 8.94 million Ram Dharan and Sridhar Reddy at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 14 GB – Windows Server 2012 R2 Datacenter – Extending to last 10 years of data set
  • 41.
    High Performance InformationComputing Center Jongwook Woo CalStat Projection of Raw Data 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 year2012 year2013 year2014 year2015
  • 42.
    High Performance InformationComputing Center Jongwook Woo CalStat Total No. of Crimes in 2012-15 0 5000 10000 15000 20000 25000 year2012 year2013 year2014 year2015
  • 43.
    High Performance InformationComputing Center Jongwook Woo CalStat Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015
  • 44.
    High Performance InformationComputing Center Jongwook Woo CalStat No.of Crimes for every 5miles from CalStateLA 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 >35 csula_2012 csula_2013 csula_2014 csula_2015
  • 45.
    High Performance InformationComputing Center Jongwook Woo CalStat No.of Crimes for every 5miles from UCLA 0 20000 40000 60000 80000 100000 120000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40 ucla_2012 ucla_2013 ucla_2014 ucla_2015
  • 46.
    High Performance InformationComputing Center Jongwook Woo CalStat No. of Crimes for every 5miles from USC 0 20000 40000 60000 80000 100000 120000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40 ucla_2012 ucla_2013 ucla_2014 ucla_2015
  • 47.
    High Performance InformationComputing Center Jongwook Woo CalStat Comparision of Crimes for every 5miles from CalStateLA, UCLA and USC in 2015 0 20000 40000 60000 80000 100000 120000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 40-50 >50 csula_2015 ucla_2015 usc_2015
  • 48.
    High Performance InformationComputing Center Jongwook Woo CalStat No.of crimes per area in LA 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 77thStreet Mission Newton Rampart Southwest Topanga VanNuys Wilshire Central Devonshire Foothill Harbor Hollenbeck Hollywood NHollywood Pacific WestValley Northeast Olympic Southeast WestLA in2012 in2013 in2014 in2015
  • 49.
    High Performance InformationComputing Center Jongwook Woo CalStat Total No.of Crimes for every 2hours in LA 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 77thStreet Mission Newton Rampart Southwest Topanga VanNuys Wilshire Central Devonshire Foothill Harbor Hollenbeck Hollywood NHollywood Pacific WestValley Northeast Olympic Southeast WestLA in2012 in2013 in2014 in2015
  • 50.
    High Performance InformationComputing Center Jongwook Woo CalStat No.of crimes for every 2hrs within 5miles from CalStateLA, UCLA and USC in 2015 0 2000 4000 6000 8000 10000 12000 00:00-02:00 02:00-04:00 04:00-06:00 06:00-08:00 08:00-10:00 10:00-12:00 12:00-14:00 14:00-16:00 16:00-18:00 18:00-20:00 20:00-22:00 22:00-24:00 usc ucla csula
  • 51.
    High Performance InformationComputing Center Jongwook Woo CalStat BUSINESS DATA ANALYSIS  DATA SET DETAILS • Yelp Review Data : 1.9GB • Business Data: 500MB • Web Service API from Yelp and Google Places Analysis Join YELP CHALLENGE DATA SET GOOGLE PLACES YELP DATA
  • 52.
    High Performance InformationComputing Center Jongwook Woo CalStat Top 10 businesses within 5 miles from CalStateLA (with 5 or 4 star ratings) 34 31 29 26 19 19 15 15 15 0 5 10 15 20 25 30 35 40 count Chart Title Hair Salons Auto Repair General Dentistry Insurance Churches Skin Care Chiropractors Barbers Elementary Schools • Hair Salons and Insurance are popular qualified business categories
  • 53.
    High Performance InformationComputing Center Jongwook Woo CalStat Businesses popular in 5 miles of CalStateLA, usc , ucla
  • 54.
    High Performance InformationComputing Center Jongwook Woo CalStat Number of food business in radius 0-25 miles from CalStateLA, usc and ucla CalStateLA have more food businesses within 5 miles compared to UCLA and USC 0 100 200 300 400 500 600 0- 5 5-10. 10-15. 15-20 20-25 CSULA USC UCLA
  • 55.
    High Performance InformationComputing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) opened on May 7, 2014.
  • 56.
    High Performance InformationComputing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model The station producing hydrogen for Hydrogen Vehicle Cal State L.A. Hydrogen Research and Fueling Facility the first station in the nation to sell hydrogen fuel to the public. Hyundai, Toyota
  • 57.
    High Performance InformationComputing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Workflow
  • 58.
    High Performance InformationComputing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Model by Manvi Chandra
  • 59.
    High Performance InformationComputing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Results and observations
  • 60.
    High Performance InformationComputing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Results and observations  Can predict Vehicle Pressure – Pressure of hydrogen gas within the vehicle Hydrogen Storage System – using our model in Azure Visual Studio ML – Building Spark ML Decision forest Regression – constructing a multitude of decision trees at training time • the mode of the classes (classification) • mean prediction (regression) of the individual trees.
  • 61.
    High Performance InformationComputing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 62.
    High Performance InformationComputing Center Jongwook Woo CalStat Spark Big Data Training and R&D HiPIC California State University Los Angeles Supported by – Databricks and its cloud computing services – Amazon AWS, IBM Buemix, MS Azure – Hortonworks, Cloudera – Datameer
  • 63.
    High Performance InformationComputing Center Jongwook Woo CalStat Databricks Partners
  • 64.
    High Performance InformationComputing Center Jongwook Woo CalStat Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  • 65.
    High Performance InformationComputing Center Jongwook Woo CalStat Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  • 66.
    High Performance InformationComputing Center Jongwook Woo CalStat Question?
  • 67.
    High Performance InformationComputing Center Jongwook Woo CalStat References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )  “Market Basket Analysis using Spark”, Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes
  • 68.
    High Performance InformationComputing Center Jongwook Woo CalStat  Introduction to Big Data with Apache Spark, databricks  Stanford Spark Class (http://stanford.edu/~rezab )  Cornell University, CS5304  DS320: DataStax Enterprise Analytics with Spark  Cloudera, http://www.cloudera.com  Hortonworks, http://www.hortonworks.com  Spark 3 Use Cases, http://www.datanami.com/2014/03/06/apache_spark_ 3_real-world_use_cases/ References
  • 69.
    High Performance InformationComputing Center Jongwook Woo CalStat Scheduling Process ) rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects Optimizer Optimizer: build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  • 70.
    High Performance InformationComputing Center Jongwook Woo CalStat Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3, Couchbase, Cassandra, … RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  • 71.
    High Performance InformationComputing Center Jongwook Woo CalStat Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Wide” (shuffle) deps: boundary of stages “Narrow” deps: A stage pipeline to be run on the same node
  • 72.
    High Performance InformationComputing Center Jongwook Woo CalStat Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Narrow” deps: A stage pipeline to be run on the same node “Wide” (shuffle) deps: boundary of stages
  • 73.
    High Performance InformationComputing Center Jongwook Woo CalStat Scheduler Optimizations Pipelines within a stage 2 map, union Stage 3: join algorithms based on partitioning (minimize shuffles) join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  • 74.
    High Performance InformationComputing Center Jongwook Woo CalStat Scheduler Optimizations Conceptually Stage 1: 3 tasks Stage 2: 4 tasks Stage 3: 3 tasks Total: 3 stages, 10 tasks join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task