Big Data Trend with Open Platform

Jongwook Woo
HiPIC
CalState
LA
SWRC 2017
San Diego, CA
Feb 25 2017
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Big Data Trend with
Open Platform

High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
 Myself
 Big Data
 Spark
 Spark and Hadoop
 Open Platform
 Use Cases
 Future Trend

Jongwook Woo
CalState
LA
Myself
Experience:
 Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors

Jongwook Woo
CalState
LA
Experience (Cont’d): Bring in Big Data R&D
and training to Korea since 2009
Collaborating with LA city in 2016
– Collect, Search, and Analyze City Data
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and
Research Centers
• Yonsei, Gachon
• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself

Jongwook Woo
CalState
LA
Experience in Big Data
 Collaboration
 Council Member of IBM Spark Technology Center
 City of Los Angeles for OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
 Grants
 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata

Jongwook Woo
CalState
LA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CalState
LA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity
computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004

Jongwook Woo
CalState
LA
What is Hadoop?
9
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …

Jongwook Woo
CalState
LA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Cluster for Compute
Cluster for Store Cluster for Compute/Store

Jongwook Woo
CalState
LA
Definition: Big Data
Non-expensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–More public than the traditional super
computers
• You can store and process your applications
– In your university labs, small companies,
research centers

Jongwook Woo
CalState
LA
Hadoop Cluster: Logical Diagram
Web Browser of
Cluster nonitor:
CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala

Jongwook Woo
CalState
LA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/

Jongwook Woo
CalState
LA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Only Map and Reduce
– Limited Parallelization
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce

Jongwook Woo
CalState
LA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
 Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query

Jongwook Woo
CalState
LA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
ML /
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices

Jongwook Woo
CalState
LA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Communicate with Spark workers
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–Development and Test

Jongwook Woo
CalState
LA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
Immutable
–RDD, DStream, SchemaRDD, PairRDD
Lineage
–History of the objects
–Automatically and efficiently re-compute lost
data

Jongwook Woo
CalState
LA
RDD and Data Frame Operations
Transformation
Define new RDDs and Data Frame from the
current
–Lazy: not computed immediately
map(), filter(), join(), select(), groupBy()
Actions
Return values
count(), collect(), take(), save()

Jongwook Woo
CalState
LA
Programming in Spark
Scala
Functional Programming
– Fundamental of programming is function
• Input/Output is function
No side effects
– No states
Python
Legacy, large Libraries
Java
R

Jongwook Woo
CalState
LA

Jongwook Woo
CalState
LA
Spark
 Spark SQL
 Querying using SQL, HiveQL
 Data Frame
 ML
 Machine Learning on Data Frame, Pipelining
 MLib
– On RDD
– Sparse vector support, Decision trees, Linear/Logistic Regression,
PCA, SVM
 Spark Streaming
 DStream
– RDD in streaming
– Windows
• To select DStream from streaming data

Jongwook Woo
CalState
LA
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
Optimizer
Optimizer: build
operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed

Jongwook Woo
CalState
LA
During Scheduling Process
https://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797

Jongwook Woo
CalState
LA
Spark
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS
– No Hadoop ecosystems

Jongwook Woo
CalState
LA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase, Amazon S3,
Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads

Jongwook Woo
CalState
LA
Spark with Hadoop YARN
Spark Client
Slave Nodes
 ResourceManager (RM) Per Cluster
 Create Spark AM and
 allocate Containers for Spark AM
 NodeManager (NM) Per Node
 Spark workers
 ApplicationMaster (AM) Per Application
 Containers for Spark Executors
Master
Node
Node
Manager
Node
Manager
Node
Manager
Container:
Spark Executor
Spark AM
Resource
Manager

Jongwook Woo
CalState
LA
Databricks cluster at CalStateLA

Jongwook Woo
CalState
LA
Open Platform
Open Source
Open Conference
Open Data
Public Data

Jongwook Woo
CalState
LA
Open Source
Hadoop
http://hadoop.apache.org/
Spark
http://spark.apache.org/
 NoSQL
http://hbase.apache.org/
Search Engine
http://lucene.apache.org/solr/

Jongwook Woo
CalState
LA
Open Conference
Hadoop Summit
Live Streaming
–http://siliconangle.tv/hadoop-summit-
2016/
Spark Summit
https://spark-summit.org/east-2017/
Live Streaming
–http://go.spark-summit.org/east-
2017/live-
stream?_ga=1.62160364.1150099959.1484
851457

Jongwook Woo
CalState
LA
Open Data
USA government
Federal, State, City governments
Expose data to public
USA Business
Twitter, Yelp, …
Expose data to public with APIs
– Some restriction to download
City government
New York
– Taxi, Uber, …
Los Angeles
– Open Data, Open Hub with Geo info

Jongwook Woo
CalState
LA
Databricks Partners

Jongwook Woo
CalState
LA
Industrial Collaboration
Cloudera visits to interview Jongwook Woo

Jongwook Woo
CalState
LA
Industrial Collaboration: IBM Bluemix
at CalStateLA

Jongwook Woo
CalState
LA
Big Data Analysis and Prediction Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Tableua, Qlik, …)
Data Visualization
Qlik, Datameer, Excel
PowerView

Jongwook Woo
HiPIC
CalState
LA
LOCAL BUSINESS DATA ANALYSIS
Yashaswi Ananth
Ruchi Singh
Mahsa Tayer Farahani

Jongwook Woo
CalState
LA
LOCAL BUSINESS DATA ANALYSIS
Using Local Business Data
From Yelp and Google Local
Grad Students at CalStateLA
Symposium, Feb 24 2017
Yashaswi Ananth
Ruchi Singh
Mahsa Tayer Farahani

Jongwook Woo
CalState
LA
REVIEW COUNT FOR BUSINESS TYPES
• Food
• Services
• Entertainment
• Shopping
• Medical

Jongwook Woo
CalState
LA
TOP BUSINESS IN THE SIX CATEGORIES

Jongwook Woo
CalState
LA
Review count of popular sub-categories of
business

Jongwook Woo
CalState
LA
Sentiment Analysis of Services category

Jongwook Woo
CalState
LA
Top business
Top 5 most popular local business on Yelp between 2006-2016 in the selected cities

Jongwook Woo
CalState
LA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA

Jongwook Woo
CalState
LA
Historical Analysis Of
College Scorecard
CalStateLA Symposium
Feb 24 2017
Kunal Pritwani
Atinder Singh
Dharmesh Soni
Mounika Vallabhaneni

Jongwook Woo
CalState
LA
Data is collected from the site. :
https://www.kaggle.com/kaggle/college-scorecard
We have historical data of over 100,000 colleges in
the US spanning over 14 years.
Data Size – 1.33 GB
File Format – CSV ( Comma Separated Values)
Specification of Data Set

Jongwook Woo
CalState
LA
Mean Income
Medical college of Wisconsin: 250K
Upstate Medical University: 152.7K
CalTech: 103K
Washington and Lee University: 100K

Jongwook Woo
CalState
LA
Comparing Average Net Price of Two
States (Annual Tuition)
UCLA: $13,817 CalStateLA: $4,370
Fashion Inst of Tech: $11.5K CUNY: $5K

Jongwook Woo
CalState
LA
SAT Scores in Different Colleges
Math (Blue), Verbal (Orange), Mean Earning (Purple)
• CalTech: 800, 778.9, $98.7K
• MIT: 800, 764.4, $124.4K
• Harvard: 791, 795.6, $133K
• Princeton: 793, 791, $115.6K
• Yale: 788, 794.4, $97.8K

Jongwook Woo
CalState
LA
Comparing Average Undergraduates
Receiving PELL GRANT
Universal Career Community College: 100% PELL grant scholarship

Jongwook Woo
CalState
LA
Average Undergraduates Receiving
PELL GRANT in Each College
East Georgia State College: $2,854 Avg.
PELL grant: 97.285%

Jongwook Woo
CalState
LA
Alphago vs Lee using Twitter
Data
 Systems
Azure HDInsights Spark
8 Nodes
– 40 cores: 2.4GHz Intel Xeon
– Memory - Each Node: 28 GB
 Data Source
Keyword ‘alphago’ from Tweeter via Apache NiFi
 Data Size
 63,193 tweets
 Real Time Data Collection period
03/12 – 03/17/2016
– No data collected on 03/13

Jongwook Woo
CalState
LA
Top 10 Countries that Tweets
“Alphago”

Jongwook Woo
CalState
LA
Top 10 Countries
 # of Tweets per
Country
USA: > 11,000
Japan: > 9,000
Korea: > 1,900
Russia, UK: > 1,600
Thai Land, France : >
1,000
 Netherland, Spain,
Ukraine: > 600

Jongwook Woo
CalState
LA
Top 10 Countries Sentiment
Positive Negative

Jongwook Woo
CalState
LA
Top 10 Countries
Most Tweeted Countries
 All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…

Jongwook Woo
CalState
LA
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016
Alphago vs Lee Sedol
Game 4: Mar 13
Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12

Jongwook Woo
CalState
LA
Ngram words
 3 word in row right after Go-Champion
“sedol” and “se-dol”
sedol
 se-dol
3-grams Frequency
Again-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168

Jongwook Woo
CalState
LA
Sentiment Map of Alphago
Positive
Negative

Jongwook Woo
CalState
LA
Sentiment Map of Lee Se-Dol vs Alphago
 YouTube video: “alphago sentiment” by Google
 The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb
ToiB8wQ2w14a

Jongwook Woo
CalState
LA
Airline Data Set
Government Open Data
Airline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
 Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter

Jongwook Woo
CalState
LA
Airline Data Set

Jongwook Woo
CalState
LA
City Government: Crime Data Set
Open Data in City of Los Angeles
Crime Data Set in 2014
Ram Dharan and Sridhar Reddy at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB
– Windows Server 2012 R2 Datacenter
– Extending to last 10 years of data set

Jongwook Woo
CalState
LA
Crime Data
Los Angeles 2014
2%
8%
9%
12%
17%
19%
33%
Total occurences of each Crime
CRIMINAL
VANDALISM
OTHERS
BURGALARY
ASSAULT
TRAFFIC
THEFT

Jongwook Woo
CalState
LA
Total No.of Crimes in 2014
19169
17384
19730
19413
20645
20494
21480
21280
21287
21669
19844
21355
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12
No.of Crimes per Month

Jongwook Woo
CalState
LA
Raw Data Projection on Map

Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within
5miles from CalStateLA

Jongwook Woo
CalState
LA
5miles from UCLA

Jongwook Woo
CalState
LA
5miles from USC

Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within 5miles
from CalStateLA, UCLA and USC in 2015

Jongwook Woo
CalState
LA
No. of crimes within 5 miles from CSULA, UCLA
and USC on crime type
0
5000
10000
15000
20000
25000
30000
csula ucla usc

Jongwook Woo
CalState
LA
Future Research Trend
Deep Learning
TensorFlow and Spark
– Yahoo, Intel, Google
– Image Recognition, Prediction Analysis
ChatBot
Amazon Alexa API
IBM Watson ChatBot API
Google Home API
More into
In-Memory Processing
– Spark DataFrame, Data Set, ML
Cloud Computing
– IBM Bluemix, MS Azure, Google Cloud, Amazon AWS

Jongwook Woo
CalState
LA
Question?

Big Data Trend with Open Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Trend with Open Platform

Similar to Big Data Trend with Open Platform (20)

More from Jongwook Woo

More from Jongwook Woo (14)

Recently uploaded

Recently uploaded (20)

Big Data Trend with Open Platform