SlideShare a Scribd company logo
1 of 74
Download to read offline
Jongwook Woo
HiPIC
CalStat
eLA
UKC 2016
Dallas, TX
Aug 12 2016
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Big Data Trend and
Open Data
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Myself
Experience:
 Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
High Performance Information Computing Center
Jongwook Woo
CalStat
Experience (Cont’d): Bring in Big Data R&D
and training to Korea since 2009
Collaborating with LA city in 2016
– Collect, Search, and Analyze City Data
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and
Research Centers
• Yonsei, Gachon
• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
High Performance Information Computing Center
Jongwook Woo
CalStat
Experience in Big Data
 Collaboration
 Council Member of IBM Spark Technology Center
 City of Los Angeles for OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
 Grants
 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CalStat
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity
computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
High Performance Information Computing Center
Jongwook Woo
CalStat
What is Hadoop?
9
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …
High Performance Information Computing Center
Jongwook Woo
CalStat
Definition: Big Data
Non-expensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–More public than the traditional super
computers
• You can store and process your applications
– In your university labs, small companies,
research centers
High Performance Information Computing Center
Jongwook Woo
CalStat
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
 Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
ML
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
High Performance Information Computing Center
Jongwook Woo
CalStat
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
High Performance Information Computing Center
Jongwook Woo
CalStat
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
 Spark SQL
 DataFrame
– Turning an RDD into a Relation
 Querying using SQL
 Spark Streaming
 DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
 Mlib, ML
 Sparse vector support, Decision trees, Linear/Logistic Regression,
PCA
 Pipeline
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS
– No Hadoop ecosystems
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark with Hadoop YARN
Spark Client
Slave Nodes
 ResourceManager (RM) Per Cluster
 Create Spark AM and
 allocate Containers for Spark AM
 NodeManager (NM) Per Node
 Spark workers
 ApplicationMaster (AM) Per Application
 Containers for Spark Executors
Master
Node
Node
Manager
Node
Manager
Node
Manager
Container:
Spark Executor
Spark AM
Resource
Manager
High Performance Information Computing Center
Jongwook Woo
CalStat
Big Data Analysis Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Datameer, Qlik, …)
Data Visualization
Qlik, Datameer, Excel
PowerView
High Performance Information Computing Center
Jongwook Woo
CalStat
Databricks cluster at CalStateLA
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Open Data
USA government
Federal, State, City governments
Expose data to public
USA Business
Twitter, Yelp, …
Expose data to public with APIs
– Some restriction to download
City government
New York
– Taxi, Uber, …
Los Angeles
– Open Data, Open Hub with Geo info
High Performance Information Computing Center
Jongwook Woo
CalStat
Open Big Data Analysis in
CalStateLA
Social Media Data Analysis
Twitter Sentiment Analysis for Alphago
Open Data from Government
Airline Data analysis
Crime Data analysis
Web Service API
Business Data Analysis from Yelp and Google Places API
High Performance Information Computing Center
Jongwook Woo
CalStat
Data from Industry: Twitter
Data
 Systems
Azure HDInsights Spark
8 Nodes
– 40 cores: 2.4GHz Intel Xeon
– Memory - Each Node: 28 GB
 Data Source
Keyword ‘alphago’ from Tweeter via Apache NiFi
 Data Size
 63,193 tweets
 Real Time Data Collection period
03/12 – 03/17/2016
– No data collected on 03/13
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries that Tweets
“Alphago”
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries
 # of Tweets per
Country
USA: > 11,000
Japan: > 9,000
Korea: > 1,900
Russia, UK: > 1,600
Thai Land, France : >
1,000
 Netherland, Spain,
Ukraine: > 600
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries Sentiment
Positive Negative
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries
Most Tweeted Countries
 All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…
High Performance Information Computing Center
Jongwook Woo
CalStat
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016
Alphago vs Lee Sedol
Game 4: Mar 13
Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12
High Performance Information Computing Center
Jongwook Woo
CalStat
Ngram words
 3 word in row right after Go-Champion
“sedol” and “se-dol”
sedol
 se-dol
3-grams Frequency
Again-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168
High Performance Information Computing Center
Jongwook Woo
CalStat
Sentiment Map of Alphago
Positive
Negative
High Performance Information Computing Center
Jongwook Woo
CalStat
Sentiment Map of Lee Se-Dol vs Alphago
 YouTube video: “alphago sentiment” by Google
 The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb
ToiB8wQ2w14a
High Performance Information Computing Center
Jongwook Woo
CalStat
Federal Government: Airline Data Set
Government Open Data
Airline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CalStateLA
Microsoft Azure using Hive and Spark SQL
 Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
High Performance Information Computing Center
Jongwook Woo
CalStat
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalStat
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalStat
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalStat
City Government: Crime Data Set
Open Data in City of Los Angeles
 Crime Data Set in 2012-2015
 File Size – 151MB
 Total Number of offenses – 8.94 million
Ram Dharan and Sridhar Reddy at HiPIC,
CalStateLA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB
– Windows Server 2012 R2 Datacenter
– Extending to last 10 years of data set
High Performance Information Computing Center
Jongwook Woo
CalStat
Projection of Raw Data
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
year2012 year2013 year2014 year2015
High Performance Information Computing Center
Jongwook Woo
CalStat
Total No. of Crimes in 2012-15
0
5000
10000
15000
20000
25000
year2012 year2013 year2014 year2015
High Performance Information Computing Center
Jongwook Woo
CalStat
Mapping of Crimes Occurred within 5miles
from CalStateLA, UCLA and USC in 2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of Crimes for every 5miles
from CalStateLA
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 >35
csula_2012 csula_2013 csula_2014 csula_2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of Crimes for every 5miles
from UCLA
0
20000
40000
60000
80000
100000
120000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40
ucla_2012 ucla_2013 ucla_2014 ucla_2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No. of Crimes for every 5miles
from USC
0
20000
40000
60000
80000
100000
120000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40
ucla_2012 ucla_2013 ucla_2014 ucla_2015
High Performance Information Computing Center
Jongwook Woo
CalStat
Comparision of Crimes for
every 5miles from CalStateLA,
UCLA and USC in 2015
0
20000
40000
60000
80000
100000
120000
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 40-50 >50
csula_2015 ucla_2015 usc_2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of crimes per area in LA
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
77thStreet
Mission
Newton
Rampart
Southwest
Topanga
VanNuys
Wilshire
Central
Devonshire
Foothill
Harbor
Hollenbeck
Hollywood
NHollywood
Pacific
WestValley
Northeast
Olympic
Southeast
WestLA
in2012 in2013 in2014 in2015
High Performance Information Computing Center
Jongwook Woo
CalStat
Total No.of Crimes for every
2hours in LA
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
77thStreet
Mission
Newton
Rampart
Southwest
Topanga
VanNuys
Wilshire
Central
Devonshire
Foothill
Harbor
Hollenbeck
Hollywood
NHollywood
Pacific
WestValley
Northeast
Olympic
Southeast
WestLA
in2012 in2013 in2014 in2015
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of crimes for every 2hrs
within 5miles from CalStateLA,
UCLA and USC in 2015
0 2000 4000 6000 8000 10000 12000
00:00-02:00
02:00-04:00
04:00-06:00
06:00-08:00
08:00-10:00
10:00-12:00
12:00-14:00
14:00-16:00
16:00-18:00
18:00-20:00
20:00-22:00
22:00-24:00
usc ucla csula
High Performance Information Computing Center
Jongwook Woo
CalStat
BUSINESS DATA ANALYSIS
 DATA SET DETAILS
• Yelp Review Data : 1.9GB
• Business Data: 500MB
• Web Service API from Yelp and Google Places
Analysis Join
YELP
CHALLENGE
DATA SET
GOOGLE
PLACES
YELP DATA
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 businesses within 5 miles from CalStateLA
(with 5 or 4 star ratings)
34
31
29
26
19 19
15 15 15
0
5
10
15
20
25
30
35
40
count
Chart Title
Hair Salons Auto Repair General Dentistry
Insurance Churches Skin Care
Chiropractors Barbers Elementary Schools
• Hair Salons and
Insurance are
popular qualified
business categories
High Performance Information Computing Center
Jongwook Woo
CalStat
Businesses popular in 5 miles of CalStateLA,
usc , ucla
High Performance Information Computing Center
Jongwook Woo
CalStat
Number of food business in radius
0-25 miles from CalStateLA, usc
and ucla
CalStateLA have more food businesses within 5 miles compared
to UCLA and USC
0
100
200
300
400
500
600
0- 5 5-10. 10-15. 15-20 20-25
CSULA USC UCLA
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
The Cal State L.A. Hydrogen Research and Fueling
Facility (H2 Station)
opened on May 7, 2014.
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
The station
producing hydrogen for Hydrogen Vehicle
Cal State L.A. Hydrogen Research and Fueling
Facility
the first station in the nation to sell hydrogen fuel to
the public.
Hyundai, Toyota
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Workflow
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Model by Manvi Chandra
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Results and observations
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Results and observations
 Can predict Vehicle Pressure
– Pressure of hydrogen gas within the vehicle Hydrogen
Storage System
– using our model in Azure Visual Studio ML
– Building Spark ML
Decision forest Regression
– constructing a multitude of decision trees at training
time
• the mode of the classes (classification)
• mean prediction (regression) of the individual trees.
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark Big Data Training and R&D
HiPIC
California State University Los Angeles
Supported by
– Databricks and its cloud computing services
– Amazon AWS, IBM Buemix, MS Azure
– Hortonworks, Cloudera
– Datameer
High Performance Information Computing Center
Jongwook Woo
CalStat
Databricks Partners
High Performance Information Computing Center
Jongwook Woo
CalStat
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo
High Performance Information Computing Center
Jongwook Woo
CalStat
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles
High Performance Information Computing Center
Jongwook Woo
CalStat
Question?
High Performance Information Computing Center
Jongwook Woo
CalStat
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
 “Market Basket Analysis using Spark”,
Jongwook Woo, in Journal of Science and
Technology, April 2015, Volume 5, No 4,
pp207-209, ISSN 2225-7217, ARPN
https://github.com/hipic/spark_mba, HiPIC
of California State University Los Angenes
High Performance Information Computing Center
Jongwook Woo
CalStat
 Introduction to Big Data with Apache Spark, databricks
 Stanford Spark Class (http://stanford.edu/~rezab )
 Cornell University, CS5304
 DS320: DataStax Enterprise Analytics with Spark
 Cloudera, http://www.cloudera.com
 Hortonworks, http://www.hortonworks.com
 Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_
3_real-world_use_cases/
References
High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
Optimizer
Optimizer: build
operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
High Performance Information Computing Center
Jongwook Woo
CalStat
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase, Amazon S3,
Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
High Performance Information Computing Center
Jongwook Woo
CalStat
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Wide” (shuffle) deps:
boundary of stages
“Narrow” deps: A stage pipeline
to be run on the same node
High Performance Information Computing Center
Jongwook Woo
CalStat
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: A stage pipeline
to be run on the same node
“Wide” (shuffle) deps:
boundary of stages
High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduler Optimizations
Pipelines within a
stage 2
map, union
Stage 3:
join algorithms
based on
partitioning
(minimize shuffles) join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduler Optimizations
Conceptually
Stage 1: 3 tasks
Stage 2: 4 tasks
Stage 3: 3 tasks
Total: 3 stages, 10
tasks
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task

More Related Content

What's hot

Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Jongwook Woo
 
Revenue Earned From Students in USA
Revenue Earned From Students in USARevenue Earned From Students in USA
Revenue Earned From Students in USAApekshitBhingardive
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12mark madsen
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
hydrogenbigdataanalysis
hydrogenbigdataanalysishydrogenbigdataanalysis
hydrogenbigdataanalysisManvi Chandra
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaEdureka!
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerDataWorks Summit
 

What's hot (19)

Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Revenue Earned From Students in USA
Revenue Earned From Students in USARevenue Earned From Students in USA
Revenue Earned From Students in USA
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
hydrogenbigdataanalysis
hydrogenbigdataanalysishydrogenbigdataanalysis
hydrogenbigdataanalysis
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 

Similar to Big Data Trend and Open Data

Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...WikibonCommunity
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Job Data Analysis Reveals Key Skills Required for Data Scientists
Job Data Analysis Reveals Key Skills Required for Data ScientistsJob Data Analysis Reveals Key Skills Required for Data Scientists
Job Data Analysis Reveals Key Skills Required for Data ScientistsJobsPikr
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Impetus Technologies
 
Using semantic web technologies for exploratory olap a survey
Using semantic web technologies for exploratory olap a surveyUsing semantic web technologies for exploratory olap a survey
Using semantic web technologies for exploratory olap a surveyieeepondy
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 

Similar to Big Data Trend and Open Data (20)

AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Job Data Analysis Reveals Key Skills Required for Data Scientists
Job Data Analysis Reveals Key Skills Required for Data ScientistsJob Data Analysis Reveals Key Skills Required for Data Scientists
Job Data Analysis Reveals Key Skills Required for Data Scientists
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Using semantic web technologies for exploratory olap a survey
Using semantic web technologies for exploratory olap a surveyUsing semantic web technologies for exploratory olap a survey
Using semantic web technologies for exploratory olap a survey
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 

More from Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in SeoulJongwook Woo
 

More from Jongwook Woo (12)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Recently uploaded

STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
Air Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdfAir Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdfJasonBoboKyaw
 
The market for cross-border mortgages in Europe
The market for cross-border mortgages in EuropeThe market for cross-border mortgages in Europe
The market for cross-border mortgages in Europe321k
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseThinkInnovation
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfdcphostmaster
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media PlatformsMahmoud Yasser
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsGain Insights
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxjkmrshll88
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsNeo4j
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxEmmanuel Dauda
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingMarketingTrips
 
Unleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMUnleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMMarco Wobben
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfmxlos0
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxShammiRai3
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 

Recently uploaded (20)

STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
Air Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdfAir Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdf
 
The market for cross-border mortgages in Europe
The market for cross-border mortgages in EuropeThe market for cross-border mortgages in Europe
The market for cross-border mortgages in Europe
 
Target_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110millionTarget_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110million
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data Warehouse
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdf
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media Platforms
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded Analytics
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptx
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge Graphs
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potx
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân Marketing
 
Unleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMUnleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IM
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdf
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptx
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 

Big Data Trend and Open Data

  • 1. Jongwook Woo HiPIC CalStat eLA UKC 2016 Dallas, TX Aug 12 2016 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles Big Data Trend and Open Data
  • 2. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 3. High Performance Information Computing Center Jongwook Woo CalStat Myself Experience:  Since 2002, Professor at California State Univ Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  • 4. High Performance Information Computing Center Jongwook Woo CalStat Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city in 2016 – Collect, Search, and Analyze City Data • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  • 5. High Performance Information Computing Center Jongwook Woo CalStat Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University  Grants  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  • 6. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 7. High Performance Information Computing Center Jongwook Woo CalStat Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  • 8. High Performance Information Computing Center Jongwook Woo CalStat Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  • 9. High Performance Information Computing Center Jongwook Woo CalStat What is Hadoop? 9  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  • 10. High Performance Information Computing Center Jongwook Woo CalStat Definition: Big Data Non-expensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers
  • 11. High Performance Information Computing Center Jongwook Woo CalStat Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  • 12. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 13. High Performance Information Computing Center Jongwook Woo CalStat Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce
  • 14. High Performance Information Computing Center Jongwook Woo CalStat Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS  Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase… New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  • 15. High Performance Information Computing Center Jongwook Woo CalStat Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib ML machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  • 16. High Performance Information Computing Center Jongwook Woo CalStat RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  • 17. High Performance Information Computing Center Jongwook Woo CalStat Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  • 18. High Performance Information Computing Center Jongwook Woo CalStat Spark  Spark SQL  DataFrame – Turning an RDD into a Relation  Querying using SQL  Spark Streaming  DStream – RDD in streaming – Windows • To select DStream from streaming data  Mlib, ML  Sparse vector support, Decision trees, Linear/Logistic Regression, PCA  Pipeline
  • 19. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 20. High Performance Information Computing Center Jongwook Woo CalStat Spark Spark File Systems: Tachyon Resource Manager: Mesos But, Hadoop has been dominating market Integrating Spark into Hadoop cluster Cloud Computing – Amazon AWS, Azure HDInsight, IBM Bluemix • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS – No Hadoop ecosystems
  • 21. High Performance Information Computing Center Jongwook Woo CalStat Spark with Hadoop YARN Spark Client Slave Nodes  ResourceManager (RM) Per Cluster  Create Spark AM and  allocate Containers for Spark AM  NodeManager (NM) Per Node  Spark workers  ApplicationMaster (AM) Per Application  Containers for Spark Executors Master Node Node Manager Node Manager Node Manager Container: Spark Executor Spark AM Resource Manager
  • 22. High Performance Information Computing Center Jongwook Woo CalStat Big Data Analysis Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, …) Data Visualization Qlik, Datameer, Excel PowerView
  • 23. High Performance Information Computing Center Jongwook Woo CalStat Databricks cluster at CalStateLA
  • 24. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Use Cases  Hadoop Spark Training
  • 25. High Performance Information Computing Center Jongwook Woo CalStat Open Data USA government Federal, State, City governments Expose data to public USA Business Twitter, Yelp, … Expose data to public with APIs – Some restriction to download City government New York – Taxi, Uber, … Los Angeles – Open Data, Open Hub with Geo info
  • 26. High Performance Information Computing Center Jongwook Woo CalStat Open Big Data Analysis in CalStateLA Social Media Data Analysis Twitter Sentiment Analysis for Alphago Open Data from Government Airline Data analysis Crime Data analysis Web Service API Business Data Analysis from Yelp and Google Places API
  • 27. High Performance Information Computing Center Jongwook Woo CalStat Data from Industry: Twitter Data  Systems Azure HDInsights Spark 8 Nodes – 40 cores: 2.4GHz Intel Xeon – Memory - Each Node: 28 GB  Data Source Keyword ‘alphago’ from Tweeter via Apache NiFi  Data Size  63,193 tweets  Real Time Data Collection period 03/12 – 03/17/2016 – No data collected on 03/13
  • 28. High Performance Information Computing Center Jongwook Woo CalStat Top 10 Countries that Tweets “Alphago”
  • 29. High Performance Information Computing Center Jongwook Woo CalStat Top 10 Countries  # of Tweets per Country USA: > 11,000 Japan: > 9,000 Korea: > 1,900 Russia, UK: > 1,600 Thai Land, France : > 1,000  Netherland, Spain, Ukraine: > 600
  • 30. High Performance Information Computing Center Jongwook Woo CalStat Top 10 Countries Sentiment Positive Negative
  • 31. High Performance Information Computing Center Jongwook Woo CalStat Top 10 Countries Most Tweeted Countries  All countries show more positive tweets –Korea, Japan, USA Country Positive Negative USA 5070 3567 Japan 8118 217 … Korea 1053 407 …
  • 32. High Performance Information Computing Center Jongwook Woo CalStat Daily Tweets in 03/12 – 03/17/2016 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016 Alphago vs Lee Sedol Game 4: Mar 13 Lee Se-Dol win Game 5: Mar 15 Game 3: Mar 12
  • 33. High Performance Information Computing Center Jongwook Woo CalStat Ngram words  3 word in row right after Go-Champion “sedol” and “se-dol” sedol  se-dol 3-grams Frequency Again-to-win 1,187 Is-something-I’ll 369 Is-something-i 199 In-go-tournament 168
  • 34. High Performance Information Computing Center Jongwook Woo CalStat Sentiment Map of Alphago Positive Negative
  • 35. High Performance Information Computing Center Jongwook Woo CalStat Sentiment Map of Lee Se-Dol vs Alphago  YouTube video: “alphago sentiment” by Google  The sentiment of the World in Geo and Time: https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb ToiB8wQ2w14a
  • 36. High Performance Information Computing Center Jongwook Woo CalStat Federal Government: Airline Data Set Government Open Data Airline Data Set in 2012 – 2014 – US Dept of transportation Cluster by Nillohit at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQL  Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 7 GB – Windows Server 2012 R2 Datacenter
  • 37. High Performance Information Computing Center Jongwook Woo CalStat Airline Data Set
  • 38. High Performance Information Computing Center Jongwook Woo CalStat Airline Data Set
  • 39. High Performance Information Computing Center Jongwook Woo CalStat Airline Data Set
  • 40. High Performance Information Computing Center Jongwook Woo CalStat City Government: Crime Data Set Open Data in City of Los Angeles  Crime Data Set in 2012-2015  File Size – 151MB  Total Number of offenses – 8.94 million Ram Dharan and Sridhar Reddy at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 14 GB – Windows Server 2012 R2 Datacenter – Extending to last 10 years of data set
  • 41. High Performance Information Computing Center Jongwook Woo CalStat Projection of Raw Data 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 year2012 year2013 year2014 year2015
  • 42. High Performance Information Computing Center Jongwook Woo CalStat Total No. of Crimes in 2012-15 0 5000 10000 15000 20000 25000 year2012 year2013 year2014 year2015
  • 43. High Performance Information Computing Center Jongwook Woo CalStat Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015
  • 44. High Performance Information Computing Center Jongwook Woo CalStat No.of Crimes for every 5miles from CalStateLA 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 >35 csula_2012 csula_2013 csula_2014 csula_2015
  • 45. High Performance Information Computing Center Jongwook Woo CalStat No.of Crimes for every 5miles from UCLA 0 20000 40000 60000 80000 100000 120000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40 ucla_2012 ucla_2013 ucla_2014 ucla_2015
  • 46. High Performance Information Computing Center Jongwook Woo CalStat No. of Crimes for every 5miles from USC 0 20000 40000 60000 80000 100000 120000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40 ucla_2012 ucla_2013 ucla_2014 ucla_2015
  • 47. High Performance Information Computing Center Jongwook Woo CalStat Comparision of Crimes for every 5miles from CalStateLA, UCLA and USC in 2015 0 20000 40000 60000 80000 100000 120000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 40-50 >50 csula_2015 ucla_2015 usc_2015
  • 48. High Performance Information Computing Center Jongwook Woo CalStat No.of crimes per area in LA 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 77thStreet Mission Newton Rampart Southwest Topanga VanNuys Wilshire Central Devonshire Foothill Harbor Hollenbeck Hollywood NHollywood Pacific WestValley Northeast Olympic Southeast WestLA in2012 in2013 in2014 in2015
  • 49. High Performance Information Computing Center Jongwook Woo CalStat Total No.of Crimes for every 2hours in LA 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 77thStreet Mission Newton Rampart Southwest Topanga VanNuys Wilshire Central Devonshire Foothill Harbor Hollenbeck Hollywood NHollywood Pacific WestValley Northeast Olympic Southeast WestLA in2012 in2013 in2014 in2015
  • 50. High Performance Information Computing Center Jongwook Woo CalStat No.of crimes for every 2hrs within 5miles from CalStateLA, UCLA and USC in 2015 0 2000 4000 6000 8000 10000 12000 00:00-02:00 02:00-04:00 04:00-06:00 06:00-08:00 08:00-10:00 10:00-12:00 12:00-14:00 14:00-16:00 16:00-18:00 18:00-20:00 20:00-22:00 22:00-24:00 usc ucla csula
  • 51. High Performance Information Computing Center Jongwook Woo CalStat BUSINESS DATA ANALYSIS  DATA SET DETAILS • Yelp Review Data : 1.9GB • Business Data: 500MB • Web Service API from Yelp and Google Places Analysis Join YELP CHALLENGE DATA SET GOOGLE PLACES YELP DATA
  • 52. High Performance Information Computing Center Jongwook Woo CalStat Top 10 businesses within 5 miles from CalStateLA (with 5 or 4 star ratings) 34 31 29 26 19 19 15 15 15 0 5 10 15 20 25 30 35 40 count Chart Title Hair Salons Auto Repair General Dentistry Insurance Churches Skin Care Chiropractors Barbers Elementary Schools • Hair Salons and Insurance are popular qualified business categories
  • 53. High Performance Information Computing Center Jongwook Woo CalStat Businesses popular in 5 miles of CalStateLA, usc , ucla
  • 54. High Performance Information Computing Center Jongwook Woo CalStat Number of food business in radius 0-25 miles from CalStateLA, usc and ucla CalStateLA have more food businesses within 5 miles compared to UCLA and USC 0 100 200 300 400 500 600 0- 5 5-10. 10-15. 15-20 20-25 CSULA USC UCLA
  • 55. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) opened on May 7, 2014.
  • 56. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model The station producing hydrogen for Hydrogen Vehicle Cal State L.A. Hydrogen Research and Fueling Facility the first station in the nation to sell hydrogen fuel to the public. Hyundai, Toyota
  • 57. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Workflow
  • 58. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Model by Manvi Chandra
  • 59. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Results and observations
  • 60. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Results and observations  Can predict Vehicle Pressure – Pressure of hydrogen gas within the vehicle Hydrogen Storage System – using our model in Azure Visual Studio ML – Building Spark ML Decision forest Regression – constructing a multitude of decision trees at training time • the mode of the classes (classification) • mean prediction (regression) of the individual trees.
  • 61. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 62. High Performance Information Computing Center Jongwook Woo CalStat Spark Big Data Training and R&D HiPIC California State University Los Angeles Supported by – Databricks and its cloud computing services – Amazon AWS, IBM Buemix, MS Azure – Hortonworks, Cloudera – Datameer
  • 63. High Performance Information Computing Center Jongwook Woo CalStat Databricks Partners
  • 64. High Performance Information Computing Center Jongwook Woo CalStat Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  • 65. High Performance Information Computing Center Jongwook Woo CalStat Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  • 66. High Performance Information Computing Center Jongwook Woo CalStat Question?
  • 67. High Performance Information Computing Center Jongwook Woo CalStat References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )  “Market Basket Analysis using Spark”, Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes
  • 68. High Performance Information Computing Center Jongwook Woo CalStat  Introduction to Big Data with Apache Spark, databricks  Stanford Spark Class (http://stanford.edu/~rezab )  Cornell University, CS5304  DS320: DataStax Enterprise Analytics with Spark  Cloudera, http://www.cloudera.com  Hortonworks, http://www.hortonworks.com  Spark 3 Use Cases, http://www.datanami.com/2014/03/06/apache_spark_ 3_real-world_use_cases/ References
  • 69. High Performance Information Computing Center Jongwook Woo CalStat Scheduling Process ) rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects Optimizer Optimizer: build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  • 70. High Performance Information Computing Center Jongwook Woo CalStat Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3, Couchbase, Cassandra, … RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  • 71. High Performance Information Computing Center Jongwook Woo CalStat Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Wide” (shuffle) deps: boundary of stages “Narrow” deps: A stage pipeline to be run on the same node
  • 72. High Performance Information Computing Center Jongwook Woo CalStat Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Narrow” deps: A stage pipeline to be run on the same node “Wide” (shuffle) deps: boundary of stages
  • 73. High Performance Information Computing Center Jongwook Woo CalStat Scheduler Optimizations Pipelines within a stage 2 map, union Stage 3: join algorithms based on partitioning (minimize shuffles) join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  • 74. High Performance Information Computing Center Jongwook Woo CalStat Scheduler Optimizations Conceptually Stage 1: 3 tasks Stage 2: 4 tasks Stage 3: 3 tasks Total: 3 stages, 10 tasks join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task