SlideShare a Scribd company logo
Jongwook Woo
HiPIC
CalState
LA
SWRC 2017
San Diego, CA
Feb 25 2017
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Big Data Trend with
Open Platform
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
 Myself
 Big Data
 Spark
 Spark and Hadoop
 Open Platform
 Use Cases
 Future Trend
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Myself
Experience:
 Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Experience (Cont’d): Bring in Big Data R&D
and training to Korea since 2009
Collaborating with LA city in 2016
– Collect, Search, and Analyze City Data
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and
Research Centers
• Yonsei, Gachon
• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Experience in Big Data
 Collaboration
 Council Member of IBM Spark Technology Center
 City of Los Angeles for OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
 Grants
 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
 Myself
 Big Data
 Spark
 Spark and Hadoop
 Open Platform
 Use Cases
 Future Trend
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity
computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
High Performance Information Computing Center
Jongwook Woo
CalState
LA
What is Hadoop?
9
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Cluster for Compute
Cluster for Store Cluster for Compute/Store
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Definition: Big Data
Non-expensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–More public than the traditional super
computers
• You can store and process your applications
– In your university labs, small companies,
research centers
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Hadoop Cluster: Logical Diagram
Web Browser of
Cluster nonitor:
CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
 Myself
 Big Data
 Spark
 Spark and Hadoop
 Open Platform
 Use Cases
 Future Trend
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Only Map and Reduce
– Limited Parallelization
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
 Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
ML /
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Communicate with Spark workers
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–Development and Test
High Performance Information Computing Center
Jongwook Woo
CalState
LA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
Immutable
–RDD, DStream, SchemaRDD, PairRDD
Lineage
–History of the objects
–Automatically and efficiently re-compute lost
data
High Performance Information Computing Center
Jongwook Woo
CalState
LA
RDD and Data Frame Operations
Transformation
Define new RDDs and Data Frame from the
current
–Lazy: not computed immediately
map(), filter(), join(), select(), groupBy()
Actions
Return values
count(), collect(), take(), save()
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Programming in Spark
Scala
Functional Programming
– Fundamental of programming is function
• Input/Output is function
No side effects
– No states
Python
Legacy, large Libraries
Java
R
High Performance Information Computing Center
Jongwook Woo
CalState
LA
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark
 Spark SQL
 Querying using SQL, HiveQL
 Data Frame
 ML
 Machine Learning on Data Frame, Pipelining
 MLib
– On RDD
– Sparse vector support, Decision trees, Linear/Logistic Regression,
PCA, SVM
 Spark Streaming
 DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
Optimizer
Optimizer: build
operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
High Performance Information Computing Center
Jongwook Woo
CalState
LA
During Scheduling Process
https://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
 Myself
 Big Data
 Spark
 Spark and Hadoop
 Open Platform
 Use Cases
 Future Trend
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS
– No Hadoop ecosystems
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase, Amazon S3,
Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark with Hadoop YARN
Spark Client
Slave Nodes
 ResourceManager (RM) Per Cluster
 Create Spark AM and
 allocate Containers for Spark AM
 NodeManager (NM) Per Node
 Spark workers
 ApplicationMaster (AM) Per Application
 Containers for Spark Executors
Master
Node
Node
Manager
Node
Manager
Node
Manager
Container:
Spark Executor
Spark AM
Resource
Manager
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Databricks cluster at CalStateLA
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
 Myself
 Big Data
 Spark
 Spark and Hadoop
 Open Platform
 Use Cases
 Future Trend
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Open Platform
Open Source
Open Conference
Open Data
Public Data
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Open Source
Hadoop
http://hadoop.apache.org/
Spark
http://spark.apache.org/
 NoSQL
http://hbase.apache.org/
Search Engine
http://lucene.apache.org/solr/
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Open Conference
Hadoop Summit
Live Streaming
–http://siliconangle.tv/hadoop-summit-
2016/
Spark Summit
https://spark-summit.org/east-2017/
Live Streaming
–http://go.spark-summit.org/east-
2017/live-
stream?_ga=1.62160364.1150099959.1484
851457
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Open Data
USA government
Federal, State, City governments
Expose data to public
USA Business
Twitter, Yelp, …
Expose data to public with APIs
– Some restriction to download
City government
New York
– Taxi, Uber, …
Los Angeles
– Open Data, Open Hub with Geo info
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
 Myself
 Big Data
 Spark
 Spark and Hadoop
 Open Platform
 Use Cases
 Future Trend
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Databricks Partners
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Industrial Collaboration
Cloudera visits to interview Jongwook Woo
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Industrial Collaboration: IBM Bluemix
at CalStateLA
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Big Data Analysis and Prediction Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Tableua, Qlik, …)
Data Visualization
Qlik, Datameer, Excel
PowerView
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Databricks cluster at CalStateLA
Jongwook Woo
HiPIC
CalState
LA
LOCAL BUSINESS DATA ANALYSIS
Yashaswi Ananth
Ruchi Singh
Mahsa Tayer Farahani
High Performance Information Computing Center
Jongwook Woo
CalState
LA
LOCAL BUSINESS DATA ANALYSIS
Using Local Business Data
From Yelp and Google Local
Grad Students at CalStateLA
Symposium, Feb 24 2017
Yashaswi Ananth
Ruchi Singh
Mahsa Tayer Farahani
High Performance Information Computing Center
Jongwook Woo
CalState
LA
REVIEW COUNT FOR BUSINESS TYPES
• Food
• Services
• Entertainment
• Shopping
• Medical
High Performance Information Computing Center
Jongwook Woo
CalState
LA
TOP BUSINESS IN THE SIX CATEGORIES
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Review count of popular sub-categories of
business
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Sentiment Analysis of Services category
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Top business
Top 5 most popular local business on Yelp between 2006-2016 in the selected cities
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Historical Analysis Of
College Scorecard
CalStateLA Symposium
Feb 24 2017
Kunal Pritwani
Atinder Singh
Dharmesh Soni
Mounika Vallabhaneni
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Data is collected from the site. :
https://www.kaggle.com/kaggle/college-scorecard
We have historical data of over 100,000 colleges in
the US spanning over 14 years.
Data Size – 1.33 GB
File Format – CSV ( Comma Separated Values)
Specification of Data Set
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mean Income
Medical college of Wisconsin: 250K
Upstate Medical University: 152.7K
CalTech: 103K
Washington and Lee University: 100K
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Comparing Average Net Price of Two
States (Annual Tuition)
UCLA: $13,817 CalStateLA: $4,370
Fashion Inst of Tech: $11.5K CUNY: $5K
High Performance Information Computing Center
Jongwook Woo
CalState
LA
SAT Scores in Different Colleges
Math (Blue), Verbal (Orange), Mean Earning (Purple)
• CalTech: 800, 778.9, $98.7K
• MIT: 800, 764.4, $124.4K
• Harvard: 791, 795.6, $133K
• Princeton: 793, 791, $115.6K
• Yale: 788, 794.4, $97.8K
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Comparing Average Undergraduates
Receiving PELL GRANT
Universal Career Community College: 100% PELL grant scholarship
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Average Undergraduates Receiving
PELL GRANT in Each College
East Georgia State College: $2,854 Avg.
PELL grant: 97.285%
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Alphago vs Lee using Twitter
Data
 Systems
Azure HDInsights Spark
8 Nodes
– 40 cores: 2.4GHz Intel Xeon
– Memory - Each Node: 28 GB
 Data Source
Keyword ‘alphago’ from Tweeter via Apache NiFi
 Data Size
 63,193 tweets
 Real Time Data Collection period
03/12 – 03/17/2016
– No data collected on 03/13
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Top 10 Countries that Tweets
“Alphago”
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Top 10 Countries
 # of Tweets per
Country
USA: > 11,000
Japan: > 9,000
Korea: > 1,900
Russia, UK: > 1,600
Thai Land, France : >
1,000
 Netherland, Spain,
Ukraine: > 600
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Top 10 Countries Sentiment
Positive Negative
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Top 10 Countries
Most Tweeted Countries
 All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016
Alphago vs Lee Sedol
Game 4: Mar 13
Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Ngram words
 3 word in row right after Go-Champion
“sedol” and “se-dol”
sedol
 se-dol
3-grams Frequency
Again-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Sentiment Map of Alphago
Positive
Negative
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Sentiment Map of Lee Se-Dol vs Alphago
 YouTube video: “alphago sentiment” by Google
 The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb
ToiB8wQ2w14a
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Airline Data Set
Government Open Data
Airline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
 Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalState
LA
City Government: Crime Data Set
Open Data in City of Los Angeles
Crime Data Set in 2014
Ram Dharan and Sridhar Reddy at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB
– Windows Server 2012 R2 Datacenter
– Extending to last 10 years of data set
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Crime Data
Los Angeles 2014
2%
8%
9%
12%
17%
19%
33%
Total occurences of each Crime
CRIMINAL
VANDALISM
OTHERS
BURGALARY
ASSAULT
TRAFFIC
THEFT
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Total No.of Crimes in 2014
19169
17384
19730
19413
20645
20494
21480
21280
21287
21669
19844
21355
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12
No.of Crimes per Month
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Raw Data Projection on Map
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within
5miles from CalStateLA
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within
5miles from UCLA
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within
5miles from USC
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within 5miles
from CalStateLA, UCLA and USC in 2015
High Performance Information Computing Center
Jongwook Woo
CalState
LA
No. of crimes within 5 miles from CSULA, UCLA
and USC on crime type
0
5000
10000
15000
20000
25000
30000
csula ucla usc
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
 Myself
 Big Data
 Spark
 Spark and Hadoop
 Open Platform
 Use Cases
 Future Trend
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Future Research Trend
Deep Learning
TensorFlow and Spark
– Yahoo, Intel, Google
– Image Recognition, Prediction Analysis
ChatBot
Amazon Alexa API
IBM Watson ChatBot API
Google Home API
More into
In-Memory Processing
– Spark DataFrame, Data Set, ML
Cloud Computing
– IBM Bluemix, MS Azure, Google Cloud, Amazon AWS
High Performance Information Computing Center
Jongwook Woo
CalState
LA
Question?

More Related Content

What's hot

Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Jongwook Woo
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
Jongwook Woo
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
Jongwook Woo
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
Tomy Rhymond
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Jongwook Woo
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
KCC Software Ltd. & Easylearning.guru
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Big data with java
Big data with javaBig data with java
Big data with java
Stefan Angelov
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data science
Deepak Singh
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
nandhiniarumugam619
 
Revenue Earned From Students in USA
Revenue Earned From Students in USARevenue Earned From Students in USA
Revenue Earned From Students in USA
ApekshitBhingardive
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 
Bigdata
BigdataBigdata
Bigdata
Shankar R
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Bhushan Kulkarni
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
Ganesan Narayanasamy
 

What's hot (19)

Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data science
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Hadoop
HadoopHadoop
Hadoop
 
Revenue Earned From Students in USA
Revenue Earned From Students in USARevenue Earned From Students in USA
Revenue Earned From Students in USA
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Bigdata
BigdataBigdata
Bigdata
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 

Viewers also liked

The Top 8 Trends for Big Data in 2016
The Top 8 Trends for Big Data in 2016The Top 8 Trends for Big Data in 2016
The Top 8 Trends for Big Data in 2016
Tableau Software
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
Trend Teknologi Pembelajaran
Trend Teknologi PembelajaranTrend Teknologi Pembelajaran
Trend Teknologi Pembelajaran
jeeroloo
 
Clase n° 5 passo
Clase n° 5   passoClase n° 5   passo
Clase n° 5 passo
Mario Salazar Orihuela
 
GreenRoad Fleet Managment solution spotlight
GreenRoad Fleet Managment solution spotlightGreenRoad Fleet Managment solution spotlight
GreenRoad Fleet Managment solution spotlight
Miles Driven
 
Skill Development Program
Skill Development ProgramSkill Development Program
Skill Development Program
Dev Textile Services, Ludhiana
 
Flujo circular economico
Flujo circular economicoFlujo circular economico
Flujo circular economico
Elí Cortés
 
Juliocesarcamachodiaz1 l
Juliocesarcamachodiaz1 lJuliocesarcamachodiaz1 l
Juliocesarcamachodiaz1 lCesar Diiaz
 
Latest Update Bigdata in indonesia
Latest Update Bigdata in indonesiaLatest Update Bigdata in indonesia
Latest Update Bigdata in indonesia
Heru Sutadi
 
NABIL_WAGDY_ELBAZ_HSE_2015_doc
NABIL_WAGDY_ELBAZ_HSE_2015_docNABIL_WAGDY_ELBAZ_HSE_2015_doc
NABIL_WAGDY_ELBAZ_HSE_2015_docNabil Elbaz
 
What is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesWhat is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use Cases
Tony Pearson
 
243
243243
Julius caesar by william shakespeare
Julius caesar by william shakespeareJulius caesar by william shakespeare
Julius caesar by william shakespeare
jocsan jimenez
 
Prezentacija Sokobanja
Prezentacija SokobanjaPrezentacija Sokobanja
Prezentacija Sokobanja
Luka Stosic
 
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
René Pfitzner
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Edureka!
 
Prezentacija Zemun
Prezentacija ZemunPrezentacija Zemun
Prezentacija Zemun
Luka Stosic
 
Komunitas tumbuhan
Komunitas tumbuhanKomunitas tumbuhan
Komunitas tumbuhan
Jessy Damayanti
 

Viewers also liked (20)

The Top 8 Trends for Big Data in 2016
The Top 8 Trends for Big Data in 2016The Top 8 Trends for Big Data in 2016
The Top 8 Trends for Big Data in 2016
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Trend Teknologi Pembelajaran
Trend Teknologi PembelajaranTrend Teknologi Pembelajaran
Trend Teknologi Pembelajaran
 
Clase n° 5 passo
Clase n° 5   passoClase n° 5   passo
Clase n° 5 passo
 
GreenRoad Fleet Managment solution spotlight
GreenRoad Fleet Managment solution spotlightGreenRoad Fleet Managment solution spotlight
GreenRoad Fleet Managment solution spotlight
 
Skill Development Program
Skill Development ProgramSkill Development Program
Skill Development Program
 
Flujo circular economico
Flujo circular economicoFlujo circular economico
Flujo circular economico
 
Juliocesarcamachodiaz1 l
Juliocesarcamachodiaz1 lJuliocesarcamachodiaz1 l
Juliocesarcamachodiaz1 l
 
Latest Update Bigdata in indonesia
Latest Update Bigdata in indonesiaLatest Update Bigdata in indonesia
Latest Update Bigdata in indonesia
 
NABIL_WAGDY_ELBAZ_HSE_2015_doc
NABIL_WAGDY_ELBAZ_HSE_2015_docNABIL_WAGDY_ELBAZ_HSE_2015_doc
NABIL_WAGDY_ELBAZ_HSE_2015_doc
 
What is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesWhat is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use Cases
 
243
243243
243
 
Julius caesar by william shakespeare
Julius caesar by william shakespeareJulius caesar by william shakespeare
Julius caesar by william shakespeare
 
Prezentacija Sokobanja
Prezentacija SokobanjaPrezentacija Sokobanja
Prezentacija Sokobanja
 
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
 
Prezentacija Zemun
Prezentacija ZemunPrezentacija Zemun
Prezentacija Zemun
 
Komunitas tumbuhan
Komunitas tumbuhanKomunitas tumbuhan
Komunitas tumbuhan
 

Similar to Big Data Trend with Open Platform

Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
Jongwook Woo
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
Nillohit Bhattacharya
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
Jongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
Jongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
Jongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
AshishRathore72
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
ssuseracaaae2
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
Jongwook Woo
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
Gezim Sejdiu
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Jongwook Woo
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
ElsonPaul2
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
Jongwook Woo
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
Philippe Julio
 
Yu's resume
Yu's resumeYu's resume
Yu's resume
Yu(Rein) Wang
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_short
Mao Ye
 
hydrogenbigdataanalysis
hydrogenbigdataanalysishydrogenbigdataanalysis
hydrogenbigdataanalysisManvi Chandra
 

Similar to Big Data Trend with Open Platform (20)

Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Yu's resume
Yu's resumeYu's resume
Yu's resume
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_short
 
hydrogenbigdataanalysis
hydrogenbigdataanalysishydrogenbigdataanalysis
hydrogenbigdataanalysis
 

More from Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
Jongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Jongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
Jongwook Woo
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
Jongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
Jongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
Jongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
Jongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
Jongwook Woo
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
Jongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Jongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
Jongwook Woo
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
Jongwook Woo
 

More from Jongwook Woo (14)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Recently uploaded

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 

Recently uploaded (20)

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 

Big Data Trend with Open Platform

  • 1. Jongwook Woo HiPIC CalState LA SWRC 2017 San Diego, CA Feb 25 2017 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles Big Data Trend with Open Platform
  • 2. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  • 3. High Performance Information Computing Center Jongwook Woo CalState LA Myself Experience:  Since 2002, Professor at California State Univ Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  • 4. High Performance Information Computing Center Jongwook Woo CalState LA Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city in 2016 – Collect, Search, and Analyze City Data • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  • 5. High Performance Information Computing Center Jongwook Woo CalState LA Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University  Grants  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  • 6. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  • 7. High Performance Information Computing Center Jongwook Woo CalState LA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  • 8. High Performance Information Computing Center Jongwook Woo CalState LA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  • 9. High Performance Information Computing Center Jongwook Woo CalState LA What is Hadoop? 9  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  • 10. High Performance Information Computing Center Jongwook Woo CalState LA Super Computer vs Hadoop Parallel vs. Distributed file systems by Michael Malak Cluster for Compute Cluster for Store Cluster for Compute/Store
  • 11. High Performance Information Computing Center Jongwook Woo CalState LA Definition: Big Data Non-expensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers
  • 12. High Performance Information Computing Center Jongwook Woo CalState LA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  • 13. High Performance Information Computing Center Jongwook Woo CalState LA Hadoop Ecosystems http://dawn.dbsdataprojects.com/tag/hadoop/
  • 14. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  • 15. High Performance Information Computing Center Jongwook Woo CalState LA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Only Map and Reduce – Limited Parallelization Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce
  • 16. High Performance Information Computing Center Jongwook Woo CalState LA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS  Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase… New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  • 17. High Performance Information Computing Center Jongwook Woo CalState LA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL ML / MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  • 18. High Performance Information Computing Center Jongwook Woo CalState LA Spark Drivers and Workers Drivers Client –with SparkContext • Communicate with Spark workers Workers Spark Executor Run on cluster nodes –Production Run in local threads –Development and Test
  • 19. High Performance Information Computing Center Jongwook Woo CalState LA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects –that can be cached in memory Immutable –RDD, DStream, SchemaRDD, PairRDD Lineage –History of the objects –Automatically and efficiently re-compute lost data
  • 20. High Performance Information Computing Center Jongwook Woo CalState LA RDD and Data Frame Operations Transformation Define new RDDs and Data Frame from the current –Lazy: not computed immediately map(), filter(), join(), select(), groupBy() Actions Return values count(), collect(), take(), save()
  • 21. High Performance Information Computing Center Jongwook Woo CalState LA Programming in Spark Scala Functional Programming – Fundamental of programming is function • Input/Output is function No side effects – No states Python Legacy, large Libraries Java R
  • 22. High Performance Information Computing Center Jongwook Woo CalState LA
  • 23. High Performance Information Computing Center Jongwook Woo CalState LA Spark  Spark SQL  Querying using SQL, HiveQL  Data Frame  ML  Machine Learning on Data Frame, Pipelining  MLib – On RDD – Sparse vector support, Decision trees, Linear/Logistic Regression, PCA, SVM  Spark Streaming  DStream – RDD in streaming – Windows • To select DStream from streaming data
  • 24. High Performance Information Computing Center Jongwook Woo CalState LA Scheduling Process ) rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects Optimizer Optimizer: build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  • 25. High Performance Information Computing Center Jongwook Woo CalState LA During Scheduling Process https://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797
  • 26. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  • 27. High Performance Information Computing Center Jongwook Woo CalState LA Spark Spark File Systems: Tachyon Resource Manager: Mesos But, Hadoop has been dominating market Integrating Spark into Hadoop cluster Cloud Computing – Amazon AWS, Azure HDInsight, IBM Bluemix • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS – No Hadoop ecosystems
  • 28. High Performance Information Computing Center Jongwook Woo CalState LA Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3, Couchbase, Cassandra, … RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  • 29. High Performance Information Computing Center Jongwook Woo CalState LA Spark with Hadoop YARN Spark Client Slave Nodes  ResourceManager (RM) Per Cluster  Create Spark AM and  allocate Containers for Spark AM  NodeManager (NM) Per Node  Spark workers  ApplicationMaster (AM) Per Application  Containers for Spark Executors Master Node Node Manager Node Manager Node Manager Container: Spark Executor Spark AM Resource Manager
  • 30. High Performance Information Computing Center Jongwook Woo CalState LA Databricks cluster at CalStateLA
  • 31. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  • 32. High Performance Information Computing Center Jongwook Woo CalState LA Open Platform Open Source Open Conference Open Data Public Data
  • 33. High Performance Information Computing Center Jongwook Woo CalState LA Open Source Hadoop http://hadoop.apache.org/ Spark http://spark.apache.org/  NoSQL http://hbase.apache.org/ Search Engine http://lucene.apache.org/solr/
  • 34. High Performance Information Computing Center Jongwook Woo CalState LA Open Conference Hadoop Summit Live Streaming –http://siliconangle.tv/hadoop-summit- 2016/ Spark Summit https://spark-summit.org/east-2017/ Live Streaming –http://go.spark-summit.org/east- 2017/live- stream?_ga=1.62160364.1150099959.1484 851457
  • 35. High Performance Information Computing Center Jongwook Woo CalState LA Open Data USA government Federal, State, City governments Expose data to public USA Business Twitter, Yelp, … Expose data to public with APIs – Some restriction to download City government New York – Taxi, Uber, … Los Angeles – Open Data, Open Hub with Geo info
  • 36. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  • 37. High Performance Information Computing Center Jongwook Woo CalState LA Databricks Partners
  • 38. High Performance Information Computing Center Jongwook Woo CalState LA Industrial Collaboration Cloudera visits to interview Jongwook Woo
  • 39. High Performance Information Computing Center Jongwook Woo CalState LA Industrial Collaboration: IBM Bluemix at CalStateLA
  • 40. High Performance Information Computing Center Jongwook Woo CalState LA Big Data Analysis and Prediction Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Tableua, Qlik, …) Data Visualization Qlik, Datameer, Excel PowerView
  • 41. High Performance Information Computing Center Jongwook Woo CalState LA Databricks cluster at CalStateLA
  • 42. Jongwook Woo HiPIC CalState LA LOCAL BUSINESS DATA ANALYSIS Yashaswi Ananth Ruchi Singh Mahsa Tayer Farahani
  • 43. High Performance Information Computing Center Jongwook Woo CalState LA LOCAL BUSINESS DATA ANALYSIS Using Local Business Data From Yelp and Google Local Grad Students at CalStateLA Symposium, Feb 24 2017 Yashaswi Ananth Ruchi Singh Mahsa Tayer Farahani
  • 44. High Performance Information Computing Center Jongwook Woo CalState LA REVIEW COUNT FOR BUSINESS TYPES • Food • Services • Entertainment • Shopping • Medical
  • 45. High Performance Information Computing Center Jongwook Woo CalState LA TOP BUSINESS IN THE SIX CATEGORIES
  • 46. High Performance Information Computing Center Jongwook Woo CalState LA Review count of popular sub-categories of business
  • 47. High Performance Information Computing Center Jongwook Woo CalState LA Sentiment Analysis of Services category
  • 48. High Performance Information Computing Center Jongwook Woo CalState LA Top business Top 5 most popular local business on Yelp between 2006-2016 in the selected cities
  • 49. High Performance Information Computing Center Jongwook Woo CalState LA Businesses popular in 5 miles of CalStateLA, USC , UCLA
  • 50. High Performance Information Computing Center Jongwook Woo CalState LA Historical Analysis Of College Scorecard CalStateLA Symposium Feb 24 2017 Kunal Pritwani Atinder Singh Dharmesh Soni Mounika Vallabhaneni
  • 51. High Performance Information Computing Center Jongwook Woo CalState LA Data is collected from the site. : https://www.kaggle.com/kaggle/college-scorecard We have historical data of over 100,000 colleges in the US spanning over 14 years. Data Size – 1.33 GB File Format – CSV ( Comma Separated Values) Specification of Data Set
  • 52. High Performance Information Computing Center Jongwook Woo CalState LA Mean Income Medical college of Wisconsin: 250K Upstate Medical University: 152.7K CalTech: 103K Washington and Lee University: 100K
  • 53. High Performance Information Computing Center Jongwook Woo CalState LA Comparing Average Net Price of Two States (Annual Tuition) UCLA: $13,817 CalStateLA: $4,370 Fashion Inst of Tech: $11.5K CUNY: $5K
  • 54. High Performance Information Computing Center Jongwook Woo CalState LA SAT Scores in Different Colleges Math (Blue), Verbal (Orange), Mean Earning (Purple) • CalTech: 800, 778.9, $98.7K • MIT: 800, 764.4, $124.4K • Harvard: 791, 795.6, $133K • Princeton: 793, 791, $115.6K • Yale: 788, 794.4, $97.8K
  • 55. High Performance Information Computing Center Jongwook Woo CalState LA Comparing Average Undergraduates Receiving PELL GRANT Universal Career Community College: 100% PELL grant scholarship
  • 56. High Performance Information Computing Center Jongwook Woo CalState LA Average Undergraduates Receiving PELL GRANT in Each College East Georgia State College: $2,854 Avg. PELL grant: 97.285%
  • 57. High Performance Information Computing Center Jongwook Woo CalState LA Alphago vs Lee using Twitter Data  Systems Azure HDInsights Spark 8 Nodes – 40 cores: 2.4GHz Intel Xeon – Memory - Each Node: 28 GB  Data Source Keyword ‘alphago’ from Tweeter via Apache NiFi  Data Size  63,193 tweets  Real Time Data Collection period 03/12 – 03/17/2016 – No data collected on 03/13
  • 58. High Performance Information Computing Center Jongwook Woo CalState LA Top 10 Countries that Tweets “Alphago”
  • 59. High Performance Information Computing Center Jongwook Woo CalState LA Top 10 Countries  # of Tweets per Country USA: > 11,000 Japan: > 9,000 Korea: > 1,900 Russia, UK: > 1,600 Thai Land, France : > 1,000  Netherland, Spain, Ukraine: > 600
  • 60. High Performance Information Computing Center Jongwook Woo CalState LA Top 10 Countries Sentiment Positive Negative
  • 61. High Performance Information Computing Center Jongwook Woo CalState LA Top 10 Countries Most Tweeted Countries  All countries show more positive tweets –Korea, Japan, USA Country Positive Negative USA 5070 3567 Japan 8118 217 … Korea 1053 407 …
  • 62. High Performance Information Computing Center Jongwook Woo CalState LA Daily Tweets in 03/12 – 03/17/2016 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016 Alphago vs Lee Sedol Game 4: Mar 13 Lee Se-Dol win Game 5: Mar 15 Game 3: Mar 12
  • 63. High Performance Information Computing Center Jongwook Woo CalState LA Ngram words  3 word in row right after Go-Champion “sedol” and “se-dol” sedol  se-dol 3-grams Frequency Again-to-win 1,187 Is-something-I’ll 369 Is-something-i 199 In-go-tournament 168
  • 64. High Performance Information Computing Center Jongwook Woo CalState LA Sentiment Map of Alphago Positive Negative
  • 65. High Performance Information Computing Center Jongwook Woo CalState LA Sentiment Map of Lee Se-Dol vs Alphago  YouTube video: “alphago sentiment” by Google  The sentiment of the World in Geo and Time: https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb ToiB8wQ2w14a
  • 66. High Performance Information Computing Center Jongwook Woo CalState LA Airline Data Set Government Open Data Airline Data Set in 2012 – 2014 – US Dept of transportation Cluster by Nillohit at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL  Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 7 GB – Windows Server 2012 R2 Datacenter
  • 67. High Performance Information Computing Center Jongwook Woo CalState LA Airline Data Set
  • 68. High Performance Information Computing Center Jongwook Woo CalState LA Airline Data Set
  • 69. High Performance Information Computing Center Jongwook Woo CalState LA Airline Data Set
  • 70. High Performance Information Computing Center Jongwook Woo CalState LA City Government: Crime Data Set Open Data in City of Los Angeles Crime Data Set in 2014 Ram Dharan and Sridhar Reddy at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 14 GB – Windows Server 2012 R2 Datacenter – Extending to last 10 years of data set
  • 71. High Performance Information Computing Center Jongwook Woo CalState LA Crime Data Los Angeles 2014 2% 8% 9% 12% 17% 19% 33% Total occurences of each Crime CRIMINAL VANDALISM OTHERS BURGALARY ASSAULT TRAFFIC THEFT
  • 72. High Performance Information Computing Center Jongwook Woo CalState LA Total No.of Crimes in 2014 19169 17384 19730 19413 20645 20494 21480 21280 21287 21669 19844 21355 0 5000 10000 15000 20000 25000 1 2 3 4 5 6 7 8 9 10 11 12 No.of Crimes per Month
  • 73. High Performance Information Computing Center Jongwook Woo CalState LA Raw Data Projection on Map
  • 74. High Performance Information Computing Center Jongwook Woo CalState LA Mapping of Crimes Occurred within 5miles from CalStateLA
  • 75. High Performance Information Computing Center Jongwook Woo CalState LA Mapping of Crimes Occurred within 5miles from UCLA
  • 76. High Performance Information Computing Center Jongwook Woo CalState LA Mapping of Crimes Occurred within 5miles from USC
  • 77. High Performance Information Computing Center Jongwook Woo CalState LA Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015
  • 78. High Performance Information Computing Center Jongwook Woo CalState LA No. of crimes within 5 miles from CSULA, UCLA and USC on crime type 0 5000 10000 15000 20000 25000 30000 csula ucla usc
  • 79. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  • 80. High Performance Information Computing Center Jongwook Woo CalState LA Future Research Trend Deep Learning TensorFlow and Spark – Yahoo, Intel, Google – Image Recognition, Prediction Analysis ChatBot Amazon Alexa API IBM Watson ChatBot API Google Home API More into In-Memory Processing – Spark DataFrame, Data Set, ML Cloud Computing – IBM Bluemix, MS Azure, Google Cloud, Amazon AWS
  • 81. High Performance Information Computing Center Jongwook Woo CalState LA Question?