SlideShare a Scribd company logo
1 of 54
Jongwook Woo
HiPIC
CSULA
UKC 2015
Atlanta, GA
July 30 2015
Nillohit Bhattacharya, nbhatta2@calstatela.edu
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Big Data Analysis and
Industrial Approach
using Spark
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 MBA Examples
 Academic Cloud Computing
High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Name: 우종욱, Jongwook Woo
Experience:
 2012 - Present
– Certified Cloudera Instructor: R&D, Consulting, Training
 2012 - Present : Big Data Academic Parterships
– Cloudera, Hortonworks Partner for Hadoop Training
– Amazon AWS, MicroSoft Azure, IBM Bluemix
 Since 2002, Professor at California State Univ Los Angeles
 Since 1998: R&D consulting in Hollywood
– implements eBusiness applications using J2EE and middleware
– Information Search and Integration with FAST, Lucene/Solr,
Sphinx
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 Since 2007: Exposed to Big Data
 PhD in 2001: Computer Science and Engineering at USC
High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Experience (Cont’d): Bring in Big Data
training and R&D to Korea since 2009
2014: Training Hadoop and the Ecosystems
Summer 2013 Igloo Security:
– Collect, Search, and Analyze Security Log files 30GB –
100GB / day
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education in Univ and
Research Centers
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received IBM Bluemix , MicroSoft Windows Azure, Amazon
AWS in Education Research Grant
 Partnership
 Received Academic Education Partnership with Cloudera,
Hortonworks and IBM
 Certificate
 Certified Cloudera Hadoop Instructor
 Certified Cloudera Hadoop Developer / Administrator / Hbase /
Spark
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Spark Cores
 RDD
 Task Scheduling
 Spark SQL, Streaming, ML
 Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social
Computing, Streaming data, smart phone, online
game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Hadoop?
9
Hadoop Founder:
Doug Cutting
Chief Architect at Cloudera
High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and
process it faster in parallel
Hadoop
–Non-expensive Super Computer
–You can build and run your applications
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop CDH: Logical Diagram
Web Browser to control Cloudera Manager
Server
HTTP(S)
Agent CDH Agent CDH Agent CDH
Agent CDH Agent CDH Agent CDH
CM
.
.
.
.
.
.
.
.
.
Agent CDH Agent CDH Agent CDH
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 MBA Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
Hive
One of Ecosystems
Developed at Facebook
Turns Hadoop into a data warehouse
HiveQL
SQL syntax
Convert to MapReduce jobs
– Batch Processing
– Slow
• in the beginning even for any simple SQL statement
• Impala on MPP for interactive querying
High Performance Information Computing Center
Jongwook Woo
CSULA
Airline Data Set
 Cluster by Nillohit at HiPIC, CSULA:
 Airline Data Set in 2012 – 2014
– US Dept of transportation
 Microsoft Azure using Hive
 Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
 Hadoop and Hive
 Highest Average Departure Delay
– E Air Lines: 13.25 minutes
– U Air Lines: 12.59 minutes
 Least Avg Departure Delay (Least Delayed/On Time):-
– AS, "Alaska Airlines Inc.", 1.33 minutes
– HA, "Hawaiian Airlines Inc.", - 0.56 minutes
• (Negative, i.e. most of the time the flight departs before the scheduled time)
High Performance Information Computing Center
Jongwook Woo
CSULA
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CSULA
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CSULA
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CSULA
Hive
Easy for Data Analysis
HiveQL
– SQL syntax
– Convert to MapReduce jobs
Slow
Batch Processing
– in the beginning even for any simple SQL statement
Impala on MPP for interactive querying
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 MBA Examples
 Academic Cloud Computing
High Performance Information Computing Center
Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
10 ~ 100x faster than N/W and Disk
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its
ecosystems
HDFS
HBase, Hive, Sequence files
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Create RDDs
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–development
High Performance Information Computing Center
Jongwook Woo
CSULA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase,
Amazon S3…
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 MBA Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
RDD, DStream, SchemaRDD, PairRDD
Immutable
Lineage
–History of the objects
–Automatically and efficiently recompute lost
data
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
High Performance Information Computing Center
Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 MBA Examples
 Academic Cloud Computing
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
SparkSQL
Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
MLib
Sparse vector support, Decision trees,
Linear/Logistic Regression, PCA
SVD and PCA
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 MBA Examples
 Use Cases
High Performance Information Computing Center
Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Collect the list of pair of transaction
items
most frequently occurred together at a store(s)
Traditional Business Intelligence
Analysis
 much better opportunity for a profit
– by controlling the order of products and
marketing
– control the stocks more intelligently
– arrange items on shelves
– promote items together etc.
High Performance Information Computing Center
Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Transactions in Store A: Input data
Transaction 1: cracker, icecream, beer
Transaction 2: chicken, pizza, coke, bread
Transaction 3: baguette, soda, hering, cracker,
beer
Transaction 4: bourbon, coke, turkey
Transaction 5: sardines, beer, chicken, coke
Transaction 6: apples, peppers, avocado, steak
Transaction 7: sardines, apples, peppers,
avocado, steak
…
What is a pair of items that people
frequently buy at Store A?
High Performance Information Computing Center
Jongwook Woo
CSULA
Market Basket Analysis (MBA) on
Spark
n-gram() Functional Algorithm
1. Take an input transaction text
2. Items are slided by n elements and sorted
Slided items are generated as (e1, e2, …, en)
3. Duplicated items pairs are removed
4. Several (e1, e2, …, en) pairs are produced by
n
High Performance Information Computing Center
Jongwook Woo
CSULA
n-gram
Normally “He follows Texas Rangers”
Bi-gram
–(He follows), (follows Texas), (Texas
Rangers)
Tri-gram
–(He follows Texas), (follows Texas
Rangers)
a transaction: (coke, beer, cracker)
 the output list of bi-gram in MBA
– (beer, coke), (beer, cracker), (coke, cracker)
High Performance Information Computing Center
Jongwook Woo
CSULA
Market Basket Analysis (MBA) on
Spark
MBA Functional Algorithm
1. Take input text files
2. For all files, n-gram(each line)
Each line generates ngram list
– The elements of the ngram list is flattened
Each flattened element generates (ngram, 1) pair
3. All pairs are reduced by each key
Values by a key are summed
4. (ngram, sum of values) pair are generated
and sorted by sum of values
High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Result
The number of nodes on AWS EC2
2, 4, and 6 nodes
– 1 is a master node
– the input transaction data sets
• 1.6 GB and 3.2 GB file size
Don’t need many nodes in Spark
– 1 node
• Equivalent to 10 – 100 nodes in MapReduce
High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Result
Using Spark Scala on Amazon AWS
Spark version is 1.3.0
– the data is stored at its HDFS on the cluster.
AWS EC2
– m1.large
• 2 core (2 EC2 vCPU compute unit),
• 7.5GB memory and 2 x 420GB storage on 64 bits
Amazon Linux OS
– m1.xlarge
• 4 core (4 EC2 vCPU compute unit),
• 15GB memory and 4 x 420GB storage on 64 bits
Amazon Linux OS
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet
}
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word,
1)).reduceByKey((a, b) => a+b)
val sortedResult = result.map(pair => pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile("jwoo/result3.2")
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
// ngrams to pair items
def ngram(s: String, inSep: String, outSep: String, n:Int):
Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.
mkString(outSep)).toSet
}toLowerCase(): convert to lower letters
split(inSep): split by the separator inSep
sliding(n): select n words as a group
_.sorted: sort the elements in the group
mkString(outSep): the elements are appended
with outSep
toSet: make the group as a set with unique elmts
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile(“jwoo/result32G”)
Extract and count bigram
High Performance Information Computing Center
Jongwook Woo
CSULA
Example Code: Market Basket
Analysis
val fPath = "jwoo/files3.2G.dat"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+",
ngramNo)).map(word => (word, 1)).reduceByKey((a, b) =>
a+b)
val sortedResult = result.map(pair =>
pair.swap).sortByKey(false)
//save result to HDFS
sortedResult.saveAsTextFile(“jwoo/result32G”)
Sort the bigram in descending
order of the value
High Performance Information Computing Center
Jongwook Woo
CSULA
Time for MBA
0 100 200 300 400 500
1
3
5
sec
m1.large
(1.6GB)
m1.xlarge
(1.6GB)
m1.large
(3.2GB)
m1.xlarge
(3.2GB)
High Performance Information Computing Center
Jongwook Woo
CSULA
Scala Spark vs Java
Code Size:
1/100 of Java code
Performance:
Scala Spark: 10x ~ 100x faster than MapRedude in
Java
Scala Spark vs PySpark
– Scala Spark
• 2x faster
– PySpark
• has more libraries
Meaning
100 nodes in Java MapReduce
1 – 10 nodes in Spark
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 MBA Examples
 Academic Cloud Computing
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Training
California State University Los Angeles
(Prof Jongwook Woo)
UC Berkeley Edx (MOOC)
UC Berkeley amplab camp
Stanford
Cloudera, Hortonworks, DataStax Training
courses
IBM Big University
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Spark
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark on Cloud Computing
Amazon AWS
Spark example at https://spark.apache.org/
Microsoft Azure
IBM Bluemix
Support Spark and AMP Labs
–Will hire 3,500 Spark engineers
Launch Spark on BlueMix on July 2015
Academic Initiative Programs
– Contact for Spark on Bluemix account
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Big Data is Hadoop
Spark is the way to go for Big Data
Spark training and academic partnership
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?
High Performance Information Computing Center
Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
 “Market Basket Analysis using Spark”, Jongwook
Woo, in Journal of Science and Technology, April
2015, Volume 5, No 4, pp207-209, ISSN 2225-7217,
ARPN
https://github.com/hipic/spark_mba, HiPIC of
California State University Los Angenes
High Performance Information Computing Center
Jongwook Woo
CSULA
References
 Introduction to Big Data with Apache Spark, databricks
 Stanford Spark Class (http://stanford.edu/~rezab )
 Cornell University, CS5304
 DS320: DataStax Enterprise Analytics with Spark
 Cloudera, http://www.cloudera.com
 Hortonworks, http://www.hortonworks.com
 Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_3_
real-world_use_cases/

More Related Content

What's hot

Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...Hortonworks
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerDataWorks Summit
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big DataPaco Nathan
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsEdureka!
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 

What's hot (20)

Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionals
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 

Similar to Spark ukc2015v1.1

Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open DataJongwook Woo
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
hydrogenbigdataanalysis
hydrogenbigdataanalysishydrogenbigdataanalysis
hydrogenbigdataanalysisManvi Chandra
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousingSneha Challa
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleDatabricks
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev Kumar
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Jongwook Woo
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopJongwook Woo
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?samthemonad
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
 
2009/11 Database Architechs Presentation
2009/11   Database Architechs Presentation2009/11   Database Architechs Presentation
2009/11 Database Architechs PresentationDatabase Architechs
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionWeCloudData
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minssparkflows
 
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...Databricks
 

Similar to Spark ukc2015v1.1 (20)

Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
hydrogenbigdataanalysis
hydrogenbigdataanalysishydrogenbigdataanalysis
hydrogenbigdataanalysis
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
2009/11 Database Architechs Presentation
2009/11   Database Architechs Presentation2009/11   Database Architechs Presentation
2009/11 Database Architechs Presentation
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
 
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Spark ukc2015v1.1

  • 1. Jongwook Woo HiPIC CSULA UKC 2015 Atlanta, GA July 30 2015 Nillohit Bhattacharya, nbhatta2@calstatela.edu Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) Cloudera Academic Partner and Grants Awardee of Amazon AWS California State University Los Angeles Big Data Analysis and Industrial Approach using Spark
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  MBA Examples  Academic Cloud Computing
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Myself Name: 우종욱, Jongwook Woo Experience:  2012 - Present – Certified Cloudera Instructor: R&D, Consulting, Training  2012 - Present : Big Data Academic Parterships – Cloudera, Hortonworks Partner for Hadoop Training – Amazon AWS, MicroSoft Azure, IBM Bluemix  Since 2002, Professor at California State Univ Los Angeles  Since 1998: R&D consulting in Hollywood – implements eBusiness applications using J2EE and middleware – Information Search and Integration with FAST, Lucene/Solr, Sphinx – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  Since 2007: Exposed to Big Data  PhD in 2001: Computer Science and Engineering at USC
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Myself Experience (Cont’d): Bring in Big Data training and R&D to Korea since 2009 2014: Training Hadoop and the Ecosystems Summer 2013 Igloo Security: – Collect, Search, and Analyze Security Log files 30GB – 100GB / day • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education in Univ and Research Centers
  • 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Education Research Grant  Partnership  Received Academic Education Partnership with Cloudera, Hortonworks and IBM  Certificate  Certified Cloudera Hadoop Instructor  Certified Cloudera Hadoop Developer / Administrator / Hbase / Spark
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Task Scheduling  Spark SQL, Streaming, ML  Examples  Use Cases
  • 7. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 8. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – On non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with multiple non-expensive computers • Own super computers
  • 9. High Performance Information Computing Center Jongwook Woo CSULA What is Hadoop? 9 Hadoop Founder: Doug Cutting Chief Architect at Cloudera
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Definition: Big Data Inexpensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –You can build and run your applications
  • 11. High Performance Information Computing Center Jongwook Woo CSULA Hadoop CDH: Logical Diagram Web Browser to control Cloudera Manager Server HTTP(S) Agent CDH Agent CDH Agent CDH Agent CDH Agent CDH Agent CDH CM . . . . . . . . . Agent CDH Agent CDH Agent CDH HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  • 12. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  MBA Examples  Use Cases
  • 13. High Performance Information Computing Center Jongwook Woo CSULA Hive One of Ecosystems Developed at Facebook Turns Hadoop into a data warehouse HiveQL SQL syntax Convert to MapReduce jobs – Batch Processing – Slow • in the beginning even for any simple SQL statement • Impala on MPP for interactive querying
  • 14. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set  Cluster by Nillohit at HiPIC, CSULA:  Airline Data Set in 2012 – 2014 – US Dept of transportation  Microsoft Azure using Hive  Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 7 GB – Windows Server 2012 R2 Datacenter  Hadoop and Hive  Highest Average Departure Delay – E Air Lines: 13.25 minutes – U Air Lines: 12.59 minutes  Least Avg Departure Delay (Least Delayed/On Time):- – AS, "Alaska Airlines Inc.", 1.33 minutes – HA, "Hawaiian Airlines Inc.", - 0.56 minutes • (Negative, i.e. most of the time the flight departs before the scheduled time)
  • 15. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set
  • 16. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set
  • 17. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set
  • 18. High Performance Information Computing Center Jongwook Woo CSULA Hive Easy for Data Analysis HiveQL – SQL syntax – Convert to MapReduce jobs Slow Batch Processing – in the beginning even for any simple SQL statement Impala on MPP for interactive querying
  • 19. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  MBA Examples  Academic Cloud Computing
  • 20. High Performance Information Computing Center Jongwook Woo CSULA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab In-memory storage for intermediate data 10 ~ 100x faster than N/W and Disk
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS HBase, Hive, Sequence files New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  • 22. High Performance Information Computing Center Jongwook Woo CSULA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  • 23. High Performance Information Computing Center Jongwook Woo CSULA Spark Drivers and Workers Drivers Client –with SparkContext • Create RDDs Workers Spark Executor Run on cluster nodes –Production Run in local threads –development
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3… RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  • 25. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  MBA Examples  Use Cases
  • 26. High Performance Information Computing Center Jongwook Woo CSULA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects –that can be cached in memory RDD, DStream, SchemaRDD, PairRDD Immutable Lineage –History of the objects –Automatically and efficiently recompute lost data
  • 27. High Performance Information Computing Center Jongwook Woo CSULA RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  • 28. High Performance Information Computing Center Jongwook Woo CSULA Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  • 29. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  MBA Examples  Academic Cloud Computing
  • 30. High Performance Information Computing Center Jongwook Woo CSULA Spark SparkSQL Turning an RDD into a Relation Querying using SQL Spark Streaming DStream – RDD in streaming – Windows • To select DStream from streaming data MLib Sparse vector support, Decision trees, Linear/Logistic Regression, PCA SVD and PCA
  • 31. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  MBA Examples  Use Cases
  • 32. High Performance Information Computing Center Jongwook Woo CSULA Market Basket Analysis (MBA) Collect the list of pair of transaction items most frequently occurred together at a store(s) Traditional Business Intelligence Analysis  much better opportunity for a profit – by controlling the order of products and marketing – control the stocks more intelligently – arrange items on shelves – promote items together etc.
  • 33. High Performance Information Computing Center Jongwook Woo CSULA Market Basket Analysis (MBA) Transactions in Store A: Input data Transaction 1: cracker, icecream, beer Transaction 2: chicken, pizza, coke, bread Transaction 3: baguette, soda, hering, cracker, beer Transaction 4: bourbon, coke, turkey Transaction 5: sardines, beer, chicken, coke Transaction 6: apples, peppers, avocado, steak Transaction 7: sardines, apples, peppers, avocado, steak … What is a pair of items that people frequently buy at Store A?
  • 34. High Performance Information Computing Center Jongwook Woo CSULA Market Basket Analysis (MBA) on Spark n-gram() Functional Algorithm 1. Take an input transaction text 2. Items are slided by n elements and sorted Slided items are generated as (e1, e2, …, en) 3. Duplicated items pairs are removed 4. Several (e1, e2, …, en) pairs are produced by n
  • 35. High Performance Information Computing Center Jongwook Woo CSULA n-gram Normally “He follows Texas Rangers” Bi-gram –(He follows), (follows Texas), (Texas Rangers) Tri-gram –(He follows Texas), (follows Texas Rangers) a transaction: (coke, beer, cracker)  the output list of bi-gram in MBA – (beer, coke), (beer, cracker), (coke, cracker)
  • 36. High Performance Information Computing Center Jongwook Woo CSULA Market Basket Analysis (MBA) on Spark MBA Functional Algorithm 1. Take input text files 2. For all files, n-gram(each line) Each line generates ngram list – The elements of the ngram list is flattened Each flattened element generates (ngram, 1) pair 3. All pairs are reduced by each key Values by a key are summed 4. (ngram, sum of values) pair are generated and sorted by sum of values
  • 37. High Performance Information Computing Center Jongwook Woo CSULA Experimental Result The number of nodes on AWS EC2 2, 4, and 6 nodes – 1 is a master node – the input transaction data sets • 1.6 GB and 3.2 GB file size Don’t need many nodes in Spark – 1 node • Equivalent to 10 – 100 nodes in MapReduce
  • 38. High Performance Information Computing Center Jongwook Woo CSULA Experimental Result Using Spark Scala on Amazon AWS Spark version is 1.3.0 – the data is stored at its HDFS on the cluster. AWS EC2 – m1.large • 2 core (2 EC2 vCPU compute unit), • 7.5GB memory and 2 x 420GB storage on 64 bits Amazon Linux OS – m1.xlarge • 4 core (4 EC2 vCPU compute unit), • 15GB memory and 4 x 420GB storage on 64 bits Amazon Linux OS
  • 39. High Performance Information Computing Center Jongwook Woo CSULA Example Code: Market Basket Analysis // ngrams to pair items def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = { s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet } val fPath = "jwoo/files3.2G.dat" val lines = sc.textFile(fPath) // lines: Array[String] val ngramNo = 2 val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word, 1)).reduceByKey((a, b) => a+b) val sortedResult = result.map(pair => pair.swap).sortByKey(false) //save result to HDFS sortedResult.saveAsTextFile("jwoo/result3.2")
  • 40. High Performance Information Computing Center Jongwook Woo CSULA Example Code: Market Basket Analysis // ngrams to pair items def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = { s.toLowerCase.split(inSep).sliding(n).map(_.sorted. mkString(outSep)).toSet }toLowerCase(): convert to lower letters split(inSep): split by the separator inSep sliding(n): select n words as a group _.sorted: sort the elements in the group mkString(outSep): the elements are appended with outSep toSet: make the group as a set with unique elmts
  • 41. High Performance Information Computing Center Jongwook Woo CSULA Example Code: Market Basket Analysis val fPath = "jwoo/files3.2G.dat" val lines = sc.textFile(fPath) // lines: Array[String] val ngramNo = 2 val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word, 1)).reduceByKey((a, b) => a+b) val sortedResult = result.map(pair => pair.swap).sortByKey(false) //save result to HDFS sortedResult.saveAsTextFile(“jwoo/result32G”) Extract and count bigram
  • 42. High Performance Information Computing Center Jongwook Woo CSULA Example Code: Market Basket Analysis val fPath = "jwoo/files3.2G.dat" val lines = sc.textFile(fPath) // lines: Array[String] val ngramNo = 2 val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word, 1)).reduceByKey((a, b) => a+b) val sortedResult = result.map(pair => pair.swap).sortByKey(false) //save result to HDFS sortedResult.saveAsTextFile(“jwoo/result32G”) Sort the bigram in descending order of the value
  • 43. High Performance Information Computing Center Jongwook Woo CSULA Time for MBA 0 100 200 300 400 500 1 3 5 sec m1.large (1.6GB) m1.xlarge (1.6GB) m1.large (3.2GB) m1.xlarge (3.2GB)
  • 44. High Performance Information Computing Center Jongwook Woo CSULA Scala Spark vs Java Code Size: 1/100 of Java code Performance: Scala Spark: 10x ~ 100x faster than MapRedude in Java Scala Spark vs PySpark – Scala Spark • 2x faster – PySpark • has more libraries Meaning 100 nodes in Java MapReduce 1 – 10 nodes in Spark
  • 45. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  MBA Examples  Academic Cloud Computing
  • 46. High Performance Information Computing Center Jongwook Woo CSULA Spark Training California State University Los Angeles (Prof Jongwook Woo) UC Berkeley Edx (MOOC) UC Berkeley amplab camp Stanford Cloudera, Hortonworks, DataStax Training courses IBM Big University
  • 47. High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  • 48. High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop and Spark
  • 49. High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  • 50. High Performance Information Computing Center Jongwook Woo CSULA Spark on Cloud Computing Amazon AWS Spark example at https://spark.apache.org/ Microsoft Azure IBM Bluemix Support Spark and AMP Labs –Will hire 3,500 Spark engineers Launch Spark on BlueMix on July 2015 Academic Initiative Programs – Contact for Spark on Bluemix account
  • 51. High Performance Information Computing Center Jongwook Woo CSULA Conclusion Big Data is Hadoop Spark is the way to go for Big Data Spark training and academic partnership
  • 52. High Performance Information Computing Center Jongwook Woo CSULA Question?
  • 53. High Performance Information Computing Center Jongwook Woo CSULA References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )  “Market Basket Analysis using Spark”, Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes
  • 54. High Performance Information Computing Center Jongwook Woo CSULA References  Introduction to Big Data with Apache Spark, databricks  Stanford Spark Class (http://stanford.edu/~rezab )  Cornell University, CS5304  DS320: DataStax Enterprise Analytics with Spark  Cloudera, http://www.cloudera.com  Hortonworks, http://www.hortonworks.com  Spark 3 Use Cases, http://www.datanami.com/2014/03/06/apache_spark_3_ real-world_use_cases/