SlideShare a Scribd company logo
1 of 35
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant Data
Analysis and Prediction Using
Spark
Manvi Chandra,mchandr2@calstatela.edu
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
Machine Learning
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Name: Manvi chandra
Experience:
 2012 -2014
– Programmer Analyst at Cognizant Technology Solutions
 2015-2016 - Present : Master’s in information system
 Exposed to Big Data Analytics
 Pursuing research in Big data analytics and machine learning
 2007-2011-Bachelor of Technology in Electronics and
Communication Engineering.
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
Machine Learning
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Introduction To Big Data
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social
Computing, Streaming data, smart phone, online
game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Hadoop?
8
Hadoop Founder:
Doug Cutting
Chief Architect at Cloudera
High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–You can build and run your applications
High Performance Information Computing Center
Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
10 ~ 100x faster than N/W and Disk
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
Machine Learning
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Machine Learning
Subfield of computer science that evolved from
the study of pattern recognition and
computational learning theory in artificial
intelligence.
Explores pattern recognition during data analysis
through computer science and statistics.
Machine learning is a method of data analysis
that automates analytical model building. Using
algorithms that iteratively learn from data,
machine learning allows computers to find
hidden insights without being explicitly
programmed where to look.
High Performance Information Computing Center
Jongwook Woo
CSULA
Machine Learning Studio
Microsoft Azure Machine Learning Studio is a
collaborative, drag-and-drop tool you can use
to build, test, and deploy predictive analytics
solutions on your data.
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
Machine Learning
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
HBase, Hive, Sequence files
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Create RDDs
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–development
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
RDD, DStream, SchemaRDD, PairRDD
Immutable
Lineage
–History of the objects
–Automatically and efficiently recompute lost
data
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
High Performance Information Computing Center
Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
SparkSQL
Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
MLib
Sparse vector support, Decision trees,
Linear/Logistic Regression, PCA
SVD and PCA
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
Hydrogen gas power plant spark
model
o Separating the labeled column.
o Creation of RDD.
o Splitting the data into training and test sets.
o Training the dataset using Decision forest
regression algorithm.
o Evaluation of the result.
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
Hydrogen gas power plant spark
model
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The Cal State L.A. Hydrogen Research
and Fueling Facility (H2 Station) was
formally opened on May 7, 2014.
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The station is capable of producing
hydrogen onsite from renewable energy
sources, using the process known as
electrolysis.
Cal State L.A. Hydrogen Research and
Fueling Facility became the first station
in the nation to sell hydrogen fuel by the
kilogram to the public.
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Workflow
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
o According to our research we are able to predict
Vehicle Pressure (Pressure of hydrogen gas within
the vehicle Hydrogen Storage System)using our
model.
o The algorithm used is decision forest regression.
o Decision forest are an ensemble learning method for
classification, regression and other tasks, that
operate by constructing a multitude of decision
trees at training time and outputting the class that is
the mode of the classes (classification) or mean
prediction (regression) of the individual trees.
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
 STATE OF CHARGE (SOC):-
Ratio of hydrogen density within the vehicle storage
system to the full-fill density. SOC is expressed as a
percentage and is computed based on the gas density as
per formula below:-
 Our model predict vehicle pressure which in turn
could be used to determine the state of charge.
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?
High Performance Information Computing Center
Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )

More Related Content

What's hot

Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Vijay Srinivas Agneeswaran, Ph.D
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitGanesan Narayanasamy
 
Scientific
Scientific Scientific
Scientific marpierc
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark Summit
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
IRJET- A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
IRJET-  	  A Survey on Predictive Analytics and Parallel Algorithms for Knowl...IRJET-  	  A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
IRJET- A Survey on Predictive Analytics and Parallel Algorithms for Knowl...IRJET Journal
 
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄Cheer Chain Enterprise Co., Ltd.
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
Distributed Cache With MapReduce
Distributed Cache With MapReduceDistributed Cache With MapReduce
Distributed Cache With MapReduceEdureka!
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paperrestassure
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...Fulvio Corno
 
Ph. D. Final Dissertation SLides
Ph. D. Final Dissertation SLidesPh. D. Final Dissertation SLides
Ph. D. Final Dissertation SLidesEmanuele Panigati
 

What's hot (18)

Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
Scientific
Scientific Scientific
Scientific
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
IRJET- A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
IRJET-  	  A Survey on Predictive Analytics and Parallel Algorithms for Knowl...IRJET-  	  A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
IRJET- A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
 
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Distributed Cache With MapReduce
Distributed Cache With MapReduceDistributed Cache With MapReduce
Distributed Cache With MapReduce
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
 
Ph. D. Final Dissertation SLides
Ph. D. Final Dissertation SLidesPh. D. Final Dissertation SLides
Ph. D. Final Dissertation SLides
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 

Similar to Spark Prediction of Hydrogen Gas Power Plant Data

Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Jongwook Woo
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopQuantUniversity
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developersNirmal Fernando
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopJongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzerpriyal mistry
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev Kumar
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Aastha Grover Resume (2)
Aastha Grover Resume (2)Aastha Grover Resume (2)
Aastha Grover Resume (2)Aastha Grover
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC
 

Similar to Spark Prediction of Hydrogen Gas Power Plant Data (20)

Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
 
Poster
PosterPoster
Poster
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Aastha Grover Resume (2)
Aastha Grover Resume (2)Aastha Grover Resume (2)
Aastha Grover Resume (2)
 
ravindra_job
ravindra_jobravindra_job
ravindra_job
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 

Spark Prediction of Hydrogen Gas Power Plant Data

  • 1. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Data Analysis and Prediction Using Spark Manvi Chandra,mchandr2@calstatela.edu Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Myself Name: Manvi chandra Experience:  2012 -2014 – Programmer Analyst at Cognizant Technology Solutions  2015-2016 - Present : Master’s in information system  Exposed to Big Data Analytics  Pursuing research in Big data analytics and machine learning  2007-2011-Bachelor of Technology in Electronics and Communication Engineering.
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  • 5. High Performance Information Computing Center Jongwook Woo CSULA Introduction To Big Data
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 7. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – On non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with multiple non-expensive computers • Own super computers
  • 8. High Performance Information Computing Center Jongwook Woo CSULA What is Hadoop? 8 Hadoop Founder: Doug Cutting Chief Architect at Cloudera
  • 9. High Performance Information Computing Center Jongwook Woo CSULA Definition: Big Data Inexpensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –You can build and run your applications
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab In-memory storage for intermediate data 10 ~ 100x faster than N/W and Disk
  • 11. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  • 12. High Performance Information Computing Center Jongwook Woo CSULA Machine Learning Subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Explores pattern recognition during data analysis through computer science and statistics. Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.
  • 13. High Performance Information Computing Center Jongwook Woo CSULA Machine Learning Studio Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you can use to build, test, and deploy predictive analytics solutions on your data.
  • 14. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  • 15. High Performance Information Computing Center Jongwook Woo CSULA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS HBase, Hive, Sequence files New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  • 16. High Performance Information Computing Center Jongwook Woo CSULA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  • 17. High Performance Information Computing Center Jongwook Woo CSULA Spark Drivers and Workers Drivers Client –with SparkContext • Create RDDs Workers Spark Executor Run on cluster nodes –Production Run in local threads –development
  • 18. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  Hydrogen Gas Power Plant Prediction Model
  • 19. High Performance Information Computing Center Jongwook Woo CSULA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects –that can be cached in memory RDD, DStream, SchemaRDD, PairRDD Immutable Lineage –History of the objects –Automatically and efficiently recompute lost data
  • 20. High Performance Information Computing Center Jongwook Woo CSULA RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  • 22. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  Hydrogen Gas Power Plant Prediction Model
  • 23. High Performance Information Computing Center Jongwook Woo CSULA Spark SparkSQL Turning an RDD into a Relation Querying using SQL Spark Streaming DStream – RDD in streaming – Windows • To select DStream from streaming data MLib Sparse vector support, Decision trees, Linear/Logistic Regression, PCA SVD and PCA
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Spark Hydrogen gas power plant spark model o Separating the labeled column. o Creation of RDD. o Splitting the data into training and test sets. o Training the dataset using Decision forest regression algorithm. o Evaluation of the result.
  • 25. High Performance Information Computing Center Jongwook Woo CSULA Spark Hydrogen gas power plant spark model
  • 26. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  Hydrogen Gas Power Plant Prediction Model
  • 27. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) was formally opened on May 7, 2014.
  • 28. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The station is capable of producing hydrogen onsite from renewable energy sources, using the process known as electrolysis. Cal State L.A. Hydrogen Research and Fueling Facility became the first station in the nation to sell hydrogen fuel by the kilogram to the public.
  • 29. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Workflow
  • 30. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Model
  • 31. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations
  • 32. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations o According to our research we are able to predict Vehicle Pressure (Pressure of hydrogen gas within the vehicle Hydrogen Storage System)using our model. o The algorithm used is decision forest regression. o Decision forest are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
  • 33. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations  STATE OF CHARGE (SOC):- Ratio of hydrogen density within the vehicle storage system to the full-fill density. SOC is expressed as a percentage and is computed based on the gas density as per formula below:-  Our model predict vehicle pressure which in turn could be used to determine the state of charge.
  • 34. High Performance Information Computing Center Jongwook Woo CSULA Question?
  • 35. High Performance Information Computing Center Jongwook Woo CSULA References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )