This document discusses the labs for an AI course. It provides:
1. An overview of the labs covered, including DNN, CNN, and reinforcement learning labs.
2. Guidelines for lab reports including getting reasonable results, tuning hyperparameters, and changing datasets.
3. Instructions for running Lab 1 (titanic survival prediction) and Lab 3 (curve fitting), including code snippets and directories to navigate.
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
See our presentation from the 6th International EULAG Users Workshop. We talked about taking HPC to the "Industry 4.0" by implementing smart techniques to optimize the codes in terms of performance and energy consumption. It explains how Machine Learning can dynamically optimize HPC simulations and byteLAKE's software autotuning solution.
Find out more about byteLAKE at: www.byteLAKE.com
Clustering has been one of the most widely studied topics
in data mining and k-means clustering has been one of
the popular clustering algorithms. K-means requires several
passes on the entire dataset, which can make it very expensive
for large disk-resident datasets. In view of this, a lot of work
has been done on various approximate versions of k-means,
which require only one or a small number of passes on the
entire dataset.
In this paper, we present a new algorithm, called Fast and
Exact K-means Clustering (FEKM), which typically requires
only one or a small number of passes on the entire dataset,
and provably produces the same cluster centers as reported
by the original k-means algorithm. The algorithm uses sampling
to create initial cluster centers, and then takes one or
more passes over the entire dataset to adjust these cluster
centers. We provide theoretical analysis to show that the cluster
centers thus reported are the same as the ones computed
by the original k-means algorithm. Experimental results from
a number of real and synthetic datasets show speedup between
a factor of 2 and 4.5, as compared to k-means.
This paper also describes and evaluates a distributed version
of FEKM, which we refer to as DFEKM. This algorithm
is suitable for analyzing data that is distributed across loosely
coupled machines. Unlike the previous work in this area,
DFEKM provably produces the same results as the original
k-means algorithm. Our experimental results show that
DFEKM is clearly better than two other possible options for
exact clustering on distributed data, which are down-loading
all data and running sequential k-means, or running parallel
k-means on a loosely coupled configuration. Moreover, even
in a tightly coupled environment, DFEKM can outperform
parallel k-means if there is a significant load imbalance
Artificial Neural Networks for Storm Surge Prediction in North CarolinaAnton Bezuglov
Feedforward Artificial Neural network (FF ANN) for storm surge prediction in North Carolina. Presentation at Coastal Resilience Center by Anton Bezuglov, Ph.D. Usage of TensorFlow and Python with links to the code on GitHub.
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
See our presentation from the 6th International EULAG Users Workshop. We talked about taking HPC to the "Industry 4.0" by implementing smart techniques to optimize the codes in terms of performance and energy consumption. It explains how Machine Learning can dynamically optimize HPC simulations and byteLAKE's software autotuning solution.
Find out more about byteLAKE at: www.byteLAKE.com
Clustering has been one of the most widely studied topics
in data mining and k-means clustering has been one of
the popular clustering algorithms. K-means requires several
passes on the entire dataset, which can make it very expensive
for large disk-resident datasets. In view of this, a lot of work
has been done on various approximate versions of k-means,
which require only one or a small number of passes on the
entire dataset.
In this paper, we present a new algorithm, called Fast and
Exact K-means Clustering (FEKM), which typically requires
only one or a small number of passes on the entire dataset,
and provably produces the same cluster centers as reported
by the original k-means algorithm. The algorithm uses sampling
to create initial cluster centers, and then takes one or
more passes over the entire dataset to adjust these cluster
centers. We provide theoretical analysis to show that the cluster
centers thus reported are the same as the ones computed
by the original k-means algorithm. Experimental results from
a number of real and synthetic datasets show speedup between
a factor of 2 and 4.5, as compared to k-means.
This paper also describes and evaluates a distributed version
of FEKM, which we refer to as DFEKM. This algorithm
is suitable for analyzing data that is distributed across loosely
coupled machines. Unlike the previous work in this area,
DFEKM provably produces the same results as the original
k-means algorithm. Our experimental results show that
DFEKM is clearly better than two other possible options for
exact clustering on distributed data, which are down-loading
all data and running sequential k-means, or running parallel
k-means on a loosely coupled configuration. Moreover, even
in a tightly coupled environment, DFEKM can outperform
parallel k-means if there is a significant load imbalance
Artificial Neural Networks for Storm Surge Prediction in North CarolinaAnton Bezuglov
Feedforward Artificial Neural network (FF ANN) for storm surge prediction in North Carolina. Presentation at Coastal Resilience Center by Anton Bezuglov, Ph.D. Usage of TensorFlow and Python with links to the code on GitHub.
Short-term forecasting is usually classified into two groups: the first approach uses physical model in order to compute the downscaling, whereas the second group relies on statistical learning. We propose a new strategy based on both approaches: a micro-scale CFD model coupled with an artificial neural network correction. Selection of the optimal neural network is achieved through a genetic algorithm. This solution is tested on a real case, which leads to a relative RMSE improvement of 17%.
Scaling Deep Learning Algorithms on Extreme Scale Architecturesinside-BigData.com
In this video from the MVAPICH User Group, Abhinav Vishnu from PNNL presents: Scaling Deep Learning Algorithms on Extreme Scale Architectures.
"Deep Learning (DL) is ubiquitous. Yet leveraging distributed memory systems for DL algorithms is incredibly hard. In this talk, we will present approaches to bridge this critical gap. We will start by scaling DL algorithms on large scale systems such as leadership class facilities (LCFs). Specifically, we will: 1) present our TensorFlow and Keras runtime extensions which require negligible changes in user-code for scaling DL implementations, 2) present communication-reducing/avoiding techniques for scaling DL implementations, 3) present approaches on fault tolerant DL implementations, and 4) present research on semi-automatic pruning of DNN topologies. Our results will include validation on several US supercomputer sites such as Berkeley's NERSC, Oak Ridge Leadership Class Facility, and PNNL Institutional Computing. We will provide pointers and discussion on the general availability of our research under the umbrella of Machine Learning Toolkit on Extreme Scale (MaTEx) available at http://github.com/matex-org/matex."
Watch the video: https://wp.me/p3RLHQ-hnZ
First project in DNN
The steps for first project are as follows:
Load Data.
Define Keras Model.
Compile Keras Model.
Fit Keras Model.
Evaluate Keras Model.
Make Predictions
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Development of Software for scalable anomaly detection modeling of time-series data using Apache Spark.
私たちはこれまで、様々な機器類を監視するセンサーの時系列データを分析し、異常を検知する手法およびソフトウェアの研究開発を行ってきた。
今回紹介するソフトウェアでは、バッチ処理で複数のセンサーから得られた高次元の時系列データから線形のLASSO回帰により学習、モデル化し、異常時を識別する。
しかし学習時間やメモリー使用量の増大が課題になってきたため、Sparkを活用し並列分散化を行った。
SparkにはMLlibという汎用的な機械学習ライブラリが存在するが、今回は使用するアルゴリズムの特殊性を考慮し、既存実装を基に新規に開発した。
本講演では当開発におけるデザインチョイスや性能計測結果について報告する。
a
A Cooperative Coevolutionary Approach to Maximise Surveillance Coverage of UA...Daniel H. Stolfi
This paper presents the parameterisation and optimisation of the CACOC (Chaotic Ant Colony Optimisation for Coverage) mobility model used by an Unmanned Aerial Vehicle (UAV) swarm to perform surveillance tasks. CACOC uses chaotic solutions of a dynamical system and pheromones for optimising area coverage. Consequently, several parameters of CACOC are to be optimised with the aim of improving its coverage performance. We propose a Genetic Algorithm (GA) and two Cooperative Coevolutionary Genetic Algorithms (CCGA) to tackle this problem. After testing our proposals on four case studies we performed a comparative analysis to conclude that the cooperative approaches allow a better exploration of the search space by optimising each UAV parameters independently.
https://doi.org/10.1109/CCNC46108.2020.9045643
Story of static code analyzer developmentAndrey Karpov
Greetings from the past: simple tools and bad standards. Regular expressions don’t work. What is inside modern static code analyzers on the PVS-Studio example. About machine learning. Learning vs Data Flow analysis.
Medical Image Segmentation Using Hidden Markov Random Field A Distributed Ap...EL-Hachemi Guerrout
Medical imaging applications produce large sets of similar images. The huge amount of data makes the manual
analysis and interpretation a fastidious task. Medical image segmentation is thus an important process in image processing
used to partition the images into different regions (e.g. gray matter, white matter and cerebrospinal fluid). Hidden Markov
Random Field (HMRF) Model and Gibbs distributions provide powerful tools for image modeling. In this paper, we use a
HMRF model to perform segmentation of volumetric medical images. We have a problem with incomplete data. We seek
the segmented images according to the MAP (Maximum A Posteriori) criterion. MAP estimation leads to the minimization
of an energy function. This problem is computationally intractable. Therefore, optimizations techniques are used to
compute a solution. We will evaluate the segmentation upon two major factors: the time of calculation and the quality of
segmentation. Processing time is reduced by distributing the computation of segmentation on a powerful and inexpensive
architecture that consists of a cluster of personal computers. Parallel programming was done by using the standard MPI
(Message Passing Interface).
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
This paper presents an approach based on applying an aggregated predictor formed by multiple versions of a multilayer neural network with a back-propagation optimization algorithm for helping the engineer to get a list of the most appropriate well-test interpretation models for a given set of pressure/ production data. The proposed method consists of three stages: (1) data decorrelation through principal component analysis to reduce the covariance between the variables and the dimension of the input layer in the artificial neural network, (2) bootstrap replicates of the learning set where the data is repeatedly sampled with a random split of the data into train sets and using these as new learning sets, and (3) automatic reservoir model identification through aggregated predictor formed by a plurality vote when predicting a new class. This method is described in detail to ensure successful replication of results. The required training and test dataset were generated by using analytical solution models. In our case, there were used 600 samples: 300 for training, 100 for cross-validation, and 200 for testing. Different network structures were tested during this study to arrive at optimum network design. We notice that the single net methodology always brings about confusion in selecting the correct model even though the training results for the constructed networks are close to 1. We notice also that the principal component analysis is an effective strategy in reducing the number of input features, simplifying the network structure, and lowering the training time of the ANN. The results obtained show that the proposed model provides better performance when predicting new data with a coefficient of correlation approximately equal to 95% Compared to a previous approach 80%, the combination of the PCA and ANN is more stable and determine the more accurate results with lesser computational complexity than was feasible previously. Clearly, the aggregated predictor is more stable and shows less bad classes compared to the previous approach.
Short-term forecasting is usually classified into two groups: the first approach uses physical model in order to compute the downscaling, whereas the second group relies on statistical learning. We propose a new strategy based on both approaches: a micro-scale CFD model coupled with an artificial neural network correction. Selection of the optimal neural network is achieved through a genetic algorithm. This solution is tested on a real case, which leads to a relative RMSE improvement of 17%.
Scaling Deep Learning Algorithms on Extreme Scale Architecturesinside-BigData.com
In this video from the MVAPICH User Group, Abhinav Vishnu from PNNL presents: Scaling Deep Learning Algorithms on Extreme Scale Architectures.
"Deep Learning (DL) is ubiquitous. Yet leveraging distributed memory systems for DL algorithms is incredibly hard. In this talk, we will present approaches to bridge this critical gap. We will start by scaling DL algorithms on large scale systems such as leadership class facilities (LCFs). Specifically, we will: 1) present our TensorFlow and Keras runtime extensions which require negligible changes in user-code for scaling DL implementations, 2) present communication-reducing/avoiding techniques for scaling DL implementations, 3) present approaches on fault tolerant DL implementations, and 4) present research on semi-automatic pruning of DNN topologies. Our results will include validation on several US supercomputer sites such as Berkeley's NERSC, Oak Ridge Leadership Class Facility, and PNNL Institutional Computing. We will provide pointers and discussion on the general availability of our research under the umbrella of Machine Learning Toolkit on Extreme Scale (MaTEx) available at http://github.com/matex-org/matex."
Watch the video: https://wp.me/p3RLHQ-hnZ
First project in DNN
The steps for first project are as follows:
Load Data.
Define Keras Model.
Compile Keras Model.
Fit Keras Model.
Evaluate Keras Model.
Make Predictions
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Development of Software for scalable anomaly detection modeling of time-series data using Apache Spark.
私たちはこれまで、様々な機器類を監視するセンサーの時系列データを分析し、異常を検知する手法およびソフトウェアの研究開発を行ってきた。
今回紹介するソフトウェアでは、バッチ処理で複数のセンサーから得られた高次元の時系列データから線形のLASSO回帰により学習、モデル化し、異常時を識別する。
しかし学習時間やメモリー使用量の増大が課題になってきたため、Sparkを活用し並列分散化を行った。
SparkにはMLlibという汎用的な機械学習ライブラリが存在するが、今回は使用するアルゴリズムの特殊性を考慮し、既存実装を基に新規に開発した。
本講演では当開発におけるデザインチョイスや性能計測結果について報告する。
a
A Cooperative Coevolutionary Approach to Maximise Surveillance Coverage of UA...Daniel H. Stolfi
This paper presents the parameterisation and optimisation of the CACOC (Chaotic Ant Colony Optimisation for Coverage) mobility model used by an Unmanned Aerial Vehicle (UAV) swarm to perform surveillance tasks. CACOC uses chaotic solutions of a dynamical system and pheromones for optimising area coverage. Consequently, several parameters of CACOC are to be optimised with the aim of improving its coverage performance. We propose a Genetic Algorithm (GA) and two Cooperative Coevolutionary Genetic Algorithms (CCGA) to tackle this problem. After testing our proposals on four case studies we performed a comparative analysis to conclude that the cooperative approaches allow a better exploration of the search space by optimising each UAV parameters independently.
https://doi.org/10.1109/CCNC46108.2020.9045643
Story of static code analyzer developmentAndrey Karpov
Greetings from the past: simple tools and bad standards. Regular expressions don’t work. What is inside modern static code analyzers on the PVS-Studio example. About machine learning. Learning vs Data Flow analysis.
Medical Image Segmentation Using Hidden Markov Random Field A Distributed Ap...EL-Hachemi Guerrout
Medical imaging applications produce large sets of similar images. The huge amount of data makes the manual
analysis and interpretation a fastidious task. Medical image segmentation is thus an important process in image processing
used to partition the images into different regions (e.g. gray matter, white matter and cerebrospinal fluid). Hidden Markov
Random Field (HMRF) Model and Gibbs distributions provide powerful tools for image modeling. In this paper, we use a
HMRF model to perform segmentation of volumetric medical images. We have a problem with incomplete data. We seek
the segmented images according to the MAP (Maximum A Posteriori) criterion. MAP estimation leads to the minimization
of an energy function. This problem is computationally intractable. Therefore, optimizations techniques are used to
compute a solution. We will evaluate the segmentation upon two major factors: the time of calculation and the quality of
segmentation. Processing time is reduced by distributing the computation of segmentation on a powerful and inexpensive
architecture that consists of a cluster of personal computers. Parallel programming was done by using the standard MPI
(Message Passing Interface).
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
This paper presents an approach based on applying an aggregated predictor formed by multiple versions of a multilayer neural network with a back-propagation optimization algorithm for helping the engineer to get a list of the most appropriate well-test interpretation models for a given set of pressure/ production data. The proposed method consists of three stages: (1) data decorrelation through principal component analysis to reduce the covariance between the variables and the dimension of the input layer in the artificial neural network, (2) bootstrap replicates of the learning set where the data is repeatedly sampled with a random split of the data into train sets and using these as new learning sets, and (3) automatic reservoir model identification through aggregated predictor formed by a plurality vote when predicting a new class. This method is described in detail to ensure successful replication of results. The required training and test dataset were generated by using analytical solution models. In our case, there were used 600 samples: 300 for training, 100 for cross-validation, and 200 for testing. Different network structures were tested during this study to arrive at optimum network design. We notice that the single net methodology always brings about confusion in selecting the correct model even though the training results for the constructed networks are close to 1. We notice also that the principal component analysis is an effective strategy in reducing the number of input features, simplifying the network structure, and lowering the training time of the ANN. The results obtained show that the proposed model provides better performance when predicting new data with a coefficient of correlation approximately equal to 95% Compared to a previous approach 80%, the combination of the PCA and ANN is more stable and determine the more accurate results with lesser computational complexity than was feasible previously. Clearly, the aggregated predictor is more stable and shows less bad classes compared to the previous approach.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. 2022/9/6
Three levels for lab1-12 reports
• (Minimum) Run the program and get reasonable result, e.g, accuracy around 90%
• Tune the hyperparameters to get better result
• Change the dataset and get acceptable results
3
(tensorflow) C:> Lab01_titanic_survival_prediction
(tensorflow) C:> python titanic_survival_predictation.py
(tensorflow) C:> 可自行修改輸入資料,看結果合不合理?
(tensorflow) C:> cd..
(tensorflow) C:> cd Lab03_plot_result
(tensorflow) C:> python plot_result.py
(tensorflow) C:> 可自行修改輸入曲線,看是否可以訓練成功?
(tensorflow) C:> cd..
Run Lab01 and Lab03
4
3. 2022/9/6
Lab01 Titanic Survival
Prediction
↓
PM2.5 Exceeded Prediction (Change
dataset!)
職電子碩㇐ 110368505 劉蘋慧
110368526 蕭銘宏
110368529 林佑軒
110368540 李品濬
Abstract
● Titanic survival prediction
In this case, we learned how to use TFLearn and TensorFlow to model the survival chance of
titanic passengers using their personal information (such as gender, age, and so on). To
tackle this classic machine learning task, we are going to build a DNN classifier.
4. 2022/9/6
Abstract
● PM2.5 exceeded prediction
Based on the case of titanic survival predictor, we used and preprocessed the dataset from
fengyuan automatic meteorological observation station to predict the chance if the
concentration of PM2.5 will exceed the standard.
Table of contents
a. Dataset parameters
b. Source code introduction
c. Source code modification
d. Conclusion
PM2.5 Exceeded Prediction
01 02
Titanic Survival Prediction
a. Introduction
b. Dataset preprocessing
c. Source code modification
d. Conclusion
5. 2022/9/6
Titanic Survival
Prediction
01
Dataset parameters
Dataset with titanic passengers’ personal
information.
survived (0 = No; 1 = Yes)
pclass Passenger Class (1 = st; 2 = nd; 3 = rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
7. 2022/9/6
Default use 2 layers of neural
network.
Softmax activation maps inputs to real numbers between 0-1,
and guarantees the sum of chances of all categories is 1.
Batch size Number of samples used for one iteration of gradient descent
Epoch Number of times that the learning algorithm work through all training samples
Changing the values of these two parameters will influence the accuracy of the prediction and the
surviving rate of DiCaprio and Winslet.
8. 2022/9/6
Prepare the test data and predict the surviving
rates.
Titanic Survival
Prediction
01
Source code modification
9. 2022/9/6
Add this line at the beginning of the code to remove nodes from graph
or reset entire default graph, and prevent the error below.
Add a new layer of the neural network, and modify
the nodes of these layers to 128, 64 and 32
respectively.
Keep the values of epoch and batch
size.
12. 2022/9/6
We prepared the real data of fengyuan automatic weather station (AWS)
from Central Weather Bureau(CWB) to be the dataset for predicting the
chance if the concentration of PM2.5 will exceed.
PM2.5 Exceeded
Prediction
02
Dataset preprocessing
13. 2022/9/6
We need to preprocess the dataset and let it fit the
input model format of the neural network in TFLearn.
14. 2022/9/6
List all data to an one-dimensional
array with the sorted 18 parameters.
Reshape data to 320x18 array. (18 a
group)
Insert id numbers to the first
column.
Add parameters row and export the dataset csv
file.
15. 2022/9/6
Classify and mask the values of PM2.5 to 0 and
1.
PM2.5 Exceeded
Prediction
02
Source code modification
16. 2022/9/6
Designate PM2.5 as target predict
parameter.
Based on the case of titanic survival predictor, we modified
the input data shape to 17 and kept the number of the
layers and the nodes of these layers.
17 = 18 parameters – 1 (PM2.5 as predict target
parameter)
17. 2022/9/6
Modify the value of epoch and observe the accuracy and the
loss rate. We found that the value of epoch is 15, the best
accuracy it will be. If exceeding 15, the overfitting would happen.
Meanwhile, we modified the value of batch size to 20.
Prepare the test data and predict the exceeded
rates.