SlideShare a Scribd company logo
1 of 37
Download to read offline
Project report: catch data predicition
By Mrinal Yadav(IITKGP)
Under the guidance of
Prof. Ekaterina Kim(NTNU)
In collaboration with
Bjornar Brende Smestad(NTNU)
Project - Catch data prediction
In this project we are analyzing the past 20 years data in order to do
catch prediction so that a new vessel with different parameters can be
given instructions on when , where and how to fish to get the maximum
catch possible.
First we will do the data preprocessing as the data has many
discrepancies like Nan’s, handling missing values ,wrong formats and
abnormal data(values that cannot be possible).
Once the data file for individual year is handled we combine all the data
for 20 years for further regression modelling. On the combined data we
do analysis.
Preprocessing the individual files(eg-2001)
• Read the data(2001) - Initial shape: (1213767,133)
• The data has to be read such that certain code formats were preserved, for ex the species code
(01220) when read gets converted to 1220 so the essence is lost. So they were read in string
formats.(data-2001) – only certain important columns taken for analysis. This image shows the
important columns and their datatypes for further analysis.
As we can see that there are 4
columns that still need
modification in their formats
that are length and catch date
and latitude/longitude. We
need to convert length to
float , catch date to date-
time, latitude/longitude to
float and product weight to
float format format so we
did.
The Nan values:
• So there were many data points that were nan which meant they had no information for that place. For
the important columns we can see this distribution from the following image.
As we can see that certain columns had no
nan values but the ones that did were having
some pattern. 4 columns have same number
of nans so when we removed them we
observed they belonged to the same row
indexes hence after removal of those entries
all 4 columns were handled at once. (Remove
entries where vessel id = nan)
For latitude and longitude we remove the
entries where longitude = nan and the
latitudes were handled automatically.
Regarding the tonnage column , we have to
deal with imputation strategy because of 2
reasons:
1. There is no such pattern visible.
2. We cannot remove such high amount of
data , it will result in information loss.
Dealing with zero values: (problem in data)
Dealing with zero values in tonnage column: (problem in data but solvable)
• We try certain checks for the tonnage column like if any value is zero or negative (which is actually
not possible). we find no value to be negative however we find that there are certain entries
where tonnage is zero.
• Now there are two columns for tonnage in each year data – tonnage 1969 and tonnage other. We
find that tonnage 1969 has about 90 percent of entries as nans so there was no point keeping this
column however tonnage other had just few entries as nans as compared to tonnage 1969 so we
keep it. All our analysis was based on this column – tonnage other.
• Now we observe that certain entries in this column are zero which is practically not possible so
we check . So we replace the entries in those places with the entries with corresponding index
from the column tonnage 1969.
Dealing with zero entries in power column: (problem in data – unsolvable)
• Also we observe that there are entries where engine power column has zero values , but since
there was nothing to replace and the number of such entries is very small(0-10) hence we remove
them.
Dealing with zero entries in product weight column: (problem in data – unsolvable)
• Also there are entries with product weight equal to zero since we cannot do much about them so
we remove these entries.
Relationship of technical parameters:
• Based on the correlation coefficients we see that length , power and tonnage (technical
parameters are highly correlated to each other .
As we see that correlation between all three is pretty high and almost the same hence we can use any of length or power
to impute tonnage values.
Based on the article: (19) (PDF) Relationship between Gross Tonnage and Overall Length for vessels on the ICCAT Record:
Implications for Unique Vessel Identifiers (researchgate.net)
We decide to go with length to impute the tonnage values.
General overview of length, power and tonnage relationship for year - 2001
Solving abnormalities
• As we go by the years we see some general trend being followed but we also see some abnormalities.
The length vs tonnage plot is logarithmic
All years generally follow the same trend but the years 2014, 2015, 2016 follow somewhat different trend
because of some vessels with abnormally large values of tonnage.(greater than 8000 that too for small
vessel of 10-20m range).
The general trend is somewhat like below:
Cause of abnormalities:
For 2014 data this is caused by 21 entries belonging to the same vessel:
For 2015 data this is caused by 70 entries belonging to the same vessel:
For 2016 data this is caused by 35 entries belonging to the same vessel:
Vessel name: CHRISTINA
Tonnage given: 9.65m
Length given: 9300.0
Vessel name: CHRISTIN
Tonnage given: 9.65m
Length given: 9300.0
Vessel name: GLESEN
Tonnage given: 9.65m
Length given: 9300.0
Because of these ships the plots of power vs tonnage and tonnage_length_ratio vs power were also affected for
the years 2014,2015,2016.
• General trend of power vs tonnage:
Further analysis:
In the year 2014 there are total of 282 vessels named ‘CHRISTINA’.
But the length for all 282 same named vessels is not same . It has 4 values of
length in which for 3 value tonnage is given and for remaining 1 it is nan.
• For 7.2m - tonnage=3.0
• For 8.26m - tonnage=11.0
• For 9.65m - tonnage= 9300.0 (Abnormal 21 entries)
• For 10.65m - tonnage=nan
In 2015 data there are a total of 297 vessels named ‘CHRISTINA’.
But the length for all 297 same named vessels is not same . It has 4 values of
length in which for 3 value tonnage is given and for remaining 1 it is nan.
• For 7.2m - tonnage=3.0
• For 8.26m - tonnage=11.0
• For 9.65m - tonnage= 9300.0 (Abnormal 70 entries)
• For 10.65m - tonnage=nan
In 2016 data there are a total of 35 vessels named ‘GLESEN’.
All 35 entries have same length-9.65m and tonnage = 9300.0
Solution: Replace 9300.0 by 9.3
After replacement: 2015 and 2016 also have the same general trend.
One more abnormality in 2011 data:
As we can see that there are certain points that
are affecting the trend.
This is because of 4 entries belonging to the
same vessel:
Vessel name: VIOLETA
Length given: 6.1m
Tonnage given: 800.0
In complete 2011 data we have only these 4
entries of vessel named VIOLETA , so we
don't have much information here regarding
this.
So the best option regarding these entries
is just to remove them.
Imputation of tonnage :
At this point we just have nans with tonnage that we need to impute with the
help of length.
The general relationship between length and tonnage is logarithmic as evident
from the plots for all the years. So we assume a general log-linear relationship
between them and try to find the best possible curve that fits it.
The relation that we use is :
ln(tonnage) = a*ln(length) + b
For each year we find these coefficients (a,b) that best fits that years data. We
measure the extent of good fit through the R-squared score.
R-squared is a statistical measure of how close the data are to the fitted
regression line. It is also known as the coefficient of determination.
Curve fit results: ln(tonnage) = a*ln(length) + b
As we see that we get different coefficients for each year but the ranges are
limited to particular integers.
From the graphs we can visualize the ranges directly and also see that
R-squared score are pretty good indicating they are good fits.
So we use these coefficients in the general relation to impute the tonnage
values wherever they are nan.
General plots after imputation: All years had the same trend as follows
Combining data of all years:
• Each and every step of the processing and cleaning
till now has been followed for each year separately
and the final data has been combined to form the
data upon which we will work further.
• We don’t take all the into account instantaneously
we first create our model for a segment that is a
particular length group and a particular
species(Torsk).
• Dividing vessels into length groups was important
because length is the most important factor that
the vessel owner cannot change once build, it can
go different places at different times with different
gears but the length cannot be changed.
• So we divide them into 5 groups namely: very
small, small, medium , large and very large.
• l < 9.9m - very_small
• 10m-14.9m - small
• 15m-20.9m - medium
• 21m-27.9m - large
• l > 28m - very_large
Analysis before regression model:
• Before creating the regression model we perform some analysis to get some insights in the data.
• Right now we are working with complete data of all length groups but a particular species - Torsk
• First we find the total catch within each length group so we can see which length group is catching
the highest amount.
As evident from the plot the groups that are catching the
highest amount of fishes (Torsk) are very large and small.
Remaining 3 groups are not so high catching .
Monthly catch:
• How high amount can you catch also depends upon the time at which you are fishing, a high catch
season can yield you good profit whereas a low catch season can make you go into loss.
• So we find the monthly catch that was done for the species – torsk over the last 20 years.
As we can see that the catch was very
high during the first 4 months but
reduced during the later 8 months
The first 4 months can be called high
catch season and the next 8 months
can be called low catch season.
Tools/gears:
• It is also important to use the correct tools to fish, as some can yield higher catch than others .
• So we find which gears/tools helped in catching the maximum amount .
As we see that there were certain gears that helped in
catching the maximum amount
These were gear codes : 22,51 and 61
22 – Settegarn – Set yarn
51 – Bunntrål – Bottom trawl
61 – Snurrevad – Spinning rod
Group data by length and gear codes:
• We group the data by length group and gear codes to see for a particular length group which
gears and tools have resulted to be most advantageous in catching the fishes.
Small vessels – 22
Large vessel – 61(best) ,22
Very small vessels – 33(best) ,22
Very large vessels – 51(best) ,35,61
Medium vessels - 61,22
These are the length groups and
their most catch gear codes.
Handling Noise:
• Filtering on the basis of gear code frequency (>1000)
• Same with product condition code frequency (>1000)
• Filtering based on catch range.
• Created extra feature of haversine distance to contain info of both lat/lon.
Also divided the model into north and south parts
based on latitudes.
Model: very small length group(l < 10m) + torsk species
• Overview of dataframe:
The length column has values lesser than 10m and the species code has the codes related to torsk species only, that
are 1022, 102201, 102202, 102204.
Fishing locations:
• We visualize the fishing locations for our data entries.
The kyst code represents whether the
location is within 20 nautical miles of the
coast or not.
Kyst code = 8: within 20 nautical miles from
the coast (blue dots)
Kyst code = 0: outside 20 nautical miles
from the coast (red dots)
Analyzing the target column: product weight(catch)
• We see that the target column is very dispersed that will create problem in the
learning process of the model.
• We can see that through the plot between length and catch and also through the
histogram of catch.
For just a small range of 2-10m
length we see that the catch
varies as 0-3500 which is not
normalized.
So we try to normalize he catch
by doing logarithmic
transformation and dividing by
length so that we have length
factor in both (X)- independent
variable as well as (y)-
dependent variable
Catch: log(catch)/length
After transformation of target column:
• We performed the logarithmic transformation on target and divided
by vessel length.
• Catch → log(catch)/length
• After transformation: As we can see that after
normalization the data has
interpretable range .
Some ML algorithms are
sensitive to scale of data , the
ones that use gradient
descent as their optimization
technique
Lightgbm is one such
algorithm so we need to
feature scale the data.
Train,validation and test split:
• Now we have our data ready to be divided into train , validation and test sets .
We divide it in 3:1:1 that is 60,20,20 into train , validation and test respectively.
• While dividing make use of stratify feature on column of species code as we want
proportionate distribution in each set.(one-hot encoding required)
LightGBM regression model:
• It is a gradient boosting algorithm that uses gradient descent as its optimization
technique. General objective function is regression which is optimized on
squared – error.
• However for different business models we can create our own custom objective
and evaluation function. The objective function takes two arguments that are
targets and predictions and gives gradient (first derivative) and hessian (second
derivative) .
• The evaluation function also takes two arguments - same but returns the loss .
• The objective function is the training loss whereas the evaluation function is the
validation loss.
• For our purpose of predicting catch, underpredicting is fine but overpredicting is
not because it can result into the loss of fishing organization(if we predict more
but it does not happen so ,the organization can suffer loss). So we need to
assign asymmetric penalty to both losses. For this purpose we use our custom
evaluation function.
Custom objective and evaluation functions:
• The penalty factor we used is p = 1.05 , it seems to be the convergence point no more significant change
in MAE after this.
• The penalty will be included in gradient as well as hessian and also in the evaluation function that will give
us the mean squared loss.
This is the plot of the gradient of our custom objective
function for regression (penalty included) – asymmetric
loss vs the default objective function for regression –
symmetric loss
Gradient – 1 order derivative of squared error.
This is the plot of the hessian of our custom objective
function for regression (penalty included) – asymmetric
loss vs the default objective function for regression –
symmetric loss
Hessian – 1 order derivative of squared error.
More about custom functions:
General mean squared loss is the function that is
optimized .
Default objective function – regression (training)
Default evaluation function/metric – l2 loss
(validation)
custom objective function –
custom_asymmetric_objective(training)
custom evaluation function/metric –
custom_asymmetric_eval (validation)
Penalty given for overestimated predictions(p=1.05)
As said earlier objective function is for training loss
whereas evaluation function is for validation loss.
GBM instances(models):
• We have created 6 instances of LightGBM model for comparison:
Early Stopping to Avoid Overfitting: -
Is an approach to training complex machine learning models to avoid overfitting.
It works by monitoring the performance of the model that is being trained on a separate validation dataset and
stopping the training procedure once the performance on the validation dataset has not improved after a fixed
number of training iterations.(10).
It avoids overfitting by attempting to automatically select the inflection point where performance on the validation
dataset starts to decrease while performance on the training dataset continues to improve as the model starts to
overfit.
The performance is the loss function that is being optimized to train the model (objective – default/custom).
The general rule is to take it to be 10% of your num_iterations which is by default 100 so no of early
stopping rounds = 10.
When we activate early stopping the booster will run till
that iteration till which early stopping criteria is not met.
Results: very small vessels + torsk (south)
The model giving the least MSE on test set is gbm6: custom objective + custom evaluation + early stopping
So for further analysis this will be our model of choice.
MAE on test set by gbm6: 0.0736 (least of all the 6 instances)
So our model of choice is gbm6.
Plots of predictions vs targets:
Seeing the plots we can say that there is not much difference
between the last three models – gbm4,gbm5 and gbm6
which was also visible through the data(results).
But since the score was best for gbm6 hence it our model of
choice.
Further analysis:
These MSE and MAE we get are on a target that was initially
transformed so inorder to get actual errors with respect to
product weight , we apply inverse transformation on the
predicted results .
Initially before applying model: transformation
Actual_product_weight = log(Actual_product_weight )/length.
We did this to normalize the target (feature scaling required for
lightgbm for good predictions)
Now we have to just apply the inverse transform on predictions :
Predictions = e ^(predictions * length)
After this we calculate the MAE and its equal to 48.37kg.
Boxplot for each month:
Box plot for each gear:
Location - error analysis:
A total of 216 different locations in the test set(used
the (fangsfelt code)
1.Grouped by fansgfelt code and summed over
absolute error to get total absolute error for each
location.
2.Calculated the frequency of each location in the
test set.
3. Calculated the average absolute error for each
location .
4. Set a threshold for error = 100kg.
No overlaps , each dot represents individual location

More Related Content

What's hot

johan_malmberg_thesis_pres_final
johan_malmberg_thesis_pres_finaljohan_malmberg_thesis_pres_final
johan_malmberg_thesis_pres_final
Johan Malmberg
 
Two Dimensional Motion and Vectors
Two Dimensional Motion and VectorsTwo Dimensional Motion and Vectors
Two Dimensional Motion and Vectors
ZBTHS
 
CP - Graphical Analysis
CP - Graphical AnalysisCP - Graphical Analysis
CP - Graphical Analysis
stephm32
 

What's hot (20)

Physics
PhysicsPhysics
Physics
 
johan_malmberg_thesis_pres_final
johan_malmberg_thesis_pres_finaljohan_malmberg_thesis_pres_final
johan_malmberg_thesis_pres_final
 
1D graphs, kinematics, and calculus
1D graphs, kinematics, and calculus1D graphs, kinematics, and calculus
1D graphs, kinematics, and calculus
 
Physics ppt
Physics pptPhysics ppt
Physics ppt
 
Two Dimensional Motion and Vectors
Two Dimensional Motion and VectorsTwo Dimensional Motion and Vectors
Two Dimensional Motion and Vectors
 
Lec07
Lec07Lec07
Lec07
 
Free fall PHYSICS IGCSE FORM 3 MRSM
Free fall PHYSICS IGCSE  FORM 3 MRSMFree fall PHYSICS IGCSE  FORM 3 MRSM
Free fall PHYSICS IGCSE FORM 3 MRSM
 
Speed, velocity and acceleration
Speed, velocity and accelerationSpeed, velocity and acceleration
Speed, velocity and acceleration
 
CP - Graphical Analysis
CP - Graphical AnalysisCP - Graphical Analysis
CP - Graphical Analysis
 
Kinematics jan 27
Kinematics jan 27Kinematics jan 27
Kinematics jan 27
 
A* - Astar - A-Star
A* - Astar - A-StarA* - Astar - A-Star
A* - Astar - A-Star
 
Vectors
VectorsVectors
Vectors
 
Motion in a plane chapter 3 converted
Motion in  a plane chapter 3 convertedMotion in  a plane chapter 3 converted
Motion in a plane chapter 3 converted
 
Quantities in mechanics
Quantities in mechanicsQuantities in mechanics
Quantities in mechanics
 
Chapter1
Chapter1Chapter1
Chapter1
 
Vectors And Scalars And Kinematics
Vectors And Scalars And KinematicsVectors And Scalars And Kinematics
Vectors And Scalars And Kinematics
 
Introduction to Motion
Introduction to MotionIntroduction to Motion
Introduction to Motion
 
Foundation Science Presentation 1
Foundation Science Presentation 1Foundation Science Presentation 1
Foundation Science Presentation 1
 
Scalars and vectors
Scalars and vectorsScalars and vectors
Scalars and vectors
 
Vectors and scalars
Vectors and scalarsVectors and scalars
Vectors and scalars
 

Similar to Project - report

Determination of DensityRequired materials provided in t
Determination of DensityRequired materials provided in tDetermination of DensityRequired materials provided in t
Determination of DensityRequired materials provided in t
mackulaytoni
 
MMAE545-Final Report-Analysis of Aircraft Wing
MMAE545-Final Report-Analysis of Aircraft WingMMAE545-Final Report-Analysis of Aircraft Wing
MMAE545-Final Report-Analysis of Aircraft Wing
LI HE
 
Physical quantities, units & measurements complete
Physical quantities, units & measurements completePhysical quantities, units & measurements complete
Physical quantities, units & measurements complete
Mak Dawoodi
 
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docxExercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
gitagrimston
 
urpl969-group2-paper-03May06
urpl969-group2-paper-03May06urpl969-group2-paper-03May06
urpl969-group2-paper-03May06
Wintford Thornton
 
Position Analysis One-DOF LinkageVector Loop Represent.docx
Position Analysis One-DOF LinkageVector Loop Represent.docxPosition Analysis One-DOF LinkageVector Loop Represent.docx
Position Analysis One-DOF LinkageVector Loop Represent.docx
harrisonhoward80223
 

Similar to Project - report (20)

Forecasting_CO2_Emissions.pptx
Forecasting_CO2_Emissions.pptxForecasting_CO2_Emissions.pptx
Forecasting_CO2_Emissions.pptx
 
Time Series Analysis.pptx
Time Series Analysis.pptxTime Series Analysis.pptx
Time Series Analysis.pptx
 
Time Series Forecasting Using TBATS Model.pptx
Time Series Forecasting Using TBATS Model.pptxTime Series Forecasting Using TBATS Model.pptx
Time Series Forecasting Using TBATS Model.pptx
 
Determination of DensityRequired materials provided in t
Determination of DensityRequired materials provided in tDetermination of DensityRequired materials provided in t
Determination of DensityRequired materials provided in t
 
Watershed Delineation Using ArcMap
Watershed Delineation Using ArcMapWatershed Delineation Using ArcMap
Watershed Delineation Using ArcMap
 
SST & Canes [Compatibility Mode]
SST & Canes [Compatibility Mode]SST & Canes [Compatibility Mode]
SST & Canes [Compatibility Mode]
 
MMAE545-Final Report-Analysis of Aircraft Wing
MMAE545-Final Report-Analysis of Aircraft WingMMAE545-Final Report-Analysis of Aircraft Wing
MMAE545-Final Report-Analysis of Aircraft Wing
 
Lecture 1 - System of Measurements, SI Units
Lecture 1 - System of Measurements, SI UnitsLecture 1 - System of Measurements, SI Units
Lecture 1 - System of Measurements, SI Units
 
Physical quantities, units & measurements complete
Physical quantities, units & measurements completePhysical quantities, units & measurements complete
Physical quantities, units & measurements complete
 
Brock Butlett Time Series-Great Lakes
Brock Butlett Time Series-Great Lakes Brock Butlett Time Series-Great Lakes
Brock Butlett Time Series-Great Lakes
 
3.3 IUKWC Workshop Freshwater EO - Marian Scott - Jun17
3.3 IUKWC Workshop Freshwater EO - Marian Scott - Jun173.3 IUKWC Workshop Freshwater EO - Marian Scott - Jun17
3.3 IUKWC Workshop Freshwater EO - Marian Scott - Jun17
 
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docxExercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
 
Team Nile EE416 Final Report
Team Nile EE416 Final ReportTeam Nile EE416 Final Report
Team Nile EE416 Final Report
 
Basic Concepts of Statistics - Lecture Notes
Basic Concepts of Statistics - Lecture NotesBasic Concepts of Statistics - Lecture Notes
Basic Concepts of Statistics - Lecture Notes
 
1,2. Dimensional Analysis and fluid .pptx
1,2. Dimensional Analysis and fluid .pptx1,2. Dimensional Analysis and fluid .pptx
1,2. Dimensional Analysis and fluid .pptx
 
What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...
What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...
What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...
 
Watershed Delineation in ArcGIS
Watershed Delineation in ArcGISWatershed Delineation in ArcGIS
Watershed Delineation in ArcGIS
 
urpl969-group2-paper-03May06
urpl969-group2-paper-03May06urpl969-group2-paper-03May06
urpl969-group2-paper-03May06
 
Visualizing botnets with t-SNE
Visualizing botnets with t-SNEVisualizing botnets with t-SNE
Visualizing botnets with t-SNE
 
Position Analysis One-DOF LinkageVector Loop Represent.docx
Position Analysis One-DOF LinkageVector Loop Represent.docxPosition Analysis One-DOF LinkageVector Loop Represent.docx
Position Analysis One-DOF LinkageVector Loop Represent.docx
 

Recently uploaded

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 

Recently uploaded (20)

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentation
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 

Project - report

  • 1. Project report: catch data predicition By Mrinal Yadav(IITKGP) Under the guidance of Prof. Ekaterina Kim(NTNU) In collaboration with Bjornar Brende Smestad(NTNU)
  • 2. Project - Catch data prediction In this project we are analyzing the past 20 years data in order to do catch prediction so that a new vessel with different parameters can be given instructions on when , where and how to fish to get the maximum catch possible. First we will do the data preprocessing as the data has many discrepancies like Nan’s, handling missing values ,wrong formats and abnormal data(values that cannot be possible). Once the data file for individual year is handled we combine all the data for 20 years for further regression modelling. On the combined data we do analysis.
  • 3. Preprocessing the individual files(eg-2001) • Read the data(2001) - Initial shape: (1213767,133) • The data has to be read such that certain code formats were preserved, for ex the species code (01220) when read gets converted to 1220 so the essence is lost. So they were read in string formats.(data-2001) – only certain important columns taken for analysis. This image shows the important columns and their datatypes for further analysis. As we can see that there are 4 columns that still need modification in their formats that are length and catch date and latitude/longitude. We need to convert length to float , catch date to date- time, latitude/longitude to float and product weight to float format format so we did.
  • 4. The Nan values: • So there were many data points that were nan which meant they had no information for that place. For the important columns we can see this distribution from the following image. As we can see that certain columns had no nan values but the ones that did were having some pattern. 4 columns have same number of nans so when we removed them we observed they belonged to the same row indexes hence after removal of those entries all 4 columns were handled at once. (Remove entries where vessel id = nan) For latitude and longitude we remove the entries where longitude = nan and the latitudes were handled automatically. Regarding the tonnage column , we have to deal with imputation strategy because of 2 reasons: 1. There is no such pattern visible. 2. We cannot remove such high amount of data , it will result in information loss.
  • 5. Dealing with zero values: (problem in data) Dealing with zero values in tonnage column: (problem in data but solvable) • We try certain checks for the tonnage column like if any value is zero or negative (which is actually not possible). we find no value to be negative however we find that there are certain entries where tonnage is zero. • Now there are two columns for tonnage in each year data – tonnage 1969 and tonnage other. We find that tonnage 1969 has about 90 percent of entries as nans so there was no point keeping this column however tonnage other had just few entries as nans as compared to tonnage 1969 so we keep it. All our analysis was based on this column – tonnage other. • Now we observe that certain entries in this column are zero which is practically not possible so we check . So we replace the entries in those places with the entries with corresponding index from the column tonnage 1969. Dealing with zero entries in power column: (problem in data – unsolvable) • Also we observe that there are entries where engine power column has zero values , but since there was nothing to replace and the number of such entries is very small(0-10) hence we remove them. Dealing with zero entries in product weight column: (problem in data – unsolvable) • Also there are entries with product weight equal to zero since we cannot do much about them so we remove these entries.
  • 6. Relationship of technical parameters: • Based on the correlation coefficients we see that length , power and tonnage (technical parameters are highly correlated to each other . As we see that correlation between all three is pretty high and almost the same hence we can use any of length or power to impute tonnage values. Based on the article: (19) (PDF) Relationship between Gross Tonnage and Overall Length for vessels on the ICCAT Record: Implications for Unique Vessel Identifiers (researchgate.net) We decide to go with length to impute the tonnage values.
  • 7. General overview of length, power and tonnage relationship for year - 2001
  • 8. Solving abnormalities • As we go by the years we see some general trend being followed but we also see some abnormalities. The length vs tonnage plot is logarithmic All years generally follow the same trend but the years 2014, 2015, 2016 follow somewhat different trend because of some vessels with abnormally large values of tonnage.(greater than 8000 that too for small vessel of 10-20m range). The general trend is somewhat like below:
  • 9. Cause of abnormalities: For 2014 data this is caused by 21 entries belonging to the same vessel: For 2015 data this is caused by 70 entries belonging to the same vessel: For 2016 data this is caused by 35 entries belonging to the same vessel: Vessel name: CHRISTINA Tonnage given: 9.65m Length given: 9300.0 Vessel name: CHRISTIN Tonnage given: 9.65m Length given: 9300.0 Vessel name: GLESEN Tonnage given: 9.65m Length given: 9300.0
  • 10. Because of these ships the plots of power vs tonnage and tonnage_length_ratio vs power were also affected for the years 2014,2015,2016. • General trend of power vs tonnage:
  • 11. Further analysis: In the year 2014 there are total of 282 vessels named ‘CHRISTINA’. But the length for all 282 same named vessels is not same . It has 4 values of length in which for 3 value tonnage is given and for remaining 1 it is nan. • For 7.2m - tonnage=3.0 • For 8.26m - tonnage=11.0 • For 9.65m - tonnage= 9300.0 (Abnormal 21 entries) • For 10.65m - tonnage=nan In 2015 data there are a total of 297 vessels named ‘CHRISTINA’. But the length for all 297 same named vessels is not same . It has 4 values of length in which for 3 value tonnage is given and for remaining 1 it is nan. • For 7.2m - tonnage=3.0 • For 8.26m - tonnage=11.0 • For 9.65m - tonnage= 9300.0 (Abnormal 70 entries) • For 10.65m - tonnage=nan In 2016 data there are a total of 35 vessels named ‘GLESEN’. All 35 entries have same length-9.65m and tonnage = 9300.0 Solution: Replace 9300.0 by 9.3
  • 12. After replacement: 2015 and 2016 also have the same general trend.
  • 13. One more abnormality in 2011 data: As we can see that there are certain points that are affecting the trend. This is because of 4 entries belonging to the same vessel: Vessel name: VIOLETA Length given: 6.1m Tonnage given: 800.0 In complete 2011 data we have only these 4 entries of vessel named VIOLETA , so we don't have much information here regarding this. So the best option regarding these entries is just to remove them.
  • 14. Imputation of tonnage : At this point we just have nans with tonnage that we need to impute with the help of length. The general relationship between length and tonnage is logarithmic as evident from the plots for all the years. So we assume a general log-linear relationship between them and try to find the best possible curve that fits it. The relation that we use is : ln(tonnage) = a*ln(length) + b For each year we find these coefficients (a,b) that best fits that years data. We measure the extent of good fit through the R-squared score. R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination.
  • 15. Curve fit results: ln(tonnage) = a*ln(length) + b As we see that we get different coefficients for each year but the ranges are limited to particular integers. From the graphs we can visualize the ranges directly and also see that R-squared score are pretty good indicating they are good fits. So we use these coefficients in the general relation to impute the tonnage values wherever they are nan.
  • 16. General plots after imputation: All years had the same trend as follows
  • 17. Combining data of all years: • Each and every step of the processing and cleaning till now has been followed for each year separately and the final data has been combined to form the data upon which we will work further. • We don’t take all the into account instantaneously we first create our model for a segment that is a particular length group and a particular species(Torsk). • Dividing vessels into length groups was important because length is the most important factor that the vessel owner cannot change once build, it can go different places at different times with different gears but the length cannot be changed. • So we divide them into 5 groups namely: very small, small, medium , large and very large. • l < 9.9m - very_small • 10m-14.9m - small • 15m-20.9m - medium • 21m-27.9m - large • l > 28m - very_large
  • 18. Analysis before regression model: • Before creating the regression model we perform some analysis to get some insights in the data. • Right now we are working with complete data of all length groups but a particular species - Torsk • First we find the total catch within each length group so we can see which length group is catching the highest amount. As evident from the plot the groups that are catching the highest amount of fishes (Torsk) are very large and small. Remaining 3 groups are not so high catching .
  • 19. Monthly catch: • How high amount can you catch also depends upon the time at which you are fishing, a high catch season can yield you good profit whereas a low catch season can make you go into loss. • So we find the monthly catch that was done for the species – torsk over the last 20 years. As we can see that the catch was very high during the first 4 months but reduced during the later 8 months The first 4 months can be called high catch season and the next 8 months can be called low catch season.
  • 20. Tools/gears: • It is also important to use the correct tools to fish, as some can yield higher catch than others . • So we find which gears/tools helped in catching the maximum amount . As we see that there were certain gears that helped in catching the maximum amount These were gear codes : 22,51 and 61 22 – Settegarn – Set yarn 51 – Bunntrål – Bottom trawl 61 – Snurrevad – Spinning rod
  • 21. Group data by length and gear codes: • We group the data by length group and gear codes to see for a particular length group which gears and tools have resulted to be most advantageous in catching the fishes. Small vessels – 22 Large vessel – 61(best) ,22 Very small vessels – 33(best) ,22 Very large vessels – 51(best) ,35,61 Medium vessels - 61,22 These are the length groups and their most catch gear codes.
  • 22. Handling Noise: • Filtering on the basis of gear code frequency (>1000) • Same with product condition code frequency (>1000) • Filtering based on catch range. • Created extra feature of haversine distance to contain info of both lat/lon. Also divided the model into north and south parts based on latitudes.
  • 23. Model: very small length group(l < 10m) + torsk species • Overview of dataframe: The length column has values lesser than 10m and the species code has the codes related to torsk species only, that are 1022, 102201, 102202, 102204.
  • 24. Fishing locations: • We visualize the fishing locations for our data entries. The kyst code represents whether the location is within 20 nautical miles of the coast or not. Kyst code = 8: within 20 nautical miles from the coast (blue dots) Kyst code = 0: outside 20 nautical miles from the coast (red dots)
  • 25. Analyzing the target column: product weight(catch) • We see that the target column is very dispersed that will create problem in the learning process of the model. • We can see that through the plot between length and catch and also through the histogram of catch. For just a small range of 2-10m length we see that the catch varies as 0-3500 which is not normalized. So we try to normalize he catch by doing logarithmic transformation and dividing by length so that we have length factor in both (X)- independent variable as well as (y)- dependent variable Catch: log(catch)/length
  • 26. After transformation of target column: • We performed the logarithmic transformation on target and divided by vessel length. • Catch → log(catch)/length • After transformation: As we can see that after normalization the data has interpretable range . Some ML algorithms are sensitive to scale of data , the ones that use gradient descent as their optimization technique Lightgbm is one such algorithm so we need to feature scale the data.
  • 27. Train,validation and test split: • Now we have our data ready to be divided into train , validation and test sets . We divide it in 3:1:1 that is 60,20,20 into train , validation and test respectively. • While dividing make use of stratify feature on column of species code as we want proportionate distribution in each set.(one-hot encoding required)
  • 28. LightGBM regression model: • It is a gradient boosting algorithm that uses gradient descent as its optimization technique. General objective function is regression which is optimized on squared – error. • However for different business models we can create our own custom objective and evaluation function. The objective function takes two arguments that are targets and predictions and gives gradient (first derivative) and hessian (second derivative) . • The evaluation function also takes two arguments - same but returns the loss . • The objective function is the training loss whereas the evaluation function is the validation loss. • For our purpose of predicting catch, underpredicting is fine but overpredicting is not because it can result into the loss of fishing organization(if we predict more but it does not happen so ,the organization can suffer loss). So we need to assign asymmetric penalty to both losses. For this purpose we use our custom evaluation function.
  • 29. Custom objective and evaluation functions: • The penalty factor we used is p = 1.05 , it seems to be the convergence point no more significant change in MAE after this. • The penalty will be included in gradient as well as hessian and also in the evaluation function that will give us the mean squared loss. This is the plot of the gradient of our custom objective function for regression (penalty included) – asymmetric loss vs the default objective function for regression – symmetric loss Gradient – 1 order derivative of squared error. This is the plot of the hessian of our custom objective function for regression (penalty included) – asymmetric loss vs the default objective function for regression – symmetric loss Hessian – 1 order derivative of squared error.
  • 30. More about custom functions: General mean squared loss is the function that is optimized . Default objective function – regression (training) Default evaluation function/metric – l2 loss (validation) custom objective function – custom_asymmetric_objective(training) custom evaluation function/metric – custom_asymmetric_eval (validation) Penalty given for overestimated predictions(p=1.05) As said earlier objective function is for training loss whereas evaluation function is for validation loss.
  • 31. GBM instances(models): • We have created 6 instances of LightGBM model for comparison: Early Stopping to Avoid Overfitting: - Is an approach to training complex machine learning models to avoid overfitting. It works by monitoring the performance of the model that is being trained on a separate validation dataset and stopping the training procedure once the performance on the validation dataset has not improved after a fixed number of training iterations.(10). It avoids overfitting by attempting to automatically select the inflection point where performance on the validation dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit. The performance is the loss function that is being optimized to train the model (objective – default/custom). The general rule is to take it to be 10% of your num_iterations which is by default 100 so no of early stopping rounds = 10. When we activate early stopping the booster will run till that iteration till which early stopping criteria is not met.
  • 32. Results: very small vessels + torsk (south) The model giving the least MSE on test set is gbm6: custom objective + custom evaluation + early stopping So for further analysis this will be our model of choice. MAE on test set by gbm6: 0.0736 (least of all the 6 instances) So our model of choice is gbm6.
  • 33. Plots of predictions vs targets: Seeing the plots we can say that there is not much difference between the last three models – gbm4,gbm5 and gbm6 which was also visible through the data(results). But since the score was best for gbm6 hence it our model of choice.
  • 34. Further analysis: These MSE and MAE we get are on a target that was initially transformed so inorder to get actual errors with respect to product weight , we apply inverse transformation on the predicted results . Initially before applying model: transformation Actual_product_weight = log(Actual_product_weight )/length. We did this to normalize the target (feature scaling required for lightgbm for good predictions) Now we have to just apply the inverse transform on predictions : Predictions = e ^(predictions * length) After this we calculate the MAE and its equal to 48.37kg.
  • 36. Box plot for each gear:
  • 37. Location - error analysis: A total of 216 different locations in the test set(used the (fangsfelt code) 1.Grouped by fansgfelt code and summed over absolute error to get total absolute error for each location. 2.Calculated the frequency of each location in the test set. 3. Calculated the average absolute error for each location . 4. Set a threshold for error = 100kg. No overlaps , each dot represents individual location