Automated Drought Analysis with Python and Machine Learning
THESIS SUBMITTED TO
Symbiosis Institute of Geoinformatics
FOR PARTIAL FULFILLMENT OF THE M. Sc. DEGREE
By
Gurminder Bharani
(Batch 2014 - 16)
Symbiosis Institute of Geoinformatics
Symbiosis International University
5th Floor, Atur Centre, Gokhale Cross Road,
Model Colony, Pune – 411016.
CERTIFICATE

Page | 2
Certified that this thesis titled ‘Automated Drought Analysis with Python and Machine
Learning’ is a bonafide work done by Mr. Gurminder Bharani, at International Water
Management Institute (IWMI), Sri Lanka and Symbiosis Institute of Geoinformatics, under our
supervision.
Supervisor External
Dr. Giriraj Amarnath
IWMI
Supervisor Internal
Dr. T. P. Singh
Director,
Symbiosis Institute of
Geoinformatics

Page | 3
Index
I. Acknowledgement 4
II. List of Figure 5
III. List Table 6
IV. Abbreviation list 7
1. Preface 8
2. Introduction 10
3. Literature Review 12
4. Study Area (if there) 24
5. Methodology 25
6. Result 37
7. Discussion 56
8. Conclusion 57
9. References 58
10. Annexure 59

Page | 4
Acknowledgement
The last six months working on my project has been a very productive journey.
Getting an opportunity to have a glimpse of what the research world looks and feels like
could not have been possible had it not been for Dr Giriraj Amarnath, who hired me as intern in
IWMI. I would like to extend my heartfelt gratitude to Mr Peejush Pani, who particularly helped
in developing the tools by rendering me his remarkable and constant guidance through his remote
sensing modelling expertise in the field.
The experience in this esteemed organisation could be marked as leaving an indelible mark
on my learning experience. It has been a great exposure and had served as a reality check through
which I plan to better myslef and polish my learning skills in the days to come.
Further, I would thank the faculty of Symbiosis Institute of Geoinformatics, Pune, namely
Dr T. P. Singh, Dr Navendu Chowdhury and Col B. K. Pradhan, without whom my knowledge
about GIS and its application in the various domains would not have been clear.
I would like to thank my computer science teacher Mr Charudatta Ekbote without his
teachings none of this would have been possible.

Page | 5
List of Figures
Figure 1: Simple implementation of Decision Tree
Figure 2: Simple implementation of Random Forest
Figure 3: Area of Interest
Figure 4: UI of Monthly Sum for SPI
Figure 5:UI of SPI Calculation from 1 month to 12 months
Figure 6: UI of SPI Calculation from 13 months to 60 months
Figure 7: UI of unpacking all the calculated SPI to daily raster images
Figure 8: Comparison of Mean Rainfall with Mean SPI
Figure 9: 2007 TRMM data compared with all types of Bias Correction methods
Figure 11: 2007 PERSIANN data compared with all types of Bias Correction methods
Figure 15: IWMI tools in ArcCatelog 10.3

Page | 6
List of Tables
Table 1: SPI Values
Table 2: Comparison between SPI calculated by WMO and IWMI made python tool
Table 3: Comparison between Random forest with 10, 80, 25, 50 estimators and decision tree
with IDSI and SPI

Page | 7
Abbreviation List
MODIS Moderate Imaging Spectro Radiometer
VCI Vegetation Condition Index
TCI Temperature Condition Index
SPI Standardized Precipitation Index
IMD Indian Meteorological Department
NDVI Normalized Difference Vegetation Index
OLI Operational Landsat Imager
DEM Digital Elevation Model
WMO World Meteorological Organization
IDSI Integrated Drought Severity Index

Page | 8
Preface
This project is about blending in multiple index from multiple satellites such as MODIS is known
for its high temporal dataset hence two indices are derived from MODIS which are VCI
(Vegetation Condition Index) and TCI (Temperature Condition Index).
The SPI (Standardized Precipitation Index) is widely accepted as the prime indicator of
meteorological drought and is derived from IMD (Indian Meteorological Department)
precipitation data.
Landsat data has been used specifically to get the fine resolution of 30 meter in the result of the
final classified product. NDVI (Normalized Difference Vegetation Index) is the indicator used to
identify pixels which will be eligible for drought.
Benefits of blending the datasets
There are several benefits that one may notice when blending datasets. Few are discussed below:
Temporal resolution:
High temporal resolution from MODIS dataset helps in understanding the long term behavior of
the data.
Also, by comparing the values in long term we can determine the severity of the dataset based on
the past events.
Spatial resolution:
Low spatial resolution gives the detailed outline of the data distributed spatially. Freely available
low spatial resolution data on an average do not have large historical data which makes processes
like identification of drought pixel compared to the past a difficult task. Additionally, when we
blend these dataset we get the benefits of both temporal as well as spatial resolution to indicate
stress of drought
Determining short term and long term drought with 1 month SPI to 60 month SPI
SPI based on 1 month of precipitation data over 30 years will indicated short term meteorological
drought because accumulation of one month of precipitation data is taken of every year and is
compared with the past accumulated rainfall of the respective month.

Page | 9
Here, SPI based on 12 months of accumulated precipitation data will indicate locations which are
under meteorological drought over one year. Similarly 24 months based SPI will indicate locations
suffering drought over two years.
When we try to classify drought with machine learning by taking 1, 12, 24 and 60, we can get
drought stress pixels with varying intensity.

Page | 10
Introduction
Objective
1. Need for Automation
The main objective of automation is to produce rapid results. When the project is based on
high temporal data sets, the processing of these datasets becomes repetitive. Once the
definition of the process is defined it can be automated, this will help in reducing human
interference which will lead to less erroneous product. Also since it’s automated the results
generated are rapid.
ArcMap a GIS software also has limitation when it comes to project specific customization,
for example exporting weekly drought maps by default has to be done manually in case of
high temporal dataset it becomes a time consuming task, with the help of automation we
can reduce the time of generating the result.
When it comes to analyzing the data, depending on the methodology many intermediate
dataset are created. For example for plotting sum monthly mean rainfall of precipitation
data provided by IMD from 1901 to 2015 a total dataset of 115 years, traditionally ArcMap
user will create Batch for every month and perform zonal statistics on the files given in
batch, the user will do this for every month of every from 1901 to 2015.
The intermediate data generated here is the sum of the monthly files taking unnecessary
space in the computer.
Each sum file is around 136 Kilobytes after calculation over 115 years we waste 184
Megabytes of space.
Since IMD dataset is 0.25 degree in spatial resolution and the size of the individual dataset
is in Kilobytes
Comparing it to Landsat 8 OLI images where each image is over 1 Gigabyte the amount
of space wasted will be more therefore with help of automation we can use the hardware
resources of the computer to the optimal level.
2. Need for machine learning classification
Machine learning enables us to create application which replicate human cognitive function
to classify objects. Machine learning has many sub streams, each having its own
advantages and disadvantages.

Page | 11
Once the algorithm is trained it can be used multiple times to classify drought to generate
weekly or daily product depending on the input parameters.

Page | 12
Literature Review
Python success story
ForecastWatch.com
Introduction
ForecastWatch.com, a service of Intellovations, is in the business of rating the accuracy of weather
reports from companies such as Accuweather, MyForecast.com, and The Weather Channel. Over
36,000 weather forecasts are collected every day for over 800 U.S. cities, and later compared with
actual climatological data. These comparisons are used by meteorologists to improve their weather
forecasts, and to compare their forecasts with others. They are also used by consumers to better
understand the probable accuracy of a forecast.
The Architecture
ForecastWatch.com is built from four major architectural components: An input process for
acquiring forecasts, an input process for acquiring measured climatological data, the data
aggregation engine, and the web application framework.
There are two main input processes in the system: The forecast parser, and the actuals parser. The
forecast parser is responsible for requesting forecasts from the web for each of the forecast
providers ForecastWatch.com tracks. It parses the forecast from the page and inserts the forecast
data into a database until it can be compared to the actual data. The actuals parser takes actual data
from the National Climatic Data Center of the National Weather Service, which provides high,
low, precipitation, and significant weather events for over 800 United States cities and inserts the
data in to the database. This process also scores the forecasts with the actual weather data, and
places that information in the database.
Once the data has been collected and scored, it is processed by the aggregation engine, which
combines the scores into yearly and monthly blocks, sliced by provider, location, and the number
of days into the future for which the forecasts were predicting. In its first year, 2003, the system
only gathered forecasts for 20 U.S. cities, or about 250,000 individual forecasts, so most of the
data output was based on the raw scoring data. The aggregation engine was added once the system
was scaled up to 800 cities, increasing the data stream by almost 4000%. In the first half of 2004,

Page | 13
the system has already scored over 4 million forecasts, all collected, parsed, and displayed on the
web.
Implemented with Python
ForecastWatch.com is a 100% pure Python solution. Python is used in all its components, from
the back-end to the front-end, including also the more performance-critical portions of the system.
Python was chosen initially because it comes with many standard libraries useful in collecting,
parsing, and storing data from the web. Among those particularly useful in this application were
the regular expression library, the thread library, the object serialization library, and gzip data
compression library. Other libraries, such as an HTTP client capable of accepting cookies
(ClientCookie), and an HTML table parser (ClientTable) were available as third party modules.
These proved invaluable and were easy to use.
The threading library turned out to be very important in scaling ForecastWatch.com's coverage to
over 800 cities. Grabbing web pages is a very I/O bound process, and requesting a single page at
a time for roughly 5000 web pages a day would have been prohibitively time-consuming. Using
Python's threading library, the web page retrieval loop simply calls thread.start_new() for each
request, passing in the necessary class instance method that retrieves and processes the web page,
along with the parameters necessary to describe the city for the desired forecast. The request
classes use a Python built-in Event class instance to communicate with the main controlling thread
when processing is complete. Python made this application of threading incredibly easy.
Python is also used in the aggregation engine, which runs as a separate process to combine forecast
accuracy scores into monthly and yearly slices. The aggregation process uses queries
via MySQLdb to theMySQL database where the input modules have placed the forecast and
climatological data they have harvested. Colorized maps, showing forecast accuracy by
geographical area, are then generated for use on the web site and in printed reports.
Python Made It Possible
Python played a significant role in the success of ForecastWatch.com. The product currently
contains over 5,000 lines of Python, most of which are concerned with implementing the high-
level functionality of the application, while most of the details are taken care of by Python's

Page | 14
powerful standard libraries and the third party modules described above. Many more lines of code
would have been needed working in, for example, Java or PHP. The integration capabilities of
those languages are not as strong, and their threading support is harder to use.
About Python
Python is impressive as an object-oriented rapid application development language. One of
Python's key strengths lies in its ability to produce results quickly without sacrificing
maintainability of the resulting code. In ForecastWatch.com, Python was used for prototyping as
well, and those prototypes were able to evolve cleanly into the production code without requiring
a complete rewrite or switching toolsets. This saved substantial effort and made the development
process more flexible and effective.
Because of the clean design of the language, refactoring the Python code was also much easier
than in other languages; moving code around simply requires less effort.
Python's interpreted nature was also a benefit: Code ideas can easily be tested in the Python
interactive shell, and lack of a compilation phase makes for a shorter edit/test cycle.
All of these factors combine to make Python a terrific alternative to C++ and Java as a general
purpose programming language. ForecastWatch.com was made possible because of the ease of
programming complex tasks in Python, and the rapid development that Python allows.
Python Modules
1. Pandas:
Pandas are used for data analytics. It enables the programmer to traverse around the data
and get the desired result. Pandas behaves similar to Microsoft Excel the only difference is
that there is no user interface for pandas to be implemented
2. Matplotlib:
As discussed in the earlier segment pandas matplotlib enables the user to visualize the
behavior of the data. Matplotlib is a vast library which is capable of printing any type of
graph. In this project matplotlib is used for analyzing the spatial correlation between two
dataset

Page | 15
3. Arcpy:
Arcpy is a python module made only for ArcGIS application. This module cannot be used
outside the ArcGIS environment. The basic purpose of this module to create customized
application in ArcGIS Desktop. Every tool in the ArcMap has a python implementation
which you can see in the tool description. By understanding the behavior of the tool we
can then merger multiple tools in ArcMap and get the desired output
This helps in reducing the manual work as merging the tools automated the process for
generating tools
4. Numpy:
Numpy helps in performing operation on 2D or 3D numpy arrays. With help of numpy any
raster based model can be generated. Numpy has additional methods which enable the user
to transform the raster dataset which are saved in 2D or 3D numpy array. Multiple modules
in numpy makes the task like taking the temporal mean by excluding particular value in
the high temporal dataset (For Example 100 years) very easy. Just by masking the dataset
of numpy raster the above result can be achieved
5. Openpyxl:
Openpyxl is the bridge between the ArcMap to Microsoft Excel. The results generated from
the ArcMap can be taken in the form of pandas data frame and then be stored into excel.
After storing the dataset we can use multiple methods in openpyxl to visualize the data.
All types of graph and charts can be generated with the help of openpyxl. Some of them
are line chart, scatter chart, pie charts, area charts etc.
6. Scipy:
Scipy includes all the complex statistical tools for data analysis. One of which known as
gamma cumulative probability density function is used for calculation of SPI. Statistical
components as linear regression calculation with results containing standard error and other
important information can be generated with the help of scipy. Interpolation module inside
scipy helps the programmer to perform tasks like interpolation of point dataset to generate
surface. There are dedicated modules in scipy for Fourier Transform, Linear Algebra,
Eigen Values, Multidimensional image processing etc.
7. (GRASS, n.d.):

Page | 16
GRASS is freely available plugin in Quantum GIS software which can be used with raster
as well as vector data for analysis. GRASS is a plugin not a module, it contains many
Python modules for analysis of spatial data like:
i. Db as database module.
ii. R which is a Raster module.
iii. V which is Vector module.
8. Sklearn:
a. Decision Tree (scikit-learn, 1.10. Decision Trees, n.d.): Decision Trees (DTs) are a
non-parametric supervised learning method used for classification and regression.
The goal is to create a model that predicts the value of a target variable by learning
simple decision rules inferred from the data features.
For instance, in the example below, decision trees learn from data to approximate
a sine curve with a set of if-then-else decision rules. The deeper the tree, the more
complex the decision rules and the fitter the model.
Some advantages of decision trees are:
 Simple to understand and to interpret. Trees can be visualised.
 Requires little data preparation. Other techniques often require data
normalisation, dummy variables need to be created and blank values to be
removed. Note however that this module does not support missing values.
 The cost of using the tree (i.e., predicting data) is logarithmic in the number
of data points used to train the tree.
 Able to handle both numerical and categorical data. Other techniques are
usually specialised in analysing datasets that have only one type of variable.
See algorithms for more information.
 Able to handle multi-output problems.
 Uses a white box model. If a given situation is observable in a model, the
explanation for the condition is easily explained by boolean logic. By
contrast, in a black box model (e.g., in an artificial neural network), results
may be more difficult to interpret.
 Possible to validate a model using statistical tests. That makes it possible to
account for the reliability of the model.

Page | 17
 Performs well even if its assumptions are somewhat violated by the true
model from which the data were generated.
The disadvantages of decision trees include:
 Decision-tree learners can create over-complex trees that do not
generalise the data well. This is called overfitting. Mechanisms such as
pruning (not currently supported), setting the minimum number of
samples required at a leaf node or setting the maximum depth of the tree
are necessary to avoid this problem.
 Decision trees can be unstable because small variations in the data might
result in a completely different tree being generated. This problem is
mitigated by using decision trees within an ensemble.
 The problem of learning an optimal decision tree is known to be NP-
complete under several aspects of optimality and even for simple
concepts. Consequently, practical decision-tree learning algorithms are
based on heuristic algorithms such as the greedy algorithm where
locally optimal decisions are made at each node. Such algorithms cannot
guarantee to return the globally optimal decision tree. This can be
mitigated by training multiple trees in an ensemble learner, where the
features and samples are randomly sampled with replacement.
 There are concepts that are hard to learn because decision trees do not
express them easily, such as XOR, parity or multiplexer problems.
 Decision tree learners create biased trees if some classes dominate. It is
therefore recommended to balance the dataset prior to fitting with the
decision tree.

Page | 18
Figure 16: Simple implementation of Decision Tree
b. Random Forest (scikit-learn, 3.2.4.3.1. sklearn.ensemble.RandomForestClassifier,
n.d.):
A random forest is a meta estimator that fits a number of decision tree classifiers
on various sub-samples of the dataset and use averaging to improve the predictive
accuracy and control over-fitting. The sub-sample size is always the same as the
original input sample size but the samples are drawn with replacement
if bootstrap=True (default).
The example of random forest implementation is given as following

Page | 19
Figure 17: Simple implementation of Random Forest
9. Decision Tree Implementation:
(Hwahwan & Cha, 2008) Implemented land classification with machine learning technique
named decision tree.
The parameters taken for classification were
i. DEM
ii. Aspect
iii. Slope
iv. ISO cluster
v. Population Density
vi. Distance to water
vii. Distance to Road
To train the decision tree, classified land data was taken from the government of South
Korea.
The classes classified are Forest, Urban, Water, Agriculture, Rangeland, Barren land,
Wetland. After training and classification of the dataset 96% of accuracy was achieved.

Page | 20
10. WMO SPI (Standardized precipitation Index):
(WMO, 2009) In the Inter-Regional Workshop on Indices and Early Warning Systems for
Drought declares that SPI (Standardized Precipitation Index) should be used to characterize
meteorological drought.
SPI answers the question like is rainfall in particular month in deficit or surplus compare
to past years of data.
SPI ranges from 1 month to 60 months, it can be done over 60 month as well if long term
rainfall data is available.
One month SPI helps in identifying short term drought events since data of 1 month rainfall
is compared with past records.
As we increase the month of SPI we can identify long term drought, for example if we see
48 months SPI we can identify location which are affected by drought for past 2 years. SPI
roughly ranges from -3 to +3 and each range of values have meaning.
Table 2: SPI Values
2.0+ Extremely wet
1.5 to 1.99 Very wet
1.0 to 1.49 Moderately wet
-.99 to .99 Near normal
-1.0 to -1.49 Moderately dry
-1.5 to -1.99 Severely dry
-2 and less Extremely dry
(M. Svoboda, 2012) in the user guide define the meaning of SPI values above.
11. Bias Correction CMCC:
CMCC = 𝑃 ∗ (𝑑) = 𝑃(𝑑) .
µ 𝑚(𝑃 𝑜𝑏𝑠(𝑑))
µ 𝑚(𝑃𝑟𝑒𝑚(𝑑))
Where:
- µ 𝑚(𝑃𝑜𝑏𝑠(𝑑)) is the month mean of the station data
- µ 𝑚(𝑃𝑟𝑒𝑚(𝑑)) is the month mean of remote sensing data
- 𝑃(𝑑) is the daily remote sensing data
- 𝑃 ∗ (𝑑) is the bias corrected remote sensing data

Page | 21
12. Bias Correction MRC
𝑆𝑅𝐸𝑒 = (𝑆𝑅𝐸 𝑜 − µ 𝑆𝑅𝐸) . Ƭ 𝑓(µ 𝑆𝑅𝐸 . µ 𝑓)
µ 𝑓 = µ 𝑂𝐵𝑆/ µ 𝑆𝑅𝐸
Ƭ 𝑓 = Ƭ 𝑂𝐵𝑆/ Ƭ 𝑆𝑅𝐸
Where:
- 𝑆𝑅𝐸𝑜 is the station data
- µ 𝑆𝑅𝐸 is month mean of remote sensing data
- µ 𝑂𝐵𝑆 is month mean of station data
- Ƭ 𝑂𝐵𝑆 is month standard deviation of station data
- Ƭ 𝑆𝑅𝐸 is month standard deviation of remote sensing data
- 𝑆𝑅𝐸 𝑜 daily SRE data
- 𝑆𝑅𝐸𝑒 is the bias corrected data
13. Bias Correction Rule Based (Modified CMCC):
This method was developed at IWMI and its behavior is related to the CMCC
methodology. After testing the rainfall dataset in excel we created the rules. By
studying the behavior of the equation depending on the mean of data.
- Rule 1: If station data records precipitation and satellite does not, copy
IMD data into bias corrected satellite data.
- Rule 2: If the difference between station and satellite data is two
millimeter take the mean of these pixels.
- Rule 3: If station mean is greater than the satellite mean where satellite
daily data is greater than station daily data then the equation of bias
correction is:
𝑃 ∗ (𝑑) = 𝑃(𝑑) .
µ 𝑚(𝑃𝑜𝑏𝑠(𝑑))
Where satellite daily is less than station daily data.
𝑃 ∗ (𝑑) = 𝑃(𝑑) .

Page | 22
- Rule 4: If station mean is less than the satellite mean where satellite daily
data is less than station daily data then the equation of bias correction is:
𝑃 ∗ (𝑑) = 𝑃(𝑑) .
Where satellite daily is greater than station daily data.
𝑃 ∗ (𝑑) = 𝑃(𝑑) .
Where:
- µ 𝑚(𝑃𝑜𝑏𝑠(𝑑)) is the month mean of the station data
- µ 𝑚(𝑃𝑟𝑒𝑚(𝑑)) is the month mean of remote sensing data
- 𝑃(𝑑) is the daily remote sensing data
- 𝑃 ∗ (𝑑) is the bias corrected remote sensing data
14. Bias Correction IIT Gandhinagar:
(Shah & Mishra, 2014) Bias Corrected TRMM data with respect to IMD data. In
the paper they mention that TRMM always underestimates the amount of
precipitation compared to the station data provided by IMD. These difference was
noticeably seen in the monsoon season when the values were extreme.
They created two scale factors to bias correct the data one for the extreme events
and the other for the non-monsoon months.
First they took precipitation values over ninetieth percentile of IMD data and
created the first scale factor for extreme events.
The scale factor is just the ratio of the IMD and TRMM mean corresponding to
pixels over ninetieth percentile of IMD.
This scale factor is then multiplied to raw TRMM data for respective months

Page | 23
Second for pixels below ninetieth percentile of IMD the ratio is taken of the IMD
and TRMM mean corresponding to pixels under ninetieth percentile of IMD. This
second factor is applied on raw TRMM data. Percentile is taken only for the
monsoon months because other months receive comparatively less rainfall.
15. IDSI (Integrated Drought Severity Index)
a. IDSI is index developed at IWMI for monitoring of drought in South Asia
consisting of countries named India, Pakistan, Sri Lanka, Nepal, Afghanistan and
Bangladesh. It uses VCI, TCI, Rainfall anomaly from GPM dataset to classify a
pixel’s drought severity.
A map of IDSI looks like following.
Figure 18: IDSI 20-27 Jul 2002

Page | 24
Study Area
Introduction
Since this methodology is on experimental mode and the data set is of huge varsity and variety the
processing time with machine learning of the dataset is completely dependent on how big is the
data.
To get faster results, the study area has been kept small, the area was decided on the basis of large
spatial variability in Maharashtra so that the interaction of the diverse indices from multiple
satellites could be studied.
Area of concern
The biggest are of concern is farmer suicides in Maharashtra, the major reason behind this issue
lack of management from the government. Remote sensing is one platform that can be adopted by
the government to release funds to the needy and stop this crisis.
Lack of management is due to not taking right decisions and right time, it is happening because
conventional methodologies are still used to determine drought. The reason why remote sensing
is the only method to get fast and mostly accurate results is because in this methodology we do not
need to wait for a verdict from a village surveyor to declare the village as drought, from satellite
imagery we can immediately process the data and provide is crucial information to the decision
makers to take the right decisions
With help of python automation we can automate entire processing without human interference
and get results in minutes

Page | 25
Figure 19: Area of Interest

Page | 26
Methodology
Conversion tools:
1. IMD GRD data to ASCII: IMD by default distributes the in-situ precipitation data in GRD
format, a separate C program is distribute to convert this GRD file into ASCII to simplify
processing for the user inbuilt python script is made so that performs the same task as the
C program
2. IMD ASCII to Raster: The ASCII file is then converted to girded raster data so that it can
be compared with satellite rainfall estimates
3. Excel to Raster: Apart from IMD, Bangladesh precipitation data that I found was in the
form of excel sheet containing in-situ recorded precipitation data. This tool will enable the
user to generate girded raster map from the point station data with IDW interpolation
technique
Shift in the IMD and Remote sensing data:
1. After Converting IMD raw data into raster maps we observed that there is a shift of 0.125
degrees between IMD raster and remote sensing data raster
2. Compared with PERSIANN and TRMM this shift was observed significantly
3. Since the final product is going to be based on 30 meters spatial resolution the shift of
0.125 degrees will cause major problem
4. Hence a tool was created to remove this shift
Methodology
1. Resample IMD data to 0.125 degree from 0.25 degrees
2. Then we take the zonal mean of these 0.125 degrees with a fishnet which is created based
on the remote sensing data, where the grid size of the fishnet is 0.25 matching the extent
of remote sensing data
3. The extra reaming 0.125 degree cells left are then clipped
4. Again resample is performed to convert the 0.125 degree IMD data tot 0.25 remote sensing
products like PERSIANN and TRMM
Bias Correction

Page | 27
Tool box dedicated to bias correction is created with all the methods listed below.
1. Bias Correction CMCC interval
2. Bias Correction CMCC monthly
3. Bias Correction Rule Based
4. Bias Correction MRC
5. Bias Correction IIT Gandhinagar
Best Corrected results was taken for SPI calculation
1. SPI calculation
a. Automated data sorting
Figure 20: UI of Monthly Sum for SPI
Description of this tool.
Monthly Sum For SPI
Title Monthly Sum For SPI

Page | 28
Summary
This tool calculates the monthly sum from the daily rainfall data and saves the long term files in
its repective month folder
For calculation of SPI with the SPI tool in this toolbox some data preparation is needed, this tool
will enable you process data with help of automation, hence you willl not need to manually sort
and then send it to SPI tool.
Usage
There is no usage for this tool.
Syntax
MonthlySumForSPI (Daily_gridded_data_Folder, Extension, Output_Folder)
Parameter Explanation Data Type
Daily_gridded_data_Folder Dialog Reference
This folder should contain subfolders with
year name containing 365 daily rainfall data
Example: Folder name = TRMM_daily
Sub-folder name = 2001, 2002, 2003... 2015
There is no python reference for this
parameter.
Folder
Extension Dialog Reference String

Page | 29
Select the format of precipitation dataset the
you are giving as input parameter for this tool
parameter.
Output_Folder Dialog Reference
Folder where all the monthly folders will be
created containing the output files
parameter.
Folder

Page | 30
b. 1 to 12 tool UI
Figure 21:UI of SPI Calculation from 1 month to 12 months
Description of this tool:
SPI 1 to 12
Title SPI 1 to 12
Summary
This tool calculates the SPI using monthly rainfall data
It is designed to calculate Standardiezed Precipitaion Index (SPI), with minimal human
interaction.
It computes α,β,Г(α) and cumulative probablity density function within the tool and gives the
final output as 1 to 12 all months SPI

Page | 31
NOTE: The SPI output from the tool has been validated with the World Meteorological
Organisation (WMO) software for SPI. Correlation of 0.99 was achived between this tool and
WMO software.
Usage
Syntax
SPI1to12 (Input_Folder, Extension, Daily_gridded_data_Folder, Output_Folder)
Input_Folder Dialog Reference
Folder containing all the monthly sub-folders
with all the years monthly rainfall files as
computed by the "Monthly Sum For SPI"
tool
parameter.
Folder
Extension Dialog Reference
parameter.
String
Daily_gridded_data_Folder Dialog Reference Folder

Page | 32
Sub-folder name = 2001, 2002, 2003... 2015
parameter.
parameter.
Folder

Page | 33
c. 12 to 60 tool UI
Figure 22: UI of SPI Calculation from 13 months to 60 months
SPI 13 to 60
Title SPI 13 to 60
Summary
This tool calculates the SPI using monthly rainfall data
It is designed to calculate Standardiezed Precipitaion Index (SPI), with minimal human
interaction.
It computes α,β,Г(α) and cumulative probablity density function within the tool and gives the
final output as 1 to 12 all months SPI

Page | 34
NOTE: The SPI output from the tool has been validated with the World Meteorological
Organisation (WMO) software for SPI. Correlation of 0.99 was achived between this tool and
WMO software.
Usage
Syntax
SPI13to60 (Input_Folder, Extension, Daily_gridded_data_Folder, Output_Folder)
Input_Folder Dialog Reference
Folder containing all the monthly sub-folders
with all the years monthly rainfall files as
computed by the "Monthly Sum For SPI"
tool
parameter.
Folder
Extension Dialog Reference
parameter.
String
Daily_gridded_data_Folder Dialog Reference Folder

Page | 35
Sub-folder name = 2001, 2002, 2003... 2015
parameter.
parameter.
Folder

Page | 36
d. Unpack
Figure 23: UI of unpacking all the calculated SPI to daily raster images
This tool helps to user to unpack all the stacked SPI raster, the output from this tools will contain
single layer containing monthly SPI.
e. Validation with WMO software
i. WMO has developed a command line program to calculate the SPI, output
from both WMO program and python SPI tool developed at IWMI had a
correlation of 0.99
Table 2: Comparison between SPI calculated by WMO and IWMI made python tool
Python WMO Correlation
-99 -99 0.99999
1.28 1.25
-1.4 -1.392
-0.34 -0.342
-0.26 -0.265

Page | 37
Python WMO Correlation
-0.88 -0.877
1.47 1.441
0.17 0.157
2.43 2.388
-0.27 -0.27
-0.66 -0.662
-0.43 -0.431
-0.18 -0.179
0.96 0.936
-0.62 -0.616
-1.91 -1.894
0.49 0.468
0.06 0.041
-0.73 -0.725
-0.15 -0.161
-1.65 -1.641
0.73 0.705
-0.15 -0.155
0.45 0.437
1.42 1.384
0.28 0.262
0.02 0.013
-1.2 -1.196
1.51 1.486
-1.46 -1.443
0.17 0.149
1.15 1.127

Page | 38
f. Validation with month mean of IMD rainfall data
i. IMD data of 1901 to 2013 was taken to calculate one month SPI, Y axis on
the left side has data of mean monthly rainfall of Maharashtra. Y axis on
right side contains corresponding data’s SPI. The correlation between both
the dataset was achieved over 0.94 and the patterns also match the same
Figure 24: Comparison of Mean Rainfall with Mean SPI
2. VCI
(Kogan & J. Sullivan, 1993) Defined vegetation index which takes the maximum and
minimum NDVI values in the time series and then calculates the index
𝑉𝐶𝐼 =
(𝑁𝐷𝑉𝐼 − 𝑁𝐷𝑉𝐼𝑚𝑖𝑛) × 100
𝑁𝐷𝑉𝐼 𝑚𝑎𝑥 − 𝑁𝐷𝑉𝐼𝑚𝑖𝑛
Where:
NDVI, NDVImax and NDVImin are values of smoothed weekly NDVI and the
multiple year NDVI maximum and minimum, respectively.
3. TCI
(Liu, W.T., & F.N. Kogan, 1996) Similar to VCI the maximum and minimum is taken over
the long time period
-2.50000000000
-2.00000000000
-1.50000000000
-1.00000000000
-0.50000000000
0.00000000000
0.50000000000
1.00000000000
1.50000000000
0.00000000000
100.00000000000
200.00000000000
300.00000000000
400.00000000000
500.00000000000
600.00000000000
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
111

Page | 39
𝑇𝐶𝐼 =
100(𝐵𝑇 𝑚𝑎𝑥 − 𝐵𝑇)
𝐵𝑇 𝑚𝑎𝑥 − 𝐵𝑇𝑚𝑖𝑛
Where:
BT, BTmax, and BTmin are smoothed weekly and multiple-year maximum
and minimum thermal brightness temperatures, respectively
4. NDVI
NDVI was calculated by using the dataset from Landsat 8 OLI
𝑁𝐷𝑉𝐼 =
𝐵𝑎𝑛𝑑 5 + 𝐵𝑎𝑛𝑑 4
𝐵𝑎𝑛𝑑 5 − 𝐵𝑎𝑛𝑑 4
Where:
Band 5 is near infrared having wavelength of (0.85-0.88) micro meters and
Band 4 is red, wavelength ranging from (0.64 - 0.67)
5. SPI, VCI, TCI and NDVI these indices will be used as parameters in both of the machine
learning approach

Page | 40
Results
1. Bias correction comparison
a. Two satellite precipitation estimators were take first TRMM and other the
PERSIANN both have the resolution of 0.25 degrees, all the types of bias correction
discussed in the literature review were implemented and a correlation graph
between station original and satellite estimates before bias correction and after bias
correction . After taking the annual average of the correlation between bias
corrected data and original data to understand which bias correction is consistently
giving good results on most of the days TRMM and PERSIANN had an average
correlation of 0.6
b. Since the data of PERSIANN was available from 1983 to 2015 which is thirty three
years of daily rainfall estimates data it will be better choice for calculation of SPI.
After comparison of all the results from bias correction Rule Based bias correction
proved to be the best with the average correlation of 0.7 to 0.8 so the final bias
correction method used is Rule Based with PERSIANN as the satellite rainfall
estimates and IMD as the in-situ observed data. The results are shown in the
following graphs
c. The day of the data compared in all the months listed below in graphs i:e April,
May, July and August are dated on the 15th
of April, May July and August.

Page | 41

Page | 42

Page | 43

Page | 44

Page | 45

Page | 46

Page | 47
d. DT
i. The input parameter for classification of the dataset were VCI, TCI, SPI and
NDVI.

Page | 48
e. Random Forest with different estimators and their results
i. Estimator 10

Page | 52
f. IDSI vs Random Forest, Decision Tree and SPI vs Random Forest and Decision
Tree
Table 3: Comparison between Random forest with 10, 80, 25, 50 estimators and decision tree with IDSI and SPI
RF_10 RF_80 RF_25 RF_50 DT
VS
IDSI
0.49 0.42 0.48 0.48 0.43
VS SPI 0.18 0.78 0.31 0.14 0.99
Where RF_10, RF_25, RF_50, RF_80 and DT are random forest with 10
estimators, random forest with 25 estimators, random forest with 50 estimators,
random forest with 80 estimators and decision tree respectively

Page | 53
2. IWMI tool box
Figure 31: IWMI tools in ArcCatelog 10.3
3.

Page | 54
a. Remaining tools description
Tool Short description
Export Maps Exports Map in all possible formats
supported by ArcMap
Frequency computer Computes the frequency of drought pixel
to find the number of drought occurrence
Clip Batch Clips all the files in the given folder
Define Projection Batch Defines projection to all the files in the
given folder
Interval Mean Finds days interval mean of the given
annual dataset in a folder
Months Statistics Calculates monthly statistics of daily
gridded data
Raster Scatter diagram Automated generation of scatter plot of
two given raster dataset
Resample Batch Resample all the files in the given folder
Set no data to Value File Sets a value to no data elements in the
raster file
Set no data to Value Folder Sets a value to no data elements in the
raster folder
Stack Sum Generates the sum of all the layers in the
stack raster dataset
Zonal Batch Raster Computes zonal raster on all the files in
the given folder
Zonal Batch Table Computes zonal table on all the files in
the given folder

Page | 55
Discussion
Using python in automation has substantially reduced manual intervention of human
decision and reduced the processing time of data set up to 30 percent. After going through
multiple papers on bias correction there were many assumptions taken. The very fact of in-
situ data recorded daily and satellite estimates are registered on different time period of the
day causes the daily bias correction to be made nearly impossible, more over the in-situ
data collected is also heavily biased due to human error and this fact also causes a lot noise
which has to be corrected but cannot be done. The only way of bias correcting daily real
time rainfall data is to use (Internet Of Things) IOT, multiple in-sity sensors can be
stationed around at particular intervals and while the satellites remote sense’s the data with
help of IOT both the satellite and the ground in-situ can communicate at the same moment
and bias correct the data there itself .
Due to long processing hours of dataset of the size of one district, this becomes the biggest
limitation of implementing this machine learning approach in real life, also due to the use
of Landsat 8 dataset we can see reasonably fine resolution of water stressed areas.
Machine learning needs greater understanding of how operating systems function
manipulating the threads for multiprocessing can cause the results to be generated ten times
faster.

Page | 56
Conclusion
Random forest proved to the better than decision tree in terms of classification of the pixels,
random forest could easily use the high resolution of Landsat 8 dataset to get finer
resolution of water stressed pixels. Four parameter is just the beginning multiple parameter
will be generated for future classification of potential drought pixel. Consideration of rain
fed was not taken into consideration due to lack of data availability, if the similar data is
made available the quality of classification can be increased substantially. More parameter
with high correlation with each other results in better result in random forest.
For future do deal with the slow processing of huge amount of data dask module of python
will be used to generate faster result with low (Random Access Memory) RAM, dask
enables the user to use block algorithms which use less memory and more processing
power of the computer. Instead of using single threat for calculation of entire raster dask
breaks down the big raster dataset into user defined intervals and then processes all these
divided raster dataset parallely.
After having experience in automated generation of maps development of end to end
application will be aimed to be built. In which the algorithm will directly talk to the
satellites and generate the map and statistical data without any human interference
SPI was calculated till 60 months to find location which are under drought over five years
due to time constrains it could not be implemented on the random forest as well as the
decision tree.
For further development collaboration will be done with one of my class mates Miss
Marcia Chen as she has explored neural network for analyzing the difference and efficiency
between random forest and neural network
In my six months of internship I have created nearly 70 tools out of which these 30
mentioned in this project has been approved by IWMI, in some weeks all of the tools will
be published with IWMI branding

Page | 57
References
1. GRASS. (n.d.). https://grass.osgeo.org. Retrieved from Documentation:
https://grass.osgeo.org/documentation
2. H. K., & C. Y. (2008). A Machine Learning Approach for Knowledge Base Construction.
Journal of the Korean Geographical Society, 761-774.
3. Kogan, & J. Sullivan. (1993). Development of global drought-watch system using NOAA
/ AVHRR data. Advance in Space Research, 219-222.
4. Liu, W.T., & F.N. Kogan. (1996). Monitoring regional deought using the Vegetation
Condition Index. International Journal of Remote Sensing, 2761-2782.
5. M. Svoboda, M. H. (2012). Standardized Precipitation Index User.
6. scikit-learn. (n.d.). 1.10. Decision Trees. Retrieved from http://scikit-learn.org:
http://scikit-learn.org/stable/modules/tree.html
7. scikit-learn. (n.d.). 3.2.4.3.1. sklearn.ensemble.RandomForestClassifier. Retrieved from
http://scikit-learn.org: http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
8. Shah, R. D., & Mishra, V. (2014). Development of an Experimental Neari-Real-Time
Drought Monitor for India. Journal of Hydrometeorology, 327-345.
9. WMO. (2009, 12 15). Press Release No. 872. Retrieved from www.wmo.int:
https://www.wmo.int/pages/mediacentre/press_releases/pr_872_en.html

Recommended

Recommended

More Related Content

What's hot

What's hot (17)