Functional Data Analysis of Hydrological Data in the Potomac River Valley

Functional Data Analysis of Hydrological Data in the
Potomac River Valley
Aaron Zelmanow
American University
October 31, 2012
Acknowledgement to:
Dr. Inga Maslova - American University
Dr. Andres Ticlavilca - Utah State University
1 Abstract
Flooding and droughts are extreme hydrological events that affect the United States
economically and socially. The severity and unpredictability of flooding causes billions of
dollars in damage and catastrophic loss of life. In this context, there is an urgent need
to build a firm scientific basis for adaptation by developing and applying new modeling
techniques for accurate streamflow characterization and reliable hydrological forecasting.
The goal is to use numerical streamflow characteristics in order to explore the ability to
classify, model, and estimate the likelihood of extreme events in the eastern United States,
mainly the Potomac River. Functional data analysis techniques are used to study yearly
streamflow patterns, with the extreme streamflow events characterized via functional
1

principal component analysis. These methods are merged with more classical techniques
such as cluster analysis, classification analysis, and time series modeling. This exploratory
analysis features models and forecasting methods that, while not comprehensive, are the
foundation for future study.
2 Introduction
Functional data analysis is a relatively new and novel approach to statistical analysis. The
conversion of discrete data points to functional objects for the purpose of analysis began
in the 1990’s and has only been used for a little over a decade in the field of hydrology.
To establish a foundation for the potential of functional data analysis within the field of
hydrology the work of Suhaila et al. (2011) was utilized. Their research utilized functional
data analysis to compare rainfall between regions within Peninsular Malaysia and provided
the groundwork for the use of functional data analysis and smoothing techniques that were
utilized in the study of the Potomac River hydrological data. Additionally the merging
of clustering techniques and functional data analysis were identified and discussed by
Huzurbazar and Humphrey (2008) in their study of subglacial water flow.
To merge the techniques of functional data analysis and time series analysis, the
mortality and fertility rate analyses of Chiou and Müller (2009) and Hyndman and Ullah
(2006) were referenced. The work presented by their papers provided information into
the process of converting data into functional data, the benefits of smoothing techniques
and the resulting fit of functional data into time series ARIMA models. This analysis was
able to expand these techniques by reviewing various aspects of functional data within
the time series analysis.
2

3 Data Description
The data set used to model the streamflow is from the Potomac River and is measured
in cubic feet per second. The United States Geological Survey (USGS) has positioned
sensors at intervals along the river and has taken monthly readings from 1931 to 2012. A
graphic of the time series for the first five years of data is found in Figure 1. The survey
site used for this analysis is the Little Falls Pump Station on the Potomac River, site
1646500 1
.
Little Falls Pumping Station is a dam and pumping station in Maryland used to pro-
vide a water supply for Washington DC. Water pumped from the station is extracted
upriver of the Chesapeake and Ohio Canal and travels downhill to purification plants
for the nation’s capital. The sensors at the pumping station are affected by a multi-
tiude of man-made and natural factors that effect the predictability of stream flow. The
river provides water for households and irrigation and supports industrial and economic
development2
. Additionally, natural factors that increase the variability in streamflow in-
clude melting snow and ice, direct precipitation and stream gradient of both the identified
river and other connecting water sources. The physical proximity of the Potomac River
to both the eastern coast of the United States and the Shenendoah Mountains creates
highly variable amounts of precipitation.
As a gaining river, the Potomac River is increases in streamflow with movement away
from its origin. As the most southernly of the sensors on the Potomac River, the Little
Falls Pumping Station is furthest from the origin but subject to the maximum amount of
increase in streamflow. Combining this with the volatility associated with the man-made
and natural factors discussed, the variability in streamflow through time is maximized at
the pumping station and there is greater potential for building a streamflow model that
1
http://ida.water.usgs.gov
2
www.nature.org/ourinitiatives/
3

can be utilized across a wide range of rivers and locations.
To fit the data using functional data analysis, the available data had to be reduced by
a total of seven months. The first three months of 1931 and the first four months of 2012
were removed to satisfy the R-Statistical Software requirement of all functional curves
being the same length. The length chosen was twelve months and is analyzed as January
1st to December 31st, ensuring that the time of year containing the greatest amount of
activity, Spring and Fall, are centered within each function.
4 Exploratory Analysis
The graphic of the stream flow data in Figure 2 displays 81 years of streamflow data
as a matrix plot, overlayed on a 12 month axis. As determined by NOAA 3
, flooding
at the Little Falls Pumping Station on the Potomac River (site 1646500) is categorized
as major, moderate, minor and none; the respective color coding representing flooding
magnitude is red, purple, navy and sky blue. During the exploratory phase there were two
identified periods of increased streamflow, once the spring and again in the fall, with floods
occurring multiple times within some years. A component of the analysis is to determine
if a year with a flood event can be identified prior to its occurrence. To meet this end,
flood severity coding is applied according to the the greatest flood event occurring within
each year.
Flooding along the Potomac river is determined as the crest, measured in feet, of the
water. A major challenge in modeling the data has been to reconcile the flood events,
measured in feet, and the data used to create the models, being streamflow measured in
cubic feet per second. A visual inspection of the data in Figure 2 indicates that there
is a connection between the two measurements, warranting further analysis and model
3
www.erh.noaa.gov
4

development.
Figure 3 displays the stream flow averages and maximums over the 81 year record.
From this graphic there are indicators that the average yearly stream flow has increased
in variability over time. Additionally, there are also visual indicators that show that the
yearly stream flow maximums have increased in both magnitude and variability. This
initial assessment is used cautiously during the exploratory phase. Natural factors such
as weather patterns, sediment levels and erosion of the river banks change over time,
meaning that the amount of stream flow in one year may lead to flooding, but the same
streamflow under different conditions in another year may not lead to the same flood
severity. Additionally, the definitions of flooding and scientific advances in how stream-
flow is measured have changed over time, effecting how flood events are determined and
categorized.
5 Functional Data Analysis
Functional data analysis is intended to allow observed data to be treated as a continuum of
information rather then a sequence of individual observations. In the case of streamflow,
as is often the case with functional data, the discrete data is referenced as an observation
at a point in time. The conversion of discrete data points to functional data allows it to
be modeled in a manner that integrates the structure of the data as a function whereby
it can be analyzed and modeled using mathematical functions such as derivatives. In
order to successfully use mathematical properties on the functional data, a smoothing
parameter is added that assumes that observation yj and yj+1 are linked and unlikely to
differ substantially. In the context of streamflow, a smoothing parameter would assume
that streamflow on January 1st would likely not differ much from streamflow on January
2nd, or in a more granular sense, streamflow at 10:00am is unlikely to differ greatly from
5

11:00am. When smoothing parameters are added in the functional analysis, the error, or
residual, between the model and raw data is increased in favor of a more easily interpreted
model. A negligible smoothing parameter, virtually zero, was used in this analysis to
maximize the use of available information, which is of particular importance considering
that there is only one observation available per month, a relatively small amount of data
in the realm of functional data analysis. The notation used to express observational error
or residual, ϵ is:
(5.1) yj = x(tj) + ϵj ȷ = 1, 2, ..., 12
where yj represents the value of observation, j, composed of the value at the observation
plus the additional error or noise.
The fundamental component of functional data is the basis function, a mathematical
description of the data within a confined space. At its essence, the basis functions act as
building blocks with which functions are built and parameters easily estimated using a
matrix of coefficients. The data was transformed into functional objects with the intent
of minimizing the residual difference between the basis objects and the actual data. A set
of functional building blocks ϕk, k=1,...,K, or basis functions, are combined linearly such
that the function x(t) may be expressed in mathematical notation as the basis function
expansion:
(5.2) x(t) = ΣK
k=1ckϕk(t) = cT
ϕ(t)
The parameters c1, c2,..., ck are the coefficients of expansion, cT
is equal to the vector
of K coefficients and ϕ(t) = (ϕ1,(t),...ϕK(t)) is a vector of length K containing the basis
functions (Ramsay and Silverman (2005), page 44).
6

While there are numerous basis systems that can be used, the Fourier and bspline
basis systems were deemed most appropriate for the analysis. The Fourier system is
the usual choice for periodic functions, or functions displaying sinusoidal tendencies, and
the bspline system being the preferred system for non-periodic functions. Other basis
functions, such as constant or monomial systems were disregarded as visual assessment
of the data indicates streamflow is neither monomial nor constant in nature.
Figure 4 shows the data modeled using the Fourier basis system. The Fourier basis
system relies on the data being separated by breaks, within which the data is modeled
using sinusoidal functions. Visual assessment of the data presented several issues which
led to a rejection of the model. The first issue identified with the Fourier system is that
numerous yearly observations are forced to take on negative stream flow values. Secondly,
the data is coerced into taking the same value at both ends of the function, meaning that
January 1st and December 31st are assigned the same streamflow value, increasing the
residual error between the function and actual data at the beginning and end of each year.
This non-uniform increase in residual error led to questions regarding the independence
of observations, and ultimately the decision to not use the Fourier basis system.
The data was also modeled using the bspline basis system, where the relationship
between basis functions, interior knots and the order of the data become paramount
in creating a model that sufficiently represents the data and allows for computational
analysis. The relationship between the parameters is described in the equation (Ramsay
et al. (2009), Page 35):
(5.3) # of basis functions = order + # of interior knots
Where the number of basis functions is determined by the order of the polynomial seg-
ments and the number of interior knots or breakpoints. In order for the model to be
7

optimized in its computational power, the order must be equal to three or greater. With
an order of three, analysis of the velocity and acceleration can be performed using the
first and second derivative of the data. The streamflow data is best represented yearly
due to annual weather patterns. With one streamflow observation taken per month, 11
interior knots were selected, allowing for one observation for each basis function. Thus,
the total number of basis functions selected is 14. Other parameter arrangements were
tested but ultimately led to either a loss in power, with the order being reduced, or the
disregard for data by coercing multiple observations into a single break, a result of de-
creasing the number of interior knots. Figure 5 displays the basis functions and model
using the selected bspline basis system.
When comparing the Fourier and bspline models, two methods determined the best
model fit, visual assessment and calculation of the root mean square residuals. Figure 6
offers a one year comparison of the basis systems compared to the raw data, Fourier and
bspline respectively. The root mean square residual of the Fourier basis system is 851.54,
while the bspline basis system root mean square residual equals 554.53. Both the visual
assessment and computation of residuals are in agreement that the bspline basis system
is the best fitting model.
With the bspline model in place, the five year graphic, figure 7, displays the raw
data and functional data of the Potomac River from 1931-1935. The mean functional
streamflow was also obtained as figure 8 displaying the general shape and form of the
streamflow through the year, created by obtaining the average across the 81 years of
observations. We see the pattern of average streamflow maximizing in early Spring and
minimizing in the summer which is intuitive when snow melt and typical rain patterns in
the region are considered.
Additionally, with the model in place, the variance-covariance perspective plot in
figure 9 and variance-covariance contour plot in figure 10 were generated to provide a
8

representation of the variance in monthly stream flows from year to year. The greatest
variance occurs in March, with November being a month with a localized increase in
variance. This indicates that while the stream flow follows a basic pattern every year, the
timing of pattern changes and the amount of increase in streamflow have a much greater
variance when the annual streamflows are at their highest levels.
Lastly, the functional data enabled the computation of derivatives, used to further the
analysis and modeling of the streamflow data. Figure 11 displays a four-piece graphic
that displays the functional data, residuals, first and second derivatives of the year 1931.
With this information, modeling of both the velocity and acceleration of streamflow can
be incorporated into the analysis.
6 Cluster Analysis
In order to classify flood events using streamflow data both the number of evaluated points
in time and the registration, or occurrence, of significant events had to be considered.
When using the functional data each functional observation, or year, has the ability to be
evaluated at designated intervals. In most cases, the functions are evaluated monthly to
correspond to the monthly raw data readings, but in scenarios when more granularity is
beneficial the functions are evaluated 100 times over the course of an observational year.
Additionally, the functional data underwent a continuous registration process in which
the functional observations were aligned with a target curve using continuous criterion
such as maxima, minima, velocity and acceleration. To align the functions a warping
function, or smooth increasing transformation, is applied to each observational function.
Figure 12 displays the registered curves and warping functions.
The characteristics chosen to undergo cluster analyses is the first and second deriva-
tives of the functional data to determine if the velocity or acceleration of streamflow could
9

be used to cluster functional observations into groups corresponding to flood severity. The
registered curves and warping functions were also selected to undergo cluster analyses as
shared flooding traits may become more apparent in the registration process. The regis-
tered curves yielded largely the same classification results as the unregistered functional
data and therefore only one is presented.
The initial clustering method used the Euclidean distance of the functional data and
registered functional data. In order to calculate the Euclidean distance, each yearly
function was evaluated at 100 points in time. The distance between the corresponding
evaluated points in each year was then plotted with colors coded to indicate the greatest
flood event within the observed year. While unable to produce clusters that matched the
flood severity categorization provided by NOAA, a dendrogram of the Euclidean distance
method captured almost all of the non-flood years within a single cluster. The results
were ultimately disregarded as the three types of flood events were categorized into seven
different clusters.
The k-means method of clustering was also used to attempt a classification of the data.
The k-means clustering method uses algorithmic computations to partition data into k-
clusters in which each observation is assigned its cluster according to its mean and the
mean of the clusters. The k-means method was used to test the the registered functional
data, warping functions and streamflow velocity and acceleration. Figure 13 is a graphic
representation of the k-means process, with the color coding representing the flood events
and the icons representing the clusters derived in the k-means process. Of the attributes
tested the warping function clustering was the most successful as it correctly clustered the
majority of non-flood years, but was ultimately rejected as it failed to create clustering
that matched the three flood events.
A PAM analysis was also conducted using clustering around medoids, or clustering
based on dissimilarity, resulting in an more robust version of the k-means method. Fig-
10

ure 14 shows the results of the analysis, which largely failed to accurately classify the data
in flood severity categories. In this analysis, the registered data captured the majority
of non-flood events in a cluster, but failed to capture the rest of the data in accurate
clusters. The warping functions, velocity and acceleration of the registered data failed to
cluster the data in a meaningful manner.
Finally, the registered data, warping functions, first derivative and second derivative
underwent a classic cluster analysis whereby the data was plotted using the same color and
icon coding as above. Additionally, dendrograms were created in which the classification
trees were coerced into creating four clusters, each representing the severity of flood events
occurring within the year. Regardless of the clustering method; single, complete, Ward or
average, the functional characteristics all failed to cluster the data accurately. Virtually all
cluster techniques yielded only 24-28 correct classifications, not enough to be considered
significant findings.
7 Functional Principal Component Analysis
A principal component analysis is a procedure that transforms a number of potentially
correlated variables into a smaller number of uncorrelated variables called principal com-
ponents. The first principal component accounts for the most variability in the data and
each successive component accounts for a decreasing amount of variability. The principal
components are comprised of eignevalues, η, and eigenfunctions or harmonics, ξ. The
analysis searches for the harmonic of the equation below that has the largest possible
variance (Ramsay and Silverman (2005), Page 100)
(7.1) ρξ =
∫
ξ(t) xi(t) dt
11

The eigenvalue variance associated with a harmonic is the value of:
(7.2) η = max
ξ
[Σiρ2
ξ (xi)], subject to
∫
ξ2
(t) dt = 1
When considering how an observation is deconstructed using the principal component
scores, or eigenvalues, and harmonics the following equation provides an applied formulaic
method:
(7.3) xt = µt + Σ∞
j=1γj ξj(t)
xt = observation x at time t
µt = mean of the data
γj = principal component score vector
ξj(t) = jth
eigenfunction or harmonic
Using a Scree test, four principal components were chosen as the most efficient number
to represent the data. The four principal components displayed in Figure 15 account for
76% of the variation in the data. The harmonics used to create the principal components
are in figure 16. The first principal component captures the seasonal variation of the
streamflow, largely rising in March and minimizing in August. The greatest amount of
variation captured is in the amount of streamflow that occurs at the maximum. The
second principal component accounts for the variation that occurs from the streamflow
variability during late spring, summer and winter months. The third principal component
captures the shift in the timing of the early spring peak flows, or changes from winter to
spring. The fourth principal component highlights the increase in streamflow in the early
spring months.
A rotated principal component was considered using the varimax function. Ultimately,
the procedure was not used as the analysis incorporates a penalization for harmonic ac-
celeration. The varimax procedure, capturing the variation in streamflow, is estimated by
12

penalizing the squared deviations in harmonic acceleration as measured by the differential
operator. If the stream flow variation was purely sinusoidal, the yearly curves would be
exactly zero. The reasoning for this not being used is that the penalty increases with the
higher order harmonics, distorting the interpretation of the latter principal components.
With a functional principal component analysis in place using four principal compo-
nents, the plot in figure 17 was created to determine if there is any correlation between
the variation seen in the first four principal components and the years where floods oc-
curred. In the scatterplot on the axis of the first two principal components outliers were
identified by the most extreme observations dependent on their distance from the mean.
Observations with a first principal component greater then 10,000 and with a second prin-
cipal component greater then 20,000 were determined to be outliers. Of the 15 years that
were identified to be outliers there were; three years with major floods, seven years with
moderate floods, four years with minor floods and only one year where no flooding at all.
The range of types of floods found to be outliers and the fact that many flood years were
not considered outliers indicates that there is not a strong enough relationship between
the principal components of stream flow to make determinations or predictions regarding
flood events without additional information or analysis techniques.
8 Forecasting Models
Median Testing Model
(8.1) ˆX(t+1) = ˆµ + Σ4
j=1ˆγ(t+1)
ˆξj(t)
Xt = observation x at time t
γt+1 = nested median of principal component scores
ξj(t) = jth
13

Mean Testing Model
(8.2) ˆX(t+1) = ˆµ + Σ4
j=1
ˆβ(t+1)
ˆξj(t)
βt+1 = nested mean of principal component scores
ξj(t) = jth
AR(1) Testing Model
(8.3) ˆX(t+1) = ˆµ + Σ4
j=1 ˆα(t+1)
ˆξj(t)
αt+1 = nested ar(1) forecast of principal component scores
ξj(t) = jth
RMSE (Root Mean Square Error) Formula for Forecasting
(8.4) MSRE =
1
m
Σm
j=1
√
Σy
i=1(fi − di)2
m = # of months in a forecasting year, typically twelve
y = # of years in forecast
fi = A matrix of forecast values with y columns and m rows
di = A matrix of true values with y columns and m rows
9 Forecasting as a Time Series
Using ACF (autocorrelation function) and PACF (partial autocorrelation function) anal-
yses, the registered functional data, warping functions, principal components, and first
and second derivatives were tested to determine if there was a SARIMA (seasonal au-
toregressive integrated moving average) model that explains the variation in the data. Of
the characteristics tested, no strong pattern was identified, but the principal component
scores did exhibit a slight seasonal component that warranted further analysis.
Without a specific SARIMA model with which to analyze and forecast the principal
component scores, the AR(1) or first order autogregressive process was used as it is the
14

archetypical time series process. To determine the ability for the principal components
to be used in the forecasting process, a 71 year training set and 10 year testing set was
used. The forecasts were nested, so that each year in the forecast contained not only
the 71 years of the training set, but every subsequent forecast includes the data for all
previously forecasted years. The 10 year training set was used to test the accuracy of the
forecasts to predict streamflow. Using formula 8.4 the root mean square error (RMSE)
of the training set was calculated at 9,437.7. The RMSE of the training set allows for a
baseline calculation which can be used to measure the success of the forecast models.
Formula 8.1 was used in the median forecasting method. The median of the functional
principal component scores for the nested training set was calculated and then multiplied
by the functional principal component harmonics to derive forecasted streamflow values.
This model was compared to the testing set using the RMSE and graphics containing the
historical data, forecasted data and a 95% confidence band. The principal component
analysis was performed uncentered, with the mean included in the analysis, and centered,
with the mean subtracted from the analysis and later added back to the data prior to
forecasting. The uncentered forecast had an RMSE of 8,495.5 and the centered forecasts
had a RMSE of 8,510.3, both perfoming better then the testing set when compared to
the raw data.
Formula 8.2 is the formula used in the mean forecasting method. In this scenario, the
mean of the functional principal component scores for the nested training set was cal-
culated and then multiplied by the functional principal component harmonics to derive
forecasted streamflow values. As with the median forecasting method, formula 8.1, this
model was compared to the testing set using the RMSE and graphics containing the histor-
ical data, forecasted data and a 95% confidence band. The principal component analysis
was also performed centered and uncentered enabling a comprehensive comparison. Both
of the methods peformed better then the testing set with the centered functional princi-
15

pal component analysis leading to a forecast with an RMSE of 8526.2 and the uncentered
functional principal component analysis providing and RMSE of 8,393. While the RMSE
in the forecast performed well, the model was ultimately not used as the uncentered prin-
cipal component analysis yielded principal component scores so close to zero that the
mean was the only information captured for the forecast.
Equation 8.3 is the final, and most successful, forecasting method using a nested AR(1)
model to forecast the functional principal component scores. The principal component
scores were then multiplied by their corresponding harmonics to derive a streamflow
forecast. As with the modeling using the mean and median, the fit of the 10 year forecasted
model was tested using the RMSE and graphics were created that contain the historical
data, forecasted data and a 95% confidence band to determine if the true data lay within
the boundaries of the forecasted model. Also, as with the mean and median modeling,
the forecast was tested twice, once using a centered principal component analysis with
the mean not included and once using an uncentered analysis with the mean included. In
this scenario, the uncentered forecast RMSE is 8,604.4 and the centered forecast, which
outperformed all of the other models, including the training set, with an RMSE of 8,493.
Figure 18 shows the AR(1) forecast model, the black lines indicate the forecasted yearly
streamflow, with the red line representing the mean of all 10 years in the testing set. The
green line represents the mean of the true data in the 10 testing set years, and the black
lines represent the 95% confidence bands. In the graphic it is easy to see that the mean
of the true data lies within the confidence band and does not differ significantly from the
forecasted data. What is difficult to interpret is why the confidence bands have a series of
hierarchical switches in the month of April. When referencing the confidence bands, each
band represents the most extreme hydrological years. The top 97.5% of functional years
has a streamflow that reaches its maximum in March while the bottom 2.5% of functional
years reaches its maximum in April. The shift in time of extreme events, followed by a
16

uniformed steep decline in streamflow are the cause of the flip in confidence bands.
10 Conclusion
The necessity to model and forecast extreme hydrological events is indisputable, and as
flood events continue to occur, the urgency to develop methods of early warning detection
become ever more urgent. The exploratory analysis, models and forecasting methodolo-
gies used in development of this paper reflect what is possible through the use of statistical
analyses in regards to extreme hydrological events. The results found here are a testa-
ment that stand alone traditional methodology does not have enough power to model the
complicated and multifaceted science of streamflow analysis. The most successful aspects
of this analysis were the layered and concurrent use of many statistical tools, such as in
the forecasting method whereby principal component analysis and time series forecasting
methods were merged.
Moving forward, there are numerous attributes that should be introduced. These
models may be more adept when the effects of man are considered. This can be achieved
by reading different sensors along the stream, using adjusted data or adding categorical
variables to the methodology. Additionally, global and local recurring weather patterns
can be introduced to further enhance the accuracy of the model. Finally, the velocity
and acceleration of the streamflow can be analyzed more thoroughly to determine if there
is an evaluatable point prior to flood events where the severity can be determined and
modeled.
17

References
Chiou, J-M. and M¨uller, H-G. (2009). Modeling hazard rates as functional data for
the analysis of cohort lifetables and mortality forecasting. Journal of the American
Statistical Association, 486, number 1, 572–585.
Huzurbazar, S. and Humphrey, Neil F. (2008). Functional clustering of time series: An
insight into length scales in subglacial ﬂow. Water Resources Research, 44, number
W11420, 1–9.
Hyndman, Rob J. and Ullah, Md. Shahid (2006). Robust forecasting of mortality and
fertility rates: A functional data approach. Computational Statistics & Data Analysis,
51, number 1, 4942–4956.
Ramsay, J. O., Hooker, Giles and Graves, Spencer (2009). Functional Data Analysis with
R and MATLAB. Springer Verlag.
Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer Verlag.
Suhaila, Jamaludin, Jemain, Abdul Aziz, Hamdan, Muhammad Fauzee and Zin, Wan Za-
wiah Wan (2011). Comparing rainfall patterns between regions in peninsular malaysia
via a functional data analysis technique. Journal of Hydrology, 411, number 1, 197–
2006.
18

05000150002500035000
Five Years of Raw Data
month
steamflow
1931 1932 1933 1934 1935
Figure 1: Raw data of the Potomac River at the Little Falls Pumping Station site 1646500,
1931-1935. Each year is seperated by a horizontal line.
19

1 1
1
1 1
1 1
1
1 1 1
1
2 4 6 8 10 12
0200004000060000
Raw Data, Potomac River, 1931
Month
StreamFlow
2
2
2 2
2
2 2
2 2
2
2
2
3 3
3
3
3
3 3
3
3
3 3 3
4
4
4 4
4
4
4 4
4
4
4
4
5
5
5
5
5
5 5 5
5
5
5 5
6 6
6
6
6
6 6 6 6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
78 8
8
8 8
8 8 8
8 8 8
8
9
9
9
9
9
9 9 9
9
9 9
90
0
0
0
0
0
0
0 0
0
0 0a
a
a
a
a
a a
a a a a
ab
b
b
b
b
b
b
b
b
b
b
b
c
c
c
c
c
c
c
c c c
c
c
d d
d
d d
d
d d d
d
d
de
e
e
e e
e e
e
e
e
e
ef
f
f
f
f f
f f
f f f f
g
g
g
g
g
g g
g
g g
g
g
h
h
h
h
h
h
h h
h
h
h
h
i
i
i
i
i
i
i
i
i
i
i
ij
j
j
j
j
j
j
j
j
j
j
j
k
k
k
k
k
k
k
k k k k
k
l
l
l
l
l
l
l
l
l
l
l l
m
m
m
m m
m
m m m m m
mn n
n
n n
n
n n n
n
n
no o
o
o
o
o
o
o
o o o op
p
p
p
p p
p
p
p p
p
pq
q
q
q
q
q
q q q q q
q
r
r
r r
r
r r r
r r r r
s s
s
s
s
s
s s s
s
s
s
t
t t
t
t
t
t t t
t t t
u
u u
u
u
u
u u u u u
u
v
v
v
v
v
v
v v v v
v
v
w
w
w
w
w
w
w w w w
w w
x
x
x
x
x
x x x x x x
x
y
y
y
y
y
y y y y y y yz
z z z
z
z
z z
z z
z
zA A
A
A
A
A A
A
A
A A
A
B
B
B
B
B B
B B B B
B BC
C C C
C C C
C
C
C C
C
D
D
D
D
D
D
D
D
D D
D D
E
E
E
E
E E
E
E E
E
E
E
F
F F
F F
F
F
F
F
F
F
F
G
G
G
G
G
G
G G G G G
GH
H H
H
H
H
H
H H H H
HI
I
I
I
I
I
I
I
I
I
I
I
J
J
J J
J J
J J J
J
J J
K
K
K K
K
K K K K K
K
K
L
L
L
L
L
L
L
L
L L L
L
M M
M
M M
M
M M
M
M
M
M
N
N
N
N
N
N
N N N N
N N
O
O
O
O
O
O
O
O O O O O
P
P P
P
P
P
P
P P P P
P
Q
Q
Q
Q
Q
Q
Q Q Q
Q
Q
Q
R
R
R
R
R
R R
R
R R
R
R
S
S
S S
S
S
S S S S
S
S
T
T T
T
T
T T T T T
T
TU U
U
U
U
U U
U
U
U
U
UV V
V
V
V
V
V V V V
V V
W W
W
W
W
W W
W W
W W
W
X
X
X
X X
X
X
X X
X
X
X
Y
Y
Y
Y
Y
Y Y Y Y Y Y
Y
Z Z
Z Z
Z
Z
Z
Z Z Z
Z
Z
1
1
1
1
1
1
1 1 1 1
1
1
2
2
2
2
2
2 2
2
2 2 2
2
3
3
3
3
3
3
3
3
3
3
3 3
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5 5 5 5
5
5
6
6
6
6
6
6
6 6 6 6 6 6
7
7
7
7
7
7 7 7
7
7
7
7
8
8
8 8
8 8
8
8 8
8 8
89
9
9
9
9
9
9
9
9 9 9 90 0
0
0
0
0
0 0 0
0
0
0
a
a
a
a
a
a
a
a
a
a
a
a
b
b
b
b
b
b
b b
b
b b
bc
c
c c
c
c
c
c c
c c
c
d
d
d
d
d
d
d
d
d d
d
d
e
e
e
e
e
e
e e e e e
ef
f
f f
f
f
f
f
f f f
f
g
g
g
g
g
g
g g g
g
g
g
h
h
h
h
h
h
h h h
h h
h
i
i
i
i
i
i
i i
i
i
i
i
Flood Severity
None
Minor
Moderate
Major
Figure 2: Matrix plot of the Potomac River from 1931-2012 overlayed onto a 12 month
axis
20

1940 1960 1980 2000
500020000
Yearly Stream Flow Averages
year
1940 1960 1980 2000
1000060000
Yearly Stream Flow Maximums
year
Figure 3: The yearly averages and maximums of streamﬂow for the Potomac River, 1931-
2012.
21

2 4 6 8 10 12
020000400006000080000
Data modeled with Fourier basis
time
value
Flood Severity
None
Minor
Moderate
Major
Figure 4: Streamﬂow data converted to functional data using the Fourier basis sys-
tem. Each year is represented as its own function, determined by sinusoidal functions
at monthly breaks.
22

2 4 6 8 10 12
0.00.20.40.60.81.0
basis functions
2 4 6 8 10 12
02000060000
Model fit using spline basis
Flood Severity
None Minor Moderate Major
Figure 5: Streamﬂow data converted to functional data using the bspline basis system.
Each year is represented as its own function, determined by polynomial functions at
monthly breaks.
23

2 4 6 8 10 12
040000
Fourier Vs. Bspline Functions
month
Fourier
2 4 6 8 10 12
040000
2 4 6 8 10 12
040000
month
bspline
2 4 6 8 10 12
040000
Figure 6: A one year graphic of the Fourier and bspline systems compared to the raw
data. The increased residual error in January and December of the Fourier system was a
key factor in its rejection as a model.
24

020000
Five Years of Raw Data
month
steamflow
1931 1932 1933 1934 1935
2 4 6 8 10 12
020000
Five Years of Functional Data
months
value
Flood Severity
Figure 7: A plot of the first five years of raw data and the same five years converted to
functional data using the bspline system.
25

2 4 6 8 10 12
5000100001500020000
Mean Functional Stream Flow by Month
time
meanvalue
Figure 8: The mean functional curve of the streamﬂow data representing the average
streamﬂow across the 1931-2012 period. Each functional year is evaluated 12 times, and
the mean calculated across each month.
26

month
month
var−cov
Variance−Covariance Surface Plot
Figure 9: The variance-covariance perspective plot represents the variance that occurs
in each month across the 81 functional years. The plot indicates that the variance is
maximized in the early months of the year, corresponding to when streamﬂow is at its
greatest.
27

Contour Plot of the Bivariate Variance−Covariance Surface0
0
0
0
2e+07
2e+07
4e+07
6e+07
8e+07
2 4 6 8 10 12
24681012
Figure 10: The variance-covariance contour plot is an additional representation of the
variance that occurs in each month across the 81 functional years. The plot clearly shows
that the variance is maximized March with a local maximum in October, corresponding
to when streamﬂow is at its greatest and locally maximized.
28

2 4 6 8 10 12
−4000004000080000
(a)
0 2 4 6 8 10 12
−1000−50005001000
(b)
2 4 6 8 10 12
−50000500010000
(c)
2 4 6 8 10 12
−15000−500005000
(d)
Figure 11: A four piece plot of the year 1931, the information obtained is used to further
analyze the modelling potential of the functional data. (a) represents the raw data, as
points and the functional curve as the red line. (b) The residual difference between the raw
data and functional data. (c) The first derivative of the functional data, representing the
velocity of the streamflow. (d) The second derivative of the functional data, representing
the acceleration of the streamflow.
29

2 4 6 8 10 12
060000
time
Registeredvalue
Flood Severity
2 4 6 8 10 12
04812
time
WarpingFunctions
Flood Severity
Figure 12: The registered functional objects and associated warping functions. The top
graphic represents the registered functional curves. In the continuous registration process
the functional observations are aligned with a target curve using continuous criterion such
as maxima, minima, velocity and acceleration. To align the registered curves a warping
function, visible in the bottom graphic, is applied to each observational year.
30

−20000 0 20000
−2000020000
(a)
2nd Principal Component
1stPrincipalComponent
−20000 0 20000
−2000020000
(b)
−20000 0 20000
−2000020000
(c)
−20000 0 20000
−2000020000
(d)
Figure 13: K-means cluster analysis of the functional streamflow data. The color coding
represents the flood severity while the icons represent the clustering identified in the k-
means procedure. The plots each represent a characteristic of the functional data tested
for its ability to cluster then data in the same manner as the flood severity. (a) Warping
functiond, (b) Registered data, providing similar results as the functional data, (c) First
derivative, or velocity, of the functional streamflow, and (d) the second derivative, or
acceleration of the streamflow data.
31

−20000 0 20000
−2000020000
(a)
−20000 0 20000
−2000020000
(b)
−20000 0 20000
−2000020000
(c)
−20000 0 20000
−2000020000
(d)
Figure 14: PAM Analysis of the functional streamflow data. The color coding represents
the flood severity while the icons represent the clustering identified in the PAM analysis.
The plots each represent a characteristic of the functional data tested for its ability to
cluster then data in the same manner as the flood severity. (a) Warping functiond, (b)
Registered data, providing similar results as the functional data, (c) First derivative, or
velocity, of the functional streamflow, and (d) the second derivative, or acceleration of the
streamflow data.
32

2 4 6 8 10 12
500020000
PCA function 1 (Percentage of variability 32.7 )
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 4 6 8 10 12
01000025000
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
−−−−−−−−−−−−−−
−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 4 6 8 10 12
500020000
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 4 6 8 10 12
500020000
+
+++++++++++
+
+
+
+
+
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Figure 15: The ﬁrst four functional principal components as a linear series. The solid
line represents the mean streamﬂow for all 81 years of data. The plus and minus icons
represent the variability captured in each principal component.
33

2 4 6 8 10 12
−0.50.00.5
Principal Component Harmonics
values
Principal Components
1st
2nd
3rd
4th
Figure 16: The ﬁrst four principal component harmonics displayed. The functional data
is converted to functional principal component scores by way of the harmonics, which
capture successively decreasing variation.
34

−20000 0 20000 40000
−2000002000040000
(a)
−20000 0 20000 40000
−30000−1000010000
(b)
−30000 −10000 10000
−20000010000
(c)
−20000 0 10000
−30000−1000010000
(d)
Figure 17: Scatterplot of Four Principal Components. (a) 1st and 2nd Principal Compo-
nents. (b) 2nd and 3rd Principal Components. (c) 3rd and 4th Principal Components.
(d) 4th and 1st Principal Components. This procedure in intended to determine outliers
in the data, which did not align with the ﬂood severity groupings. However, as with the
clustering methodolgy, the analysis captures non-ﬂood events in a general cluster.
35

2 4 6 8 10 12
0100002000030000400005000060000
AR(1) Forecasting of Principal Components
month
streamflow
2 4 6 8 10 12
0100002000030000400005000060000
AR(1) Forecasting of Principal Components
month
streamflow
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
2 4 6 8 10 12
0100002000030000400005000060000
Yearly Forecast
Forecast Mean
True Streamflow
Confidence Band
Figure 18: A 10 year forecasting model using the AR(1) process on the nested principal
component scores. The black lines represent the yearly forecast, the red line is the mean
of the forecasted data, the green line is the mean of the true data and the blue lines
represent the 95% conﬁdence band of the forecast.
36

Functional Data Analysis of Hydrological Data in the Potomac River Valley

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (17)

Similar to Functional Data Analysis of Hydrological Data in the Potomac River Valley

Similar to Functional Data Analysis of Hydrological Data in the Potomac River Valley (20)

Functional Data Analysis of Hydrological Data in the Potomac River Valley