The document discusses using machine learning techniques like artificial neural networks (ANN) to optimally model large, complex hydrologic systems using big data. It presents four case studies:
1) Modeling groundwater levels in the Floridan Aquifer using over 200 wells and 40 years of daily data. Signal decomposition and time series clustering were used to develop sub-models for different behavioral classes.
2) Predicting hourly stream temperatures across western Oregon using data from 148 monitoring sites, climate data from 25 sites, and stream attribute data.
3) Modeling half-hourly stream temperatures statewide in Wisconsin using data from 254 monitoring sites over 13 years and climate data from 353 stations.
4) Predicting water
IFLA ENSULIB Webinar Series #12: Sustainability - Bringing Nature and Communi...
Ewri2009 big data_jbc
1. Using “Big Data” to Optimally Model
Hydrology and Water Quality across
Expansive Regions
Edwin A. Roehl
John B. Cook, PE
Advanced Data Mining Int’l, Greenville, SC
Paul A. Conrads
US Geological Survey, Columbia, SC
Presented to EWRI/ASCE
May 21, 2009
Kansas City
2. Objectives
• Use “Big Data” to optimally model hydrologic
systems; with case studies
• Systems are highly dynamic, spatially
expansive, and behaviorally heterogeneous
• Approach: divide and conquer – big problem
transformed into series of small problems
• Use a sequence of numerically optimized
algorithms
– Goal to minimize subjectivity
– Use both categorical and time-series “big data”
sets to predict temporal and spatial variability
3. Some applications for expansive
regions
• Models for groundwater systems such as
Floridan Aquifer in Suwannee River Valley
• Models for water depths such as Everglades
• Models for water quality in large number of
streams such as for temperature, DO
• Side-benefit: finding redundancy in
monitoring sites
4. Imagine that one single raindrop is one bit of data…
What do we have in natural resource studies?
6. But data is NOT information
• Information is what we are after
• So information must be “extracted” from
data, where it is “hidden”
• Data must be correlated and
decorrelated to optimize information
and remove redundancies
7. Tools used to divide and conquer, 1
• Artificial Neural Networks (ANN) models
– Form of Machine Learning
– Non-linear, multivariate curve fitting
• Sub- and Super- Models
– System model = “super-model” composed of multiple “sub-models”
Process data plotted with ANN response surface.
ANN Non -Linear Classifier.
8. Tools used to divide and conquer, 2
• Time-Series Clustering
– Groups populations of signals into behavioral
classes
– Each class can then be optimally “sub-modeled”
• Signal Decomposition
– Break signals down into periodic and chaotic
components
9. Tools used to divide and conquer, 3
• Spatially Interpolating ANN Models
– Spatial coordinate input parameters are
combined with time series data to create
“stacked” data sets
• New Site Classification
– Where no data has been collected, categorical
parameters can be used to determine which
“sub-models” to run
11. Upper Floridan Aquifer
Suwannee River Valley, Florida
• Area:
– 140 X 140 km2
– 40 years of data
– Over 200 wells
• Need:
– Interpreting the hydrologic data
– Reducing size of monitoring
network, if possible
– Generate spatially continuous
water level predictions
• Data:
– Daily water level (WL) (dynamic)
– UTM x and y (static)
– Surface elevation z (static)
Well Locations
140 x 140 km2
12. Signal decomposition
• Decompose hydrologic time series for each
well into static and chaotic components
• Static component of daily water level =
historical mean of daily water levels
13. Spatial discontinuity in the process
physics
• Sub-set of the
wells
• Annual periodic
component
• Variability due
to chaotic
forcing
• Well behaviors
spatially
discontinuous
14. Time series clustering
12 classes – from sensitivity
of RMSE to k in k-means
indicates well redundancy
15. ANN modeling
• Cascading sub-models
– Sub-models -1i (i=1 to k, k = 12) predict
historical mean water levels; static component
– Sub-models – 2i predict the dynamic or chaotic
components
16. Super-model
• Super-model = 12 cascaded sub-model
pairs, one pair for each class (total of 24
ANNs)
Chaotic
Model
Static
Model
Time Series Data
Site Static Variables
17. Accuracy of sub-models for 4 of
the 12 classes
Actual
Prediction
C1
History from Apr 1982 to Oct 1998
NormalizedWaterLevel
aboveSeaLevel
C3
C6
C10
18. Super model prediction
Gulf of Mexico
Max elevation above
sea level ~ 60 meters
run time
application
display
19. Floridan Aquifer model summary
• Signal Decomposition - decompose time series into static
and dynamic components
• Time Series Clustering – numerically optimal
segmentation of time series into behavioral classes
• Stacked Database – configures static and time series
variables for training ANNs to spatially interpolate
• ANN Modeling – multivariate non-linear curve fitting of
static and dynamic variables
• Super-model - combined all sub-models (12 classes x 2
sub-models/class = 24 sub-models)
20. Western Oregon stream temperature
• Area: Western third of
Oregon
• Need: To estimate water
temperatures in “pristine” or
unimpaired1st, 2nd, 3rd order
streams
• Data:
– Stream temperature –
hourly time series from 148
“pristine” sites from June to
September 1999
– Climate – 65 hourly time
series from 25 locations (air
temp., dew point, solar
radiation, barometric
pressure, snowpack,
precipitation
– Stream habitat and basin
attributes – 34 static
variables (e.g. gradient,
canopy cover, depth, bed
substrate, …)
Klamath
Mountains
Ecoregion
Willamette
Valley
Eco-
region
Portland
Corvallis
Eugene
Ashland
Klamath
Mountains
Ecoregion
Willamette
Valley
Eco-
region
Portland
Corvallis
Eugene
Ashland
21. Western Oregon sites used for training
Stream Temp Sites
Climatic sites
• Circles represent stream
temperature sites
– Different colored circles
represent 3 classes of
streams included in the
study
• Triangles mark climatic
and snowpack monitoring
sites
• 6 sites set aside as
validation sites
22. Differences from Floridan Aquifer study
• Predict hourly vs. daily stream temperatures
• Large list of possible static and dynamic
inputs
– Many variables highly correlated
• New site classification could not be based
solely on spatial coordinates due to
influences of habitat and basin attributes
23. Predicting hourly vs. daily temperatures
• Resulted in 3 cascaded sub-models for each
behavioral class predicting in succession
– historical mean
– daily stream temperature
– hourly stream temperature
24. Addressing correlated variables
• Climatic time series from multiple weather
stations were highly correlated
– Decorrelated climatic variables of the same type
by setting 1 station to be a “standard”
– Calculated differences from the standard at the
other stations
– Future studies address ways to non-linearly
decorrelate variables of different types
25. Addressing the large number of habitat
and basin attributes data
• “Best” predictor variables selected by
systematically adding and removing
candidates and tracking statistical measures
of prediction accuracy
26. Western Oregon – 1 of 6 validation
sites not used for training
21
20
19
18
17
16
15
14
13
12
11
25 30 5 10 15 20 25 31 5 10 15 20 25 31 5 10 15
JUNE JULY AUGUST SEPTEMBER
27. Western Oregon – another validation site
• Good dynamics, but predictions are offset
• Offset error largely in the predicted static stream temp.
– Habitat and basin attribute assignment OR
– Validation sites randomly selected. A validation site whose
attributes are unique and unlearned will be poorly represented
14
13
12
11
10
9
8
7
6
5
25 30 5 10 15 20 25 31 5 10 15 20 25 31 5 10 15
JUNE JULY AUGUST SEPTEMBER
28. Western Oregon model results
• A reliable method of estimating water temperatures
for “unimpaired” streams across entire region
• Data collected from 148 sites on 1st, 2nd, and 3rd
order streams having minimally-disturbed
conditions
• Hourly climatic time-series data from 25 sites used
• Cluster analyses used to divide 142 sites into 3
classes, with separate model for each class.
• R2 between 0.88 and 0.98
29. Wisconsin streams temperature model
• Area: Entire state of
Wisconsin
• Need: To predict stream
temperature for stream
segments throughout state for
fisheries management
• Data:
– Stream temperature – half-
hourly signals from June 1
to August 31 in 254
streams from 1990 - 2002
– Climate – 353 signals
across state; 7 air pressure,
156 air temp., 13 dew point,
164 precipitation, 13 solar
radiation
– 42 categorical parameters
to describe stream and
basin attributes
30. New issues
• Large number of climatic signals
• Stream temperature time series were
temporally scattered over 13 years
– Few sites overlapped year-to-year
– Most sites measured only 1 year and none
measured more than 2 years out of 13 yrs.
31. Asynchronous site monitoring
• Modified time series clustering method
• Steps
a) Compiled populations having overlapping signals
• 1998 to 2002 made up 241 of the 254 sites
b) Estimate # classes per population, then choose
same k for all populations; k=3 for Wisconsin
model
c) Apply the standard time series clustering
algorithm to each population using k = 3
32. Best static variables
Top variables
Variable description 6 10 14
Land cover–agriculture (W) * * *
Area–drainage area (W) * * *
Land cover–forest (W) * * *
Bedrock depth–depth to bedrock (0?50 feet) (W) * * *
Surficial deposit texture–medium (W) * * *
Stream network–downstream link (S) * * *
Stream network–gradient (S) * *
Land cover–wetland (W) * *
Darcy value–darcy (W) * *
Bedrock depth–depth to bedrock (51?100 feet) (W) * *
Land cover–urban (W) *
Surficial deposit texture–fine (W) *
Bedrock type–sandstone (W) *
Bedrock depth–depth to bedrock (101?200 feet) (W) *
33. Measured & predicted stream temps, 14 test
• 14 “test” sites not used to train ANNs
– concatenated
– June – August
• R2=0.66
• Dynamically very good
• Offsets (high or low) from static variables
measured predicted
34. Wisconsin model results
• Time series clustering used to condense
large no. climate signals into much smaller
set with additional algorithm to account for
asynchronous times
• 3 classes emerged from clustering
• Successful in predicting climatically-forced
dynamic behaviors of stream temps
• 14 stream test sites yielded R2 of 0.66 and
captured dynamics
• R2 for training sites from 0.60 – 0.75
35. Water depths in Florida Everglades
• Area: Water Conservation
Area 3A in Florida Everglades
• Need: Predict water levels at
ungaged locations
• Data:
– Water level (WL) from 3
sites
– Water depth (WD) from 16
sites
– Categorical Data
• EDEN grid UTM North
• EDEN grid UTM South
• % prairie
• % sawgrass
• % slough
• % upland
36. New issues and techniques
• Validation sites were selected using a zone-
averaging filter to identify those to be the
least unique according to categorical
parameters
• Water levels are reported to set datum.
Study required setting all stations to a
common datum
– Site 64 set as reference datum station
– Time series used difference between the
measured data from other stations and the
reference site
37. Approach
• Two step ANN model
– First step: estimate mean water‐depths using static model
– “spatially interpolating” ANN scheme
– Second step: estimate water‐depths
variability using dynamic variables
38. Static sub-model results alone
• Each “step” represents a different site
• Model able to generalize water level difference but not
variability
39. Final model results: static + dynamic
sub-models for site W8
Results: R2 = 0.995 for site W8
R2 for 5 validation sites 0.980-0.995
40. Conclusions, 1
A. Numerical methods
1. Signal processing, e.g., spectral
filtering
2. Clustering, e.g., k-means
3. ANN non-linear, dynamic sub-
models of behavioral components
assembled into super-model
4. Classification, e.g., ANN non-linear
classifier
41. Conclusions, 2
B. Approach uses all available
static and time series data
C. Divide and conquer makes big
problems tractable
D. Near optimal results – limited
by data quality only
E. Compact, finished model
42. Thanks for your attention!
Questions: John B. Cook, PE, M.ASCE
Advanced Data Mining Int’l
John.Cook@advdmi.com
843-513-2130