Improving access to
geospatial Big Data in the
hydrology domain
Claudia Vitolo1,2
and Wouter Buytaert1
1
Imperial College London
2
Brunel University London
Big Data and Spatial Analytics - Business and Industrial Section
Royal Statistical Society, London, UK - 18.11.2015
Outline
1. Background
2. Open Data and access approaches
3. Demo
4. Conclusions
1.
Background
What is Hydrology?
Hydrology is the scientific study of the movement,
distribution, and quality of water on Earth.
Source: Hydrology. In Wikipedia, The Free Encyclopedia.
What do (river) hydrologists do?
▣ Collect data on climate,
soil, geology,
topography, etc.
▣ Setup model
▣ Calibrate model with
observed water levels
and stream flows
□ locations
□ time intervals
▣ Use models to analyse
scenarios and make
predictions
Big Data in Hydrology
Information:
▣ Topography & bathymetry
▣ Geology
▣ Soil & Moisture
▣ Land cover
▣ Weather & Climate
▣ Hydrometry
▣ Quality samples
▣ Groundwater
▣ Infrastructures
Format:
▣ Plain text
▣ Raster
▣ Vector
▣ Binary
▣ Markup Languages
▣ Graphs & networks
▣ Cad drawings
Big Data in Hydrology
Information:
▣ Topography & bathymetry
▣ Geology
▣ Soil & Moisture
▣ Land cover
▣ Weather & Climate
▣ Hydrometry
▣ Quality samples
▣ Groundwater
▣ Infrastructures
Format:
▣ Plain text
▣ Raster
▣ Vector
▣ Binary
▣ Markup Languages
▣ Graphs & networks
▣ Cad drawings
Big Data in Hydrology
Information:
▣ Topography & bathymetry
▣ Geology
▣ Soil & Moisture
▣ Land cover
▣ Weather & Climate
▣ Hydrometry
▣ Quality samples
▣ Groundwater
▣ Infrastructures
Format:
▣ Plain text
▣ Raster
▣ Vector
▣ Binary
▣ Markup Languages
▣ Graphs & networks
▣ Cad drawings
Big Data in Hydrology
Information:
▣ Topography & bathymetry
▣ Geology
▣ Soil & Moisture
▣ Land cover
▣ Weather & Climate
▣ Hydrometry
▣ Quality samples
▣ Groundwater
▣ Infrastructures
Format:
▣ Plain text
▣ Raster
▣ Vector
▣ Binary
▣ Markup Languages
▣ Graphs & networks
▣ Cad drawings
Big Data challenges:
▣ Get large volume of heterogeneous data
▣ Mash-up information and use it to make
decisions
2.
Open Data
and data access
approaches
Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
The National River Flow Archive (NRFA)
River flow data from gauging station networks across the UK
including networks operated by:
● Environment Agency (England),
● Natural Resources Wales,
● Scottish Environment Protection Agency,
● Rivers Agency (Northern Ireland).
http://nrfa.ceh.ac.uk/
GUI
PROS: simple and intuitive
CONS: not scalable, not
flexible
Point & click (GUI) vs programmatic
(API) data retrieval
API
PROS: scalable, fast and
flexible
CONS: requires
programming skills
Application Programming Interface
SERVER
USER/CLIENT
API
The NRFA’s API
▣ metadata catalogue,
▣ catalogue filters,
▣ time series of gauged daily data,
▣ time series of catchment monthly rainfall.
How does an API work?
server/format/service?X=1&Y=2&Z=3
How does an API work?
server/format/service?X=1&Y=2&Z=3
QUESTION A:
How do I get information on station “18019” from the NRFA catalogue?
How does an API work?
server/format/service?X=1&Y=2&Z=3
QUESTION A:
How do I get information on station “18019” from the NRFA catalogue?
ANSWER:
nrfaapps.ceh.ac.uk/nrfa/json/stationSummary?db=nrfa_public&stn=18019
How does an API work?
server/format/service?X=1&Y=2&Z=3
QUESTION B:
How do I get the time series of gauged daily data for station “18019”?
How does an API work?
server/format/service?X=1&Y=2&Z=3
QUESTION B:
How do I get the time series of gauged daily data for station “18019”?
ANSWER:
nrfaapps.ceh.ac.uk/nrfa/xml/waterml2?db=nrfa_public&stn=18019&dt=gdf
From machine-readable to human-
readable formats
JSON
XML
Plain text
R libraries to interface APIs
▣ raincpc: download and process the Climate Prediction Center's
(CPC) daily rainfall data
▣ rnoaa: an interface to NOAA Climate data API
▣ soilDB: read data from USDA-NCSS soil databases.
▣ waterData: retrieve, analyse, and calculate anomalies of daily
hydrologic time series data.
▣ rnrfa: an interface to the UK National River Flow Archive data API.
3.
Demo
The R package RNRFA
API interface:
▣ make request
▣ parse response
▣ retrieve and filter metadata catalogue
▣ get time series of gauged daily data and catchment monthly
rainfall
API interface + external libraries:
▣ make maps
▣ create interactive tables and plots
▣ simplify and speed up reporting!
Example of dynamic report
▣ Find all the stations operated by National Resources Wales
▣ Retrieve time series of daily flows
▣ Run a basic analysis
▣ Create interactive plot, table and map
4.
Conclusions
Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
Thanks!
Any questions?
Claudia Vitolo
Twitter: @clavitolo
Email: claudia.vitolo@gmail.com
Blog: http://claudiavitolo.com/

Improving access to geospatial Big Data in the hydrology domain

  • 1.
    Improving access to geospatialBig Data in the hydrology domain Claudia Vitolo1,2 and Wouter Buytaert1 1 Imperial College London 2 Brunel University London Big Data and Spatial Analytics - Business and Industrial Section Royal Statistical Society, London, UK - 18.11.2015
  • 2.
    Outline 1. Background 2. OpenData and access approaches 3. Demo 4. Conclusions
  • 3.
  • 4.
    What is Hydrology? Hydrologyis the scientific study of the movement, distribution, and quality of water on Earth. Source: Hydrology. In Wikipedia, The Free Encyclopedia.
  • 5.
    What do (river)hydrologists do? ▣ Collect data on climate, soil, geology, topography, etc. ▣ Setup model ▣ Calibrate model with observed water levels and stream flows □ locations □ time intervals ▣ Use models to analyse scenarios and make predictions
  • 6.
    Big Data inHydrology Information: ▣ Topography & bathymetry ▣ Geology ▣ Soil & Moisture ▣ Land cover ▣ Weather & Climate ▣ Hydrometry ▣ Quality samples ▣ Groundwater ▣ Infrastructures Format: ▣ Plain text ▣ Raster ▣ Vector ▣ Binary ▣ Markup Languages ▣ Graphs & networks ▣ Cad drawings
  • 7.
    Big Data inHydrology Information: ▣ Topography & bathymetry ▣ Geology ▣ Soil & Moisture ▣ Land cover ▣ Weather & Climate ▣ Hydrometry ▣ Quality samples ▣ Groundwater ▣ Infrastructures Format: ▣ Plain text ▣ Raster ▣ Vector ▣ Binary ▣ Markup Languages ▣ Graphs & networks ▣ Cad drawings
  • 8.
    Big Data inHydrology Information: ▣ Topography & bathymetry ▣ Geology ▣ Soil & Moisture ▣ Land cover ▣ Weather & Climate ▣ Hydrometry ▣ Quality samples ▣ Groundwater ▣ Infrastructures Format: ▣ Plain text ▣ Raster ▣ Vector ▣ Binary ▣ Markup Languages ▣ Graphs & networks ▣ Cad drawings
  • 9.
    Big Data inHydrology Information: ▣ Topography & bathymetry ▣ Geology ▣ Soil & Moisture ▣ Land cover ▣ Weather & Climate ▣ Hydrometry ▣ Quality samples ▣ Groundwater ▣ Infrastructures Format: ▣ Plain text ▣ Raster ▣ Vector ▣ Binary ▣ Markup Languages ▣ Graphs & networks ▣ Cad drawings
  • 10.
    Big Data challenges: ▣Get large volume of heterogeneous data ▣ Mash-up information and use it to make decisions
  • 11.
    2. Open Data and dataaccess approaches
  • 12.
    Open Data “Open dataand content can be freely used, modified, and shared by anyone for any purpose” Source: http://opendefinition.org/
  • 13.
    Open Data “Open dataand content can be freely used, modified, and shared by anyone for any purpose” Source: http://opendefinition.org/
  • 14.
    Open Data “Open dataand content can be freely used, modified, and shared by anyone for any purpose” Source: http://opendefinition.org/
  • 15.
    Open Data “Open dataand content can be freely used, modified, and shared by anyone for any purpose” Source: http://opendefinition.org/
  • 16.
    Open Data “Open dataand content can be freely used, modified, and shared by anyone for any purpose” Source: http://opendefinition.org/
  • 17.
    The National RiverFlow Archive (NRFA) River flow data from gauging station networks across the UK including networks operated by: ● Environment Agency (England), ● Natural Resources Wales, ● Scottish Environment Protection Agency, ● Rivers Agency (Northern Ireland). http://nrfa.ceh.ac.uk/
  • 18.
    GUI PROS: simple andintuitive CONS: not scalable, not flexible Point & click (GUI) vs programmatic (API) data retrieval API PROS: scalable, fast and flexible CONS: requires programming skills
  • 19.
  • 20.
    The NRFA’s API ▣metadata catalogue, ▣ catalogue filters, ▣ time series of gauged daily data, ▣ time series of catchment monthly rainfall.
  • 21.
    How does anAPI work? server/format/service?X=1&Y=2&Z=3
  • 22.
    How does anAPI work? server/format/service?X=1&Y=2&Z=3 QUESTION A: How do I get information on station “18019” from the NRFA catalogue?
  • 23.
    How does anAPI work? server/format/service?X=1&Y=2&Z=3 QUESTION A: How do I get information on station “18019” from the NRFA catalogue? ANSWER: nrfaapps.ceh.ac.uk/nrfa/json/stationSummary?db=nrfa_public&stn=18019
  • 24.
    How does anAPI work? server/format/service?X=1&Y=2&Z=3 QUESTION B: How do I get the time series of gauged daily data for station “18019”?
  • 25.
    How does anAPI work? server/format/service?X=1&Y=2&Z=3 QUESTION B: How do I get the time series of gauged daily data for station “18019”? ANSWER: nrfaapps.ceh.ac.uk/nrfa/xml/waterml2?db=nrfa_public&stn=18019&dt=gdf
  • 26.
    From machine-readable tohuman- readable formats JSON XML Plain text
  • 27.
    R libraries tointerface APIs ▣ raincpc: download and process the Climate Prediction Center's (CPC) daily rainfall data ▣ rnoaa: an interface to NOAA Climate data API ▣ soilDB: read data from USDA-NCSS soil databases. ▣ waterData: retrieve, analyse, and calculate anomalies of daily hydrologic time series data. ▣ rnrfa: an interface to the UK National River Flow Archive data API.
  • 28.
  • 29.
    The R packageRNRFA API interface: ▣ make request ▣ parse response ▣ retrieve and filter metadata catalogue ▣ get time series of gauged daily data and catchment monthly rainfall API interface + external libraries: ▣ make maps ▣ create interactive tables and plots ▣ simplify and speed up reporting!
  • 30.
    Example of dynamicreport ▣ Find all the stations operated by National Resources Wales ▣ Retrieve time series of daily flows ▣ Run a basic analysis ▣ Create interactive plot, table and map
  • 31.
  • 32.
    Summary Big Data Large volumesof heterogeneous spatio- temporal data is becoming increasingly open in the hydrology domain. GUIs vs APIs GUIs may be the easiest way to browse data but not the most efficient. APIs are fast and scalable. Hardware/software Hardware & software burden is on the data provider side. No need to update your datasets, you always access the latest version R as interface R is an easy-to-learn language, widely used by statisticians and scientists. It provides a number of libraries to obtain and parse data from the web. Reproducible workflows Query databases, filter information, convert coordinates, generate plots and maps for reproducible reporting. Scalability & Interoperability An approach to gather information for single as well as multiple sites. At larger scale, computing can be made more efficient by using cloud facilities. R
  • 33.
    Summary Big Data Large volumesof heterogeneous spatio- temporal data is becoming increasingly open in the hydrology domain. GUIs vs APIs GUIs may be the easiest way to browse data but not the most efficient. APIs are fast and scalable. Hardware/software Hardware & software burden is on the data provider side. No need to update your datasets, you always access the latest version R as interface R is an easy-to-learn language, widely used by statisticians and scientists. It provides a number of libraries to obtain and parse data from the web. Reproducible workflows Query databases, filter information, convert coordinates, generate plots and maps for reproducible reporting. Scalability & Interoperability An approach to gather information for single as well as multiple sites. At larger scale, computing can be made more efficient by using cloud facilities. R
  • 34.
    Summary Big Data Large volumesof heterogeneous spatio- temporal data is becoming increasingly open in the hydrology domain. GUIs vs APIs GUIs may be the easiest way to browse data but not the most efficient. APIs are fast and scalable. Hardware/software Hardware & software burden is on the data provider side. No need to update your datasets, you always access the latest version R as interface R is an easy-to-learn language, widely used by statisticians and scientists. It provides a number of libraries to obtain and parse data from the web. Reproducible workflows Query databases, filter information, convert coordinates, generate plots and maps for reproducible reporting. Scalability & Interoperability An approach to gather information for single as well as multiple sites. At larger scale, computing can be made more efficient by using cloud facilities. R
  • 35.
    Summary Big Data Large volumesof heterogeneous spatio- temporal data is becoming increasingly open in the hydrology domain. GUIs vs APIs GUIs may be the easiest way to browse data but not the most efficient. APIs are fast and scalable. Hardware/software Hardware & software burden is on the data provider side. No need to update your datasets, you always access the latest version R as interface R is an easy-to-learn language, widely used by statisticians and scientists. It provides a number of libraries to obtain and parse data from the web. Reproducible workflows Query databases, filter information, convert coordinates, generate plots and maps for reproducible reporting. Scalability & Interoperability An approach to gather information for single as well as multiple sites. At larger scale, computing can be made more efficient by using cloud facilities. R
  • 36.
    Summary Big Data Large volumesof heterogeneous spatio- temporal data is becoming increasingly open in the hydrology domain. GUIs vs APIs GUIs may be the easiest way to browse data but not the most efficient. APIs are fast and scalable. Hardware/software Hardware & software burden is on the data provider side. No need to update your datasets, you always access the latest version R as interface R is an easy-to-learn language, widely used by statisticians and scientists. It provides a number of libraries to obtain and parse data from the web. Reproducible workflows Query databases, filter information, convert coordinates, generate plots and maps for reproducible reporting. Scalability & Interoperability An approach to gather information for single as well as multiple sites. At larger scale, computing can be made more efficient by using cloud facilities. R
  • 37.
    Summary Big Data Large volumesof heterogeneous spatio- temporal data is becoming increasingly open in the hydrology domain. GUIs vs APIs GUIs may be the easiest way to browse data but not the most efficient. APIs are fast and scalable. Hardware/software Hardware & software burden is on the data provider side. No need to update your datasets, you always access the latest version R as interface R is an easy-to-learn language, widely used by statisticians and scientists. It provides a number of libraries to obtain and parse data from the web. Reproducible workflows Query databases, filter information, convert coordinates, generate plots and maps for reproducible reporting. Scalability & Interoperability An approach to gather information for single as well as multiple sites. At larger scale, computing can be made more efficient by using cloud facilities. R
  • 38.
    Summary Big Data Large volumesof heterogeneous spatio- temporal data is becoming increasingly open in the hydrology domain. GUIs vs APIs GUIs may be the easiest way to browse data but not the most efficient. APIs are fast and scalable. Hardware/software Hardware & software burden is on the data provider side. No need to update your datasets, you always access the latest version R as interface R is an easy-to-learn language, widely used by statisticians and scientists. It provides a number of libraries to obtain and parse data from the web. Reproducible workflows Query databases, filter information, convert coordinates, generate plots and maps for reproducible reporting. Scalability & Interoperability An approach to gather information for single as well as multiple sites. At larger scale, computing can be made more efficient by using cloud facilities. R
  • 39.
    Thanks! Any questions? Claudia Vitolo Twitter:@clavitolo Email: claudia.vitolo@gmail.com Blog: http://claudiavitolo.com/