SlideShare a Scribd company logo
1 of 68
Software for Improving Scientific Data
Access Infrastructure
Russ Rew
Unidata Program Center
Overview
• Problems in scientific data management
• Some efforts toward finding solutions
– Distributing near-real time data
– A data model for the earth sciences
– Advancing a metadata standard for the earth
system science community
– Serving metadata and data
– Visualizing and analyzing geoscience data
• Thoughts on the value of infrastructure
Thanks to:
• GFD Dennou Club and
members who visited
Unidata in 2004
• Research Institute for
Sustainable
Humanosphere
• The Japanese
Meteorological Agency
• National Science
Foundation and UCAR
• Unidata Program Center
staff and associated
community
Masato Shiotani, Yasuhiro Morikawa, Masaki
Ishiwata, Russ Rew, Takeshi Horinouchi,
Ethan Davis, and Yoshi-Yuki Hayashi
Unidata
• Funded primarily by the U.S. National Science
Foundation
• Mission: To provide data, tools, and community
leadership for improving Earth-system
education and research
• At the Unidata Program Center, we
– Provide access to data (via push and pull systems)
– Develop open source tools and infrastructure for
data access, analysis, visualization, and data
management
– Support users of our technologies: faculty, students,
and researchers
– Help to build, represent, and advocate for a
community
Background
• Science increasingly advances through
collaborations and synthesis of data across
disciplines
• Tools are not keeping up with need to analyze
and combine data from different disciplines
• We need to continue improving tools for data
distribution, data modeling, metadata, remote
access, and visualization
– Within geosciences
– Across scientific disciplines
– For collaborations, including international scale
Problems in the Current Scientific
Data Landscape
• Data volumes are approximately doubling every year
• Data tools are not keeping pace with data volumes
• Very large datasets demand new techniques for data management
• Integrating data across organizations and disciplines is very difficult
• Input/Output bandwidth is not keeping up with storage capacity
• No single data model has achieved critical mass in the scientific
community, each discipline has their own
• There are many metadata models for each discipline and a lack of
common conventions for metadata
Summarized from
Scientific Data Management in the Next Decade,
by Jim Gray, David T. Liu, Maria Nieto-Santisteban, Alex Szalay, …
CT Watch Quarterly, February 2005
www.ctwatch.org
Some Earth Science Data
Characteristics
• Large multidimensional arrays from forecast models
• Coordinate systems to access data by region and time
• Need for near-real time data
• Read-only or read-mostly access to existing data
• Security often a minor issue: data is usually freely
available and shared, but data systems must
– Protect data from accidental or malicious changes
– Provide restricted access for a few data collections
– Protect from denial of service caused by users asking for too
much
Five Areas of Unidata Involvement
1. Distributing near-real time data
2. A data model for the earth sciences
3. Advancing a metadata standard for the
earth system science community
4. Serving metadata and data
5. Visualizing and analyzing geoscience data
Distributing Near-Real Time Data
• Unidata’s Internet Data Distribution system
– Delivers near-real time data: model outputs,
surface, radar, upper-air, satellite observations,
lightning, aircraft, observations from aircraft,
observations from mesoscale networks, …
– A collaboration of universities and other institutions
– No data center, data products are injected from
multiple sources
– Unidata’s part includes development of client-server
software, providing support, training workshops,
coordination, negotiating data agreements
Why Not Just Use FTP?
• Designed to send whole files, not many small
products
• If client pulls data from server, delays result from
repeatedly asking server “is new data available yet?”
• If server pushes data to clients, server must maintain
connection information and state for each client
• FTP can be slow for sending many small data
products
• In spite of these shortcomings, FTP is still used
successfully in many data distribution systems, when:
– Delay not an issue
– Small number of clients per server
– Only large products in whole files are distributed
Unidata’s LDM (Local Data Manager)
• An alternative to FTP for data distribution
• Protocols and client-server software for capturing, distributing, and organizing
data in near-real time using reliable, event-driven data distribution
• Supports subscriptions to subsets of data feeds
• Suitable for pushing many small products, as well as large products
• Highly configurable: can inject, distribute, capture, filter, and process arbitrary
data products
• Requires Unix system
• Heart of the Internet Data Distribution system
Source
Source
Source
LDM
Internet
LDM
LDM
LDM
LDM
LDM
LDM
LDM
LDM
Pushes data from multiple sources using cooperating LDMs
Over 170 institutions on 5 continents and growing
IDD (Internet Data Distribution)
In the Beginning...
“a dizzying volume of information –
on the order of 100 MBytes/day”
(AMS paper on LDM-2, Davis and
Rew, 1990)
Real-Time Data Flows
Now
• 30 data feeds provide radar, satellite, text
bulletins, lightning, model forecasts, surface
and upper air observations, …
• LDM-6 commonly handles 5 GB/hour input,
with as many as 140,000 products/hour
• LDM-6 was recently selected for data
collection for the THORPEX Interactive
Grand Global Ensemble (TIGGE)
• A cluster LDM configuration can handle 400
downstream connections
• Currently over 300 machines at 170 sites run
LDM-6 continuously
• Redundant feeds support reliability in case
of failure of “upstream” machines or network
• The National Weather Service uses LDM-6
to collect and relay NEXRAD level 2 radar
data operationally for over 150 radars
Participants
United States
Canada
Puerto Rico
Costa Rica
Barbados
Venezuela
Chile
Brazil
Argentina
England
Portugal
Spain
Austria
Russia
Vietnam
China (Hong
Kong)
South Korea
Antarctica
(incipient)
Unidata IDD
North American data
delivery and sharing
network
IDD-Brasil
South American peer of
North American IDD
IDD-Caribe (planning)
Central American peer of
North American IDD
Antarctic-IDD
Support of US Antarctic
research community
IDD 2007
Extract from 2006 American
Meteorological Society presentation
Data Products from
CPTEC available on
the IDD-Brasil
W.G. Almeida, A.A. Lima, A.S. Pessoa, A. L. T. Ferreira, A. Bonfin, M. V.
Mendes, Ferreira, S. H. - CPTEC/INPE
G. O. Chagas, D. G. Coelho, M.G. Justi - UFRJ
T. Yoksas - Unidata/UCAR
 Began as a collaboration under Meteoforum project.
 Participants: Unidata Program Center/UCAR, CPTEC/INPE,
UFRJ, UFPA and USP
 Inaugurated in January of 2004 with 4 nodes
The IDD-Brasil
IDD-Brasil Participation
 The working paradigm for IDD-Brasil is:
 You get free access to global data, tools and support
 You give free access to your data-sets, provide infrastructure
and support
 Free data access and cooperation is a major topic of discussion in
every Brazilian meteorological meeting
CPTEC’s Data ingesting
in IDD-Brasil
 ETA regional Model, 40 Km resolution (operational)
 Automatic data-collecting network (operational)
 GOES satellite imagery, full-resolution for South America
(under testing)
 Global T213 model (under testing)
 Ensemble T126 Global model – 15 members (under
testing)
INPE´s automatic network:
Data Collecting Platforms
 More than 524 automated stations from INPE and
cooperating Institutions
 50 Stations reporting Atmospheric Pressure
 These were the first new data shared through the IDD-
Brasil, soon they will be also on GTS (in BUFR format)
 Location of all
INPE data
collecting
platforms
(PCD)
Conclusions I:
 The IDD extension to Brazil (IDD-Brasil) is changing
Brazilian Meteorology through:
 Easier access to Global Data
 Free availability of good analysis tools
 Spreading ideas and practices
 Free Data Sharing
 Cooperation and mutual support
Conclusions II:
 The IDD-Brasil shows a sharply growing rate
 Today Brazil is the largest international IDD-user
community
 Numerical models of Brazilians institutions distributed by
the IDD/IDD-Brasil are easily available to national and
international users.
 The data from several Brazilians mesonets may be
distributed by the IDD/IDD-Brasil
 These data are not available on GTS
 They are very important because the data network
in South America is sparse.
 As a result of IDD expansion to Brazil, more Brazilian
data are becoming available for International
community.
Conclusions III:
Contact Information
Waldenio Gambi de Almeida gambi@cptec.inpe.br
CPTEC/INPE cptec.inpe.br
Maria G.A. Justi da Silva justi@igeo.ufrj.br
UFRJ www.meteorologia.ufrj.br
Tom Yoksas yoksas@unidata.ucar.edu
Unidata/UCAR www.unidata.ucar.edu
1. Distributing near-real time data
2. A data model for the earth sciences
3. Advancing a metadata standard for the
earth system science community
4. Serving metadata and data
5. Visualizing and analyzing geoscience data
How Adequate is the Relational
Database Model for Scientific Data?
• Designed and optimized for
– Data in tables
– Online transaction processing systems
– Other business and enterprise problems
• Successful in Geospatial Information Systems integration, such as
ESRI ArcGIS
• Also very successful in aother disciplines, such as astronomy
(Sloan Digital Sky Survey, U.S. National Virtual Observatory)
• Not adequate for earth sciences data
– N-dimensional arrays
– Event-oriented systems, such as sensor webs or high-speed data
streams
– Indexing unstructured data, for example metadata in XML form
– Supporting scientific analysis and visualization tools
Open Questions in Modeling
Scientific Data
• For how wide a realm is the relational database model
adequate?
• Is any data model that unifies data collections from
many disciplines too complex to be useful?
• Can one scientific data model be useful across many
scientific disciplines?
• Is one scientific data model even practical for earth
sciences?
Network Common Data Form
• A simple data model for scientific datasets
• A format for portable, self-describing data
• A programming library that uses efficient direct
access and efficient subsetting of
multidimensional arrays
• Several programming interfaces: C, Fortran,
C++, Java, Python, Perl, Ruby, ...
• Support for appending, sharing, and archiving
data
The NetCDF-3 Data Model
Limitations of the NetCDF-3 Data Model
• Too simple to represent some data structures and
relationships
• No real data structures, just scalars and
multidimensional arrays
• No “ragged arrays” or nested structures
• Only one shared unlimited dimension, along which data
can be appended
• A flat name space for dimensions and variables
• No strings, just arrays of characters
• A limited set of numeric types
• Only ASCII characters in names
The NetCDF-4 Data Model
NetCDF’s Future
• NetCDF-4 integrates netCDF with HDF5, another major
standard format and data model
• Parallel netCDF has proved suitable for high-
performance computing
• NetCDF-4 data model (CDM) improves interoperability
with other scientific data representations
• NetCDF-Java has advanced features, including access
to remote data
NetCDF-4 Features
• User-defined compound types
(portable structs)
• User-defined variable-length
types
• Groups for nested scopes
• Multiple unlimited dimensions
• String type
• Additional numeric types
• Unicode names
• Efficient dynamic
schema changes
• Multidimensional tiling
(chunking)
• Per variable
compression
• Parallel I/O
• Reader-Makes-Right
conversion
Address limitations of netCDF-3
The Unidata Common Data Model
• For a common subset of abstractions in
OPeNDAP, HDF5, and netCDF-4
• User-defined compound types (portable structs)
• User-defined variable-length types
• Groups for nested scopes
• String type
• Additional numeric types
• Prototype implemented in netCDF-Java
• Attempts a balance between simplicity and
power of representation
NetCDF-Java
• 100% Java library has advances compared to
C-based interfaces
• Prototype implementation of Common Data
Model for access to netCDF-4, OPeNDAP,
HDF5
– Provides netCDF interfaces to other formats: Grids
(GRIB1, GRIB2), Radar (NEXRAD, NIDS,
DORADE), Satellite (DMSP, GINI), Point
Observations (BUFR)
– Provides uniform coordinate systems layer
• Includes access to THREDDS inventory
catalogs
Common Data Model
Coordinate Systems
Common Data Access Model
Scientific Datatypes
Grid
Point
Radial
Trajectory
Swath
Station
Application
Application
Applications
THREDDS HDF5 netCDF
GRIB
OPeNDAP ...
1. Distributing near-real time data
2. A data model for the earth sciences
3. Advancing a metadata standard for the
earth system science community
4. Serving metadata and data
5. Visualizing and analyzing geoscience data
Some Metadata Issues
• Every scientific discipline has their own metadata
standard. Is any convergence likely?
• How do you choose among the multiple candidates for
metadata standards?
• How can metadata be improved for existing data
without rewriting the data?
Climate and Forecast (CF)
Conventions
• A widely used metadata standard for atmospheric,
ocean, and climate data, based on netCDF
• Specifies coordinate systems used in models, data cell
properties and methods, packing, standard names for
quantities, and grid mappings
• CF-aware software can automatically determine space-
time location of data variables
• Originally intended for climate model output
conventions, but use has broadened to weather and
ocean models and observational data
• Community governance structure now in place for
maintaining and advancing the CF conventions, WMO
Working Group on Coupled Modeling (WGCM)
Libcf
• Purpose: to ease the creation and use of
datasets conforming to the CF
Conventions
• In early stages of development and
testing
• C and Fortran interfaces available from
Unidata in alpha release
Udunits (Unidata Units)
• Library for manipulating units of physical
qualities.
– Conversion of unit specifications between formatted
and binary forms
– Arithmetic manipulation of unit specifications
– Conversion of values between compatible scales of
measurement
• C, Fortran, and Java interfaces
• Required by CF conventions
• May soon be available as part of netCDF
release
NcML (NetCDF Markup Language)
• An XML representation of netCDF metadata, similar to
CDL
• A schema language for Earth science data
– To get NcML from netCDF data, use ncdump -x or Java ToolsUI
program
– To create netCDF from NcML, use ToolsUI or (eventually)
ncgen
• Provides a way to add to or change metadata without
rewriting, by referencing and overriding metadata in a
file
• Also supports aggregation of multiple files
1. Distributing near-real time data
2. A data model for the earth sciences
3. Advancing a metadata standard for the
earth system science community
4. Serving metadata and data
5. Visualizing and analyzing geoscience data
• Open-source Project for a Network Data Access
Protocol, see opendap.org
• A discipline-neutral protocol to get remote scientific data
and metadata (not files)
• Allows requests for subsets and aggregations
• Software reference implementations for many kinds of
data: netCDF, SQL (databases), HDF, FITS, JGOFS,
• In use in earth sciences, astronomy, medicine, …
• Serves IPCC model output
Client Server
Data
Data
Data
FTP (File Transfer) versus OPeNDAP
for Access to Remote Data
• FTP accesses only whole files
• OPeNDAP includes services for
– Selected subset of data from a file
– Aggregation of data in multiple files
– Selected subset of aggregated data
• Protocol uses URLs and HTTP
• Unidata provides OPeNDAP support
• Several OPeNDAP servers available: pyDAP, FDS,
GDS, DAPPER, THREDDS Data Server
• OPeNDAP clients include: Ferret, GrADS, Matlab, IDL,
ArcGIS, netCDF-Java, IDV
• OPeNDAP version 2 now a NASA standard
• Version 4 under development with a test version
available: adds XML, new types, new functions,
THREDDS catalogs, SOAP, outputs in HTML and ASCII
• Will add authentication, more server-side processing
Thematic Real-time Environmental
Distributed Data Services (THREDDS)
• For data providers, implements data catalogs to present
to users and applications
• Catalogs are XML documents (metadata) describing
and pointing to datasets accessible via client/server
protocols (OPeNDAP, ADDE, WCS, HTTP)
• Datasets may be found by discovery centers (master
directories, digital libraries, data portals) via catalogs
• Catalog hierarchy provides places to hang common
metadata
• Unidata coordinates THREDDS activities, community
implements servers
• Many partners as data providers, tool builders,
interoperability experts from academia, government,
industry
Motherlode Portal Catalog of Catalogs
NCDC Server
NCEP NAM Individual Run
THREDDS Data Server (TDS)
• Serves data, THREDDS catalogs, and metadata
• Reads and serves several kinds of data through a
uniform CDM interface: netCDF, OPeNDAP, HDF5,
GRIB, NEXRAD, …
• Adds Earth-location coordinate systems to data
• Provides OPeNDAP access and subsetting of any data
readable with NetCDF-Java library
• An integrated server provides data access through the
OpenGIS Consortium Web Coverage Service
(OGC/WCS)
• Easy to install, 100% Java, freely available
• Supports dynamic generation of catalogs
HTTP Tomcat Server
THREDDS Data Server
Datasets
Catalog.xml
hostname.edu
THREDDS Server
Application
NetCDF-Java
library
IDD Data
•OPeNDAP
•HTTPServer
•WCS
1. Distributing near-real time data
2. A data model for the earth sciences
3. Advancing a metadata standard for the
earth system science community
4. Serving metadata and data
5. Visualizing and analyzing geoscience
data
Integrated Data Viewer (IDV)
• Unidata’s newest
scientific analysis and
visualization tool
• Freely available 100%
Java framework and
reference application
• Provides 2- and 3-D
displays of geoscience
data
• Stand-alone or
networked application
• Integrates data from
different sources
• Provides End-to-end
test for technologies
Some IDV Features
• Client-server data access
from remote systems
• Suite of data probes for
interactive exploration (slice
and dice)
• Animations (temporal and
spatial)
• HTML interface for pedagogic
materials
• XML configuration and
bundling allows collaboration
with other educators
• Java-based framework
supports Extensions built via
plug-ins: for example,
geosciences network (GEON)
solid earth community
Catalog of catalogs in IDV
(Catalog from within a Client)
Summary and Tentative Conclusions
• Database “One Size Fits All” databases are not a
comprehensive solution for scientific data or metadata.
• Similarly, the old file-FTP approach is running out of
steam for distributed data systems.
• There is a limited time for establishing interdisciplinary
data tools and services, before each discipline
crystallizes on its own solutions.
• Islands of non-interoperability that result would be
unfortunate.
• Flexibility to be ready adapt to better solutions is
required, maybe even before the best solutions are
evident.
A Few Last Thoughts on
Infrastructure …
What is Infrastructure?
• The basic facilities, services, and
installations needed for the functioning of
a community
– Utilities: water and power lines
– Transportation and communications systems
• Good infrastructure is reliable, sturdy,
useful, long lasting, standardized, widely
used, and invisible
Infrastructure: Stones in a Wall
• Higher layers are
built on lower layers
• Stones may be
replaced with other
stones of similar size
and shape
• From the top, lower
layers are invisible
Cyberinfrastructure: the Middle Layers
Base Technology: computation, storage, communication
Networking, Operating Systems, Middleware
High
performance
computation
services
Data,
information,
Knowledge
management
services
Observation,
measurement
fabrication
services
Interfaces,
visualization
services
Collaboration
services
Community-Specific Knowledge Environments for Research
and Education (collaboratories, grid community,
e-science community, virtual community)
Customization for discipline-and project-specific applications
From the “Atkins report” on Cyberinfrastructure
Is Developing Infrastructure Rewarding?
• It’s abstract, so hard to explain at a party
• You can’t take a picture or movie about it
• If it works well, it is invisible
• End users are often not aware of it
• It doesn’t get referenced in scientific papers
• It can be expensive to evolve and support
• If not maintained, it eventually crumbles
• You can’t sell it, so you have to give it away
Earth Science Infrastructure: Bricks
in a Wall of Acronyms
XML C Fortran Java
Unix HTTP
CDL
NetCDF Udunits GRIB
NcML
Unidata
decoders
CF
OGC
WCS
HDF5
OPeNDAP
LDM
NetCDF
Java
Libcf NetCDF-4
CDM TDS
IDD
CONDUIT
project
LEAD
project
IDV GEMPAK McIDAS
GML
CSML
BUFR
GrADS
ArcGIS Ferret
HDF4
Developed
by Unidata
THREDDS
Involvement
by Unidata
GALEON
project
NCO
Other
technologies
SQL
Python,
Ruby, …
VisAD
ADDE
Visible and Invisible Infrastructure
CDL
NetCDF Udunits GRIB
NcML
Unidata
decoders
CF
OGC
WCS
HDF5
OPeNDAP
LDM
netCDF
Java
Libcf NetCDF-4
CDM TDS
IDD
CONDUIT
project
LEAD
project
IDV GEMPAK McIDAS
GML
CSML
BUFR
GrADS
ArcGIS Ferret
HDF4
THREDDS
GALEON
project
NCO
Visible to End Users:
“Cloak of invisibility”
XML C Fortran Java
Unix HTTP SQL
Python,
Ruby, …
VisAD
ADDE
What Is Good Infrastructure?
• Provides a useful service
• Makes abstractions at the right level
• Cloaks invisible details with a simple interface
• Binds loosely to other infrastructure
• Behaves reliably
• Adapts easily to changes
An Example of Great Infrastructure:
Popular Programming Languages
• Base of huge collection of higher layers of infrastructure
• People continue to build on top of this infrastructure
• The opportunity to create a long-lasting and popular programming
language is rare
• Jim Backus (Fortran), John McCarthy (Lisp), Dennis Ritchie (C),
Bjarne Stroustrup (C++), James Gosling (Java), Yukihiro “Matz”
Matsumoto (Ruby)
• Other great infrastructures: Unix, TCP/IP, HTTP, …
Rewards of Developing
Infrastructure?
• It “raises the level” for other developers
• Beautiful and useful new layers and
applications are built on top of it
• You can feel a part of everything it supports
• If it’s long lasting and widely used, you have
made a difference for future generations
• So, it’s one way to get closer to immortality
• Infrastructure is abstract, but rewards can also
be real
• … like this trip to Japan!
For More Information
http://www.unidata.ucar.edu/
support@unidata.ucar.edu
russ@ucar.edu

More Related Content

Similar to jamstec-rew.ppt

Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | EUDAT
 
Service and Support for Science IT -Peter Kunzst, University of Zurich
Service and Support for Science IT-Peter Kunzst, University of ZurichService and Support for Science IT-Peter Kunzst, University of Zurich
Service and Support for Science IT -Peter Kunzst, University of ZurichMind the Byte
 
Ariadne: Data Management Planning
Ariadne: Data Management PlanningAriadne: Data Management Planning
Ariadne: Data Management Planningariadnenetwork
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECAProject
 
Introduction to Data Management Planning at Alien Challenge COST workshop
Introduction to Data Management Planning at Alien Challenge COST workshopIntroduction to Data Management Planning at Alien Challenge COST workshop
Introduction to Data Management Planning at Alien Challenge COST workshopAaike De Wever
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Dan Taylor
 
big_data_casestudies_2.ppt
big_data_casestudies_2.pptbig_data_casestudies_2.ppt
big_data_casestudies_2.pptvishal choudhary
 
Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationHistoric Environment Scotland
 
Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationEDINA, University of Edinburgh
 
Open Access to Research Data: Challenges and Solutions
Open Access to Research Data: Challenges and SolutionsOpen Access to Research Data: Challenges and Solutions
Open Access to Research Data: Challenges and SolutionsMartin Donnelly
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
 
Meeting the NSF DMP Requirement June 13, 2012
Meeting the NSF DMP Requirement June 13, 2012Meeting the NSF DMP Requirement June 13, 2012
Meeting the NSF DMP Requirement June 13, 2012IUPUI
 
Open Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon HodsonOpen Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon HodsonAfrican Open Science Platform
 
Big Data Brown Bag
Big Data Brown BagBig Data Brown Bag
Big Data Brown Bagusmanqureshi
 

Similar to jamstec-rew.ppt (20)

Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
 
Service and Support for Science IT -Peter Kunzst, University of Zurich
Service and Support for Science IT-Peter Kunzst, University of ZurichService and Support for Science IT-Peter Kunzst, University of Zurich
Service and Support for Science IT -Peter Kunzst, University of Zurich
 
Ariadne: Data Management Planning
Ariadne: Data Management PlanningAriadne: Data Management Planning
Ariadne: Data Management Planning
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
 
Introduction to Data Management Planning at Alien Challenge COST workshop
Introduction to Data Management Planning at Alien Challenge COST workshopIntroduction to Data Management Planning at Alien Challenge COST workshop
Introduction to Data Management Planning at Alien Challenge COST workshop
 
Baker - Evolution of Data Products and Designated Audiences
Baker - Evolution of Data Products and Designated AudiencesBaker - Evolution of Data Products and Designated Audiences
Baker - Evolution of Data Products and Designated Audiences
 
RDA Governance
RDA GovernanceRDA Governance
RDA Governance
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
big_data_casestudies_2.ppt
big_data_casestudies_2.pptbig_data_casestudies_2.ppt
big_data_casestudies_2.ppt
 
Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant Application
 
Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant Application
 
Open Access to Research Data: Challenges and Solutions
Open Access to Research Data: Challenges and SolutionsOpen Access to Research Data: Challenges and Solutions
Open Access to Research Data: Challenges and Solutions
 
RDM@Edinburgh
RDM@EdinburghRDM@Edinburgh
RDM@Edinburgh
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
DBMS
DBMSDBMS
DBMS
 
Meeting the NSF DMP Requirement June 13, 2012
Meeting the NSF DMP Requirement June 13, 2012Meeting the NSF DMP Requirement June 13, 2012
Meeting the NSF DMP Requirement June 13, 2012
 
Open Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon HodsonOpen Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon Hodson
 
Big Data Brown Bag
Big Data Brown BagBig Data Brown Bag
Big Data Brown Bag
 
RDM@Edinburgh
RDM@EdinburghRDM@Edinburgh
RDM@Edinburgh
 

Recently uploaded

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 

Recently uploaded (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 

jamstec-rew.ppt

  • 1. Software for Improving Scientific Data Access Infrastructure Russ Rew Unidata Program Center
  • 2. Overview • Problems in scientific data management • Some efforts toward finding solutions – Distributing near-real time data – A data model for the earth sciences – Advancing a metadata standard for the earth system science community – Serving metadata and data – Visualizing and analyzing geoscience data • Thoughts on the value of infrastructure
  • 3. Thanks to: • GFD Dennou Club and members who visited Unidata in 2004 • Research Institute for Sustainable Humanosphere • The Japanese Meteorological Agency • National Science Foundation and UCAR • Unidata Program Center staff and associated community Masato Shiotani, Yasuhiro Morikawa, Masaki Ishiwata, Russ Rew, Takeshi Horinouchi, Ethan Davis, and Yoshi-Yuki Hayashi
  • 4. Unidata • Funded primarily by the U.S. National Science Foundation • Mission: To provide data, tools, and community leadership for improving Earth-system education and research • At the Unidata Program Center, we – Provide access to data (via push and pull systems) – Develop open source tools and infrastructure for data access, analysis, visualization, and data management – Support users of our technologies: faculty, students, and researchers – Help to build, represent, and advocate for a community
  • 5. Background • Science increasingly advances through collaborations and synthesis of data across disciplines • Tools are not keeping up with need to analyze and combine data from different disciplines • We need to continue improving tools for data distribution, data modeling, metadata, remote access, and visualization – Within geosciences – Across scientific disciplines – For collaborations, including international scale
  • 6. Problems in the Current Scientific Data Landscape • Data volumes are approximately doubling every year • Data tools are not keeping pace with data volumes • Very large datasets demand new techniques for data management • Integrating data across organizations and disciplines is very difficult • Input/Output bandwidth is not keeping up with storage capacity • No single data model has achieved critical mass in the scientific community, each discipline has their own • There are many metadata models for each discipline and a lack of common conventions for metadata Summarized from Scientific Data Management in the Next Decade, by Jim Gray, David T. Liu, Maria Nieto-Santisteban, Alex Szalay, … CT Watch Quarterly, February 2005 www.ctwatch.org
  • 7. Some Earth Science Data Characteristics • Large multidimensional arrays from forecast models • Coordinate systems to access data by region and time • Need for near-real time data • Read-only or read-mostly access to existing data • Security often a minor issue: data is usually freely available and shared, but data systems must – Protect data from accidental or malicious changes – Provide restricted access for a few data collections – Protect from denial of service caused by users asking for too much
  • 8. Five Areas of Unidata Involvement 1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
  • 9. Distributing Near-Real Time Data • Unidata’s Internet Data Distribution system – Delivers near-real time data: model outputs, surface, radar, upper-air, satellite observations, lightning, aircraft, observations from aircraft, observations from mesoscale networks, … – A collaboration of universities and other institutions – No data center, data products are injected from multiple sources – Unidata’s part includes development of client-server software, providing support, training workshops, coordination, negotiating data agreements
  • 10. Why Not Just Use FTP? • Designed to send whole files, not many small products • If client pulls data from server, delays result from repeatedly asking server “is new data available yet?” • If server pushes data to clients, server must maintain connection information and state for each client • FTP can be slow for sending many small data products • In spite of these shortcomings, FTP is still used successfully in many data distribution systems, when: – Delay not an issue – Small number of clients per server – Only large products in whole files are distributed
  • 11. Unidata’s LDM (Local Data Manager) • An alternative to FTP for data distribution • Protocols and client-server software for capturing, distributing, and organizing data in near-real time using reliable, event-driven data distribution • Supports subscriptions to subsets of data feeds • Suitable for pushing many small products, as well as large products • Highly configurable: can inject, distribute, capture, filter, and process arbitrary data products • Requires Unix system • Heart of the Internet Data Distribution system
  • 12. Source Source Source LDM Internet LDM LDM LDM LDM LDM LDM LDM LDM Pushes data from multiple sources using cooperating LDMs Over 170 institutions on 5 continents and growing IDD (Internet Data Distribution)
  • 13. In the Beginning... “a dizzying volume of information – on the order of 100 MBytes/day” (AMS paper on LDM-2, Davis and Rew, 1990) Real-Time Data Flows Now • 30 data feeds provide radar, satellite, text bulletins, lightning, model forecasts, surface and upper air observations, … • LDM-6 commonly handles 5 GB/hour input, with as many as 140,000 products/hour • LDM-6 was recently selected for data collection for the THORPEX Interactive Grand Global Ensemble (TIGGE) • A cluster LDM configuration can handle 400 downstream connections • Currently over 300 machines at 170 sites run LDM-6 continuously • Redundant feeds support reliability in case of failure of “upstream” machines or network • The National Weather Service uses LDM-6 to collect and relay NEXRAD level 2 radar data operationally for over 150 radars
  • 14. Participants United States Canada Puerto Rico Costa Rica Barbados Venezuela Chile Brazil Argentina England Portugal Spain Austria Russia Vietnam China (Hong Kong) South Korea Antarctica (incipient) Unidata IDD North American data delivery and sharing network IDD-Brasil South American peer of North American IDD IDD-Caribe (planning) Central American peer of North American IDD Antarctic-IDD Support of US Antarctic research community IDD 2007
  • 15. Extract from 2006 American Meteorological Society presentation Data Products from CPTEC available on the IDD-Brasil W.G. Almeida, A.A. Lima, A.S. Pessoa, A. L. T. Ferreira, A. Bonfin, M. V. Mendes, Ferreira, S. H. - CPTEC/INPE G. O. Chagas, D. G. Coelho, M.G. Justi - UFRJ T. Yoksas - Unidata/UCAR
  • 16.  Began as a collaboration under Meteoforum project.  Participants: Unidata Program Center/UCAR, CPTEC/INPE, UFRJ, UFPA and USP  Inaugurated in January of 2004 with 4 nodes The IDD-Brasil
  • 17. IDD-Brasil Participation  The working paradigm for IDD-Brasil is:  You get free access to global data, tools and support  You give free access to your data-sets, provide infrastructure and support  Free data access and cooperation is a major topic of discussion in every Brazilian meteorological meeting
  • 18. CPTEC’s Data ingesting in IDD-Brasil  ETA regional Model, 40 Km resolution (operational)  Automatic data-collecting network (operational)  GOES satellite imagery, full-resolution for South America (under testing)  Global T213 model (under testing)  Ensemble T126 Global model – 15 members (under testing)
  • 19. INPE´s automatic network: Data Collecting Platforms  More than 524 automated stations from INPE and cooperating Institutions  50 Stations reporting Atmospheric Pressure  These were the first new data shared through the IDD- Brasil, soon they will be also on GTS (in BUFR format)
  • 20.  Location of all INPE data collecting platforms (PCD)
  • 21. Conclusions I:  The IDD extension to Brazil (IDD-Brasil) is changing Brazilian Meteorology through:  Easier access to Global Data  Free availability of good analysis tools  Spreading ideas and practices  Free Data Sharing  Cooperation and mutual support
  • 22. Conclusions II:  The IDD-Brasil shows a sharply growing rate  Today Brazil is the largest international IDD-user community  Numerical models of Brazilians institutions distributed by the IDD/IDD-Brasil are easily available to national and international users.
  • 23.  The data from several Brazilians mesonets may be distributed by the IDD/IDD-Brasil  These data are not available on GTS  They are very important because the data network in South America is sparse.  As a result of IDD expansion to Brazil, more Brazilian data are becoming available for International community. Conclusions III:
  • 24. Contact Information Waldenio Gambi de Almeida gambi@cptec.inpe.br CPTEC/INPE cptec.inpe.br Maria G.A. Justi da Silva justi@igeo.ufrj.br UFRJ www.meteorologia.ufrj.br Tom Yoksas yoksas@unidata.ucar.edu Unidata/UCAR www.unidata.ucar.edu
  • 25. 1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
  • 26. How Adequate is the Relational Database Model for Scientific Data? • Designed and optimized for – Data in tables – Online transaction processing systems – Other business and enterprise problems • Successful in Geospatial Information Systems integration, such as ESRI ArcGIS • Also very successful in aother disciplines, such as astronomy (Sloan Digital Sky Survey, U.S. National Virtual Observatory) • Not adequate for earth sciences data – N-dimensional arrays – Event-oriented systems, such as sensor webs or high-speed data streams – Indexing unstructured data, for example metadata in XML form – Supporting scientific analysis and visualization tools
  • 27. Open Questions in Modeling Scientific Data • For how wide a realm is the relational database model adequate? • Is any data model that unifies data collections from many disciplines too complex to be useful? • Can one scientific data model be useful across many scientific disciplines? • Is one scientific data model even practical for earth sciences?
  • 28. Network Common Data Form • A simple data model for scientific datasets • A format for portable, self-describing data • A programming library that uses efficient direct access and efficient subsetting of multidimensional arrays • Several programming interfaces: C, Fortran, C++, Java, Python, Perl, Ruby, ... • Support for appending, sharing, and archiving data
  • 30. Limitations of the NetCDF-3 Data Model • Too simple to represent some data structures and relationships • No real data structures, just scalars and multidimensional arrays • No “ragged arrays” or nested structures • Only one shared unlimited dimension, along which data can be appended • A flat name space for dimensions and variables • No strings, just arrays of characters • A limited set of numeric types • Only ASCII characters in names
  • 32. NetCDF’s Future • NetCDF-4 integrates netCDF with HDF5, another major standard format and data model • Parallel netCDF has proved suitable for high- performance computing • NetCDF-4 data model (CDM) improves interoperability with other scientific data representations • NetCDF-Java has advanced features, including access to remote data
  • 33. NetCDF-4 Features • User-defined compound types (portable structs) • User-defined variable-length types • Groups for nested scopes • Multiple unlimited dimensions • String type • Additional numeric types • Unicode names • Efficient dynamic schema changes • Multidimensional tiling (chunking) • Per variable compression • Parallel I/O • Reader-Makes-Right conversion Address limitations of netCDF-3
  • 34. The Unidata Common Data Model • For a common subset of abstractions in OPeNDAP, HDF5, and netCDF-4 • User-defined compound types (portable structs) • User-defined variable-length types • Groups for nested scopes • String type • Additional numeric types • Prototype implemented in netCDF-Java • Attempts a balance between simplicity and power of representation
  • 35. NetCDF-Java • 100% Java library has advances compared to C-based interfaces • Prototype implementation of Common Data Model for access to netCDF-4, OPeNDAP, HDF5 – Provides netCDF interfaces to other formats: Grids (GRIB1, GRIB2), Radar (NEXRAD, NIDS, DORADE), Satellite (DMSP, GINI), Point Observations (BUFR) – Provides uniform coordinate systems layer • Includes access to THREDDS inventory catalogs
  • 36. Common Data Model Coordinate Systems Common Data Access Model Scientific Datatypes Grid Point Radial Trajectory Swath Station Application Application Applications THREDDS HDF5 netCDF GRIB OPeNDAP ...
  • 37. 1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
  • 38. Some Metadata Issues • Every scientific discipline has their own metadata standard. Is any convergence likely? • How do you choose among the multiple candidates for metadata standards? • How can metadata be improved for existing data without rewriting the data?
  • 39. Climate and Forecast (CF) Conventions • A widely used metadata standard for atmospheric, ocean, and climate data, based on netCDF • Specifies coordinate systems used in models, data cell properties and methods, packing, standard names for quantities, and grid mappings • CF-aware software can automatically determine space- time location of data variables • Originally intended for climate model output conventions, but use has broadened to weather and ocean models and observational data • Community governance structure now in place for maintaining and advancing the CF conventions, WMO Working Group on Coupled Modeling (WGCM)
  • 40. Libcf • Purpose: to ease the creation and use of datasets conforming to the CF Conventions • In early stages of development and testing • C and Fortran interfaces available from Unidata in alpha release
  • 41. Udunits (Unidata Units) • Library for manipulating units of physical qualities. – Conversion of unit specifications between formatted and binary forms – Arithmetic manipulation of unit specifications – Conversion of values between compatible scales of measurement • C, Fortran, and Java interfaces • Required by CF conventions • May soon be available as part of netCDF release
  • 42. NcML (NetCDF Markup Language) • An XML representation of netCDF metadata, similar to CDL • A schema language for Earth science data – To get NcML from netCDF data, use ncdump -x or Java ToolsUI program – To create netCDF from NcML, use ToolsUI or (eventually) ncgen • Provides a way to add to or change metadata without rewriting, by referencing and overriding metadata in a file • Also supports aggregation of multiple files
  • 43. 1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
  • 44. • Open-source Project for a Network Data Access Protocol, see opendap.org • A discipline-neutral protocol to get remote scientific data and metadata (not files) • Allows requests for subsets and aggregations • Software reference implementations for many kinds of data: netCDF, SQL (databases), HDF, FITS, JGOFS, • In use in earth sciences, astronomy, medicine, … • Serves IPCC model output Client Server Data Data Data
  • 45. FTP (File Transfer) versus OPeNDAP for Access to Remote Data • FTP accesses only whole files • OPeNDAP includes services for – Selected subset of data from a file – Aggregation of data in multiple files – Selected subset of aggregated data
  • 46. • Protocol uses URLs and HTTP • Unidata provides OPeNDAP support • Several OPeNDAP servers available: pyDAP, FDS, GDS, DAPPER, THREDDS Data Server • OPeNDAP clients include: Ferret, GrADS, Matlab, IDL, ArcGIS, netCDF-Java, IDV • OPeNDAP version 2 now a NASA standard • Version 4 under development with a test version available: adds XML, new types, new functions, THREDDS catalogs, SOAP, outputs in HTML and ASCII • Will add authentication, more server-side processing
  • 47. Thematic Real-time Environmental Distributed Data Services (THREDDS) • For data providers, implements data catalogs to present to users and applications • Catalogs are XML documents (metadata) describing and pointing to datasets accessible via client/server protocols (OPeNDAP, ADDE, WCS, HTTP) • Datasets may be found by discovery centers (master directories, digital libraries, data portals) via catalogs • Catalog hierarchy provides places to hang common metadata • Unidata coordinates THREDDS activities, community implements servers • Many partners as data providers, tool builders, interoperability experts from academia, government, industry
  • 51. THREDDS Data Server (TDS) • Serves data, THREDDS catalogs, and metadata • Reads and serves several kinds of data through a uniform CDM interface: netCDF, OPeNDAP, HDF5, GRIB, NEXRAD, … • Adds Earth-location coordinate systems to data • Provides OPeNDAP access and subsetting of any data readable with NetCDF-Java library • An integrated server provides data access through the OpenGIS Consortium Web Coverage Service (OGC/WCS) • Easy to install, 100% Java, freely available • Supports dynamic generation of catalogs
  • 52. HTTP Tomcat Server THREDDS Data Server Datasets Catalog.xml hostname.edu THREDDS Server Application NetCDF-Java library IDD Data •OPeNDAP •HTTPServer •WCS
  • 53. 1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
  • 54. Integrated Data Viewer (IDV) • Unidata’s newest scientific analysis and visualization tool • Freely available 100% Java framework and reference application • Provides 2- and 3-D displays of geoscience data • Stand-alone or networked application • Integrates data from different sources • Provides End-to-end test for technologies
  • 55. Some IDV Features • Client-server data access from remote systems • Suite of data probes for interactive exploration (slice and dice) • Animations (temporal and spatial) • HTML interface for pedagogic materials • XML configuration and bundling allows collaboration with other educators • Java-based framework supports Extensions built via plug-ins: for example, geosciences network (GEON) solid earth community
  • 56. Catalog of catalogs in IDV (Catalog from within a Client)
  • 57. Summary and Tentative Conclusions • Database “One Size Fits All” databases are not a comprehensive solution for scientific data or metadata. • Similarly, the old file-FTP approach is running out of steam for distributed data systems. • There is a limited time for establishing interdisciplinary data tools and services, before each discipline crystallizes on its own solutions. • Islands of non-interoperability that result would be unfortunate. • Flexibility to be ready adapt to better solutions is required, maybe even before the best solutions are evident.
  • 58. A Few Last Thoughts on Infrastructure …
  • 59. What is Infrastructure? • The basic facilities, services, and installations needed for the functioning of a community – Utilities: water and power lines – Transportation and communications systems • Good infrastructure is reliable, sturdy, useful, long lasting, standardized, widely used, and invisible
  • 60. Infrastructure: Stones in a Wall • Higher layers are built on lower layers • Stones may be replaced with other stones of similar size and shape • From the top, lower layers are invisible
  • 61. Cyberinfrastructure: the Middle Layers Base Technology: computation, storage, communication Networking, Operating Systems, Middleware High performance computation services Data, information, Knowledge management services Observation, measurement fabrication services Interfaces, visualization services Collaboration services Community-Specific Knowledge Environments for Research and Education (collaboratories, grid community, e-science community, virtual community) Customization for discipline-and project-specific applications From the “Atkins report” on Cyberinfrastructure
  • 62. Is Developing Infrastructure Rewarding? • It’s abstract, so hard to explain at a party • You can’t take a picture or movie about it • If it works well, it is invisible • End users are often not aware of it • It doesn’t get referenced in scientific papers • It can be expensive to evolve and support • If not maintained, it eventually crumbles • You can’t sell it, so you have to give it away
  • 63. Earth Science Infrastructure: Bricks in a Wall of Acronyms XML C Fortran Java Unix HTTP CDL NetCDF Udunits GRIB NcML Unidata decoders CF OGC WCS HDF5 OPeNDAP LDM NetCDF Java Libcf NetCDF-4 CDM TDS IDD CONDUIT project LEAD project IDV GEMPAK McIDAS GML CSML BUFR GrADS ArcGIS Ferret HDF4 Developed by Unidata THREDDS Involvement by Unidata GALEON project NCO Other technologies SQL Python, Ruby, … VisAD ADDE
  • 64. Visible and Invisible Infrastructure CDL NetCDF Udunits GRIB NcML Unidata decoders CF OGC WCS HDF5 OPeNDAP LDM netCDF Java Libcf NetCDF-4 CDM TDS IDD CONDUIT project LEAD project IDV GEMPAK McIDAS GML CSML BUFR GrADS ArcGIS Ferret HDF4 THREDDS GALEON project NCO Visible to End Users: “Cloak of invisibility” XML C Fortran Java Unix HTTP SQL Python, Ruby, … VisAD ADDE
  • 65. What Is Good Infrastructure? • Provides a useful service • Makes abstractions at the right level • Cloaks invisible details with a simple interface • Binds loosely to other infrastructure • Behaves reliably • Adapts easily to changes
  • 66. An Example of Great Infrastructure: Popular Programming Languages • Base of huge collection of higher layers of infrastructure • People continue to build on top of this infrastructure • The opportunity to create a long-lasting and popular programming language is rare • Jim Backus (Fortran), John McCarthy (Lisp), Dennis Ritchie (C), Bjarne Stroustrup (C++), James Gosling (Java), Yukihiro “Matz” Matsumoto (Ruby) • Other great infrastructures: Unix, TCP/IP, HTTP, …
  • 67. Rewards of Developing Infrastructure? • It “raises the level” for other developers • Beautiful and useful new layers and applications are built on top of it • You can feel a part of everything it supports • If it’s long lasting and widely used, you have made a difference for future generations • So, it’s one way to get closer to immortality • Infrastructure is abstract, but rewards can also be real • … like this trip to Japan!