This document discusses challenges and opportunities in managing large volumes of scientific data from various sources like experiments, simulations, literature, and archives. It advocates making all scientific data available online to increase scientific information velocity and productivity. Key aspects of scientific data management discussed include data ingest, common schemas, organization, sharing, querying, modeling, documentation, curation and long-term preservation. The cloud is presented as a way to democratize access to scale and analytics for scientific data.
4. Data
Acquisition & modelling
Collaboration and visualisation
Analysis & data mining
Dissemination & sharing
Archiving and preserving
fourthparadigm.org
Data-intensive Research
5. X-Info
• Data ingest
• Managing a petabyte
• Common schema
• How to organize it
• How to reorganize it
• How to share with others
• Query and Vis tools
• Building and executing models
• Integrating data and Literature
• Documenting experiments
• Curation and long-term
preservation
The Generic Problems
Experiments &
Instruments
Simulations
Literature
Other Archives
facts
facts
facts
facts
Questions
Answers
6. All Scientific Data Online
•Many disciplines overlap and use data from other sciences.
•Internet can unify all literature and data
•Go from literature to computation to data back to literature.
•Information at your fingertips –
For everyone, everywhere
•Increase Scientific Information
Velocity
•Huge increase in Science Productivity
(From Jim Gray’s last talk)
Literature
Derived and recombined data
Raw data
10. Monitoring
Collation
Quality assurance
Aggregation
Analysis
Reporting
Forecasting
Distribution
Done poorly, but a few notablecounter-examples
Done poorly to moderately, not easy to find
Sometimes done well, generally discoverable and available, but could be improved
Integration
(I. Zaslavsky& CSIRO, BOM, WMO)
14. Water depth map of London(~130km2). Storm eventof 60 minutes and 100 years return periodhttp://www.ncl.ac.uk/ceser/researchprogramme/informatics/citycaturbanfloodmodel/
22. Parker MacCready: Univ. of Washington
Rob Fatland:, Wenming Ye, NelsOscar, Microsoft Research
23.
24. Numerical model of 3-D ocean currents and water properties
•salinity,
•temperature,
•biogeochemistry
Relies on external data sources:
•Bathymetry
•Wind and heating
•Open Ocean BC’s
•Tides
•Rivers
25. Model Validation
Comparisons are done to an extensive suite of in-situ observations
•sea surface height
12 NOAA tide gauges
•salinity and temperature
over 2000 CTD casts from ECOHAB, RISE, DOE, NANOOS, Hood Canal, IOS, King County, and NOAA
•velocity and moored S,T
7 coastal ADCP / CTD moorings from the ECOHAB and RISE projects, 2 moorings from IOS
26. Interactive 3-D Model Visualization using WorldWideTelescope, Narwhal and Layerscape
www.layerscape.
27. EH4 32 m
Figure from SA Siedlecki, UW/JISAO; Observations from Connolly et al., 2010
Validation: Dissolved Oxygen & Temperature
28. LiveOcean: System Architecture
HPC
linux150 cores
Forecast
NetCDFfiles
LiveOcean
Server
•Post Processing
•Pre-make .png“views”
•Archive NetCDFfiles
•API for web sites
•Admin.js
•Client.js
Blob Storage:
Forecast Copy
Science User
python
Azure Table:
Log Info
Admin Website
Client Website
http://mappable.azurewebsites. net/liveocean/
Rivers
USGS
Atmosphere
UW WRF
Ocean
HYCOM