December 20, 2018 1
Modern Scientific Data Management Practices:
The Atmospheric Radiation Measurement (ARM)
Facility Data Center Architecture
GIRI PRAKASH, RANJEET DEVARAKONDA, ROB RECORDS, KYLE DUMAS
ARM Data Center, Oak Ridge National Laboratory
AGU 100, December 12, 2018
ARM’s Vision
2
To provide a detailed & accurate description
of the earth atmosphere in diverse climate
regimes to resolve the uncertainties in climate
and earth system models toward the
development of sustainable solutions for the
Nation’s energy & environmental challenges.
Field Campaigns
3
Pushing the limits to help scientists study our atmosphere
Visit the ARM Exhibit @ 1230
ARM Data Flow – The Big Picture
Data Growth
1.5 PB
4
Data Discovery Tool
5
6
§ Based on big data analysis platform
(NoSQL)
§ ARM HPC Clusters for data
processing
§ Provides an interactive web
interface for users to find
simulations of interest through
examination of the LES
performance relative to select ARM
observations
§ Allows user to visualize LASSO
data bundle diagnostics and skill
scores on the fly using plots and
tables
Cassandra
D3 &
NodeJS
Spark
Data Discovery for LASSO
Data Retrieval, Packaging, and Delivery
§ Merging
§ DQR filtering
§ Conversion
Retrieval
Future
capability
Data-
streams
HPSS
Online
copy
Link to data access
Data quality
Access to plots
DOI based citation guidance
Publication request
Discovery
UI
&
Web services
NetCDF
data
extractions
Data
staging
order
HPC ML
Live Data WS
7
8
Globus
Online
Data and Computing Infrastructure
Next-Gen ARM Computing Facility
Cumulus clusterStratus
cluster
§ LASSO model operations and large scale
data analysis/ visualizations
– 112 nodes (4,032 cores)
– 2 PB GPFS storage
§ Routine radar processing
§ Large-scale reprocessing
§ Complex VAP development
§ No-SQL based advanced visualizations
§ Big data extractions for science users
§ Long-term data quality analysis
– 30 nodes (1,080 cores)
– 256 GB memory/node
– Lustre and 2 TB SSD per node
9
Data Pipeline and Software Architecture
December 20, 2018 10
Data Processing
Storage &
Data
Model
Querying Analytics Scientific
Users
Data Pipeline
Software Architecture
Interface
Visualization
Analytics
Output
Spark
ARM HPC
Computing Clusters
JupyterLab
Relational Database NoSQL Database
• Supports fast analysis
of voluminous data
• Hides architectural
complexities
• Stage data in HPC
• Metadata
• Order History
• Data from multiple
instruments
Frontend
Analytic Server
Backend
Dr.Bhargavi Krishna, Yuping Lu, and Dr.Jitu Kumar
10
11
§ Allow users to cite exact
ARM data used in their
research/publication
§ Allow ARM to provide
proper data citation credits
to the PIs
and collaborators
§ Allows future data users
and the project to easily
track the data used
in various articles
§ Millions of data files from
over 10,000 data products
§ Typically continuous
datastreams but some
of them are from
field campaigns
§ DOIs are assigned
at the data collection level
§ Recommended
Citation structure
§ Citation Generator and
resolver to help users
Benefits Challenge Strategy
Data Citation and DOI Capabilities
Data Sharing with External Portals
ARM Data Center
ISO 19115,CF,
FGDC,
Schema.org,
OAI, JSON-LD,
THREDDS
OPENDAP
Extraction
Visualization
Science Metadata Data Access
Google
IASOA
Data.gov
DataCite
NGEE-Arctic
Other Data
networks
Metadata harvesting
Data download service
DOI
12
13
Google Data Search (Beta)

Modern Scientific Data Management Practices: The Atmospheric Radiation Measurement (ARM) Facility Data Center Architecture

  • 1.
    December 20, 20181 Modern Scientific Data Management Practices: The Atmospheric Radiation Measurement (ARM) Facility Data Center Architecture GIRI PRAKASH, RANJEET DEVARAKONDA, ROB RECORDS, KYLE DUMAS ARM Data Center, Oak Ridge National Laboratory AGU 100, December 12, 2018
  • 2.
    ARM’s Vision 2 To providea detailed & accurate description of the earth atmosphere in diverse climate regimes to resolve the uncertainties in climate and earth system models toward the development of sustainable solutions for the Nation’s energy & environmental challenges.
  • 3.
    Field Campaigns 3 Pushing thelimits to help scientists study our atmosphere Visit the ARM Exhibit @ 1230
  • 4.
    ARM Data Flow– The Big Picture Data Growth 1.5 PB 4
  • 5.
  • 6.
    6 § Based onbig data analysis platform (NoSQL) § ARM HPC Clusters for data processing § Provides an interactive web interface for users to find simulations of interest through examination of the LES performance relative to select ARM observations § Allows user to visualize LASSO data bundle diagnostics and skill scores on the fly using plots and tables Cassandra D3 & NodeJS Spark Data Discovery for LASSO
  • 7.
    Data Retrieval, Packaging,and Delivery § Merging § DQR filtering § Conversion Retrieval Future capability Data- streams HPSS Online copy Link to data access Data quality Access to plots DOI based citation guidance Publication request Discovery UI & Web services NetCDF data extractions Data staging order HPC ML Live Data WS 7
  • 8.
  • 9.
    Next-Gen ARM ComputingFacility Cumulus clusterStratus cluster § LASSO model operations and large scale data analysis/ visualizations – 112 nodes (4,032 cores) – 2 PB GPFS storage § Routine radar processing § Large-scale reprocessing § Complex VAP development § No-SQL based advanced visualizations § Big data extractions for science users § Long-term data quality analysis – 30 nodes (1,080 cores) – 256 GB memory/node – Lustre and 2 TB SSD per node 9
  • 10.
    Data Pipeline andSoftware Architecture December 20, 2018 10 Data Processing Storage & Data Model Querying Analytics Scientific Users Data Pipeline Software Architecture Interface Visualization Analytics Output Spark ARM HPC Computing Clusters JupyterLab Relational Database NoSQL Database • Supports fast analysis of voluminous data • Hides architectural complexities • Stage data in HPC • Metadata • Order History • Data from multiple instruments Frontend Analytic Server Backend Dr.Bhargavi Krishna, Yuping Lu, and Dr.Jitu Kumar 10
  • 11.
    11 § Allow usersto cite exact ARM data used in their research/publication § Allow ARM to provide proper data citation credits to the PIs and collaborators § Allows future data users and the project to easily track the data used in various articles § Millions of data files from over 10,000 data products § Typically continuous datastreams but some of them are from field campaigns § DOIs are assigned at the data collection level § Recommended Citation structure § Citation Generator and resolver to help users Benefits Challenge Strategy Data Citation and DOI Capabilities
  • 12.
    Data Sharing withExternal Portals ARM Data Center ISO 19115,CF, FGDC, Schema.org, OAI, JSON-LD, THREDDS OPENDAP Extraction Visualization Science Metadata Data Access Google IASOA Data.gov DataCite NGEE-Arctic Other Data networks Metadata harvesting Data download service DOI 12
  • 13.