Using Multiple Big Datasets and Machine Learning
to Produce a New Global Particulate Dataset
A Technology Challenge Case Study
David Lary
Hanson Center for Space Science
University of Texas at Dallas
What?
Why?
Table!1.!PM!and!health!outcomes!(modified!from!Ruckerl*et*al.!(2006)).!
Decreased Lung Function < 10 μm

x, few studies; xx, many studies; xxx, large number of studies.

Cardiovascular Disease < 0.1 μm

Skin & Eye Disease < 2.5 μm
0.1 mm

0.001 μm

0.01 μm

0.1 μm

1 μm

10 μm

1 mm

100 μm

Tumors < 1 μm
0.0001 μm

1000 μm

Mold Spores

Types of biological Material

Cell
Pollen
House Dust Mite Allergens
Cat Allergens
Bacteria
Hair

Viruses

Types of Dust

Heavy Dust

Settling Dust

Suspended Atmospheric Dust

Cement Dust
Fly Ash

Types of Particulates

Long9term*Studies*
PM10! PM2.5! UFP!
!!
!!
!!
xx!
xx!
x!
xx!
xx!
x!
xx!
xx!
x!
!!
!!
!!
xxx!
xxx!
!!
xxx!
xxx!
!!
!!
!!
!!
xxx!
xxx!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
xx!
xx!
x!
xx!
xx!
x!
!!
!!
!!
x!
x!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!

Oil Smoke

Pin

Smog
Tobacco Smoke
Soot
Gas Molecules

Gas Molecules

Short9term*Studies*
PM10! PM2.5! UFP!
Mortality*
!!
!!
!!
!!!!All!causes!
xxx!!
xxx!!
x!
!!!!Cardiovascular!
xxx!
xxx!
x!!
!!!!Pulmonary!
xxx!
xxx!
x!
Pulmonary!effects!
!!
!!
!!
!!!!Lung!function,!e.g.,!PEF!
xxx!
xxx!
xx!
!!!!Lung!function!growth!
!!
!!
!!
Asthma!and!COPD!exacerbation!
!!
!!
!!
!!!!Acute!respiratory!symptoms!
!!
xx!
x!
!!!!Medication!use!
!!
!!
x!
!!!!Hospital!admission!
xx!
xxx!
x!
Lung!cancer!
!!
!!
!!
!!!!Cohort!
!!
!!
!!
!!!!Hospital!admission!
!!
!!
!!
Cardiovascular!effects!
!!
!!
!!
!!!!Hospital!admission!
xxx!
xxx!
!!
ECG@related!endpoints!
!!
!!
!!
!!!!Autonomic!nervous!system!
xxx!
xxx!
xx!
!!!!Myocardial!substrate!and!vulnerability! !!
xx!
x!
Vascular!function!
!!
!!
!!
!!!!Blood!pressure!
xx!
xxx!
x!
!!!!Endothelial!function!
x!
xx!
x!
Blood!markers!
!!
!!
!!
!!!!Pro!inflammatory!mediators!
xx!
xx!
xx!
!!!!Coagulation!blood!markers!
xx!
xx!
xx!
!!!!Diabetes!
x!
xx!
x!
!!!!Endothelial!function!
x!
x!
xx!
Reproduction!
!!
!!
!!
!!!!Premature!birth!
x!
x!
!!
!!!!Birth!weight!
xx!
x!
!!
!!!!IUR/SGA!
x!
x!
!!
Fetal!growth!
!!
!!
!!
!!!!Birth!defects!
x!
!!
!!
!!!!Infant!mortality!
xx!
x!
!!
!!!!Sperm!quality!
x!
x!
!!
Neurotoxic!effects!
!!
!!
!!
!!!!Central!nervous!system!!
!!
x!
xx!
!!
Health*Outcomes!

PM10 particles
PM2.5 particles
PM0.1 ultra fine particles

0.0001 μm

0.001 μm

0.01 μm

PM10-2.5 coarse fraction
0.1 μm

1 μm

10 μm

100 μm

1000 μm
Why?
How?
Used around 40 different BigData sets from satellites, meteorology,
demographics, scraped web-sites and social media to estimate PM2.5. Plot
below shows the average of 5,935 days from August 1, 1997 to the present.
Which Platform?
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
7. Ability to schedule tasks at precise times and time intervals to automate
workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes,
1 hour, 3 hours, 1 day)
How?

Exis%ng(

New(

Simula%on(

• Social(Media(
• Socioeconomic,(Census(
• News(feeds(
• Environmental(
• Weather(
• Satellite(
• Sensors(
• Health(
• Economic(

• UAVs(
• Smart(Dust(
• Autonomous(Cars(
• Sensors(

• Global(Weather(Models(
• Economic(Models(
• Earthquake(Models(

Data(
Machine(
Learning(

Insight(

Same approach highly relevant for
the validation and optimal
exploitation of the next generation
of satellites, e.g. the upcoming
NASA Decadal Survey Missions.
How?

California Children Example
Terra DeepBlue
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Source

Variable

Type

Satellite Product
Meteorological Analyses
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Meteorological Analyses
Meteorological Analyses
Meteorological Analyses
Meteorological Analyses
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product

Population Density
Tropospheric NO2 Column
Surface Specific Humidity
Solar Azimuth
Surface Wind Speed
White-sky Albedo at 2,130 nm
White-sky Albedo at 555 nm
Surface Air Temperature
Surface Layer Height
Surface Ventilation Velocity
Total Precipitation
Solar Zenith
Air Density at Surface
Cloud Mask Qa
Deep Blue Aerosol Optical Depth 470 nm
Sensor Zenith
White-sky Albedo at 858 nm
Surface Velocity Scale
White-sky Albedo at 470 nm
Deep Blue Angstrom Exponent Land
White-sky Albedo at 1,240 nm
Scattering Angle
Sensor Azimuth
Deep Blue Surface Reflectance 412 nm
White-sky Albedo at 1,640 nm
Deep Blue Aerosol Optical Depth 660 nm
White-sky Albedo at 648 nm
Deep Blue Surface Reflectance 660 nm
Cloud Fraction Land
Deep Blue Surface Reflectance 470 nm
Deep Blue Aerosol Optical Depth 550 nm
Deep Blue Aerosol Optical Depth 412 nm

Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input

In-situ Observation

PM2.5

Target
Hourly measurements from 53 countries from 1997-present

A lot of measurements,
but notice the large gaps!
Gaps are inevitable because of the
infrastructure and cost associated with
making the measurements.
Challenge 1: Obtaining the in-situ PM2.5 data
Real time data from:
1. EPA AirNow data for USA and Canada
2. EEA data for Europe
3. Tasmania and Australia
4. Israel
5. Russia
6. Asia and Latin America by scraping http://aqicn.org/map/
7. Harvesting social media (twitter feeds from US Embassies)

Relative low bandwidth from multiple sites every 5 minutes
Challenge 2: (Easier)
Obtaining the Satellite & Meteorological Data
Real time data from:
1. Multiple satellites MODIS Terra, MODIS Aqua, SeaWIFS, VIIRS NPP etc
2. Global Meteorological Analyses

High bandwidth from few sites every 1 to 3 hours
Challenge 3:
Combine multiple BigData Sets with Machine Learning
Large member machine learning ensemble using massively parallel computing
to produce PM2.5 data product
Algorithms capable of dealing with massive non-linear, non-parametric, nonGaussian multivariate datasets (13,000+ variables)
Drastically reduced development time by using a high level language (Matlab)
that can easily exploit parallel execution using both multiple CPUs and GPUs.

Massively parallel every 3 hours
High level language which can readily use CPUs and GPUs
Challenge 4:
Continual Performance Improvement
Currently on around 400th version of system.
Have been making continuous improvements in:
1. Coverage of in-situ training data set
2. Inclusion of new satellite sensors
3. Additional BigData sets that help improve fidelity of the non-linear, nonparametric, non-Gaussian multivariate machine learning fits
4. Using many alternative machine learning strategies
5. Estimate uncertainties.
6. This requires frequent reprocessing of the entire multi-year record from
1997-present

Persistent massive data storage, much more
than usual scratch space at HPC centers
Fully Automated Workflow

Requires ability to schedule automated tasks
Requires ability to disseminate results in multiple formats including
ftp and as web and map services
Key System Requirements:
Not always available on current HPC systems
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
7. Ability to schedule tasks at precise times and time intervals to automate
workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes,
1 hour, 3 hours, 1 day)

Thank you!

Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning

  • 1.
    Using Multiple BigDatasets and Machine Learning to Produce a New Global Particulate Dataset A Technology Challenge Case Study David Lary Hanson Center for Space Science University of Texas at Dallas
  • 2.
  • 3.
    Why? Table!1.!PM!and!health!outcomes!(modified!from!Ruckerl*et*al.!(2006)).! Decreased Lung Function< 10 μm x, few studies; xx, many studies; xxx, large number of studies. Cardiovascular Disease < 0.1 μm Skin & Eye Disease < 2.5 μm 0.1 mm 0.001 μm 0.01 μm 0.1 μm 1 μm 10 μm 1 mm 100 μm Tumors < 1 μm 0.0001 μm 1000 μm Mold Spores Types of biological Material Cell Pollen House Dust Mite Allergens Cat Allergens Bacteria Hair Viruses Types of Dust Heavy Dust Settling Dust Suspended Atmospheric Dust Cement Dust Fly Ash Types of Particulates Long9term*Studies* PM10! PM2.5! UFP! !! !! !! xx! xx! x! xx! xx! x! xx! xx! x! !! !! !! xxx! xxx! !! xxx! xxx! !! !! !! !! xxx! xxx! !! !! !! !! !! !! !! !! !! !! xx! xx! x! xx! xx! x! !! !! !! x! x! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! Oil Smoke Pin Smog Tobacco Smoke Soot Gas Molecules Gas Molecules Short9term*Studies* PM10! PM2.5! UFP! Mortality* !! !! !! !!!!All!causes! xxx!! xxx!! x! !!!!Cardiovascular! xxx! xxx! x!! !!!!Pulmonary! xxx! xxx! x! Pulmonary!effects! !! !! !! !!!!Lung!function,!e.g.,!PEF! xxx! xxx! xx! !!!!Lung!function!growth! !! !! !! Asthma!and!COPD!exacerbation! !! !! !! !!!!Acute!respiratory!symptoms! !! xx! x! !!!!Medication!use! !! !! x! !!!!Hospital!admission! xx! xxx! x! Lung!cancer! !! !! !! !!!!Cohort! !! !! !! !!!!Hospital!admission! !! !! !! Cardiovascular!effects! !! !! !! !!!!Hospital!admission! xxx! xxx! !! ECG@related!endpoints! !! !! !! !!!!Autonomic!nervous!system! xxx! xxx! xx! !!!!Myocardial!substrate!and!vulnerability! !! xx! x! Vascular!function! !! !! !! !!!!Blood!pressure! xx! xxx! x! !!!!Endothelial!function! x! xx! x! Blood!markers! !! !! !! !!!!Pro!inflammatory!mediators! xx! xx! xx! !!!!Coagulation!blood!markers! xx! xx! xx! !!!!Diabetes! x! xx! x! !!!!Endothelial!function! x! x! xx! Reproduction! !! !! !! !!!!Premature!birth! x! x! !! !!!!Birth!weight! xx! x! !! !!!!IUR/SGA! x! x! !! Fetal!growth! !! !! !! !!!!Birth!defects! x! !! !! !!!!Infant!mortality! xx! x! !! !!!!Sperm!quality! x! x! !! Neurotoxic!effects! !! !! !! !!!!Central!nervous!system!! !! x! xx! !! Health*Outcomes! PM10 particles PM2.5 particles PM0.1 ultra fine particles 0.0001 μm 0.001 μm 0.01 μm PM10-2.5 coarse fraction 0.1 μm 1 μm 10 μm 100 μm 1000 μm
  • 4.
  • 5.
    How? Used around 40different BigData sets from satellites, meteorology, demographics, scraped web-sites and social media to estimate PM2.5. Plot below shows the average of 5,935 days from August 1, 1997 to the present.
  • 6.
  • 7.
    Which Platform? Requirements: 1. Largepersistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired)
  • 8.
    Which Platform? Requirements: 1. Largepersistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections
  • 9.
    Which Platform? Requirements: 1. Largepersistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
  • 10.
    Which Platform? Requirements: 1. Largepersistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab
  • 11.
    Which Platform? Requirements: 1. Largepersistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables)
  • 12.
    Which Platform? Requirements: 1. Largepersistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs
  • 13.
    Which Platform? Requirements: 1. Largepersistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs 7. Ability to schedule tasks at precise times and time intervals to automate workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day)
  • 14.
  • 15.
  • 16.
    Terra DeepBlue Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Source Variable Type Satellite Product MeteorologicalAnalyses Satellite Product Meteorological Analyses Satellite Product Satellite Product Meteorological Analyses Meteorological Analyses Meteorological Analyses Meteorological Analyses Satellite Product Meteorological Analyses Satellite Product Satellite Product Satellite Product Satellite Product Meteorological Analyses Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Population Density Tropospheric NO2 Column Surface Specific Humidity Solar Azimuth Surface Wind Speed White-sky Albedo at 2,130 nm White-sky Albedo at 555 nm Surface Air Temperature Surface Layer Height Surface Ventilation Velocity Total Precipitation Solar Zenith Air Density at Surface Cloud Mask Qa Deep Blue Aerosol Optical Depth 470 nm Sensor Zenith White-sky Albedo at 858 nm Surface Velocity Scale White-sky Albedo at 470 nm Deep Blue Angstrom Exponent Land White-sky Albedo at 1,240 nm Scattering Angle Sensor Azimuth Deep Blue Surface Reflectance 412 nm White-sky Albedo at 1,640 nm Deep Blue Aerosol Optical Depth 660 nm White-sky Albedo at 648 nm Deep Blue Surface Reflectance 660 nm Cloud Fraction Land Deep Blue Surface Reflectance 470 nm Deep Blue Aerosol Optical Depth 550 nm Deep Blue Aerosol Optical Depth 412 nm Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input In-situ Observation PM2.5 Target
  • 18.
    Hourly measurements from53 countries from 1997-present A lot of measurements, but notice the large gaps!
  • 19.
    Gaps are inevitablebecause of the infrastructure and cost associated with making the measurements.
  • 20.
    Challenge 1: Obtainingthe in-situ PM2.5 data Real time data from: 1. EPA AirNow data for USA and Canada 2. EEA data for Europe 3. Tasmania and Australia 4. Israel 5. Russia 6. Asia and Latin America by scraping http://aqicn.org/map/ 7. Harvesting social media (twitter feeds from US Embassies) Relative low bandwidth from multiple sites every 5 minutes
  • 21.
    Challenge 2: (Easier) Obtainingthe Satellite & Meteorological Data Real time data from: 1. Multiple satellites MODIS Terra, MODIS Aqua, SeaWIFS, VIIRS NPP etc 2. Global Meteorological Analyses High bandwidth from few sites every 1 to 3 hours
  • 22.
    Challenge 3: Combine multipleBigData Sets with Machine Learning Large member machine learning ensemble using massively parallel computing to produce PM2.5 data product Algorithms capable of dealing with massive non-linear, non-parametric, nonGaussian multivariate datasets (13,000+ variables) Drastically reduced development time by using a high level language (Matlab) that can easily exploit parallel execution using both multiple CPUs and GPUs. Massively parallel every 3 hours High level language which can readily use CPUs and GPUs
  • 23.
    Challenge 4: Continual PerformanceImprovement Currently on around 400th version of system. Have been making continuous improvements in: 1. Coverage of in-situ training data set 2. Inclusion of new satellite sensors 3. Additional BigData sets that help improve fidelity of the non-linear, nonparametric, non-Gaussian multivariate machine learning fits 4. Using many alternative machine learning strategies 5. Estimate uncertainties. 6. This requires frequent reprocessing of the entire multi-year record from 1997-present Persistent massive data storage, much more than usual scratch space at HPC centers
  • 24.
    Fully Automated Workflow Requiresability to schedule automated tasks
  • 25.
    Requires ability todisseminate results in multiple formats including ftp and as web and map services
  • 28.
    Key System Requirements: Notalways available on current HPC systems Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs 7. Ability to schedule tasks at precise times and time intervals to automate workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day) Thank you!