Using Multiple Big Datasets and Machine Learning
to Produce a New Global Particulate Dataset
A Technology Challenge Case S...
What?
Why?
Table!1.!PM!and!health!outcomes!(modified!from!Ruckerl*et*al.!(2006)).!
Decreased Lung Function < 10 μm

x, few studi...
Why?
How?
Used around 40 different BigData sets from satellites, meteorology,
demographics, scraped web-sites and social media t...
Which Platform?
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had tim...
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had tim...
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had tim...
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had tim...
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had tim...
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had tim...
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had tim...
How?

Exis%ng(

New(

Simula%on(

• Social(Media(
• Socioeconomic,(Census(
• News(feeds(
• Environmental(
• Weather(
• Sat...
How?

California Children Example
Terra DeepBlue
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Source

Variab...
Hourly measurements from 53 countries from 1997-present

A lot of measurements,
but notice the large gaps!
Gaps are inevitable because of the
infrastructure and cost associated with
making the measurements.
Challenge 1: Obtaining the in-situ PM2.5 data
Real time data from:
1. EPA AirNow data for USA and Canada
2. EEA data for E...
Challenge 2: (Easier)
Obtaining the Satellite & Meteorological Data
Real time data from:
1. Multiple satellites MODIS Terr...
Challenge 3:
Combine multiple BigData Sets with Machine Learning
Large member machine learning ensemble using massively pa...
Challenge 4:
Continual Performance Improvement
Currently on around 400th version of system.
Have been making continuous im...
Fully Automated Workflow

Requires ability to schedule automated tasks
Requires ability to disseminate results in multiple formats including
ftp and as web and map services
Key System Requirements:
Not always available on current HPC systems
Requirements:
1. Large persistent storage for multipl...
Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning
Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning
Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning
Upcoming SlideShare
Loading in …5
×

Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning

5,149 views

Published on

The next generation of cloud computing system will need to handle:
1. Multiple massive datasets (Large Storage)
2. Massive memory per node
3. Facilitate automation and scheduling of repetitive tasks
4. Include high level technical languages (e.g. Matlab)

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
5,149
On SlideShare
0
From Embeds
0
Number of Embeds
3,324
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning

  1. 1. Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset A Technology Challenge Case Study David Lary Hanson Center for Space Science University of Texas at Dallas
  2. 2. What?
  3. 3. Why? Table!1.!PM!and!health!outcomes!(modified!from!Ruckerl*et*al.!(2006)).! Decreased Lung Function < 10 μm x, few studies; xx, many studies; xxx, large number of studies. Cardiovascular Disease < 0.1 μm Skin & Eye Disease < 2.5 μm 0.1 mm 0.001 μm 0.01 μm 0.1 μm 1 μm 10 μm 1 mm 100 μm Tumors < 1 μm 0.0001 μm 1000 μm Mold Spores Types of biological Material Cell Pollen House Dust Mite Allergens Cat Allergens Bacteria Hair Viruses Types of Dust Heavy Dust Settling Dust Suspended Atmospheric Dust Cement Dust Fly Ash Types of Particulates Long9term*Studies* PM10! PM2.5! UFP! !! !! !! xx! xx! x! xx! xx! x! xx! xx! x! !! !! !! xxx! xxx! !! xxx! xxx! !! !! !! !! xxx! xxx! !! !! !! !! !! !! !! !! !! !! xx! xx! x! xx! xx! x! !! !! !! x! x! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! Oil Smoke Pin Smog Tobacco Smoke Soot Gas Molecules Gas Molecules Short9term*Studies* PM10! PM2.5! UFP! Mortality* !! !! !! !!!!All!causes! xxx!! xxx!! x! !!!!Cardiovascular! xxx! xxx! x!! !!!!Pulmonary! xxx! xxx! x! Pulmonary!effects! !! !! !! !!!!Lung!function,!e.g.,!PEF! xxx! xxx! xx! !!!!Lung!function!growth! !! !! !! Asthma!and!COPD!exacerbation! !! !! !! !!!!Acute!respiratory!symptoms! !! xx! x! !!!!Medication!use! !! !! x! !!!!Hospital!admission! xx! xxx! x! Lung!cancer! !! !! !! !!!!Cohort! !! !! !! !!!!Hospital!admission! !! !! !! Cardiovascular!effects! !! !! !! !!!!Hospital!admission! xxx! xxx! !! ECG@related!endpoints! !! !! !! !!!!Autonomic!nervous!system! xxx! xxx! xx! !!!!Myocardial!substrate!and!vulnerability! !! xx! x! Vascular!function! !! !! !! !!!!Blood!pressure! xx! xxx! x! !!!!Endothelial!function! x! xx! x! Blood!markers! !! !! !! !!!!Pro!inflammatory!mediators! xx! xx! xx! !!!!Coagulation!blood!markers! xx! xx! xx! !!!!Diabetes! x! xx! x! !!!!Endothelial!function! x! x! xx! Reproduction! !! !! !! !!!!Premature!birth! x! x! !! !!!!Birth!weight! xx! x! !! !!!!IUR/SGA! x! x! !! Fetal!growth! !! !! !! !!!!Birth!defects! x! !! !! !!!!Infant!mortality! xx! x! !! !!!!Sperm!quality! x! x! !! Neurotoxic!effects! !! !! !! !!!!Central!nervous!system!! !! x! xx! !! Health*Outcomes! PM10 particles PM2.5 particles PM0.1 ultra fine particles 0.0001 μm 0.001 μm 0.01 μm PM10-2.5 coarse fraction 0.1 μm 1 μm 10 μm 100 μm 1000 μm
  4. 4. Why?
  5. 5. How? Used around 40 different BigData sets from satellites, meteorology, demographics, scraped web-sites and social media to estimate PM2.5. Plot below shows the average of 5,935 days from August 1, 1997 to the present.
  6. 6. Which Platform?
  7. 7. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired)
  8. 8. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections
  9. 9. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
  10. 10. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab
  11. 11. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables)
  12. 12. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs
  13. 13. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs 7. Ability to schedule tasks at precise times and time intervals to automate workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day)
  14. 14. How? Exis%ng( New( Simula%on( • Social(Media( • Socioeconomic,(Census( • News(feeds( • Environmental( • Weather( • Satellite( • Sensors( • Health( • Economic( • UAVs( • Smart(Dust( • Autonomous(Cars( • Sensors( • Global(Weather(Models( • Economic(Models( • Earthquake(Models( Data( Machine( Learning( Insight( Same approach highly relevant for the validation and optimal exploitation of the next generation of satellites, e.g. the upcoming NASA Decadal Survey Missions.
  15. 15. How? California Children Example
  16. 16. Terra DeepBlue Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Source Variable Type Satellite Product Meteorological Analyses Satellite Product Meteorological Analyses Satellite Product Satellite Product Meteorological Analyses Meteorological Analyses Meteorological Analyses Meteorological Analyses Satellite Product Meteorological Analyses Satellite Product Satellite Product Satellite Product Satellite Product Meteorological Analyses Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Population Density Tropospheric NO2 Column Surface Specific Humidity Solar Azimuth Surface Wind Speed White-sky Albedo at 2,130 nm White-sky Albedo at 555 nm Surface Air Temperature Surface Layer Height Surface Ventilation Velocity Total Precipitation Solar Zenith Air Density at Surface Cloud Mask Qa Deep Blue Aerosol Optical Depth 470 nm Sensor Zenith White-sky Albedo at 858 nm Surface Velocity Scale White-sky Albedo at 470 nm Deep Blue Angstrom Exponent Land White-sky Albedo at 1,240 nm Scattering Angle Sensor Azimuth Deep Blue Surface Reflectance 412 nm White-sky Albedo at 1,640 nm Deep Blue Aerosol Optical Depth 660 nm White-sky Albedo at 648 nm Deep Blue Surface Reflectance 660 nm Cloud Fraction Land Deep Blue Surface Reflectance 470 nm Deep Blue Aerosol Optical Depth 550 nm Deep Blue Aerosol Optical Depth 412 nm Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input In-situ Observation PM2.5 Target
  17. 17. Hourly measurements from 53 countries from 1997-present A lot of measurements, but notice the large gaps!
  18. 18. Gaps are inevitable because of the infrastructure and cost associated with making the measurements.
  19. 19. Challenge 1: Obtaining the in-situ PM2.5 data Real time data from: 1. EPA AirNow data for USA and Canada 2. EEA data for Europe 3. Tasmania and Australia 4. Israel 5. Russia 6. Asia and Latin America by scraping http://aqicn.org/map/ 7. Harvesting social media (twitter feeds from US Embassies) Relative low bandwidth from multiple sites every 5 minutes
  20. 20. Challenge 2: (Easier) Obtaining the Satellite & Meteorological Data Real time data from: 1. Multiple satellites MODIS Terra, MODIS Aqua, SeaWIFS, VIIRS NPP etc 2. Global Meteorological Analyses High bandwidth from few sites every 1 to 3 hours
  21. 21. Challenge 3: Combine multiple BigData Sets with Machine Learning Large member machine learning ensemble using massively parallel computing to produce PM2.5 data product Algorithms capable of dealing with massive non-linear, non-parametric, nonGaussian multivariate datasets (13,000+ variables) Drastically reduced development time by using a high level language (Matlab) that can easily exploit parallel execution using both multiple CPUs and GPUs. Massively parallel every 3 hours High level language which can readily use CPUs and GPUs
  22. 22. Challenge 4: Continual Performance Improvement Currently on around 400th version of system. Have been making continuous improvements in: 1. Coverage of in-situ training data set 2. Inclusion of new satellite sensors 3. Additional BigData sets that help improve fidelity of the non-linear, nonparametric, non-Gaussian multivariate machine learning fits 4. Using many alternative machine learning strategies 5. Estimate uncertainties. 6. This requires frequent reprocessing of the entire multi-year record from 1997-present Persistent massive data storage, much more than usual scratch space at HPC centers
  23. 23. Fully Automated Workflow Requires ability to schedule automated tasks
  24. 24. Requires ability to disseminate results in multiple formats including ftp and as web and map services
  25. 25. Key System Requirements: Not always available on current HPC systems Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs 7. Ability to schedule tasks at precise times and time intervals to automate workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day) Thank you!

×