SlideShare a Scribd company logo
Using Multiple Big Datasets and Machine Learning
to Produce a New Global Particulate Dataset
A Technology Challenge Case Study
David Lary
Hanson Center for Space Science
University of Texas at Dallas
What?
Why?
Table!1.!PM!and!health!outcomes!(modified!from!Ruckerl*et*al.!(2006)).!
Decreased Lung Function < 10 μm

x, few studies; xx, many studies; xxx, large number of studies.

Cardiovascular Disease < 0.1 μm

Skin & Eye Disease < 2.5 μm
0.1 mm

0.001 μm

0.01 μm

0.1 μm

1 μm

10 μm

1 mm

100 μm

Tumors < 1 μm
0.0001 μm

1000 μm

Mold Spores

Types of biological Material

Cell
Pollen
House Dust Mite Allergens
Cat Allergens
Bacteria
Hair

Viruses

Types of Dust

Heavy Dust

Settling Dust

Suspended Atmospheric Dust

Cement Dust
Fly Ash

Types of Particulates

Long9term*Studies*
PM10! PM2.5! UFP!
!!
!!
!!
xx!
xx!
x!
xx!
xx!
x!
xx!
xx!
x!
!!
!!
!!
xxx!
xxx!
!!
xxx!
xxx!
!!
!!
!!
!!
xxx!
xxx!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
xx!
xx!
x!
xx!
xx!
x!
!!
!!
!!
x!
x!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!

Oil Smoke

Pin

Smog
Tobacco Smoke
Soot
Gas Molecules

Gas Molecules

Short9term*Studies*
PM10! PM2.5! UFP!
Mortality*
!!
!!
!!
!!!!All!causes!
xxx!!
xxx!!
x!
!!!!Cardiovascular!
xxx!
xxx!
x!!
!!!!Pulmonary!
xxx!
xxx!
x!
Pulmonary!effects!
!!
!!
!!
!!!!Lung!function,!e.g.,!PEF!
xxx!
xxx!
xx!
!!!!Lung!function!growth!
!!
!!
!!
Asthma!and!COPD!exacerbation!
!!
!!
!!
!!!!Acute!respiratory!symptoms!
!!
xx!
x!
!!!!Medication!use!
!!
!!
x!
!!!!Hospital!admission!
xx!
xxx!
x!
Lung!cancer!
!!
!!
!!
!!!!Cohort!
!!
!!
!!
!!!!Hospital!admission!
!!
!!
!!
Cardiovascular!effects!
!!
!!
!!
!!!!Hospital!admission!
xxx!
xxx!
!!
ECG@related!endpoints!
!!
!!
!!
!!!!Autonomic!nervous!system!
xxx!
xxx!
xx!
!!!!Myocardial!substrate!and!vulnerability! !!
xx!
x!
Vascular!function!
!!
!!
!!
!!!!Blood!pressure!
xx!
xxx!
x!
!!!!Endothelial!function!
x!
xx!
x!
Blood!markers!
!!
!!
!!
!!!!Pro!inflammatory!mediators!
xx!
xx!
xx!
!!!!Coagulation!blood!markers!
xx!
xx!
xx!
!!!!Diabetes!
x!
xx!
x!
!!!!Endothelial!function!
x!
x!
xx!
Reproduction!
!!
!!
!!
!!!!Premature!birth!
x!
x!
!!
!!!!Birth!weight!
xx!
x!
!!
!!!!IUR/SGA!
x!
x!
!!
Fetal!growth!
!!
!!
!!
!!!!Birth!defects!
x!
!!
!!
!!!!Infant!mortality!
xx!
x!
!!
!!!!Sperm!quality!
x!
x!
!!
Neurotoxic!effects!
!!
!!
!!
!!!!Central!nervous!system!!
!!
x!
xx!
!!
Health*Outcomes!

PM10 particles
PM2.5 particles
PM0.1 ultra fine particles

0.0001 μm

0.001 μm

0.01 μm

PM10-2.5 coarse fraction
0.1 μm

1 μm

10 μm

100 μm

1000 μm
Why?
How?
Used around 40 different BigData sets from satellites, meteorology,
demographics, scraped web-sites and social media to estimate PM2.5. Plot
below shows the average of 5,935 days from August 1, 1997 to the present.
Which Platform?
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
7. Ability to schedule tasks at precise times and time intervals to automate
workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes,
1 hour, 3 hours, 1 day)
How?

Exis%ng(

New(

Simula%on(

• Social(Media(
• Socioeconomic,(Census(
• News(feeds(
• Environmental(
• Weather(
• Satellite(
• Sensors(
• Health(
• Economic(

• UAVs(
• Smart(Dust(
• Autonomous(Cars(
• Sensors(

• Global(Weather(Models(
• Economic(Models(
• Earthquake(Models(

Data(
Machine(
Learning(

Insight(

Same approach highly relevant for
the validation and optimal
exploitation of the next generation
of satellites, e.g. the upcoming
NASA Decadal Survey Missions.
How?

California Children Example
Terra DeepBlue
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Source

Variable

Type

Satellite Product
Meteorological Analyses
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Meteorological Analyses
Meteorological Analyses
Meteorological Analyses
Meteorological Analyses
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product

Population Density
Tropospheric NO2 Column
Surface Specific Humidity
Solar Azimuth
Surface Wind Speed
White-sky Albedo at 2,130 nm
White-sky Albedo at 555 nm
Surface Air Temperature
Surface Layer Height
Surface Ventilation Velocity
Total Precipitation
Solar Zenith
Air Density at Surface
Cloud Mask Qa
Deep Blue Aerosol Optical Depth 470 nm
Sensor Zenith
White-sky Albedo at 858 nm
Surface Velocity Scale
White-sky Albedo at 470 nm
Deep Blue Angstrom Exponent Land
White-sky Albedo at 1,240 nm
Scattering Angle
Sensor Azimuth
Deep Blue Surface Reflectance 412 nm
White-sky Albedo at 1,640 nm
Deep Blue Aerosol Optical Depth 660 nm
White-sky Albedo at 648 nm
Deep Blue Surface Reflectance 660 nm
Cloud Fraction Land
Deep Blue Surface Reflectance 470 nm
Deep Blue Aerosol Optical Depth 550 nm
Deep Blue Aerosol Optical Depth 412 nm

Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input

In-situ Observation

PM2.5

Target
Hourly measurements from 53 countries from 1997-present

A lot of measurements,
but notice the large gaps!
Gaps are inevitable because of the
infrastructure and cost associated with
making the measurements.
Challenge 1: Obtaining the in-situ PM2.5 data
Real time data from:
1. EPA AirNow data for USA and Canada
2. EEA data for Europe
3. Tasmania and Australia
4. Israel
5. Russia
6. Asia and Latin America by scraping http://aqicn.org/map/
7. Harvesting social media (twitter feeds from US Embassies)

Relative low bandwidth from multiple sites every 5 minutes
Challenge 2: (Easier)
Obtaining the Satellite & Meteorological Data
Real time data from:
1. Multiple satellites MODIS Terra, MODIS Aqua, SeaWIFS, VIIRS NPP etc
2. Global Meteorological Analyses

High bandwidth from few sites every 1 to 3 hours
Challenge 3:
Combine multiple BigData Sets with Machine Learning
Large member machine learning ensemble using massively parallel computing
to produce PM2.5 data product
Algorithms capable of dealing with massive non-linear, non-parametric, nonGaussian multivariate datasets (13,000+ variables)
Drastically reduced development time by using a high level language (Matlab)
that can easily exploit parallel execution using both multiple CPUs and GPUs.

Massively parallel every 3 hours
High level language which can readily use CPUs and GPUs
Challenge 4:
Continual Performance Improvement
Currently on around 400th version of system.
Have been making continuous improvements in:
1. Coverage of in-situ training data set
2. Inclusion of new satellite sensors
3. Additional BigData sets that help improve fidelity of the non-linear, nonparametric, non-Gaussian multivariate machine learning fits
4. Using many alternative machine learning strategies
5. Estimate uncertainties.
6. This requires frequent reprocessing of the entire multi-year record from
1997-present

Persistent massive data storage, much more
than usual scratch space at HPC centers
Fully Automated Workflow

Requires ability to schedule automated tasks
Requires ability to disseminate results in multiple formats including
ftp and as web and map services
Key System Requirements:
Not always available on current HPC systems
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
7. Ability to schedule tasks at precise times and time intervals to automate
workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes,
1 hour, 3 hours, 1 day)

Thank you!

More Related Content

What's hot

Ecosystem science requirements for uas remote sensing
Ecosystem science requirements for uas remote sensing Ecosystem science requirements for uas remote sensing
Ecosystem science requirements for uas remote sensing
bensparrowau
 
How can drone data be used in modelling?
How can drone data be used in modelling?How can drone data be used in modelling?
How can drone data be used in modelling?
ARDC
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
Allen Day, PhD
 
Pacific Research Platform Science Drivers
Pacific Research Platform Science DriversPacific Research Platform Science Drivers
Pacific Research Platform Science Drivers
Larry Smarr
 
Sensornets and Global Change
Sensornets and Global ChangeSensornets and Global Change
Sensornets and Global Change
Larry Smarr
 
Andy Hardy-Enfermedades transmitidas por vectores
Andy Hardy-Enfermedades transmitidas por vectoresAndy Hardy-Enfermedades transmitidas por vectores
Andy Hardy-Enfermedades transmitidas por vectores
Fundación Ramón Areces
 
2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems
TimeScience
 
Pacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataPacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big Data
Larry Smarr
 
 Gigapixel resolution imaging for near-remote sensing and phenomics
 Gigapixel resolution imaging for near-remote sensing and phenomics Gigapixel resolution imaging for near-remote sensing and phenomics
 Gigapixel resolution imaging for near-remote sensing and phenomics
TimeScience
 
Pacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth SciencesPacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth Sciences
Larry Smarr
 
The FiRe CTO Design Challenge: Wildfire Technology
The FiRe CTO Design Challenge: Wildfire TechnologyThe FiRe CTO Design Challenge: Wildfire Technology
The FiRe CTO Design Challenge: Wildfire Technology
Larry Smarr
 
Creating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIOCreating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIO
Larry Smarr
 
GIS and GPS in plant pathology. vinay
GIS and GPS in plant pathology. vinayGIS and GPS in plant pathology. vinay
GIS and GPS in plant pathology. vinay
vinay ju
 
DRI Energy Related Projects
DRI Energy Related ProjectsDRI Energy Related Projects
DRI Energy Related Projects
DRIscience
 
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
Larry Smarr
 

What's hot (15)

Ecosystem science requirements for uas remote sensing
Ecosystem science requirements for uas remote sensing Ecosystem science requirements for uas remote sensing
Ecosystem science requirements for uas remote sensing
 
How can drone data be used in modelling?
How can drone data be used in modelling?How can drone data be used in modelling?
How can drone data be used in modelling?
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
Pacific Research Platform Science Drivers
Pacific Research Platform Science DriversPacific Research Platform Science Drivers
Pacific Research Platform Science Drivers
 
Sensornets and Global Change
Sensornets and Global ChangeSensornets and Global Change
Sensornets and Global Change
 
Andy Hardy-Enfermedades transmitidas por vectores
Andy Hardy-Enfermedades transmitidas por vectoresAndy Hardy-Enfermedades transmitidas por vectores
Andy Hardy-Enfermedades transmitidas por vectores
 
2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems
 
Pacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataPacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big Data
 
 Gigapixel resolution imaging for near-remote sensing and phenomics
 Gigapixel resolution imaging for near-remote sensing and phenomics Gigapixel resolution imaging for near-remote sensing and phenomics
 Gigapixel resolution imaging for near-remote sensing and phenomics
 
Pacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth SciencesPacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth Sciences
 
The FiRe CTO Design Challenge: Wildfire Technology
The FiRe CTO Design Challenge: Wildfire TechnologyThe FiRe CTO Design Challenge: Wildfire Technology
The FiRe CTO Design Challenge: Wildfire Technology
 
Creating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIOCreating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIO
 
GIS and GPS in plant pathology. vinay
GIS and GPS in plant pathology. vinayGIS and GPS in plant pathology. vinay
GIS and GPS in plant pathology. vinay
 
DRI Energy Related Projects
DRI Energy Related ProjectsDRI Energy Related Projects
DRI Energy Related Projects
 
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
 

Viewers also liked

.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014
Mark Tabladillo
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
Denis Dus
 
Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...
Dhwaj Raj
 
Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...
Denis Dus
 
Lessons learned
Lessons learnedLessons learned
Lessons learned
hexgnu
 
Is Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech LuxembourgIs Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech Luxembourg
Marie-Adélaïde Gervis
 
Technical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern RecognitionTechnical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern Recognition
butest
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
Frank Kienle
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 

Viewers also liked (9)

.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...
 
Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...
 
Lessons learned
Lessons learnedLessons learned
Lessons learned
 
Is Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech LuxembourgIs Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech Luxembourg
 
Technical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern RecognitionTechnical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern Recognition
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning

Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science CloudAccelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
Globus
 
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
Globus
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
Ian Foster
 
IEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfIEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdf
ssuserff37aa
 
Drones and A.I in Earth Science
Drones and A.I in Earth ScienceDrones and A.I in Earth Science
Drones and A.I in Earth Science
ARDC
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
Ian Foster
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
marpierc
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
Ian Foster
 
NextGen environmental sensing at the National Arboretum
NextGen environmental sensing at the National ArboretumNextGen environmental sensing at the National Arboretum
NextGen environmental sensing at the National Arboretum
TimeScience
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
William Yetman
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
Ian Foster
 
Braintalk cuso nm
Braintalk cuso nmBraintalk cuso nm
Braintalk cuso nm
eXascale Infolab
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
inside-BigData.com
 
Research and technology explosion in scale-out storage
Research and technology explosion in scale-out storageResearch and technology explosion in scale-out storage
Research and technology explosion in scale-out storage
Jeff Spencer
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
Robert Grossman
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
Guy Coates
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
praveen bhat
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
eXascale Infolab
 

Similar to Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning (20)

Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science CloudAccelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
 
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
IEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfIEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdf
 
Drones and A.I in Earth Science
Drones and A.I in Earth ScienceDrones and A.I in Earth Science
Drones and A.I in Earth Science
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
NextGen environmental sensing at the National Arboretum
NextGen environmental sensing at the National ArboretumNextGen environmental sensing at the National Arboretum
NextGen environmental sensing at the National Arboretum
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Braintalk cuso nm
Braintalk cuso nmBraintalk cuso nm
Braintalk cuso nm
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
Research and technology explosion in scale-out storage
Research and technology explosion in scale-out storageResearch and technology explosion in scale-out storage
Research and technology explosion in scale-out storage
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 

More from David Lary

The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
The West Africa-America Chamber of Commerce & Industries presents: Big Data &...The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
David Lary
 
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
David Lary
 
The West Africa-America Chamber of Commerce & Industries presents:
The West Africa-America Chamber of Commerce & Industries presents: The West Africa-America Chamber of Commerce & Industries presents:
The West Africa-America Chamber of Commerce & Industries presents:
David Lary
 
West Africa-America Chamber of Commerce & Industries: E mist
West Africa-America Chamber of Commerce & Industries: E mistWest Africa-America Chamber of Commerce & Industries: E mist
West Africa-America Chamber of Commerce & Industries: E mist
David Lary
 
Big Data & Machine Learning for Societal Benefit
Big Data & Machine Learning for Societal BenefitBig Data & Machine Learning for Societal Benefit
Big Data & Machine Learning for Societal Benefit
David Lary
 
Why geni
Why geniWhy geni
Why geni
David Lary
 

More from David Lary (6)

The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
The West Africa-America Chamber of Commerce & Industries presents: Big Data &...The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
 
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
 
The West Africa-America Chamber of Commerce & Industries presents:
The West Africa-America Chamber of Commerce & Industries presents: The West Africa-America Chamber of Commerce & Industries presents:
The West Africa-America Chamber of Commerce & Industries presents:
 
West Africa-America Chamber of Commerce & Industries: E mist
West Africa-America Chamber of Commerce & Industries: E mistWest Africa-America Chamber of Commerce & Industries: E mist
West Africa-America Chamber of Commerce & Industries: E mist
 
Big Data & Machine Learning for Societal Benefit
Big Data & Machine Learning for Societal BenefitBig Data & Machine Learning for Societal Benefit
Big Data & Machine Learning for Societal Benefit
 
Why geni
Why geniWhy geni
Why geni
 

Recently uploaded

dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 

Recently uploaded (20)

dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 

Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning

  • 1. Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset A Technology Challenge Case Study David Lary Hanson Center for Space Science University of Texas at Dallas
  • 3. Why? Table!1.!PM!and!health!outcomes!(modified!from!Ruckerl*et*al.!(2006)).! Decreased Lung Function < 10 μm x, few studies; xx, many studies; xxx, large number of studies. Cardiovascular Disease < 0.1 μm Skin & Eye Disease < 2.5 μm 0.1 mm 0.001 μm 0.01 μm 0.1 μm 1 μm 10 μm 1 mm 100 μm Tumors < 1 μm 0.0001 μm 1000 μm Mold Spores Types of biological Material Cell Pollen House Dust Mite Allergens Cat Allergens Bacteria Hair Viruses Types of Dust Heavy Dust Settling Dust Suspended Atmospheric Dust Cement Dust Fly Ash Types of Particulates Long9term*Studies* PM10! PM2.5! UFP! !! !! !! xx! xx! x! xx! xx! x! xx! xx! x! !! !! !! xxx! xxx! !! xxx! xxx! !! !! !! !! xxx! xxx! !! !! !! !! !! !! !! !! !! !! xx! xx! x! xx! xx! x! !! !! !! x! x! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! Oil Smoke Pin Smog Tobacco Smoke Soot Gas Molecules Gas Molecules Short9term*Studies* PM10! PM2.5! UFP! Mortality* !! !! !! !!!!All!causes! xxx!! xxx!! x! !!!!Cardiovascular! xxx! xxx! x!! !!!!Pulmonary! xxx! xxx! x! Pulmonary!effects! !! !! !! !!!!Lung!function,!e.g.,!PEF! xxx! xxx! xx! !!!!Lung!function!growth! !! !! !! Asthma!and!COPD!exacerbation! !! !! !! !!!!Acute!respiratory!symptoms! !! xx! x! !!!!Medication!use! !! !! x! !!!!Hospital!admission! xx! xxx! x! Lung!cancer! !! !! !! !!!!Cohort! !! !! !! !!!!Hospital!admission! !! !! !! Cardiovascular!effects! !! !! !! !!!!Hospital!admission! xxx! xxx! !! ECG@related!endpoints! !! !! !! !!!!Autonomic!nervous!system! xxx! xxx! xx! !!!!Myocardial!substrate!and!vulnerability! !! xx! x! Vascular!function! !! !! !! !!!!Blood!pressure! xx! xxx! x! !!!!Endothelial!function! x! xx! x! Blood!markers! !! !! !! !!!!Pro!inflammatory!mediators! xx! xx! xx! !!!!Coagulation!blood!markers! xx! xx! xx! !!!!Diabetes! x! xx! x! !!!!Endothelial!function! x! x! xx! Reproduction! !! !! !! !!!!Premature!birth! x! x! !! !!!!Birth!weight! xx! x! !! !!!!IUR/SGA! x! x! !! Fetal!growth! !! !! !! !!!!Birth!defects! x! !! !! !!!!Infant!mortality! xx! x! !! !!!!Sperm!quality! x! x! !! Neurotoxic!effects! !! !! !! !!!!Central!nervous!system!! !! x! xx! !! Health*Outcomes! PM10 particles PM2.5 particles PM0.1 ultra fine particles 0.0001 μm 0.001 μm 0.01 μm PM10-2.5 coarse fraction 0.1 μm 1 μm 10 μm 100 μm 1000 μm
  • 5. How? Used around 40 different BigData sets from satellites, meteorology, demographics, scraped web-sites and social media to estimate PM2.5. Plot below shows the average of 5,935 days from August 1, 1997 to the present.
  • 7. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired)
  • 8. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections
  • 9. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
  • 10. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab
  • 11. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables)
  • 12. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs
  • 13. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs 7. Ability to schedule tasks at precise times and time intervals to automate workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day)
  • 16. Terra DeepBlue Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Source Variable Type Satellite Product Meteorological Analyses Satellite Product Meteorological Analyses Satellite Product Satellite Product Meteorological Analyses Meteorological Analyses Meteorological Analyses Meteorological Analyses Satellite Product Meteorological Analyses Satellite Product Satellite Product Satellite Product Satellite Product Meteorological Analyses Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Population Density Tropospheric NO2 Column Surface Specific Humidity Solar Azimuth Surface Wind Speed White-sky Albedo at 2,130 nm White-sky Albedo at 555 nm Surface Air Temperature Surface Layer Height Surface Ventilation Velocity Total Precipitation Solar Zenith Air Density at Surface Cloud Mask Qa Deep Blue Aerosol Optical Depth 470 nm Sensor Zenith White-sky Albedo at 858 nm Surface Velocity Scale White-sky Albedo at 470 nm Deep Blue Angstrom Exponent Land White-sky Albedo at 1,240 nm Scattering Angle Sensor Azimuth Deep Blue Surface Reflectance 412 nm White-sky Albedo at 1,640 nm Deep Blue Aerosol Optical Depth 660 nm White-sky Albedo at 648 nm Deep Blue Surface Reflectance 660 nm Cloud Fraction Land Deep Blue Surface Reflectance 470 nm Deep Blue Aerosol Optical Depth 550 nm Deep Blue Aerosol Optical Depth 412 nm Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input In-situ Observation PM2.5 Target
  • 17.
  • 18. Hourly measurements from 53 countries from 1997-present A lot of measurements, but notice the large gaps!
  • 19. Gaps are inevitable because of the infrastructure and cost associated with making the measurements.
  • 20. Challenge 1: Obtaining the in-situ PM2.5 data Real time data from: 1. EPA AirNow data for USA and Canada 2. EEA data for Europe 3. Tasmania and Australia 4. Israel 5. Russia 6. Asia and Latin America by scraping http://aqicn.org/map/ 7. Harvesting social media (twitter feeds from US Embassies) Relative low bandwidth from multiple sites every 5 minutes
  • 21. Challenge 2: (Easier) Obtaining the Satellite & Meteorological Data Real time data from: 1. Multiple satellites MODIS Terra, MODIS Aqua, SeaWIFS, VIIRS NPP etc 2. Global Meteorological Analyses High bandwidth from few sites every 1 to 3 hours
  • 22. Challenge 3: Combine multiple BigData Sets with Machine Learning Large member machine learning ensemble using massively parallel computing to produce PM2.5 data product Algorithms capable of dealing with massive non-linear, non-parametric, nonGaussian multivariate datasets (13,000+ variables) Drastically reduced development time by using a high level language (Matlab) that can easily exploit parallel execution using both multiple CPUs and GPUs. Massively parallel every 3 hours High level language which can readily use CPUs and GPUs
  • 23. Challenge 4: Continual Performance Improvement Currently on around 400th version of system. Have been making continuous improvements in: 1. Coverage of in-situ training data set 2. Inclusion of new satellite sensors 3. Additional BigData sets that help improve fidelity of the non-linear, nonparametric, non-Gaussian multivariate machine learning fits 4. Using many alternative machine learning strategies 5. Estimate uncertainties. 6. This requires frequent reprocessing of the entire multi-year record from 1997-present Persistent massive data storage, much more than usual scratch space at HPC centers
  • 24. Fully Automated Workflow Requires ability to schedule automated tasks
  • 25. Requires ability to disseminate results in multiple formats including ftp and as web and map services
  • 26.
  • 27.
  • 28. Key System Requirements: Not always available on current HPC systems Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs 7. Ability to schedule tasks at precise times and time intervals to automate workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day) Thank you!