2. Why does AWS care about Open Data?
§ Many of our commercial sector customers rely on quality open data as much as they
rely on our cloud infrastructure services.
§ Many of our public sector customers use AWS to make their data available to a global
community of researchers, entrepreneurs, students, and fellow government agencies.
2
Sharing data on AWS makes it accessible to a large and growing community
of researchers, entrepreneurs, and enterprises who use the AWS cloud.
3. Big Data: Collaboration in the Cloud
The Big Data Challenge
Traditionally, it has been time consuming and expensive to acquire, store, and analyze
large data sets.
3
Our Solution – Shared Open Data on AWS
When data is shared on AWS, users can work with any amount of data without needing
to download it or store their own copies.
When heavy data like Earth imagery is available in AWS's cloud, the time required to
copy the data to a virtual server for analysis is virtually eliminated.
4. Traditional Data Acquisition
4
“…data must be organized, well-documented, consistently formatted, and error
free. Cleaning the data is often the most taxing part of data science, and is
frequently 80% of the work.”
We ask: How can we get rid of that 80%?
Bandwidth and infrastructure constraints slow down research and development.
New and better sensors are exacerbating this problem.
- Data Driven by DJ Patil and Hilary Mason
Tape Data Center Disk Server Client
5. The cloud allows researchers from anywhere to take their algorithms
to data rather than downloading data to their computing resources.
Data Acquisition in the Cloud
5
When data is shared in the cloud, anyone can analyze it without needing to download it or
store it themselves.
“Ordinarily, hitting ‘copy’ on a 4 gigabyte file is an opportunity to stand up
and get a fresh cup of coffee, browse the sports section for a little while, but
moving data between servers in an Amazon data center barely affords
time to touch your toes a couple times.” — Paul Ramsey
“ “
Source: http://s3.cleverelephant.ca.s3.amazonaws.com/2015-ccog.pdf
6. Open data as a platform
Data Creation Data Enrichment
Sensemaking
Data at Rest
(Object storage)
Basic APIs
Complex APIs
Consumer
applications
Algorithmic
policy
Data-driven
journalism
Data Catalogs
Focused data
dashboards
Predictive
modeling
Visualizations
Lower cost of knowledge
(Efficiency)
6
8. Public Data Sets on AWS
Several high-value datasets are available for anyone to access for free on AWS.
Examples include:
8
Landsat on AWS3K Rice Genome NEXRAD on AWS
9. Landsat on AWS
We have committed to make up to 1
petabyte of Landsat imagery readily
available as objects on Amazon S3.
All Landsat 8 scenes from 2015 and
2016 are available, along with a
selection of cloud-free scenes from
2013 and 2014.
All new Landsat 8 scenes are made
available each day (~700 per day),
often within hours of production.
9
11. Landsat – Not Just Pretty Pictures
Landsat scenes are made up of
multiple files, each of which
includes data about different kinds
of light reflected off of Earth.
The Landsatprogram has been in
operation since 1972.
12. Each pixel of each Landsat8 file
represents a 12-bit measurement of
light reflected off a 30m2 part of our
planet.
Each Landsat 8 scene contains
about 840 million pixels and takes
up about 800 MB.
We’re currently hosting over
128,000 Landsat 8 scenes and
make about 700 new scenes
available on S3 every day.
Landsat – Not Just Pretty Pictures
13. Landsat on AWS
Landsat on AWS makes each band of
each scene readily available as objects
on Amazon S3. Data can be accessed
programmatically via HTTP and quickly
deployed to any of our products for
analysis and processing.
Users do not need to worry about local
storage and have access to virtually
unlimited computing power on demand.
13
Amazon
EC2
s3://landsat-pds
.tarUSGS
.tiff
14. Landsat on AWS: Usage
In the first year (19 Mar 2015 – 19 Mar 2016)
§ Over 400,000 scenes available
§ Over 1 billion hits globally
Image shows frequency of scene requests by
path/row.
White: ~100 requests
Orange: >300k requests
14
Visualization by Drew Bollinger
Development Seed
15. MATLAB – Landsat8 Data Explorer
MathWorks created a freely
downloadable MATLAB based tool
for accessing, processing, and
visualizing Landsat 8 data.
The tool allows MATLAB users to
find Landsat 8 scenes, analyze
them, and combine them with other
sources of GIS data for new
visualizations.
http://blogs.mathworks.com/steve/2015/03/19/matlab-landsat-8-aws/
15
16. Esri – Unlock Earth’s Secrets
Esri has created a tool to show how
ArcGIS Online can quickly visualize
Landsat data for live visualization and
analysis within the browser.
“These are not pre-generated cache
services limited to just visualization—
they are dynamic, high-performance
image services that perform on-the-
fly processing and dynamic
mosaicking of Landsat’s multi-
spectral and multi-temporal imagery.”
http://www.esri.com/landsatonaws
18. Landsat-live
Mapbox created Landsat-live, a map
that is constantly refreshed with the
latest satellite imagery from NASA’s
Landsat 8 satellite.
Creating a live Earth imagery
pipeline is possible because Landsat
imagery is available on Amazon S3
within hours of creation.
https://www.mapbox.com/blog/landsat-live-live/
19. Snapsat
A team of five novice programmers
used Landsat on AWS to build a web
app called Snapsat that creates
Landsat data visualizations in
seconds.
Snapsat was built during the team’s
8-week training program at Code
Fellows. They launched it just
months after learning to write code.
http://snapsat.org
19
21. The NOAA Big Data Project
21
Amazon Web Services has entered into a research
agreement with the US National Oceanic and
Atmospheric Administration (NOAA) to explore
sustainable models to increase the output of open
NOAA data.
Under this new agreement,AWS and its
collaborators are exploring ways to make NOAA
data easier to access and use.
22. NEXRAD on AWS
22
The Next Generation Weather Radar (NEXRAD) is
a network of 160 high-resolution Doppler radar
sites that detects precipitation and atmospheric
movement and disseminates data in 5 minute
intervals from each site.
It has traditionally been time consuming and
expensive to acquire, store, and analyze NEXRAD
data. Accessing the full historical archive has been
impossible.
23. NEXRAD on AWS – RESTful Interface
NEXRAD on AWS makes 270TB of individual volume scan files and real-time
chunks freely available on Amazon S3.
§ Data can be accessed programmatically via a RESTful interface and quickly
deployed to any of our products for analysis and processing.
23
24. NEXRAD on AWS – Event-driven Analysis
NEXRAD on AWS makes 270TB of individual volume scan files and real-time
chunks freely available on Amazon S3.
§ Amazon Simple Notification Service (SNS) allows subscription to
notifications of new data.
24
25. NEXRAD on AWS: Early Use Cases
§ Climate Corporation cut two weeks
out of an analysis pipeline and are
incorporating the real-time feed into
their production workflows.
§ A weather data company stopped
storing their own NEXRAD archive,
freeing up revenue to build new
products.
§ Several weather companies
developing new products based on
the archival data and real-time feed.
25
NEXRAD on AWS has already reduced the cost of research and product
development.
§ Unidata has made the archive data
available via an AWS-hosted THREDDS
Data Server to users with .edu domains.
§ Several researchers are planning
longitudinal analyses using the full
archive.
§ An open sensor data standards group is
evaluating the NEXRAD on AWS model.
33. HTTP Range Gets and Headers are Cool
https://sgillies.net/2016/04/05/rasterio-0-34.html
This call fetches only 16,384 bytes of a 50MB TIFF to get the metadata
printed above. That’s an efficiency of 3000:1.
34. Requester Pays
With Requester Pays buckets, the requester instead of the bucket owner
pays the cost of the request and the data download from the bucket.
35. Driving innovation with open data
To drive innovation with your data, make sure:
1. It’s accurate
2. It’s documented
3. It will still be available to developers tomorrow
These are the hard parts. Storage and delivery shouldn’t be.