Project Matsu

Project Matsu: Large Scale On-Demand
Image Processing for Disaster Relief
Collin Bennett, Robert Grossman,
Yunhong Gu, and Andrew Levine
Open Cloud Consortium
June 21, 2010
www.opencloudconsortium.org

Project Matsu Goals
• Provide persistent data resources and elastic
computing to assist in disasters:
– Make imagery available for disaster relief workers
– Elastic computing for large scale image processing
– Change detection for temporally different and
geospatially identical image sets
• Provide a resource to test standards and
interoperability studies large data clouds

• 501(3)(c) Not-for-profit corporation
• Supports the development of standards,
interoperability frameworks, and reference
implementations.
• Manages testbeds: Open Cloud Testbed and
Intercloud Testbed.
• Manages cloud computing infrastructure to support
scientific research: Open Science Data Cloud.
• Develops benchmarks.
4

OCC Members
• Companies: Aerospace, Booz Allen Hamilton,
Cisco, InfoBlox, Open Data Group, Raytheon,
Yahoo
• Universities: CalIT2, Johns Hopkins,
Northwestern Univ., University of Illinois at
Chicago, University of Chicago
• Government agencies: NASA
• Open Source Projects: Sector Project
5

Operates Clouds
• 500 nodes
• 3000 cores
• 1.5+ PB
• Four data centers
• 10 Gbps
• Target to refresh 1/3
each year.
• Open Cloud Testbed
• Open Science Data Cloud
• Intercloud Testbed
• Project Matsu: Cloud-
based Disaster Relief
Services

Open Science Data Cloud
7
Astronomical data
Biological data
(Bionimbus)
Networking data
Image processing for disaster relief

Focus of OCC Large Data Cloud Working Group
8
Cloud Storage Services
Cloud Compute Services
(MapReduce, UDF, & other programming
frameworks)
Table-based Data
Services
Relational-like
Data Services
App App App App App
App App
App App
• Developing APIs for this framework.

Tools and Standards
• Apache Hadoop/MapReduce
• Sector/Sphere large data cloud
• Open Geospatial Consortium
– Web Map Service (WMS)
• OCC tools are open source (matsu-project)
– http://code.google.com/p/matsu-project/

Part 2: Technical Approach
• Hadoop – Lead Andrew Levine
• Hadoop with Python Streams – Lead Collin
Bennet
• Sector/Sphere – Lead Yunhong Gu

Implementation 1:
Hadoop & Mapreduce
Andrew Levine

Image Processing in the Cloud - Mapper
Mapper Input Key: Bounding Box
Mapper Input Value:
Mapper Output Key: Bounding Box
Mapper Output Value:
Mapper resizes and/or cuts up the original
image into pieces to output Bounding Boxes
(minx = -135.0 miny = 45.0 maxx = -112.5 maxy = 67.5)
Step 1: Input to Mapper
Step 2: Processing in Mapper Step 3: Mapper Output
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp

Image Processing in the Cloud - Reducer
Reducer Key Input: Bounding Box
(minx = -45.0 miny = -2.8125 maxx = -43.59375 maxy = -2.109375)
Reducer Value Input:
Step 1: Input to Reducer
… …
Step 2: Process difference in Reducer
Assemble Images based on timestamps and compare Result is a delta of the two Images
Step 3: Reducer Output
All images go to different map layers set of images for display in WMS
Timestamp 1
Set
Timestamp 2
Set
Delta Set

Implementation 2:
Hadoop & Python Streams
Collin Bennett

Preprocessing Step
• All images (in a batch to be processed) are
combined into a single file.
• Each line contains the image’s byte array
transformed to pixels (raw bytes don’t seem
to work well with the one-line-at-a-time
Hadoop streaming paradigm).
geolocation t timestamp | tuple size
; image width ; image height; comma-
separated list of pixels
the fields in red are metadata needed to process the image in the
reducer

Map and Shuffle
• We can use the identity mapper
• All of the work for mapping was
done in the pre-process step
• Map / Shuffle key is the geolocation
• In the reducer, the timestamp will be
1st field of each record when
splitting on ‘|’

Implementation 3:
Sector/Sphere
Yunhong Gu

Sector Distributed File System
• Sector aggregate hard disk storage across
commodity computers
– With single namespace, file system level reliability
(using replication), high availability
• Sector does not split files
– A single image will not be split, therefore when it
is being processed, the application does not need
to read the data from other nodes via network
– A directory can be kept together on a single node
as well, as an option

Sphere UDF
• Sphere allows a User Defined Function to be
applied to each file (either it is a single image
or multiple images)
• Existing applications can be wrapped up in a
Sphere UDF
• In many situations, Sphere streaming utility
accepts a data directory and a application
binary as inputs
• ./stream -i haiti -c ossim_foo -o results

For More Information
info@opencloudconsortium.org

Project Matsu

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Project Matsu

Similar to Project Matsu (20)

Recently uploaded

Recently uploaded (20)

Project Matsu