Project Matsu: Elastic Clouds for Disaster Relief


Published on

This is a talk I gave at OGF 29 in Chicago on June 21, 2010.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Project Matsu: Elastic Clouds for Disaster Relief

  1. 1.<br />Project Matsu: Large Scale On-Demand Image Processing for Disaster Relief<br />Collin Bennett, Robert Grossman, YunhongGu, and Andrew LevineOpen Cloud Consortium<br />June 21, 2010<br />
  2. 2. Project Matsu Goals<br />Provide persistent data resources and elastic computing to assist in disasters:<br />Make imagery available for disaster relief workers<br />Elastic computing for large scale image processing<br />Change detection for temporally different and geospatially identical image sets<br />Provide a resource to test standards and interoperability studies large data clouds<br />
  3. 3. Part 1:Open Cloud Consortium<br />
  4. 4. 501(3)(c) Not-for-profit corporation<br />Supports the development of standards, interoperability frameworks, and reference implementations.<br />Manages testbeds: Open Cloud Testbed and IntercloudTestbed.<br />Manages cloud computing infrastructure to support scientific research: Open Science Data Cloud.<br />Develops benchmarks.<br />4<br /><br />
  5. 5. OCC Members<br />Companies: Aerospace, Booz Allen Hamilton, Cisco, InfoBlox, Open Data Group, Raytheon, Yahoo<br />Universities: CalIT2, Johns Hopkins, Northwestern Univ., University of Illinois at Chicago, University of Chicago<br />Government agencies: NASA<br />Open Source Projects: Sector Project<br />5<br />
  6. 6. Operates Clouds<br />500 nodes<br />3000 cores<br />1.5+ PB<br />Four data centers<br />10 Gbps<br />Target to refresh 1/3 each year.<br /><ul><li>Open Cloud Testbed
  7. 7. Open Science Data Cloud
  8. 8. IntercloudTestbed
  9. 9. Project Matsu: Cloud-based Disaster Relief Services</li></li></ul><li>Open Science Data Cloud<br />Astronomical data<br />Biological data (Bionimbus)<br />Networking data<br />Image processing for disaster relief<br />7<br />
  10. 10. Focus of OCC Large Data Cloud Working Group<br />8<br />App<br />App<br />App<br />App<br />App<br />Table-based Data Services<br />Relational-like Data Services<br />App<br />App<br />Cloud Compute Services (MapReduce, UDF, & other programming frameworks)<br />App<br />App<br />Cloud Storage Services<br />Developing APIs for this framework.<br />
  11. 11. Tools and Standards<br />Apache Hadoop/MapReduce<br />Sector/Sphere large data cloud<br />Open Geospatial Consortium<br />Web Map Service (WMS)<br />OCC tools are open source (matsu-project)<br /><br />
  12. 12. Part 2: Technical Approach<br />Hadoop – Lead Andrew Levine<br />Hadoop with Python Streams – Lead Collin Bennet<br />Sector/Sphere – Lead YunhongGu<br />
  13. 13. Implementation 1: Hadoop & MapreduceAndrew Levine<br />
  14. 14. Image Processing in the Cloud - Mapper<br />Mapper Input Key: Bounding Box<br />Mapper Output Key: Bounding Box<br />Mapper Output Value:<br />+ Timestamp<br />(minx = -135.0 miny = 45.0 maxx = -112.5 maxy = 67.5)<br />Mapper Input Value:<br />Mapper Output Key: Bounding Box<br />Mapper Output Value:<br />+ Timestamp<br />+ Timestamp<br />Mapper Output Key: Bounding Box<br />Mapper Output Value:<br />+ Timestamp<br />Mapper Output Key: Bounding Box<br />Step 1: Input to Mapper<br />Mapper Output Value:<br />+ Timestamp<br />Mapper resizes and/or cuts up the original<br />image into pieces to output Bounding Boxes<br />Mapper Output Key: Bounding Box<br />Mapper Output Value:<br />+ Timestamp<br />Mapper Output Key: Bounding Box<br />Mapper Output Value:<br />+ Timestamp<br />Mapper Output Key: Bounding Box<br />Mapper Output Value:<br />+ Timestamp<br />Mapper Output Key: Bounding Box<br />Mapper Output Value:<br />+ Timestamp<br />Step 3: Mapper Output<br />Step 2: Processing in Mapper<br />
  15. 15. Image Processing in the Cloud - Reducer<br />Reducer Key Input: Bounding Box<br />(minx = -45.0 miny = -2.8125 maxx = -43.59375 maxy = -2.109375)<br />Reducer Value Input:<br />…<br />…<br />Step 1: Input to Reducer<br />Result is a delta of the two Images<br />Assemble Images based on timestamps and compare<br />Step 2: Process difference in Reducer<br />All images go to different map layers set of images for display in WMS<br />Timestamp 1<br />Set<br />Timestamp 2<br />Set<br />Delta Set<br />Step 3: Reducer Output<br />
  16. 16. Implementation 2: Hadoop & Python StreamsCollin Bennett<br />
  17. 17. Preprocessing Step<br /><ul><li>All images (in a batch to be processed) are combined into a single file.
  18. 18. Each line contains the image’s byte array transformed to pixels (raw bytes don’t seem to work well with the one-line-at-a-timeHadoop streaming paradigm).</li></ul>geolocation timestamp | tuple size ; image width ; image height; comma-separated list of pixels<br />the fields in red are metadata needed to process the image in the reducer<br />
  19. 19. Map and Shuffle<br /><ul><li>We can use the identity mapper
  20. 20. All of the work for mapping was done in the pre-process step
  21. 21. Map / Shuffle key is the geolocation
  22. 22. In the reducer, the timestamp will be 1st field of each record when splitting on ‘|’</li></li></ul><li>Implementation 3: Sector/SphereYunhongGu<br />
  23. 23. Sector Distributed File System<br />Sector aggregate hard disk storage across commodity computers<br />With single namespace, file system level reliability (using replication), high availability<br />Sector does not split files<br />A single image will not be split, therefore when it is being processed, the application does not need to read the data from other nodes via network<br />A directory can be kept together on a single node as well, as an option<br />
  24. 24. Sphere UDF<br />Sphere allows a User Defined Function to be applied to each file (either it is a single image or multiple images)<br />Existing applications can be wrapped up in a Sphere UDF<br />In many situations, Sphere streaming utility accepts a data directory and a application binary as inputs<br />./stream -ihaiti -cossim_foo -o results<br />
  25. 25. For More Information<br /><br /><br />