Your SlideShare is downloading. ×
  • Like
Experiences on Processing Spatial Data with MapReduce ssdbm09
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Experiences on Processing Spatial Data with MapReduce ssdbm09


Experiences on Processing Spatial Data with MapReduce

Experiences on Processing Spatial Data with MapReduce

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Experiences on Processing Spatial Data with MapReduce Ariel Cary, Zhengguo Sun, Vagelis Hristidis, Naphtali Rishe Florida International University School of Computing and Information Sciences 11200 SW 8th St, Miami, FL 33199 {acary001,sunz,vagelis,rishen} Sponsored by: NSF Cluster Exploratory (CluE)
  • 2. Agenda 1. Introduction 2. Solving Spatial Problems in MapReduce • R-Tree Index Construction • Aerial Image Processing 3. Experimental Results 4. Related Work 5. Conclusion2 Florida International University
  • 3. Introduction  Spatial databases mainly store:  Raster data (satellite/aerial digital images), and  Vector data (points, lines, polygons).  Traditional sequential computing models may take excessive time to process large and complex spatial repositories.  Emerging parallel computing models, such as MapReduce, provide a potential for scaling data processing in spatial applications.3 Florida International University
  • 4. Introduction (cont.)  MapReduce is an emerging massively parallel computing model (Google) composed of two functions:  Map: takes a key/value pair, executes some computation, and emits a set of intermediate key/value pairs as output.  Reduce: merges its intermediate values, executes some computation on them, and emits the final output.  In this work, we present our experiences in applying the MapReduce model to:  Bulk-construct R-Trees (vector) and4 Florida International University  Compute aerial image quality (raster)
  • 5. Introduction (cont.)  Apaches Hadoop Proxy  Linux operating Cloud system  XEN hypervisor  Hadoop Distributed 1 Internet File System (HDFS) bin/hadoop dfs put  SOCKS proxy server 2 bin/hadoop jar  480 computers 3 bin/hadoop dfs get (nodes), each half terabytes storage5 Florida International University
  • 6. 2. Solving Spatial Problems in MapReduce • R-Tree Index Construction • Aerial Image Processing
  • 7. MapReduce (MR) R-Tree Construction  R-Tree Bulk-Construction  Every object o in database D has two attributes:  - the object’s unique identifier.  o.P - the object’s location in some spatial domain.  The goal is to build an R-Tree index on D.  MapReduce Algorithm 1. Database partitioning function computation (MR). 2. A small R-Tree is created for each partition (MR). 3. The small R-Trees are merged into the final R- Tree.7 Florida International University
  • 8. Phase 1 – Partitioning Function – Goal: compute f to assign objects of D into one of RMap and Reduce inputs/outputss.t.: possible partitions in computing partitioning function f. –Function Input: (Key, Value) Output: (Key,are R (ideally) equally-sized partitions Value) generated (minimal variance(C, U(o.P)) Map (, o.P) is acceptable). Reduce (C, list(ui, i=1, .., L)) S´ – Objects close in the spatial domain are placed Where: within the same partition. • o is an solution: – Proposed spatial object in the database. • C which is a constant that helps in sending Mappers – Use Z-order space-filling curve to map spatial outputs to a single Reducer. • coordinate samples (3%) into an value. U is a space-filling curve, e.g. Z-order sorted • sequence. containing R-1 splitting points. S is an array8 – Collect splitting points that partition the Florida International University sequence in R ranges.
  • 9. Phase 2 - R-Tree Construction in MR  Mappers compute f() values for objects.  Reducers compute an R-Tree for each group of objects with identical f() value MapReduce functions in constructing R-Trees. Function Input: (Key, Value) Output: (Key, Value) Map (, o.P) (f(o.P), o) Reduce (f(o.P), list(oi, i=1, .., A)) tree.root Where: • o is an spatial object in the database. • f is the partitioning function computed in phase 1. • Tree.root is the R-Tree root node.9 Florida International University
  • 10. Phase 3 - R-Tree Consolidation  sequential process10 Florida International University
  • 11. Image Processing in MapReduce  Aerial Image Quality Computation  Let d be a orthorectified aerial photography (DOQQ) file and t be a tile inside d, is d’s file name and t.q is the quality information of tile t.  The goal is to compute a quality bitmap for d.  MapReduce Algorithm  A customized InputFormatter partitions each DOQQ file d into several splits containing multiple tiles.  The Mappers compute the quality bitmap for each tile inside a split.  The Reducers merge all the bitmaps that belongs to a file d and write them to an output file.11 Florida International University
  • 12. Image Processing in MapReduce  MapReduce Algorithm Input and output of map and reduce functions Function Input: (Key, Value) Output: (Key, Value) Map (, t) (, (,t.q)) Reduce (, list(,t.q)) Quality-bitmap of d Where: • d is a DOQQ file. • t is a tile in d. • t.q is the quality bitmap of t.12 Florida International University
  • 13. 3. Experimental Results
  • 14. Experimental Results: Setting  Data Set Table 4. Spatial data sets used in experiments*. Problem Data Data size Objects Description set (GB) R-Tree FLD 11.4 M 5 Points of properties in the state of Florida. Yellow pages directory of points of businesses mostly YPD 37 M 5.3 in the United States but also in other countries. Image Miami- Aerial imagery of Miami-Dade county, FL (3-inch 482 files 52 Quality Dade resolution) * Data sets supplied by the High Performance Database Research Center at Florida International University  Environment  The cluster was provided by the Google and IBM Academic Cluster Computing Initiative.  The cluster contains around 480 computers running Hadoop - open source MapReduce.14 Florida International University
  • 15. Experimental Results: R-Tree  R-Tree Construction Performance Metrics 30.00 60.00 25.00 50.00 MR2 MR2 20.00 40.00 Time (min) Time (min) MR1 MR1 15.00 30.00 10.00 20.00 5.00 10.00 0.00 0.00 2 4 8 16 32 64 4 8 16 32 64 Reducers Reducers (a) FLD data set (b) YPD data set MapReduce job completion times for various number of reducers in phase-2 (MR2).15 Florida International University
  • 16. Experimental Results: R-Tree  MapReduce R-Trees vs. Single Process (SP) Objects per Reducer Consolidated R-Tree Data set R Average Stdev Nodes Height FLD 2 5,690,419 12,183 172,776 4 4 2,845,210 6,347 172,624 4 8 1,422,605 2,235 173,141 4 16 711,379 2,533 162,518 4 32 355,651 2,379 173,273 3 64 177,826 1,816 173,445 3 SP 11,382,185 0 172,681 4 YPD 4 9,257,188 22,137 568,854 4 8 4,628,594 9,413 568,716 4 16 2,314,297 7,634 568,232 4 32 1,157,149 6,043 567,550 4 64 578,574 2,982 566,199 4 SP 37,034,126 0 587,353 516 Florida International University
  • 17. Experimental Results: Imagery  Tile Quality Computation 25 4 Reduce 3.5 Reduce 20 Map Map 3 Time (min) Time (min) 15 2.5 2 10 1.5 5 1 0.5 0 0 4 8 16 32 64 128 256 512 2 4 8 16 Reducers Size of data (GB) (a) Fixed data size, variable Reducers (b) Variable data size, fixed Reducers Fig. 9. MapReduce job completion time for tile quality computation17 Florida International University
  • 18. 4. Related Work
  • 19. Related Work  Previous works on R-Tree parallel construction faced intrinsic distributed computing problems: load balancing, process scheduling, fault tolerance, etc.  Schnitzer and Leutenegger [16] proposed a Master- Client R-Tree, where the data set is first partitioned using Hilbert packing sort algorithm, then the partitions are declustered into a number of processors, where individual trees are built. At the end, a master process combines the individual trees into the final R-Tree.  Papadopoulos and Manolopoulos [17] proposed a methodology for sampling-based space partitionining, load balancing, and partition assignment into a set of Florida International University19 processors in parallely building R-Trees.
  • 20. 5. Conclusion
  • 21. Conclusion  We used the MapReduce model to solve two spatial problems on a Google&IBM cluster:  (a) Bulk-construction of R-Trees and  (b) Aerial image quality computation  MapReduce can dramatically improve task completion times. Our experiments show close to linear scalability.  Our experience in this work shows MapReduce has the potential to be applicable to more complex spatial problems.21 Florida International University
  • 22. References [1] Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD 1984:47-57. [2] NSF Cluster Exploratory Program: [3] Google&IBM Academic Cluster Computing Initiative: [4] Apache Hadoop project: [6] Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, USENIX Association, Volume 6, pp. 10-10, December 2004. [12] High Performance Database Research Center (HPDRC), Research Division of the Florida International University, School of Computing and Information Sciences, University Park, Telephone: (305) 348-1706, FIU ECS-243, Miami, FL 33199. [16] Schnitzer B., Leutenegger S.T.: Master-client R-trees: a new parallel R- tree architecture, In Proceedings of the 11th International Conference on Scientific and Statistical Database Management, pp. 68-77, August 1999. [17] Apostolos Papadopoulos, Yannis Manolopoulos: Parallel bulk-loading of spatial data, Parallel Computing, Volume 29, Issue 10, pp. 1419 - 1444, October 2003.22 Florida International University