Monsanto built a geospatial platform on Hadoop and HBase capable of managing over 120 billion polygons. As a result of the extreme data volumes and compute complexities we were forced to migrate our data processing from a more traditional RDBMS to a scale out Hadoop implementation. Data processing that took over 30 days on 8% of the data now runs in under 12 hours on the entire data set. Very little concrete material exist for how you process spatial data via MapReduce or model it in HBase. We will provide concrete and novel examples for processing and storing spatial data on Hadoop and HBase. As part of the data processing pipeline we integrated the popular open source geospatial processing library GDAL with MapReduce to convert all geospatial datasets to a common format and projection. We developed a method for splitting and processing images via MapReduce in which the boundaries of splits needed to be shared by multiple tasks due to the nature of the computation being performed on the data. Bulk writes to HBase were performed by writing HFiles directly. Finally we developed a novel method for storing geospatial data in HBase that met the needs of our access pattern.
Clipping is a handy way to collect important slides you want to go back to later.