Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

on

  • 4,285 views

Monsanto built a geospatial platform on Hadoop and HBase capable of managing over 120 billion polygons. As a result of the extreme data volumes and compute complexities we were forced to migrate our ...

Monsanto built a geospatial platform on Hadoop and HBase capable of managing over 120 billion polygons. As a result of the extreme data volumes and compute complexities we were forced to migrate our data processing from a more traditional RDBMS to a scale out Hadoop implementation. Data processing that took over 30 days on 8% of the data now runs in under 12 hours on the entire data set. Very little concrete material exist for how you process spatial data via MapReduce or model it in HBase. We will provide concrete and novel examples for processing and storing spatial data on Hadoop and HBase. As part of the data processing pipeline we integrated the popular open source geospatial processing library GDAL with MapReduce to convert all geospatial datasets to a common format and projection. We developed a method for splitting and processing images via MapReduce in which the boundaries of splits needed to be shared by multiple tasks due to the nature of the computation being performed on the data. Bulk writes to HBase were performed by writing HFiles directly. Finally we developed a novel method for storing geospatial data in HBase that met the needs of our access pattern.

Statistics

Views

Total Views
4,285
Views on SlideShare
4,115
Embed Views
170

Actions

Likes
12
Downloads
124
Comments
0

3 Embeds 170

http://www.scoop.it 164
https://twitter.com 4
https://content-preview.socialcast.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • http://psipunk.com/page/18/With big agricultural farms getting smaller due to fast growing population, we need some compact and efficient tools of farming to balance structured agriculture with nature to ensure a healthy ecosystem around us. Offering a solution, the “Agria” by Julia Kaisinger, Katharina Unger and Stefan Riegbauer is an autonomous farm robot for sowing and plant protection in small farms. Featuring infrared and UV light to control bugs, fungi and pests, the modular machine examines the soil and plants regularly to allow specific treatment. Placing seeds and fertilizer in the right place and proportion, the Agria works with an intelligent network of fields and machines, supplied by a local station, which can be controlled through a computer or smartphone, so you may store and share data with experts for better analysis.
  • Agriculture is going through transition via adoption of breakthrough technologies in seed genetics, farm equipment hardware and software, and farm practices – akin to the advances in computer technology ushering in the modern information technology era;Growers are getting increasingly swamped by information – much of it needing further thoughtful analysis leading to extraction and integration of actionable information.  Monsanto is gearing up to do that;Anyone interested in developing improved agronomic practices or information apps that contribute to increasing yield or improving life on the farm should get in touch with us (leave contact information at the Monsanto booth).
  • General data flow
  • Split and Task sizes were a challenge because of number of files to be processed and metadata needed to process each task. Data generation for only the United States so only 15% of all SuperCells covering the world were used. Presplit of table to even hfiles.

Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield Presentation Transcript

  • 1. Monsanto Company Confidential - Attorney Client Privilege Geospatial Processing @ Monsanto Hadoop Summit 2013 Robert Grailer, Big Data Engineer Erich Hochmuth, Data & Analytics Architecture Lead
  • 2. Monsanto Company Confidential - Attorney Client Privilege Our Vision: Sustainable Agriculture A Strong Vision That Guides All We Do • Producing More – We are committed to increasing yields to meet the growing demand for food, fiber & fuel • Conserving More – We are committed to reducing the amount of land, water and energy needed to grow our crops • Improving Lives – We are committed to improving lives around the world 2
  • 3. Monsanto Company Confidential - Attorney Client Privilege  ADVANCED EQUIPMENT  AVERAGE CORN YIELD –300 BU/AC  AUTOMATED WEATHER STATIONS  FIELD SENSORS PROVIDING INFORMATION  ADVANCED IMAGERY TECHNOLOGY Doubling Yields by 2030 - Farming in the Future Will Be Increasingly Information-Driven 3
  • 4. Monsanto Company Confidential - Attorney Client Privilege 4 Planting Prescription 2012 (DKC63-84 Brand) Target Rate (Count) (ksds/ac) 38.00 (24.75 ac) 37.00 (22.63 ac) 35.00 (16.60 ac) 34.00 ( 8.23 ac) 33.00 ( 6.00 ac) 32.00 ( 2.82 ac) Integrated Farming Systems – FieldScriptsSM for 2014 • FieldScripts℠ will deliver, by field, a corn hybrid recommendation utilizing variable rate seeding by FieldScripts management zones to increase yield potential and reduce risk • The science of FieldScripts is based on proprietary algorithms that combine data from the FieldScripts Testing Network and Monsanto generated hybrid response to plant population research Precision Planting
  • 5. Monsanto Company Confidential - Attorney Client Privilege IL Irrigated, Back 80 Treatment Yield (bu/ac) Static|34000 196 FieldScripts (35000) 233 Central IL Dry Land, 47-50 Treatment Yield (bu/ac) Static|34000 139 FieldScripts (33000) 145 MS Irrigated, 21 Treatment Yield (bu/ac) Static|34000 166 FieldScripts (34700) 181 2012 Field Trials Indicate 5-10 bu/a Average Yield Gain 5 In the United States Alone: Corn acres planted in 2013 – 96M Price of Corn per bushel – $6.93* Advantage of 5–10 Bu/Ac *Price reflects CBOT price of corn 1/9/2013
  • 6. Monsanto Company Confidential - Attorney Client Privilege Integrated Farming SystemsSM Combine Advanced Seed Genetics, On-farm Agronomic Practices, Software and Hardware Innovations to Drive Yield DATABASE BACKBONE Expansive product by environment testing makes on-farm prescriptions possible VARIABLE-RATE FERTILITY Variable rate N, P & K “Apps” aligned with yield management zones PRECISION SEEDING Planter hardware systems enabling variable rate seeding & row spacing of multiple hybrids in a field by yield management zone FERTILITY & DISEASE MANAGEMENT “Apps” for in-season custom application of supplemental late nitrogen and fungicides YIELD MONITOR Advances in Yield Monitoring to deliver higher resolution data BREEDING Significant increases in data points collected per year to increase annual rate genetic gain 6
  • 7. Monsanto Company Confidential - Attorney Client Privilege Use Case 7 Public Data Monsanto Data Grower Data Standardize & Link Algorithms • Load thousands of files containing spatial data • Support diverse range of data types — tabular, vector, raster • Join & link data spatially • Generate dense grid covering entire US — 120 billion polygons • Generate a set of derived attributes — Think moving average • Make data available for other data products such as Field Scripts High Level Data Flow
  • 8. Monsanto Company Confidential - Attorney Client Privilege Version 1 Architecture • In RDBMS spatial • PL/SQL • Multiple patches to DB Engine • Just 8% of the data!! – 35+ days to process • TBs in indexes • Tradeoffs – Compressed vs. Uncompressed – Performance vs. Storage – Read vs. Write performance • Options/recommendations – Limit use of in DB spatial functionality – Buy more RDBMS 8 0 10 20 30 Days Data Processing Time Soil Elevation Spatial Index Processing 0 50 100 TBs Data Volumes Raw Data Uncompressed Compressed Spatial Index
  • 9. Monsanto Company Confidential - Attorney Client Privilege Version 2 Architecture • Combination of MapReduce & HBase • Leverage existing Hadoop cluster • MapReduce – Parallelize everything! – Bulk HBase loads • HBase – Spatial data model – Custom spatial engine 9
  • 10. Monsanto Company Confidential - Attorney Client Privilege Data Ingestion • Bulk load 1,000s of files into HDFS • Standardize data – Common usable format • Storage vs. Compute • Raster format is easily splitable • Hadoop Streaming integrated with GDAL • Streaming API Lessons Learned – Lack of documentation – Counters to track task progress – Jobs run as mapred user – HDFS Access outside of MR 10 0 20 40 60 Hours Data Ingestion Time RDBMS Hadoop NFS • Raster Images • Vector Shape Files • Zip Files • Text Data •Unzip •Convert to Raster • Re-project HDFS Hadoop Streaming • Raster Files Results
  • 11. Monsanto Company Confidential - Attorney Client Privilege Data Processing • Process raster data – Dense matrix • Generic InputFormat & RecordReader for raster data • HFiles easily transportable between clusters • Challenges tuning Jobs – IO Sort Factor – Split/Task Size 11 HDFS HBase Generate Derived Attributes • Raster Files Results Pre-split table Generate HFiles 0 10 20 30 Days Data Processing Time RDBMS Hadoop
  • 12. Monsanto Company Confidential - Attorney Client Privilege HBASE SCHEMA DESIGN 12
  • 13. Monsanto Company Confidential - Attorney Client Privilege Geospatial in HBase Need – Dense data set – Complex computations – Scalable & cost efficient – Bulk analytics & random reads HBase – GeoHash most notable example • Best suited for sparse data – Precision of reads – Alphanumeric key HBase Considerations – Key overhead – Scan vs. Get performance – Reduce reading unnecessary data Example Field Complex Data Interactions
  • 14. Monsanto Company Confidential - Attorney Client Privilege Global Coordinate System Longitude Latitude-180 180 -90 90
  • 15. Monsanto Company Confidential - Attorney Client Privilege Reference System Longitude Latitude-180 180 -90 90
  • 16. Monsanto Company Confidential - Attorney Client Privilege Reference System Continued Longitude Latitude 1 2 3 20 21 22 23 19 381 382 400399 190 -180 180 -90 90 4
  • 17. Monsanto Company Confidential - Attorney Client Privilege HBase Schema Take 1 Spatial Table • Key: cell_id long • Column Family: A – Column: Data Holder • elevation • slope: float • aspect: float 17 • Each spatial dataset is a separate table • All attributes for a layer that are read together are stored together ‒ Attributes packed into a single column as an Avro object • 1 row per record • 120 billion rows total! • 1,000s of Get requests per field • TBs of key overhead – roughly 56% of the data
  • 18. Monsanto Company Confidential - Attorney Client Privilege Reference System Storage Format • Data grouped into 100 x 100 super cells • A super cell of 100 x 100 cells is a single row in HBase • At most 4 disk reads are required to read all data for one layer for a 150 acre field • Given a bounding box the super cells and attributed grid cells containing the desired data can easily be computed • A generic geospatial data service when given a set of layers will read each layer in parallel • Overhead of key data reduced from 56% to below 0.1% Super Grid Cells Attributed Grid Cells Spatial Table • Key: super_cell_id long • Column Family: A – Column: Data Holder • elevation : array float [ values ] • slope: array float [ values ] • aspect: array float [ values ]
  • 19. Monsanto Company Confidential - Attorney Client Privilege Results • Significant cost savings in required hardware • 120 billion unique polygons in total • 1.5 trillion data points • Dense grid of the entire U.S. • Foundational architecture for other spatial data sets • Fully unit tested implementation RDBMS • 4 states only • 30+ days to load • 8 months of dev. Hadoop • Entire U.S. • 18 hour load time • 3 months of dev. • 100% scalable • Cloud ready 0 10 20 30 Days Total Data Processing Time RDBMS Hadoop 8% of the data Full data set Total Run Time
  • 20. Monsanto Company Confidential - Attorney Client Privilege Thank You 20