MapReduce Best Practices and Lessons Learned Applied to Enterprise Datasets - StampedeCon 2012


Published on

At StampedeCon 2012 in St. Louis, Erich Hochmuth of Monsanto presents: Hadoop is quickly becoming the preferable platform for performing analysis over large datasets. We will explore opportunities for utilizing MapReduce to process genomic data in an enterprise system. We will discuss lessons learned introducing Hadoop into an existing enterprise and cover topics such as security, network architecture, and backups.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MapReduce Best Practices and Lessons Learned Applied to Enterprise Datasets - StampedeCon 2012

  1. 1. Harvesting Big Data in Agriculture Experiences with Hadoop Erich Hochmuth R&D IT Big Data & Analytics Lead
  2. 2. Monsanto Serves Farmers Around the WorldWorking With Growers Large and Small, Row Crops and Vegetables
  3. 3. Our Approach to Driving YieldA System of Agriculture Working Together to Boost Productivity BREEDING BIOTECHNOLOGY AGRONOMICS The art and science The science of improving The farm management of combining genetic plants by inserting genes practices involved in material to produce a new into their DNA growing plants seed
  4. 4. Increasing Yield through Big DataAt the Cornerstone of Yield Increases is Information & Analytics Increased Yield Variety Volume Velocity • Raw Sequence data • PBs of NGS data • 10’s millions yield dps/day • Unstructured sensor data • 10’s TBs of genomic data • 100’s million genotyping dps/day • Relational yield data • TBs of yield data • TBs of NGS data/week • Poly-structured genomic data • Billions of genotyping dps • Spatial data • Satellite imagery
  5. 5. Why Hadoop?• Focus on solving the business problem & not building IT solutions• Commodity solution for the easy (data parallel) stuff• Remove the hand off between developers & strategic scientist• Cost to generate & store data continues to decrease• Eliminate the constant churn to scale existing solution• Cost effective incremental platform expansion
  6. 6. Hadoop as an ETL PlatformScientific Instrumentation Data Processing Summarized Results
  7. 7. Hadoop as a Queryable Archive Long term storage DiscoveryHistoric Data
  8. 8. HBase Real-time Access OLAP
  9. 9. Lessons Learned
  10. 10. Technical Landscape• 3 clusters (Dev/Test, QA, & Prod)• 2 backup clusters• Combined HBase & MapReduce• Access via Edge Services• Resources partitioned by workflows – Data & compute
  11. 11. Hadoop Ecosystem @ Monsanto Web Portal (HUE) Workflow (Oozie) Scheduling (Fair Scheduler)Data Integration (Sqoop) Real-time access (HBase) Languages/Compilers Serialization (Avro) (Pig) Coordination (Zookeeper) In Use Planned Very Interested In• Hadoop MR • Hue • Hive • HCatalog• HBase • Stargate/HBase REST • RHadoop • Flume• Oozie • Fair Scheduler • YARN• Zookeeper • Pig• Sqoop• Quest Connector
  12. 12. Hadoop Implementation/Deployment• It Takes a Team• Practices makes perfect• Fit into existing process or standards when possible – Deviated when necessary• Know your use case!• Capacity Planning• Start small & build on success
  13. 13. Hadoop Security• Research data is IP• Hadoop is system of record for some data• Spent 6 weeks configuring Hadoop security – Sought outside help – Successful installation not consistently reproducible – Support inconsistent across ecosystem• Adopted more traditional Hadoop security approach• HTTP edge services augmented with corporate single sign-on• Integrated into corporate LDAP• Revisit when Hadoop security becomes stable
  14. 14. Backup & Restore• Doesn’t Hadoop have built in replication?• Requirements – Backup HBase & HDFS – Weekly full backups – Daily incremental – Offsite data & retain for 60 days• Rolled our own – Dedicated backup cluster – DistCp data to backup cluster – Copy data via Fuse-DFS to tape – Manual restore & merge• Considering replicating to offside DR cluster – No more tape backups!
  15. 15. Data Management….or lack there of!• Current Approach – Data grouped into subject areas – Utilize HDFS Quotas – Access controlled through AD groups – Supplement with governance & process• Needs – Publish & share known schemas – Common schema across tool set – Fine grained authorization – Monitoring/alerting of data access – Track data lineage
  16. 16. Conclusion• Enterprise ready?• Support? – Open Source Community• Documentation – Missouri is “The Show Me State”• Evolving third party support• Hadoop resources in the Midwest?• Know your use case!
  17. 17. Thank you! We are hiring!