Successfully reported this slideshow.
Your SlideShare is downloading. ×

MapReduce Best Practices and Lessons Learned Applied to Enterprise Datasets - StampedeCon 2012

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 17 Ad

MapReduce Best Practices and Lessons Learned Applied to Enterprise Datasets - StampedeCon 2012

At StampedeCon 2012 in St. Louis, Erich Hochmuth of Monsanto presents: Hadoop is quickly becoming the preferable platform for performing analysis over large datasets. We will explore opportunities for utilizing MapReduce to process genomic data in an enterprise system. We will discuss lessons learned introducing Hadoop into an existing enterprise and cover topics such as security, network architecture, and backups.

At StampedeCon 2012 in St. Louis, Erich Hochmuth of Monsanto presents: Hadoop is quickly becoming the preferable platform for performing analysis over large datasets. We will explore opportunities for utilizing MapReduce to process genomic data in an enterprise system. We will discuss lessons learned introducing Hadoop into an existing enterprise and cover topics such as security, network architecture, and backups.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Advertisement

Similar to MapReduce Best Practices and Lessons Learned Applied to Enterprise Datasets - StampedeCon 2012 (20)

More from StampedeCon (20)

Advertisement

Recently uploaded (20)

MapReduce Best Practices and Lessons Learned Applied to Enterprise Datasets - StampedeCon 2012

  1. 1. Harvesting Big Data in Agriculture Experiences with Hadoop Erich Hochmuth R&D IT Big Data & Analytics Lead erich.hochmuth@monsanto.com
  2. 2. Monsanto Serves Farmers Around the World Working With Growers Large and Small, Row Crops and Vegetables
  3. 3. Our Approach to Driving Yield A System of Agriculture Working Together to Boost Productivity BREEDING BIOTECHNOLOGY AGRONOMICS The art and science The science of improving The farm management of combining genetic plants by inserting genes practices involved in material to produce a new into their DNA growing plants seed
  4. 4. Increasing Yield through Big Data At the Cornerstone of Yield Increases is Information & Analytics Increased Yield Variety Volume Velocity • Raw Sequence data • PBs of NGS data • 10’s millions yield dps/day • Unstructured sensor data • 10’s TBs of genomic data • 100’s million genotyping dps/day • Relational yield data • TBs of yield data • TBs of NGS data/week • Poly-structured genomic data • Billions of genotyping dps • Spatial data • Satellite imagery
  5. 5. Why Hadoop? • Focus on solving the business problem & not building IT solutions • Commodity solution for the easy (data parallel) stuff • Remove the hand off between developers & strategic scientist • Cost to generate & store data continues to decrease • Eliminate the constant churn to scale existing solution • Cost effective incremental platform expansion
  6. 6. Hadoop as an ETL Platform Scientific Instrumentation Data Processing Summarized Results
  7. 7. Hadoop as a Queryable Archive Long term storage Discovery Historic Data
  8. 8. HBase Real-time Access OLAP
  9. 9. Lessons Learned
  10. 10. Technical Landscape • 3 clusters (Dev/Test, QA, & Prod) • 2 backup clusters • Combined HBase & MapReduce • Access via Edge Services • Resources partitioned by workflows – Data & compute
  11. 11. Hadoop Ecosystem @ Monsanto Web Portal (HUE) Workflow (Oozie) Scheduling (Fair Scheduler) Data Integration (Sqoop) Real-time access (HBase) Languages/Compilers Serialization (Avro) (Pig) Coordination (Zookeeper) In Use Planned Very Interested In • Hadoop MR • Hue • Hive • HCatalog • HBase • Stargate/HBase REST • RHadoop • Flume • Oozie • Fair Scheduler • YARN • Zookeeper • Pig • Sqoop • Quest Connector
  12. 12. Hadoop Implementation/Deployment • It Takes a Team • Practices makes perfect • Fit into existing process or standards when possible – Deviated when necessary • Know your use case! • Capacity Planning • Start small & build on success
  13. 13. Hadoop Security • Research data is IP • Hadoop is system of record for some data • Spent 6 weeks configuring Hadoop security – Sought outside help – Successful installation not consistently reproducible – Support inconsistent across ecosystem • Adopted more traditional Hadoop security approach • HTTP edge services augmented with corporate single sign-on • Integrated into corporate LDAP • Revisit when Hadoop security becomes stable
  14. 14. Backup & Restore • Doesn’t Hadoop have built in replication? • Requirements – Backup HBase & HDFS – Weekly full backups – Daily incremental – Offsite data & retain for 60 days • Rolled our own – Dedicated backup cluster – DistCp data to backup cluster – Copy data via Fuse-DFS to tape – Manual restore & merge • Considering replicating to offside DR cluster – No more tape backups!
  15. 15. Data Management….or lack there of! • Current Approach – Data grouped into subject areas – Utilize HDFS Quotas – Access controlled through AD groups – Supplement with governance & process • Needs – Publish & share known schemas – Common schema across tool set – Fine grained authorization – Monitoring/alerting of data access – Track data lineage
  16. 16. Conclusion • Enterprise ready? • Support? – Open Source Community • Documentation – Missouri is “The Show Me State” • Evolving third party support • Hadoop resources in the Midwest? • Know your use case!
  17. 17. Thank you! We are hiring! erich.hochmuth@monsanto.com

×