Cloud computing and Hadoop introduction


Presentation done by Roman Valls at PRBB Computational Genomics Technical Seminars in Barcelona

  1. 1. BioCloud Random large-scale tools that you can use
  2. 2. Disclaimer I'm working on computer security research... no biology background anywhere in my field, not even on computer virus ;) While working, I stumbled across hadoop for scalable web spidering purposes. I'm not a bioinformatician (yet)... but I saw a powerful tool that could be useful in your research field(s): "biodatacrunching" ?
  3. 3. Glossary • Cluster (beowulf) • Grid • Cloud
  4. 4. Biology and computer science • Increasingly resource-hungry applications o Nowadays, they can be approached by "brute force" o More data means more "iron" to crunch it • Local IT team nor budget keep up with this pace o €€€ spent on new hardware o €€€ spent on IT personnel o Isn't it wiser to scale one machine at a time ? • Developers get angry or frustrated on o Delays on software installation and config o Unscheduled downtimes o Delays as a result of not enough computing power
  5. 5. What is cloud computing ? In plain english:
  6. 6. Infrastructure layer
  7. 7. Cloud niche
  8. 8. Infraestructure • Amazon o EC2 o S3 o AMI  Recently added BioInformatic appliances  Public data sets • Eukalyptus o EC2 + AMI server-side open source implementation o We run it for our internal projects • Enomalism • Rightscale & Service Cloud o Tools/Consultants for the upcoming cloud issues
  9. 9. Application layer • Tecnologias para paralelizar aplicaciones
  10. 10. Application layer • Hadoop o Open source mapreduce implementation o Java based, but any language can be used • Cloudburst-bio o MapReduce fine tuned implementation for Bio (XXX)
  11. 11. Easy mapreduce
  12. 12. What is hadoop Quotation from official web page: "Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data." "vast amounts of data (ATGTTAG...)" + "easily" = sounds good   isn't it ? or is it vaporware ?
  13. 13. Why is it used for ? • Attack problems that imply several GB, TB even PB of data • The programmer does not care on job management o The focus is on data transformation, piping (useful work) • Not intended for realtime processing • Suitable to offload databases from long batch jobs
  14. 14. What is MapReduce Joel on software explanation Useful to crunch *tons* of data parallellized by design
  15. 15. HDFS: Hadoop Distributed FileSystem
  16. 16. What about Jobs control ?
  17. 17. Who is using it ? • Google o Lots of internal projects (proprietary MapReduce)  GMail spam machine learning  Google maps  ... • Yahoo o Internal web graph (powers search engine) o Pig (sqlish abstraction) o Sort 1 terabyte of data in 209 seconds • Facebook o Users big graph, used for data mining (Hive)
  18. 18. Hadoop has (lots of) new friends • Nutch • Mahout • Hbase • Hama • Pig • ZooKeeper • Smartfrog • ...
  19. 19. Next steps ? Identify resource-hungry applications (batch vs interactive) Migrate apps to cloud 1) Allocate a certain fixed amount of money 2) Give a try on amazon EC2 3) Optional: Build (local) rocks cluster with Eukaliptus cloud Test, deploy, automate, automate and automate ... puppet ?
