Your SlideShare is downloading. ×
Cloud computing and Hadoop introduction
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cloud computing and Hadoop introduction


Published on

Presentation done by Roman Valls at PRBB Computational Genomics Technical Seminars in Barcelona

Presentation done by Roman Valls at PRBB Computational Genomics Technical Seminars in Barcelona

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. BioCloud Random large-scale tools that you can use
  • 2. Disclaimer I'm working on computer security research... no biology background anywhere in my field, not even on computer virus ;) While working, I stumbled across hadoop for scalable web spidering purposes. I'm not a bioinformatician (yet)... but I saw a powerful tool that could be useful in your research field(s): "biodatacrunching" ?
  • 3. Glossary • Cluster (beowulf) • Grid • Cloud
  • 4. Biology and computer science • Increasingly resource-hungry applications o Nowadays, they can be approached by "brute force" o More data means more "iron" to crunch it • Local IT team nor budget keep up with this pace o €€€ spent on new hardware o €€€ spent on IT personnel o Isn't it wiser to scale one machine at a time ? • Developers get angry or frustrated on o Delays on software installation and config o Unscheduled downtimes o Delays as a result of not enough computing power
  • 5. What is cloud computing ? In plain english:
  • 6. Infrastructure layer
  • 7. Cloud niche
  • 8. Infraestructure • Amazon o EC2 o S3 o AMI  Recently added BioInformatic appliances  Public data sets • Eukalyptus o EC2 + AMI server-side open source implementation o We run it for our internal projects • Enomalism • Rightscale & Service Cloud o Tools/Consultants for the upcoming cloud issues
  • 9. Application layer • Tecnologias para paralelizar aplicaciones
  • 10. Application layer • Hadoop o Open source mapreduce implementation o Java based, but any language can be used • Cloudburst-bio o MapReduce fine tuned implementation for Bio (XXX)
  • 11. Easy mapreduce
  • 12. What is hadoop Quotation from official web page: "Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data." "vast amounts of data (ATGTTAG...)" + "easily" = sounds good   isn't it ? or is it vaporware ?
  • 13. Why is it used for ? • Attack problems that imply several GB, TB even PB of data • The programmer does not care on job management o The focus is on data transformation, piping (useful work) • Not intended for realtime processing • Suitable to offload databases from long batch jobs
  • 14. What is MapReduce Joel on software explanation Useful to crunch *tons* of data parallellized by design
  • 15. HDFS: Hadoop Distributed FileSystem
  • 16. What about Jobs control ?
  • 17. Who is using it ? • Google o Lots of internal projects (proprietary MapReduce)  GMail spam machine learning  Google maps  ... • Yahoo o Internal web graph (powers search engine) o Pig (sqlish abstraction) o Sort 1 terabyte of data in 209 seconds • Facebook o Users big graph, used for data mining (Hive)
  • 18. Hadoop has (lots of) new friends • Nutch • Mahout • Hbase • Hama • Pig • ZooKeeper • Smartfrog • ...
  • 19. Next steps ? Identify resource-hungry applications (batch vs interactive) Migrate apps to cloud 1) Allocate a certain fixed amount of money 2) Give a try on amazon EC2 3) Optional: Build (local) rocks cluster with Eukaliptus cloud Test, deploy, automate, automate and automate ... puppet ?
  • 20. (a few) References