Cloud computing and Hadoop introduction

  • 3,460 views
Uploaded on

Presentation done by Roman Valls at PRBB Computational Genomics Technical Seminars in Barcelona

Presentation done by Roman Valls at PRBB Computational Genomics Technical Seminars in Barcelona

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,460
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
130
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. BioCloud Random large-scale tools that you can use
  • 2. Disclaimer I'm working on computer security research... no biology background anywhere in my field, not even on computer virus ;) While working, I stumbled across hadoop for scalable web spidering purposes. I'm not a bioinformatician (yet)... but I saw a powerful tool that could be useful in your research field(s): "biodatacrunching" ?
  • 3. Glossary • Cluster (beowulf) • Grid • Cloud
  • 4. Biology and computer science • Increasingly resource-hungry applications o Nowadays, they can be approached by "brute force" o More data means more "iron" to crunch it • Local IT team nor budget keep up with this pace o €€€ spent on new hardware o €€€ spent on IT personnel o Isn't it wiser to scale one machine at a time ? • Developers get angry or frustrated on o Delays on software installation and config o Unscheduled downtimes o Delays as a result of not enough computing power
  • 5. What is cloud computing ? In plain english: http://www.youtube.com/watch?v=XdBd14rjcs0
  • 6. Infrastructure layer
  • 7. Cloud niche
  • 8. Infraestructure • Amazon o EC2 o S3 o AMI  Recently added BioInformatic appliances  Public data sets • Eukalyptus o EC2 + AMI server-side open source implementation o We run it for our internal projects • Enomalism • Rightscale & Service Cloud o Tools/Consultants for the upcoming cloud issues
  • 9. Application layer • Tecnologias para paralelizar aplicaciones
  • 10. Application layer • Hadoop o Open source mapreduce implementation o Java based, but any language can be used • Cloudburst-bio o MapReduce fine tuned implementation for Bio (XXX)
  • 11. Easy mapreduce
  • 12. What is hadoop Quotation from official web page: "Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data." "vast amounts of data (ATGTTAG...)" + "easily" = sounds good   isn't it ? or is it vaporware ?
  • 13. Why is it used for ? • Attack problems that imply several GB, TB even PB of data • The programmer does not care on job management o The focus is on data transformation, piping (useful work) • Not intended for realtime processing • Suitable to offload databases from long batch jobs
  • 14. What is MapReduce Joel on software explanation Useful to crunch *tons* of data parallellized by design
  • 15. HDFS: Hadoop Distributed FileSystem
  • 16. What about Jobs control ?
  • 17. Who is using it ? • Google o Lots of internal projects (proprietary MapReduce)  GMail spam machine learning  Google maps  ... • Yahoo o Internal web graph (powers search engine) o Pig (sqlish abstraction) o Sort 1 terabyte of data in 209 seconds • Facebook o Users big graph, used for data mining (Hive)
  • 18. Hadoop has (lots of) new friends • Nutch • Mahout • Hbase • Hama • Pig • ZooKeeper • Smartfrog • ...
  • 19. Next steps ? Identify resource-hungry applications (batch vs interactive) Migrate apps to cloud 1) Allocate a certain fixed amount of money 2) Give a try on amazon EC2 3) Optional: Build (local) rocks cluster with Eukaliptus cloud Test, deploy, automate, automate and automate ... puppet ?
  • 20. (a few) References http://www.cloudera.com/hadoop-training-thinking-at-scale http://www.slideshare.net/tag/hadoop http://sourceforge.net/projects/cloudburst-bio/ http://hadoop.apache.org/core/ http://people.apache.org/~rdonkin/hadoop-talk/hadoop.html