Random large-scale tools that you
I'm working on computer security research... no biology
background anywhere in my field, not even on computer virus ;)
While working, I stumbled across hadoop for scalable web
I'm not a bioinformatician (yet)... but I saw a powerful tool that
could be useful in your research field(s):
Biology and computer science
• Increasingly resource-hungry applications
o Nowadays, they can be approached by "brute force"
o More data means more "iron" to crunch it
• Local IT team nor budget keep up with this pace
o €€€ spent on new hardware
o €€€ spent on IT personnel
o Isn't it wiser to scale one machine at a time ?
• Developers get angry or frustrated on
o Delays on software installation and config
o Unscheduled downtimes
o Delays as a result of not enough computing power
What is cloud computing ?
In plain english:
Recently added BioInformatic appliances
Public data sets
o EC2 + AMI server-side open source implementation
o We run it for our internal projects
• Rightscale & Service Cloud
o Tools/Consultants for the upcoming cloud issues
• Tecnologias para paralelizar
o Open source mapreduce implementation
o Java based, but any language can be used
o MapReduce fine tuned implementation for Bio (XXX)
What is hadoop
Quotation from official web page:
"Hadoop is a software platform that lets one easily write and
run applications that process vast amounts of data."
"vast amounts of data (ATGTTAG...)" + "easily" = sounds good
isn't it ? or is it vaporware ?
Why is it used for ?
• Attack problems that imply several GB, TB even PB of data
• The programmer does not care on job management
o The focus is on data transformation, piping (useful work)
• Not intended for realtime processing
• Suitable to offload databases from long batch jobs
What is MapReduce
Joel on software explanation
Useful to crunch *tons* of data parallellized by design
Who is using it ?
o Lots of internal projects (proprietary MapReduce)
GMail spam machine learning
o Internal web graph (powers search engine)
o Pig (sqlish abstraction)
o Sort 1 terabyte of data in 209 seconds
o Users big graph, used for data mining (Hive)
Next steps ?
Identify resource-hungry applications (batch vs interactive)
Migrate apps to cloud
1) Allocate a certain fixed amount of money
2) Give a try on amazon EC2
3) Optional: Build (local) rocks cluster with Eukaliptus cloud
Test, deploy, automate, automate and automate ... puppet ?
(a few) References