Personalized medicine holds much promise to improve the quality of human life.
However, personalizing medicine depends on genome analysis software that does not scale well. Given the potential impact on society, genomics takes first place among fields of science that can benefit from Hadoop.
A single human genome contains about 3 billion base pairs. This is less than 1 gigabyte of data but the intermediate data produced by a DNA sequencer, required to produce a sequenced human genome, is many hundreds of times larger. Beyond the huge storage requirement, deep genomic analysis across large populations of humans requires enormous computational capacity as well.
Interestingly enough, while genome scientists have adopted the concept of MapReduce for parallelizing I/O, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally.