At StampedeCon 2014, Rob Long (Monstanto) presented "Managing Genomes At Scale: What We Learned."
Monsanto generates large amounts of genomic sequence data every year. Agronomists and other scientists use this data as input for predictive analytics to aid breeding and the discovery of new traits such as disease or drought resistance. In order to enable the broadest use possible of this valuable data, scientists would like to query genomic data by species, chromosome, position, and myriad other categories. We present our solutions to these problems, as realized on top of HBase here at Monsanto.We will be discussing our particular learnings around: flat/wide vs tall/narrow HBase schema design, preprocessing and caching windows of data for use in web based visualizations, approaches to complex multi-join queries across deep data sets, and distributed indexing via SolrCloud.