Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EPA 2013 Air Sensors Meeting Big Data Talk

2,717 views

Published on

https://sites.google.com/site/airsensors2013/final-materials

Published in: Education
  • Be the first to comment

  • Be the first to like this

EPA 2013 Air Sensors Meeting Big Data Talk

  1. 1. BIG DATA (IN BIOLOGY):INTEGRATING LARGE, FAST MOVING, HETEROGENEOUS DATASETS Adina Howe Argonne National Laboratory Michigan State University EPA Air Sensors 2013: Data Quality and Applications March 19, 2013
  2. 2. Introduction – My perspective Experiment Design Applied Data Solutions Engineering Generation Microbial Ecology Bioinformatics Data Workflow / analysis Tools
  3. 3. THE DATA DELUGEAn exponential landscape
  4. 4. Next-generation sequencing growth outpacing computational resourcesLog Scale! Stein, Genome Biology, 2010
  5. 5. Next-generation sequencing growthoutpacing computational resources Stein, Genome Biology, 2010
  6. 6. Effects of low cost sequencing… First free-living bacterium sequenced for billions of dollars and years of analysis Personal genome can be mapped in a few days and hundreds to few thousand dollars
  7. 7. Effects of low cost sequencing onresearch Sboner et al., Genome Biology, 2011
  8. 8. Effects of low cost sequencing onresearch Sboner et al., Genome Biology, 2011
  9. 9. Effects of low cost sequencing onresearch Sboner et al., Genome Biology, 2011
  10. 10. Technology Core Value added competencyRETHINKINGWhat it takes to deliver
  11. 11. Technical obstacles in the big data deluge• Access to the data and its value• Access to the resourcesDemocratization of both data and resource access“80% of awards and 50% of $$ are for grants < $350,000”Root causes:• Data volume and velocity “clog” Experiment Design• Data is very heterogeneous• Previous efforts are difficult to integrate Applied Solutions Data Generation• Innovation is necessary but hard Workflow / Data analysis Tools
  12. 12. Social obstacles are the most difficult.• Shift of costs do not mean a shift of expectations • “Give me the answer so I can get back to work.”• A culture of sharing (data, time, and tools)• Evolution of necessary training• Creating teams that can communicate across domains• Incentives are not strong enough• Patterns for success (useful data sharing and collaboration) are not apparent or well understood.
  13. 13. POSSIBLE SOLUTIONS
  14. 14. Common solutions: been there, done that http://xkcd.com/927/
  15. 15. What would an ideal solution look like?• Flexible access to data, tools, and resources• Cost effective, consistent, reusab le (scalable)• Rapid exploration• Incentives to participate, share, communi cate• Community sandbox (vs lab-specific)• Platform which supports an “ecology” of Painless databases, interfaces, and analysis software.
  16. 16. The success of organization: Amazon• > 50 million users, > 1 million product partners, billions of reviews, dozens of compute services.• Continually changing/updating data sets.• Explicitly adopted a service-oriented architecture that enables both internal and external use of this data.• For example, the Amazon.com website is itself built from over 150 independent services…• Amazon routinely deploys new services and functionality. http://highscalability.com/amazon-architecture https://plus.google.com/112678702228711889851/posts /eVeouesvaVX
  17. 17. Amazon development guideline: Colloquially said, “You should eat your own dogfood.” Design and implement the database and database functionality to meet your own needs; only use the functionality you’ve explicitly made available to everyone. To adapt to research: database functionality should be designed in tight integration with researchers who are using it, both at a user interface level and programmatically.
  18. 18. If the “customers” aren’t integrated into the development loop:http://blog.thingsdesigner.com/uploads/id/tree_swing_development_requirements.jpg
  19. 19. DOE Knowledgebase (KBase)• Emerging software and data environment to enable researchers• Service oriented architecture where biological data integrated into single data model with Kbase services loosely coupled to achieve various functions• Open development environments for community contribution (public data, services, software)• Provides robust and scalable infrastructure (with some level of support) https://kbase.us
  20. 20. Kbase uses service oriented architecture Higher level functions http://kbase.us/files/6913/4990/5274/Infrastructure.pptx.pdf
  21. 21. DOE KBase Investment “…may also apply for additional supplemental funding of up to $300,000 per year for development of systems biology and –omics data driven applications in collaboration with the DOE Systems Biology Knowledgbase.” Free tutorials / workshops for the community provided.
  22. 22. Advice for the next round… Big data is a community problem andData generator: solution• Managing expectations and value Platform / Teams AccessDeveloper:• “Eat your own dogfood” Training CommunicationData analyzer:• Analyze with reproducibility in mind
  23. 23. Resources• Amazon interviewshttp://highscalability.com/amazon-architecture• Titus Brown’s blog post on heterogeneous data integrationhttp://ivory.idyll.org/blog/software-architecture-for-heterogeneous-data-integration.html• Kbase websitehttp://www.kbase.us• Software carpentry – “helping scientists build better software”http://software-carpentry.org
  24. 24. Thanks!Please feel free to contact me:http://adina.github.comadina@anl.gov http://cheezburger.com/6983817216

×