Your SlideShare is downloading. ×
2. Just Enough Developed Structure
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

2. Just Enough Developed Structure


Published on

The second presentation from Eagle Genomics second symposium held on 29th March 2012. Presented by Dan MacLean from TSL. Unfortunately we are not allowed to share the cool Lego pictures from Dan's …

The second presentation from Eagle Genomics second symposium held on 29th March 2012. Presented by Dan MacLean from TSL. Unfortunately we are not allowed to share the cool Lego pictures from Dan's presentation.
"The Sainsbury Laboratory is a small independent research lab of approximately 80 researchers, we concentrate on Plant/Pathogen interactions with a focus on the genetics and genomics of a wide and frequently shifting line-up of organisms under study. A large proportion of our projects are focussed on a relatively modest goal, for example identifying Resistance genes in a novel model, rather than producing a gold-standard annotated genome. Recently we have made the decision to abandon our own compute cluster in favour of merging with larger users ie TGAC in an 'Access is more important than ownership model'. Similarly, we have found that many existing tools for genome sequence and feature annotation management, while great in their own way, can be too large and more difficult to manage than is useful for our projects and personnel profile. I'll describe the sequencing pipelines and hardware that feed into our data store as background for the core of our management structure, our lightweight sequence and feature versioning tool Gee Fu. Gee Fu interacts with our assembly and annotation pipelines from the command-line and from our Galaxy installation. It acts simultaneously as a store and annotation editor that has appropriate interfaces for users with a wide range of computing skills and can be used to share data externally via its RESTful interface and a web based front end."

Published in: Health & Medicine, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • It’ s an industrial revolution in data, and anyone not being able to deal with these scales of data is going to be very sorely left behind. The skills required are still very rare, and we are currently in the ‘ tooling ’ up phase.. So this talk is going to be how we go about transitioning into a flexible scalabe approach to dealing with these… The catch-all name for these sorts of approaches in biology is bioinformatics.
  • Transcript

    • 1. Just Enough Developed Infrastructure An Agile J.E.D.I approach Dan MacLean The Sainsbury Laboratory
    • 2. The Sainsbury Lab
    • 3. TSL Funding
    • 4. TSL Research “The Sainsbury Laboratory is dedicated to making fundamental discoveries about plants and how they interact with microbes and viruses and favours daring, long-term research over work that could be equally well carried out elsewhere”.Basic and translational research into Plant/Pathogen interactions
    • 5. genomics HTGS effector finding R gene cloning assembly de novo resequencing SNP detection annotation pipelinestranscriptomics RNA seq ChIP seq arraysproteomics high throughput image analysis pipeline development image segmentation statistical methods for object detection spectrum analysis algorithm development
    • 6. Our Projects, Our GoalsProjects and Researchers
    • 7. Bottom-Up Bioinformatics ‘Bioinformatics is a sub-discipline of molecular biology’MacLean and Kamoun, 2012 Nat Biotech 30, 33–34Burrell and MacLean: Open Source Software in Life Science Research (eds Harland and Forster)
    • 8. Our Genome Content• Raw reads• Assembled reference sequence• Annotations• Variants• Expression data
    • 9. Where will our data come from? Access is more important than ownershipownedoutsourced 2006 2012 2022
    • 10. Where will we compute? Access is more important than ownership2006 2012 2022As time goes on, the infrastructure becomes much more of a commodity
    • 11. Is genome content management astorm in a tea cup for small groups? • No! • Always need to share results, for us it’s more about derived data than the raw data • Problem is recording and sharing effectively
    • 12. The Bioinformatics Ecosystem• Monolithic Infrastructures Prevail• So many standards, so little time• Ensembl or Chado / GMOD too complex for stretched small team
    • 13. What is a support bioinformatician to do in a small team?• Need to get biologist to prepare their own metadata• Sociological resistance on biologists part
    • 14. Just Enough Developed Infrastructure get tooled up
    • 15. Microlithic Genomic Infrastructure: very small tools for very specific jobs Microliths• Bioinformatics infrastructure should be as simple as possible, but no simpler(all bioinformatics infrastructure is wrong, some bioinformatics infrastructure is useful)• Metadata, Content History
    • 16. Gee Fu ( Sequence feature version database and web-services Simply serve the needs now Keep an eye on the future needs•Rails App•Simple Database Schema•Multi-User, Multi-Focus Paradigm•Web and Console interface•Visualisation via AnnoJ / any web-service browser•API•bio-samtoolsRamirez-Gonzalez et al (2011) Bioinformatics 27:2754
    • 17. Gee Fu: Simple SchemaGFF3 + sequence + metadataone instance = one organismmany genomes, reference sequences, features, experiments predecessors
    • 18. Gee Fu, Multi-User Backgrounds
    • 19. Gee Fu, Versioning FeaturesEasy editing and instant update of feature and history allow all workers to be usingup-to-date annotations
    • 20. Gee Fu, NGS data from BAMHTTP GET Response e.g AnnoJ ( RESPONSE RDBMS BAM
    • 21. bio-samtoolsRamirez-Gonzalez, Bonnal et al (2012) Open Research Computation, 1:1
    • 22. Gee Fu, RESTfulnessRails RESTful URLs give full read access to all models anddatasome local root e.g.http://my_genome_resource.meh then/references.xml or /references.json/features/coverage/2?start=100&end=200&format=json
    • 23. Pay Off• Store of meta-data, kept with experimental data throughout life, history of changes kept• Everyone on same db, everyone knows what everyone else knows• Everyone up-to-date• Sharing easy internally and externally• Easy to collect data and metadata for submission to big repositories at publication time
    • 24. Future of microlithic genomics International Arabidopsis Informatics Consortium Arabidopsis Informatics PortalCentral Community body/resource:sets standardscoordinates information exchange byexternal modulesExternal modules:Do specialised functions –Eg. Host expression data,provide visualisation serversAllow annotation uploadsEG FlyBase, WormBaseAs number of model organismsincreases – an increasinglyprevalent model IAIC (2010) The Plant Cell,
    • 25. SummaryWe’re little. But there are lots of us.We can all do a bit, but we’re too busy to do lots (your bigoverpowered solution is likely to annoy us).-therefore small in-house solutions are going to proliferateWe will be very happy when we just know how to play nicelyand talk to each other usefully.
    • 26. Acknowledgements• Ricardo Ramirez-Gonzalez• Raoul Bonnal• Mario Caccamo• The Gatsby Charitable Foundation