Your SlideShare is downloading. ×
0
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Bottoms bosc2010 bio_snp_inherit
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Bottoms bosc2010 bio_snp_inherit

232

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
232
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The data file had to read into the database and then the information from the database was used to determine inheritance codes.
  • We had 5000 samples of data associated with one “SNP ID” and we had over 1000 SNP ID’s, making our data file over 5 million lines long. It was actually much messier looking than this and I ended up processing each line and storing the results in a database. After talking with my boss about this, he provided me the same data in a different format.
  • We had 5000 samples of data associated with one “SNP ID” and we had over 1000 SNP ID’s, making our data file over 5 million lines long. It was actually much messier looking than this and I ended up processing each line and storing the results in a database. After talking with my boss about this, he provided me the same data in a different format.
  • This format really condensed the data file. From 800MB to less than 15MB, in fact. However, now each “data point” isn’t “tagged”, so some additional preprocessing needed to be done.
  • This format really condensed the data file. From 800MB to less than 15MB, in fact. However, now each “data point” isn’t “tagged”, so some additional preprocessing needed to be done.
  • The sample ID’s I showed you earlier, each represented a different individual corn plant. Knowing the relationships among the different plants was required for processing the data. Here, since I’m a human familiar the genetic system, I know that IBM stands for an Intermated B73 x Mo17 population. This is a simplified example of a manifest file. Z1, M100, and “Bob” are just made up names and any similarity to known names is purely coincidental. When you start looking at these, you see that the way the Relationships were defined in multiple ways. There isn’t anything here that directly tells that IBM and Mo17 and B73 are related. To take advantage of this information I wrote a long series of rules. Well, the break through came with the realization that I couldn’t keep this up forever. Instead of telling the computer how to understand these relationships, I decided to just tell the computer what the relationships are (next slide).
  • The sample ID’s I showed you earlier, each represented a different individual corn plant. Knowing the relationships among the different plants was required for processing the data. Here, since I’m a human familiar the genetic system, I know that IBM stands for an Intermated B73 x Mo17 population. This is a simplified example of a manifest file. Z1, M100, and “Bob” are just made up names and any similarity to known names is purely coincidental. When you start looking at these, you see that the way the Relationships were defined in multiple ways. There isn’t anything here that directly tells that IBM and Mo17 and B73 are related. To take advantage of this information I wrote a long series of rules. Well, the break through came with the realization that I couldn’t keep this up forever. Instead of telling the computer how to understand these relationships, I decided to just tell the computer what the relationships are (next slide).
  • This is organized in a way that is simple to both humans and computer programs to understand.
  • Configuration files are great for some tasks that are easy for humans but more difficult to program. They are also great for things that are variable Setting up the configuration file only takes minutes. If we don’t know what these relationships are to start with, then we’re in trouble anyway. Simple for humans ≠ simple for computers Something else I didn’t put up here is that reducing your dependencies sure makes it easier to install.
  • Transcript

    • 1. SNP Allele Designations (Bio::SNP::Inherit) Christopher Bottoms BOSC 2010
    • 2. 5 million data “items” one CPU: 2+ days eight CPUs: 1-2 days
    • 3. SNP ID Sample ID Base1 Base2 1 1 A A 1 2 A A 1 3 A G … … … … 1 5000 A A 2 1 C C … … … … … … … … 1106 5000 GG GG
    • 4. SNP ID Sample ID Base1 Base2 1 1 A A 1 2 A A 1 3 A G … … … … 1 5000 A A 2 1 C C … … … … … … … … 1106 5000 GG GG
    • 5. “ Matrix” data file format SNP ID 1 2 3 … 5000 SNP1 AA AA AG … AA SNP2 CC GG GG … CG
    • 6. “ Matrix” data file format SNP ID 1 2 3 … 5000 SNP1 AA AA AG … AA SNP2 CC GG GG … CG
    • 7. Using new data format <ul><li>12 million data items </li></ul><ul><li>one cpu: ~30 min </li></ul>
    • 8. ID’s file ID Name Group 1 B73 B73 2 B73xZ1 NAMF1 3 Mo17 Control 4 M100 IBM 5 Bob B73xZ1
    • 9. ID’s file ID Name Group 1 B73 B73 2 B73xZ1 NAMF1 3 Mo17 Control 4 M100 IBM 5 Bob B73xZ1
    • 10. “ Human Parsed” ID’s file ID Name Group A (ID) B (ID) AxB (ID) 1 B73 B73 2 B73xZ1 NAMF1 3 Mo17 Control 4 M100 IBM 1 3 5 Bob B73xZ1 1 2
    • 11. Lessons learned <ul><li>Explore other solutions before deciding on parallel processing </li></ul><ul><li>File format changes can simplify work </li></ul><ul><li>When appropriate, divide work </li></ul><ul><ul><li>Humans: Complicated but “once-only” task </li></ul></ul><ul><ul><li>Computers: Repetitive boring work </li></ul></ul>
    • 12. Acknowledgements <ul><li>Advisors </li></ul><ul><li>Mike McMullen </li></ul><ul><li>Sherry Flint-Garcia </li></ul><ul><li>Hardware support </li></ul><ul><li>Arturo Garcia </li></ul><ul><li>Funding </li></ul><ul><ul><li>National Science Foundation Plant Genome Program Grant DBI-0820619 </li></ul></ul><ul><ul><li>USDA-ARS </li></ul></ul>
    • 13. Acknowledgements <ul><li>Programming support </li></ul><ul><li>You (CPAN) </li></ul><ul><li>You (stackoverflow.com) </li></ul><ul><li>You (perlmonks.org) </li></ul>
    • 14. End

    ×