Kusarinoko:developingthe public next generation sequencing datasearch interfacethat works.                                ...
Problems for NGS data archivemanaging large-scale dataKusarinoko project, for better way to search and browsemetadata, fix...
Problems for NGS dataarchive
Storing large-scale NGS data causes many problemsdata transfer, storage, backup...Metadata management is one big problem f...
organism : mouse                                                                                          ATGCATGCATGCATGC...
Over 55,000 submissions, over 350,000 sequence runsand still increasing amount and size of the dataMetadata is provided ap...
Kusarinoko project, for better way to search and browse
Cutting the cost of using public data of SRAsearch, browse, download, checkGiving more resources to support using datais t...
Study.xml        Experiment.xml        Submission.xml          Sequence Data             metadataRun.xml                  ...
Covering only the data which has at least one publishedarticleif a paper is not published yet, Kusarinoko cannot find it. ...
http://g86.dbcls.jp/kusarinoko or google “kusarinoko”
Inside of Sequence Read Archive
Statistics of SRA by publication and seq quality  ONLY PUBLIC NGS DATA IN SRA WHICH HASPUBLICATION  Detailed stat will be ...
2007~2011                            number of                            submission                            Blue: Roch...
number of PubMed                     ID                     colored by Library                     type                   ...
quick quality calc;                                                 total average qual                                    ...
total N content rate;                                                    no correlation with                              ...
total sequence                                                   duplication                                              ...
Conclusion
Developed a service to help searching and browsing SRA datapublication information and result of quality check support the...
Thank You
Kusarinoko: developing the public next generation sequencing data search interface that works.
Kusarinoko: developing the public next generation sequencing data search interface that works.
Kusarinoko: developing the public next generation sequencing data search interface that works.
Kusarinoko: developing the public next generation sequencing data search interface that works.
Kusarinoko: developing the public next generation sequencing data search interface that works.
Kusarinoko: developing the public next generation sequencing data search interface that works.
Kusarinoko: developing the public next generation sequencing data search interface that works.
Kusarinoko: developing the public next generation sequencing data search interface that works.
Kusarinoko: developing the public next generation sequencing data search interface that works.
Upcoming SlideShare
Loading in …5
×

Kusarinoko: developing the public next generation sequencing data search interface that works.

7,994 views
7,951 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
7,994
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Kusarinoko: developing the public next generation sequencing data search interface that works.

    1. 1. Kusarinoko:developingthe public next generation sequencing datasearch interfacethat works. Tazro Ohta Database Center for Life Science Research Organization of Information and Systems
    2. 2. Problems for NGS data archivemanaging large-scale dataKusarinoko project, for better way to search and browsemetadata, fix and addInside of Sequence Read Archivestatistics of SRA reveals how it isToday’s topics
    3. 3. Problems for NGS dataarchive
    4. 4. Storing large-scale NGS data causes many problemsdata transfer, storage, backup...Metadata management is one big problem for public NGSdatabasemetadata : description of sequencing data. sample, sequencer platform,application, etc.Fixing metadata is a lifeline for public NGS databaseCost of storing large-scale sequence data
    5. 5. organism : mouse ATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGC : nervous cell cell ATGCATGCATGCATGCATGCAT GCATGCATGCATGATGCATGCA sequencer : 454 TGCATGCATGCATGCATGCATG CATGCATGCATGCATGCATGCA date : 2011 12 08 TGCATGATGCATCGATGCAATG CATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATG CATGCATGCATGCAGCATGCAT GCATGCATGCATGCATGCATGC SRA ATGCATGCATGCATGCATGCATLab / Research institute DRA INSDC int’l nucleotide seq DB collaboration data exchange and sharing ATGCATGCATGCAT GCATGCATGCATGC ATGCATGCATGCATdata submission ATGCATGCATGCATGCATGCAT GCATGCATGCATGA GCATGCATGCATGCATGCATGC TGCATGCATGCATG ATGCATGCATGCATGCATGCAT CATGCATGCATGCA GCATGCATGCATGATGCATGCA Dat TGCATGATGCATCG TGCATGCATGCATGCATGCATG w/ metadata CATGCATGCATGCA Data ID : 000001 CATGCATGCATGCATGCATGCA org TGCATGCATGCATG TGCATGATGCATCGATGCAATG CATGCATGCATGCA CATGCATGCATGCATGCATGCA organism : mouse GCATGCATGCATGC cell TGCATGCATGCATGCATGCATG ATGCATGCATGCAT CATGCATGCATGCAGCATGCAT cell : nervous cell seq GCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCAT sequencer : 454 date date : 2011 12 08 ENA Sequence Read ArchivePublic NGS database, Sequence Read Archive
    6. 6. Over 55,000 submissions, over 350,000 sequence runsand still increasing amount and size of the dataMetadata is provided apart, and is not described perfectlysubmission / study / experiment / sample / runFixing metadata and adding extra information is NEEDEDIt cannot be easy to find the data you want
    7. 7. Kusarinoko project, for better way to search and browse
    8. 8. Cutting the cost of using public data of SRAsearch, browse, download, checkGiving more resources to support using datais the data really sound?Aim of Kusarinoko project
    9. 9. Study.xml Experiment.xml Submission.xml Sequence Data metadataRun.xml Sample.xml pubmed ID FastQC result get from sra.dbcls.jp calculate seq quality Submission.xml by FastQC integrate Kusarinoko Integrate metadata, add extra information
    10. 10. Covering only the data which has at least one publishedarticleif a paper is not published yet, Kusarinoko cannot find it. publication info:sra.dbcls.jpQuality checking is still beta verstill on validating and trying to offer better information, will take more timeLimitation and features
    11. 11. http://g86.dbcls.jp/kusarinoko or google “kusarinoko”
    12. 12. Inside of Sequence Read Archive
    13. 13. Statistics of SRA by publication and seq quality ONLY PUBLIC NGS DATA IN SRA WHICH HASPUBLICATION Detailed stat will be available online at project website soonStatistics for stepping into SRA
    14. 14. 2007~2011 number of submission Blue: Roche Yellow: Illumina Green: AB Pink: Helicos Red: PacBioplatform trend statistics
    15. 15. number of PubMed ID colored by Library type Blue: genomic Red: transcriptomic Brown: metagenomic Yellow: synthetic Purple: Viral RNA Green: non genomic total 97 journals (unidentified) 587 total # of pmid:Journal statistics
    16. 16. quick quality calc; total average qual (phred) Blue: Roche Yellow: Illumina Green: AB Pink: Helicos Red: PacBio same as max read length total # of items (continuing) (run): 16,006minimum read length vs average quality value
    17. 17. total N content rate; no correlation with number of reads, library prep methods total # of items (continuing) (run): 16,006total number of reads vs N content
    18. 18. total sequence duplication same as previous stat amount of reads seems not to effect duplication total # of items (continuing) (run): 16,006total number of reads vs duplication rate
    19. 19. Conclusion
    20. 20. Developed a service to help searching and browsing SRA datapublication information and result of quality check support the metadata.Statistics revealed the inside of SRA and gave some insightsshowed NGS trends, and some items don’t have enough quality even if it has apublished article.Detailed results and more at poster presentation: 2P-0132(today)Conclusion: for making use of public resources
    21. 21. Thank You

    ×