SlideShare a Scribd company logo
1 of 30
Kusarinoko:
developing
the public next generation sequencing data
search interface
that works.


                                               Tazro Ohta
                          Database Center for Life Science
         Research Organization of Information and Systems
Problems for NGS data archive
managing large-scale data


Kusarinoko project, for better way to search and browse
metadata, fix and add


Inside of Sequence Read Archive
statistics of SRA reveals how it is




Today’s topics
Problems for NGS data
archive
Storing large-scale NGS data causes many problems
data transfer, storage, backup...

Metadata management is one big problem for public NGS
database
metadata : description of sequencing data. sample, sequencer platform,
application, etc.


Fixing metadata is a lifeline for public NGS database




Cost of storing large-scale sequence data
organism : mouse

                                                                                          ATGCATGCATGCATGCATGCAT
                                                                                          GCATGCATGCATGCATGCATGC : nervous cell
                                                                                                           cell
                                                                                          ATGCATGCATGCATGCATGCAT
                                                                                          GCATGCATGCATGATGCATGCA
                                                                                                             sequencer : 454
                                                                                          TGCATGCATGCATGCATGCATG
                                                                                          CATGCATGCATGCATGCATGCA
                                                                                                            date : 2011 12 08
                                                                                          TGCATGATGCATCGATGCAATG
                                                                                          CATGCATGCATGCATGCATGCA
                                                                                          TGCATGCATGCATGCATGCATG
                                                                                          CATGCATGCATGCAGCATGCAT
                                                                                          GCATGCATGCATGCATGCATGC



                                                                       SRA                ATGCATGCATGCATGCATGCAT




Lab / Research institute

                                                                DRA                                INSDC
                                                                                  int’l nucleotide seq DB collaboration

                                                                          data exchange
                                                                           and sharing
                                                                                                                                  ATGCATGCATGCAT
                                                                                                                                  GCATGCATGCATGC
                                                                                                                                  ATGCATGCATGCAT



data submission
                                      ATGCATGCATGCATGCATGCAT                                                                      GCATGCATGCATGA
                                      GCATGCATGCATGCATGCATGC                                                                      TGCATGCATGCATG
                                      ATGCATGCATGCATGCATGCAT                                                                      CATGCATGCATGCA
                                      GCATGCATGCATGATGCATGCA                                                                                 Dat
                                                                                                                                  TGCATGATGCATCG
                                      TGCATGCATGCATGCATGCATG



  w/ metadata
                                                                                                                                  CATGCATGCATGCA
                        Data ID : 000001
                                      CATGCATGCATGCATGCATGCA                                                                                 org
                                                                                                                                  TGCATGCATGCATG
                                       TGCATGATGCATCGATGCAATG                                                                     CATGCATGCATGCA
                                       CATGCATGCATGCATGCATGCA
                        organism : mouse                                                                                          GCATGCATGCATGC
                                                                                                                                            cell
                                       TGCATGCATGCATGCATGCATG                                                                     ATGCATGCATGCAT
                                       CATGCATGCATGCAGCATGCAT
                       cell : nervous cell                                                                                                   seq
                                       GCATGCATGCATGCATGCATGC
                                       ATGCATGCATGCATGCATGCAT
                         sequencer : 454                                                                                                    date


                        date : 2011 12 08



                                                                                                                           ENA
                                                                      Sequence Read Archive


Public NGS database, Sequence Read Archive
Over 55,000 submissions, over 350,000 sequence runs
and still increasing amount and size of the data

Metadata is provided apart, and is not described perfectly
submission / study / experiment / sample / run


Fixing metadata and adding extra information is NEEDED




It cannot be easy to find the data you want
Kusarinoko project, for better way to search and browse
Cutting the cost of using public data of SRA
search, browse, download, check


Giving more resources to support using data
is the data really sound?




Aim of Kusarinoko project
Study.xml        Experiment.xml        Submission.xml          Sequence Data

             metadata
Run.xml                    Sample.xml
                                           pubmed ID              FastQC result

                                        get from sra.dbcls.jp   calculate seq quality
          Submission.xml                                            by FastQC

                                            integrate

                                    Kusarinoko



 Integrate metadata, add extra information
Covering only the data which has at least one published
article
if a paper is not published yet, Kusarinoko cannot find it. publication info:
sra.dbcls.jp


Quality checking is still beta ver
still on validating and trying to offer better information, will take more time




Limitation and features
http://g86.dbcls.jp/kusarinoko or google “kusarinoko”
Inside of Sequence Read Archive
Statistics of SRA by publication and seq quality

  ONLY PUBLIC NGS DATA IN SRA WHICH HAS
PUBLICATION

  Detailed stat will be available online at project website soon




Statistics for stepping into SRA
2007~2011
                            number of
                            submission

                            Blue: Roche
                            Yellow: Illumina
                            Green: AB
                            Pink: Helicos
                            Red: PacBio




platform trend statistics
number of PubMed
                     ID

                     colored by Library
                     type
                     Blue: genomic
                     Red: transcriptomic
                     Brown:
                     metagenomic
                     Yellow: synthetic
                     Purple: Viral RNA

                     Green: non genomic
 total 97 journals   (unidentified) 587
                      total # of pmid:



Journal statistics
quick quality calc;
                                                 total average qual
                                                 (phred)

                                                 Blue: Roche
                                                 Yellow: Illumina
                                                 Green: AB
                                                 Pink: Helicos
                                                 Red: PacBio

                                                 same as max read
                                                 length
                                                       total # of items
                                  (continuing)
                                                       (run): 16,006



minimum read length vs average quality value
total N content rate;

                                                    no correlation with
                                                    number of reads,
                                                    library prep methods



                                                          total # of items
                                     (continuing)
                                                          (run): 16,006



total number of reads vs N content
total sequence
                                                   duplication
                                                   same as previous stat

                                                   amount of reads
                                                   seems not to effect
                                                   duplication


                                                         total # of items
                                    (continuing)
                                                         (run): 16,006



total number of reads vs duplication rate
Conclusion
Developed a service to help searching and browsing SRA data
publication information and result of quality check support the metadata.


Statistics revealed the inside of SRA and gave some insights
showed NGS trends, and some items don’t have enough quality even if it has a
published article.


Detailed results and more at poster presentation: 2P-0132
(today)



Conclusion: for making use of public resources
Thank You

More Related Content

More from Tazro Ohta (8)

Now and then: next-generation sequencing database to encourage the big data s...
Now and then: next-generation sequencing database to encourage the big data s...Now and then: next-generation sequencing database to encourage the big data s...
Now and then: next-generation sequencing database to encourage the big data s...
 
次世代おもろい話
次世代おもろい話次世代おもろい話
次世代おもろい話
 
第三回統合牧場収穫祭イントロダクション
第三回統合牧場収穫祭イントロダクション 第三回統合牧場収穫祭イントロダクション
第三回統合牧場収穫祭イントロダクション
 
Large-scale data in life science
Large-scale data in life scienceLarge-scale data in life science
Large-scale data in life science
 
JPMA forum 2 at DBCLS 3. June 2011
JPMA forum 2 at DBCLS 3. June 2011JPMA forum 2 at DBCLS 3. June 2011
JPMA forum 2 at DBCLS 3. June 2011
 
"次世代シーケンサのデータ解析 戦略立案編"
"次世代シーケンサのデータ解析 戦略立案編""次世代シーケンサのデータ解析 戦略立案編"
"次世代シーケンサのデータ解析 戦略立案編"
 
Transcriptome Sequenceによる遺伝子発現解析の実際
Transcriptome Sequenceによる遺伝子発現解析の実際Transcriptome Sequenceによる遺伝子発現解析の実際
Transcriptome Sequenceによる遺伝子発現解析の実際
 
Jaspug 2010
Jaspug 2010Jaspug 2010
Jaspug 2010
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Kusarinoko: developing the public next generation sequencing data search interface that works.

  • 1. Kusarinoko: developing the public next generation sequencing data search interface that works. Tazro Ohta Database Center for Life Science Research Organization of Information and Systems
  • 2. Problems for NGS data archive managing large-scale data Kusarinoko project, for better way to search and browse metadata, fix and add Inside of Sequence Read Archive statistics of SRA reveals how it is Today’s topics
  • 3. Problems for NGS data archive
  • 4. Storing large-scale NGS data causes many problems data transfer, storage, backup... Metadata management is one big problem for public NGS database metadata : description of sequencing data. sample, sequencer platform, application, etc. Fixing metadata is a lifeline for public NGS database Cost of storing large-scale sequence data
  • 5. organism : mouse ATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGC : nervous cell cell ATGCATGCATGCATGCATGCAT GCATGCATGCATGATGCATGCA sequencer : 454 TGCATGCATGCATGCATGCATG CATGCATGCATGCATGCATGCA date : 2011 12 08 TGCATGATGCATCGATGCAATG CATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATG CATGCATGCATGCAGCATGCAT GCATGCATGCATGCATGCATGC SRA ATGCATGCATGCATGCATGCAT Lab / Research institute DRA INSDC int’l nucleotide seq DB collaboration data exchange and sharing ATGCATGCATGCAT GCATGCATGCATGC ATGCATGCATGCAT data submission ATGCATGCATGCATGCATGCAT GCATGCATGCATGA GCATGCATGCATGCATGCATGC TGCATGCATGCATG ATGCATGCATGCATGCATGCAT CATGCATGCATGCA GCATGCATGCATGATGCATGCA Dat TGCATGATGCATCG TGCATGCATGCATGCATGCATG w/ metadata CATGCATGCATGCA Data ID : 000001 CATGCATGCATGCATGCATGCA org TGCATGCATGCATG TGCATGATGCATCGATGCAATG CATGCATGCATGCA CATGCATGCATGCATGCATGCA organism : mouse GCATGCATGCATGC cell TGCATGCATGCATGCATGCATG ATGCATGCATGCAT CATGCATGCATGCAGCATGCAT cell : nervous cell seq GCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCAT sequencer : 454 date date : 2011 12 08 ENA Sequence Read Archive Public NGS database, Sequence Read Archive
  • 6. Over 55,000 submissions, over 350,000 sequence runs and still increasing amount and size of the data Metadata is provided apart, and is not described perfectly submission / study / experiment / sample / run Fixing metadata and adding extra information is NEEDED It cannot be easy to find the data you want
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Kusarinoko project, for better way to search and browse
  • 13. Cutting the cost of using public data of SRA search, browse, download, check Giving more resources to support using data is the data really sound? Aim of Kusarinoko project
  • 14. Study.xml Experiment.xml Submission.xml Sequence Data metadata Run.xml Sample.xml pubmed ID FastQC result get from sra.dbcls.jp calculate seq quality Submission.xml by FastQC integrate Kusarinoko Integrate metadata, add extra information
  • 15. Covering only the data which has at least one published article if a paper is not published yet, Kusarinoko cannot find it. publication info: sra.dbcls.jp Quality checking is still beta ver still on validating and trying to offer better information, will take more time Limitation and features
  • 16.
  • 17.
  • 18.
  • 19.
  • 21. Inside of Sequence Read Archive
  • 22. Statistics of SRA by publication and seq quality ONLY PUBLIC NGS DATA IN SRA WHICH HAS PUBLICATION Detailed stat will be available online at project website soon Statistics for stepping into SRA
  • 23. 2007~2011 number of submission Blue: Roche Yellow: Illumina Green: AB Pink: Helicos Red: PacBio platform trend statistics
  • 24. number of PubMed ID colored by Library type Blue: genomic Red: transcriptomic Brown: metagenomic Yellow: synthetic Purple: Viral RNA Green: non genomic total 97 journals (unidentified) 587 total # of pmid: Journal statistics
  • 25. quick quality calc; total average qual (phred) Blue: Roche Yellow: Illumina Green: AB Pink: Helicos Red: PacBio same as max read length total # of items (continuing) (run): 16,006 minimum read length vs average quality value
  • 26. total N content rate; no correlation with number of reads, library prep methods total # of items (continuing) (run): 16,006 total number of reads vs N content
  • 27. total sequence duplication same as previous stat amount of reads seems not to effect duplication total # of items (continuing) (run): 16,006 total number of reads vs duplication rate
  • 29. Developed a service to help searching and browsing SRA data publication information and result of quality check support the metadata. Statistics revealed the inside of SRA and gave some insights showed NGS trends, and some items don’t have enough quality even if it has a published article. Detailed results and more at poster presentation: 2P-0132 (today) Conclusion: for making use of public resources

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n