SlideShare a Scribd company logo
1 of 53
Sequence Read Archive
Validation, Archival, and Distribution
of Raw Sequencing Data
By Eugene Yaschenko – Head of Molecular Software Section, NCBI
11 years of software development
• Introduction
• Mass production revolution in Sequencing
• Trace Archive
• Next Generation of Massively Parallel
Sequencing
• Sequence Read Archive
• SRA Toolkit
In 30 minutes
About Sequence Read Archive
The Sequence Read Archive (SRA) was created and engineered at the National Center for
Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes
of Health (NIH), Department of Health and Human Services.
The SRA is part of a cluster of sequencing data repositories called the "Trace Archives“, and
is located under the "Primary Data Archives" at NCBI, which includes GenBank.
The SRA is part of the International Nucleotide Sequence Database Collaboration (INSDC).
The data model, data transfer protocols, and accession space are shared with the INSDC
collaborators: European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan
(DDBJ).
NCBI was created by Congress in 1988 to develop information systems, such as
GenBank, to support the biomedical research community. NCBI was also mandated
to conduct basic and applied research and, as part of the NIH Intramural Program,
NCBI scientists work in areas of gene and genome analysis, computational structural
biology and mathematical methods for sequence analysis.
As a part of National Library of Medicine, NCBI also creates archives of Medical and
Biological literature, consumer health information sites.
NCBI is most widely known for
Bioinformatics resources at NCBI
Primary Data Archives
Primary Data Archives are submitter-driven.
Data Archive is different from file archive. It stores data not original files.
Data Archive is not necessarily lossless. Some controlled loss of information or precision
should be allowed.
Internal storage format should be sufficiently uniform to enable validation, searching, sub-
setting, etc...
Extra effort is needed to support conversion from input formats as well as produce output
formats. Large variety of formats significantly stresses archive’s resources.
Additional benefit of conversion is that all data are validated before archival.
Primary Data Archives
GenBank
• Internal storage: ASN.1
• Inputs: GenBank, EMBL, FASTA, GFF, ASN.1, etc…
• Bulk Outputs: GenBank, FASTA, ASN.1, BlastDB
TraceArchive
• Internal storage: Relational Database
• Inputs: SFF, ZTR, SCF, ABI
• Bulk Outputs: FASTA, Quality
SRA
• Internal storage: SRA format
• Inputs: SFF, SRF, FASTQ, SOLiD native, Illumina native, etc…
• Bulk Output: SRA format
RAW
Data
Archives
Biologist’s perception of Raw Data Archive
Bioinformatician’s perception of Raw Data Archive
But similar perspective
Mass production in Sequencing
1998 - Applied Biosystems and Hitachi develops revolutionary ABI
PRISM 3700 DNA Analyzer based on Sanger sequencing
Fully automated sequencing, lab technicians were needed to
prepare and load biological samples
A few years later updated 3730 DNA Analyzer significantly
improved automation and data quality
The era of mass production of sequences begins. Since there were
very little manual supervision of produced data – we labeled it as
raw
Trace Archive – first steps
Created by request of Mouse Sequencing Consortium in 2000
• Requested to contain 55 million records
• Input file format 200Kbyte per record
• Total input data size 11Tb (Total Usenet was 1.5 Tb)
Data Compression
• Put a lot of effort into compressing signal data
• Managed to reduce footprint of a single record to < 20Kb
Data Modeling
• Relational Approach – model all data and metadata as a rectangular table
• Normalize low-cardinality metadata
• Put bulky columns in separate tables
• Use data-warehousing databases for indexing
Trace Archive Layout
Metadata Data
1
2
3
4
5
6
7
Design
Metadata
1
2
3
4
5
6
7
RDBMS Data
1
2
3
4
5
6
7
Data
1
2
3
4
5
6
7
Data
1
2
3
4
5
6
7
Metadata
1
2
3
4
5
6
7
DataWarehouse
Trace Archive User Experience
Index
User Web Server
Data Storage
query
data
list of ids
list of ids
update
Trace Archive – growth
0
50
100
150
200
250
300
Jun-00 Jan-01 Jul-01 Feb-02 Sep-02 Mar-03 Oct-03
Millions
Number of records
Design requirement
Trace Archive went on beyond mouse
The design requirements were exceeded in a year
Prompted design corrections
•Vertical split of tables with large data
•Moving split tables into separate databases
•Finally, replacing non-active databases with a file-based storage
Next Generation Sequencing
2005 – Trace Archive was approached by “454”
company about archiving new type of data
NCBI and 454 develop a strategy of storing this data in
Trace Archive
We were surprised to learn that technology is capable
of producing 400,000 reads in 7 hours, but we did not
suspect that this is just a beginning of new wave in the
Sequencing Revolution
Next Generation Sequencing
2006 – We start to hear reports about new
technology from Solexa capable of producing
tens of millions sequences in just a few days
We come to realization that Trace Archive
design is obsolete and will not be able to
handle new wave
We started to look for new approaches in
designing Sequence Read Archive
Trace Archive limits
Global integer identifier (ti) for all sequences
Data and metadata are mixed as columns in a simple rectangular table
Number of traces exceeded 2 billions
No convenient identifier for common project, sample, and methodology
Every search results in a large list of integer ids to be retrieved one-by-one
Trace Archive most important lessons
Model Data and Metadata separately
Metadata
•Model metadata not only for completeness but by ability to easily group the data
•Use of commercial relational databases is acceptable
•Heavy indexing is the requirement
•Assign accessions (simple identifiers) to metadata objects
Data
•Model data in uniform and compact format
•Try to make this format usable outside of the archive
•Use regular file system instead of relational databases
Software Toolkit
•Support internal format through software toolkit
New!
Trace Archive –
dangerously high growth
0
500
1,000
1,500
Jun-00 Nov-01 Mar-03 Aug-04 Dec-05 Apr-07
Millions
Number of records
Design requirement
454 submission started to increase
Trace Archive - crisis averted
0
500
1,000
1,500
2,000
Jun-00 Nov-01 Mar-03 Aug-04 Dec-05 Apr-07 Sep-08 Jan-10 Jun-11
Millions
Number of records
454 submission diverted to SRA
Centerfold – Nature, April 2010
0.1
1
10
100
1000
2000-12-01
2001-05-01
2001-10-01
2002-03-01
2002-08-01
2003-01-01
2003-06-01
2003-11-01
2004-04-01
2004-09-01
2005-02-01
2005-07-01
2005-12-01
2006-05-01
2006-10-01
2007-03-01
2007-08-01
2008-01-01
2008-06-01
2008-11-01
2009-04-01
2009-09-01
2010-02-01
2010-07-01
2010-12-01
2011-05-01
Trillions
(log
scale)
Cumulative Trace/SRA Holdings
Cumulative Trace (bp)
Cumulative SRA (bp)
Total
SRA design challenges
Multiple sequencing platforms
•Roche 454
•Illumina Genome Analyzer
•Applied Biosystems SOLiD System
•Helicos Heliscope
•Complete Genomics
•Pacific Biosciences SMRT
•Ion Torrent
Multiple input formats
•454 Standard Flowgram Format (SFF)
•Sequence Read Format (SRF)
•ASCII record-based formats (fasta, fastq, etc..)
•ASCII tabular formats (qseq, etc…)
•HDF5
Multiple sequencing application
•Whole Genome Sequencing
•Whole Exome Sequencing
•Medical resequencing
•Metagenome sequencing
•Functional Genome
Storage size per base pair is important
SRA Model
Metadata is lifted to the level of physical
organization of the platform
Metadata is modeled with XML schema
Metadata is stored in XML-aware RDBMS
Data is order-independent array of internal elements
of the platform
Data is divided into parallel series, compressed and
stored in binary files.
SRA User Experience
Indexed Metadata
User Web Server
query
file ids
File Servers – http/ftp/fasp/etc..
Comparative Table Layout
Metadata Data
1
2
3
4
5
6
7
Metadata
1
2
3
4
5
6
7
Millions
of
rows
Billions
of
rows
Data
Data
Data
Data
Data
Millions
of
rows
Trace Archive Sequence Read Archive
SRA Data Processing Pipeline
Sequencing Machine generates data in machine format
Data is converted into common storage format
Data is stored and indexed for retrieval
Data is converted into user preferred format
Data is analyzed
Toolkit
Toolkit
Past Present Future
Raw Data Thousands
Bytes/Base
SRA Data – common denominator
Raw Data
Signal Data 4 - 50
Thousands
Bytes/Base
SRA Data – common denominator
Raw Data
Signal Data 4 - 50
Thousands
Basecalls and Quality 0.6 - 0.9
Bytes/Base
SRA Data – common denominator
Signal Data 4 - 50
Thousands
Basecalls and Quality 0.6 - 0.9
Bytes/Base
original SRA
Raw Data (not stored)
1 - 50
SRA Data – common denominator
Signal Data (not stored*) 4 - 50
Thousands
Basecalls and Quality 0.6 - 0.9
Bytes/Base
current SRA
Raw Data (not stored)
* All signal data was removed from
the archive, except for the 454 platform
0.6 - 0.9
SRA Data – common denominator
SRA SDK
Divide data into series, store by column
Eliminate repeated cells
Apply type-aware compression
Tune for retrieval speed vs. size tradeoff
Fully indexed long-term online/nearline archive storage
Evolution-aware backward compatibility
SRA SDK VDB
“Vertical” database
“Virtual” database
Fagernes Airport, Leirin (ENFG/VDB) | Victor David Brenner
(designer of a U.S. one-cent coin) | Paul Vanden Boeynants,
Belgian prime minister | Frank Vandenbroucke, cyclist |
Vrijzinnig Democratische Bond, a Dutch political party | Video
Data Bank | Virtual Database | Visual Database | Voluntary
Denied Boarding | Verband der Bahnindustrie in Deutschland
e.V., the German railway industry association | /var/db/pkg, the
installed package database in FreeBSD and Gentoo Linux
SRA SDK VDB
Columns group like-data for better compression
Good for accessing a reduced set of columns
Schema-driven
Virtualization layer – data from store or created from available
information
Highly distributed - takes advantage of host file-system
Flexible - allows removal of individual columns in archive
Write Side of SRA Toolkit
Input Data Format
Format
Parser
Input
Columns
A
B
D
C
Writing
Schema
XF1
XF2
Physical
Columns
B
X
Y
Z
XF3
Encoding
Schema
EF1
EF1
EF2
EF2
Serialization
B
X
Y
Z
Directory
Structure
SRA File Format
kdb
vdb
Read Side of SRA Toolkit
Deserialization
B
X
Y
Z
SRA Data Directory
or SRA File Format
Physical
Columns
B
X
Y
Z
Decoding
Schema
DF1
DF1
DF2
DF2
Read
Schema
XF5
XF6
XF1
Output
Columns
A
B
D
C
Format
Generator
Output File Format
API to Read
Columns
Schema Examples
• simple assignment
ascii S = B;
• assignment using plugin function
U32 A= < U32> clip < 15 , 10000 > ( B);
• conditional assignment
ascii S = S1 | S2;
• Defining encoding rule using plugin
functions
physical < type T > T izip_encoding #1.0
{
decode { return ( T ) iunzip ( @ ); }
encode { return izip ( @ ); }
};
• Using encoding rules
extern column < U32> izip_encoding A;
extern column < I64> izip_encoding C;
Schema: using existing functions
AATCGGTAACCG
pack<8,2> (@)
{0,0,3,1,2,2,3,0,0,1,1,2}
<ascii,U8>map<'ACGT', [ 0, 1, 2, 3 ] > ( @)
00001101,10101100,00010110
Privacy Issues
Homo sapiens
61%
human
metagenome
6%
Mus musculus
5%
other metagenome
4%
Drosophila melanogaster
2%
Plasmodium falciparum
2%
Mustela putorius furo
1%
Danio rerio
1%
Latimeria chalumnae
1%
Otolemur garnettii
0%
OTHER
17%
SRA Content by organism
Amount of human sequencing will only increase
In a series of studies, NIH is planning to sequence
tens of thousands people
Genetic data is highly personal
Most of human sequencing data will require careful
protection from unauthorized distribution
Encrypted SRA format
Single file SRA Format
32Kb 32Kb 32Kb 32Kb 32Kb 15Kb
32Kb 32Kb 32Kb 32Kb 32Kb
256 bit chained encryption
• Original file information
• Length, checksums
• Extra parameters used for
randomizing encryption
• Encrypted block checksum and length
• Needed for checking transfers without
decrypting the file
Compression by Reference
Currently, massively parallel sequencing created enough
fragments to cover subject’s genome several times (4x-100x)
This is done to improve signal/noise ratio
In many studies the very next step after sequencing is
mapping to the Reference Genome
After the mapping, large fraction of fragments perfectly
match the Reference
cSRA
References
Alignments
Sequences
Basic components of aligned sequences
Redundancy in aligned data
Many sequences repeat
Only need to store differences
cSRA
References
Unaligned
Sequences
Alignments
Aligned
Sequences
Unaligned sequences can be
separated from aligned
cSRA
References
Unaligned
Sequences
Alignments
Difference
From Reference
All matching bases are dropped.
Only the mismatched bases are stored
Qualities are retained.
Unaligned sequences are treated as
traditional SRA data
cSRA
Reference Difference
Original
sequence
+ =
The original sequence may be restored by applying a
function to the reference and the stored differences.
There is no loss of information.
cSRA
References
Unaligned
Sequences
Alignments
Difference
From
Reference
Remote
References
Local
References
•References to GenBank and RefSeq accessions are typically stored as Remote
•References to de-novo assemblies or modified sequences will be stored as Local
•Local Reference may be usefull for storing Transcriptome and Metagenome
assemblies, which will make those projects completely contained within SRA
cSRA
Reference
Primary Alignment
Secondary Alignment
Sequence
Needs: Produces:
Access to
Remote Ref.
Sequences
From Ref.
Primary
Alignment
Primary Alignment
For mapped reads
Restored
Sequences
Restored
Sequences
All Basecalls
And Qualities
Local Sequences, Statistics
Placements, Mismatches, Indels
Alternate Placements, Indels
Qualities, unplaced reads, groups
VDB-2 table structure for “Archive” version
Stores:
New Data Dimension
Metadata
1
2
3
4
5
6
7
Reference
Position
Data
Data
Data
Data
Data
Millions
of
rows
Sequence Read Archive
Data1 Data2 Data3 Data4
Millions
of
rows
SRA Multisample View
Query
Range
Credits
SRA Toolkit Development sra-tools@ncbi.nlm.nih.gov
Kenneth Durbrow
Anton Golikov
William Killian
Andrei Klymenko
Wolfgang Raetz
Kurt Rodarmer
Eugene Yaschenko yaschenk@ncbi.nlm.nih.gov
SRA Pipeline Development
Michael Kimelman
Yuri Ostapchuk
Project manager:
Reza Safarnejad
SRA Curators
Zinaida Belaya
Christopher O'Sullivan
Robert Sanders
Martin Shumway
Yuriy Skripchenko
Adam Stine
SRA Website
Sergiy Ponomarev

More Related Content

Similar to SRA-System (7).ppsx

Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Rdap12 wrap up reagan moore
Rdap12 wrap up reagan mooreRdap12 wrap up reagan moore
Rdap12 wrap up reagan mooreASIS&T
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
Data base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaData base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaMukesh Jaiswal
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Ryft
 
Data Indexing Presentation-My.pptppt.ppt
Data Indexing Presentation-My.pptppt.pptData Indexing Presentation-My.pptppt.ppt
Data Indexing Presentation-My.pptppt.pptsdsm2
 
Research data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciencesResearch data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciencesBlue BRIDGE
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18karenostil
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanPhilippe Rocca-Serra
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological databaseKAUSHAL SAHU
 
Big data analysing genomics and the bdg project
Big data   analysing genomics and the bdg projectBig data   analysing genomics and the bdg project
Big data analysing genomics and the bdg projectsree navya
 
MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)Nikos Palavitsinis, PhD
 
Tripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIIITripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIIIVivek Krishnakumar
 

Similar to SRA-System (7).ppsx (20)

Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Rdap12 wrap up reagan moore
Rdap12 wrap up reagan mooreRdap12 wrap up reagan moore
Rdap12 wrap up reagan moore
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
Data base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaData base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somya
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
 
Data Indexing Presentation-My.pptppt.ppt
Data Indexing Presentation-My.pptppt.pptData Indexing Presentation-My.pptppt.ppt
Data Indexing Presentation-My.pptppt.ppt
 
Research data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciencesResearch data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciences
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
 
Big data analysing genomics and the bdg project
Big data   analysing genomics and the bdg projectBig data   analysing genomics and the bdg project
Big data analysing genomics and the bdg project
 
MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)
 
Tripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIIITripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIII
 

Recently uploaded

Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 

Recently uploaded (20)

Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 

SRA-System (7).ppsx

  • 1. Sequence Read Archive Validation, Archival, and Distribution of Raw Sequencing Data By Eugene Yaschenko – Head of Molecular Software Section, NCBI
  • 2. 11 years of software development • Introduction • Mass production revolution in Sequencing • Trace Archive • Next Generation of Massively Parallel Sequencing • Sequence Read Archive • SRA Toolkit In 30 minutes
  • 3. About Sequence Read Archive The Sequence Read Archive (SRA) was created and engineered at the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Department of Health and Human Services. The SRA is part of a cluster of sequencing data repositories called the "Trace Archives“, and is located under the "Primary Data Archives" at NCBI, which includes GenBank. The SRA is part of the International Nucleotide Sequence Database Collaboration (INSDC). The data model, data transfer protocols, and accession space are shared with the INSDC collaborators: European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ).
  • 4. NCBI was created by Congress in 1988 to develop information systems, such as GenBank, to support the biomedical research community. NCBI was also mandated to conduct basic and applied research and, as part of the NIH Intramural Program, NCBI scientists work in areas of gene and genome analysis, computational structural biology and mathematical methods for sequence analysis. As a part of National Library of Medicine, NCBI also creates archives of Medical and Biological literature, consumer health information sites.
  • 5. NCBI is most widely known for
  • 7. Primary Data Archives Primary Data Archives are submitter-driven. Data Archive is different from file archive. It stores data not original files. Data Archive is not necessarily lossless. Some controlled loss of information or precision should be allowed. Internal storage format should be sufficiently uniform to enable validation, searching, sub- setting, etc... Extra effort is needed to support conversion from input formats as well as produce output formats. Large variety of formats significantly stresses archive’s resources. Additional benefit of conversion is that all data are validated before archival.
  • 8. Primary Data Archives GenBank • Internal storage: ASN.1 • Inputs: GenBank, EMBL, FASTA, GFF, ASN.1, etc… • Bulk Outputs: GenBank, FASTA, ASN.1, BlastDB TraceArchive • Internal storage: Relational Database • Inputs: SFF, ZTR, SCF, ABI • Bulk Outputs: FASTA, Quality SRA • Internal storage: SRA format • Inputs: SFF, SRF, FASTQ, SOLiD native, Illumina native, etc… • Bulk Output: SRA format RAW Data Archives
  • 9. Biologist’s perception of Raw Data Archive Bioinformatician’s perception of Raw Data Archive
  • 11. Mass production in Sequencing 1998 - Applied Biosystems and Hitachi develops revolutionary ABI PRISM 3700 DNA Analyzer based on Sanger sequencing Fully automated sequencing, lab technicians were needed to prepare and load biological samples A few years later updated 3730 DNA Analyzer significantly improved automation and data quality The era of mass production of sequences begins. Since there were very little manual supervision of produced data – we labeled it as raw
  • 12. Trace Archive – first steps Created by request of Mouse Sequencing Consortium in 2000 • Requested to contain 55 million records • Input file format 200Kbyte per record • Total input data size 11Tb (Total Usenet was 1.5 Tb) Data Compression • Put a lot of effort into compressing signal data • Managed to reduce footprint of a single record to < 20Kb Data Modeling • Relational Approach – model all data and metadata as a rectangular table • Normalize low-cardinality metadata • Put bulky columns in separate tables • Use data-warehousing databases for indexing
  • 13. Trace Archive Layout Metadata Data 1 2 3 4 5 6 7 Design Metadata 1 2 3 4 5 6 7 RDBMS Data 1 2 3 4 5 6 7 Data 1 2 3 4 5 6 7 Data 1 2 3 4 5 6 7 Metadata 1 2 3 4 5 6 7 DataWarehouse
  • 14. Trace Archive User Experience Index User Web Server Data Storage query data list of ids list of ids update
  • 15. Trace Archive – growth 0 50 100 150 200 250 300 Jun-00 Jan-01 Jul-01 Feb-02 Sep-02 Mar-03 Oct-03 Millions Number of records Design requirement Trace Archive went on beyond mouse The design requirements were exceeded in a year Prompted design corrections •Vertical split of tables with large data •Moving split tables into separate databases •Finally, replacing non-active databases with a file-based storage
  • 16. Next Generation Sequencing 2005 – Trace Archive was approached by “454” company about archiving new type of data NCBI and 454 develop a strategy of storing this data in Trace Archive We were surprised to learn that technology is capable of producing 400,000 reads in 7 hours, but we did not suspect that this is just a beginning of new wave in the Sequencing Revolution
  • 17. Next Generation Sequencing 2006 – We start to hear reports about new technology from Solexa capable of producing tens of millions sequences in just a few days We come to realization that Trace Archive design is obsolete and will not be able to handle new wave We started to look for new approaches in designing Sequence Read Archive
  • 18. Trace Archive limits Global integer identifier (ti) for all sequences Data and metadata are mixed as columns in a simple rectangular table Number of traces exceeded 2 billions No convenient identifier for common project, sample, and methodology Every search results in a large list of integer ids to be retrieved one-by-one
  • 19. Trace Archive most important lessons Model Data and Metadata separately Metadata •Model metadata not only for completeness but by ability to easily group the data •Use of commercial relational databases is acceptable •Heavy indexing is the requirement •Assign accessions (simple identifiers) to metadata objects Data •Model data in uniform and compact format •Try to make this format usable outside of the archive •Use regular file system instead of relational databases Software Toolkit •Support internal format through software toolkit New!
  • 20. Trace Archive – dangerously high growth 0 500 1,000 1,500 Jun-00 Nov-01 Mar-03 Aug-04 Dec-05 Apr-07 Millions Number of records Design requirement 454 submission started to increase
  • 21. Trace Archive - crisis averted 0 500 1,000 1,500 2,000 Jun-00 Nov-01 Mar-03 Aug-04 Dec-05 Apr-07 Sep-08 Jan-10 Jun-11 Millions Number of records 454 submission diverted to SRA
  • 24. SRA design challenges Multiple sequencing platforms •Roche 454 •Illumina Genome Analyzer •Applied Biosystems SOLiD System •Helicos Heliscope •Complete Genomics •Pacific Biosciences SMRT •Ion Torrent Multiple input formats •454 Standard Flowgram Format (SFF) •Sequence Read Format (SRF) •ASCII record-based formats (fasta, fastq, etc..) •ASCII tabular formats (qseq, etc…) •HDF5 Multiple sequencing application •Whole Genome Sequencing •Whole Exome Sequencing •Medical resequencing •Metagenome sequencing •Functional Genome
  • 25. Storage size per base pair is important
  • 26. SRA Model Metadata is lifted to the level of physical organization of the platform Metadata is modeled with XML schema Metadata is stored in XML-aware RDBMS Data is order-independent array of internal elements of the platform Data is divided into parallel series, compressed and stored in binary files.
  • 27. SRA User Experience Indexed Metadata User Web Server query file ids File Servers – http/ftp/fasp/etc..
  • 28. Comparative Table Layout Metadata Data 1 2 3 4 5 6 7 Metadata 1 2 3 4 5 6 7 Millions of rows Billions of rows Data Data Data Data Data Millions of rows Trace Archive Sequence Read Archive
  • 29. SRA Data Processing Pipeline Sequencing Machine generates data in machine format Data is converted into common storage format Data is stored and indexed for retrieval Data is converted into user preferred format Data is analyzed Toolkit Toolkit Past Present Future
  • 30. Raw Data Thousands Bytes/Base SRA Data – common denominator
  • 31. Raw Data Signal Data 4 - 50 Thousands Bytes/Base SRA Data – common denominator
  • 32. Raw Data Signal Data 4 - 50 Thousands Basecalls and Quality 0.6 - 0.9 Bytes/Base SRA Data – common denominator
  • 33. Signal Data 4 - 50 Thousands Basecalls and Quality 0.6 - 0.9 Bytes/Base original SRA Raw Data (not stored) 1 - 50 SRA Data – common denominator
  • 34. Signal Data (not stored*) 4 - 50 Thousands Basecalls and Quality 0.6 - 0.9 Bytes/Base current SRA Raw Data (not stored) * All signal data was removed from the archive, except for the 454 platform 0.6 - 0.9 SRA Data – common denominator
  • 35. SRA SDK Divide data into series, store by column Eliminate repeated cells Apply type-aware compression Tune for retrieval speed vs. size tradeoff Fully indexed long-term online/nearline archive storage Evolution-aware backward compatibility
  • 36. SRA SDK VDB “Vertical” database “Virtual” database Fagernes Airport, Leirin (ENFG/VDB) | Victor David Brenner (designer of a U.S. one-cent coin) | Paul Vanden Boeynants, Belgian prime minister | Frank Vandenbroucke, cyclist | Vrijzinnig Democratische Bond, a Dutch political party | Video Data Bank | Virtual Database | Visual Database | Voluntary Denied Boarding | Verband der Bahnindustrie in Deutschland e.V., the German railway industry association | /var/db/pkg, the installed package database in FreeBSD and Gentoo Linux
  • 37. SRA SDK VDB Columns group like-data for better compression Good for accessing a reduced set of columns Schema-driven Virtualization layer – data from store or created from available information Highly distributed - takes advantage of host file-system Flexible - allows removal of individual columns in archive
  • 38. Write Side of SRA Toolkit Input Data Format Format Parser Input Columns A B D C Writing Schema XF1 XF2 Physical Columns B X Y Z XF3 Encoding Schema EF1 EF1 EF2 EF2 Serialization B X Y Z Directory Structure SRA File Format kdb vdb
  • 39. Read Side of SRA Toolkit Deserialization B X Y Z SRA Data Directory or SRA File Format Physical Columns B X Y Z Decoding Schema DF1 DF1 DF2 DF2 Read Schema XF5 XF6 XF1 Output Columns A B D C Format Generator Output File Format API to Read Columns
  • 40. Schema Examples • simple assignment ascii S = B; • assignment using plugin function U32 A= < U32> clip < 15 , 10000 > ( B); • conditional assignment ascii S = S1 | S2; • Defining encoding rule using plugin functions physical < type T > T izip_encoding #1.0 { decode { return ( T ) iunzip ( @ ); } encode { return izip ( @ ); } }; • Using encoding rules extern column < U32> izip_encoding A; extern column < I64> izip_encoding C;
  • 41. Schema: using existing functions AATCGGTAACCG pack<8,2> (@) {0,0,3,1,2,2,3,0,0,1,1,2} <ascii,U8>map<'ACGT', [ 0, 1, 2, 3 ] > ( @) 00001101,10101100,00010110
  • 42. Privacy Issues Homo sapiens 61% human metagenome 6% Mus musculus 5% other metagenome 4% Drosophila melanogaster 2% Plasmodium falciparum 2% Mustela putorius furo 1% Danio rerio 1% Latimeria chalumnae 1% Otolemur garnettii 0% OTHER 17% SRA Content by organism Amount of human sequencing will only increase In a series of studies, NIH is planning to sequence tens of thousands people Genetic data is highly personal Most of human sequencing data will require careful protection from unauthorized distribution
  • 43. Encrypted SRA format Single file SRA Format 32Kb 32Kb 32Kb 32Kb 32Kb 15Kb 32Kb 32Kb 32Kb 32Kb 32Kb 256 bit chained encryption • Original file information • Length, checksums • Extra parameters used for randomizing encryption • Encrypted block checksum and length • Needed for checking transfers without decrypting the file
  • 44. Compression by Reference Currently, massively parallel sequencing created enough fragments to cover subject’s genome several times (4x-100x) This is done to improve signal/noise ratio In many studies the very next step after sequencing is mapping to the Reference Genome After the mapping, large fraction of fragments perfectly match the Reference
  • 46. Redundancy in aligned data Many sequences repeat Only need to store differences
  • 48. cSRA References Unaligned Sequences Alignments Difference From Reference All matching bases are dropped. Only the mismatched bases are stored Qualities are retained. Unaligned sequences are treated as traditional SRA data
  • 49. cSRA Reference Difference Original sequence + = The original sequence may be restored by applying a function to the reference and the stored differences. There is no loss of information.
  • 50. cSRA References Unaligned Sequences Alignments Difference From Reference Remote References Local References •References to GenBank and RefSeq accessions are typically stored as Remote •References to de-novo assemblies or modified sequences will be stored as Local •Local Reference may be usefull for storing Transcriptome and Metagenome assemblies, which will make those projects completely contained within SRA
  • 51. cSRA Reference Primary Alignment Secondary Alignment Sequence Needs: Produces: Access to Remote Ref. Sequences From Ref. Primary Alignment Primary Alignment For mapped reads Restored Sequences Restored Sequences All Basecalls And Qualities Local Sequences, Statistics Placements, Mismatches, Indels Alternate Placements, Indels Qualities, unplaced reads, groups VDB-2 table structure for “Archive” version Stores:
  • 52. New Data Dimension Metadata 1 2 3 4 5 6 7 Reference Position Data Data Data Data Data Millions of rows Sequence Read Archive Data1 Data2 Data3 Data4 Millions of rows SRA Multisample View Query Range
  • 53. Credits SRA Toolkit Development sra-tools@ncbi.nlm.nih.gov Kenneth Durbrow Anton Golikov William Killian Andrei Klymenko Wolfgang Raetz Kurt Rodarmer Eugene Yaschenko yaschenk@ncbi.nlm.nih.gov SRA Pipeline Development Michael Kimelman Yuri Ostapchuk Project manager: Reza Safarnejad SRA Curators Zinaida Belaya Christopher O'Sullivan Robert Sanders Martin Shumway Yuriy Skripchenko Adam Stine SRA Website Sergiy Ponomarev