Your SlideShare is downloading. ×
BITS: Basics of sequence databases
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

BITS: Basics of sequence databases

4,290
views

Published on

Module 1: Sequence databases. …

Module 1: Sequence databases.

Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training

Published in: Education, Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,290
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
154
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Basic bioinformatics concepts, databases and tools Introduction to the training and Sequence databases Joachim Jacob http://www.bits.vib.beUpdated 22 February 2012http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf
  • 2. Scope Introductory training to Bioinformatics Exploring and understanding databases and software for everyday bioinformatics use If there is any term which is unclear, please stop me and ask me!
  • 3. Bioinformatics ... Bio all data is derived from living samples Informatics that data is stored and analyzed in and with computers to obtain understanding Extremely broad description, for which however we will extract common principles during the course
  • 4. Bioinformatics is present into every aspectof life sciences research
  • 5. Bioinformatics is present into every aspectof life sciences research
  • 6. Bioinformatics is present into every aspectof life sciences research , sequences
  • 7. Bioinformatics is present into every aspectof life sciences research
  • 8. Bioinformatics is present into every aspectof life sciences research
  • 9. Bioinformatics is present into every aspectof life sciences research
  • 10. Bioinformatics is present into every aspectof life sciences research
  • 11. Bioinformatics is present into every aspectof life sciences research
  • 12. Bioinformatics is present into every aspectof life sciences research
  • 13. Bioinformatics ... Bio - different types of living samples Informatics - storing and categorizing the information and making it easily accessible - interpreting that information reliably
  • 14. Bioinformatics … and his companion Bio - different types of living samples Informatics - storing and categorizing the information and making it easily accessible - interpreting that information reliably Statistics - large numbers, observational data
  • 15. The siblings of Bioinformatics Based on the biological component extracted from life, the measured properties and the ultimate goal of the analysis, different sub-disciplines of bioinformatics exist.DNA RNA proteins metabolitesGenomics Transcriptomics Proteomics MetabolomicsEpigenomics Structural bioinformaticsSystems biology Microbiomics InteractomicsMetagenomics Functional genomics Comparative gx
  • 16. Mere data is worth nothingCGCTACGCATATCGCT Data = symbols- Dasypus novemcinctus Information = data that are processed to be useful;- found in my garden provides answers to "who", "what", "where", and- Part of genome- sequenced on June 2010 "when" questions. Also called metadata.This species seems to be Knowledge: application of data and information;related to my neighbors pet,because it has also this answers "how" questionssequenceHas the same mother Understanding: appreciation of "why" Wisdom http://www.systems-thinking.org/dikw/dikw.htm
  • 17. ? ! Life sciences research as major end user for the data knowledge bioinformatics tools and conclusions tool user Tools and approaches Bioinformatics research, as a specific branch onBiology Computer Statistics the boundary of life science, mathematics and computer science tool manufacturer
  • 18. This course is organised in several modulesModule 1: Sequence databases: what, where, howModule 2: Sequence comparisons: searching, aligningModule 3: Sequence analysis – domains in protein sequences and predicting functionality, standardisation and useful linksModule 4: Beyond sequences - additional important data sourcesModule 5: Genome Browsers - integrating biological data and performing reproducible bioinformatics research in the Galaxy
  • 19. Overview of the crash course
  • 20. One tip for the future Be prepared for change... Information is fluid So are bioinfo tools Learn how to accommodate for change Major resources are more stable Important concepts do not change often
  • 21. Module 1 Sequence databases
  • 22. Module 1: Sequence databases Sequence databases store DNA and RNA sequences. In Bioinformatics, they are by far (still) the largest collections of biological data, and used by many subdisciplines of bioinformatics. http://www.ebi.ac.uk/embl/Services/DBStats/
  • 23. ... and growing http://www.ebi.ac.uk/embl/Services/DBStats/
  • 24. Three major nucleotide databanks host primarysequence data European Nucleotide Archive (ENA) at EBI - http://www.ebi.ac.uk/ Division EMBL-bank (European Molecular Biology Laboratory) (single) Trace Archive SRA Archive GenBank at NCBI - http://www.ncbi.nlm.nih.gov/ maintained at NCBI (National Center for Biotechnology Information, (USA) DDBJ (DNA Data Bank of Japan) - http://www.ddbj.nig.ac.jp/ maintained at NIG/CIB (National Institute of Genetics, Center for Information Biology, Mishima, Japan)
  • 25. These databases are filled with NA sequence information by scientists and consortia Large-scale Individual Patent sequencing scientists Offices ACTGCTGCTA GCTAGCTGAT projects CTATGCTAGC TGTAGCTGAG Primary sequence data each primary sequence = one experiment Primary sequence Basically, all source nucleotide material databaseJennifer McDowall - http://www.biotnet.org/training-materials/nucleotide-sequence-databases-ena
  • 26. Primary NA sequence can be produced by Sanger-based technologies or NGS technologies Sanger sample Low output in number of seqs, high quality, 400-850 bp. Read profiles in .abi format. Stored in Trace Archive. RNA DNA RT NGS Different technologies. Extremely high output rate, low cDNA quality, 30 bp – 600 bp. Reads in .fastq format, stored in the SRA. These techniques can only read DNA strands, so RNA needs first to be converted to cDNA with reverse transcriptases prior to loading to the machines.Sanger overview: http://www.bio.davidson.edu/Courses/Molbio/MolStudents/spring2003/ObenraderNGS overview: http://seqanswers.com/forums/showthread.php?t=3561
  • 27. Overview major DNA reading technologies Dennis Wall, NGS Data Analysis and Computation I course, Wall Lab
  • 28. In the primary sequence dbs a major distinctioncan be made in two major categories High quality single submission (Sanger) - gene sequence (genomic – STD data class) - mRNA sequence (via cDNA – STD) - BAC/YAC/cosmid sequences - genome sequencing projects (contigs, assemblies, WGS) DNAcDNA RNA - genome markers, STS (sequence tagged sites, unique short sequences from a genome) Low quality batch submissions - Expressed Sequence Tags (EST) - Genome Survey Sequences (GSS) - high-throughput sequence data (e.g. NGS) http://www.ebi.ac.uk/ena/about/formats
  • 29. The batch submissions originate mostly fromsequencing centers Large-scale sequencing projects chromosome fragment sequencing library submission sequence reads e.g. whole genome shotgun submission assemble sequence submission annotation cyp30 cyp309 insv cg343
  • 30. Each primary database stores their sequencesand batch submissions in their own way... - NCBI: ESTs are stored in dbEST (separate database) - ENA: ESTs are part of EMBL-bank in EST data class Similar for GSS (see dbGSS at NCBI) ESTs : expressed sequence tag, often partial sequence derived from RNA in batch. See example >est1 ATCGACTAGCATCA sample >est2 TCGACTAGCGACTA RNA-seq >est3 RNA CAGCATCATCGAC
  • 31. http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.pptBatch submissions are marked and/or storeddifferently than single submissions Data class ESTs areENA-Annotation: also batch submissionsFeature annotation 1) EMBL-BankENA-Assembly:Assembly information Batch submissionsENA-Reads: 2) Trace ArchiveSequencing and - Raw data (capillary sequencing)sampling information 3) Sequence Read Archive - Raw data (Next Gen sequencing) TIER CLASS TYPE ENA structure
  • 32. The normal submissions are a minority inprimary sequence databases http://www.ebi.ac.uk/ena/about/statistics#embl_bases_per_dataclass
  • 33. Primary sequence dbs are synchronised andevery sequence receives a unique identifier All database maintainers assign and share a unique accession number (AC) to each sequence – besides their own ID number – (info at NCBI). Sequences can get updated, and the accession number is extended with a version number, e.g. .1 (see SVA) Example of acc number: BC010109.2http://www.insdc.org/Collaboration on GenBank DDBJFeatures, taxonomy,... + SRA Synchronized International nucleotide Sequence databases collaboration daily All use the same - Accession Ids ENA - Project Ids - Feature tables (see later) http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)
  • 34. One sequence entry contains three categoriesof different types of information 1. Info about sequence, submitters and literature (metadata) 2. Annotations of the sequence (metadata related to the seq) 3. Stretch of ATGC / AUGC sequence (the data, at the bottom) • A sequence record is called annotated when biological information is added and linked to a position in the sequence • Annotations, also called features, are abbreviated as codes, which can be found in the Feature Tables http://www.ebi.ac.uk/embl/Documentation/FT_d
  • 35. This sequence information can be written in different formats (plain) Text format, e.g. GenBank 1. General info Official shared accession Genbank specific identifier (just sums up with each new) A lot of different identifiers! ~number of databases → conversion tools can translate identifiers needed (see exercises)*In humans: HUGO Nomenclature committee determines the right genename http://mobyle.pasteur.fr/cgi-bin/portal.py#tutorials::seqfmt
  • 36. 2. Annotation db_xref = cross references, = links to records of other databases which are related to this record (see later). The format dbname:identifierFeature name Qualifier name
  • 37. 3. Sequence Each protein sequence receives also an accession number
  • 38. Other sequence formats Fasta (minimal metadata, basically only sequence) >genename And a description ATCGATGCAGCTATATCCTCGCGATCAGC CGGACAGCTCTCGAGCGCATCGACGACGAC ASN.1 Abstract Syntax Notation (ASN.1) EMBL :all info as in gb, online referred to as plain text XML Fastq : sequence info and base call qualityImportantFormat has nothing to do with which program you save your file! You donthave a choice: it needs to be plain text format (.txt - not a file which can beopened with MS Word such as .doc or .rtf files). Wordpad is a good choice forthis. Format in bioinfo is all about how the information is structured and writtendown in the plain text file. http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
  • 39. http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.pptDegree of annotation differs between entries Batch submitted sequences areENA-Annotation: annotated poorly, singleFeature annotation submissions are annotated better Good seq 1) EMBL-Bank annotationsENA-Assembly:Assembly informationENA-Reads: 2)Experiment information Trace Archive is- of most(capillary sequencing) Raw data importance inSequencing andsampling information batch submissions (e.g. 3) Sequence Read which which species, Archive - Raw data (Next Gen sequencing) technique, ...) TIER CLASS TYPE ENA structure
  • 40. SRA contains batch submitted records of whichexperiment information is of most importance Since the sequences are barely (not) annotated, is experiment description important: which machine, which organism, which tissue, which developmental stage, disease, treatment, …
  • 41. How to get sequences into the db, and back out Submit Retrieve Always submit your sequence data (mostly One or few sequences obliged by journals) and include your ACC number in articles (not any other number). → Use one of the numerous webbased tools GenBank: Entrez EMBL: EB-eye MRS: developed for easySequin (GenBank retrievalstand alone) retrieve Many sequences (BatchBankit (GenBank submitweb tool) retrieval)Webin (EMBL → use ftp (file transfer protocol)online submission) → use perl (flexible pro- gramming language) → BioMart http://www.biomart.org/
  • 42. Example of a primary NA sequence record (ENA) http://www.ebi.ac.uk/ena/about/formats
  • 43. Example of a primary NA sequence record (ENA) Text format Code usable for Data linked to that searching code http://www.ebi.ac.uk/ena/about/formats
  • 44. Primary sequence data contains a lot ofredundancy! Chromosome sequence Several gene sequences from different labs EST sequences from transcripts cDNA sequence Al match to the same gene. Often you end up in your database search with all these sequences... A lot of redundancy!
  • 45. The primary sequences are the basis foranalyses that generate derived sequence data Scientists/Consortia → primary databases – Source for further analyses. Which? • Create protein sequences • Curate the sequence database • Assemble genomes • Searching similarities • Aggregate information about one gene • … Results stored in derived databases
  • 46. Protein databases come in two kinds
  • 47. The most important protein db is UniProt andcontains automatic and manual entries UniProt Knowledge Base - the best annotated protein database of the world http://www.uniprot.org/
  • 48. The most important protein db is UniProt andcontains automatic and manual entries
  • 49. Refseq - The NCBI way to reduce redundancy inprimary sequence data RefSeq is NCBI Reference Sequences (prot and nuc) Redundancy from primary sequence data is reduced both automatically and by manual annotation of NA and protein sequences. one natural biological molecule = one entry. Links back to the original primary sequences. Hugely popular and a basis for a lot of analyses. Click to apply refseq filter in entrez search http://www.ncbi.nlm.nih.gov/RefSeq/
  • 50. RefSeq has its own identifiers, not to be mixedup with accession numbers Refseq entry codes looks similar as ACC numbers (but are not ACC numbers – underscore!); and RefSeq is also in GenBank format. Note: in Features section one can find the raw sequences from what is was derived. (typical mistake: search with refseq code in uniprot) NC_* (curated) complete genomic element (chromosome, plasmid,...) NT_* (automated) intermediate assembly from BAC NZ_* (automated) incomplete genomic sequence from WGS NW_* (automated) intermediate assembly from WGS NG_* (curated) incomplete genomic element corresponding to gene NM_* (curated) mRNA NR_* (curated) non-coding RNA or predicted transcript of pseudogene NP_* (curated) protein ZP_* (automated) protein predicted from WGS sequence (NZ_*) YP_* (curated) other predicted protein sequences from NCBI Genome Annotation Pipeline XM_* (automated) mRNA XR_* (automated) non-coding RNA or predicted transcript of pseudogene XP_* (automated) protein http://www.ncbi.nlm.nih.gov/RefSeq/key.html http://www.ncbi.nlm.nih.gov/RefSeq/
  • 51. UniRef – UniProt redundancy reducing system forproteins sequences Non redundant protein sequences from UniProt ~ refseq Hiding redundant sequences by clustering them • UniRef100 = complete identical sequences • UniRef90 = 90% identical sequences • UniRef50 = 50% identical sequences See http://www.uniprot.org/help/uniref
  • 52. NCBIs Gene – summarizes gene informationincluding sequence information from primary dbs Example of the gene NPR1 from A. thaliana
  • 53. UniGene – summarizes transcriptomicinformation around genes
  • 54. And a lot more derived databases withsequence information exist Repbase : repeats (Alu, …), maintained by Jerzy Jurka at the Genetic Information Research Institute (Mountain View CA, USA). CENSOR server allows to "clean" sequences. http://www.girinst.org/repbase MiRBase → published miRNA sequences http://www.mirbase.org/ Eukaryotic promoter database http://www.epd.isb-sib.ch/ UniVec GenBank subset + some sequences from commercial sources - ftp://ftp.ncbi.nih.gov/pub/UniVec/
  • 55. The most important sequence databasesoverview Integrated Prim seq data Search Derive Curat d ed Portals GB GenPept RefSeq Entrez ENA trEMBL ENA search EB-eye DDBJ UNIPROT SwissProt UniProt
  • 56. Common gene annotations on sequences Genome sequence: e.g. Chr6 Enhancers/promotors terminator Intron Gene sequence exon mRNA AAAAAAAAAAAAA 5UTR CDS 3UTR poly(A) tail protein Genetic code tables
  • 57. Searching the database for your gene of interest First you have to determine for yourself which information you want - NA sequences vs. protein sequences - If NA, genomic sequences, or RNA derived - All possible sequences that exists, or curated ones - Protein sequences of which quality - ...
  • 58. Entrez is a starting point for searches at NCBI http://www.ncbi.nlm.nih.gov/sites/gquery
  • 59. Visualising the db_xrefs in records at NCBI
  • 60. ENA has its text-search portal http://www.ebi.ac.uk/ena/
  • 61. Results from an ENA search are organisedfollowing the ENA database structure
  • 62. UniProt has a simple search box leading to asophisticated search results page
  • 63. Complex searches can be achieved by using theindex codes in the database e.g. “oc=Primates and de=complete and de=cds and de=MHC” Code usable for Could answer: give me searching all coding sequence of MHC available in primates.
  • 64. Meta-search tools can search differentsequence databases at once. MRS Open Source, developed by Maarten Hekkelman at Radboud U. (Nijmegen, the Netherlands). Allows searching in different databases at once, and provides also statistics on the databases.Alternatives: ACNUC, SRS
  • 65. Logical operators Searching involves making combinations of conditions. Here the difference between a logic and, or and not explained by venn diagrams. Q1 AND Q2 & Q1 NOT Q2 ! Q1 OR Q2 |
  • 66. Hands-on! Every module ends with an exercise session. We will now explore how data is stored in different sequence databases. You get …. minutes for this exercise. Afterwards, we summarizes some of the difficulties some of you might have experienced.
  • 67. Summary This course is organised in several modules Module 1: Sequence databases Three major nucleotide databanks host primary sequence data These databases are filled with NA sequence information by scientists and consortia The batch submissions originate mostly from sequencing centers Each primary database stores their sequences and batch submissions in their own way... Batch submissions are marked and/or stored differently than single submissions The normal submissions are a minority in primary sequence databases Primary sequence dbs are synchronised and every sequence receives a unique identifier One sequence entry contains three categories of different types of information This sequence information can be written in different formats Degree of annotation differs between entries SRA contains batch submitted records of which experiment information is of most importance How to get sequences into the db, and back out Primary sequence data contains a lot of redundancy! The primary sequences are the basis for analyses that generate derived sequence data Protein databases come in two kinds The most important protein db is UniProt and contains automatic and manual entries Refseq - The NCBI way to reduce redundancy in primary sequence data RefSeq has its own identifiers, not to be mixed up with accession numbers UniRef – UniProt redundancy reducing system for proteins sequences NCBIs Gene – summarizes gene information including sequence information from primary dbs UniGene – summarizes transcriptomic information around genes And a lot more derived databases with sequence information exist Searching the database for your gene of interest Entrez is a starting point for searches at NCBI Visualising the db_xrefs in records at NCBI ENA has its text-search portal Results from an ENA search are organised following the ENA database structure UniProt has a simple search box leading to a sophisticated search results page Complex searches can be achieved by using the index codes in the database Meta-search tools can search different sequence databases at once. Hands-on!