SlideShare a Scribd company logo
1 of 12
Data Manipulation: Molecular Online
and Server Tools & BioExtract Server
Theme: FXN Gene and Pancreatic Cancer.
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Etienne.gnimpieba@usd.edu
Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Metabolic:
• Sabio-RK (check with Brent)
• KEGG (check with Brent)
• HMDB (hmdb.ca, contact for API)
• SMPDB (http://www.smpdb.ca)
• BioModels
• drugDB
• Brenda (check with Brent)
• [Mathi's project]
Protein
• Expazy DB collection (uniprot, )
• PDB
• SBKB
• STRING
Genomic:
• G.E.O.
• GenBank
• GO
• EBI Array Express & Gene Atlas
Phenomic:
• PhenomicDB
• Phenoscape
Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Active Network Extraction & Analysis
Reactome Functional Interaction network
Disease subnetwork
Extract mutated, overexpressed,
undexpressed, expanded/deleted
genesAdd Linker
genes
Disease “modules”
Disease gene prediction
Sample classification
Hypothesis generationApply community
clustering algorithms
Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
p53, SMAD, TGFβ,
TNF signaling
KRAS, MAPK signaling
Heterotrimeric
G-protein signaling
Rho GTPase
signaling
Transcription & translation
Cell cycle
Wnt & Cadherin
signaling
Hedgehog
signaling
Transcription
Zinc fingers
Ca2+ Signaling
Non-silent mutations
• blue – in primary tumour only
• green – in xenograft only
• red – in primary & xenograft
Pancreatic Cancer Module Map (43 Cases)
Christina Yung / Bioinformatics.ca
Data Manipulation Molecular Online Tools: BioExtract Server
Bibliographic Taxonomic Nucleotide GenomicProteinMetabolic pathway
Molecular Biology
Databases
MEDLINE
PubMed
EMBASE
BIOSIS
CAB
International
AGRICOLA
NEWT
The Tree of Life
Species 2000
IOPI
ITIS
KEGG
EcoCyc
BRENDA
ENZYME
BIOMODEL
REACTOME
INSDC
EMBL
DDBJ
NCBI
GENBANK
SPGP
AceDB
HIV-SD
Ensembl
Wormbase
FlyBase
MGD
SGD
EBI ( Genome
server,
Karyn’s genome)
RGD
SPGP
•GOA
•ENZYME
•INterPro
•PDB
•Integr8
•MEROPS
LIGAN
•EMP
•DCHGR
•PROSITE
•PRINT
•Pfam
•BLOCKS
•SBASE
•UniProt/
Swiss-
Prot
•PIR
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Sequence Type Accession Number
DNA sequence from GENBANk , EMBL or DDBJ
1 letter + 5 digits : U43752
2 letter + 6 digits : AF462052
GenePept sequence GENBANk , EMBL or DDBJ 3 letter + 5 digits : AAF46449
Protein sequence from SwissProt 1 letter + 5 digits : Q16595
Protein sequence from the Protein Research Foundation 6/7 digits + 1 letter : 2808353A
RefSeq sequence
2 letters + _ + >6 digits
mRNA : NM_******
Protein : NP_******
Protein sequence from Protein Data Bank PDB 1 digit + 3 letters : 2EFF
Protein sequence from Molecular Modeling DataBase MMDB ID + >4 digits : MMDB ID 767744
Review: data format
Data Manipulation Molecular Online Tools and BioExtract Server
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
>gi|XXXX |XXX >sp|XXXX |XXX
Gene Info number Specie referenceAccession number Gene Info number Specie referenceAccession number
Data Manipulation Molecular Online Tools: BioExtract Server
Biological sequences and data can be analyzed in many ways with bioinformatics tools.
They can be read, assembled, compared, mapped, predicted, designed, modeled…
1. Nucleotide and protein sequence searching (blastall, SSEARCH for fasta
local, GLSEEARCH for global)
2. Multiple sequence alignment (clustalW2, Mview, …)
3. Pairwise sequence alignment (Needle for global, LALIGN for local)
4. Protein functional analysis (SMART, Phobius, interproscan)
5. Functional genomic tools (R-tools, SAIL, EFOtools,)
6. Molecular structure analysis (PDBeFold, QuaternaryStructure,…)
7. Scientific literature text mining (EBIMed, Whatizit)
8. Sequence translation (Transeq, readseq, Backtranseq,…)
9. Data retrieval and ID mapping (dbfetchm, ENA/SRA, SRS, PICR)
10.Protein structure prediction tools
11.…
Review: Online Programs & Algorithms
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
AND = term1 AND term2 must exist in the searched documents
OR = term1 OR term2 must exist
NOT = term1 must not be present in any of the displayed documents
ALL = term1 must not be present in all of the displayed documents
+ term1 = document must contain the term1
- term1 = document must not contain term1
XXX* = all characters are accepted after the XXX
XX?YX = all characters are accepted instead of Y
 FXN [AND] gene [NOT] Frataxin  all data related with FXN gene except
those concerning Frataxin protein
 ataxia + apraxia + gene  all genes related with ataxia and apraxia
 Ada* [AUTH]  all authors whose names begin with Ada
Boolean operators and symbols
Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
BLAST (Basic Local Alignment search Tool) : comparing a
protein or a DNA sequence to other sequences
FASTA (FAST-ALL): fast protein or nucleotide comparison
Similarity search tools
Global match : align all
residues of a sequence with
all of the other sequence
Local match : find a region
in one sequence that
matches with the other
Motif match : find matches of a short
sequence in one or more region internal
to another long sequence, it could be a :
Multiple alignment : a
mutual alignment of many
sequences
Perfect match deletions insertionsmismatches
Review: Sequence Analysis
Data Manipulation Molecular Online Tools and BioExtract Server
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Review: Sequence Analysis
Data Manipulation Molecular Online Tools and BioExtract Server
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Sequence alignment : assignment of residue-residue correspondence
Determine phylogenic relationship by analyzing similarity and homology-
Similarity: Observation or measurement of resemblance and difference
Homology: The sequences and the organisms in which they occur are
descended from a common ancestor  Homology must be an inference from
observation of similarity
Determine if a protein (or a gene) is related to a larger group of proteins
Verify if a mutated residue is conserved within species
Context
0. Specification & Aims
.
Statement of problem / Case study: The FXN gene provides instructions for making a protein called frataxin. This protein is found in cells throughout the body, with the highest levels in the heart,
spinal cord, liver, pancreas, and muscles. The protein is used for voluntary movement (skeletal muscles). Within cells, frataxin is found in energy-producing structures called mitochondria. Although
its function is not fully understood, frataxin appears to help assemble clusters of iron and sulfur molecules that are critical for the function of many proteins, including those needed for energy
production. Mutations in the FXN gene cause Friedreich ataxia. Friedreich ataxia is a genetic condition that affects the nervous system and causes movement problems. Most people with Friedreich
ataxia begin to experience the signs and symptoms of the disorder around puberty.
Molecular Online Tools and Server
Keywords:
Bio: FXN, Frataxin, pancreatic cancer, CDKN4
Math: HMM,
Informatics: programing, bioinformatics tools, getting
and exporting data
Reduced expression of frataxin is
the cause of Friedrich's ataxia
(FRDA), a lethal neurodegenerative
disease, how about liver cancer?
Aim: The purpose of this lab is to initiate online
biological exploration tools of the human model large
scale data study (metabolic, proteic, genomic, …). We
simulated the application on FXN gene and pancreatic
cancer disease. Now we can understand how a
researcher can come to identify cross biological
knowledge available in data banks.
Acquired skills
Online and server tools:
- Query biological DB (fasta, Html, txt, figure formats)
- Sequence tools (protein and gene)
Alignment (showalign, clustalw2), similarity, …
- Manage data result (select, keep, map, export)
- Build and reuse workflow
Biological Hypothesis
FXN on chromosome 9
Frataxin molecule structure (pymol)
Pancreatic cancerPancreasanatomy
?
BiologicalDB
Tools
Resolution Process
T2. Genome exploration:
Objective: Use of Ensembl to localize the FXN on the human
genome and identify the genes implicate in pancreatic cancer
disease.
T3. Sequences manipulation
Objective: Find similar sequence using BLAST tools
and make an alignment on given sequences.
T2.1. Locate a given gene on human genome
T2.2. Get a genomic sequence from NCBI
T2.3. Get the protein data and sequence from EBI
T2.4. Save the export sequences data in data folder
T3.1. Find similar sequences using BLAST tool
T3.2. Align generated sequences with ClustalW tool
T3.3. Visualized result using phylogenic tree on
Jalview
T5. BioExtract server
Objective: used server tool to optimized data
manipulation process, apply on BioExtract server.
T5.1. Server Initialization
T5.2. Pancreatic cancer & Frataxin (FXN)
T5.3. Mapping, Alignment
T5.4. Workflow save & reused
T4. Protein Data and Structural
Biology Knowledge
Objective: To provide protein levels of frataxin study
and its connection with pancreatic cancer (functional ad
structural data)
T1. Metabolomics
Objective: Use metabolic data repository to
understand the frataxin protein mechanism
T1.1. Finding the Enzyme and Pathway related to
Frataxin using KEGG
T1.2. Finding the Reaction involved with Frataxin
using Reactome
T1.3. Using BRENDA for enzyme data on Frataxin
T1.4. Using Collected data for Analysis
T1.5. Redu the process with Pancreatic Cancer
Results
T4.1. Structural Knowledge on Frataxin using
SBKB
T4.2. Using Uniprot for Frataxin Protein Study
T4.3. Protein-Protein Interaction using STRING
T4.4. Using same method for Pancreatic Cancer
and compare

More Related Content

What's hot

Pathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticsPathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticssarwat bashir
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.Elena Sügis
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Juan Antonio Vizcaino
 
FAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic DataFAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic DataIan Fore
 
Bioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future PerspectivesBioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future PerspectivesUniversity of Malaya
 
Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)Sijo A
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Juan Antonio Vizcaino
 
Features of biological databases
Features of biological databasesFeatures of biological databases
Features of biological databasesCharu Sharma
 
Mass spectrometry resources at the EBI
Mass spectrometry resources at the EBIMass spectrometry resources at the EBI
Mass spectrometry resources at the EBIJuan Antonio Vizcaino
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsAmna Jalil
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in BioinformaticsArindam Ghosh
 
Data retreival system
Data retreival systemData retreival system
Data retreival systemShikha Thakur
 

What's hot (20)

Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Pathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticsPathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformatics
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018
 
TOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBITOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBI
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
FAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic DataFAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic Data
 
Bioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future PerspectivesBioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future Perspectives
 
Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017
 
Features of biological databases
Features of biological databasesFeatures of biological databases
Features of biological databases
 
Mass spectrometry resources at the EBI
Mass spectrometry resources at the EBIMass spectrometry resources at the EBI
Mass spectrometry resources at the EBI
 
PRIDE-ProteomeXchange
PRIDE-ProteomeXchangePRIDE-ProteomeXchange
PRIDE-ProteomeXchange
 
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 

Similar to Session i overview bioinfo dm and app mmc

Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsAyeshaYousaf20
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdfnedalalazzwy
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data miningSangeeta Das
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuKAUSHAL SAHU
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple nadeem akhter
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introductionDrGopaSarma
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wdWagied Davids
 
Thesis def
Thesis defThesis def
Thesis defJay Vyas
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureRobert Cormia
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 

Similar to Session i overview bioinfo dm and app mmc (20)

Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Data retrieval
Data retrievalData retrieval
Data retrieval
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdf
 
Article
ArticleArticle
Article
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
Applications of bioinformatics
Applications of bioinformaticsApplications of bioinformatics
Applications of bioinformatics
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 
Eccmid meet the expert 2015
Eccmid meet the expert 2015Eccmid meet the expert 2015
Eccmid meet the expert 2015
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
 
Thesis def
Thesis defThesis def
Thesis def
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
2012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les12012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les1
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 

More from USD Bioinformatics

Clinical Application of RNA Sequencing - Bladder Cancer
Clinical Application of RNA Sequencing - Bladder CancerClinical Application of RNA Sequencing - Bladder Cancer
Clinical Application of RNA Sequencing - Bladder CancerUSD Bioinformatics
 
Small Molecule Real Time Sequencing
Small Molecule Real Time SequencingSmall Molecule Real Time Sequencing
Small Molecule Real Time SequencingUSD Bioinformatics
 
Next Generation Sequencing - the basics
Next Generation Sequencing - the basicsNext Generation Sequencing - the basics
Next Generation Sequencing - the basicsUSD Bioinformatics
 
Session ii g3 overview epidemiology modeling mmc
Session ii g3 overview epidemiology modeling mmcSession ii g3 overview epidemiology modeling mmc
Session ii g3 overview epidemiology modeling mmcUSD Bioinformatics
 
Session ii g3 overview behavior science mmc
Session ii g3 overview behavior science mmcSession ii g3 overview behavior science mmc
Session ii g3 overview behavior science mmcUSD Bioinformatics
 
Session ii g3 lab behavior science mmc
Session ii g3 lab behavior science mmcSession ii g3 lab behavior science mmc
Session ii g3 lab behavior science mmcUSD Bioinformatics
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcUSD Bioinformatics
 
Session ii g2 overview metabolic network modeling mcc
Session ii g2 overview metabolic network modeling mccSession ii g2 overview metabolic network modeling mcc
Session ii g2 overview metabolic network modeling mccUSD Bioinformatics
 
Session ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmcSession ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmcUSD Bioinformatics
 

More from USD Bioinformatics (20)

Visualization Tools
Visualization ToolsVisualization Tools
Visualization Tools
 
Clinical Application of RNA Sequencing - Bladder Cancer
Clinical Application of RNA Sequencing - Bladder CancerClinical Application of RNA Sequencing - Bladder Cancer
Clinical Application of RNA Sequencing - Bladder Cancer
 
Clinical Application 1.0
Clinical Application 1.0Clinical Application 1.0
Clinical Application 1.0
 
Clinical Application 2.0
Clinical Application 2.0Clinical Application 2.0
Clinical Application 2.0
 
Bridge Amplification Part 2
Bridge Amplification Part 2Bridge Amplification Part 2
Bridge Amplification Part 2
 
Bridge Amplification Part 1
Bridge Amplification Part 1Bridge Amplification Part 1
Bridge Amplification Part 1
 
Basic Steps of the NGS Method
Basic Steps of the NGS MethodBasic Steps of the NGS Method
Basic Steps of the NGS Method
 
True Single Molecule Sequencing
True Single Molecule SequencingTrue Single Molecule Sequencing
True Single Molecule Sequencing
 
Small Molecule Real Time Sequencing
Small Molecule Real Time SequencingSmall Molecule Real Time Sequencing
Small Molecule Real Time Sequencing
 
Sanger Dideoxy Method
Sanger Dideoxy MethodSanger Dideoxy Method
Sanger Dideoxy Method
 
Pyrosequencing 454
Pyrosequencing 454Pyrosequencing 454
Pyrosequencing 454
 
Ion Torrent Sequencing
Ion Torrent SequencingIon Torrent Sequencing
Ion Torrent Sequencing
 
Next Generation Sequencing - the basics
Next Generation Sequencing - the basicsNext Generation Sequencing - the basics
Next Generation Sequencing - the basics
 
Illumina Sequencing
Illumina SequencingIllumina Sequencing
Illumina Sequencing
 
Session ii g3 overview epidemiology modeling mmc
Session ii g3 overview epidemiology modeling mmcSession ii g3 overview epidemiology modeling mmc
Session ii g3 overview epidemiology modeling mmc
 
Session ii g3 overview behavior science mmc
Session ii g3 overview behavior science mmcSession ii g3 overview behavior science mmc
Session ii g3 overview behavior science mmc
 
Session ii g3 lab behavior science mmc
Session ii g3 lab behavior science mmcSession ii g3 lab behavior science mmc
Session ii g3 lab behavior science mmc
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmc
 
Session ii g2 overview metabolic network modeling mcc
Session ii g2 overview metabolic network modeling mccSession ii g2 overview metabolic network modeling mcc
Session ii g2 overview metabolic network modeling mcc
 
Session ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmcSession ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmc
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Session i overview bioinfo dm and app mmc

  • 1. Data Manipulation: Molecular Online and Server Tools & BioExtract Server Theme: FXN Gene and Pancreatic Cancer. Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 Etienne.gnimpieba@usd.edu
  • 2. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 Metabolic: • Sabio-RK (check with Brent) • KEGG (check with Brent) • HMDB (hmdb.ca, contact for API) • SMPDB (http://www.smpdb.ca) • BioModels • drugDB • Brenda (check with Brent) • [Mathi's project] Protein • Expazy DB collection (uniprot, ) • PDB • SBKB • STRING Genomic: • G.E.O. • GenBank • GO • EBI Array Express & Gene Atlas Phenomic: • PhenomicDB • Phenoscape
  • 3. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 Active Network Extraction & Analysis Reactome Functional Interaction network Disease subnetwork Extract mutated, overexpressed, undexpressed, expanded/deleted genesAdd Linker genes Disease “modules” Disease gene prediction Sample classification Hypothesis generationApply community clustering algorithms
  • 4. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 p53, SMAD, TGFβ, TNF signaling KRAS, MAPK signaling Heterotrimeric G-protein signaling Rho GTPase signaling Transcription & translation Cell cycle Wnt & Cadherin signaling Hedgehog signaling Transcription Zinc fingers Ca2+ Signaling Non-silent mutations • blue – in primary tumour only • green – in xenograft only • red – in primary & xenograft Pancreatic Cancer Module Map (43 Cases) Christina Yung / Bioinformatics.ca
  • 5. Data Manipulation Molecular Online Tools: BioExtract Server Bibliographic Taxonomic Nucleotide GenomicProteinMetabolic pathway Molecular Biology Databases MEDLINE PubMed EMBASE BIOSIS CAB International AGRICOLA NEWT The Tree of Life Species 2000 IOPI ITIS KEGG EcoCyc BRENDA ENZYME BIOMODEL REACTOME INSDC EMBL DDBJ NCBI GENBANK SPGP AceDB HIV-SD Ensembl Wormbase FlyBase MGD SGD EBI ( Genome server, Karyn’s genome) RGD SPGP •GOA •ENZYME •INterPro •PDB •Integr8 •MEROPS LIGAN •EMP •DCHGR •PROSITE •PRINT •Pfam •BLOCKS •SBASE •UniProt/ Swiss- Prot •PIR Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013
  • 6. Sequence Type Accession Number DNA sequence from GENBANk , EMBL or DDBJ 1 letter + 5 digits : U43752 2 letter + 6 digits : AF462052 GenePept sequence GENBANk , EMBL or DDBJ 3 letter + 5 digits : AAF46449 Protein sequence from SwissProt 1 letter + 5 digits : Q16595 Protein sequence from the Protein Research Foundation 6/7 digits + 1 letter : 2808353A RefSeq sequence 2 letters + _ + >6 digits mRNA : NM_****** Protein : NP_****** Protein sequence from Protein Data Bank PDB 1 digit + 3 letters : 2EFF Protein sequence from Molecular Modeling DataBase MMDB ID + >4 digits : MMDB ID 767744 Review: data format Data Manipulation Molecular Online Tools and BioExtract Server Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 >gi|XXXX |XXX >sp|XXXX |XXX Gene Info number Specie referenceAccession number Gene Info number Specie referenceAccession number
  • 7. Data Manipulation Molecular Online Tools: BioExtract Server Biological sequences and data can be analyzed in many ways with bioinformatics tools. They can be read, assembled, compared, mapped, predicted, designed, modeled… 1. Nucleotide and protein sequence searching (blastall, SSEARCH for fasta local, GLSEEARCH for global) 2. Multiple sequence alignment (clustalW2, Mview, …) 3. Pairwise sequence alignment (Needle for global, LALIGN for local) 4. Protein functional analysis (SMART, Phobius, interproscan) 5. Functional genomic tools (R-tools, SAIL, EFOtools,) 6. Molecular structure analysis (PDBeFold, QuaternaryStructure,…) 7. Scientific literature text mining (EBIMed, Whatizit) 8. Sequence translation (Transeq, readseq, Backtranseq,…) 9. Data retrieval and ID mapping (dbfetchm, ENA/SRA, SRS, PICR) 10.Protein structure prediction tools 11.… Review: Online Programs & Algorithms Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013
  • 8. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 AND = term1 AND term2 must exist in the searched documents OR = term1 OR term2 must exist NOT = term1 must not be present in any of the displayed documents ALL = term1 must not be present in all of the displayed documents + term1 = document must contain the term1 - term1 = document must not contain term1 XXX* = all characters are accepted after the XXX XX?YX = all characters are accepted instead of Y  FXN [AND] gene [NOT] Frataxin  all data related with FXN gene except those concerning Frataxin protein  ataxia + apraxia + gene  all genes related with ataxia and apraxia  Ada* [AUTH]  all authors whose names begin with Ada Boolean operators and symbols
  • 9. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 BLAST (Basic Local Alignment search Tool) : comparing a protein or a DNA sequence to other sequences FASTA (FAST-ALL): fast protein or nucleotide comparison Similarity search tools
  • 10. Global match : align all residues of a sequence with all of the other sequence Local match : find a region in one sequence that matches with the other Motif match : find matches of a short sequence in one or more region internal to another long sequence, it could be a : Multiple alignment : a mutual alignment of many sequences Perfect match deletions insertionsmismatches Review: Sequence Analysis Data Manipulation Molecular Online Tools and BioExtract Server Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013
  • 11. Review: Sequence Analysis Data Manipulation Molecular Online Tools and BioExtract Server Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 Sequence alignment : assignment of residue-residue correspondence Determine phylogenic relationship by analyzing similarity and homology- Similarity: Observation or measurement of resemblance and difference Homology: The sequences and the organisms in which they occur are descended from a common ancestor  Homology must be an inference from observation of similarity Determine if a protein (or a gene) is related to a larger group of proteins Verify if a mutated residue is conserved within species
  • 12. Context 0. Specification & Aims . Statement of problem / Case study: The FXN gene provides instructions for making a protein called frataxin. This protein is found in cells throughout the body, with the highest levels in the heart, spinal cord, liver, pancreas, and muscles. The protein is used for voluntary movement (skeletal muscles). Within cells, frataxin is found in energy-producing structures called mitochondria. Although its function is not fully understood, frataxin appears to help assemble clusters of iron and sulfur molecules that are critical for the function of many proteins, including those needed for energy production. Mutations in the FXN gene cause Friedreich ataxia. Friedreich ataxia is a genetic condition that affects the nervous system and causes movement problems. Most people with Friedreich ataxia begin to experience the signs and symptoms of the disorder around puberty. Molecular Online Tools and Server Keywords: Bio: FXN, Frataxin, pancreatic cancer, CDKN4 Math: HMM, Informatics: programing, bioinformatics tools, getting and exporting data Reduced expression of frataxin is the cause of Friedrich's ataxia (FRDA), a lethal neurodegenerative disease, how about liver cancer? Aim: The purpose of this lab is to initiate online biological exploration tools of the human model large scale data study (metabolic, proteic, genomic, …). We simulated the application on FXN gene and pancreatic cancer disease. Now we can understand how a researcher can come to identify cross biological knowledge available in data banks. Acquired skills Online and server tools: - Query biological DB (fasta, Html, txt, figure formats) - Sequence tools (protein and gene) Alignment (showalign, clustalw2), similarity, … - Manage data result (select, keep, map, export) - Build and reuse workflow Biological Hypothesis FXN on chromosome 9 Frataxin molecule structure (pymol) Pancreatic cancerPancreasanatomy ? BiologicalDB Tools Resolution Process T2. Genome exploration: Objective: Use of Ensembl to localize the FXN on the human genome and identify the genes implicate in pancreatic cancer disease. T3. Sequences manipulation Objective: Find similar sequence using BLAST tools and make an alignment on given sequences. T2.1. Locate a given gene on human genome T2.2. Get a genomic sequence from NCBI T2.3. Get the protein data and sequence from EBI T2.4. Save the export sequences data in data folder T3.1. Find similar sequences using BLAST tool T3.2. Align generated sequences with ClustalW tool T3.3. Visualized result using phylogenic tree on Jalview T5. BioExtract server Objective: used server tool to optimized data manipulation process, apply on BioExtract server. T5.1. Server Initialization T5.2. Pancreatic cancer & Frataxin (FXN) T5.3. Mapping, Alignment T5.4. Workflow save & reused T4. Protein Data and Structural Biology Knowledge Objective: To provide protein levels of frataxin study and its connection with pancreatic cancer (functional ad structural data) T1. Metabolomics Objective: Use metabolic data repository to understand the frataxin protein mechanism T1.1. Finding the Enzyme and Pathway related to Frataxin using KEGG T1.2. Finding the Reaction involved with Frataxin using Reactome T1.3. Using BRENDA for enzyme data on Frataxin T1.4. Using Collected data for Analysis T1.5. Redu the process with Pancreatic Cancer Results T4.1. Structural Knowledge on Frataxin using SBKB T4.2. Using Uniprot for Frataxin Protein Study T4.3. Protein-Protein Interaction using STRING T4.4. Using same method for Pancreatic Cancer and compare

Editor's Notes

  1. Welcome to this bioinformatics lab on data manipulation using online and server tools.As the theme, we have chosen to study of the interaction between Frataxin and pancreatic cancer.
  2. Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  3. Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  4. Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  5. Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  6. To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
  7. Software generally takes the name of the implemented algorithmWe have hundreds of available algorithmsFor details on the algorithms for each tool, links are usually available on the website list of tools. If this fails, it suffices to type on a search browser (google) the name of the tool and you will have the referred algorithm. For example, Biological sequences and data can be analyzed in many ways with bioinformatics tools. They can be read, assembled, compared, mapped, predicted, designed, modeled
  8. Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  9. Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  10. To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
  11. To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
  12. This is the lab template: The context is a biological context based on a real biological problem. And a given hypothesisI don’t use computer science, strong word.When you read this template, you have a different view than an informatician.You want to understand the process to build the used tools.The architecture of the systemThe algorithm implementationThe quality of the resulting dataAnd so on