Session i overview bioinfo dm and app mmc

  • 179 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
179
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
2
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Welcome to this bioinformatics lab on data manipulation using online and server tools.As the theme, we have chosen to study of the interaction between Frataxin and pancreatic cancer.
  • Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  • Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  • Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  • Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  • To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
  • Software generally takes the name of the implemented algorithmWe have hundreds of available algorithmsFor details on the algorithms for each tool, links are usually available on the website list of tools. If this fails, it suffices to type on a search browser (google) the name of the tool and you will have the referred algorithm. For example, Biological sequences and data can be analyzed in many ways with bioinformatics tools. They can be read, assembled, compared, mapped, predicted, designed, modeled
  • Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  • Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
  • To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
  • To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
  • This is the lab template: The context is a biological context based on a real biological problem. And a given hypothesisI don’t use computer science, strong word.When you read this template, you have a different view than an informatician.You want to understand the process to build the used tools.The architecture of the systemThe algorithm implementationThe quality of the resulting dataAnd so on

Transcript

  • 1. Data Manipulation: Molecular Onlineand Server Tools & BioExtract ServerTheme: FXN Gene and Pancreatic Cancer.Etienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013Etienne.gnimpieba@usd.edu
  • 2. Data Manipulation Molecular Online Tools: BioExtract ServerReview: DatabasesEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013Metabolic:• Sabio-RK (check with Brent)• KEGG (check with Brent)• HMDB (hmdb.ca, contact for API)• SMPDB (http://www.smpdb.ca)• BioModels• drugDB• Brenda (check with Brent)• [Mathis project]Protein• Expazy DB collection (uniprot, )• PDB• SBKB• STRINGGenomic:• G.E.O.• GenBank• GO• EBI Array Express & Gene AtlasPhenomic:• PhenomicDB• Phenoscape
  • 3. Data Manipulation Molecular Online Tools: BioExtract ServerReview: DatabasesEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013Active Network Extraction & AnalysisReactome Functional Interaction networkDisease subnetworkExtract mutated, overexpressed,undexpressed, expanded/deletedgenesAdd LinkergenesDisease “modules”Disease gene predictionSample classificationHypothesis generationApply communityclustering algorithms
  • 4. Data Manipulation Molecular Online Tools: BioExtract ServerReview: DatabasesEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013p53, SMAD, TGFβ,TNF signalingKRAS, MAPK signalingHeterotrimericG-protein signalingRho GTPasesignalingTranscription & translationCell cycleWnt & CadherinsignalingHedgehogsignalingTranscriptionZinc fingersCa2+ SignalingNon-silent mutations• blue – in primary tumour only• green – in xenograft only• red – in primary & xenograftPancreatic Cancer Module Map (43 Cases)Christina Yung / Bioinformatics.ca
  • 5. Data Manipulation Molecular Online Tools: BioExtract ServerBibliographic Taxonomic Nucleotide GenomicProteinMetabolic pathwayMolecular BiologyDatabasesMEDLINEPubMedEMBASEBIOSISCABInternationalAGRICOLANEWTThe Tree of LifeSpecies 2000IOPIITISKEGGEcoCycBRENDAENZYMEBIOMODELREACTOMEINSDCEMBLDDBJNCBIGENBANKSPGPAceDBHIV-SDEnsemblWormbaseFlyBaseMGDSGDEBI ( Genomeserver,Karyn’s genome)RGDSPGP•GOA•ENZYME•INterPro•PDB•Integr8•MEROPSLIGAN•EMP•DCHGR•PROSITE•PRINT•Pfam•BLOCKS•SBASE•UniProt/Swiss-Prot•PIRReview: DatabasesEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013
  • 6. Sequence Type Accession NumberDNA sequence from GENBANk , EMBL or DDBJ1 letter + 5 digits : U437522 letter + 6 digits : AF462052GenePept sequence GENBANk , EMBL or DDBJ 3 letter + 5 digits : AAF46449Protein sequence from SwissProt 1 letter + 5 digits : Q16595Protein sequence from the Protein Research Foundation 6/7 digits + 1 letter : 2808353ARefSeq sequence2 letters + _ + >6 digitsmRNA : NM_******Protein : NP_******Protein sequence from Protein Data Bank PDB 1 digit + 3 letters : 2EFFProtein sequence from Molecular Modeling DataBase MMDB ID + >4 digits : MMDB ID 767744Review: data formatData Manipulation Molecular Online Tools and BioExtract ServerEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013>gi|XXXX |XXX >sp|XXXX |XXXGene Info number Specie referenceAccession number Gene Info number Specie referenceAccession number
  • 7. Data Manipulation Molecular Online Tools: BioExtract ServerBiological sequences and data can be analyzed in many ways with bioinformatics tools.They can be read, assembled, compared, mapped, predicted, designed, modeled…1. Nucleotide and protein sequence searching (blastall, SSEARCH for fastalocal, GLSEEARCH for global)2. Multiple sequence alignment (clustalW2, Mview, …)3. Pairwise sequence alignment (Needle for global, LALIGN for local)4. Protein functional analysis (SMART, Phobius, interproscan)5. Functional genomic tools (R-tools, SAIL, EFOtools,)6. Molecular structure analysis (PDBeFold, QuaternaryStructure,…)7. Scientific literature text mining (EBIMed, Whatizit)8. Sequence translation (Transeq, readseq, Backtranseq,…)9. Data retrieval and ID mapping (dbfetchm, ENA/SRA, SRS, PICR)10.Protein structure prediction tools11.…Review: Online Programs & AlgorithmsEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013
  • 8. Data Manipulation Molecular Online Tools: BioExtract ServerReview: DatabasesEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013AND = term1 AND term2 must exist in the searched documentsOR = term1 OR term2 must existNOT = term1 must not be present in any of the displayed documentsALL = term1 must not be present in all of the displayed documents+ term1 = document must contain the term1- term1 = document must not contain term1XXX* = all characters are accepted after the XXXXX?YX = all characters are accepted instead of Y FXN [AND] gene [NOT] Frataxin  all data related with FXN gene exceptthose concerning Frataxin protein ataxia + apraxia + gene  all genes related with ataxia and apraxia Ada* [AUTH]  all authors whose names begin with AdaBoolean operators and symbols
  • 9. Data Manipulation Molecular Online Tools: BioExtract ServerReview: DatabasesEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013BLAST (Basic Local Alignment search Tool) : comparing aprotein or a DNA sequence to other sequencesFASTA (FAST-ALL): fast protein or nucleotide comparisonSimilarity search tools
  • 10. Global match : align allresidues of a sequence withall of the other sequenceLocal match : find a regionin one sequence thatmatches with the otherMotif match : find matches of a shortsequence in one or more region internalto another long sequence, it could be a :Multiple alignment : amutual alignment of manysequencesPerfect match deletions insertionsmismatchesReview: Sequence AnalysisData Manipulation Molecular Online Tools and BioExtract ServerEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013
  • 11. Review: Sequence AnalysisData Manipulation Molecular Online Tools and BioExtract ServerEtienne Z. GnimpiebaBRIN WS 2013Mount Marty College – June 24th 2013Sequence alignment : assignment of residue-residue correspondenceDetermine phylogenic relationship by analyzing similarity and homology-Similarity: Observation or measurement of resemblance and differenceHomology: The sequences and the organisms in which they occur aredescended from a common ancestor  Homology must be an inference fromobservation of similarityDetermine if a protein (or a gene) is related to a larger group of proteinsVerify if a mutated residue is conserved within species
  • 12. Context0. Specification & Aims.Statement of problem / Case study: The FXN gene provides instructions for making a protein called frataxin. This protein is found in cells throughout the body, with the highest levels in the heart,spinal cord, liver, pancreas, and muscles. The protein is used for voluntary movement (skeletal muscles). Within cells, frataxin is found in energy-producing structures called mitochondria. Althoughits function is not fully understood, frataxin appears to help assemble clusters of iron and sulfur molecules that are critical for the function of many proteins, including those needed for energyproduction. Mutations in the FXN gene cause Friedreich ataxia. Friedreich ataxia is a genetic condition that affects the nervous system and causes movement problems. Most people with Friedreichataxia begin to experience the signs and symptoms of the disorder around puberty.Molecular Online Tools and ServerKeywords:Bio: FXN, Frataxin, pancreatic cancer, CDKN4Math: HMM,Informatics: programing, bioinformatics tools, gettingand exporting dataReduced expression of frataxin isthe cause of Friedrichs ataxia(FRDA), a lethal neurodegenerativedisease, how about liver cancer?Aim: The purpose of this lab is to initiate onlinebiological exploration tools of the human model largescale data study (metabolic, proteic, genomic, …). Wesimulated the application on FXN gene and pancreaticcancer disease. Now we can understand how aresearcher can come to identify cross biologicalknowledge available in data banks.Acquired skillsOnline and server tools:- Query biological DB (fasta, Html, txt, figure formats)- Sequence tools (protein and gene)Alignment (showalign, clustalw2), similarity, …- Manage data result (select, keep, map, export)- Build and reuse workflowBiological HypothesisFXN on chromosome 9Frataxin molecule structure (pymol)Pancreatic cancerPancreasanatomy?BiologicalDBToolsResolution ProcessT2. Genome exploration:Objective: Use of Ensembl to localize the FXN on the humangenome and identify the genes implicate in pancreatic cancerdisease.T3. Sequences manipulationObjective: Find similar sequence using BLAST toolsand make an alignment on given sequences.T2.1. Locate a given gene on human genomeT2.2. Get a genomic sequence from NCBIT2.3. Get the protein data and sequence from EBIT2.4. Save the export sequences data in data folderT3.1. Find similar sequences using BLAST toolT3.2. Align generated sequences with ClustalW toolT3.3. Visualized result using phylogenic tree onJalviewT5. BioExtract serverObjective: used server tool to optimized datamanipulation process, apply on BioExtract server.T5.1. Server InitializationT5.2. Pancreatic cancer & Frataxin (FXN)T5.3. Mapping, AlignmentT5.4. Workflow save & reusedT4. Protein Data and StructuralBiology KnowledgeObjective: To provide protein levels of frataxin studyand its connection with pancreatic cancer (functional adstructural data)T1. MetabolomicsObjective: Use metabolic data repository tounderstand the frataxin protein mechanismT1.1. Finding the Enzyme and Pathway related toFrataxin using KEGGT1.2. Finding the Reaction involved with Frataxinusing ReactomeT1.3. Using BRENDA for enzyme data on FrataxinT1.4. Using Collected data for AnalysisT1.5. Redu the process with Pancreatic CancerResultsT4.1. Structural Knowledge on Frataxin usingSBKBT4.2. Using Uniprot for Frataxin Protein StudyT4.3. Protein-Protein Interaction using STRINGT4.4. Using same method for Pancreatic Cancerand compare