There are messages hidden within our genome, regulating when and how long a gene is switched on. The presentation describes a method, STREAM, targeted at deciphering this regulatory code.
Traditionally the gene expression pathway was regarded as being composed of independent steps, from RNA transcription to protein translation. To-date there is increasing evidence for coupling between the different processes of the pathway, specifically between transcription and splicing. Given the extensive cross-talk between these processes, we derived a transcription-splicing integrated network. The nodes of the network included experimentally verified human proteins belonging to three groups of regulators: Transcription factors (TFs), splicing factors (SFs) and kinases. The nodes were wired by instances of predicted transcriptional and alternative splicing regulation. Analysis of the network indicated a pervasive cross-regulation among the nodes, specifically; SFs were significantly more often regulated by alternative splicing relative to the two other subgroups, while TFs were more extensively controlled by transcriptional regulation. In particular, we found a significant preference of specific pairs of TF-TF and SF-SF to regulate their target genes, SFs being the most regulated group via independent and combinatorial binding of SFs. Consistent with the extensive cross-regulation among the splicing and transcription factors, the subgroup of kinases within the network had the highest density of predicted phosphorylation sites. The prevalent regulation of the regulatory proteins was further supported by computational analysis of the protein sequences, demonstrating the propensity of these proteins to be highly disordered relative to other proteins in the human proteome. Overall, our systematic study reveals that an organizing principle in the logic of integrated networks favor the regulation of regulatory proteins by the specific regulation they conduct. Based on these results we propose a new regulatory paradigm, postulating that fine-tuned gene expression regulation of the master regulators in the cell is commonly achieved by cross-regulation.
1) Randomly select positions to project sequences onto lower-dimensional "buckets" based on letters at those positions.
2) Recover motifs from buckets containing multiple sequences by building frequency matrices and refining with EM.
3) The best motif is the one with the highest score, where score is based on the likelihood ratio of sequences matching the motif model versus background.
This document discusses DNA replication in E. coli. It notes that E. coli has a single circular chromosome that is replicated from a single origin through semi-conservative replication. There are 14 termination sites that bind to the Tus protein to block replication forks. Certain nucleotides in the termination sites are important for Tus binding affinity and formation of a "TT-lock" that further stabilizes protein-DNA interactions to arrest replication forks. While not all sites form a TT-lock, fork arrest is achieved through both Tus affinity and TT-lock formation.
The document discusses genetic motifs and promoters. It explains that transcription factor binding sites (TFBS) are short DNA sequences that transcription factors bind to in order to regulate gene expression. It then describes how the assembly of promoter protein complexes occurs through multiple stages involving different transcription factors binding to TFBS. The document also introduces information theory concepts like entropy and mutual information that can be used to detect motifs through measuring correlations between sequences. It provides an example algorithm that uses a sliding window approach to calculate mutual information between a probe TFBS sequence and candidate sequences in order to identify new potential TFBSs.
Semantic Web Approaches to Candidate Gene IdentificationSimon Twigger
The document describes using semantic web approaches and ontologies to integrate and annotate genomic data from rat expression studies. Key points discussed include using the NCBO Annotator to annotate datasets, curating the results, linking annotations to genes and pathways in a triple store, and integrating strain and tissue level expression data into the Rat Genome Database. The goal is to enable researchers to better search and explore genomic data to identify candidate genes for phenotypes.
Presentation of our data and results from the first antibody staining of the new allele, evePJ4 (created by Dr. Amy Bejsovec\'s lab) and the wgNE2 allele.
The document discusses the structure of hemoglobin and how its structure allows it to effectively transport oxygen throughout the body. It details that hemoglobin is a tetrameric protein containing heme groups that bind oxygen. The intersections of the protein's alpha helices form binding sites for oxygen molecules. There are three main types of hemoglobin that have similar structures but can be modified through bonding with other molecules or under certain conditions, altering their oxygen affinity. The structure of hemoglobin plays a crucial role in its function as an oxygen carrier in the blood.
1. The document is Stephen Corvini's take-home test on molecular cell biology. It contains his answers to 3 questions about red blood cell membrane proteins, protein isolation techniques, and lysosomal protein transport.
2. Corvini describes how red blood cell membrane proteins like spectrin, band 3, and band 4.1 interact to regulate transport and maintain cell structure. He also explains techniques for isolating integral membrane proteins using detergents.
3. In his answers, Corvini discusses N-linked and O-linked oligosaccharides, mannose-6-phosphate receptor mediated lysosomal transport, and molecular defects in I-cell disease.
Traditionally the gene expression pathway was regarded as being composed of independent steps, from RNA transcription to protein translation. To-date there is increasing evidence for coupling between the different processes of the pathway, specifically between transcription and splicing. Given the extensive cross-talk between these processes, we derived a transcription-splicing integrated network. The nodes of the network included experimentally verified human proteins belonging to three groups of regulators: Transcription factors (TFs), splicing factors (SFs) and kinases. The nodes were wired by instances of predicted transcriptional and alternative splicing regulation. Analysis of the network indicated a pervasive cross-regulation among the nodes, specifically; SFs were significantly more often regulated by alternative splicing relative to the two other subgroups, while TFs were more extensively controlled by transcriptional regulation. In particular, we found a significant preference of specific pairs of TF-TF and SF-SF to regulate their target genes, SFs being the most regulated group via independent and combinatorial binding of SFs. Consistent with the extensive cross-regulation among the splicing and transcription factors, the subgroup of kinases within the network had the highest density of predicted phosphorylation sites. The prevalent regulation of the regulatory proteins was further supported by computational analysis of the protein sequences, demonstrating the propensity of these proteins to be highly disordered relative to other proteins in the human proteome. Overall, our systematic study reveals that an organizing principle in the logic of integrated networks favor the regulation of regulatory proteins by the specific regulation they conduct. Based on these results we propose a new regulatory paradigm, postulating that fine-tuned gene expression regulation of the master regulators in the cell is commonly achieved by cross-regulation.
1) Randomly select positions to project sequences onto lower-dimensional "buckets" based on letters at those positions.
2) Recover motifs from buckets containing multiple sequences by building frequency matrices and refining with EM.
3) The best motif is the one with the highest score, where score is based on the likelihood ratio of sequences matching the motif model versus background.
This document discusses DNA replication in E. coli. It notes that E. coli has a single circular chromosome that is replicated from a single origin through semi-conservative replication. There are 14 termination sites that bind to the Tus protein to block replication forks. Certain nucleotides in the termination sites are important for Tus binding affinity and formation of a "TT-lock" that further stabilizes protein-DNA interactions to arrest replication forks. While not all sites form a TT-lock, fork arrest is achieved through both Tus affinity and TT-lock formation.
The document discusses genetic motifs and promoters. It explains that transcription factor binding sites (TFBS) are short DNA sequences that transcription factors bind to in order to regulate gene expression. It then describes how the assembly of promoter protein complexes occurs through multiple stages involving different transcription factors binding to TFBS. The document also introduces information theory concepts like entropy and mutual information that can be used to detect motifs through measuring correlations between sequences. It provides an example algorithm that uses a sliding window approach to calculate mutual information between a probe TFBS sequence and candidate sequences in order to identify new potential TFBSs.
Semantic Web Approaches to Candidate Gene IdentificationSimon Twigger
The document describes using semantic web approaches and ontologies to integrate and annotate genomic data from rat expression studies. Key points discussed include using the NCBO Annotator to annotate datasets, curating the results, linking annotations to genes and pathways in a triple store, and integrating strain and tissue level expression data into the Rat Genome Database. The goal is to enable researchers to better search and explore genomic data to identify candidate genes for phenotypes.
Presentation of our data and results from the first antibody staining of the new allele, evePJ4 (created by Dr. Amy Bejsovec\'s lab) and the wgNE2 allele.
The document discusses the structure of hemoglobin and how its structure allows it to effectively transport oxygen throughout the body. It details that hemoglobin is a tetrameric protein containing heme groups that bind oxygen. The intersections of the protein's alpha helices form binding sites for oxygen molecules. There are three main types of hemoglobin that have similar structures but can be modified through bonding with other molecules or under certain conditions, altering their oxygen affinity. The structure of hemoglobin plays a crucial role in its function as an oxygen carrier in the blood.
1. The document is Stephen Corvini's take-home test on molecular cell biology. It contains his answers to 3 questions about red blood cell membrane proteins, protein isolation techniques, and lysosomal protein transport.
2. Corvini describes how red blood cell membrane proteins like spectrin, band 3, and band 4.1 interact to regulate transport and maintain cell structure. He also explains techniques for isolating integral membrane proteins using detergents.
3. In his answers, Corvini discusses N-linked and O-linked oligosaccharides, mannose-6-phosphate receptor mediated lysosomal transport, and molecular defects in I-cell disease.
The cell membrane contains cholesterol and proteins that are vital to many cell processes. Cholesterol levels increase before cell proliferation and help maintain membrane structure. Integrin and NCAM proteins in the membrane allow communication with other cells and the environment to promote cell survival. The membrane also regulates transport of molecules in and out of the cell through passive diffusion, pumps, and carrier proteins. Phagocytosis relies on phospholipases and other molecules to engulf cellular debris. Overall, the complex but foundational cell membrane enables critical functions through its composition and structure.
The engrailed gene is a segment polarity gene in Drosophila melanogaster that plays several important roles during development. It defines the posterior region of each embryonic parasegment, establishing anterior-posterior polarity. The engrailed gene also helps pattern the brain by defining borders between regions and guiding neuronal axon growth. Comparisons of engrailed DNA and protein sequences across species show it is conserved and related genes can be found in vertebrates as well.
Developmental cascade of morphogens Define Drosophila Body PlanDouglas Easton
The expression of genes in specific regions of the early Drosophila embryo determine the anterior-posterior and dorso-ventral axes of the organism. Expression of these genes are both spatially and temporally coordinated.
The document provides an overview of computational chemistry methods for structure-activity relationship analysis, pharmacophore modeling, and protein-ligand docking. It discusses topics like SAR, QSAR, molecular alignment, conformational analysis, homology modeling of protein targets, and docking programs. Examples are given of applying these methods to study benzodiazepine ligands and GABA receptor subtypes.
Gene expression in eukaryotes is regulated through multiple mechanisms at the transcriptional and post-transcriptional levels. These mechanisms allow for adaptation, tissue specificity, and development. Regulation occurs through chromatin remodeling, enhancers/repressors, locus control regions, gene amplification, rearrangement, and alternative RNA processing. Key differences between prokaryotic and eukaryotic gene expression include larger eukaryotic genomes, different cell types, lack of operons, chromatin structure, and uncoupled transcription/translation.
Artificial intelligence (AI) is everywhere, promising self-driving cars, medical breakthroughs, and new ways of working. But how do you separate hype from reality? How can your company apply AI to solve real business problems?
Here’s what AI learnings your business should keep in mind for 2017.
Khaled El Masry, is an assistant Lecturer of Human Anatomy & Embryology, Mansoura University, Egypt. Great thanks to Prof. Dr Salwa Gawish, professor of Cytology & Histology, Mansoura University, for her great effort in explaining Genetics course.
The document discusses several key aspects of gene prediction including:
1. Gene prediction algorithms use signals like start/stop codons, splice sites, and open reading frames to identify genes computationally with near 100% accuracy.
2. There are ab initio, homology-based, and probabilistic models like Hidden Markov Models that can predict prokaryotic and eukaryotic genes.
3. Eukaryotic gene prediction is more challenging due to larger genomes, fewer genes, and intron-exon structures. Programs must consider splicing, polyadenylation, and other post-transcriptional modifications.
Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片Chyi-Tsong Chen
The document describes the development of data-driven techniques for process optimization using real-coded genetic algorithms. It introduces genetic algorithms and how they are inspired by biological evolution. It then describes real-coded genetic algorithms and the key steps of the algorithm, including reproduction, crossover, and mutation. Finally, it discusses applications of real-coded genetic algorithms for single-objective and multi-objective optimization problems, and their use in optimizing metalorganic chemical vapor deposition processes.
These slides were used for a tutorial I gave at GECCO 2010. These are similar, yet not identical, to the other tutorials. The keynote file is too large for slideshare but if anybody needs the original I would be happy to provide a url from where to download it.
1. DNA contains the genetic instructions used in the development and functioning of all living organisms. It is made up of four chemical bases (A, T, C, G) that form base pairs between strands.
2. DNA replicates through a semi-conservative process where the double helix unwinds and each strand acts as a template for new partner strands. This ensures genetic information is preserved as cells divide.
3. Genes encoded in DNA are expressed via transcription of DNA to mRNA and translation of mRNA to proteins. Transfer RNA (tRNA) molecules match mRNA codons to amino acids during protein synthesis.
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...Spencer Bliven
This document discusses armadillo repeat proteins and their potential use in protein-protein binding applications. It provides background on armadillo repeats and their biological roles. The document then discusses using armadillo repeats as an alternative to antibodies for applications like therapeutics and assays by rationally designing armadillo repeat proteins to bind specific peptide targets. It outlines the author's approach to modeling armadillo repeat evolution and using machine learning to predict binding abilities from sequence.
Transposable elements, or transposons, are DNA sequences that can move within genomes. There are two main classes of transposons: those that encode proteins to directly move the DNA element, and retrotransposons that move via an RNA intermediate using reverse transcriptase. Barbara McClintock discovered transposons in the 1940s and 1950s through her studies of maize, where she observed "jumping genes" that caused mosaic color patterns in kernels. Transposons are found in both prokaryotes and eukaryotes and can insert into new locations in genomes, sometimes causing mutations. They have played an important role in genome evolution and can continue to induce genetic variation.
This document summarizes a presentation given by Dr. Jo Vandesompele on state-of-the-art normalization of RT-qPCR data. It discusses the importance of normalization to remove experimental variation and introduces the geNorm algorithm for determining the optimal number and combination of reference genes for normalization. GeNorm has become the standard method for reference gene validation and normalization and has improved qPCR data analysis. The document also proposes a novel global mean normalization strategy for large-scale gene expression studies.
The document describes research on modeling synthetic genetic AND gates using computational modeling. It discusses two approaches to modeling - assisting with analysis and design, or describing existing system behavior. It then details the development of a model for a synthetic genetic AND gate using elementary chemical reactions and stochastic simulation. The model was able to capture experimental results but required refinements to account for promoter leakiness. Overall, modeling helped understand the genetic components and could be applied to larger designs.
This document discusses progress towards generating a complete telomere-to-telomere assembly of the human genome using long-read sequencing technologies. It summarizes:
1. An assembly of the X chromosome from telomere-to-telomere including the 2.8 Mb centromeric repeat, demonstrating high accuracy and continuity across repetitive regions.
2. A novel polishing strategy that improves base quality in repeat-rich regions, demonstrated on the X chromosome assembly.
3. Ongoing efforts as part of the Telomere-to-Telomere Consortium to generate complete assemblies of all human chromosomes to finish the human reference genome.
This document proposes a vertical cavity surface emitting terahertz laser based on polariton lasers. A two-photon absorption process is used to excite a 2p exciton state, which then populates the 1s polariton state via THz transition. The population of the 1s state then stimulates additional THz transitions. Simulations show that the THz emission rate increases at the polariton lasing threshold, and that the 1s population increases exponentially above threshold. A hybrid GaN/GaAs design is proposed that could operate at room temperature without doping and emit both blue and red light to indicate THz emission.
The lecture describes the basic concepts of C-value, Cot curve and Rot curve analysis, MCQ questions regarding the same. Queries are always welcome.... Dr. Nitin Wahi (wahink@gmail.com).
An introduction to promoter prediction and analysisSarbesh D. Dangol
This document provides an introduction to promoter prediction and analysis in plants. It discusses what promoters are, including their cis-acting elements and core promoter regions. It describes different types of promoters such as constitutive, spatiotemporal, and inducible promoters. It also discusses models for finding binding sites in promoters and experimental approaches for identifying regulatory elements like chromatin immunoprecipitation. Finally, it mentions some bioinformatics tools and databases that can be used for promoter analysis.
The cell membrane contains cholesterol and proteins that are vital to many cell processes. Cholesterol levels increase before cell proliferation and help maintain membrane structure. Integrin and NCAM proteins in the membrane allow communication with other cells and the environment to promote cell survival. The membrane also regulates transport of molecules in and out of the cell through passive diffusion, pumps, and carrier proteins. Phagocytosis relies on phospholipases and other molecules to engulf cellular debris. Overall, the complex but foundational cell membrane enables critical functions through its composition and structure.
The engrailed gene is a segment polarity gene in Drosophila melanogaster that plays several important roles during development. It defines the posterior region of each embryonic parasegment, establishing anterior-posterior polarity. The engrailed gene also helps pattern the brain by defining borders between regions and guiding neuronal axon growth. Comparisons of engrailed DNA and protein sequences across species show it is conserved and related genes can be found in vertebrates as well.
Developmental cascade of morphogens Define Drosophila Body PlanDouglas Easton
The expression of genes in specific regions of the early Drosophila embryo determine the anterior-posterior and dorso-ventral axes of the organism. Expression of these genes are both spatially and temporally coordinated.
The document provides an overview of computational chemistry methods for structure-activity relationship analysis, pharmacophore modeling, and protein-ligand docking. It discusses topics like SAR, QSAR, molecular alignment, conformational analysis, homology modeling of protein targets, and docking programs. Examples are given of applying these methods to study benzodiazepine ligands and GABA receptor subtypes.
Gene expression in eukaryotes is regulated through multiple mechanisms at the transcriptional and post-transcriptional levels. These mechanisms allow for adaptation, tissue specificity, and development. Regulation occurs through chromatin remodeling, enhancers/repressors, locus control regions, gene amplification, rearrangement, and alternative RNA processing. Key differences between prokaryotic and eukaryotic gene expression include larger eukaryotic genomes, different cell types, lack of operons, chromatin structure, and uncoupled transcription/translation.
Artificial intelligence (AI) is everywhere, promising self-driving cars, medical breakthroughs, and new ways of working. But how do you separate hype from reality? How can your company apply AI to solve real business problems?
Here’s what AI learnings your business should keep in mind for 2017.
Khaled El Masry, is an assistant Lecturer of Human Anatomy & Embryology, Mansoura University, Egypt. Great thanks to Prof. Dr Salwa Gawish, professor of Cytology & Histology, Mansoura University, for her great effort in explaining Genetics course.
The document discusses several key aspects of gene prediction including:
1. Gene prediction algorithms use signals like start/stop codons, splice sites, and open reading frames to identify genes computationally with near 100% accuracy.
2. There are ab initio, homology-based, and probabilistic models like Hidden Markov Models that can predict prokaryotic and eukaryotic genes.
3. Eukaryotic gene prediction is more challenging due to larger genomes, fewer genes, and intron-exon structures. Programs must consider splicing, polyadenylation, and other post-transcriptional modifications.
Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片Chyi-Tsong Chen
The document describes the development of data-driven techniques for process optimization using real-coded genetic algorithms. It introduces genetic algorithms and how they are inspired by biological evolution. It then describes real-coded genetic algorithms and the key steps of the algorithm, including reproduction, crossover, and mutation. Finally, it discusses applications of real-coded genetic algorithms for single-objective and multi-objective optimization problems, and their use in optimizing metalorganic chemical vapor deposition processes.
These slides were used for a tutorial I gave at GECCO 2010. These are similar, yet not identical, to the other tutorials. The keynote file is too large for slideshare but if anybody needs the original I would be happy to provide a url from where to download it.
1. DNA contains the genetic instructions used in the development and functioning of all living organisms. It is made up of four chemical bases (A, T, C, G) that form base pairs between strands.
2. DNA replicates through a semi-conservative process where the double helix unwinds and each strand acts as a template for new partner strands. This ensures genetic information is preserved as cells divide.
3. Genes encoded in DNA are expressed via transcription of DNA to mRNA and translation of mRNA to proteins. Transfer RNA (tRNA) molecules match mRNA codons to amino acids during protein synthesis.
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...Spencer Bliven
This document discusses armadillo repeat proteins and their potential use in protein-protein binding applications. It provides background on armadillo repeats and their biological roles. The document then discusses using armadillo repeats as an alternative to antibodies for applications like therapeutics and assays by rationally designing armadillo repeat proteins to bind specific peptide targets. It outlines the author's approach to modeling armadillo repeat evolution and using machine learning to predict binding abilities from sequence.
Transposable elements, or transposons, are DNA sequences that can move within genomes. There are two main classes of transposons: those that encode proteins to directly move the DNA element, and retrotransposons that move via an RNA intermediate using reverse transcriptase. Barbara McClintock discovered transposons in the 1940s and 1950s through her studies of maize, where she observed "jumping genes" that caused mosaic color patterns in kernels. Transposons are found in both prokaryotes and eukaryotes and can insert into new locations in genomes, sometimes causing mutations. They have played an important role in genome evolution and can continue to induce genetic variation.
This document summarizes a presentation given by Dr. Jo Vandesompele on state-of-the-art normalization of RT-qPCR data. It discusses the importance of normalization to remove experimental variation and introduces the geNorm algorithm for determining the optimal number and combination of reference genes for normalization. GeNorm has become the standard method for reference gene validation and normalization and has improved qPCR data analysis. The document also proposes a novel global mean normalization strategy for large-scale gene expression studies.
The document describes research on modeling synthetic genetic AND gates using computational modeling. It discusses two approaches to modeling - assisting with analysis and design, or describing existing system behavior. It then details the development of a model for a synthetic genetic AND gate using elementary chemical reactions and stochastic simulation. The model was able to capture experimental results but required refinements to account for promoter leakiness. Overall, modeling helped understand the genetic components and could be applied to larger designs.
This document discusses progress towards generating a complete telomere-to-telomere assembly of the human genome using long-read sequencing technologies. It summarizes:
1. An assembly of the X chromosome from telomere-to-telomere including the 2.8 Mb centromeric repeat, demonstrating high accuracy and continuity across repetitive regions.
2. A novel polishing strategy that improves base quality in repeat-rich regions, demonstrated on the X chromosome assembly.
3. Ongoing efforts as part of the Telomere-to-Telomere Consortium to generate complete assemblies of all human chromosomes to finish the human reference genome.
This document proposes a vertical cavity surface emitting terahertz laser based on polariton lasers. A two-photon absorption process is used to excite a 2p exciton state, which then populates the 1s polariton state via THz transition. The population of the 1s state then stimulates additional THz transitions. Simulations show that the THz emission rate increases at the polariton lasing threshold, and that the 1s population increases exponentially above threshold. A hybrid GaN/GaAs design is proposed that could operate at room temperature without doping and emit both blue and red light to indicate THz emission.
The lecture describes the basic concepts of C-value, Cot curve and Rot curve analysis, MCQ questions regarding the same. Queries are always welcome.... Dr. Nitin Wahi (wahink@gmail.com).
An introduction to promoter prediction and analysisSarbesh D. Dangol
This document provides an introduction to promoter prediction and analysis in plants. It discusses what promoters are, including their cis-acting elements and core promoter regions. It describes different types of promoters such as constitutive, spatiotemporal, and inducible promoters. It also discusses models for finding binding sites in promoters and experimental approaches for identifying regulatory elements like chromatin immunoprecipitation. Finally, it mentions some bioinformatics tools and databases that can be used for promoter analysis.
The document discusses methods for assembling and annotating repetitive DNA elements from genome survey sequencing (GSS) data in grasses and other plant groups. In grasses like rice, sorghum, and maize, the authors were able to recover full-length transposable elements and estimate abundance, due to existing reference libraries. However, in the order Asparagales, which contains very large genomes, reference libraries are highly diverged, annotation is more difficult, and abundances may not be accurately estimated. The authors find variation in repetitive DNA content even among closely related lineages in Asparagales.
The document discusses next generation sequencing methods and RNA sequencing. It covers topics like sequencing formats, data analysis workflows including mapping, clustering, assembly programs, finding new genes and correcting existing ones. It discusses input file types, calculating sequencing depth, available tools for alignment, output file formats, assembly programs, splice junction prediction, and applications of RNA sequencing like gene expression analysis and annotation.
This document discusses how Hadoop can be used for bioinformatics applications. It provides examples of how Hadoop has been used to efficiently process large genomic datasets, such as read mapping and genome assembly, in a distributed, parallel manner. Hadoop allows bioinformatics workflows and algorithms to be rethought and scaled to handle the growing size of genomic data. Key applications discussed include read mapping, variant discovery, and de novo assembly.
Genome editing technologies allow genetic material to be added, removed or altered at specific locations in an organism's genome. Several approaches exist, including zinc finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs), CRISPR/Cas9, and base editors. These tools create precise breaks in DNA that can be repaired through non-homologous end joining or homology-directed repair. They enable trait discovery and crop improvement by generating plants with high yield, stress resistance, or other desired properties. While powerful, challenges remain in fully editing complex genomes and reducing off-target mutations.
Estimation of region of attraction for polynomial nonlinear systems a numeric...ISA Interchange
This document introduces a numerical method to estimate the region of attraction (ROA) for polynomial nonlinear systems using sum-of-squares programming. The method computes a local Lyapunov function and an invariant set around a locally asymptotically stable equilibrium point. This invariant set provides an estimation of the ROA for the equilibrium point. The paper then proposes an algorithm to select a "shape factor" based on the linearized dynamic model of the system, which is used to enlarge the estimation of the ROA by solving a sum-of-squares optimization problem in each iteration. Numerical examples are provided to demonstrate the efficiency of the proposed method.
Similar to Deciphering the regulatory code in the genome (20)
Cloud-native machine learning - Transforming bioinformatics research Denis C. Bauer
Cloud computing and artificial intelligence transforms bioinformatics research
Denis Bauer, Transformational Bioinformatics Team
Genomic data is outpacing traditional Big Data disciplines, producing more information than Astronomy, twitter, and YouTube combined. As such, Genomic research has leapfrogged to the forefront of Big Data and Cloud solutions. We developed software platforms using the latest in cloud architecture, artificial intelligence and machine learning to support every aspect genome medicine; from disease gene detection through to validation and personalized medicine.
This talk outlines how we find disease genes for complex genetic diseases, such as ALS, using VariantSpark, which is a custom machine learning implementation capable of dealing with Whole Genome Sequencing data of 80 million common and rare variants. To support disease gene validation, we created GT-Scan, which is an innovative web application, which we think of it as the “search engine for the genome”. It enables researchers to identify the optimal editing spot to create animal models efficiently. The talk concludes by demonstrating how cloud-based software distribution channels (digital Marketplaces) can be harnessed to share bioinformatics tools internationally and make research more reproducible.
Translating genomics into clinical practice - 2018 AWS summit keynoteDenis C. Bauer
CSIRO's part of the co-presented Keynote at the AWS Public Sector Summit in Canberra on genomics health care. Three key messages: 1) We need a shift from treatment towards prevention 2) Once you go serverless you never go back 3) DevOps 2.0: Hypothesis-driven architecture evolution
Going Server-less for Web-Services that need to Crunch Large Volumes of DataDenis C. Bauer
AgileIndia Breakout session on serverless applications. This talk covers how AWS serverless infrastructure can be used for a wide range of applications, such as compute intensive tasks (GT-Scan), tasks requiring continuous learning (CryptoBreeder), data intensive tasks (PhenGen Database).
How novel compute technology transforms life science researchDenis C. Bauer
AgileIndia 2018 Keynote. This talk covers how ‘Datafication’ will make data ‘wider’ (more features describing a data point), which represents a paradigm shift for Machine Learning applications. It also covers serverless architecture, which can cater for even compute-intensive tasks. It concludes by stating that business and life-science research are not that different: so let’s build a community together!
How novel compute technology transforms life science researchDenis C. Bauer
Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2.
VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts.
GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
Population-scale high-throughput sequencing data analysisDenis C. Bauer
This document provides an overview of a presentation on population-scale high-throughput sequencing data analysis. It discusses:
1) The background and goals of the CSIRO/Omics Project which aims to investigate colorectal cancer susceptibility using sequencing data from 500 individuals.
2) Methods for processing large-scale NGS data on high-performance computing clusters and cloud infrastructure using the NGSANE framework, which allows processing modules to be run in parallel.
3) Preliminary research outcomes identifying cancer-associated and microbiome changes from analysis of colorectal cancer and control samples.
The primary goal of my trip to Seattle was to establish a collaboration with a world-leading group on data integration. But by having chosen Seattle, a hub for technology companies, I also learned about synergies between business and research: Ilya Shmulevich from the Institute for Systems Biology makes use of Amazon's ''Random Forest" implementation and Google's 600.000 CPU cluster for cancer genomic association discovery. I also met with experts from University of Washington and Microsoft research to learn about technological advancements to tackle BigData and commoditizing parallelization. Finally, I observed a government funded research agency invest in solutions geared towards their enterprise structure rather than adopt solutions designed for research institutes without active computational community. In conclusion: CSIRO has unique properties and skill-sets that many collaborators would be interested in benefiting from, in return such collaborations would propel CSIRO instantly to the forefront of technology, which in particular for the analysis of big, unstructured datasets could be very rewarding.
Allelic Imbalance for Pre-capture Whole Exome SequencingDenis C. Bauer
Exome sequencing has emerged as an economical way of focusing DNA sequencing efforts on the most functionally understood regions of the genome. Pre-capture pooling, where one bait library is used to pull down the exonic regions of several pooled samples simultaneously is a further financial improvement.
However, rare alleles in the pool might not be able to attract baits at the same rate as reference conform sequences can, and may hence be underrepresented. We investigated this potential issue by sequencing a hapmap family (4 individuals) using the pre-capture protocol from Illumina and Nimblegen. We did not observe clear evidence that heterozygote variants are missed but noted a trend for indels to be imbalanced.
Our findings do not provide clear evidence to rule out allelic imbalance or bias having an impact on research findings, this may be especially critical for low cellular cancer tissue where rare alleles are more ubiquitous.
The first steps of analysing sequencing data (2GS,NGS) has entered a transitional period where on one hand most analysis steps can be automated and standardized (pipeline), while on the other constantly evolving protocols and software updates makes maintaining these analysis pipelines labour intensive.
I propose a centralized system within CSIRO that is flexible to cater for different analyses while also being generic to efficiently disseminate labour intensive maintenance and extension amongst the user community.
Qbi Centre for Brain genomics (Informatics side)Denis C. Bauer
An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration with external data on the genome browser.
This session will follow up from transcript quantification of RNAseq data and discusses statistical means of identifying differentially regulated transcripts, and isoforms and contrasts these against microarray analysis approaches.
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
The document discusses challenges in identifying causal variants for complex diseases from sequencing data. It notes that while ideal situations may involve finding a variant common in all affected individuals and absent in unaffected, reality involves sifting through around 3.5 million SNPs. Methods like genome-wide association studies and focusing on exonic variants can help prioritize, but functional variants may also reside outside of protein coding regions. Considering combinations of variants through statistical genetics approaches may be needed to explain disease heritability. Quality control, annotation, and filtering are important but finding causal variants remains difficult.
Variant (SNPs/Indels) calling in DNA sequences, Part 2Denis C. Bauer
Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced.
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
This document discusses various topics related to mapping short sequencing reads to a reference genome, including:
- File formats like FASTQ that store sequencing reads and BAM/SAM formats for aligned reads.
- Alignment algorithms like hash table-based (MAQ, BWA) and suffix tree-based (BWA, Bowtie) mappers.
- Visualizing alignments using the Integrative Genomics Viewer (IGV).
- Performing quality control on BAM files by checking the percentage of mapped reads and coverage uniformity.
- The next session will focus on identifying genomic variants from mapped reads through SNP/indel calling and filtering.
Introduction to second generation sequencingDenis C. Bauer
An introduction to second generation sequencing will be given with focus on the basic production informatics: The approach of raw data conversion and quality control will be discussed.
Bioinformatics is an interdisciplinary field that merges biology, computer science, and information technology. It is applied in areas like genomics, proteomics, and systems biology. While some basic analysis can be done through user-friendly tools, truly customized work requires programming skills and an understanding of underlying algorithms. Bioinformatics is not just a service field but rather involves scientific experimentation throughout the entire analysis process from experimental design to evaluation. It is a dedicated field of research in its own right, not a quick or interchangeable task.
Critical Run files can be missing/corrupt after the Run folder was transferred from the HiSeq storage to the cluster storage. This presentation discusses the issue and suggests four workarounds.
This was our presentation for our imaginary product for the commercialization workshop. Note, all "research results" and illustrations are totally made up and and therefore not necessarily reflecting reality (== biological processes). This presentation was created as part of the learning experience of how to pitch biological research to venture capitalists.
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
Leveraging Generative AI to Drive Nonprofit Innovation
Deciphering the regulatory code in the genome
1. Deciphering the regulatory
code in the genome
PhD completion seminar
Denis C. Bauer
Institute for Molecular Bioscience
The University of Queensland,
Australia
By yankodesign by linh.ngân
2. Research Aim
Thermodynamic model
Develop a method that translates the
regulatory message in the DNA of when and
how strong a gene is expressed.
AAGAAGGTTTTAGTTTAGCC Express gene with
CACCGTAGGTACCTGAAGAA
GAAGGTTTTAGTTTAGCCCA 70% capacity when it
CCGTAGGTACCTGAAG is hot, Thanks!
3. Why understanding transcriptional
regulation is important?
• Insight in the biology of gene pathways.
• Search for regulatory regions with specific function.
• “Re-programming” of genes has therapeutic
potential.
A transcription
gene
promoter
DNA
Broken regulatory Design and insert a new
element regulatory element
5. Background : Enhancer
• Genes can have independent “switches” (Enhancer)
beyond the core promoter, which can start the
transcription of the target gene under different
conditions.
transcription
gene
promoter
enhancer regions
6. Background: Enhancer
• Transcription is regulated by the binding of activator
and repressor TFs to an enhancer region.
enhancer
binding site map
Active
TF 8 Activators transcription
Concentration
2 Repressors
7. Background: Repression
• Transcriptional regulation is also dependent on the
interplay between activators and repressors, i.e.
where they bind relative to each other.
Repressor range
binding site map
enhancer
9. Background: Even-skipped gene (eve)
Drosophila melanogaster 1
Embryo stained for eve 2
Function representation 3
1 hLp://insects.eugenes.org/
2 Small et al.
3 hLp://bioinform.geneJka.ru
10. Background: Regulation of eve
MSE MSE eve MSE MSE MSE
Late1 3+7 2 P late2 4+6 1 5
lacZ
Janssens, H. et al. QuanJtaJve and predicJve model of transcripJonal control of the
Drosophila melanogaster even skipped gene. Nat Genet, 2006, 38, 1159‐1165
11. Hypothesis
TF Bindin
ns Genome
conce ntraJo g site
map re,
a rchitectu
RNA,
n,
m ethylaJo
…
predicts gene activation
12. Research Goals
• Optimize Thermodynamic models
efficiently.
• Analyze robustness of these
models.
• Explore the regulation of a
particular gene.
• Examine how the regulatory program evolves.
• Extend current thermodynamic model.
Cooperphoto/CORBIS
13. Model definition
Site occupancy (Hill function)
Kt · K(s, t) · [t]
p(s, t) =
1 + Kt · K(s, t) · [t] Free parameters
TF PARAMS
Total activation
K Binding affinity
W (S, T ) = Ets p(s, ts ) 1 − Ets · p(s , ts ) · d(s, s )
s∈S A s ∈S R
E Effectiveness
quenching of the activator
activator contribution GENERAL PARAMS
Transcription rate (Arrhenius function)
R0 Max. transcription
R exp W (S, T ) − G0 iff W < G0 rate
0
R(S, T ) =
R0 otherwise,
G0 Energy barrier
ts ts
Buena Vista Pictures
s s
Janssens, H. et al. QuanJtaJve and predicJve model of transcripJonal control of the
Drosophila melanogaster even skipped gene. Nat Genet, 2006, 38, 1159‐1165
14. Training the model
200
100
50
0
< [TF ], [TF ], [TF ], [TF ] >
0 20 40 60 80 100
1 2 3 4
TF Binding TF Concentration
Thermodynamic
Model
predicted Adjust model
expression and parameters to
150
100
compare it to improve fit
50
target
0
40 50 60 70 80 90
15. Optimization methods
• Two optimization paradigms
– Simulated Annealing
• LAM schedule (Reinitz et al. 2003)
• Geometric cooling
– Gradient descent
• Three GD variants approximating the objective function, which
was not continuously differentiable.
• Judged on accuracy achieved in the given time
– Drosophila MSE2 data with 400 data points and 7 TF
(16 free parameters).
16. Optimization
Simulated Annealing Gradient Descent
1.00
20
20
SA LAM
0.99
SA geom
0.99
15
15
RMS error
0.98
RMS error
CC
CC
10
10
0.97
0.97
SA_geom
5
5
0.96
GD_softmax
SA LAM
GD_nomax
SA geom
0.95
GD_max
0.95
0
0
1 2 5 10 50 200 1 2 1 5
2 105 20
10 50 100
50 200200500
time [minutes]
time [minutes] time [minutes]
Suggests: many local minima.
Bauer, D. C. & Bailey, T. L. OpJmizing staJc thermodynamic models of transcripJonal
regulaJon. BioinformaJcs, 2009, 25, 1640‐1646
17. If gradient descent gets
stuck in local minima all
the Jme, how does the
opJmizaJon landscape
look like ?
18. Landscape analysis
• Synthetic data based on real MSE2 data
– global minimum and solution (parameter values) are
known.
– Measuring distance of the optimization solution to the
starting position and the known solution.
– Measuring error reduction at the
solution compared to the
starting position.
19. Landscape analysis
Experiment Ini$al distance to Final distance to Error Red.
solu4on (mean) solu4on (mean)
(mean)
1% perturbed 3.4·10−4 2.8·10−4 88%
random 0.1 0.11 97%
Conclusion:
many local
minima.
Bauer, D. C. & Bailey, T. L. OpJmizing staJc thermodynamic models of transcripJonal
regulaJon. BioinformaJcs, 2009, 25, 1640‐1646
20. Does the model over-fit ?
• Cross-validation (5-fold)
Experiment Mean RMS error Mean CC
(SE) (SE)
training 13.39 (0.004) 0.92 (4.8 · 10−5 )
tesJng 14.04 (0.005) 0.91 (5.7 · 10−5 )
• Redundancy reduction
– Not enough data to begin with
21. Summary: Optimization & Analysis
• The objective function is
ill-posed.
– It has a plethora of local
minima.
– It might have many
global minima.
• Hence SA is the
method of choice.
• There might be a
tendency to over-fit the
data.
hLp://www2.cmp.uea.ac.uk/~aih/code/SVM/KernelTrickDemo.html
hLp://images.nciku.com/
22. Research Goals
• Optimize Thermodynamic models
efficiently
• Analyze robustness of these
models
• Explore the regulation of a
particular gene
• Examine how the regulatory program evolves
• Extend current thermodynamic model
Cooperphoto/CORBIS
23. Regulation and Evolution of eve
• Mechanism for regulating eve is
conserved:
– Stripe 2 elements from other
Drosophila species activate
eve in D. mel. correctly.
– Despite the substantial
difference in the
regulatory DNA
sequence.
hLp://www.bio.ilstu.edu/Edwards/
Hare, E. E. et al. Sepsid even‐skipped enhancers are funcJonally conserved in Drosophila
despite lack of sequence conservaJon. PLoS Genet, 2008, 4, e1000106
24. Evaluate Evolution of MSE2
• Test if the model can identify the MSE2 in these
other species.
• Test if the model correctly predicts the
transcriptional output of the homologous MSE2s.
25. Searching for MSE2
• Apply a model trained on D. mel. MSE2 to the TFBS-map
from sequential windows to find the MSE2 in other
species
MSE2 promoter
eve
Other species
150
100
50
0
40 50 60 70 80 90
150
RMS error
100
50
0
40 50 60 70 80 90
< 23 27 43 … 13 …
>
Bauer, D. C. & Bailey, T. L. Studying the funcJonal conservaJon of cis‐regulatory modules
and their transcripJonal output. BMC BioinformaJcs, 2008, 9, 220
26. Searching for MSE2: Result
• Correctly identified the MSE2 in 6/8 species
40
D. melanogaster
30
20
RMS error
10
40
D.pseudoobscura
30
20
10
rms error
Genomic locaJon
40
Bauer, D. C. & Bailey, T. L. Studying the funcJonal conservaJon of cis‐regulatory modules
30
rimshawi
and their transcripJonal output. BMC BioinformaJcs, 2008, 9, 220
20
27. Predicting the output in other species
• Apply a model trained on D. mel. MSE2 to the MSE2s
in other species
D. melanogaster
15
150
Target
10
D. melanogaster
Log odds score (bits)
relative RNA concentration
5
D. pseudoobscura
0
D. ananassae
!5
100
D. mojavensis
!10
!15
0 500 1000 1500
D. mojavensis
rel. genomic position
50
bicoid kruppel giant hunchback
knirps caudal tailless
0
40 50 60 70 80 90
A!P position (%)
Bauer, D. C. & Bailey, T. L. Studying the funcJonal conservaJon of cis‐regulatory modules
and their transcripJonal output. BMC BioinformaJcs, 2008, 9, 220
28. Summary Application
• Model fits the data
qualitatively.
• Predictions are biologically
meaningful.
• However, there is room for
improvement.
29. Research Goals
• Optimize Thermodynamic models
efficiently
• Analyze robustness of these
models
• Explore the regulation of a
particular gene
• Examine how the regulatory program evolves
• Extend current thermodynamic model
Cooperphoto/CORBIS
30. One role fits them all?
• Dual function is proposed for some of the regulatory
TFs.
– E.g. TF Hunchback (Hb) might be an activator when
regulating stripe2 and repressor for stripe3.
Late1 3+7 2 P late2 4+6 1 5
Papatsenko, D. & Levine, M. S. Dual regulaJon by the Hunchback gradient in the
Drosophila embryo. Proc Natl Acad Sci U S A, 2008, 105, 2901‐2906
Schroeder, M. D. et al. TranscripJonal control in the segmentaJon gene network of
Drosophila. PLoS Biol, 2004, 2, E271
31. Determine the regulatory role of TFs
• Different data set: 44 CRMs important for D. mel.
development but same set of TFs.
• Determine the best role for each TF in each of the
CRMs
– Brute Force: train a model for all TF role-combinations on
each of the 44 CRMs.
– Record the correlation achieved.
– Identify TFs that have dual-function.
Segal, E. et al. PredicJng expression paLerns from regulatory sequence includes
Drosophila segmentaJon. Nature, 2008, 451, 535‐540
Bauer, D. C.; Buske, F. A. & Bailey, T. L. Dual funcJoning transcripJon factors regulated by
SUMOylaJon in the developmental gene network of Drosophila melanogaster submiLed
for publicaJon, 2009
32. TFs with dual role
Bcd Cad Hb Tll Gt Kr Kni TorRE
Det. roles s + s ‐ s s ‐ s
Literature + + s ‐ (s) s ‐ NA
(consensus)
“s”: dual-functioning, “+”: activator, “-”: repressor.
• E.g. Hb
– Activator for 17 CRMs
– Repressor for 27 CRMs
Perkins, T. J. et al. Reverse engineering the gap gene network of Drosophila melanogaster.
PLoS Comput Biol, 2006, 2, e51
Schroeder, M. D. et al. TranscripJonal control in the segmentaJon gene network of
Drosophila. PLoS Biol, 2004, 2, E271
33. Improvement with dual function
kr_CD1_ru hb_anterior_actv
1.0
1.0
1.0
target
previous roles
HbDual Experiment number of mean CC
KrDual free (SE)
0.8
0.8
0.8
HbKrDual
best parameters
Previous 18 0.27 (0.008)
0.6
0.6
0.6
mRNA
mRNA
mRNA
roles
HbDual 19 0.35 (0.009)
0.4
0.4
0.4
KrDual 19 0.37 (0.007)
0.2
0.2
0.2
HbKrDual 20 0.38 (0.007)
0.0
0.0
0.0
0 20 40 60 80 100 0 20 40 60 80 100
AP AP
Bauer, D. C.; Buske, F. A. & Bailey, T. L. Dual funcJoning transcripJon factors regulated by
run_stripe5 SUMOylaJon in the developmental gene network of Drosophila melanogaster submiLed
eve_37ext_ru
for publicaJon, 2009
.0
.0
.0
34. Marker motifs for dual function
• Running MEME on the protein sequence of dual-
functioning TFs to find short motifs (<6aa) present
in all of them.
CI KE
4 4
Q
3 3
K D ID
bits
bits
2
G 2
1
0
L E
Y Q
1
0
L
V
1
2
3
4
1
2
3
4
MEME (no SSC) 15.07.09 12:07 MEME (no SSC) 15.07.09 12:07
SUMOyla(on
mo(f
35. SUMOylation
• Small Ubiquitin-related Modifier a SUMO
protease
SU
small protein covalently attached ATP
to target-proteins. SU
SUMO
• Involved in many pathways/ SU
pathway
mechanisms E1 activating
enzyme
– Compartmentisation target protein
+ E3 ligasis
– Transcriptional regulation SU
• Can reverse the function of a TF e.g. E2 conjugating
enzyme
Ikaros (the human homologue of Kr)
• SUMO (Smt3) is present in D. mel during development
Bauer, D. C.; Buske, F. A.; Bailey, T. L. & Bodén, M. PredicJng SUMOylaJon sites in
developmental transcripJon factors of Drosophila melanogaster NeurocompuJng, 2009,
in submission
del Arco, P. G. et al. Ikaros SUMOylaJon: switching out of repression. Mol Cell Biol 2005,
25, 2688‐2697
36. Conclusion
• Thermodynamic models can be best optimized using SA but
over-fitting is an issue to keep in mind.
Bauer, D. C. & Bailey, T. L. OpJmizing staJc thermodynamic models of transcripJonal regulaJon. BioinformaJcs, 2009, 25, 1640‐1646
• Non-the-less, they are applicable for
– examining the mechanisms of transcriptional regulation,
– explore the evolution of a particular regulatory mechanism
Bauer, D. C. & Bailey, T. L. Studying the funcJonal conservaJon of cis‐regulatory modules and their transcripJonal output. BMC BioinformaJcs, 2008, 9, 220
• Model prediction improves when dual-function is allowed.
Bauer, D. C.; Buske, F. A. & Bailey, T. L. Dual funcJoning transcripJon factors regulated by SUMOylaJon in the developmental gene network of Drosophila
melanogaster submiLed for publicaJon, 2009
– SUMOylation seems to be a good candidate for the biological
mechanism of role-change.
Bauer, D. C.; Buske, F. A.; Bailey, T. L. & Bodén, M. PredicJng SUMOylaJon sites in developmental transcripJon factors of Drosophila melanogaster
NeurocompuJng, 2009, in submission
37. Acknowledgments
• IMB • Funding
– Timothy Bailey (supervisor) – Institute for Molecular
– Mikael Bodén (supervisor) Bioscience, The University of
– Sean Grimmond (thesis committee)
Queensland
– Nick Hamilton (thesis committee)
– Australian Research Council
– Fabian Buske
Centre of Excellence in
– Stefan Maetschke
Bioinformatics
– National Institutes of Health
• Stony Brook University
– John Reinitz – UQ International Research
Tuition Award
Framework for modeling, visualizing, and predicJng the
regulaJon of the transcripJon rate of a target gene
www.bioinforma(cs.org.au/stream
38. www.bioinforma(cs.org.au/stream
• Framework for modeling, visualizing,
and predicting the regulation of the
transcription rate of a target gene.
• Publicly available
• Modular: New functions can be
plugged in
Many functions
Command line
Bauer, D.C. and Bailey, T.L, STREAM ‐ StaJc Thermodynamic REgulAtory Model for
transcripJonal. BioinformaJcs, 2008, 24, 2544‐2545.