The document describes models to analyze glial cell defense mechanisms in response to ischemic hypoxia in the brain. It introduces a stochastic cellular automaton (CA) model on a 100x100 lattice, a mean field approximation (MFA) model using a system of ODEs, and a pair approximation (OPA) model. The MFA model is reduced to a 2D system and its equilibria and stability are analyzed. Parameter regions for stability of the healthy cell-free equilibrium are identified.
http://bioinformaticsinstitute.ru/lectures
Гостевая лекция Института биоинформатики, 9 октября 2014. Лектор -- Мария Шутова (ИоГЕН РАН).
За последние десять лет плюрипонтентные клетки стали героями двух Нобелевских премий и многих тысяч научных и научно-популярных статей. Их уникальная возможность превращаться в любую клетку взрослого организма до сих пор дает пищу для ума как биологам развития, так и ученым, ищущим способы лечения генетических заболеваний. В лекции будет рассказано о двух типах плюрипотентных клеток: "естественных" (эмбриональные стволовые клетки) и "искусственных" (индуцированные плюрипотентные стволовые клетки). Отдельно мы остановимся на том, как знания о работе транскрипционных факторов помогли репрограммировать клетки, и как эти "искусственные" плюрипотентные клетки можно использовать в медицине.
This document summarizes a talk about assembling large metagenomic datasets. The speaker discusses the challenges of assembling large amounts of metagenomic sequence data, which scales poorly for standard assembly techniques. They present a solution that uses k-mer graphs and probabilistic data structures to efficiently store and traverse very large graphs. This allows them to exactly reduce the data size through techniques like filtering unconnected reads and partitioning reads into disconnected subgraphs. They demonstrate applying this approach to assemble over 200 GB of sequence data from an Iowa corn field soil sample.
Part 4 of RNA-seq for DE analysis: Extracting count table and QCJoachim Jacob
Fourth part of the training session 'RNA-seq for Differential expression analysis'. We explain how we get a count table from a mapping result. We show how to do quality control on the count table. Interested in following this session? Please contact http://www.jakonix.be/contact.html
SyMAP is a synteny mapping and analysis program that compares genomes of different species to understand gene function and evolutionary history. It finds synteny blocks between a physical map (e.g. FPC map) and a sequenced genome using anchors like markers and BAC end sequences. The algorithm uses a directed acyclic graph and dynamic programming to order anchors into synteny chains while allowing for errors and rearrangements. SyMAP displays synteny results through interactive views and aids in tasks like correcting BAC clone end assignments. It has been applied to several plant genome projects.
This document summarizes an analysis of assembly algorithms and implementation of a De Bruijn graph approach to genome assembly. It discusses how De Bruijn graphs have become a common approach for assembly, representing reads as nodes and connecting nodes based on overlap of k-mers. The document outlines challenges in assembly including repeats and errors. It also summarizes two efficient data structures for representing De Bruijn graphs and describes implementing these to assemble microbial genomes and compare to the ABySS assembler.
The document describes models to analyze glial cell defense mechanisms in response to ischemic hypoxia in the brain. It introduces a stochastic cellular automaton (CA) model on a 100x100 lattice, a mean field approximation (MFA) model using a system of ODEs, and a pair approximation (OPA) model. The MFA model is reduced to a 2D system and its equilibria and stability are analyzed. Parameter regions for stability of the healthy cell-free equilibrium are identified.
http://bioinformaticsinstitute.ru/lectures
Гостевая лекция Института биоинформатики, 9 октября 2014. Лектор -- Мария Шутова (ИоГЕН РАН).
За последние десять лет плюрипонтентные клетки стали героями двух Нобелевских премий и многих тысяч научных и научно-популярных статей. Их уникальная возможность превращаться в любую клетку взрослого организма до сих пор дает пищу для ума как биологам развития, так и ученым, ищущим способы лечения генетических заболеваний. В лекции будет рассказано о двух типах плюрипотентных клеток: "естественных" (эмбриональные стволовые клетки) и "искусственных" (индуцированные плюрипотентные стволовые клетки). Отдельно мы остановимся на том, как знания о работе транскрипционных факторов помогли репрограммировать клетки, и как эти "искусственные" плюрипотентные клетки можно использовать в медицине.
This document summarizes a talk about assembling large metagenomic datasets. The speaker discusses the challenges of assembling large amounts of metagenomic sequence data, which scales poorly for standard assembly techniques. They present a solution that uses k-mer graphs and probabilistic data structures to efficiently store and traverse very large graphs. This allows them to exactly reduce the data size through techniques like filtering unconnected reads and partitioning reads into disconnected subgraphs. They demonstrate applying this approach to assemble over 200 GB of sequence data from an Iowa corn field soil sample.
Part 4 of RNA-seq for DE analysis: Extracting count table and QCJoachim Jacob
Fourth part of the training session 'RNA-seq for Differential expression analysis'. We explain how we get a count table from a mapping result. We show how to do quality control on the count table. Interested in following this session? Please contact http://www.jakonix.be/contact.html
SyMAP is a synteny mapping and analysis program that compares genomes of different species to understand gene function and evolutionary history. It finds synteny blocks between a physical map (e.g. FPC map) and a sequenced genome using anchors like markers and BAC end sequences. The algorithm uses a directed acyclic graph and dynamic programming to order anchors into synteny chains while allowing for errors and rearrangements. SyMAP displays synteny results through interactive views and aids in tasks like correcting BAC clone end assignments. It has been applied to several plant genome projects.
This document summarizes an analysis of assembly algorithms and implementation of a De Bruijn graph approach to genome assembly. It discusses how De Bruijn graphs have become a common approach for assembly, representing reads as nodes and connecting nodes based on overlap of k-mers. The document outlines challenges in assembly including repeats and errors. It also summarizes two efficient data structures for representing De Bruijn graphs and describes implementing these to assemble microbial genomes and compare to the ABySS assembler.
de Bruijn Graph Construction from Combination of Short and Long ReadsSikder Tahsin Al-Amin
This document describes the de Bruijn graph approach for genome assembly using a combination of short and long reads. It discusses key terminology, the motivation for this approach, and how an A-Bruijn graph is constructed and used to find the genomic path. It also covers error correction in draft genomes assembled using this method and potential areas for further development, such as calculating the likelihood ratio of different solid string sets and applying bridging and merging techniques.
The document discusses pairwise sequence alignment methods. It defines key concepts like homology and orthology. It explains that dynamic programming is used to find optimal alignments through building a score matrix and backtracking. Global alignment finds the best match over full sequences while local alignment identifies regions of local similarity. Scoring systems like PAM matrices assign values based on substitutions and penalties for gaps.
The document discusses Adaptable Constrained Genetic Programming (ACGP), which aims to automate the discovery of heuristics to guide the genetic programming search. It describes how ACGP develops first-order and second-order heuristics based on patterns observed in high-performing individuals, and uses these heuristics to bias mutation, crossover and regrowth. Experimental results on a target equation with explicit second-order structure show that ACGP with second-order heuristics outperforms both standard GP and ACGP with only first-order heuristics. The document concludes that ACGP is effective at discovering and exploiting problem structure through its adaptive heuristic approach.
2012 talk to CSE department at U. Arizonac.titus.brown
This document summarizes an approach for streaming lossy compression of biological sequence data using probabilistic data structures. It discusses using CountMin Sketch and Bloom filters to represent sequence data and counting in a memory efficient way. It also describes an online streaming approach to lossy compression by downsampling reads based on de Bruijn graph structure to preferentially retain reads that contain "true edges". Additionally, it proposes a compressible de Bruijn graph representation by storing the implicit graph in a Bloom filter to achieve striking memory efficiency while retaining the global graph structure. This approach aims to address the computational challenges of assembly as sequencing data scales.
The document discusses updates to the human reference genome assembly GRCh38. It provides background on reference assemblies and describes how the Genome Reference Consortium manages and models genome assemblies. Key points include that GRCh38 contains refined centromere regions based on new data, novel sequence detections, and 261 alternate loci representing structural variants. The assembly is now incorporated into public sequence databases to improve access and use of the reference genome data.
This document discusses probabilistic models and string transducers for pairwise sequence alignment and phylogenetic tree construction. It introduces hidden Markov models (HMMs) and the Jukes-Cantor model for nucleotide substitution. UPGMA and neighbor-joining methods are described for building rooted and unrooted phylogenetic trees from distance matrices. Maximum parsimony is also summarized as a method for phylogenetic tree inference based on identifying the smallest number of character state changes.
20080110 Genome exploration in A-T G-C space: an introduction to DNA walkingJonathan Blakes
This document discusses using "DNA walking" to summarize and explore genomic sequences. DNA walking maps DNA sequences onto 2D walks in "A-T G-C space" to visualize patterns. The author tests using DNA walks to detect duplications in genomes and construct phylogenies without alignment. While walks can uncover relationships, accuracy is limited. Future work could involve 3D walks in tetrahedral mappings to better represent genomic structure.
Paired-end alignments in sequence graphsChirag Jain
Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix- matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second.
Presentation 2009 Journal Club Azhar Ali Shahguest5de83e
The document discusses algorithms for hierarchical clustering of large datasets. It introduces UPGMA clustering and its limitations when dealing with huge datasets. It then proposes two new algorithms called Sparse-UPGMA and Multi-Round MC-UPGMA to overcome these limitations. Multi-Round MC-UPGMA clusters the data in multiple rounds to deal with sparse inputs while requiring less memory. The algorithms are tested on clustering over 1.8 million protein sequences from UniRef90.
Bioinformatics emerged from the marriage of computer science and molecular biology to analyze massive amounts of biological data, like that produced by the Human Genome Project. It uses algorithms and techniques from computer science to solve problems in molecular biology, like comparing genomic sequences to understand evolution. As genomic data exploded publicly, bioinformatics was needed to efficiently store, analyze, and make sense of this information, which has applications in molecular medicine, drug development, agriculture, and more.
This document summarizes evolutionary computation techniques including genetic algorithms and genetic programming. It provides an overview of biological evolution and how evolutionary computation mimics this process to solve problems. Genetic algorithms use chromosomes to represent candidate solutions which are evolved over generations using selection, crossover and mutation operators. Genetic programming uses tree representations to evolve computer programs. The document describes how genetic programming can be used to evolve a program for a wall-following robot. It concludes by discussing applications and advantages/disadvantages of evolutionary computation.
The document discusses several algorithms for finding the shortest path in a graph: Dijkstra's algorithm, Floyd-Warshall algorithm, Bellman-Ford algorithm, and genetic algorithms. It provides details on how Dijkstra's and Floyd-Warshall algorithms work, including pseudocode. Examples are given for both algorithms. Applications of the different algorithms are also outlined.
The document discusses several algorithms for finding the shortest path in a graph: Dijkstra's algorithm, Floyd-Warshall algorithm, Bellman-Ford algorithm, and genetic algorithms. It provides details on how Dijkstra's and Floyd-Warshall algorithms work, including pseudocode. Examples are given for both algorithms. Applications of the different algorithms are also discussed.
The document discusses the human reference genome assembly. It provides information on what a reference assembly is, how it is constructed, and how it has evolved over time. Key points include:
- The reference assembly is a model of the human genome built from many sequencing reads and is continually improved.
- Early assemblies had gaps and errors that have been improved on in newer releases. The current primary assembly is GRCh38.
- Alternate loci are now included to represent structural and haplotype variations not in the primary assembly.
- The reference assembly is important for mapping variants and interpreting genomic data.
This paper proposes an evolution algorithm (FG-EA) to generate predictive features from biological sequence data for classification problems. FG-EA uses genetic programming to evolve tree-based representations of features from DNA sequences. It evaluates these features on a fitness function based on information gain before selecting high-scoring features. When applied to human and worm DNA splice site prediction, FG-EA features improved classification performance over state-of-the-art methods, demonstrating the ability of evolutionary search to discover predictive sequence features.
High-throughput sequencing has key differences from Sanger sequencing such as fragments being sequenced in parallel rather than by cloning. Several platforms are discussed including their read lengths, throughput, error rates, and costs. Pair-end and targeted sequencing are also covered. Challenges in bioinformatics include assembly, alignment amid repeats and errors, and downstream analysis tasks. Popular aligners like BWA and Bowtie that use the Burrows-Wheeler transform are fast and accurate. De novo assembly requires specialized tools to handle short reads. RNA-seq has additional complexities in assembly.
The document discusses genome assembly and finishing processes. It begins by outlining typical project goals of completely restoring the genome and producing a high-quality consensus sequence. It then describes the evolution of sequencing technologies from Sanger to newer platforms and their impact on draft assemblies. Key steps in the assembly and finishing process include library preparation, assembly, identifying gaps, and improving consensus quality.
Inria Tech Talk - La classification de données complexes avec MASSICCCStéphanie Roger
MASSICCC - Une plateforme SaaS pour le traitement de la classification de données complexes hétérogènes et incomplètes.
Dans ce Tech Talk venez découvrir, tester et apprendre à maîtriser MASSICCC (Massive clustering in cloud computing) une plateforme SaaS orientée utilisateurs, ainsi que ses trois familles d’algorithmes de #classification, fruits des dernières avancées des équipes de recherche Modal & Celeste de Inria, pour analyser et faire de l’apprentissage sur vos "Big Data" (ex : en immobilier, maintenance prédictive, santé, open data, etc. ).
MASSICCC c’est aussi :
- Un accès gratuit pour le test et la recherche sur https://massiccc.lille.inria.fr
- Un "one for all" de la classification
- Une forte interprétabilité des résultats (avec ses graphiques)
- Un mode SaaS qui vous permet un suivi des expériences (en cours ou terminées)
- Et des algorithmes open source qui sont réutilisables indépendamment.
Integration of single molecule, genome mapping data in a web-based genome bro...William Chow
Sequence, Finishing and the Future Conference (SFAF 2015) Poster submission. Santa Fe, New Mexico.
Poster describes the gEVAL browser, and the integration of genome/optical map data for use in evaluating/curation of genome assemblies. Human, Mouse, Zebrafish, Pig, Helminth, Chicken.
The document describes string comparison techniques using matrix algebra and seaweed matrices. It introduces the concept of semi-local string comparison, which involves comparing a whole string to substrings of another string. The key idea is representing string comparison matrices implicitly using seaweed matrices, which represent unit-Monge matrices. This allows developing algebraic techniques for efficiently multiplying such matrices using the algebra of braids and the seaweed monoid. These multiplication techniques can then be applied to problems like dynamic programming string comparison and comparing compressed strings.
More Related Content
Similar to Comparative Genomics and de Bruijn graphs
de Bruijn Graph Construction from Combination of Short and Long ReadsSikder Tahsin Al-Amin
This document describes the de Bruijn graph approach for genome assembly using a combination of short and long reads. It discusses key terminology, the motivation for this approach, and how an A-Bruijn graph is constructed and used to find the genomic path. It also covers error correction in draft genomes assembled using this method and potential areas for further development, such as calculating the likelihood ratio of different solid string sets and applying bridging and merging techniques.
The document discusses pairwise sequence alignment methods. It defines key concepts like homology and orthology. It explains that dynamic programming is used to find optimal alignments through building a score matrix and backtracking. Global alignment finds the best match over full sequences while local alignment identifies regions of local similarity. Scoring systems like PAM matrices assign values based on substitutions and penalties for gaps.
The document discusses Adaptable Constrained Genetic Programming (ACGP), which aims to automate the discovery of heuristics to guide the genetic programming search. It describes how ACGP develops first-order and second-order heuristics based on patterns observed in high-performing individuals, and uses these heuristics to bias mutation, crossover and regrowth. Experimental results on a target equation with explicit second-order structure show that ACGP with second-order heuristics outperforms both standard GP and ACGP with only first-order heuristics. The document concludes that ACGP is effective at discovering and exploiting problem structure through its adaptive heuristic approach.
2012 talk to CSE department at U. Arizonac.titus.brown
This document summarizes an approach for streaming lossy compression of biological sequence data using probabilistic data structures. It discusses using CountMin Sketch and Bloom filters to represent sequence data and counting in a memory efficient way. It also describes an online streaming approach to lossy compression by downsampling reads based on de Bruijn graph structure to preferentially retain reads that contain "true edges". Additionally, it proposes a compressible de Bruijn graph representation by storing the implicit graph in a Bloom filter to achieve striking memory efficiency while retaining the global graph structure. This approach aims to address the computational challenges of assembly as sequencing data scales.
The document discusses updates to the human reference genome assembly GRCh38. It provides background on reference assemblies and describes how the Genome Reference Consortium manages and models genome assemblies. Key points include that GRCh38 contains refined centromere regions based on new data, novel sequence detections, and 261 alternate loci representing structural variants. The assembly is now incorporated into public sequence databases to improve access and use of the reference genome data.
This document discusses probabilistic models and string transducers for pairwise sequence alignment and phylogenetic tree construction. It introduces hidden Markov models (HMMs) and the Jukes-Cantor model for nucleotide substitution. UPGMA and neighbor-joining methods are described for building rooted and unrooted phylogenetic trees from distance matrices. Maximum parsimony is also summarized as a method for phylogenetic tree inference based on identifying the smallest number of character state changes.
20080110 Genome exploration in A-T G-C space: an introduction to DNA walkingJonathan Blakes
This document discusses using "DNA walking" to summarize and explore genomic sequences. DNA walking maps DNA sequences onto 2D walks in "A-T G-C space" to visualize patterns. The author tests using DNA walks to detect duplications in genomes and construct phylogenies without alignment. While walks can uncover relationships, accuracy is limited. Future work could involve 3D walks in tetrahedral mappings to better represent genomic structure.
Paired-end alignments in sequence graphsChirag Jain
Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix- matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second.
Presentation 2009 Journal Club Azhar Ali Shahguest5de83e
The document discusses algorithms for hierarchical clustering of large datasets. It introduces UPGMA clustering and its limitations when dealing with huge datasets. It then proposes two new algorithms called Sparse-UPGMA and Multi-Round MC-UPGMA to overcome these limitations. Multi-Round MC-UPGMA clusters the data in multiple rounds to deal with sparse inputs while requiring less memory. The algorithms are tested on clustering over 1.8 million protein sequences from UniRef90.
Bioinformatics emerged from the marriage of computer science and molecular biology to analyze massive amounts of biological data, like that produced by the Human Genome Project. It uses algorithms and techniques from computer science to solve problems in molecular biology, like comparing genomic sequences to understand evolution. As genomic data exploded publicly, bioinformatics was needed to efficiently store, analyze, and make sense of this information, which has applications in molecular medicine, drug development, agriculture, and more.
This document summarizes evolutionary computation techniques including genetic algorithms and genetic programming. It provides an overview of biological evolution and how evolutionary computation mimics this process to solve problems. Genetic algorithms use chromosomes to represent candidate solutions which are evolved over generations using selection, crossover and mutation operators. Genetic programming uses tree representations to evolve computer programs. The document describes how genetic programming can be used to evolve a program for a wall-following robot. It concludes by discussing applications and advantages/disadvantages of evolutionary computation.
The document discusses several algorithms for finding the shortest path in a graph: Dijkstra's algorithm, Floyd-Warshall algorithm, Bellman-Ford algorithm, and genetic algorithms. It provides details on how Dijkstra's and Floyd-Warshall algorithms work, including pseudocode. Examples are given for both algorithms. Applications of the different algorithms are also outlined.
The document discusses several algorithms for finding the shortest path in a graph: Dijkstra's algorithm, Floyd-Warshall algorithm, Bellman-Ford algorithm, and genetic algorithms. It provides details on how Dijkstra's and Floyd-Warshall algorithms work, including pseudocode. Examples are given for both algorithms. Applications of the different algorithms are also discussed.
The document discusses the human reference genome assembly. It provides information on what a reference assembly is, how it is constructed, and how it has evolved over time. Key points include:
- The reference assembly is a model of the human genome built from many sequencing reads and is continually improved.
- Early assemblies had gaps and errors that have been improved on in newer releases. The current primary assembly is GRCh38.
- Alternate loci are now included to represent structural and haplotype variations not in the primary assembly.
- The reference assembly is important for mapping variants and interpreting genomic data.
This paper proposes an evolution algorithm (FG-EA) to generate predictive features from biological sequence data for classification problems. FG-EA uses genetic programming to evolve tree-based representations of features from DNA sequences. It evaluates these features on a fitness function based on information gain before selecting high-scoring features. When applied to human and worm DNA splice site prediction, FG-EA features improved classification performance over state-of-the-art methods, demonstrating the ability of evolutionary search to discover predictive sequence features.
High-throughput sequencing has key differences from Sanger sequencing such as fragments being sequenced in parallel rather than by cloning. Several platforms are discussed including their read lengths, throughput, error rates, and costs. Pair-end and targeted sequencing are also covered. Challenges in bioinformatics include assembly, alignment amid repeats and errors, and downstream analysis tasks. Popular aligners like BWA and Bowtie that use the Burrows-Wheeler transform are fast and accurate. De novo assembly requires specialized tools to handle short reads. RNA-seq has additional complexities in assembly.
The document discusses genome assembly and finishing processes. It begins by outlining typical project goals of completely restoring the genome and producing a high-quality consensus sequence. It then describes the evolution of sequencing technologies from Sanger to newer platforms and their impact on draft assemblies. Key steps in the assembly and finishing process include library preparation, assembly, identifying gaps, and improving consensus quality.
Inria Tech Talk - La classification de données complexes avec MASSICCCStéphanie Roger
MASSICCC - Une plateforme SaaS pour le traitement de la classification de données complexes hétérogènes et incomplètes.
Dans ce Tech Talk venez découvrir, tester et apprendre à maîtriser MASSICCC (Massive clustering in cloud computing) une plateforme SaaS orientée utilisateurs, ainsi que ses trois familles d’algorithmes de #classification, fruits des dernières avancées des équipes de recherche Modal & Celeste de Inria, pour analyser et faire de l’apprentissage sur vos "Big Data" (ex : en immobilier, maintenance prédictive, santé, open data, etc. ).
MASSICCC c’est aussi :
- Un accès gratuit pour le test et la recherche sur https://massiccc.lille.inria.fr
- Un "one for all" de la classification
- Une forte interprétabilité des résultats (avec ses graphiques)
- Un mode SaaS qui vous permet un suivi des expériences (en cours ou terminées)
- Et des algorithmes open source qui sont réutilisables indépendamment.
Integration of single molecule, genome mapping data in a web-based genome bro...William Chow
Sequence, Finishing and the Future Conference (SFAF 2015) Poster submission. Santa Fe, New Mexico.
Poster describes the gEVAL browser, and the integration of genome/optical map data for use in evaluating/curation of genome assemblies. Human, Mouse, Zebrafish, Pig, Helminth, Chicken.
Similar to Comparative Genomics and de Bruijn graphs (20)
The document describes string comparison techniques using matrix algebra and seaweed matrices. It introduces the concept of semi-local string comparison, which involves comparing a whole string to substrings of another string. The key idea is representing string comparison matrices implicitly using seaweed matrices, which represent unit-Monge matrices. This allows developing algebraic techniques for efficiently multiplying such matrices using the algebra of braids and the seaweed monoid. These multiplication techniques can then be applied to problems like dynamic programming string comparison and comparing compressed strings.
The document provides an overview of the KNIME analytics platform and its capabilities. It discusses:
- KNIME's origins, offices, codebase, and application areas including pharma, healthcare, finance, retail, and more.
- The key components of the KNIME platform including data access, transformation, analysis, visualization, and deployment capabilities.
- Integrations with tools like R, Weka, databases, and file formats.
- Community contributions expanding KNIME's functionality in areas like bioinformatics, chemistry, image processing, and more.
Ядерный век прошел, и становится все понятнее, что в фокусе науки 21-го века будут живые системы, медицина, и человек во всех его проявлениях. Здесь осуществляются самые масштабные финансовые вливания, и на эту отрасль человечество возлагает самые большие надежды. Все чаще слышатся предметные обсуждения тем, казавшихся еще недавно научной фантастикой: сможет ли человечество победить старение, рак, и другие смертельные заболевания? Сможет ли менять свой геном по собственному желанию? Будем ли мы хозяевами своим телам в той же мере, как мы хозяйничаем на Земле?
Многие десятилетия биология и медицина развивались как описательные науки. Однако по мере созревания и накопления информации, любая наука рано или поздно переходит на более точный язык - язык математики. Проект "Геном человека" обеспечил технологический прорыв, который будет питать науку о живом еще много лет - но который также поставил много новых глобальных вопросов перед современными учеными.
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...BioinformaticsInstitute
This document summarizes recent advances in cancer immunotherapy from the perspective of systems biology. It discusses how checkpoint blockade immunotherapy works by addressing the second co-inhibitory checkpoint signal needed for T cell activation. Computational methods are now able to identify tumor-specific neoantigens that can be targeted by immunotherapy. Mouse model studies showed that certain tumors are naturally rejected due to expression of a mutant antigen recognized by T cells, and that antigen-specific T cells are present before immunotherapy treatment. The high mutational load in melanoma makes it particularly responsive to checkpoint blockade. Early work in the 19th century by William Coley observed tumor regression following bacterial infection, which led to development of a toxin mixture that resembled modern vaccine formulations. Members of
http://bioinformaticsinstitute.ru/guests
В пятницу 10 октября в 19.00 Мария Шутова (ИоГЕН РАН) выступала в Институте биоинформатики с открытой лекцией, посвященной изучению рака.
Рак -- одна из наиболее распространенных причин смерти по всему миру. В лекции рассматривается, как знания об эволюции, работе генома, репрограммировании, а также использование биоинформатических методов помогли лучше понять, как развивается раковая опухоль и предложить новые методы лечения разнообразных типов рака. Рассмотрены мышиные модели развития рака и интересные результаты, которые были получены с их помощью.
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...BioinformaticsInstitute
This document summarizes genetic analyses of complex human phenotypes. It describes whole genome sequencing of individuals from bipolar disorder families and finding an association between genetic variation in a chromosome 6 region and amygdala volume. It also discusses rare variant sequencing of metabolic syndrome-related genes in Finnish cohorts, identifying new signals beyond existing GWAS hits. Additionally, it outlines exome and targeted sequencing of Tourette syndrome pedigrees, with a genome-wide significant result in a long non-coding RNA gene linked to the trait.
В своей лекции Андрей Афанасьев рассказал о стартапах в биотехе и биоинформатике и своем биоинформатическом проекте iBinom, разобрал несколько биотехнологических проектов глазами инноваторов и инвесторов, а также коснулся вопроса поиска инвестиций и поделился личным опытом взаимодействия с венчурными фондами и институтами развития.
This document provides an overview of the ENCODE project and how its data can be accessed through the UCSC Genome Browser. It discusses the different types of ENCODE data available, including mapping data, gene annotations, expression data, regulatory information, and genetic variation. It also explains how to find, view, and download ENCODE tracks from the Genome Browser and where to get more information about ENCODE. The overall goal of the ENCODE project is to identify all functional elements in the human genome.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
1. Comparative Genomics and the
de Bruijn graphs
Ilia Minkin
Pennsylvania State University
16th September 2016
1 / 43
2. What is comparative genomics?
The collection of all research activities that derive
biological insights by comparing genomic features.
1
1Comparative Genomics, Xuhua Xia
2 / 43
3. What is comparative genomics?
The collection of all research activities that derive
biological insights by comparing genomic features.
1
Why do it?
Learn evolution
Learn function
1Comparative Genomics, Xuhua Xia
2 / 43
5. Learn Function
A genomic sequence itself does not show its functions
How to nd function?
Compare with sequences of know function
Conserved sequences are likely to be important
How to compare genomes?
4 / 43
6. What is an Alignment?
Organisms inherit genomes but with errors:
The Ancestor
Genome A Genome B
Which characters A and B got from its ancestor?
5 / 43
7. What is an Alignment?
Alignments are written down as a table:
ACTG-TGA
ACTACTGA
Blue letters are matches; yellow are mismatches;
dashes are indels.
This is a global alignment.
6 / 43
8. The Global Alignment
ACTG-TGA
ACTACTGA
For two strings A and B :
Place them under each other
Insert into A and B dashes so that |A| = |B|
Penalize for dashes and mismatches
Which alignment gives the least penalty?
Complexity: O(|A||B |)
7 / 43
9. The Local Alignment
For large sequences the global alignment does not
work:
GAACTGTGATTAGGACGT
ATTTGGGACTACTGAGTA
8 / 43
10. The Local Alignment
For large sequences the global alignment does not
work:
GAACTGTGATTAGGACGT
ATTTGGGACTACTGAGTA
Apart from indels and mismatches there could
be rearrangements
Rearrangements change orders of the whole
blocks
Similar subsequences can be interleaved with
something else
8 / 43
12. An Example
We can generalize to many genomes:
GAACTGTGATTATGCTCA
ATTTGGGACTACTGAGTA
ATCTTGAGATAGCTGAAA
10 / 43
13. An Example
We can generalize to many genomes:
GAACTGTGATTATGCTCA
ATTTGGGACTACTGAGTA
ATCTTGAGATAGCTGAAA
Alignments:
ACTG-TGA
ACTACTGA
A-TGCTCA
10 / 43
14. Multiple Local Alignment
Issues:
Some subsequences can be present in some
genomes and absent in others
Genomes can have duplications
Multiple sequence alignment is NP-hard
11 / 43
15. Multiple Local Alignment
Issues:
Some subsequences can be present in some
genomes and absent in others
Genomes can have duplications
Multiple sequence alignment is NP-hard
→ We need some heuristics
11 / 43
16. Another Approach
Another way to nd common subsequences is to
build a graph from genomes
In such a graph homologous subsequences will
collapse into non-branching paths while unique ones
will form disjoint paths
12 / 43
22. The de Bruijn Graph
k = 2
TGACGTC TGACTTC
AC GTCGGATG TC
17 / 43
23. The de Bruijn Graph
k = 2
TGACGTC TGACTTC
AC GTCGGATG TC
AC TTGATG TCCT
17 / 43
24. The de Bruijn Graph
AC GTCGGATG TC
AC TTGATG TCCT
18 / 43
25. The de Bruijn Graph
AC GTCGGATG TC
AC TTGATG TCCT
AC
GT
TT
CG
GATG TC
CT
18 / 43
26. The de Bruijn graph
In the de Bruijn graph identical substrings of length
at least k + 1 are collapsed into non-branching paths
We can use this to nd homologous blocks.
We developed a tool Sibelia that nds such blocks
in many bacterial genomes and handles repeats.
But we can do more.
19 / 43
27. Alignment to a Graph
It is common to have an unassembled genome
Reads are then aligned to a very similar reference
genome:
20 / 43
28. Alignment to a Graph
Issues:
More than one reference?
Repeats within genomes?
21 / 43
29. Alignment to a Graph
Issues:
More than one reference?
Repeats within genomes?
Solution: align reads to a graph!
21 / 43
30. Alignment to a Graph
In the future genome graphs will encode information
about a population
22 / 43
31. Alignment to a Graph
In the future genome graphs will encode information
about a population
Aligning reads to a graph has many advantages:
Ecient alignment to many genomes
Reusing information about variants
Handling of repeats
22 / 43
32. Alignment to a Graph
In the future genome graphs will encode information
about a population
Aligning reads to a graph has many advantages:
Ecient alignment to many genomes
Reusing information about variants
Handling of repeats
The de Bruijn graph is a feasible model for a graph
reference.
22 / 43
33. Alignment to a Graph
In the future genome graphs will encode information
about a population
Aligning reads to a graph has many advantages:
Ecient alignment to many genomes
Reusing information about variants
Handling of repeats
The de Bruijn graph is a feasible model for a graph
reference.
Issue the graph can be too large.
22 / 43
36. The Challenge
Construct the compacted graph from many large
genomes bypassing the ordinary graph traverse.
24 / 43
37. The Challenge
Construct the compacted graph from many large
genomes bypassing the ordinary graph traverse.
Earlier work: based on sux arrays/trees Sibelia
SplitMEM handled 60 E.Coli genomes.
24 / 43
38. The Challenge
Construct the compacted graph from many large
genomes bypassing the ordinary graph traverse.
Earlier work: based on sux arrays/trees Sibelia
SplitMEM handled 60 E.Coli genomes.
A recent advance: 7 Humans in 15 hours using 100
GB of RAM using a BWT-based algorithm by Baier
et al., 2015, Beller et al., 2014.
24 / 43
39. Junctions
A vertex v is a junction if:
v has ≥ 2 distinct outgoing or incoming edges:
25 / 43
40. Junctions
A vertex v is a junction if:
v has ≥ 2 distinct outgoing or incoming edges:
v is the rst or the last k -mer of an input string
25 / 43
41. Junctions
A vertex v is a junction if:
v has ≥ 2 distinct outgoing or incoming edges:
v is the rst or the last k -mer of an input string
Facts:
Junctions = vertices of the compacted graph
Compaction = nding positions of junctions
25 / 43
45. The Observation
The observation only works when we have complete
genomes.
Once we know junctions, construction of the edges is
simple.
We can simply traverse input strings and record
junctions in the order they appear.
How to identify junctions?
27 / 43
46. The Naive Algorithm
A naive way:
Store all (k + 1)-mers (edges) in a hash table
Consider each vertex one by one
Query all possible edges from the table
If found 1 edge, mark vertex as a junction
28 / 43
47. The Naive Algorithm
A naive way:
Store all (k + 1)-mers (edges) in a hash table
Consider each vertex one by one
Query all possible edges from the table
If found 1 edge, mark vertex as a junction
Problem: the hash table can be too large.
28 / 43
49. What is the Bloom lter
A probabilistic data structure representing a set
Properties:
Occupies xed space
May generate false positives on queries
False positive rate is low
30 / 43
50. What is the Bloom lter
A probabilistic data structure representing a set
Properties:
Occupies xed space
May generate false positives on queries
False positive rate is low
Example: Bloom Filter = { GA → AC }
Is GA → AC in the set? Yes.
30 / 43
51. What is the Bloom lter
A probabilistic data structure representing a set
Properties:
Occupies xed space
May generate false positives on queries
False positive rate is low
Example: Bloom Filter = { GA → AC }
Is GA → AC in the set? Yes.
Is GA → AT in the set? Maybe no.
30 / 43
52. An Example
Bloom Filter = { GA → AC, GA → AT }
AA
AG
AC
AT
GA
The purple edge is a false positive.
31 / 43
53. The Two Pass Algorithm
How to eliminate false positives?
32 / 43
54. The Two Pass Algorithm
How to eliminate false positives?
Two-pass algorithm:
1. Use the Bloom lter to identify junction
candidates
2. Use the hash table, but store only edges that
touch candidates
32 / 43
55. An Example: the First Step
Here edges stored in the Bloom lter, purple ones are
false positives:
AC GT
CC
TT
CG
AT
GATG
TC
CT
Junction candidates: GA AC
33 / 43
56. An Example: the Second Step
Edges stored in the hash table. We kept only edges
touching junction candidates:
Junction: AC
34 / 43
57. Results
Datasets:
7 humans: 5 versions of the reference +
2 haplotypes of NA12878 from 1000 Genomes
93 simulated humans (FIGG)
8 primates available in UCSC genome browser
35 / 43
59. Conclusion Future Work
Advantages of the algorithm:
Fast
Small memory footprint
Can handle large inputs
Drawbacks:
Less applicable for large k
37 / 43
60. Conclusion Future Work
Advantages of the algorithm:
Fast
Small memory footprint
Can handle large inputs
Drawbacks:
Less applicable for large k
Take home message: it is easy to construct the
compacted de Bruijn graph for complete genomes.
37 / 43
61. Conclusion Future Work
Can potentially facilitate:
Visualization
Synteny mining (Sibelia)
Structural variations analysis
...
38 / 43
66. Splitting
Table 1: The minimal number of rounds it takes to compress
the graph without exceeding a given memory threshold.
Memory threshold Used memory Bloom lter size Running time Rounds
10 8.62 8.59 259 1
8 6.73 4.29 434 3
6 5.98 4.29 539 4
4 3.51 2.14 665 6
43 / 43