The document describes the R'MES software for identifying exceptional motifs in DNA sequences. R'MES uses statistical methods to determine if the number of occurrences of a motif is significantly higher or skewed than expected by chance. It can compare the exceptionality of a motif between sequences and identify motifs that are over-represented or orientation-dependent in a genome. As examples, it discusses how R'MES was used to identify the Chi motif in Staphylococcus aureus and investigate the organization of DNA in Escherichia coli.
Martin Roth: A spatial peaks-over-threshold model in a nonstationary climateJiří Šmída
1. The document proposes a spatial peaks-over-threshold model for estimating quantiles and trends in daily precipitation in a nonstationary climate.
2. It uses a generalized Pareto distribution fitted to precipitation extremes above a threshold to model peaks over threshold, with the threshold and distribution parameters allowed to vary over time in a nonstationary manner.
3. Spatial dependence is incorporated through an index flood approach where distribution parameters are constant across sites after scaling by a site-specific index flood value.
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
The document proposes a new version of Hamiltonian Monte Carlo (HMC) sampling that is essentially calibration-free. It achieves this by learning the optimal leapfrog scale from the distribution of integration times using the No-U-Turn Sampler algorithm. Compared to the original NUTS algorithm on benchmark models, this new enhanced HMC (eHMC) exhibits significantly improved efficiency with no hand-tuning of parameters required. The document tests eHMC on a Susceptible-Infected-Recovered model of disease transmission.
1. The document discusses maximum likelihood estimation and Bayesian parameter estimation for machine learning problems involving parametric densities like the Gaussian.
2. Maximum likelihood estimation finds the parameter values that maximize the probability of obtaining the observed training data. For Gaussian distributions with unknown mean and variance, MLE returns the sample mean and variance.
3. Bayesian parameter estimation treats the parameters as random variables and uses prior distributions and observed data to obtain posterior distributions over the parameters. This allows incorporation of prior knowledge with the training data.
This document discusses key concepts in probability theory, including:
1) Markov's inequality and Chebyshev's inequality, which relate the probability that a random variable exceeds a value to its expected value and variance.
2) The weak law of large numbers and central limit theorem, which describe how the means of independent random variables converge to the expected value and follow a normal distribution as the number of variables increases.
3) Stochastic processes, which are collections of random variables indexed by time or another parameter and can model evolving systems. Examples of stochastic processes and their properties are provided.
There are three possible ROC's:
1. Outside all poles (a, b, c)
2. Between innermost and outermost pole
3. Inside all poles
So the possible ROC's are:
1. Outside circle through a, b, c
2. Annular region between a, c
3. Inside circle through a, b, c
a b c Re
The z-Transform
Important z-Transform Pairs
Important z-Transform Pairs
1. Unit Impulse: δ(n)
1, if n = 0
δ(n) = 0, otherwise
1
X(z) =
2.
This document describes stochastic definite clause grammars (SDCG), which extend definite clause grammars (DCG) with probabilities. SDCG transforms a DCG into a stochastic logic program using PRISM, allowing probabilistic inferences and parameter learning. The probabilistic model assigns a random variable to each rule expansion. SDCG introduces syntax extensions like regular expressions and macros to make grammars more concise. Conditioned rules allow modeling higher-order hidden Markov models by selecting rules based on variable unification. SDCG provides tools for parsing sentences and learning rule probabilities from data.
The document discusses histograms and histogram equalization for digital image processing. It defines a histogram as estimating the probability distribution function of gray values in an image and providing insight into an image's contrast. Histogram equalization is introduced as a technique that transforms an image's gray values such that the transformed values are uniformly distributed, improving contrast by spreading out the most frequent intensities. The key steps of histogram equalization are outlined.
This document describes polarization coherence tomography (PCT) techniques for tomographic imaging of natural volumetric media using polarimetric synthetic aperture radar (SAR). It discusses using Born's approximation and decomposing the scattering function over Legendre polynomials to relate the coherence to polynomial integrals. The basic steps of the technique are to 1) set the volume limits, 2) compute polynomial integrals, 3) select a polarization channel, and 4) invert the linear relation to reconstruct the scattering function and create tomographic images. It also proposes a least squares tomography approach using orthogonal polynomials to provide a compact solution for arbitrary polynomial order PCT.
Martin Roth: A spatial peaks-over-threshold model in a nonstationary climateJiří Šmída
1. The document proposes a spatial peaks-over-threshold model for estimating quantiles and trends in daily precipitation in a nonstationary climate.
2. It uses a generalized Pareto distribution fitted to precipitation extremes above a threshold to model peaks over threshold, with the threshold and distribution parameters allowed to vary over time in a nonstationary manner.
3. Spatial dependence is incorporated through an index flood approach where distribution parameters are constant across sites after scaling by a site-specific index flood value.
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
The document proposes a new version of Hamiltonian Monte Carlo (HMC) sampling that is essentially calibration-free. It achieves this by learning the optimal leapfrog scale from the distribution of integration times using the No-U-Turn Sampler algorithm. Compared to the original NUTS algorithm on benchmark models, this new enhanced HMC (eHMC) exhibits significantly improved efficiency with no hand-tuning of parameters required. The document tests eHMC on a Susceptible-Infected-Recovered model of disease transmission.
1. The document discusses maximum likelihood estimation and Bayesian parameter estimation for machine learning problems involving parametric densities like the Gaussian.
2. Maximum likelihood estimation finds the parameter values that maximize the probability of obtaining the observed training data. For Gaussian distributions with unknown mean and variance, MLE returns the sample mean and variance.
3. Bayesian parameter estimation treats the parameters as random variables and uses prior distributions and observed data to obtain posterior distributions over the parameters. This allows incorporation of prior knowledge with the training data.
This document discusses key concepts in probability theory, including:
1) Markov's inequality and Chebyshev's inequality, which relate the probability that a random variable exceeds a value to its expected value and variance.
2) The weak law of large numbers and central limit theorem, which describe how the means of independent random variables converge to the expected value and follow a normal distribution as the number of variables increases.
3) Stochastic processes, which are collections of random variables indexed by time or another parameter and can model evolving systems. Examples of stochastic processes and their properties are provided.
There are three possible ROC's:
1. Outside all poles (a, b, c)
2. Between innermost and outermost pole
3. Inside all poles
So the possible ROC's are:
1. Outside circle through a, b, c
2. Annular region between a, c
3. Inside circle through a, b, c
a b c Re
The z-Transform
Important z-Transform Pairs
Important z-Transform Pairs
1. Unit Impulse: δ(n)
1, if n = 0
δ(n) = 0, otherwise
1
X(z) =
2.
This document describes stochastic definite clause grammars (SDCG), which extend definite clause grammars (DCG) with probabilities. SDCG transforms a DCG into a stochastic logic program using PRISM, allowing probabilistic inferences and parameter learning. The probabilistic model assigns a random variable to each rule expansion. SDCG introduces syntax extensions like regular expressions and macros to make grammars more concise. Conditioned rules allow modeling higher-order hidden Markov models by selecting rules based on variable unification. SDCG provides tools for parsing sentences and learning rule probabilities from data.
The document discusses histograms and histogram equalization for digital image processing. It defines a histogram as estimating the probability distribution function of gray values in an image and providing insight into an image's contrast. Histogram equalization is introduced as a technique that transforms an image's gray values such that the transformed values are uniformly distributed, improving contrast by spreading out the most frequent intensities. The key steps of histogram equalization are outlined.
This document describes polarization coherence tomography (PCT) techniques for tomographic imaging of natural volumetric media using polarimetric synthetic aperture radar (SAR). It discusses using Born's approximation and decomposing the scattering function over Legendre polynomials to relate the coherence to polynomial integrals. The basic steps of the technique are to 1) set the volume limits, 2) compute polynomial integrals, 3) select a polarization channel, and 4) invert the linear relation to reconstruct the scattering function and create tomographic images. It also proposes a least squares tomography approach using orthogonal polynomials to provide a compact solution for arbitrary polynomial order PCT.
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性Yuichi Yoshida
1. The document discusses the maximum constraint satisfaction problem (Max CSP) and how to approximate its optimal value. It presents a basic linear programming (LP) relaxation called BasicLP that provides an (αΛ-ε, ε)-approximation for any CSP Λ, where αΛ is the integrality gap.
2. For some CSPs like Max Cut, BasicLP can be implemented as a packing LP and solved in polynomial time to give an (αΛ+ε, δ)-approximation in √n time, improving on the Ω(n) time needed for general CSPs.
3. The document outlines how to derive the (αΛ+
The document summarizes the Wang-Landau algorithm and some of its improvements. The Wang-Landau algorithm is an adaptive Markov chain Monte Carlo method that iteratively estimates the density of states of a system. It partitions the state space into bins and iteratively adjusts estimates of the density within each bin so that the generated samples spend an equal amount of time in each bin. The algorithm has been improved through automatic binning methods, adaptive proposal distributions, and using parallel interacting chains. An example application to variable selection is also discussed.
EM algorithm and its application in probabilistic latent semantic analysiszukun
The document discusses the EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA). It begins by introducing the parameter estimation problem and comparing frequentist and Bayesian approaches. It then describes the EM algorithm, which iteratively computes lower bounds to the log-likelihood function. Finally, it applies the EM algorithm to pLSA by modeling documents and words as arising from a mixture of latent topics.
Spectral Learning Methods for Finite State Machines with Applications to Na...LARCA UPC
The document summarizes a spectral learning method for probabilistic finite-state machines (FSMs). It introduces observable operator models that represent probabilistic transducers using conditional probabilities between inputs, outputs, and hidden states. A key contribution is a spectral algorithm that learns the parameters of these models from data in linear time, with theoretical PAC-style guarantees. Experimental results on synthetic data show the method outperforms baselines like HMMs and k-HMMs on learning tasks.
This document summarizes and compares different distances that can be used in generative adversarial networks (GANs). It introduces the Wasserstein distance, also known as the Earth Mover (EM) distance or Wasserstein-1 distance. The document shows that the Wasserstein distance is more meaningful than other distances like total variation, Kullback-Leibler divergence, and Jensen-Shannon divergence when the real and generated distributions start to differ but their support still overlap. It also demonstrates that training GANs with the Wasserstein distance provides improved stability during training compared to other distances. Several theorems and examples are provided to illustrate properties of the Wasserstein distance such as Lipschitz continuity.
Digital signal processing (DSP) is concerned with the digital representation and processing of signals. DSP has advantages over analog processing like guaranteed accuracy, reproducibility, and flexibility. DSP has applications in areas like image processing, speech recognition, telecommunications, and biomedical signal analysis. A discrete-time signal is represented as a sequence of numbers. A discrete-time system maps an input signal to an output signal. Linear, time-invariant systems can be characterized by their impulse response. The discrete Fourier transform provides a means to obtain Fourier components of a signal using digital computation at discrete frequencies.
14th Athens Colloquium on Algorithms and Complexity (ACAC19)Apostolos Chalkis
This document presents a new method for estimating the volume of convex polytopes called practical volume estimation by a new annealing schedule. It uses a multiphase Monte Carlo approach with a sequence of concentric convex bodies to approximate the volume. A new simulated annealing method constructs a sparser sequence of bodies. Billiard walk sampling is used for volume-represented and zonotope polytopes. The method scales to dimensions of 100 in an hour for random V-polytopes and zonotopes, outperforming previous methods with theoretical complexity of O*(d^3).
This document summarizes a method for performing kernel-based similarity search in massive graph databases using wavelet trees. It introduces the need for efficient graph similarity search as graph databases grow large. It describes representing graphs as bags-of-words and using a semi-conjunctive query to relax cosine similarity searches. The method replaces inverted indexes with a wavelet tree to enable fast top-down search while using less memory than traditional inverted indexes. Experiments on a dataset of 25 million chemical compounds demonstrate the method's ability to perform similarity search efficiently in large graph databases.
- The thesis studies numerical methods for stochastic partial differential equations (SPDEs) subject to generalized Levy noise.
- It develops both deterministic methods using the Fokker-Planck equation and probabilistic methods like polynomial chaos.
- Key contributions include developing adaptive multi-element polynomial chaos for discrete measures, comparing approaches to construct orthogonal polynomials over discrete measures, and improving efficiency and accuracy through adaptive integration meshes and sparse grids.
Multimodal pattern matching algorithms and applicationsXavier Anguera
In this presentation I focus on 3 projects I have been working in the last year. The first one is a novel pattern matching algorithm, based on the well known Dynamic Time Warping. The presented algorithm can be used to find real-valued subsequences within a longer sequence, without prior knowledge of their start-end points. I have applied the algorithm for the task of acoustic matching, for which I will show some preliminary results. Then I will continue to explain a second DTW-based algorithm, this one being able do an online of two musical pieces. One of the music pieces can be input life or be retrieved from an audio file, while the second one is extracted from an online music video. The online alignment allows for the music video to be played in total synchrony with the corresponding ambient/recorded audio. Finally, I will talk about video copy detection, which is the task of finding video duplicate segments within a big database. I will explain our multimodal approach, based on audio-visual change-based features.
(1) The document describes a method for efficient similarity search in massive graph databases using wavelet trees. (2) It converts graphs into bags-of-words representations using the Weisfeiler-Lehman procedure and indexes the words with a wavelet tree to enable fast semi-conjunctive queries. (3) Experiments on 25 million chemical compounds showed the method was significantly faster than alternative approaches while using less memory.
High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos
The document discusses algorithms for computing volumes of polytopes. It notes that exactly computing volumes is hard, but randomized polynomial-time algorithms can approximate volumes with high probability. It describes two algorithms: Random Directions Hit-and-Run (RDHR), which generates random points within a polytope via random walks; and Multiphase Monte Carlo, which approximates a polytope's volume by sampling points within a sequence of enclosing balls. RDHR mixes in O(d^3) steps and these algorithms can compute volumes of high-dimensional polytopes that exact algorithms cannot handle.
1) Pairwise sequence alignment is a method to compare two biological sequences like DNA, RNA, or proteins. It involves arranging the sequences in columns to highlight their similarities and differences.
2) There are many possible alignments between two sequences, but most imply too many mutations. The best alignment minimizes the number of mutations needed to explain the differences between the sequences.
3) For short protein sequences like "QKGSYPVRSTC" and "QKGSGPVRSTC", the optimal alignment implies one single mutation occurred since the sequences diverged from a common ancestor.
This document presents a method for estimating the eigenvalues of a covariance matrix when there are few samples. It involves shifting the sampled eigenvalues toward the population values based on theoretical distributions, and balancing the energy across eigenvalues. This simple 3-matrix approach improves estimation and detection performance compared to using the sampled eigenvalues alone. Simulations and hyperspectral data experiments demonstrate the effectiveness of the method.
Computing the volume of a convex body is a fundamental problem in computational geometry and optimization. In this talk we discuss the computational complexity of this problem from a theoretical as well as practical point of view. We show examples of how volume computation appear in applications ranging from combinatorics to algebraic geometry.
Next, we design the first practical algorithm for polytope volume approximation in high dimensions (few hundreds).
The algorithm utilizes uniform sampling from a convex region and efficient boundary polytope oracles.
Interestingly, our software provides a framework for exploring theoretical advances since it is believed, and our experiments provide evidence for this belief, that the current asymptotic bounds are unrealistically high.
DESeq models read counts with a negative binomial distribution to account for biological variability between samples, which a Poisson distribution underestimates. It estimates variance for each gene based on a local regression of variance against mean expression of other genes. This allows it to better control false positives compared to EdgeR or a Poisson model. DESeq also estimates sequencing depth differently than EdgeR to improve differential expression testing across the dynamic range of expression levels.
This document summarizes a thesis on numerical methods for stochastic systems subject to generalized Levy noise. It includes:
1) Motivation for studying such systems from both mathematical and applicational perspectives, such as in mathematical finance and chaotic flows.
2) An introduction to Levy processes and the probability collocation method (PCM) for uncertainty quantification (UQ).
3) Details on improving PCM through a multi-element approach and constructing orthogonal polynomials for discrete measures.
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemIosif Itkin
Alexei Promsky, Dmitry Kondtratyev, A.P. Ershov Institute of Informatics Systems, Novosibirsk
12 - 14 November 2015
Tools and Methods of Program Analysis in St. Petersburg
I am Vincent S. I am an Algorithm Assignment Expert at programminghomeworkhelp.com. I hold a Ph.D. in Programming from, University of Minnesota, USA. I have been helping students with their homework for the past 9 years. I solve assignments related to Algorithms.
Visit programminghomeworkhelp.com or email support@programminghomeworkhelp.com. You can also call on +1 678 648 4277 for any assistance with Algorithm assignments.
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性Yuichi Yoshida
1. The document discusses the maximum constraint satisfaction problem (Max CSP) and how to approximate its optimal value. It presents a basic linear programming (LP) relaxation called BasicLP that provides an (αΛ-ε, ε)-approximation for any CSP Λ, where αΛ is the integrality gap.
2. For some CSPs like Max Cut, BasicLP can be implemented as a packing LP and solved in polynomial time to give an (αΛ+ε, δ)-approximation in √n time, improving on the Ω(n) time needed for general CSPs.
3. The document outlines how to derive the (αΛ+
The document summarizes the Wang-Landau algorithm and some of its improvements. The Wang-Landau algorithm is an adaptive Markov chain Monte Carlo method that iteratively estimates the density of states of a system. It partitions the state space into bins and iteratively adjusts estimates of the density within each bin so that the generated samples spend an equal amount of time in each bin. The algorithm has been improved through automatic binning methods, adaptive proposal distributions, and using parallel interacting chains. An example application to variable selection is also discussed.
EM algorithm and its application in probabilistic latent semantic analysiszukun
The document discusses the EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA). It begins by introducing the parameter estimation problem and comparing frequentist and Bayesian approaches. It then describes the EM algorithm, which iteratively computes lower bounds to the log-likelihood function. Finally, it applies the EM algorithm to pLSA by modeling documents and words as arising from a mixture of latent topics.
Spectral Learning Methods for Finite State Machines with Applications to Na...LARCA UPC
The document summarizes a spectral learning method for probabilistic finite-state machines (FSMs). It introduces observable operator models that represent probabilistic transducers using conditional probabilities between inputs, outputs, and hidden states. A key contribution is a spectral algorithm that learns the parameters of these models from data in linear time, with theoretical PAC-style guarantees. Experimental results on synthetic data show the method outperforms baselines like HMMs and k-HMMs on learning tasks.
This document summarizes and compares different distances that can be used in generative adversarial networks (GANs). It introduces the Wasserstein distance, also known as the Earth Mover (EM) distance or Wasserstein-1 distance. The document shows that the Wasserstein distance is more meaningful than other distances like total variation, Kullback-Leibler divergence, and Jensen-Shannon divergence when the real and generated distributions start to differ but their support still overlap. It also demonstrates that training GANs with the Wasserstein distance provides improved stability during training compared to other distances. Several theorems and examples are provided to illustrate properties of the Wasserstein distance such as Lipschitz continuity.
Digital signal processing (DSP) is concerned with the digital representation and processing of signals. DSP has advantages over analog processing like guaranteed accuracy, reproducibility, and flexibility. DSP has applications in areas like image processing, speech recognition, telecommunications, and biomedical signal analysis. A discrete-time signal is represented as a sequence of numbers. A discrete-time system maps an input signal to an output signal. Linear, time-invariant systems can be characterized by their impulse response. The discrete Fourier transform provides a means to obtain Fourier components of a signal using digital computation at discrete frequencies.
14th Athens Colloquium on Algorithms and Complexity (ACAC19)Apostolos Chalkis
This document presents a new method for estimating the volume of convex polytopes called practical volume estimation by a new annealing schedule. It uses a multiphase Monte Carlo approach with a sequence of concentric convex bodies to approximate the volume. A new simulated annealing method constructs a sparser sequence of bodies. Billiard walk sampling is used for volume-represented and zonotope polytopes. The method scales to dimensions of 100 in an hour for random V-polytopes and zonotopes, outperforming previous methods with theoretical complexity of O*(d^3).
This document summarizes a method for performing kernel-based similarity search in massive graph databases using wavelet trees. It introduces the need for efficient graph similarity search as graph databases grow large. It describes representing graphs as bags-of-words and using a semi-conjunctive query to relax cosine similarity searches. The method replaces inverted indexes with a wavelet tree to enable fast top-down search while using less memory than traditional inverted indexes. Experiments on a dataset of 25 million chemical compounds demonstrate the method's ability to perform similarity search efficiently in large graph databases.
- The thesis studies numerical methods for stochastic partial differential equations (SPDEs) subject to generalized Levy noise.
- It develops both deterministic methods using the Fokker-Planck equation and probabilistic methods like polynomial chaos.
- Key contributions include developing adaptive multi-element polynomial chaos for discrete measures, comparing approaches to construct orthogonal polynomials over discrete measures, and improving efficiency and accuracy through adaptive integration meshes and sparse grids.
Multimodal pattern matching algorithms and applicationsXavier Anguera
In this presentation I focus on 3 projects I have been working in the last year. The first one is a novel pattern matching algorithm, based on the well known Dynamic Time Warping. The presented algorithm can be used to find real-valued subsequences within a longer sequence, without prior knowledge of their start-end points. I have applied the algorithm for the task of acoustic matching, for which I will show some preliminary results. Then I will continue to explain a second DTW-based algorithm, this one being able do an online of two musical pieces. One of the music pieces can be input life or be retrieved from an audio file, while the second one is extracted from an online music video. The online alignment allows for the music video to be played in total synchrony with the corresponding ambient/recorded audio. Finally, I will talk about video copy detection, which is the task of finding video duplicate segments within a big database. I will explain our multimodal approach, based on audio-visual change-based features.
(1) The document describes a method for efficient similarity search in massive graph databases using wavelet trees. (2) It converts graphs into bags-of-words representations using the Weisfeiler-Lehman procedure and indexes the words with a wavelet tree to enable fast semi-conjunctive queries. (3) Experiments on 25 million chemical compounds showed the method was significantly faster than alternative approaches while using less memory.
High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos
The document discusses algorithms for computing volumes of polytopes. It notes that exactly computing volumes is hard, but randomized polynomial-time algorithms can approximate volumes with high probability. It describes two algorithms: Random Directions Hit-and-Run (RDHR), which generates random points within a polytope via random walks; and Multiphase Monte Carlo, which approximates a polytope's volume by sampling points within a sequence of enclosing balls. RDHR mixes in O(d^3) steps and these algorithms can compute volumes of high-dimensional polytopes that exact algorithms cannot handle.
1) Pairwise sequence alignment is a method to compare two biological sequences like DNA, RNA, or proteins. It involves arranging the sequences in columns to highlight their similarities and differences.
2) There are many possible alignments between two sequences, but most imply too many mutations. The best alignment minimizes the number of mutations needed to explain the differences between the sequences.
3) For short protein sequences like "QKGSYPVRSTC" and "QKGSGPVRSTC", the optimal alignment implies one single mutation occurred since the sequences diverged from a common ancestor.
This document presents a method for estimating the eigenvalues of a covariance matrix when there are few samples. It involves shifting the sampled eigenvalues toward the population values based on theoretical distributions, and balancing the energy across eigenvalues. This simple 3-matrix approach improves estimation and detection performance compared to using the sampled eigenvalues alone. Simulations and hyperspectral data experiments demonstrate the effectiveness of the method.
Computing the volume of a convex body is a fundamental problem in computational geometry and optimization. In this talk we discuss the computational complexity of this problem from a theoretical as well as practical point of view. We show examples of how volume computation appear in applications ranging from combinatorics to algebraic geometry.
Next, we design the first practical algorithm for polytope volume approximation in high dimensions (few hundreds).
The algorithm utilizes uniform sampling from a convex region and efficient boundary polytope oracles.
Interestingly, our software provides a framework for exploring theoretical advances since it is believed, and our experiments provide evidence for this belief, that the current asymptotic bounds are unrealistically high.
DESeq models read counts with a negative binomial distribution to account for biological variability between samples, which a Poisson distribution underestimates. It estimates variance for each gene based on a local regression of variance against mean expression of other genes. This allows it to better control false positives compared to EdgeR or a Poisson model. DESeq also estimates sequencing depth differently than EdgeR to improve differential expression testing across the dynamic range of expression levels.
This document summarizes a thesis on numerical methods for stochastic systems subject to generalized Levy noise. It includes:
1) Motivation for studying such systems from both mathematical and applicational perspectives, such as in mathematical finance and chaotic flows.
2) An introduction to Levy processes and the probability collocation method (PCM) for uncertainty quantification (UQ).
3) Details on improving PCM through a multi-element approach and constructing orthogonal polynomials for discrete measures.
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemIosif Itkin
Alexei Promsky, Dmitry Kondtratyev, A.P. Ershov Institute of Informatics Systems, Novosibirsk
12 - 14 November 2015
Tools and Methods of Program Analysis in St. Petersburg
I am Vincent S. I am an Algorithm Assignment Expert at programminghomeworkhelp.com. I hold a Ph.D. in Programming from, University of Minnesota, USA. I have been helping students with their homework for the past 9 years. I solve assignments related to Algorithms.
Visit programminghomeworkhelp.com or email support@programminghomeworkhelp.com. You can also call on +1 678 648 4277 for any assistance with Algorithm assignments.
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Harshal Chaudhari
The presentation accompanying the paper at SDM 2018 - https://epubs.siam.org/doi/abs/10.1137/1.9781611975321.50
Github: https://github.com/chdhr-harshal/mc-monitor
In networking applications, one often wishes to obtain estimates about the number of objects at different parts of the network (e.g., the number of cars at an intersection of a road network or the number of packets expected to reach a node in a computer network) by monitoring the traffic in a small number of network nodes or edges. We formalize this task by defining the Markov Chain Monitoring problem. Given an initial distribution of items over the nodes of a Markov chain, we wish to estimate the distribution of items at subsequent times. We do this by asking a limited number of queries that retrieve, for example, how many items transitioned to a specific node or over a specific edge at a particular time. We consider different types of queries, each defining a different variant of the Markov Chain Monitoring. For each variant, we design efficient algorithms for choosing the queries that make our estimates as accurate as possible. In our experiments with synthetic and real datasets we demonstrate the efficiency and the efficacy of our algorithms in a variety of settings.
This document summarizes a presentation about variational autoencoders (VAEs) presented at the ICLR 2016 conference. The document discusses 5 VAE-related papers presented at ICLR 2016, including Importance Weighted Autoencoders, The Variational Fair Autoencoder, Generating Images from Captions with Attention, Variational Gaussian Process, and Variationally Auto-Encoded Deep Gaussian Processes. It also provides background on variational inference and VAEs, explaining how VAEs use neural networks to model probability distributions and maximize a lower bound on the log likelihood.
I am Marianna P. I am an Algorithm Exam Expert at programmingexamhelp.com. I hold a PhD. in Programming, from Curtin University, Australia. I have been helping students with their exams for the past 10 years. You can hire me to take your exam in Algorithm.
Visit programmingexamhelp.com or email support@programmingexamhelp.com. You can also call on +1 678 648 4277 for any assistance with the Algorithm Exam.
We examine the effectiveness of randomized quasi Monte Carlo (RQMC) to improve the convergence rate of the mean integrated square error, compared with crude Monte Carlo (MC), when estimating the density of a random variable X defined as a function over the s-dimensional unit cube (0,1)^s. We consider histograms and kernel density estimators. We show both theoretically and empirically that RQMC estimators can achieve faster convergence rates in
some situations.
This is joint work with Amal Ben Abdellah, Art B. Owen, and Florian Puchhammer.
The document describes MOLGENIS, a database generator that can automatically generate useful data applications from simple data models. It is being used to create xGAP (extensible Genotype And Phenotype database), which aims to harmonize and enable collaboration and analysis of diverse genotype and phenotype data. The document outlines challenges in data integration in biology and how MOLGENIS addresses these challenges through its platform and software generators.
The 10th Annual Bioinformatics Open Source Conference (BOSC 2009) was held June 27-28, 2009 and organized by Kam Dahlquist, Lonnie Welch, and others. The conference schedule and information was available online and included calls for lightning talks and Birds of a Feather sessions. Lunch was provided each day and the conference featured keynote speakers, sessions, and a student travel award.
The document discusses a panel at the 2009 Bioinformatics Open Source Community conference about applying software patterns to bioinformatics open source development. The panel explores how patterns can help create better bioinformatics software, patterns they have already identified, what is done with patterns, whether there is a pattern repository, and who maintains such a repository.
The document discusses two software patterns used in developing Chipster, a bioinformatics application: graceful GUI blocking, which places an opaque layer over the GUI to indicate loading and prevent user interaction; and self-service distributed state management, which distributes application state management to clients to avoid single points of failure in a distributed system. The patterns were found useful for Chipster, which provides bioinformatics analysis tools through a graphical interface and supports distributed computing.
This document discusses the discovery that DNA previously thought to have no value ("junk DNA") may actually play important roles in gene expression regulation. Scientists investigated junk DNA in the model plant Arabidopsis thaliana and found short, linked patterns of DNA called pyknons. This suggests a universal genetic mechanism is at play across biology that is not yet fully understood. The discovery illustrates the connection between coding and non-coding DNA and that the term "junk DNA" may need reevaluation.
EMBOSS is an open source software suite for sequence analysis. It contains over 200 applications and supports over 100 file formats. It is funded by the UK BBSRC and developed by researchers at the EBI, Sanger Institute, and other institutions. EMBOSS faced an uncertain future in 2004 when its original developers were relocated, but continued funding has allowed ongoing development and support for a worldwide user base conducting research on all continents.
BioJava is an open source Java framework for processing biological data. It provides tools for analyzing and manipulating sequences, structures, and other biological data. The latest version, BioJava 1.7, includes improved support for 3D structures and modularization into separate modules. The project aims to facilitate rapid bioinformatics application development and is supported by an active developer community.
Soaplab is a generator of web services for accessing command-line programs and other tools. It wraps hundreds of EMBOSS programs and other plugins as SOAP web services. A new release, Soaplab 2.2.0, adds support for "typed services" which define inputs and outputs using WSDL and XSD for better integration with third party tools. Developers can add new command line tools or plugins and Soaplab will generate the corresponding web services.
This document provides an update on the Biopython project. It discusses recent releases including support for new file formats like FASTQ and new modules. It outlines current and future projects including work on parsing new file types and switching from CVS to git version control. Development involves an international team through an open source model and is supported by various organizations.
The document discusses software patterns for reusable design, outlining what a software pattern is, how patterns are used within communities, and how to apply patterns to documentation, design, and development. It provides an overview of pattern concepts including what constitutes a pattern, pattern languages, and pattern communities while cautioning that patterns should not be viewed as a "turn the crank" approach to software development.
PSODA is an open-source phylogenetic search and DNA analysis package that is compatible with PAUP* and adds a scripting language to PAUP blocks to allow for advanced meta-searches. It began development in 2005 as an alternative to PAUP* that could be used for phylogenetic search, multiple alignment, and detecting natural selection. PSODA's scripting language, PSODAscript, adds functionality like decision statements, loops, and functions to PAUP blocks and allows for easy scripting of meta-searches.
This document summarizes an approach called VAMSAS that enables sharing of data like sequences, alignments, and annotations between different bioinformatics tools. It describes how VAMSAS uses a shared XML document and client library to allow tools to access and update shared data, view events in other tools, and better integrate workflows. Examples of tools like TOPALI and Jalview that could benefit from this approach are discussed.
This document describes a method for discovering composite motifs in DNA sequences. The method searches for overrepresented patterns representing transcription factor binding sites. It improves on previous methods by modeling motifs as modules that occur together, rather than as isolated patterns. The algorithm ranks predicted modules based on support, specificity and significance. It was shown to outperform other tools, particularly at realistic noise levels, due to its use of real DNA backgrounds and support-based scoring. Future work includes exploring the full Pareto front of optimal solutions and parameter interactions to improve predictions.
3) The algorithm works by first discovering short motif seeds, then extending these seeds into full length position weight matrices, and iteratively refining the matrices to discover overrepresented motifs.
Debian-Med is a Debian Linux distribution community focused on adapting and disseminating open source bioinformatics software. It maintains over 160 bioinformatics packages through a collaborative development process, providing quality assurance and compatibility across multiple architectures. The community aims to improve access to packages for high-performance and cloud computing as well as ease of data management and distribution of bioinformatics libraries and software.
The document summarizes the BioLib project, which aims to create C/C++ libraries for common biological functionality that can be accessed from multiple bioinformatics programming languages to avoid duplication of efforts. It has created bindings for several existing libraries, including Affyio, Staden IO, GSL, Rlib, and others. The project uses Git for version control, CMake for building, and SWIG for generating language bindings in an effort to maximize code reuse across languages.
This document introduces BNFinder, a Python software for reconstructing Bayesian networks and dynamic Bayesian networks from data. It uses a fast, exact algorithm to find the optimal network topology, unlike traditional Markov chain Monte Carlo methods. The software supports discrete and continuous data, different scoring functions, and datasets with perturbations. It is open source and runs efficiently on large real-world genomic and neural network examples. Future plans include parallelization and improvements to continuous variable and classification models.
The document discusses the BioHDF project which aims to develop scalable data infrastructure for bioinformatics using HDF5. It notes that next generation DNA sequencing is producing vast amounts of complex data that is challenging to analyze and compare across samples due to lack of consistent data models and structured storage. The BioHDF project seeks to address this by developing HDF5 domain extensions and tools to organize, index, annotate and access sequencing data in a way that enables more efficient analysis, visualization and exploration of results within and between samples.
The document describes the biomanycores.org project, which aims to create a repository of open-source GPU-accelerated bioinformatics algorithms. It provides interfaces to popular bioinformatics tools like BioJava, BioPerl, and Biopython to easily integrate the GPU implementations. The project currently includes tools like Smith-Waterman alignment and PWM scanning. The challenges include differing APIs, object representations, real-world pipelines, and licensing. The goals are to share more OpenCL code, integrate and benchmark new algorithms, and improve usability for bioinformaticians.
This document discusses Q-normalization, a method for normalizing gene expression data. It presents parallel implementations of Q-normalization using shared memory, message passing, and GPU architectures. Benchmarking shows the GPU implementation provides a 5.5x speedup over the sequential CPU version for processing large gene expression datasets. The shared memory implementation provides a 2.9x total speedup, while the message passing version is suitable for distributed memory clusters.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 6
Schbath Rmes Bosc2009
1. R’MES
Finding Exceptional Motifs in Sequences
S. Schbath
INRA, Jouy-en-Josas, France
http://genome.jouy.inra.fr/ssb/rmes/
BOSC, Stockholm, June 27-28, 2009 – p.1
3. DNA and motifs
• DNA: Long molecule, sequence of
nucleotides
• Nucleotides: A(denine), C(ytosine),
G(uanine), T(hymine).
• Motif (= oligonucleotides): short
sequence of nucleotides, e.g.
CAGTAG
• Functional motif: recognized by
proteins or enzymes to initiate a
biological process
TAGACAGATAGACGATCAGTAGCCAGTAGACAGTAGGCATGA. . .
BOSC, Stockholm, June 27-28, 2009 – p.3
4. Some functional motifs
• Restriction sites: recognized by specific bacterial restriction enzymes ⇒
double-strand DNA break.
E.g. GAATTC recognized by EcoRI
• Chi motif: recognized by an enzyme which processes along DNA sequence
and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
stimulated by recombination.
E.g. GCTGGTGG recognized by RecBCD (E. coli)
• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
into macro-domains.
t
TGTTAACACGTGAAACA
c c t t
• promoter: structured motif recognized by the RNA polymerase to initiate
gene transcription.
(16;18)
E.g. TTGAC − − − TATAAT (E. coli).
BOSC, Stockholm, June 27-28, 2009 – p.4
5. Some functional motifs
• Restriction sites: recognized by specific bacterial restriction enzymes ⇒
double-strand DNA break.
E.g. GAATTC recognized by EcoRI
very rare along bacterial genomes
• Chi motif: recognized by an enzyme which processes along DNA sequence
and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
stimulated by recombination.
E.g. GCTGGTGG recognized by RecBCD (E. coli)
very frequent along E. coli genome
• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
into macro-domains.
t
TGTTAACACGTGAAACA
c c t t
very frequent into the ORI domain, rare elsewhere
• promoter: structured motif recognized by the RNA polymerase to initiate
gene transcription.
(16;18)
E.g. TTGAC − − − TATAAT (E. coli).
particularly located in front of genes
BOSC, Stockholm, June 27-28, 2009 – p.4
6. Prediction of functional motifs
Most of the functional motifs are unknown in the different species.
For instance,
• which would be the Chi motif of S. aureus? [Halpern et al. (08)]
• Is there an equivalent of parS in E. coli? [Mercier et al. (08)]
Statistical approach: to identify candidate motifs based on their statistical
properties.
The most over-represented The most over-represented families
8-letter words under M1 anbcdef g under M1
E. coli ( = 4.6 106 ) H. influenzae ( = 1.8 106 )
word obs exp score motif obs exp score
gctggtgg 762 84.9 73.5 gntggtgg 223 55.3 22.33
ggcgctgg 828 125.9 62.6 anttcatc 469 180.3 21.59
cgctggcg 870 150.8 58.6 anatcgcc 288 87.8 21.38
gctggcgg 723 125.9 53.3 tnatcgcc 279 84.5 21.18
cgctggtg 619 101.7 51.3 gnagaaga 270 83.6 20.10
BOSC, Stockholm, June 27-28, 2009 – p.5
8. Statistical questions addressed by R’MES
Questions related to the significance of the number of occurrences of motifs w
in sequences:
• Is N obs (w) significantly high?
• Is N obs (w) significantly higher than N obs (w )?
−→ If w = w: is w significantly skewed (strand bias)?
obs obs
• Is N1 (w) significantly more unexpected than N2 (w)?
Several types of motifs w:
• fixed words (e.g. gctggtgg),
• degenerated patterns (e.g. gntggtgg),
• set of words (e.g. {w, w}).
BOSC, Stockholm, June 27-28, 2009 – p.7
9. Is N obs (w) significantly high?
• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
BOSC, Stockholm, June 27-28, 2009 – p.8
10. Is N obs (w) significantly high?
• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
• either a Gaussian approximation of N (w) (when E(N (w)) is large)
[Prum et al. (95)], [Schbath et al. (95)]
• or a compound Poisson distribution of N (w) (when E(N (w)) is small)
[Schbath (95)], [Roquain and Schbath (07)]
(see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )
BOSC, Stockholm, June 27-28, 2009 – p.8
11. Is N obs (w) significantly high?
• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
• either a Gaussian approximation of N (w) (when E(N (w)) is large)
[Prum et al. (95)], [Schbath et al. (95)]
• or a compound Poisson distribution of N (w) (when E(N (w)) is small)
[Schbath (95)], [Roquain and Schbath (07)]
(see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )
• R’MES produces scores of exceptionality (probit transformation).
High positive (resp. negative) scores correspond to exceptionally frequent
(resp. rare) motifs.
rmes –gauss –s seqfile –m m –l wordlength –o outputfile
BOSC, Stockholm, June 27-28, 2009 – p.8
12. Is N obs (w) significantly higher than N obs (w)?
N obs (w)
• One needs to calculate the p-value P where N (·) is the
“ ”
N (w)
N (w)
≥ N obs (w)
count (r.v.) in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
BOSC, Stockholm, June 27-28, 2009 – p.9
13. Is N obs (w) significantly higher than N obs (w)?
N obs (w)
• One needs to calculate the p-value P where N (·) is the
“ ”
N (w)
N (w)
≥ N obs (w)
count (r.v.) in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
• the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
E(N (w)) and E(N (w)) are large)
[Prum et al. (95)], [Schbath et al. (95)]
BOSC, Stockholm, June 27-28, 2009 – p.9
14. Is N obs (w) significantly higher than N obs (w)?
N obs (w)
• One needs to calculate the p-value P where N (·) is the
“ ”
N (w)
N (w)
≥ N obs (w)
count (r.v.) in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
• the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
E(N (w)) and E(N (w)) are large)
[Prum et al. (95)], [Schbath et al. (95)]
• R’MES produces scores of exceptional skew (probit transformation):
High positive (resp. negative) scores correspond to motifs significantlty more
frequent (resp. rare) along the sequence than along the complementary one.
rmes –skew –seq seqfile –m m –l wordlength –o outputfile
BOSC, Stockholm, June 27-28, 2009 – p.9
15. obs obs
Is N1 (w) significantly more except. than N2 (w)?
• One wants to compare the exceptionality of a motif w in two different
obs obs
sequences (two observed counts N1 (w) and N2 (w))
BOSC, Stockholm, June 27-28, 2009 – p.10
16. obs obs
Is N1 (w) significantly more except. than N2 (w)?
• One wants to compare the exceptionality of a motif w in two different
obs obs
sequences (two observed counts N1 (w) and N2 (w))
• R’MES computes a test statistic and its asociated p-value to test
H0 : {w is equally exceptional in both sequences}
against
H1 : {w is more exceptional in the first sequence}
[Robin et al. (08)]
BOSC, Stockholm, June 27-28, 2009 – p.10
17. obs obs
Is N1 (w) significantly more except. than N2 (w)?
• One wants to compare the exceptionality of a motif w in two different
obs obs
sequences (two observed counts N1 (w) and N2 (w))
• R’MES computes a test statistic and its asociated p-value to test
H0 : {w is equally exceptional in both sequences}
against
H1 : {w is more exceptional in the first sequence}
[Robin et al. (08)]
• The test is performed by considering occurrence processes like Poisson
processes whose intensities take the sequence compositions in oligos of
length 1- up to -(m + 1) into account.
• Option –seq2 soon available in R’MES.
BOSC, Stockholm, June 27-28, 2009 – p.10
20. Chi motifs in bacterial genomes
• Motif involved in the repair of double-strand DNA breaks.
Chi needs to be frequent along bacterial genomes.
• Chi motifs have been identified for few bacterial species. They are not
conserved through species.
• Known Chi motifs are 5 to 8 nucleotides long and can be degenerated.
• Moreover, Chi activity is strongly orientation-dependent (direction of DNA
replication).
It is present preferentially on the leading strands (high skew).
BOSC, Stockholm, June 27-28, 2009 – p.13
21. E. coli as a learning case
• 8-letter word GCTGGTGG
• 762 occurrences on the leading strands ( = 4.6 106 )
• Among the most over-represented 8-letter words (whatever the model Mm)
⇒ its frequency cannot be explained by the genome composition.
• Its rank is improved if one analyzes only the backbone genome (genome
conserved in several strains of the species).
• Its skew equals 3.20 (p-value of 3.310−11 ).
The skew of a motif w is defined by N obs (w)/N obs (w) where w is the reverse
complementary of w.
BOSC, Stockholm, June 27-28, 2009 – p.14
22. Identification of Chi motif in S. aureus
Halpern et al. (07)
• Analysis of the S. aureus backbone ( = 2.44 106 ).
• 8-letter words: none of the most over-represented and skewed motifs were
frequent enough.
• 7-letter words:
A=gaaaatg (1067), B=ggattag (266), C=gaagcgg (272), D=gaattag (614)
BOSC, Stockholm, June 27-28, 2009 – p.15
23. Organization of the Ter macrodomain in E. coli
The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?
BOSC, Stockholm, June 27-28, 2009 – p.16
24. Organization of the Ter macrodomain in E. coli
The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?
Bacillus subtilis as a learning case:
• In B. subtilis, the parS motif is responsible for the structuration of the
chromosomal domain surrounding the origin of replication [Lin and
Grossman (98)].
• parS motif is 16 nt long, its sequence is partially degenerated and rather
palindromic.
t
TGTTAACACGTGAAACA
c c t t
• It is recognized by SpoOJ in both directions.
• One of its 11-mer is the most exceptional 11-mer (w, w) in the origin domain.
BOSC, Stockholm, June 27-28, 2009 – p.16
26. Identification of matS in E. coli
GACACTGTCAC
TGACACTGTCA
GACAGTGTCAC
GACGTTGTCAC
GACAACGTCAC
TGACAACGTCA
GTGACRNYGTCAC
matS is the 13nt GTGACRNYGTCAC: it is recognized by the matP protein which
structures the Ter domain [Mercier at al. (08)].
BOSC, Stockholm, June 27-28, 2009 – p.18
27. Acknowledgment
Françoise Gélis (R’MES 1.0)
Annie Bouvier (R’MES 2.0)
Mark Hoebeke (R’MES 3.0)
http://genome.jouy.inra.fr/ssb/rmes/
BOSC, Stockholm, June 27-28, 2009 – p.19