Presented at Evolution 2013, June 24; describes an approach to teaching populations genetics at the upper undergraduate/beginning graduate level, using simulations based in R and incorporating available large genomic data sets.
Bayesian modelling and computation for Raman spectroscopyMatt Moores
Raman spectroscopy can be used to identify molecules by the characteristic scattering of light from a laser. Each Raman-active dye label has a unique spectral signature, comprised by the locations and amplitudes of the peaks. The Raman spectrum is discretised into a multivariate observation that is highly collinear, hence it lends itself to a reduced-rank representation. We introduce a sequential Monte Carlo (SMC) algorithm to separate this signal into a series of peaks plus a smoothly-varying baseline, corrupted by additive white noise. By incorporating this representation into a Bayesian functional regression, we can quantify the relationship between dye concentration and peak intensity. We also estimate the model evidence using SMC to investigate long-range dependence between peaks. These methods have been implemented as an R package, using RcppEigen and OpenMP.
Precomputation for SMC-ABC with undirected graphical modelsMatt Moores
This document presents a method for improving the scalability of approximate Bayesian computation (ABC) for latent graphical models like the hidden Potts model used in image analysis. It does this by pre-computing an auxiliary model that approximates the relationship between model parameters and summary statistics, avoiding the need to simulate pseudo-data during ABC model fitting. Experimental results on both simulated and satellite image data show the method reduces ABC runtime from weeks to hours while maintaining accuracy of parameter estimates.
- The document summarizes Matthew Moores' PhD research on developing Bayesian computational methods for spatial analysis of medical and satellite images.
- The objectives are to develop a generative image model incorporating prior information, implement it computationally efficiently, and apply it to radiotherapy and remote sensing data.
- Challenges include intractable likelihoods, which are addressed through approximate Bayesian computation and sequential Monte Carlo with pre-computation.
- The research aims to classify pixels in medical and satellite images according to tissue type or land use by incorporating informative priors.
This document provides an overview of ABC methodology and applications. It begins with examples from population genetics and econometrics that are well-suited for ABC. It then describes the basic ABC algorithm for Bayesian inference using simulation: specifying prior distributions, simulating data under different parameter values, and accepting simulations that best match the observed data. Indirect inference is also discussed as a method for choosing informative summary statistics for ABC. The document traces the origins of ABC to population genetics models from the late 1990s and highlights ongoing contributions from that field to ABC methodology.
Bayesian Non-parametric Models for Data Science using PyMCMLReview
This document provides an overview of Bayesian non-parametric models using Gaussian processes in PyMC3. It discusses the motivation for using GPs, their properties, and how they can be built and fitted in PyMC3. Examples are provided on modeling salmon recruitment data, coal mining disasters, measles outbreak data, and a multidimensional spatial dataset. Scaling issues with GPs are also addressed through sparse approximations. The PyMC3 team that develops these methods is acknowledged.
The document discusses Approximate Bayesian Computation (ABC), a simulation-based method for conducting Bayesian inference when the likelihood function is intractable or unavailable. ABC works by simulating data from the model, accepting simulations that are close to the observed data based on a distance measure and tolerance level. This provides samples from an approximation of the posterior distribution. The document provides examples that motivate ABC and outlines the basic ABC algorithm. It also discusses extensions and improvements to the standard ABC method.
This document discusses approximate Bayesian computation (ABC) techniques for performing Bayesian inference when the likelihood function is not available in closed form. It covers the basic ABC algorithm and discusses challenges with high-dimensional data. It also summarizes recent advances in ABC that incorporate nonparametric regression, reproducing kernel Hilbert spaces, and neural networks to help address these challenges.
Bayesian modelling and computation for Raman spectroscopyMatt Moores
Raman spectroscopy can be used to identify molecules by the characteristic scattering of light from a laser. Each Raman-active dye label has a unique spectral signature, comprised by the locations and amplitudes of the peaks. The Raman spectrum is discretised into a multivariate observation that is highly collinear, hence it lends itself to a reduced-rank representation. We introduce a sequential Monte Carlo (SMC) algorithm to separate this signal into a series of peaks plus a smoothly-varying baseline, corrupted by additive white noise. By incorporating this representation into a Bayesian functional regression, we can quantify the relationship between dye concentration and peak intensity. We also estimate the model evidence using SMC to investigate long-range dependence between peaks. These methods have been implemented as an R package, using RcppEigen and OpenMP.
Precomputation for SMC-ABC with undirected graphical modelsMatt Moores
This document presents a method for improving the scalability of approximate Bayesian computation (ABC) for latent graphical models like the hidden Potts model used in image analysis. It does this by pre-computing an auxiliary model that approximates the relationship between model parameters and summary statistics, avoiding the need to simulate pseudo-data during ABC model fitting. Experimental results on both simulated and satellite image data show the method reduces ABC runtime from weeks to hours while maintaining accuracy of parameter estimates.
- The document summarizes Matthew Moores' PhD research on developing Bayesian computational methods for spatial analysis of medical and satellite images.
- The objectives are to develop a generative image model incorporating prior information, implement it computationally efficiently, and apply it to radiotherapy and remote sensing data.
- Challenges include intractable likelihoods, which are addressed through approximate Bayesian computation and sequential Monte Carlo with pre-computation.
- The research aims to classify pixels in medical and satellite images according to tissue type or land use by incorporating informative priors.
This document provides an overview of ABC methodology and applications. It begins with examples from population genetics and econometrics that are well-suited for ABC. It then describes the basic ABC algorithm for Bayesian inference using simulation: specifying prior distributions, simulating data under different parameter values, and accepting simulations that best match the observed data. Indirect inference is also discussed as a method for choosing informative summary statistics for ABC. The document traces the origins of ABC to population genetics models from the late 1990s and highlights ongoing contributions from that field to ABC methodology.
Bayesian Non-parametric Models for Data Science using PyMCMLReview
This document provides an overview of Bayesian non-parametric models using Gaussian processes in PyMC3. It discusses the motivation for using GPs, their properties, and how they can be built and fitted in PyMC3. Examples are provided on modeling salmon recruitment data, coal mining disasters, measles outbreak data, and a multidimensional spatial dataset. Scaling issues with GPs are also addressed through sparse approximations. The PyMC3 team that develops these methods is acknowledged.
The document discusses Approximate Bayesian Computation (ABC), a simulation-based method for conducting Bayesian inference when the likelihood function is intractable or unavailable. ABC works by simulating data from the model, accepting simulations that are close to the observed data based on a distance measure and tolerance level. This provides samples from an approximation of the posterior distribution. The document provides examples that motivate ABC and outlines the basic ABC algorithm. It also discusses extensions and improvements to the standard ABC method.
This document discusses approximate Bayesian computation (ABC) techniques for performing Bayesian inference when the likelihood function is not available in closed form. It covers the basic ABC algorithm and discusses challenges with high-dimensional data. It also summarizes recent advances in ABC that incorporate nonparametric regression, reproducing kernel Hilbert spaces, and neural networks to help address these challenges.
Multiplicative Decompositions of Stochastic Distributions and Their Applicat...Toshiyuki Shimono
Toshiyuki Shimono presents theorems and propositions about multiplicative decompositions of stochastic distributions and their applications. Some key points:
- A theorem shows that for any positive v1 and v2, the probability that v1x1 is greater than v2x2 equals the ratio v1:v2, relating to the Bradley-Terry model of preferences.
- It is shown that the Student's t distribution with 2 degrees of freedom can be represented as the product of independent distributions in various ways using the F distribution and T distribution.
- A theorem gives a multiplicative decomposition of the uniform distribution on [0,1] involving the uniform distribution on [δ,1] and a
The document summarizes Approximate Bayesian Computation (ABC). It discusses how ABC provides a way to approximate Bayesian inference when the likelihood function is intractable or too computationally expensive to evaluate directly. ABC works by simulating data under different parameter values and accepting simulations that are close to the observed data according to a distance measure and tolerance level. Key points discussed include:
- ABC provides an approximation to the posterior distribution by sampling from simulations that fall within a tolerance of the observed data.
- Summary statistics are often used to reduce the dimension of the data and improve the signal-to-noise ratio when applying the tolerance criterion.
- Random forests can help select informative summary statistics and provide semi-automated ABC
Theory to consider an inaccurate testing and how to determine the prior proba...Toshiyuki Shimono
I presented a mathematical theory on a medical testing method. This fundamental theory can be taken account of both cases when the resource of the testing is limited or not. One implication is that "negative proof" may not function well, and another implication is that excessively high specificity and accuracy are required for meaningful diagnosis unless the careful usage of the diagnosis is considered.
The document describes a new method called component-wise approximate Bayesian computation (ABC) that combines ABC with Gibbs sampling. It aims to improve ABC's ability to efficiently explore parameter spaces when the number of parameters is large. The method works by alternating sampling from each parameter's ABC posterior conditional distribution given current values of other parameters and the observed data. The method is proven to converge to a stationary distribution under certain assumptions, especially for hierarchical models where conditional distributions are often simplified. Numerical experiments on toy examples demonstrate the method can provide a better approximation of the true posterior than vanilla ABC.
The document describes Approximate Bayesian Computation (ABC), a technique for performing Bayesian inference when the likelihood function is intractable or impossible to evaluate directly. ABC works by simulating data under different parameter values, and accepting simulations that are close to the observed data according to a distance measure and tolerance level. ABC provides an approximation to the posterior distribution that improves as the tolerance level decreases and more informative summary statistics are used. The document discusses the ABC algorithm, properties of the exact ABC posterior distribution, and challenges in selecting appropriate summary statistics.
1. The document discusses approximate Bayesian computation (ABC), a technique used when the likelihood function is intractable. ABC works by simulating parameters from the prior and simulating data, rejecting simulations that are not close to the observed data based on a tolerance level.
2. Random forests can be used in ABC to select informative summary statistics from a large set of possibilities and estimate parameters. The random forests classify simulations as accepted or rejected based on the summaries, implicitly selecting important summaries.
3. Calibrating the tolerance level in ABC is important but difficult, as it determines how close simulations must be to the observed data. Methods discussed include using quantiles of prior predictive simulations or asymptotic convergence properties.
This document discusses several perspectives and solutions to Bayesian hypothesis testing. It outlines issues with Bayesian testing such as the dependence on prior distributions and difficulties interpreting Bayesian measures like posterior probabilities and Bayes factors. It discusses how Bayesian testing compares models rather than identifying a single true model. Several solutions to challenges are discussed, like using Bayes factors which eliminate the dependence on prior model probabilities but introduce other issues. The document also discusses testing under specific models like comparing a point null hypothesis to alternatives. Overall it presents both Bayesian and frequentist views on hypothesis testing and some of the open controversies in the field.
This document discusses various methods for estimating normalizing constants that arise when evaluating integrals numerically. It begins by noting there are many computational methods for approximating normalizing constants across different communities. It then lists the topics that will be covered in the upcoming workshop, including discussions on estimating constants using Monte Carlo methods and Bayesian versus frequentist approaches. The document provides examples of estimating normalizing constants using Monte Carlo integration, reverse logistic regression, and Xiao-Li Meng's maximum likelihood estimation approach. It concludes by discussing some of the challenges in bringing a statistical framework to constant estimation problems.
ABC stands for approximate Bayesian computation. It is a method for performing Bayesian inference when the likelihood function is intractable or impossible to evaluate directly. ABC produces samples from an approximate posterior distribution by simulating parameter and summary statistic values that match the observed summary statistics within a tolerance level. The choice of summary statistics is important but difficult, as there is typically no sufficient statistic. Several strategies have been developed for selecting good summary statistics, including using random forests or the Lasso to evaluate and select from a large set of potential summaries.
Multiple estimators for Monte Carlo approximationsChristian Robert
This document discusses multiple estimators that can be used to approximate integrals using Monte Carlo simulations. It begins by introducing concepts like multiple importance sampling, Rao-Blackwellisation, and delayed acceptance that allow combining multiple estimators to improve accuracy. It then discusses approaches like mixtures as proposals, global adaptation, and nonparametric maximum likelihood estimation (NPMLE) that frame Monte Carlo estimation as a statistical estimation problem. The document notes various advantages of the statistical formulation, like the ability to directly estimate simulation error from the Fisher information. Overall, the document presents an overview of different techniques for combining Monte Carlo simulations to obtain more accurate integral approximations.
Cubic convolution interpolation is a new technique for resampling discrete data that has several desirable features for image processing. It can be performed efficiently on a digital computer. The cubic convolution interpolation function converges uniformly to the function being interpolated as the sampling increment approaches zero, achieving third-order accuracy. The paper derives the one-dimensional cubic convolution interpolation function and shows how it can be extended separably to two dimensions for interpolating image data.
This document describes a new method called component-wise approximate Bayesian computation (ABCG or ABC-Gibbs) that combines approximate Bayesian computation (ABC) with Gibbs sampling. ABCG aims to more efficiently explore parameter spaces when the number of parameters is large. It works by alternately sampling each parameter from its ABC-approximated conditional distribution given current values of other parameters. The document provides theoretical analysis showing ABCG converges to a stationary distribution under certain conditions. It also presents examples demonstrating ABCG can better separate estimates from the prior compared to simple ABC, especially for hierarchical models.
The document discusses using random forests for approximate Bayesian computation (ABC) model choice. It proposes:
1. Using random forests to infer a model from summary statistics, as random forests can handle a large number of statistics and find efficient combinations.
2. Replacing estimates of posterior model probabilities, which are poorly approximated, with posterior predictive expected losses to evaluate models.
3. An example comparing MA(1) and MA(2) time series models using two autocorrelations as summaries, finding embedded models and that random forests perform similarly to other methods on small problems.
ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating HyperplaneShinichi Tamura
The presentation material for the reading club of Element of Statistical Learning by Hastie et al.
The contents of the sections cover
- Properties of logistic regression compared to least square s fitting
- Difference between logistic regression vs. linear discriminant analysis
- Rosenblatt's perceptron algorithm
- Derivation of optimal hyperplane, which offers the basis for SVM
-------------------------------------------------------------------------
研究室での『統計学習の基礎』(Hastieら著)の輪講用発表資料(ぜんぶ英語)です。
担当範囲は
・最小二乗法との類推で見るロジスティック回帰の特徴
・ロジスティック回帰と線形判別分析の比較
・ローゼンブラットのパーセプトロンアルゴリズム
・SVMの基礎となる最適分離超平面の導出
"reflections on the probability space induced by moment conditions with impli...Christian Robert
This document discusses using moment conditions to perform Bayesian inference when the likelihood function is intractable or unknown. It outlines some approaches that have been proposed, including approximating the likelihood using empirical likelihood or pseudo-likelihoods. However, these approaches do not guarantee the same consistency as a true likelihood. Alternative approximative Bayesian methods are also discussed, such as Approximate Bayesian Computation, Integrated Nested Laplace Approximation, and variational Bayes. The empirical likelihood method constructs a likelihood from generalized moment conditions, but its use in Bayesian inference requires further analysis of consistency in each application.
This document discusses approximate Bayesian computation (ABC), a technique for performing Bayesian inference when the likelihood function is intractable or impossible to evaluate directly. ABC works by simulating data under different parameter values and accepting simulations that match the observed data closely according to some distance measure and tolerance level. The document outlines the basic ABC algorithm and discusses some advances in ABC, including modifying the proposal distribution to increase efficiency, viewing it as a conditional density estimation problem to allow for larger tolerances, and including the tolerance level in the inferential framework. It also provides examples of applying ABC to problems like inferring the number of socks in a drawer from an observation and simulating the outcome of a historical naval battle.
This document provides an introduction to Approximate Bayesian Computation (ABC), a likelihood-free method for approximating posterior distributions when the likelihood function is unavailable or computationally intractable. It describes the ABC rejection sampling algorithm and key concepts like tolerance levels, distance functions, summary statistics, and improvements like ABC-MCMC and ABC-SMC. ABC is presented as an alternative to traditional Bayesian inference methods for models where direct likelihood evaluation is impossible or too expensive.
This document discusses approximate Bayesian computation (ABC) for model choice between multiple models. It introduces the ABC algorithm for model choice, which approximates the posterior probabilities of models given the data by simulating parameters from the prior and accepting simulations based on the distance between simulated and observed sufficient statistics. Issues with choosing sufficient statistics that apply to all models are discussed. The document also examines the limiting behavior of the ABC approximation to the Bayes factor as the tolerance approaches 0 and infinity. It notes that discrepancies can arise if sufficient statistics are not cross-model sufficient. An example comparing Poisson and geometric models demonstrates this.
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
The document proposes a delayed acceptance method for accelerating Metropolis-Hastings algorithms. It begins with a motivating example of non-informative inference for mixture models where computing the prior density is costly. It then introduces the delayed acceptance approach which splits the acceptance probability into pieces that are evaluated sequentially, avoiding computing the full acceptance ratio each time. It validates that the delayed acceptance chain is reversible and provides bounds on its spectral gap and asymptotic variance compared to the original chain. Finally, it discusses optimizing the delayed acceptance approach by considering the expected square jump distance and cost per iteration to maximize efficiency.
Exploring temporal graph data with Python: a study on tensor decomposition o...André Panisson
Tensor decompositions have gained a steadily increasing popularity in data mining applications. Data sources from sensor networks and Internet-of-Things applications promise a wealth of interaction data that can be naturally represented as multidimensional structures such as tensors. For example, time-varying social networks collected from wearable proximity sensors can be represented as 3-way tensors. By representing this data as tensors, we can use tensor decomposition to extract community structures with their structural and temporal signatures.
The current standard framework for working with tensors, however, is Matlab. We will show how tensor decompositions can be carried out using Python, how to obtain latent components and how they can be interpreted, and what are some applications of this technique in the academy and industry. We will see a use case where a Python implementation of tensor decomposition is applied to a dataset that describes social interactions of people, collected using the SocioPatterns platform. This platform was deployed in different settings such as conferences, schools and hospitals, in order to support mathematical modelling and simulation of airborne infectious diseases. Tensor decomposition has been used in these scenarios to solve different types of problems: it can be used for data cleaning, where time-varying graph anomalies can be identified and removed from data; it can also be used to assess the impact of latent components in the spreading of a disease, and to devise intervention strategies that are able to reduce the number of infection cases in a school or hospital. These are just a few examples that show the potential of this technique in data mining and machine learning applications.
Multiplicative Decompositions of Stochastic Distributions and Their Applicat...Toshiyuki Shimono
Toshiyuki Shimono presents theorems and propositions about multiplicative decompositions of stochastic distributions and their applications. Some key points:
- A theorem shows that for any positive v1 and v2, the probability that v1x1 is greater than v2x2 equals the ratio v1:v2, relating to the Bradley-Terry model of preferences.
- It is shown that the Student's t distribution with 2 degrees of freedom can be represented as the product of independent distributions in various ways using the F distribution and T distribution.
- A theorem gives a multiplicative decomposition of the uniform distribution on [0,1] involving the uniform distribution on [δ,1] and a
The document summarizes Approximate Bayesian Computation (ABC). It discusses how ABC provides a way to approximate Bayesian inference when the likelihood function is intractable or too computationally expensive to evaluate directly. ABC works by simulating data under different parameter values and accepting simulations that are close to the observed data according to a distance measure and tolerance level. Key points discussed include:
- ABC provides an approximation to the posterior distribution by sampling from simulations that fall within a tolerance of the observed data.
- Summary statistics are often used to reduce the dimension of the data and improve the signal-to-noise ratio when applying the tolerance criterion.
- Random forests can help select informative summary statistics and provide semi-automated ABC
Theory to consider an inaccurate testing and how to determine the prior proba...Toshiyuki Shimono
I presented a mathematical theory on a medical testing method. This fundamental theory can be taken account of both cases when the resource of the testing is limited or not. One implication is that "negative proof" may not function well, and another implication is that excessively high specificity and accuracy are required for meaningful diagnosis unless the careful usage of the diagnosis is considered.
The document describes a new method called component-wise approximate Bayesian computation (ABC) that combines ABC with Gibbs sampling. It aims to improve ABC's ability to efficiently explore parameter spaces when the number of parameters is large. The method works by alternating sampling from each parameter's ABC posterior conditional distribution given current values of other parameters and the observed data. The method is proven to converge to a stationary distribution under certain assumptions, especially for hierarchical models where conditional distributions are often simplified. Numerical experiments on toy examples demonstrate the method can provide a better approximation of the true posterior than vanilla ABC.
The document describes Approximate Bayesian Computation (ABC), a technique for performing Bayesian inference when the likelihood function is intractable or impossible to evaluate directly. ABC works by simulating data under different parameter values, and accepting simulations that are close to the observed data according to a distance measure and tolerance level. ABC provides an approximation to the posterior distribution that improves as the tolerance level decreases and more informative summary statistics are used. The document discusses the ABC algorithm, properties of the exact ABC posterior distribution, and challenges in selecting appropriate summary statistics.
1. The document discusses approximate Bayesian computation (ABC), a technique used when the likelihood function is intractable. ABC works by simulating parameters from the prior and simulating data, rejecting simulations that are not close to the observed data based on a tolerance level.
2. Random forests can be used in ABC to select informative summary statistics from a large set of possibilities and estimate parameters. The random forests classify simulations as accepted or rejected based on the summaries, implicitly selecting important summaries.
3. Calibrating the tolerance level in ABC is important but difficult, as it determines how close simulations must be to the observed data. Methods discussed include using quantiles of prior predictive simulations or asymptotic convergence properties.
This document discusses several perspectives and solutions to Bayesian hypothesis testing. It outlines issues with Bayesian testing such as the dependence on prior distributions and difficulties interpreting Bayesian measures like posterior probabilities and Bayes factors. It discusses how Bayesian testing compares models rather than identifying a single true model. Several solutions to challenges are discussed, like using Bayes factors which eliminate the dependence on prior model probabilities but introduce other issues. The document also discusses testing under specific models like comparing a point null hypothesis to alternatives. Overall it presents both Bayesian and frequentist views on hypothesis testing and some of the open controversies in the field.
This document discusses various methods for estimating normalizing constants that arise when evaluating integrals numerically. It begins by noting there are many computational methods for approximating normalizing constants across different communities. It then lists the topics that will be covered in the upcoming workshop, including discussions on estimating constants using Monte Carlo methods and Bayesian versus frequentist approaches. The document provides examples of estimating normalizing constants using Monte Carlo integration, reverse logistic regression, and Xiao-Li Meng's maximum likelihood estimation approach. It concludes by discussing some of the challenges in bringing a statistical framework to constant estimation problems.
ABC stands for approximate Bayesian computation. It is a method for performing Bayesian inference when the likelihood function is intractable or impossible to evaluate directly. ABC produces samples from an approximate posterior distribution by simulating parameter and summary statistic values that match the observed summary statistics within a tolerance level. The choice of summary statistics is important but difficult, as there is typically no sufficient statistic. Several strategies have been developed for selecting good summary statistics, including using random forests or the Lasso to evaluate and select from a large set of potential summaries.
Multiple estimators for Monte Carlo approximationsChristian Robert
This document discusses multiple estimators that can be used to approximate integrals using Monte Carlo simulations. It begins by introducing concepts like multiple importance sampling, Rao-Blackwellisation, and delayed acceptance that allow combining multiple estimators to improve accuracy. It then discusses approaches like mixtures as proposals, global adaptation, and nonparametric maximum likelihood estimation (NPMLE) that frame Monte Carlo estimation as a statistical estimation problem. The document notes various advantages of the statistical formulation, like the ability to directly estimate simulation error from the Fisher information. Overall, the document presents an overview of different techniques for combining Monte Carlo simulations to obtain more accurate integral approximations.
Cubic convolution interpolation is a new technique for resampling discrete data that has several desirable features for image processing. It can be performed efficiently on a digital computer. The cubic convolution interpolation function converges uniformly to the function being interpolated as the sampling increment approaches zero, achieving third-order accuracy. The paper derives the one-dimensional cubic convolution interpolation function and shows how it can be extended separably to two dimensions for interpolating image data.
This document describes a new method called component-wise approximate Bayesian computation (ABCG or ABC-Gibbs) that combines approximate Bayesian computation (ABC) with Gibbs sampling. ABCG aims to more efficiently explore parameter spaces when the number of parameters is large. It works by alternately sampling each parameter from its ABC-approximated conditional distribution given current values of other parameters. The document provides theoretical analysis showing ABCG converges to a stationary distribution under certain conditions. It also presents examples demonstrating ABCG can better separate estimates from the prior compared to simple ABC, especially for hierarchical models.
The document discusses using random forests for approximate Bayesian computation (ABC) model choice. It proposes:
1. Using random forests to infer a model from summary statistics, as random forests can handle a large number of statistics and find efficient combinations.
2. Replacing estimates of posterior model probabilities, which are poorly approximated, with posterior predictive expected losses to evaluate models.
3. An example comparing MA(1) and MA(2) time series models using two autocorrelations as summaries, finding embedded models and that random forests perform similarly to other methods on small problems.
ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating HyperplaneShinichi Tamura
The presentation material for the reading club of Element of Statistical Learning by Hastie et al.
The contents of the sections cover
- Properties of logistic regression compared to least square s fitting
- Difference between logistic regression vs. linear discriminant analysis
- Rosenblatt's perceptron algorithm
- Derivation of optimal hyperplane, which offers the basis for SVM
-------------------------------------------------------------------------
研究室での『統計学習の基礎』(Hastieら著)の輪講用発表資料(ぜんぶ英語)です。
担当範囲は
・最小二乗法との類推で見るロジスティック回帰の特徴
・ロジスティック回帰と線形判別分析の比較
・ローゼンブラットのパーセプトロンアルゴリズム
・SVMの基礎となる最適分離超平面の導出
"reflections on the probability space induced by moment conditions with impli...Christian Robert
This document discusses using moment conditions to perform Bayesian inference when the likelihood function is intractable or unknown. It outlines some approaches that have been proposed, including approximating the likelihood using empirical likelihood or pseudo-likelihoods. However, these approaches do not guarantee the same consistency as a true likelihood. Alternative approximative Bayesian methods are also discussed, such as Approximate Bayesian Computation, Integrated Nested Laplace Approximation, and variational Bayes. The empirical likelihood method constructs a likelihood from generalized moment conditions, but its use in Bayesian inference requires further analysis of consistency in each application.
This document discusses approximate Bayesian computation (ABC), a technique for performing Bayesian inference when the likelihood function is intractable or impossible to evaluate directly. ABC works by simulating data under different parameter values and accepting simulations that match the observed data closely according to some distance measure and tolerance level. The document outlines the basic ABC algorithm and discusses some advances in ABC, including modifying the proposal distribution to increase efficiency, viewing it as a conditional density estimation problem to allow for larger tolerances, and including the tolerance level in the inferential framework. It also provides examples of applying ABC to problems like inferring the number of socks in a drawer from an observation and simulating the outcome of a historical naval battle.
This document provides an introduction to Approximate Bayesian Computation (ABC), a likelihood-free method for approximating posterior distributions when the likelihood function is unavailable or computationally intractable. It describes the ABC rejection sampling algorithm and key concepts like tolerance levels, distance functions, summary statistics, and improvements like ABC-MCMC and ABC-SMC. ABC is presented as an alternative to traditional Bayesian inference methods for models where direct likelihood evaluation is impossible or too expensive.
This document discusses approximate Bayesian computation (ABC) for model choice between multiple models. It introduces the ABC algorithm for model choice, which approximates the posterior probabilities of models given the data by simulating parameters from the prior and accepting simulations based on the distance between simulated and observed sufficient statistics. Issues with choosing sufficient statistics that apply to all models are discussed. The document also examines the limiting behavior of the ABC approximation to the Bayes factor as the tolerance approaches 0 and infinity. It notes that discrepancies can arise if sufficient statistics are not cross-model sufficient. An example comparing Poisson and geometric models demonstrates this.
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
The document proposes a delayed acceptance method for accelerating Metropolis-Hastings algorithms. It begins with a motivating example of non-informative inference for mixture models where computing the prior density is costly. It then introduces the delayed acceptance approach which splits the acceptance probability into pieces that are evaluated sequentially, avoiding computing the full acceptance ratio each time. It validates that the delayed acceptance chain is reversible and provides bounds on its spectral gap and asymptotic variance compared to the original chain. Finally, it discusses optimizing the delayed acceptance approach by considering the expected square jump distance and cost per iteration to maximize efficiency.
Exploring temporal graph data with Python: a study on tensor decomposition o...André Panisson
Tensor decompositions have gained a steadily increasing popularity in data mining applications. Data sources from sensor networks and Internet-of-Things applications promise a wealth of interaction data that can be naturally represented as multidimensional structures such as tensors. For example, time-varying social networks collected from wearable proximity sensors can be represented as 3-way tensors. By representing this data as tensors, we can use tensor decomposition to extract community structures with their structural and temporal signatures.
The current standard framework for working with tensors, however, is Matlab. We will show how tensor decompositions can be carried out using Python, how to obtain latent components and how they can be interpreted, and what are some applications of this technique in the academy and industry. We will see a use case where a Python implementation of tensor decomposition is applied to a dataset that describes social interactions of people, collected using the SocioPatterns platform. This platform was deployed in different settings such as conferences, schools and hospitals, in order to support mathematical modelling and simulation of airborne infectious diseases. Tensor decomposition has been used in these scenarios to solve different types of problems: it can be used for data cleaning, where time-varying graph anomalies can be identified and removed from data; it can also be used to assess the impact of latent components in the spreading of a disease, and to devise intervention strategies that are able to reduce the number of infection cases in a school or hospital. These are just a few examples that show the potential of this technique in data mining and machine learning applications.
The document introduces two approaches to chemical prediction: quantum simulation based on density functional theory and machine learning based on data. It then discusses using graph-structured neural networks for chemical prediction on datasets like QM9. It presents Neural Fingerprint (NFP) and Gated Graph Neural Network (GGNN) models for predicting molecular properties from graph-structured data. Chainer Chemistry is introduced as a library for chemical and biological machine learning that implements these graph convolutional networks.
This document summarizes Frances Kuo's work applying Quasi-Monte Carlo (QMC) methods to solve partial differential equations (PDEs) with random coefficients. It introduces a motivating example of modeling groundwater flow with uncertainty in porous medium properties. It then provides an overview of QMC methods, including advantages over Monte Carlo, construction techniques like lattice rules, and three theoretical settings for applying QMC to PDEs with random coefficients. The document outlines Kuo's collaborations applying QMC to problems with uniform and lognormal random coefficients under these different settings.
In the presence of relevant physical observations, one can usually calibrate a computer model, and even estimate systematic discrepancies of the model from reality. Estimating and quantifying the uncertainty in this model discrepancy can lead to reliable predictions - so long as the prediction "is similar to" the available physical observations. Exactly how to define "similar" has proven difficult in many applications. Clearly it depends on how well the computational model captures the relevant physics in the system, as well as how portable the model discrepancy is in going from the available physical data to the prediction. This talk will discuss these concepts using computational models ranging from simple to very complex.
Relaxation methods for the matrix exponential on large networksDavid Gleich
My talk from the Stanford ICME seminar series on doing network analysis and link prediction using the a fast algorithm for the matrix exponential on graph problems.
The document discusses higher dimensional reconstruction techniques for medical imaging. It begins with a review of 3D reconstruction and the mathematical modeling using the Radon transform. It then describes 4D, 5D and 6D reconstruction which add additional dimensions of time, cardiac phase and respiration to the 3D spatial reconstruction. Higher dimensional reconstruction allows for dynamic imaging but introduces additional challenges from increased noise, bias and variance. The document explores techniques like iterative reconstruction to address these issues and improve accuracy. Real patient data is used to demonstrate the reconstruction methods.
Revisiting the fundamental concepts and assumptions of statistics ppsD Dutta Roy
Dr. Debdulal Dutta Roy gave a pre-conference workshop on revisiting fundamental statistics concepts and R-codes. He discussed key statistics assumptions like data being free-floating and associated, and having manifest and latent content. Dr. Roy also covered various R functions for data cleaning, classification, and prediction including histograms, density plots, boxplots, and ANOVA. The workshop aimed to refresh understanding of core statistics principles and their application using R coding.
This document discusses tensor decomposition with Python. It begins by explaining what tensor decomposition and factorization are, and how they can be used to represent multi-dimensional datasets and perform dimensionality reduction. It then discusses matrix and tensor factorization methods like NMF, topic modeling, and CP/PARAFAC decomposition. The remainder of the document provides examples of tensor decomposition using Python tools and libraries, and discusses applications to analyzing temporal network and sensor data.
This document proposes an improved particle swarm optimization (PSO) algorithm for data clustering that incorporates Gauss chaotic map. PSO is often prone to premature convergence, so the proposed method uses Gauss chaotic map to generate random sequences that substitute the random parameters in PSO, providing more exploration of the search space. The algorithm is tested on six real-world datasets and shown to outperform K-means, standard PSO, and other hybrid clustering algorithms. The key aspects of the proposed GaussPSO method and experimental results demonstrating its effectiveness are described.
This document discusses dynamics of structures with uncertainties. It begins with an introduction to stochastic single degree of freedom systems and how natural frequency variability can be modeled using probability distributions. It then discusses how to extend this approach to stochastic multi degree of freedom systems using stochastic finite element formulations and modal projections. Key challenges with statistical overlap of eigenvalues are noted. The document provides mathematical models of equivalent damping in stochastic systems and examples of stochastic frequency response functions.
We consider the problem of model estimation in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP.
We apply our results to the problem of learning near-optimal policies in the reward-free setting. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible asymptotic rate. Our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of contexts.
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...Alexander Litvinenko
1. Solved time-dependent density driven flow problem with uncertain porosity and permeability in 2D and 3D
2. Computed propagation of uncertainties in porosity into the mass fraction.
3. Computed the mean, variance, exceedance probabilities, quantiles, risks.
4. Such QoIs as the number of fingers, their size, shape, propagation time can be unstable
5. For moderate perturbations, our gPCE surrogate results are similar to qMC results.
6. Used highly scalable solver on up to 800 computing nodes,
Complex models in ecology: challenges and solutionsPeter Solymos
This document discusses complex models in ecology and solutions for Bayesian analysis of complex hierarchical models. It introduces data cloning as a method that allows using Bayesian Markov chain Monte Carlo tools for frequentist inference on complex models. Data cloning replicates the data to increase the effective sample size, improving mixing and reducing the need for long runs. The document also discusses using high-performance computing to parallelize MCMC for faster inference on complex models through techniques like distributing chains across nodes.
This document discusses algorithms for predictive modeling, including logistic regression. It presents a medical dataset containing measurements of heart patients and whether they survived. Logistic regression is applied to predict survival using maximum likelihood estimation. Numerical optimization techniques like BFGS and Fisher's algorithm are discussed for maximum likelihood estimation of logistic regression. Iteratively reweighted least squares is also presented as an alternative approach.
The partitioning of an ordered prognostic factor is important in order to obtain several groups having heterogeneous survivals in medical research. For this purpose, a binary split has often been used once or recursively. We propose the use of a multi-way split in order to afford an optimal set of cut-off points. In practice, the number of groups ($K$) may not be specified in advance. Thus, we also suggest finding an optimal $K$ by a resampling technique. The algorithm was implemented into an \proglang{R} package that we called \pkg{kaps}, which can be used conveniently and freely. It was illustrated with a toy dataset, and was also applied to a real data set of colorectal cancer cases from the Surveillance Epidemiology and End Results.
The document discusses triangular norm (t-norm) based kernel functions and their application to kernel k-means clustering. It introduces common kernel functions and describes how t-norms can be used to create new kernel functions. Several parameterized and non-parameterized t-norm based kernel functions are presented. The document then details experiments applying various kernel functions including t-norm kernels to four datasets, evaluating the results using adjusted rand index scores. The best performing kernels for each dataset are identified, with some t-norm kernels performing comparably or better than traditional kernels.
This document provides an introduction to statistics and probability. It discusses descriptive statistics such as measures of central tendency and dispersion. It also discusses inferential statistics and concepts of probability such as random variables and probability distributions including binomial, Poisson, normal and exponential distributions. Examples are provided to illustrate calculating probabilities using these distributions for traffic-related scenarios such as route choice probabilities. Graphical representations of data like histograms and scatter plots are also demonstrated.
Compressed learning for time series classification學翰 施
This document proposes a compressed learning framework for time series classification using sparse envelope representations. It introduces compressed sensing concepts and describes creating a sparse envelope for time series by thresholding around the mean and standard deviation. A classification framework is developed using linear SVMs in the compressed domain. Experimental results on benchmark datasets demonstrate effectiveness of the envelope representations compared to state-of-the-art methods, as well as efficiency gains from compression. Real-world case studies on smart home applications show promising identification performance from envelope-based classifiers on sensor time series data.
Similar to Teaching Population Genetics with R (20)
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Teaching Population Genetics with R
1. A Simulation-Based Approach to
Teaching Population Genetics:
R as a Teaching Platform
Bruce J. Cochrane
Department of Zoology/Biology
Miami University
Oxford OH
2. Two Time Points
• 1974
o Lots of Theory
o Not much Data
o Allozymes Rule
• 2013
o Even More Theory
o Lots of Data
o Sequences, -omics, ???
3. The Problem
• The basic approach hasn’t changed, e. g.
o Hardy Weinberg
o Mutation
o Selection
o Drift
o Etc.
• Much of it is deterministic
4. And
• There is little initial connection with real data
o The world seems to revolve around A and a
• At least in my hands, it doesn’t work
5. The Alternative
• Take a numerical (as opposed to analytical) approach
• Focus on understanding random variables and distributions
• Incorporate “big data”
• Introduce current approaches – coalescence, Bayesian
Analysis, etc. – in this context
6. Why R?
• Open Source
• Platform-independent (Windows, Mac, Linux)
• Object oriented
• Facile Graphics
• Web-oriented
• Packages available for specialized functions
7. Where We are Going
• The Basics – Distributions, chi-square and the Hardy Weinberg
Equilibrium
• Simulating the Ewens-Watterson Distribution
• Coalescence and summary statistics
• What works and what doesn’t
13. Calculating chi-squared
The function
function(obs,exp,df=1){
chi <-sum((obs-exp)^2/exp)
pr <-1-pchisq(chi,df)
c(chi,pr)
A sample function call
obs <-c(315,108,101,32)
z <-sum(obs)/16
exp <-c(9*z,3*z,3*z,z)
chixw(obs,exp,3)
The output
chi-square = 0.47
probability(<.05) = 0.93
deg. freedom = 3
14. Basic Hardy Weinberg Calculations
The Biallelic Case
Sample input
obs <-c(13,35,70)
hw(obs)
Output
[1] "p= 0.2585 q= 0.7415"
obs exp
[1,] 13 8
[2,] 35 45
[3,] 70 65
[1] "chi squared = 5.732 p = 0.017 with 1 d. f."
15. Illustrating With Ternary Plots
library(HardyWeinberg)
dat <-(HWData(100,100))
gdist <-dat$Xt #create a variable with the working data
HWTernaryPlot(gdist, hwcurve=TRUE,addmarkers=FALSE,region=0,vbounds=FALSE,axis=2,
vertexlab=c("0","","1"),main="Theoretical Relationship",cex.main=1.5)
16. Access to Data
• Direct access of data
o HapMap
o Dryad
o Others
• Manipulation and visualization within R
• Preparation for export (e. g. Genalex)
19. And Determining the Number of Outliers
nsnps <- length(hwdist)
quant <-quantile(hwdist,c(.025,.975))
low <-length(hwdist[hwdist<quant[1]])
high <-length(hwdist[hwdist>quant[2]])
accept <-nsnps-low-high
low; accept; high
[1] 982
[1] 37330
[1] 976
20. Sampling and Plotting Deviation from Hardy Weinberg
chr21.poly <-na.omit(chr21.sum) #remove all NA's (fixed SNPs)
chr21.samp <-sample(nrow(chr21.poly),1000, replace=FALSE)
plot(chr21.poly$z.HWE[chr21.samp])
21. Plotting F for Randomly Sampled Markers
chr21.sub <-chr21.poly[chr21.samp,]
Hexp <- 2*chr21.sub$MAF*(1-chr21.sub$MAF)
Fi <- 1-(chr21.sub$P.AB/Hexp)
plot(Fi,xlab="Locus",ylab="F")
23. The Ewens- Watterson Test
• Based on Ewens (1977) derivation of the theoretical
equilibrium distribution of allele frequencies under the
infinite allele model.
• Uses expected homozygosity (Σp2) as test statistic
• Compares observed homozygosity in sample to expected
distribution in n random simulations
• Observed data are
o N=number of samples
o k= number of alleles
o Allele Frequency Distribution
24. Classic Data (Keith et al., 1985)
• Xdh in D. pseudoobscura, analyzed by sequential
electrophoresis
• 89 samples, 15 distinct alleles
25. Testing the Data
1. Input the Data
Xdh <- c(52,9,8,4,4,2,2,1,1,1,1,1,1,1,1) # vector of allele numbers
length(Xdh) # number of alleles = k
sum(Xdh) #number of samples = n
2. Calculate Expected Homozygosity
Fx <-fhat(Xdh)
3. Run the Analysis
Ewens(n,k,Fx)
27. With Newer (and more complete) Data
Lactase Haplotypes in European and African Populations
1. Download data for Lactase gene from HapMap (CEU, YRI)
o 25 SNPS
o 48,000 KB
2. Determine numbers of haplotypes and frequencies for each
3. Apply Ewens-Waterson test to each.
29. Some Basic Statistics from Sequence Data
library(seqinR)
library(pegas)
dat <-read.fasta(file="./Data/FGB.fas")
#additional code needed to rearrange data
sites <-seg.sites(dat.dna)
nd <-nuc.div(dat.dna)
taj <-tajima.test(dat.dna)
length(sites); nd;taj$D
[1] 23
[1] 0.007561061
[1] -0.7759744
Intron sequences, 433 nucleotides each
from Peters JL, Roberts TE, Winker K, McCracken KG (2012)
PLoS ONE 7(2): e31972. doi:10.1371/journal.pone.0031972
30. Coalescence I – A Bunch of Trees
trees <-read.tree("http://dl.dropbox.com/u/9752688/ZOO%20422P/R/msfiles/tree.1.txt")
plot(trees[1:9],layout=9)
32. Coalescence III – Summary Statistics
system("./ms 50 1000 -s 10 -L | ./sample_stats >samp.ss")
# 1000 simulations of 50 samples, with number of sites set to 10
ss.out <-read_ss("samp.ss")
head(ss.out)
pi S D thetaH H
1. 1.825306 10 -0.521575 2.419592 -0.594286
2. 2.746939 10 0.658832 2.518367 0.228571
3. 3.837551 10 2.055665 3.631837 0.205714
4. 2.985306 10 0.964128 2.280000 0.705306
5. 1.577959 10 -0.838371 5.728163 -4.150204
6. 2.991020 10 0.971447 3.539592 -0.548571
33. Coalescence IV – Distribution of Summary Statistics
hist(ss.out$D,main="Distribution of Tajima's D (N=1000)",xlab="D")
abline(v=mean(ss.out$D),col="blue")
abline(v=quantile(ss.out$D,c(.025,.975)),col="red")
34. Other Uses
• Data Manipulation
o Conversion of HapMap Data for use elsewhere (e. g. Genalex)
o Other data sources via API’s (e. g. package rdryad)
• Other Analyses
o Hierarchical F statistics (hierfstat)
o Haplotype networking (pegas)
o Phylogenetics (ape, phyclust, others)
o Approximate Bayesian Computation (abc)
• Access for students
o Scripts available via LMS
o Course specific functions can be accessed (source("http://db.tt/A6tReYEC")
o Notes with embedded code in HTML (Rstudio, knitr)
36. Challenges
• Some coding required
• Data Structures are a challenge
• Packages are heterogeneous
• Students resist coding
37. Nevertheless
• Fundamental concepts can be easily visualized graphically
• Real data can be incorporated from the outset
• It takes students from fundamental concepts to real-world
applications and analyses
For Further information:
cochrabj@miamioh.edu
Functions
http://db.tt/A6tReYEC