Jennifer Shelton KSU
Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 454 and illumina data.
http://bioinformaticsk-state.blogspot.com/
http://bioinformatics.k-state.edu/index.html
How novel compute technology transforms life science researchDenis C. Bauer
Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2.
VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts.
GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...Jennifer Shelton
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technical and biological variance structure in mRNA-Seq data:life in the real world" by
Using BioNano Maps to Improve an Insect Genome AssemblyJennifer Shelton
Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules. Video of Webinar available at BioNano Genomics website http://www.bionanogenomics.com/bionano-community/webinars/ as "Using BioNano Maps to Improve an Insect Genome Assembly".
How novel compute technology transforms life science researchDenis C. Bauer
Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2.
VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts.
GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...Jennifer Shelton
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technical and biological variance structure in mRNA-Seq data:life in the real world" by
Using BioNano Maps to Improve an Insect Genome AssemblyJennifer Shelton
Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules. Video of Webinar available at BioNano Genomics website http://www.bionanogenomics.com/bionano-community/webinars/ as "Using BioNano Maps to Improve an Insect Genome Assembly".
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
This is a talk I gave at the Institute for Genomics & System Biology (IGSB) on December 7, 2009. The talk looks at the role of cloud computing platforms, including private clouds, for managing the large data produced by next generation sequencing platforms.
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...Arghya Kusum Das
Recent advances in large-scale experimental facilities ushered in an era of data-driven science. These large-scale data increase the opportunity to answer many fundamental questions in basic science. However, these data pose new challenges to the scientific community in terms of their optimal processing. Consequently, scientists are in dire need of robust high-performance computing (HPC) solutions that can scale with terabytes of data.
In this talk, I will address the challenges of two major aspects of scientific big data processing: 1) Developing scalable software and algorithms for data- and compute-intensive scientific applications. 2) Proposing new cluster architectures that these applications and software tools need for good performance. In this talk, I will mainly address the challenges involved in large-scale genome analysis applications such as, genomic error correction and genome assembly which made their way to the forefront of big data challenges recently as the sequencing machines outperformed Moore's law by several magnitudes.
In the first part, I will address the challenges involved in developing scalable algorithms to process huge amounts of genomic big data using the power of recent analytic tools such as, Hadoop, Giraph, distributed NoSQL, etc. The algorithms are carefully tailored to scale over terabytes of data over hundreds of computing nodes. At a border level, these algorithms take advantage of locality-based computing for their scalability. In this aspect, I will briefly talk about my general-purpose, analytic framework for easy and rapid designing of embarrassingly parallel algorithms for massive-scale scientific data.
In the second part, I will address the challenges in designing the hardware environment that these data- and compute-intensive applications require for good performance. I will pinpoint the limitations in a traditional HPC cluster (supercomputer) to process this huge amount of big genomic data with respect to these applications and propose a solution to those limitations by balancing the storage (both I/O and memory) bandwidth, with the computational speed of high-performance CPUs. I will briefly discuss my theoretical model that can help the HPC system designers who are striving for system balance.
Many of these observations and developments are used by different hardware vendors such as, Samsung and IBM to develop or improve the configuration of their next-gen HPC clusters (e.g., Samsung’s hyper-scale computing cluster, IBM’s Power8-based supercomputer) with high-speed storage and processing power
The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble
Research Objects and their instantiation as RO-Crate: motivation, explanation, examples, history and lessons, and opportunities for scholarly communications, delivered virtually to 17th Italian Research Conference on Digital Libraries
How to sequence a large eukaryotic genome - and how we sequenced the cod genome. A seminar I gave for the Computational Life Science (Univ. of Oslo) seminar series, September 28, 2011
Complementing Computation with Visualization in GenomicsFrancis Rowland
A look at Genome Assembly Visualization with ABySS-Explorer, as well as complementing genome browsing
(Using clustering and interactive data exploration)
While Phosphorous (31P) MRS (I) has been promising in experimental and clinical settings since the early 70s, it has been beset by prohibitively lower sensitivity, limited spectral-spatial resolution, and prolonged acquisition. This manuscript and proceedings of the annual scientific meeting of ISMRM in 2022 (REF1) and 2023 (REF2) demonstrate that our novel acquisition strategy, the novel Rosette Trajectory for fast and flexible MR(S)I contrast (Shen et al. 2023 (REF3), later we renamed it as PETALUTE after the translation to the preclinical scanners of 7T and 9.4T), enables operator-independent (1) rapid acquisition (~7 minutes), (2) reconstruction, and (3) processing pipeline, resulting in phosphorous metabolite ratio maps (10 x 10 x 10 mm3) of the whole brain.
In response to the “Repeat it with Me” challenge organized by the Reproducible Research study group of ISMRM, we demonstrated the power of this technique in 5 healthy volunteers at three different institutions with different experimental setups (2nd Place: UTE 31P 3D Rosette MRSI Reproducibility Team, REF4). Since the proposed acquisition/reconstruction/processing pipeline was operator/scanner/coil-independent, the Reproducer sub-teams successfully replicated the findings of the original proceeding in 2022 (REF1). As part of this challenge, we provided some MATLAB scripts and k-space data to reproduce some of the results described in this manuscript. The software and data can be downloaded from https://purr.purdue.edu/projects/ismrm31pmrsi.
These results will likely be of broad interest across clinical settings since the proposed acquisition strategy is not specific to any region, nuclei, or magnetic field and is operator-independent. This study's resolution and signal-to-noise ratios permit the metabolite maps in an experimentally and clinically feasible timeframe at 3 Tesla and 7T.
REF1 Bozymski B, Shen X, Ozen AC, Ibey S, Chiew M, Thomas A, Dydak U, Emir UE. Ultra-Short Echo Time 31P 3D MRSI at 3T with Novel Rosette k-space Trajectory. Proceedings 30th Scientific Meeting, International Society for Magnetic Resonance in Medicine, 2022.
REF2 Farley N, Bozymski B, Dydak U, Emir UE*. Fast 3D 31P MRSI Using Novel Rosette Petal Trajectory at 3T with x4 Accelerated Compressed Sensing. Proceedings 31st Scientific Meeting, International Society for Magnetic Resonance in Medicine, 2023.
REF3 Shen X, Özen AC, Sunjar A, Ilbey S, Sawiak S, Shi R, Chiew M, Emir UE. Ultrashort T2 components imaging of the whole brain using 3D dual-echo UTE MRI with rosette k-space pattern. Magnetic Resonance in Medicine. 2023;89(2):508–521.
REF4 https://challenge.ismrm.org/2023-24-reproducibility-challenge/results-22-23/
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsChristopher Mason
Challenges and biases in preparing, characterizing, and sequencing DNA and RNA can have significant impacts on research in genomics across all kingdoms of life, including experiments in single cells, RNA profiling, and metagenomics. Technical artifacts and contaminations can arise at each point of sample manipulation, extraction, sequencing, and analysis. Thus, the measurement and benchmarking of these potential sources of error are of paramount importance as next-generation sequencing (NGS) projects become more global and ubiquitous.
Fortunately, a variety of methods, standards, and technologies have recently emerged that improve measurements in genomics and sequencing, from the initial input material to the computational pipelines that process and annotate the data.
This webinar will review work to develop standards and their applications in genomics, including the ABRF-NGS Phase II NGS Study on DNA Sequencing; the FDA’s Sequencing Quality Control Consortium (SEQC2); metagenomics standards efforts (ABRF, ATCC, Zymo, Metaquins), and the Epigenomics QC group of the SEQC2. The webinar will also review he computational methods for detection, validation, and implementation of these genomic measures.
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...IJMER
The behaviour of soil at the location of the project and interactions of the earth materials during and after construction has a major influence on the success, economy and safety of the work. Another complexity associated with some geotechnical engineering materials, such as sand and gravel, is the difficulty in obtaining undisturbed samples and time consuming involving skilled
technician. Knowledge of California Bearing Ratio (C.B.R) is essential in finding the road thickness. To cope up with the difficulties involved, an attempt has been made to model C.B.R in terms of Fine Fraction, Liquid Limit, Plasticity Index, Maximum Dry density, and Optimum Moisture content. A multi-layer perceptron network with feed forward back propagation is used to model varying the
number of hidden layers. For this purposes 50 soils test data was collected from the laboratory test
results. Among the test data 30 soils data is used for training and remaining 20 soils for testing using
60-40 distribution. The architectures developed are 5-4-1, 5-5-1, and 5-6-1. Model with 5-6-1 architecture is found to be quite satisfactory in predicting C.B.R of soils. A graph is plotted between
the predicted values and observed values of outputs for training and testing process, from the graph it
is found that all the points are close to equality line, indicating predicted values are close to observed
values
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
Bioinformatics and computational biology are rooted in life sciences as well as computer and
information sciences and technologies. Bioinformatics applies principles of information
sciences and technologies to make the vast, diverse, and complex life sciences data more
understandable and useful. Computational biology uses mathematical and computational
approaches to address theoretical and experimental questions in biology. Short read sequence
assembly is one of the most important steps in the analysis of biological data. There are many
open source software’s available for short read sequence assembly where MAQ is one such
popularly used software by the research community.
In general, biological data sets generated by next generation sequencers are very huge and
massive which requires tremendous amount of computational resources. The algorithm used for
the short read sequence assembly is NP Hard which is computationally expensive and time
consuming. Also MAQ is single threaded software which doesn't use the power of multi core and
distributed computing and it doesn't scale. In this paper we report HPC-MAQ which addresses
the NP-Hard related challenges of genome reference assembly and enables MAQ parallel and scalable through Hadoop which is a software framework for distributed computing.
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...gerogepatton
Deep learning provide successful applications in many fields. Recently, machines learning are involved for oceans remote sensing applications. In this study, we use and compare about eight (8) deep learning estimators for retrieval of a mainly pigment of phytoplankton. Depending on the water case and the multiple instruments simultaneously observing the earth on a variety of platforms, several algorithm are used to estimate the chlolophyll-a from marine reflectance.By using a long-term multi-sensor time-series of satellite ocean-colour data, as MODIS, SeaWifs, VIIRS, MERIS, etc…, we make a unique deep network model able to establish a relationship between sea surface reflectance and chlorophyll-a from any measurement satellite sensor over West Africa. These data fusion take into account the bias between case water and instruments.We construct several chlorophyll-a concentration prediction deep learning based models, compare them and therefore use the best for our study. Results obtained for accuracy training and test are quite good. The mean absolute error are very low and vary between 0,07 to 0,13 mg/m
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...ijaia
Deep learning provide successful applications in many fields. Recently, machines learning are involved for oceans remote sensing applications. In this study, we use and compare about eight (8) deep learning estimators for retrieval of a mainly pigment of phytoplankton. Depending on the water case and the multiple instruments simultaneously observing the earth on a variety of platforms, several algorithm are used to estimate the chlolophyll-a from marine reflectance.By using a long-term multi-sensor time-series of satellite ocean-colour data, as MODIS, SeaWifs, VIIRS, MERIS, etc…, we make a unique deep network model able to establish a relationship between sea surface reflectance and chlorophyll-a from any measurement satellite sensor over West Africa. These data fusion take into account the bias between case water and instruments.We construct several chlorophyll-a concentration prediction deep learning based models, compare them and therefore use the best for our study. Results obtained for accuracy training and test are quite good. The mean absolute error are very low and vary between 0,07 to 0,13 mg/m3 .
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...cscpconf
Time-delay estimation is an essential building block of many signal processing applications.
This paper follows up on earlier work for acoustic source localization and time delay estimation
using pattern recognition techniques in the adverse environment such as reverberant rooms or
underwater; it presents unprecedented high performance results obtained with supervised
training of neural networks which challenge the state of the art and compares its performance
to that of well-known methods such as the Generalized Cross-Correlation or Adaptive
Eigenvalue Decomposition.
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...csandit
Time-delay estimation is an essential building block of many signal processing applications.This paper follows up on earlier work for acoustic source localization and time delay estimation
using pattern recognition techniques in the adverse environment such as reverberant rooms or underwater; it presents unprecedented high performance results obtained with supervised training of neural networks which challenge the state of the art and compares its performance to that of well-known methods such as the Generalized Cross-Correlation or Adaptive Eigenvalue Decomposition.
발표자: 배재성(KAIST 석사과정)
발표일: 2018.10.
최근 딥러닝을 이용한 방법은 다양한 음성 인식 과제에서 괄목할 만한 성과를 내고 있습니다. 특히 Convolutional Neural Network (CNN)을 이용한 방식은 지역적인 특징 (local feature)들을 효과적으로 잡아낼 수 있기 때문에 비교적 짧은 시간 의존도를 가지는 음성 키워드 인식이나 음소 단위 인식과 같은 과제들에서 활발히 사용되고 있습니다. 그러나 CNN은 낮은 레벨의 특징들 간의 공간적 관계성을 고려하지 않는다는 한계점이 있습니다. 이를 극복하기 위해 캡슐 네트워크 구조를 도입하여 음성 스펙트로그램에서 추출된 특징들의 공간적 관계성을 고려하고자 하였습니다. 구글 음성 단어 데이터셋에서 CNN과 그 성능을 비교해 보았으며, 깨끗한 환경과 잡음 환경 모두에서 주목할만한 성능 향상을 이끌어 냈습니다.
Slides for a discussion on a brief Nature comment on Bioinformatics Cores and an older Plos One perspective that covers suggested best practices for Bioinformatics Cores.
Journal club slides for "Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches" and a description of the software pipeline digit
More Related Content
Similar to Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 454 and illumina data.
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
This is a talk I gave at the Institute for Genomics & System Biology (IGSB) on December 7, 2009. The talk looks at the role of cloud computing platforms, including private clouds, for managing the large data produced by next generation sequencing platforms.
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...Arghya Kusum Das
Recent advances in large-scale experimental facilities ushered in an era of data-driven science. These large-scale data increase the opportunity to answer many fundamental questions in basic science. However, these data pose new challenges to the scientific community in terms of their optimal processing. Consequently, scientists are in dire need of robust high-performance computing (HPC) solutions that can scale with terabytes of data.
In this talk, I will address the challenges of two major aspects of scientific big data processing: 1) Developing scalable software and algorithms for data- and compute-intensive scientific applications. 2) Proposing new cluster architectures that these applications and software tools need for good performance. In this talk, I will mainly address the challenges involved in large-scale genome analysis applications such as, genomic error correction and genome assembly which made their way to the forefront of big data challenges recently as the sequencing machines outperformed Moore's law by several magnitudes.
In the first part, I will address the challenges involved in developing scalable algorithms to process huge amounts of genomic big data using the power of recent analytic tools such as, Hadoop, Giraph, distributed NoSQL, etc. The algorithms are carefully tailored to scale over terabytes of data over hundreds of computing nodes. At a border level, these algorithms take advantage of locality-based computing for their scalability. In this aspect, I will briefly talk about my general-purpose, analytic framework for easy and rapid designing of embarrassingly parallel algorithms for massive-scale scientific data.
In the second part, I will address the challenges in designing the hardware environment that these data- and compute-intensive applications require for good performance. I will pinpoint the limitations in a traditional HPC cluster (supercomputer) to process this huge amount of big genomic data with respect to these applications and propose a solution to those limitations by balancing the storage (both I/O and memory) bandwidth, with the computational speed of high-performance CPUs. I will briefly discuss my theoretical model that can help the HPC system designers who are striving for system balance.
Many of these observations and developments are used by different hardware vendors such as, Samsung and IBM to develop or improve the configuration of their next-gen HPC clusters (e.g., Samsung’s hyper-scale computing cluster, IBM’s Power8-based supercomputer) with high-speed storage and processing power
The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble
Research Objects and their instantiation as RO-Crate: motivation, explanation, examples, history and lessons, and opportunities for scholarly communications, delivered virtually to 17th Italian Research Conference on Digital Libraries
How to sequence a large eukaryotic genome - and how we sequenced the cod genome. A seminar I gave for the Computational Life Science (Univ. of Oslo) seminar series, September 28, 2011
Complementing Computation with Visualization in GenomicsFrancis Rowland
A look at Genome Assembly Visualization with ABySS-Explorer, as well as complementing genome browsing
(Using clustering and interactive data exploration)
While Phosphorous (31P) MRS (I) has been promising in experimental and clinical settings since the early 70s, it has been beset by prohibitively lower sensitivity, limited spectral-spatial resolution, and prolonged acquisition. This manuscript and proceedings of the annual scientific meeting of ISMRM in 2022 (REF1) and 2023 (REF2) demonstrate that our novel acquisition strategy, the novel Rosette Trajectory for fast and flexible MR(S)I contrast (Shen et al. 2023 (REF3), later we renamed it as PETALUTE after the translation to the preclinical scanners of 7T and 9.4T), enables operator-independent (1) rapid acquisition (~7 minutes), (2) reconstruction, and (3) processing pipeline, resulting in phosphorous metabolite ratio maps (10 x 10 x 10 mm3) of the whole brain.
In response to the “Repeat it with Me” challenge organized by the Reproducible Research study group of ISMRM, we demonstrated the power of this technique in 5 healthy volunteers at three different institutions with different experimental setups (2nd Place: UTE 31P 3D Rosette MRSI Reproducibility Team, REF4). Since the proposed acquisition/reconstruction/processing pipeline was operator/scanner/coil-independent, the Reproducer sub-teams successfully replicated the findings of the original proceeding in 2022 (REF1). As part of this challenge, we provided some MATLAB scripts and k-space data to reproduce some of the results described in this manuscript. The software and data can be downloaded from https://purr.purdue.edu/projects/ismrm31pmrsi.
These results will likely be of broad interest across clinical settings since the proposed acquisition strategy is not specific to any region, nuclei, or magnetic field and is operator-independent. This study's resolution and signal-to-noise ratios permit the metabolite maps in an experimentally and clinically feasible timeframe at 3 Tesla and 7T.
REF1 Bozymski B, Shen X, Ozen AC, Ibey S, Chiew M, Thomas A, Dydak U, Emir UE. Ultra-Short Echo Time 31P 3D MRSI at 3T with Novel Rosette k-space Trajectory. Proceedings 30th Scientific Meeting, International Society for Magnetic Resonance in Medicine, 2022.
REF2 Farley N, Bozymski B, Dydak U, Emir UE*. Fast 3D 31P MRSI Using Novel Rosette Petal Trajectory at 3T with x4 Accelerated Compressed Sensing. Proceedings 31st Scientific Meeting, International Society for Magnetic Resonance in Medicine, 2023.
REF3 Shen X, Özen AC, Sunjar A, Ilbey S, Sawiak S, Shi R, Chiew M, Emir UE. Ultrashort T2 components imaging of the whole brain using 3D dual-echo UTE MRI with rosette k-space pattern. Magnetic Resonance in Medicine. 2023;89(2):508–521.
REF4 https://challenge.ismrm.org/2023-24-reproducibility-challenge/results-22-23/
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsChristopher Mason
Challenges and biases in preparing, characterizing, and sequencing DNA and RNA can have significant impacts on research in genomics across all kingdoms of life, including experiments in single cells, RNA profiling, and metagenomics. Technical artifacts and contaminations can arise at each point of sample manipulation, extraction, sequencing, and analysis. Thus, the measurement and benchmarking of these potential sources of error are of paramount importance as next-generation sequencing (NGS) projects become more global and ubiquitous.
Fortunately, a variety of methods, standards, and technologies have recently emerged that improve measurements in genomics and sequencing, from the initial input material to the computational pipelines that process and annotate the data.
This webinar will review work to develop standards and their applications in genomics, including the ABRF-NGS Phase II NGS Study on DNA Sequencing; the FDA’s Sequencing Quality Control Consortium (SEQC2); metagenomics standards efforts (ABRF, ATCC, Zymo, Metaquins), and the Epigenomics QC group of the SEQC2. The webinar will also review he computational methods for detection, validation, and implementation of these genomic measures.
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...IJMER
The behaviour of soil at the location of the project and interactions of the earth materials during and after construction has a major influence on the success, economy and safety of the work. Another complexity associated with some geotechnical engineering materials, such as sand and gravel, is the difficulty in obtaining undisturbed samples and time consuming involving skilled
technician. Knowledge of California Bearing Ratio (C.B.R) is essential in finding the road thickness. To cope up with the difficulties involved, an attempt has been made to model C.B.R in terms of Fine Fraction, Liquid Limit, Plasticity Index, Maximum Dry density, and Optimum Moisture content. A multi-layer perceptron network with feed forward back propagation is used to model varying the
number of hidden layers. For this purposes 50 soils test data was collected from the laboratory test
results. Among the test data 30 soils data is used for training and remaining 20 soils for testing using
60-40 distribution. The architectures developed are 5-4-1, 5-5-1, and 5-6-1. Model with 5-6-1 architecture is found to be quite satisfactory in predicting C.B.R of soils. A graph is plotted between
the predicted values and observed values of outputs for training and testing process, from the graph it
is found that all the points are close to equality line, indicating predicted values are close to observed
values
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
Bioinformatics and computational biology are rooted in life sciences as well as computer and
information sciences and technologies. Bioinformatics applies principles of information
sciences and technologies to make the vast, diverse, and complex life sciences data more
understandable and useful. Computational biology uses mathematical and computational
approaches to address theoretical and experimental questions in biology. Short read sequence
assembly is one of the most important steps in the analysis of biological data. There are many
open source software’s available for short read sequence assembly where MAQ is one such
popularly used software by the research community.
In general, biological data sets generated by next generation sequencers are very huge and
massive which requires tremendous amount of computational resources. The algorithm used for
the short read sequence assembly is NP Hard which is computationally expensive and time
consuming. Also MAQ is single threaded software which doesn't use the power of multi core and
distributed computing and it doesn't scale. In this paper we report HPC-MAQ which addresses
the NP-Hard related challenges of genome reference assembly and enables MAQ parallel and scalable through Hadoop which is a software framework for distributed computing.
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...gerogepatton
Deep learning provide successful applications in many fields. Recently, machines learning are involved for oceans remote sensing applications. In this study, we use and compare about eight (8) deep learning estimators for retrieval of a mainly pigment of phytoplankton. Depending on the water case and the multiple instruments simultaneously observing the earth on a variety of platforms, several algorithm are used to estimate the chlolophyll-a from marine reflectance.By using a long-term multi-sensor time-series of satellite ocean-colour data, as MODIS, SeaWifs, VIIRS, MERIS, etc…, we make a unique deep network model able to establish a relationship between sea surface reflectance and chlorophyll-a from any measurement satellite sensor over West Africa. These data fusion take into account the bias between case water and instruments.We construct several chlorophyll-a concentration prediction deep learning based models, compare them and therefore use the best for our study. Results obtained for accuracy training and test are quite good. The mean absolute error are very low and vary between 0,07 to 0,13 mg/m
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...ijaia
Deep learning provide successful applications in many fields. Recently, machines learning are involved for oceans remote sensing applications. In this study, we use and compare about eight (8) deep learning estimators for retrieval of a mainly pigment of phytoplankton. Depending on the water case and the multiple instruments simultaneously observing the earth on a variety of platforms, several algorithm are used to estimate the chlolophyll-a from marine reflectance.By using a long-term multi-sensor time-series of satellite ocean-colour data, as MODIS, SeaWifs, VIIRS, MERIS, etc…, we make a unique deep network model able to establish a relationship between sea surface reflectance and chlorophyll-a from any measurement satellite sensor over West Africa. These data fusion take into account the bias between case water and instruments.We construct several chlorophyll-a concentration prediction deep learning based models, compare them and therefore use the best for our study. Results obtained for accuracy training and test are quite good. The mean absolute error are very low and vary between 0,07 to 0,13 mg/m3 .
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...cscpconf
Time-delay estimation is an essential building block of many signal processing applications.
This paper follows up on earlier work for acoustic source localization and time delay estimation
using pattern recognition techniques in the adverse environment such as reverberant rooms or
underwater; it presents unprecedented high performance results obtained with supervised
training of neural networks which challenge the state of the art and compares its performance
to that of well-known methods such as the Generalized Cross-Correlation or Adaptive
Eigenvalue Decomposition.
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...csandit
Time-delay estimation is an essential building block of many signal processing applications.This paper follows up on earlier work for acoustic source localization and time delay estimation
using pattern recognition techniques in the adverse environment such as reverberant rooms or underwater; it presents unprecedented high performance results obtained with supervised training of neural networks which challenge the state of the art and compares its performance to that of well-known methods such as the Generalized Cross-Correlation or Adaptive Eigenvalue Decomposition.
발표자: 배재성(KAIST 석사과정)
발표일: 2018.10.
최근 딥러닝을 이용한 방법은 다양한 음성 인식 과제에서 괄목할 만한 성과를 내고 있습니다. 특히 Convolutional Neural Network (CNN)을 이용한 방식은 지역적인 특징 (local feature)들을 효과적으로 잡아낼 수 있기 때문에 비교적 짧은 시간 의존도를 가지는 음성 키워드 인식이나 음소 단위 인식과 같은 과제들에서 활발히 사용되고 있습니다. 그러나 CNN은 낮은 레벨의 특징들 간의 공간적 관계성을 고려하지 않는다는 한계점이 있습니다. 이를 극복하기 위해 캡슐 네트워크 구조를 도입하여 음성 스펙트로그램에서 추출된 특징들의 공간적 관계성을 고려하고자 하였습니다. 구글 음성 단어 데이터셋에서 CNN과 그 성능을 비교해 보았으며, 깨끗한 환경과 잡음 환경 모두에서 주목할만한 성능 향상을 이끌어 냈습니다.
Similar to Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 454 and illumina data. (20)
Slides for a discussion on a brief Nature comment on Bioinformatics Cores and an older Plos One perspective that covers suggested best practices for Bioinformatics Cores.
Journal club slides for "Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches" and a description of the software pipeline digit
Cracking the Workplace Discipline Code Main.pptxWorkforce Group
Cultivating and maintaining discipline within teams is a critical differentiator for successful organisations.
Forward-thinking leaders and business managers understand the impact that discipline has on organisational success. A disciplined workforce operates with clarity, focus, and a shared understanding of expectations, ultimately driving better results, optimising productivity, and facilitating seamless collaboration.
Although discipline is not a one-size-fits-all approach, it can help create a work environment that encourages personal growth and accountability rather than solely relying on punitive measures.
In this deck, you will learn the significance of workplace discipline for organisational success. You’ll also learn
• Four (4) workplace discipline methods you should consider
• The best and most practical approach to implementing workplace discipline.
• Three (3) key tips to maintain a disciplined workplace.
What is the TDS Return Filing Due Date for FY 2024-25.pdfseoforlegalpillers
It is crucial for the taxpayers to understand about the TDS Return Filing Due Date, so that they can fulfill your TDS obligations efficiently. Taxpayers can avoid penalties by sticking to the deadlines and by accurate filing of TDS. Timely filing of TDS will make sure about the availability of tax credits. You can also seek the professional guidance of experts like Legal Pillers for timely filing of the TDS Return.
Kseniya Leshchenko: Shared development support service model as the way to ma...Lviv Startup Club
Kseniya Leshchenko: Shared development support service model as the way to make small projects with small budgets profitable for the company (UA)
Kyiv PMDay 2024 Summer
Website – www.pmday.org
Youtube – https://www.youtube.com/startuplviv
FB – https://www.facebook.com/pmdayconference
Implicitly or explicitly all competing businesses employ a strategy to select a mix
of marketing resources. Formulating such competitive strategies fundamentally
involves recognizing relationships between elements of the marketing mix (e.g.,
price and product quality), as well as assessing competitive and market conditions
(i.e., industry structure in the language of economics).
Buy Verified PayPal Account | Buy Google 5 Star Reviewsusawebmarket
Buy Verified PayPal Account
Looking to buy verified PayPal accounts? Discover 7 expert tips for safely purchasing a verified PayPal account in 2024. Ensure security and reliability for your transactions.
PayPal Services Features-
🟢 Email Access
🟢 Bank Added
🟢 Card Verified
🟢 Full SSN Provided
🟢 Phone Number Access
🟢 Driving License Copy
🟢 Fasted Delivery
Client Satisfaction is Our First priority. Our services is very appropriate to buy. We assume that the first-rate way to purchase our offerings is to order on the website. If you have any worry in our cooperation usually You can order us on Skype or Telegram.
24/7 Hours Reply/Please Contact
usawebmarketEmail: support@usawebmarket.com
Skype: usawebmarket
Telegram: @usawebmarket
WhatsApp: +1(218) 203-5951
USA WEB MARKET is the Best Verified PayPal, Payoneer, Cash App, Skrill, Neteller, Stripe Account and SEO, SMM Service provider.100%Satisfection granted.100% replacement Granted.
Digital Transformation and IT Strategy Toolkit and TemplatesAurelien Domont, MBA
This Digital Transformation and IT Strategy Toolkit was created by ex-McKinsey, Deloitte and BCG Management Consultants, after more than 5,000 hours of work. It is considered the world's best & most comprehensive Digital Transformation and IT Strategy Toolkit. It includes all the Frameworks, Best Practices & Templates required to successfully undertake the Digital Transformation of your organization and define a robust IT Strategy.
Editable Toolkit to help you reuse our content: 700 Powerpoint slides | 35 Excel sheets | 84 minutes of Video training
This PowerPoint presentation is only a small preview of our Toolkits. For more details, visit www.domontconsulting.com
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...BBPMedia1
Marvin neemt je in deze presentatie mee in de voordelen van non-endemic advertising op retail media netwerken. Hij brengt ook de uitdagingen in beeld die de markt op dit moment heeft op het gebied van retail media voor niet-leveranciers.
Retail media wordt gezien als het nieuwe advertising-medium en ook mediabureaus richten massaal retail media-afdelingen op. Merken die niet in de betreffende winkel liggen staan ook nog niet in de rij om op de retail media netwerken te adverteren. Marvin belicht de uitdagingen die er zijn om echt aansluiting te vinden op die markt van non-endemic advertising.
Improving profitability for small businessBen Wann
In this comprehensive presentation, we will explore strategies and practical tips for enhancing profitability in small businesses. Tailored to meet the unique challenges faced by small enterprises, this session covers various aspects that directly impact the bottom line. Attendees will learn how to optimize operational efficiency, manage expenses, and increase revenue through innovative marketing and customer engagement techniques.
Personal Brand Statement:
As an Army veteran dedicated to lifelong learning, I bring a disciplined, strategic mindset to my pursuits. I am constantly expanding my knowledge to innovate and lead effectively. My journey is driven by a commitment to excellence, and to make a meaningful impact in the world.
Enterprise Excellence is Inclusive Excellence.pdfKaiNexus
Enterprise excellence and inclusive excellence are closely linked, and real-world challenges have shown that both are essential to the success of any organization. To achieve enterprise excellence, organizations must focus on improving their operations and processes while creating an inclusive environment that engages everyone. In this interactive session, the facilitator will highlight commonly established business practices and how they limit our ability to engage everyone every day. More importantly, though, participants will likely gain increased awareness of what we can do differently to maximize enterprise excellence through deliberate inclusion.
What is Enterprise Excellence?
Enterprise Excellence is a holistic approach that's aimed at achieving world-class performance across all aspects of the organization.
What might I learn?
A way to engage all in creating Inclusive Excellence. Lessons from the US military and their parallels to the story of Harry Potter. How belt systems and CI teams can destroy inclusive practices. How leadership language invites people to the party. There are three things leaders can do to engage everyone every day: maximizing psychological safety to create environments where folks learn, contribute, and challenge the status quo.
Who might benefit? Anyone and everyone leading folks from the shop floor to top floor.
Dr. William Harvey is a seasoned Operations Leader with extensive experience in chemical processing, manufacturing, and operations management. At Michelman, he currently oversees multiple sites, leading teams in strategic planning and coaching/practicing continuous improvement. William is set to start his eighth year of teaching at the University of Cincinnati where he teaches marketing, finance, and management. William holds various certifications in change management, quality, leadership, operational excellence, team building, and DiSC, among others.
Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 454 and illumina data.
1. 1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
1
Leveraging multiple sequencing technologies,
assembly algorithms, and assembly parameters
to create a de novo transcript libraries for four
non-models.
Jennifer Shelton
Bioinformatics Core Outreach Coordinator
Kansas State University
2. Outline
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
2
I. Goals
II. Metrics
III. Background (multi-k assembly)
IV. Workflow
V. Results
VI. Conclusions
3. Goals
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
3
Create a high quality reference transcriptomes
of non-model plants in order to:
- annotate lipid synthesis pathways
- compare expression profiles
4. Outline
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
4
I. Goals
II. Metrics
III. Background (multi-k assembly)
IV. Workflow
V. Results
VI. Conclusions
5. Quality metrics
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
5
1) Cumulative lengths of contigs
2) Number of contigs
3) N25, N50, N75: Order contigs
smallest to largest report shortest
contig representing 25, 50 or 75% of
the cumulative contig length
4) Ortholog Hit Ratio: length of the
putative coding region (High Scoring
Pairs (HSP)) by the length of the
protein
6. ‘Ideal’ quality metrics
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
6
1) Cumulative lengths of contigs:
small
2) Number of contigs: 20-60 k
3) N25, N50, N75: Order contigs
smallest to largest report shortest
contig representing 25, 50 or 75% of
the cumulative contig length: large
4) Ortholog Hit Ratio: length of the
putative coding region (High Scoring
Pairs (HSP)) by the length of the
protein:1 ‘full length’
7. Recently reported N50
5/22/13
K-INBRE Bioinformatics Core Training
and Education Resource
7
Schliesky, Simon, et al. "RNA-seq assembly–are we there yet?." Frontiers in plant
science 3 (2012).
Reference" Year of publication" N50"
Bräutigam et al." 2011" 596 and 521"
Lu et al." 2012" 884"
Meyer et al. " 2012" 1308"
Garg et al. " 2011" 1671"
Mutasa-Göttgens" 2012"
1185 (1573 for loci
above 0.5kb)"
Xia et al." 2011" 485"
Chibalina and Filatov" 2011" 1321"
Wong" 2011" 948 and 938"
Shi et al. " 2011" 506"
Hyun et al. " 2012" 450"
Hao et al." 2012" 408"
Huang et al. " 2012" 887"
Zhang et al. " 2012" 823 (616-664)"
8. Ortholog hit ratio in recent literature
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
8
62% ≥0.5 and 35% ≥0.8 for Daphnia pulex
64% ≥0.5 and 35% ≥0.9 for salt marsh beetle
64% ≥0.5 and 40% ≥0.8 for Gryllus bimaculatus
58% ≥0.5 and 41% ≥0.8 for Oncopeltus fasciatus
Zeng V, et al. BMC Genomics. (2011) 12:581, Van Belleghem, Steven M., et al. PloS one 7.8
(2012): e42605, Zeng., et al. PLoS ONE (2013).
9. Outline
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
9
I. Goals
II. Metrics
III. Background (multi-k assembly)
IV. Workflow
V. Results
VI. Conclusions
10. k-mers
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
10
http://homolog.us/Tutorials/index.php?p=3.4&s=1
A k-mer is a substring within the larger string (the read)
11. Assembly details: exploring the parameter
space
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
11
Low expression:
assembles best with
low values of k
High expression:
assembles best with
high values of k
All levels of
expression: assemble
to full-length with a
merged assembly of a
range of values of k
MANUSCRIPT CATEGORY: ORIGINAL PAPER
Oases de novo RNA-seq assembly
f Velvet and Oases assemblies on the human RNA-
>100 bp Sens. (%) Spec. (%) Full Lgth. 80% lgth.
789 12.45 83.58 42 78
319 17.23 92.55 828 7437
042 16.13 89.62 92 516
504 14.97 93.0 754 6882
986 12.78 93.16 213 1986
878 10.55 94.63 429 3751
507 7.9 94.81 107 1660
012 6.67 95.99 196 1885
rags longer that 100 bp (Tfrags), nucleotide sensitivity and
umber of full length or 80% length reconstructed Ensembl
ers, we tested an array of parameters and
ose datasets, namely n = 10, c = 3 and ABYSS
(Supplementary Material).
08-20) was run with the default parameters. In
length of 25 could not be modified.
tails after assembly were removed using the
om the EMBOSS package (Rice et al., 2000)
20 40 60 80 100
0200400600800
Expression Quantiles
Reconstructedtoatleast80%
Merged 19 35
k=19
k=21
k=27
k=31
k=35
Fig. 2. Comparison of single k-mer Oases assemblies and the merged
assembly from kMIN=19 to kMAX=35 by Oases-M, on the human dataset.
The total number of Ensembl transcripts assembled to 80% of their length is
provided by RPKM gene expression quantiles of 1464 genes each.
performed by Oases, it is possible to observe that this parameter
http://bioinformatics.oDownloadedfrom
Schulz M, et al. Bioinformatics (2012) 28:8 1086-1092.
12. Assembly details: exploring the parameter
space
1/12/13
Gruenheit N, et al. BMC Genomics.
(2012) 13:92.
in assemblies that used one k-mer size. 392 of these
sequences were assembled using exactly one parameter
combination. Similarly, for P. cheesemanii the success of
cutoffs but only 18 with all 20 k-mer sizes. 445 genes
were only completely assembled with one coverage cutoff
and 495 genes were only completely assembled with one
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
100 200 300 400 500 600 700
25 53 55 57 59 61 6351494745434139373533312927
k-mer
coveragecutoff
number of complete coding sequences
Figure 1 Number of complete transcripts identified in different assemblies of P. fastigiatum reads. 380 different assemblies were made
using ABySS [25,26] and a combination of (i) coverage cutoffs between 2 and 20 and (ii) k-mer sizes between 25 and 63. Transcripts covering
the complete coding sequence of the homologue from A. lyrata or A. thaliana, respectively, were identified and counted. The maximum number
(741) of complete transcripts was identified for coverage cutoff seven and k-mer size 41 while the lowest (70) number of complete transcripts
was identified for coverage cutoff 19 and k-mer size 63.
Gruenheit et al. BMC Genomics 2012, 13:92
http://www.biomedcentral.com/1471-2164/13/92
Page 4 of 19
Number of contigs
assembled to full
length was found to
peak with k-mer
values ~ 41
(~82% of genes were
only assembled to
over 80% with one
k-mer)
13. Assembly details: exploring the parameter
space
1/12/13
Multi-k-mer assembly improves assembly to
full length and assembly of a broad range of
expression quantiles
14. Outline
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
14
I. Goals
II. Metrics
III. Background (multi-k assembly)
IV. Workflow
V. Results
VI. Conclusions
15. Sand bluestem assembly- complete, currently preparing reads for mapping and mapping them back, also
running OHR for merged assembly.
Current workflow:
454 reads
A. geradii ssp. hallii
! and
A. gerardii ssp. gerardii
Illumina paired end
reads
A. geradii ssp. hallii
! and
A. gerardii ssp. gerardii
Tagcleaner to remove
PrimeSmart sequences
Prinseq to remove low
quality reads and tails,
reads <100bp, low
entropy, poly A/T/N
tails, remove identical
reads
MIRA
MIRA assembly of
454 reads
Merge
MIRA assembly
Oases-M
Velvet
assemblies
using multi
values of k
(k=23 - k=61)
Comparison of
assemblies
“Blind” metrics
highest N25, N50,
N75; cumulative
length of contigs;
number of contigs
Blastx
against
Phytozome
v9.0 S.
bicolor
protein
database
Ortholog hit ratio
((length of hit /3) /
length of ortholog)
Number of unique
blast hits; number of
putative paralog/
homeolog groups
Sickle to remove reads
with N, low quality,
reads <50bp
Prinseq to remove low
quality reads and tails,
poly A/T/N tails,
remove identical reads
Oases
assemblies
using multi
values of k
(k=23 - k=61)
Assembly overview: workflow
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
15
1) Stringently clean
2) Assemble Illumina reads with
a De Bruijn Graph assembler,
and 454 with an Overlap
Layout Consensus assembler
3) Merge with MIRA or CD-HIT
4) Compare assemblies with
metrics based on contiguity
and putative homology to
closest relative
16. Length and number of contigs with a range
of k-mers
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
16
60
145
230
315
400
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
MIRA(454)
MIRAcluster
0
75
150
225
300
375
450
525
600
Sand bluestem assembly length and number of contigs
Cumulativelengthofsequences(Mb)
Assembly k-mer value or name
Numberofsequences(k)
Cumulative length of sequences (Mb)
Number of sequences x 10^5
5.0
Sand bluestem N values
26,000 contigs after
clustering
17. 47
57
merge
CDH cluster
MIRA cluster
1.168 1.948 2.932 129.331497 1.07545
1.218 1.974 2.95 111.672465 0.90385
1.404 2.23 3.299 418.762352 2.77833
1.399 2.274 3.339 96.411479 0.70852 CDH cluster 1399 2274 3
1.825 2.676 3.856 123.666263 0.59598 MIRA cluster 1825 2676 3
100
200
300
400
500
27
37
47
57
merge
CDHcluster
MIRAcluster
0
0.75
1.5
2.25
3
Bittersweet assembly length and number of contigs
Cumulativelengthofsequences(Mb)
Assembly k-mer value or name
Numberofsequencesx10^5
Cumulative length of sequences (Mb)
Number of sequences x 10^5
1.1
1.8
2.6
3.3
4.0
27
37
47
57
merge
CDHcluster
MIRAcluster
Bittersweet N values
Contiglength(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)
N25 (kb)
Length and number of contigs with a range
of k-mers
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
17
60-70,000 contigs
after clustering
18. 37
47
57
merge
CDH cluster
MIRA cluster
1.206 2.008 3.087 128.100083 1.1091 37 1.206 2.00
1.195 1.977 3.051 113.176134 0.93839 47 1.195 1.97
1.271 2.035 3.096 102.507455 0.82755 57 1.271 2.03
1.41 2.211 3.331 345.752982 2.31102 merge 1.41 2.21
1.44 2.27 3.422 84.202533 0.59174 CDH cluster 1440 2270
1.804 2.69 3.941 105.920843 0.50279 MIRA cluster 1804 2690
1.1
1.7
2.3
2.8
3.4
4.0
27
37
47
57
merge
CDHcluster
MIRAcluster
Balsam N values
Contiglength(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)
N25 (kb)
80
185
290
395
500
27
37
47
57
merge
CDHcluster
MIRAcluster
0
0.75
1.5
2.25
3
Balsam assembly length and number of contigsCumulativelengthofsequences(Mb)
Assembly k-mer value or name
Numberofsequencesx10^5
Cumulative length of sequences (Mb)
Number of sequences x 10^5
Length and number of contigs with a range
of k-mers
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
18
50-60,000 contigs
after clustering
19. N25, N50, N75 across a range of k-mer
values
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
19
Duan, Jialei, et al. BMC genomics 13.1 (2012): 392. Meyer E, et al. The
Plant Journal. (2012) 70: 879-890. Liu, Mingying, et al. PloS one (2012)
7.10. Chouvarine, Philippe, et al. PloS one 7.1 (2012): e29850.
MIRA
MIRAc
Assembly k-mer value or name
Cumulative length of sequences (Mb)
Number of sequences x 10^5
0.4
1.6
2.7
3.9
5.0
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
MIRA(454)
MIRAcluster
Sand bluestem N values
Contiglength(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)
N25 (kb)
Sand bluestem’s
N50 is 3.2 kb after clustering
Other published N50 values:
wheat 1.4 kb
Panicum hallii 1.3 kb
Ma Bamboo 1.1 kb
Miscanthus 0.7 kb
20. N25, N50, N75 across a range of k-mer
values
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
20
Duan, Jialei, et al. BMC genomics 13.1 (2012): 392. Meyer E, et al. The
Plant Journal. (2012) 70: 879-890. Liu, Mingying, et al. PloS one (2012)
7.10. Chouvarine, Philippe, et al. PloS one 7.1 (2012): e29850.
Bittersweet’s
N50 is 2.3-2.7 kb after
clustering
Other published N50 values:
wheat 1.4 kb
Panicum hallii 1.3 kb
Ma Bamboo 1.1 kb
Miscanthus 0.7 kb
0.59598 MIRA cluster 1825 2676 3856 123666263 59598Numberofsequencesx10^5
1.1
1.8
2.6
3.3
4.0
27
37
47
57
merge
CDHcluster
MIRAcluster
Bittersweet N values
Contiglength(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)
N25 (kb)
21. N25, N50, N75 across a range of k-mer
values
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
21
Duan, Jialei, et al. BMC genomics 13.1 (2012): 392. Meyer E, et al. The
Plant Journal. (2012) 70: 879-890. Liu, Mingying, et al. PloS one (2012)
7.10. Chouvarine, Philippe, et al. PloS one 7.1 (2012): e29850.
Bittersweet’s
N50 is 2.3-2.7 kb after
clustering
Other published N50 values:
wheat 1.4 kb
Panicum hallii 1.3 kb
Ma Bamboo 1.1 kb
Miscanthus 0.7 kb
2.31102 merge 1.41 2.211 3.331 345.752982 2.31102
0.59174 CDH cluster 1440 2270 3422 84202533 59174
0.50279 MIRA cluster 1804 2690 3941 105920843 50279
1.1
1.7
2.3
2.8
3.4
4.0
27
37
47
57
merge
CDHcluster
MIRAcluster
Balsam N values
Contiglength(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)
N25 (kb)
22. Ortholog hit ratio with a range of k-mers
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
22
Numberofcontigsx103
Ortholog hit ratio
0
8
16
24
32
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
mira 23 25 27 29 31 33 35 37
39 41 43 45 47 49 51 53 55
57 59 61
** Note: This algorithm varies
slightly from the final OHR
alogorithm I used in slide 27
23. Ortholog hit ratio in recent literature
1/12/13
Gruenheit N, et al. BMC Genomics.
(2012) 13:92.
in assemblies that used one k-mer size. 392 of these
sequences were assembled using exactly one parameter
combination. Similarly, for P. cheesemanii the success of
cutoffs but only 18 with all 20 k-mer sizes. 445 genes
were only completely assembled with one coverage cutoff
and 495 genes were only completely assembled with one
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
100 200 300 400 500 600 700
25 53 55 57 59 61 6351494745434139373533312927
k-mer
coveragecutoff
number of complete coding sequences
Figure 1 Number of complete transcripts identified in different assemblies of P. fastigiatum reads. 380 different assemblies were made
using ABySS [25,26] and a combination of (i) coverage cutoffs between 2 and 20 and (ii) k-mer sizes between 25 and 63. Transcripts covering
the complete coding sequence of the homologue from A. lyrata or A. thaliana, respectively, were identified and counted. The maximum number
(741) of complete transcripts was identified for coverage cutoff seven and k-mer size 41 while the lowest (70) number of complete transcripts
was identified for coverage cutoff 19 and k-mer size 63.
Gruenheit et al. BMC Genomics 2012, 13:92
http://www.biomedcentral.com/1471-2164/13/92
Page 4 of 19
Number of contigs
assembled to full
length was found to
peak with k-mer
values ~ 41
(similar to our peak
at k = 47)
24. Ortholog hit ratio for bluestem
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
24
59
61
MIRA cluster
0.4768073818 0.2972410611
0.4619314293 0.2767591755
0.7127479439 0.5199564586
0
5000
10000
15000
20000
25000
30000
35000
40000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
OHR histogram
Numberofcontigs
OHR bins
23 25 27 29 31 33 35
37 39 41 43 45 47 49
51 53 55 57 59 61 MIRA cluster
The number
of contigs is
lower but the
OHR
histogram of
the clustered
assembly has
proportionally
fewer
fragmented
contigs
26. Ortholog hit ratio in recent literature
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
26
51% ≥0.5 and 33% ≥0.8. for bluestem k(39-45)
71% ≥0.5 and 52% ≥0.8. for bluestem
62% ≥0.5 and 35% ≥0.8 for Daphnia pulex
64% ≥0.5 and 35% ≥0.9 for salt marsh beetle
64% ≥0.5 and 40% ≥0.8 for Gryllus bimaculatus
58% ≥0.5 and 41% ≥0.8 for Oncopeltus fasciatus
Zeng V, et al. BMC Genomics. (2011) 12:581, Van Belleghem, Steven M., et al. PloS one 7.8
(2012): e42605, Zeng., et al. PLoS ONE (2013).
27. Conclusions
1/12/13
K-INBRE Bioinformatics Core Training
and Education Resource
27
1) Metrics based on contiguity suggest that
many of the Illumina assemblies are highly
contiguous compared to recent de novo plant
transcriptomes
2) Metrics based on OHR suggest the assembly
is accurate and the multi-k-mer method and
clustering steps are improving the quality of the
assembly
3) 454 data appears to have been less cost
efficient than the Illumina data (in terms of all
metrics accept cumulative length of assembly)