A full picture of -omics cellular networks of regulation brings researchers closer to a realistic and reliable understanding of complex conditions. For more information, please visit: http://tbioinfopb.pine-biotech.com/
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGijbbjournal
Latest progress in biology, medical science, bioinformatics, and biotechnology has become important and
tremendous amounts of biodata that demands in-depth analysis. On the other hand, recent progress in data
mining research has led to the development of numerous efficient and scalable methods for mining
interesting patterns in large databases. This paper bridge the two fields, data mining and bioinformatics
for successful mining of biological data. Microarrays constitute a new platform which allows the discovery
and characterization of proteins.
Introduction
What is cheminformatics?
Why do we have to use informatics methods in chemistry?
Is it cheminformatics or chemoinformatics?
Emergence of cheminformations
Three major aspects of cheminformatics
Basics of cheminformatics
Topological representations
Tools used for cheminformatics
Application of cheminformatics
Role of cheminformatics in morden drug discovery
Conclusion
Bibliography
A full picture of -omics cellular networks of regulation brings researchers closer to a realistic and reliable understanding of complex conditions. For more information, please visit: http://tbioinfopb.pine-biotech.com/
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGijbbjournal
Latest progress in biology, medical science, bioinformatics, and biotechnology has become important and
tremendous amounts of biodata that demands in-depth analysis. On the other hand, recent progress in data
mining research has led to the development of numerous efficient and scalable methods for mining
interesting patterns in large databases. This paper bridge the two fields, data mining and bioinformatics
for successful mining of biological data. Microarrays constitute a new platform which allows the discovery
and characterization of proteins.
Introduction
What is cheminformatics?
Why do we have to use informatics methods in chemistry?
Is it cheminformatics or chemoinformatics?
Emergence of cheminformations
Three major aspects of cheminformatics
Basics of cheminformatics
Topological representations
Tools used for cheminformatics
Application of cheminformatics
Role of cheminformatics in morden drug discovery
Conclusion
Bibliography
Mining frequent pattern is a NP-hard problem and has become a hot topic in recent researches. Moreover,
protein dataset contains distinct Pattern that can be used in many areas such as drug discovery, disease
prediction, etc. In early decades, pattern discovery and protein fold recognition was determined by
biophysics and biochemistry approach; and X-ray and NMR have been used for protein structure
prediction which are very expensive and time consuming while, a mathematical approach can reduce the
cost of such laboratory experiments. Many computer based tests have been applied for the protein fold
detection such as graph based algorithms and data mining viewpoints like classification or clustering, and
all have their advantages and drawbacks. Pattern matching in protein sequential dataset for fold
recognition plays a meaningful role in the field of bioinformatics since it evolved prediction of unknown
protein function. There are lots of pattern recognition algorithms but in this work we used PrefixSpan. The
reason of selecting this algorithm will be discussed below in section 2. For evaluating the result of
experiments we used SCOPE dataset which is a classified protein dataset and ASTRAL, a discriminative
sequential dataset of SCOPE.
AACIMP 2010 Summer School lecture by Igor Tetko. "Physics, Сhemistry and Living Systems" stream. "Chemoinformatics" course.
More info at http://summerschool.ssa.org.ua
Computer-Assisted Structure Elucidation (CASE) using a combination of 1D and 2D NMR data has been available for a number of years. These algorithms can be considered as “logic machines” capable of deriving all plausible structures from a set of structural constraints or “axioms”, defined by the spectral data and associated chemical information or prior knowledge. CASE programs allow the spectroscopist not only to determine structures from the spectral data but also to study the dependence of the proposed structure on changes to the set of axioms. In this article we describe the application of the ACD/Structure Elucidator expert system to resolve the conflict between two different hypothetical hexacyclinol structures derived by different researchers from the NMR spectra of this complex natural product. It has been shown that the combination of algorithms for both structure elucidation and structure validation delivered by the expert system enables the identification of the most probable structure as well as the associated chemical shift assignments.
Computer-Assisted Structure Elucidation (CASE) using a combination of 1D and 2D NMR data has been available for a number of years. These algorithms can be considered as “logic machines” capable of deriving all plausible structures from a set of structural constraints or “axioms”, defined by the spectral data and associated chemical information or prior knowledge. CASE programs allow the spectroscopist not only to determine structures from the spectral data but also to study the dependence of the proposed structure on changes to the set of axioms. In this article we describe the application of the ACD/Structure Elucidator expert system to resolve the conflict between two different hypothetical hexacyclinol structures derived by different researchers from the NMR spectra of this complex natural product. It has been shown that the combination of algorithms for both structure elucidation and structure validation delivered by the expert system enables the identification of the most probable structure as well as the associated chemical shift assignments.
Decision Support System for Bat Identification using Random Forest and C5.0TELKOMNIKA JOURNAL
Morphometric and morphological bat identification are a conventional method of identification and
requires precision, significant experience, and encyclopedic knowledge. Morphological features of a
species may sometimes similar to that of another species and this causes several problems for the
beginners working with bat taxonomy. The purpose of the study was to implement and conduct the random
forest and C5.0 algorithm analysis in order to decide characteristics and carry out identification of bat
species. It also aims at developing supporting decision-making system based on the model to find out the
characteristics and identification of the bat species. The study showed that C5.0 algorithm prevailed and
was selected with the mean score of accuracy of 98.98%, while the mean score of accuracy for the
random forest was 97.26%. As many 50 rules were implemented in the DSS to identify common and rare
bat species with morphometric and morphological attributes.
Applied Bioinformatics & Chemoinformatics: Techniques, Tools, and OpportunitiesHezekiah Fatoki
The computational methods for in silico drug discovery have been broadly categories into two fields bioinformatics and chemoinformatics. In case of bioinformatics, major emphasis is on identification and validation of drug targets, mainly based on functional/structural annotation of genomes. In case of chemoinformatics or pharmacoinformatics, major emphasis is on designing of drug molecules or ligands and their interaction with drug targets.
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Sequencing projects arising from high-throughput technologies including those of sequencing DNA microarray allowed measuring simultaneously the expression levels of millions of genes of a biological sample as well as to annotate and to identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information, bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the researchers’ hypothesis. In this context, this article describes and discusses some techniques used for the functional analysis of gene expression data.
This talk was part of the 2020 Disease Map Modeling Community meeting, covering the steps towards publishing reproducible simulation studies (based on a reused model). Links to different COMBINE guidelines, tutorials and efforts. Grants: European Commission: EOSCsecretariat.eu - EOSCsecretariat.eu (831644)
Cheminformatics, concept by kk sahu sirKAUSHAL SAHU
INTRODUCTION
THE NEED FOR CHEMOINFORMATICS
CHEMOINFORMATICS AND DRUG DISCOVERY
HISTORICAL EVOLUTION
BASIC CONCEPTS
Chemistry Space
Molecular Descriptors
High-Throughput Screening
The Similar-Structure, Similar-Property Principle
Graph theory and Chemoinformatics
CHEMOINFORMATICS TASKS
MOLECULAR REPRESENTATIONS
Topological Representations
Geometrical Representations
TYPES OF MOLECULAR DESCRIPTORS
IN SILICO DE NOVO MOLECULAR DESIGN
FREE CHEMISTRY DATABASE
FUTURE
CONCLUSION
REFERENCE
Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely
promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation
sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw
NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover,
the (meta)data does often not meet the FAIR principles [6]. Consequently, a central FAIR-compliant repository
is highly desirable to support NGS related research. We have selected iRODS (Rule-Oriented Data management
systems) [3] as a basis for implementing a sequencing data repository because it allows storing both data and metadata
together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized
way, and supports different types of clients. This repository will be part of an ecosystem of RDM solutions that
cover complementary phases of the research data life cycle in our organization (Academic Medical Center of the
University of Amsterdam). We selected Virtuoso [5] to enrich the metadata from iRODS to enable the management
of a triplestore for linked data. The metadata in the iCat (iRODS’ metadata catalogue) and the ontology in Virtuoso
are kept synchronized by enforcement of strict data manipulation policies. We have implemented a prototype to
preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes:
Davrods [4] for data and metadata ingestion, data retrieval; Metalnx-web [7] for administration, data curation, and
repository browsing; and iCommands [2] for all tasks by advanced users. Different user profiles are defined (principal
investigator, data curator, repository administrator), with different access rights. New data is ingested by copying raw
sequence files and the corresponding metadata file (a sample sheet) to the landing collection on iRODS. An iRODS
rule is triggered by the sample sheet file, which extracts the metadata and registers it to the iCAT as AVU (Attribute,
Value and Unit). Ontology files are registered into Virtuoso. The sequence files are copied to the persistent collection
and are made uniquely identifiable based on metadata. All the steps are recorded into a report file that enables
monitoring and tracking of progress and faults. Here we describe the design and implementation of the prototype,
and discuss the first assessment results. Initial results indicate that the proposed solution is acceptable and fits the
researchers workflow well.
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSIJDKP
The biomedical research literature is one among many other domains that hides a precious knowledge, and
the biomedical community made an extensive use of this scientific literature to discover the facts of
biomedical entities, such as disease, drugs,etc.MEDLINE is a huge database of biomedical research
papers which remain a significantly underutilized source of biological information. Discovering the useful
knowledge from such huge corpus leads to various problems related to the type of information such as the
concepts related to the domain of texts and the semantic relationship associated with them. In this paper,
we propose a Two-level model for Self-supervised relation extraction from MEDLINE using Unified
Medical Language System (UMLS) Knowledge base. The model uses a Self-supervised Approach for
Relation Extraction (RE) by constructing enhanced training examples using information from UMLS. The
model shows a better result in comparison with current state of the art and naïve approaches
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ijaia
The work in this paper shows intensive empirical experiments using 13 datasets to understand the regularization effectiveness of ridge regression, the lasso estimate, and elastic net regularization methods. The study offers a deep understanding of how the datasets affect the goodness of the prediction accuracy of each regularization method for a given problem given the diversity in the datasets used. The results have shown that datasets play crucial rules on the performance of the regularization method and that the
predication accuracy depends heavily on the nature of the sampled datasets.
Supervised machine learning based liver disease prediction approach with LASS...journalBEEI
In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1-score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.
Mining frequent pattern is a NP-hard problem and has become a hot topic in recent researches. Moreover,
protein dataset contains distinct Pattern that can be used in many areas such as drug discovery, disease
prediction, etc. In early decades, pattern discovery and protein fold recognition was determined by
biophysics and biochemistry approach; and X-ray and NMR have been used for protein structure
prediction which are very expensive and time consuming while, a mathematical approach can reduce the
cost of such laboratory experiments. Many computer based tests have been applied for the protein fold
detection such as graph based algorithms and data mining viewpoints like classification or clustering, and
all have their advantages and drawbacks. Pattern matching in protein sequential dataset for fold
recognition plays a meaningful role in the field of bioinformatics since it evolved prediction of unknown
protein function. There are lots of pattern recognition algorithms but in this work we used PrefixSpan. The
reason of selecting this algorithm will be discussed below in section 2. For evaluating the result of
experiments we used SCOPE dataset which is a classified protein dataset and ASTRAL, a discriminative
sequential dataset of SCOPE.
AACIMP 2010 Summer School lecture by Igor Tetko. "Physics, Сhemistry and Living Systems" stream. "Chemoinformatics" course.
More info at http://summerschool.ssa.org.ua
Computer-Assisted Structure Elucidation (CASE) using a combination of 1D and 2D NMR data has been available for a number of years. These algorithms can be considered as “logic machines” capable of deriving all plausible structures from a set of structural constraints or “axioms”, defined by the spectral data and associated chemical information or prior knowledge. CASE programs allow the spectroscopist not only to determine structures from the spectral data but also to study the dependence of the proposed structure on changes to the set of axioms. In this article we describe the application of the ACD/Structure Elucidator expert system to resolve the conflict between two different hypothetical hexacyclinol structures derived by different researchers from the NMR spectra of this complex natural product. It has been shown that the combination of algorithms for both structure elucidation and structure validation delivered by the expert system enables the identification of the most probable structure as well as the associated chemical shift assignments.
Computer-Assisted Structure Elucidation (CASE) using a combination of 1D and 2D NMR data has been available for a number of years. These algorithms can be considered as “logic machines” capable of deriving all plausible structures from a set of structural constraints or “axioms”, defined by the spectral data and associated chemical information or prior knowledge. CASE programs allow the spectroscopist not only to determine structures from the spectral data but also to study the dependence of the proposed structure on changes to the set of axioms. In this article we describe the application of the ACD/Structure Elucidator expert system to resolve the conflict between two different hypothetical hexacyclinol structures derived by different researchers from the NMR spectra of this complex natural product. It has been shown that the combination of algorithms for both structure elucidation and structure validation delivered by the expert system enables the identification of the most probable structure as well as the associated chemical shift assignments.
Decision Support System for Bat Identification using Random Forest and C5.0TELKOMNIKA JOURNAL
Morphometric and morphological bat identification are a conventional method of identification and
requires precision, significant experience, and encyclopedic knowledge. Morphological features of a
species may sometimes similar to that of another species and this causes several problems for the
beginners working with bat taxonomy. The purpose of the study was to implement and conduct the random
forest and C5.0 algorithm analysis in order to decide characteristics and carry out identification of bat
species. It also aims at developing supporting decision-making system based on the model to find out the
characteristics and identification of the bat species. The study showed that C5.0 algorithm prevailed and
was selected with the mean score of accuracy of 98.98%, while the mean score of accuracy for the
random forest was 97.26%. As many 50 rules were implemented in the DSS to identify common and rare
bat species with morphometric and morphological attributes.
Applied Bioinformatics & Chemoinformatics: Techniques, Tools, and OpportunitiesHezekiah Fatoki
The computational methods for in silico drug discovery have been broadly categories into two fields bioinformatics and chemoinformatics. In case of bioinformatics, major emphasis is on identification and validation of drug targets, mainly based on functional/structural annotation of genomes. In case of chemoinformatics or pharmacoinformatics, major emphasis is on designing of drug molecules or ligands and their interaction with drug targets.
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Sequencing projects arising from high-throughput technologies including those of sequencing DNA microarray allowed measuring simultaneously the expression levels of millions of genes of a biological sample as well as to annotate and to identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information, bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the researchers’ hypothesis. In this context, this article describes and discusses some techniques used for the functional analysis of gene expression data.
This talk was part of the 2020 Disease Map Modeling Community meeting, covering the steps towards publishing reproducible simulation studies (based on a reused model). Links to different COMBINE guidelines, tutorials and efforts. Grants: European Commission: EOSCsecretariat.eu - EOSCsecretariat.eu (831644)
Cheminformatics, concept by kk sahu sirKAUSHAL SAHU
INTRODUCTION
THE NEED FOR CHEMOINFORMATICS
CHEMOINFORMATICS AND DRUG DISCOVERY
HISTORICAL EVOLUTION
BASIC CONCEPTS
Chemistry Space
Molecular Descriptors
High-Throughput Screening
The Similar-Structure, Similar-Property Principle
Graph theory and Chemoinformatics
CHEMOINFORMATICS TASKS
MOLECULAR REPRESENTATIONS
Topological Representations
Geometrical Representations
TYPES OF MOLECULAR DESCRIPTORS
IN SILICO DE NOVO MOLECULAR DESIGN
FREE CHEMISTRY DATABASE
FUTURE
CONCLUSION
REFERENCE
Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely
promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation
sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw
NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover,
the (meta)data does often not meet the FAIR principles [6]. Consequently, a central FAIR-compliant repository
is highly desirable to support NGS related research. We have selected iRODS (Rule-Oriented Data management
systems) [3] as a basis for implementing a sequencing data repository because it allows storing both data and metadata
together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized
way, and supports different types of clients. This repository will be part of an ecosystem of RDM solutions that
cover complementary phases of the research data life cycle in our organization (Academic Medical Center of the
University of Amsterdam). We selected Virtuoso [5] to enrich the metadata from iRODS to enable the management
of a triplestore for linked data. The metadata in the iCat (iRODS’ metadata catalogue) and the ontology in Virtuoso
are kept synchronized by enforcement of strict data manipulation policies. We have implemented a prototype to
preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes:
Davrods [4] for data and metadata ingestion, data retrieval; Metalnx-web [7] for administration, data curation, and
repository browsing; and iCommands [2] for all tasks by advanced users. Different user profiles are defined (principal
investigator, data curator, repository administrator), with different access rights. New data is ingested by copying raw
sequence files and the corresponding metadata file (a sample sheet) to the landing collection on iRODS. An iRODS
rule is triggered by the sample sheet file, which extracts the metadata and registers it to the iCAT as AVU (Attribute,
Value and Unit). Ontology files are registered into Virtuoso. The sequence files are copied to the persistent collection
and are made uniquely identifiable based on metadata. All the steps are recorded into a report file that enables
monitoring and tracking of progress and faults. Here we describe the design and implementation of the prototype,
and discuss the first assessment results. Initial results indicate that the proposed solution is acceptable and fits the
researchers workflow well.
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSIJDKP
The biomedical research literature is one among many other domains that hides a precious knowledge, and
the biomedical community made an extensive use of this scientific literature to discover the facts of
biomedical entities, such as disease, drugs,etc.MEDLINE is a huge database of biomedical research
papers which remain a significantly underutilized source of biological information. Discovering the useful
knowledge from such huge corpus leads to various problems related to the type of information such as the
concepts related to the domain of texts and the semantic relationship associated with them. In this paper,
we propose a Two-level model for Self-supervised relation extraction from MEDLINE using Unified
Medical Language System (UMLS) Knowledge base. The model uses a Self-supervised Approach for
Relation Extraction (RE) by constructing enhanced training examples using information from UMLS. The
model shows a better result in comparison with current state of the art and naïve approaches
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ijaia
The work in this paper shows intensive empirical experiments using 13 datasets to understand the regularization effectiveness of ridge regression, the lasso estimate, and elastic net regularization methods. The study offers a deep understanding of how the datasets affect the goodness of the prediction accuracy of each regularization method for a given problem given the diversity in the datasets used. The results have shown that datasets play crucial rules on the performance of the regularization method and that the
predication accuracy depends heavily on the nature of the sampled datasets.
Supervised machine learning based liver disease prediction approach with LASS...journalBEEI
In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1-score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.
As part of our efforts to develop a public platform to provide access to predictive models we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. Using a thorough manual review of the data underlying the well-known EPI Suite software, we developed automated processes for the validation of the data using a KNIME workflow. This includes: approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Our efforts to quantify and segregate data into various quality categories has allowed us to thoroughly investigate the resulting models developed from these data slices, as well as allowing us to examine whether or not efforts into the development of large high-quality datasets has the expected pay-off in terms of prediction performance. Machine-learning approaches have been applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. These data are available online via the EPA’s iCSS Chemistry Dashboard. This abstract does not reflect U.S. EPA policy.
In this research, a hybrid wrapper model is proposed to identify the featured gene subset from the gene expression data. To balance the gap between exploration
and exploitation, a hybrid model with a popular meta-heuristic algorithm named
spider monkey optimizer (SMO) and simulated annealing (SA) is applied. In the proposed model, ReliefF is used as a filter to obtain the relevant gene subset
from dataset by removing the noise and outliers prior to feeding the data to the
wrapper SMO. To enhance the quality of the solution, simulated annealing is
deployed as local search with the SMO in the second phase, which will guide to the detection of the most optimal feature subset. To evaluate the performance of the proposed model, support vector machine (SVM) as a fitness function to recognize the most informative biomarker gene from the cancer datasets along with University of California, Irvine (UCI) datasets. To further evaluate the model, 4 different classifiers (SVM, na¨ıve Bayes (NB), decision tree (DT), and k-nearest neighbors (KNN)) are used. From the experimental results and analysis, it’s noteworthy to accept that the ReliefF-SMO-SA-SVM performs relatively better than its state-of-the-art counterparts. For cancer datasets, our model performs better in terms of accuracy with a maximum of 99.45%.
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
The increasing number and size of public databases is facilitating the collection of chemical structures and associated experimental data for QSAR modeling. However, the performance of QSAR models is highly dependent not only on the modeling methodology, but also on the quality of the data used. In this study we developed robust QSAR models for endpoints of environmental interest with the aim of helping the regulatory process. We used the publicly available PHYSPROP database that includes a set of thirteen common physicochemical and environmental fate properties, including logP, melting point, Henry’s coefficient, and biodegradability among others. Curation and standardization workflows have been applied to use the highest quality data and generate QSAR-ready structures. The developed models are in agreement with the five OECD principles that requires QSARs to be simple and reliable. These models were applied to a set of ~700k chemicals to produce predictions for display on the EPA CompTox Chemistry Dashboard. In addition to the predictions, this free web and mobile application provides access to the experimental data used for training as well as detailed reports including general model performances, specific applicability domain and prediction accuracy, and the nearest neighboring structures used for prediction. The dashboard also provides access to model QMRFs (QSAR modeling report format) which is a downloadable pdf containing additional details about the modeling approaches, the data, and molecular descriptor interpretation.
Deep learning methods applied to physicochemical and toxicological endpointsValery Tkachenko
Chemical and pharmaceutical companies, and government agencies regulating both chemical and biological compounds, all strive to develop new methods to provide efficient prioritization, evaluation and safety assessments for the hundreds of new chemicals that enter the market annually. While there is a lot of historical data available within the various agencies, organizations and companies, significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. Traditional QSAR methods are based on sets of features (fingerprints) which representing the functional characteristics of chemicals. Unfortunately, due to both data gaps and limitations in the development of QSAR models, read-across approaches have become a popular area of research. Successes in the application of Artificial Neural Networks, and specifically in Deep Learning Neural Networks, has delivered a new optimism that the lack of data and limited feature sets can be overcome by using Deep Learning methods. In this poster we will present a comparison of various machine learning methods applied to several toxicological and physicochemical parameter endpoints. This abstract does not reflect U.S. EPA policy.
Single parent mating in genetic algorithm for real robotic system identificationIAESIJAI
System identification (SI) is a method of determining a mathematical model
for a system given a set of input-output data. A representation is made using
a mathematical model based on certain specified assumptions. In SI, model
structure selection is a step where a model structure perceived as an adequate
system representation is selected. A typical rule is that the final model must
have a good balance between parsimony and accuracy. As a popular search
method, genetic algorithm (GA) is used for selecting a model structure.
However, the optimality of the final model depends much on the
effectiveness of GA operators. This paper presents a mating technique
named single parent mating (SPM) in GA for use in a real robotic SI. This
technique is based on the chromosome structure of the parents such that a
single parent is sufficient in achieving mating that eases the search for the
optimal model. The results show that using three different objective
functions (Akaike information criterion, Bayesian information criterion and
parameter magnitude–based information criterion 2) respectively, GA with
the mating technique is able to find more optimal models than without the
mating technique. Validations show that the selected models using the
mating technique are acceptable.
Semantics for integrated laboratory analytical processes - The Allotrope Pers...OSTHUS
The software environment currently found in the analytical community consists of a patchwork of incompatible software, proprietary and non-standardized file formats,
which is further complicated by incomplete, inconsistent and potentially inaccurate metadata. To overcome these issues, the Allotrope Foundation develops a
comprehensive and innovative Framework consisting of metadata dictionaries, data standards, and class libraries for managing analytical data throughout its lifecycle. The
talk describes how laboratory data and their semantic metadata descriptions are brought together to ease the management of vast amount of data that underpin almost
every aspect of drug discovery and development.
T-Bioinfo is a comprehensive bioinformatics platform that allows the user to navigate NGS, Mass-Spec and Structural Biology data analysis pipelines using consistent interface. Analysis and integration of such data allows for better and faster discovery and optimization of personalized and precision treatment of complex diseases and understanding of medical conditions. For more information, go to pine-biotech.com
Developing tools for high resolution mass spectrometry-based screening via th...Andrew McEachran
Non-targeted and suspect screening studies using high resolution mass spectrometry (HRMS) have revolutionized how chemicals are detected in complex matrices. However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. The US EPA has developed functionality within the CompTox Chemistry Dashboard (https://comptox.epa.gov) to address challenges related to data processing and analysis in HRMS. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will introduce the tools and combined workflow including visualization and access via the Chemistry Dashboard. These tools, data, and visualization approaches within an open chemistry resource provide a freely available software tool to support structure identification and non-targeted analysis. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
It is widely agreed that complex diseases are typically caused by joint effects of multiple genetic variations, rather than a single genetic variation. Multi-SNP interactions, also known as epistatic interactions, have the potential to provide information about causes of complex diseases, and build on GWAS studies that look at associations between single SNPs and phenotypes. However, epistatic analysis methods are both computationally expensive, and have limited accessibility for biologists wanting to analyse GWAS datasets due to being command line based. Here we present APPistatic, a prototype desktop version of a pipeline for epistatic analysis of GWAS datasets. his application combines ease-of-use, via a GUI, with accelerated implementation of BOOST and FaST-LMM epistatic analysis methods.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
2. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 27 American Journal of Molecular Biology
1. Introduction
New sequencing technologies such as Illumina/Solexa, SOLiD/ABI and 454/
Roche, revolutionized biological researches [1] [2]. A large amount of data gen-
erated by these platforms, combined with the sequencing cost reduction, further
increased investigations in the fields of genome, proteome, and transcriptome.
New applications, such as Personalized Medicine, have emerged as a result of the
discovery and drugs development [3]. Moreover, works related to personalized
cancer diagnosis and treatment also gained attention [4]. Due to this significant
capillarity, researchers started focusing on the New Generation Sequencers.
They represent a challenge in computational tractability, regarding memory and
performance, as a result of the large volume of data generated, usually above 100
million of short readings (32 - 400 base pairs in length) per run [5].
Initially, most studies in the area of bioinformatics tools focused only on soft-
ware development, attempting to balance memory consumption and processing,
to reduce the execution time of biological analysis, especially the task of alignment
with a reference [6]. Currently, several tools are consolidated for Next-Generation
Sequencing (NGS) platforms. Therefore, the efforts were directed to optimize the
analysis process as a whole, considering the intrinsic platforms characteristics.
The SOLiD platform has a particular sequencing type called Multiplex Run,
which enables the sequencing of several samples in a single execution. This fea-
ture is important to biomedical areas as it reduces costs and time for sequencing
several samples. It occurs because the samples are prepared and submitted to the
sequencer at the same time, thus minimizing the use of reagents and working
hours of researchers. Besides, the multiplex run becomes more useful to the
analysis of related samples. For example, it is possible sequencing a particular
tumor tissue at different stages to compare gene expressions.
This research was motivated due to some gaps found in quality measurement
tools for this sequencing type. Moreover, the area has only one proprietary tool
that manipulates data from Multiplex Run, which is the Corona Lite Pipeline
from Applied Biosystem. This tool is unable to perform investigations on the
data quality and consistency. Aiming to tackle the gaps presented, we present in
this paper a probabilistic approach to increase the reliability of data analysis
from multiplex sequencing carried out on SOLiD, assigning a degree of confi-
dence to the data generated. The proposed model can be used to guide the fil-
tering process that respects the characteristics of each sequencing, without arbi-
trarily discarding useful sequences.
The remainder of this paper is organized as follows: a briefly theoretical back-
ground is given in Section 2; related works are discussed in Section 3; Section 4
describes the paper proposal; results and final remarks are presented in Sections
5 and 6, respectively.
2. Multiplex Sequencing on SOLiD System
In this section, the SOLiD system, as well as the multiplex sequencing, is pre-
3. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 28 American Journal of Molecular Biology
sented, emphasizing the samples markup system technology.
2.1. SOLiD System
SOLiD (Sequencing by Oligonucleotide Ligation and Detection) is an NSG plat-
form developed by Thermo Fisher Scientific (former Life technologies) launched
in 2007. It uses a technology based on the ligation of fluorescently labeled hybri-
dization probes to determine the sequence of a template DNA strand [2]. The
developer claims that the feature of check two bases at a time increases the se-
quencing reliability [7].
The sample preparation is done by fragmentation, adaptor ligation, hybridiza-
tion to beads and emulsion PCR, similarly for the 454 systems [2]. The beads are
then immobilized on a glass slide. The preparation proceeds adding a universal
primer and fluorescently labeled oligonucleotide probes. The ligation reaction is
based on probe recognition. As mentioned before, the probes used in SOLiD
system interrogate two bases per reaction [8]. Four colored dyes are used in the
SOLiD System, representing the four possible two-base combinations. The first
base in the sequence is always from the universal primer. Then, the rest of the
sequence can be infered from the color data obtained. Despite the benefits of this
strategy, [9] states that this sequencing method has limited read lengths. Moreo-
ver, the computational infrastructure required is expensive.
2.2. Multiplex Sequencing
The SOLiD platform supports multiplex sequencing up to 256 samples in a sin-
gle run in its second version. This task is performed by combining individual
slides and a proprietary barcodes markup system from Applied Biosystem [7]
[8]. Although the use of slides equipped with sites make the data retrieving
process more reliable, they restrict the space for the beads containing the sam-
ples to be sequenced, thus, reducing the sequencing coverage.
To increase the density of the beads, and thereby to improve its coverage,
SOLiD uses markers that are attached to the samples to allow data differentia-
tion. Since physical separators are not employed, this process maximizes the se-
quencing space [7] [10]. In SOLiD, the markers presented [8] consist of six nuc-
leotides, starting with Guanine. The systems uses only 16 out of the 1024 possi-
ble sequences, which were chosen based on their same melting temperature, low
error rate and unique orthogonality in colorspace [7]. Table 1 presents the Bar-
codes available, represented both in colorspace and base space.
3. Related Works
As pointed out previously, literature has not explored multiplex sequencing re-
liability. Most investigations involving data quality assessment from SOLiD are
based on heuristics and analyze only the sequences of interest [11] [12] [13].
However, analyzing the markup system used in this type of sequencing is even
more advantageous than analyzing the sequences of interest. It occurs due to the
4. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 29 American Journal of Molecular Biology
Table 1. The sixteen barcodes of standard library.
Barcode Colorspace Basespace
1 “0032” GGGCCT
2 “0111” GGTGTG
3 “0200” AAGGGG
4 “0323” CCGATG
5 “1013” CAACGA
6 “1130” GTGCCC
7 “1221” GTCTGG
8 “1302” ACGGAG
9 “2020” GAAGGG
10 “2103” GACCGC
11 “2212” CTCAGG
12 “2331” AGCGTT
13 “3001” CGGGTC
14 “3122” CGTCTG
15 “3233” TAGCGT
16 “3310” GCGTTA
following reasons: 1) they are smaller, being directly proportional to the
processing cost; 2) the sequences are in the second stage of sequencing, under-
going a unique degeneration process, different from sequences of interest. Con-
sequently, the markup system reflects the sequencing quality as a whole.
Among the works that examine the SOLiD output reliability, [14] deserves at-
tention. The authors analyzed the quality decay pattern, concluding that there is
a high error probability from 20 bp, and the average Quality Value (QV) is less
than 15. In order to improve data reliability, [14] developed a framework based
on heuristics that filters readings with quality lower than 10.
Heuristics are useful for analyzing large sequences since algorithms based on
it usually have less complexity. As a consequence, heuristics require less
processing time compared to analytical algorithms. In contrast, analytical solu-
tions improve reliability since they generate more accurate results [15] [16].
Another point that should be emphasized is that heuristic-based approaches are
not indicated for analysis of ancestors or transcribed samples, which are more
susceptible to degeneration time of DNA molecules [17] and have a greater con-
taminants presence[10] [18].
Cloud computing also has provided means to achieve reduced time consump-
tion in computational activities, despite exponentially growing datasets generat-
ed by NGS technologies [19] [20] [21]. Companies such as PerkinElmer, Illumi-
na and DNAnexus have proposed cloud-based applications to support NGS data
analysis using proprietary applications, and open APIs so that bioinformaticians
5. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 30 American Journal of Molecular Biology
can develop their own applications and analysis workflows [20]. There are also
open source solutions for NGS data analysis that can be implemented in cloud
services such as Amazon Elastic Compute Cloud (EC2) [22].
Therefore, the development of a scalable approach to allow data quality evalu-
ation in Multiplex Runs represents a gap in state of the art and state of practice
models. Besides, the approach should identify exogenous agents responsible for
degrading the process, and in which stages they act. Aiming to fill these gaps, we
proposed in this paper a probabilistic model which considers the intrinsic cha-
racteristics of each sequencing to characterize multiplex runs and filter low-
quality data, increasing the data analysis reliability of multiplex sequencing per-
formed on the SOLiD platform.
4. Char Barcoding
4.1. Case Study
The biological material used to characterize the SOLiD markup system was de-
rived from five cancer patients, with two samples extracted from each. From
these ten samples, two sequencing activities were performed. Due to a systemic
failure, the second run was divided in two. On average, 300 Gigabytes (GB) of
text-files were produced in these sequencings process.
The multiplex sequencing output consists of four files. Two containing the
markup and interest sequences, and two with their qualities. The output files
should present an equivalence in terms of readings amount with the proportion
of samples labeled with barcodes. However, some failures were observed. More
specifically, regarding the discrepancy in samples proportion and the presence of
unused barcodes in the experiment [23], aiming to tackle these issues, we pro-
posed a probabilistic model that can increase the reliability of analyses involving
multiplex sequencing. Figure 1 presents the basic workflow of the proposed ap-
proach, showing the multiplex sequencing characterization, and filtering method
Figure 1. The basic workflow of the proposed approach.
6. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 31 American Journal of Molecular Biology
developed for further analysis.
4.2. Probabilistic Model
The model proposed is based on the Quality Values (QV) associated with di-
base transitions. The QV ranges from −1 to 35, and for proprietary reasons, Ap-
plied Biosystem doesn’t provide details on how this value is obtained nor the
mapping function between the QV and the associated probability. Through an
extensive literature review and exhaustive tests, we perceived that the following
Equation (1) can be used to map the quality-probability and adapt the range of
QVs reported by the SOLiD platform:
( )
1
10
= 1 10
Q
P Q
− +
− (1)
As the range of the QVs includes a negative number, it was necessary to adjust
it for further normalizing between zero and one, adding a unit (Q + 1). The re-
sult obtained by Equation (1) represents the probability associated with di-base
detection. In other words, the probability of the sequence has not been an event
of chance.
The readings were modeled as a Markov Chain to evaluate the degree of a se-
quences confidence. The first sequence nucleotide and the subsequent transi-
tions represent states with transition probability P(Q). For better comprehension
of the approach used, an example is given for the colorspace sequence: G00, with
Quality Values 10 24, by the proposed quality-probability mapping, the transi-
tions probabilities are 0.92057 and 0.99684, as shown in Figure 2.
To calculate the degree of confidence of each sequence (θ ), the probabilities
of all existing transitions are multiplied, as shown in Equation (2). For example
in Figure 2, the degree of confidence of “G00” is 0.91766.
( ) ( )=P P Qθ Π (2)
This probabilistic model allows: 1) the discovery of summary measures able to
characterize the sequence from the markup system; 2) guide a flexible filtering
process, respecting the intrinsic sequencing characteristics; 3) evaluate and im-
prove the sequencing protocols used.
4.3. Sequencing Characterization
The search for summary measures of a given number of samples is a non-trivial
Figure 2. Example of a Markov Chain Modeled for Colorspace Sequences.
7. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 32 American Journal of Molecular Biology
task. After obtaining the degree of confidence, it was possible to calculate several
statistical measures, including the mean and the variance, representing the con-
centration and dispersion to characterize the sequencing. However, this infor-
mation is not sufficient to describe a multiplex sequencing in the SOLiD plat-
form. Thus, the rate of sequences obtained that match with the standard library
of barcodes, and the rate of unpaired sequences between files was also used to
evaluate the sequencing quality.
As shown previously, the output of multiplex sequencing held in SOLiD is di-
vided as follows: two files with information about the markup system, and two
files with information on the sequences of interest. The platform pre-filters these
data during the sequencing process on real-time. Due to the large volume of data
and processing time constraints, there are some flaws in the file pairing, which
must be treated in a posterior analysis.
One example of an insight that the proposed model can give is related to sam-
ples contaminants or reagents used to attach the marker analyzing the rate of
obtained marked sequences that match to the standard library barcodes. Anoth-
er example is when the model shows a significant number of unpaired readings.
It can be interpreted as sequencing failures and may involve the beads cloning
steps or the deposit on the slides. For a better visualization of the sequencing
behavior, the probability density function histogram was also generated in addi-
tion to the abovementioned measures. The histograms for the case studies are
presented in the Results and Discussion section.
4.4. Filtering
Despite the filtering scheme that uses as a threshold based directly on the Quali-
ty Value, the proposed model accepts as input a minimum value of the degree of
confidence. It enables a better fit, since changes in the degree of confidence are
more sensitive when compared to changes in the filtering methods that use the
QV directly.
The use of the degree of confidence as an input parameter is also important to
multiplex sequencing. For example, for a sequencing which presents a good av-
erage but a low matching rate, it is necessary to consider a higher threshold.
Whereas, in sequencing with optimal rates, it is possible to relax the threshold to
maximize the recovery of readings. To evaluate the proposed approach, several
filtering were performed using the QV and the degree of confidence as input
parameters. As assessment criterion, the rate of filtered sequences was used,
which represents the number of sequences number that were above the thre-
shold, and the match with barcodes used in the experiment.
5. Results and Discussion
To analyze the data presented in Case Study section a series of ad hoc scripts
were implemented. The following subsections present the summary measures
and the filtering results.
8. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 33 American Journal of Molecular Biology
5.1. Multiplex Sequencing Evaluation
The summary measures calculated for the case study data are shown in Table 2.
The Rate of Paired Sequence aims to evaluate issues in the pre-filtering step,
which is performed by the SOLiD platform during the sequencing. A low pairing
rate characterizes a low quality of readings. In this scenario, the readings are the
sequences of interest and the markup used in the multiplex sequencing. As
shown in Table 2, C2 and C3 runs presented this kind of issue. Another relevant
information to evaluate the sequencing is the rate of matches with the barcodes
used in the experiments.
Even though it is possible to analyze the measures separately, it is more ad-
vantageous to correlate them. For instance, if there is a good match with the
barcodes of the kit and low match with the barcodes used in the experiments,
there were probable failures in the protocols and possible contamination.
Another pertinent scenario it is the low match in the last two sequencings, it can
be concluded that problems occurred in the beads amplification stages or con-
tamination in the barcodes preparation.
To improve the sequencing characterization, Probability Density Function
(PDF) histograms were generated by Sturges interval. The graphs for runs C1,
C2 and C3 are, respectively, Figures 3(a)-(c).
Table 2. Information of the Analyzed Sequencing.
Data First Run(C1) Second Run(C2) Third Run(C3)
Rate of Paired Sequence 84.00% 69.00% 67.00%
Match miRNA 81.87% 1.73% 49.57%
Mean 0.812 0.471 0.592
Variance 0.055 0.069 0.081
(a)
9. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 34 American Journal of Molecular Biology
(b)
(c)
Figure 3. Analyzed Sequencing Probability Density Functions. (a) C1 Probability Density
Function; (b) C2 Probability Density Function; (c) C3 Probability Density Function.
The graphs present the expected behavior given the statistical information
contained in Table 2. Figure 3(a) shows C1 showing a large number of readings
with the degree of confidence close to 1, and a low variability, while Figure 3(b)
and Figure 3(c) shows the shows the C2 and C3 sequencing behavior. It should
be noted that C2 and C3 are part of the same sequencing—the same samples
were used. However, due to a systemic power failure during the process, the run
was divided in two, in which C2 being the most affected by the power surges.
Analyzing the C3 behavior, it is perceived an increase of high-quality read-
ings. However, it is important to notice that the longer the time between sample
10. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 35 American Journal of Molecular Biology
preparation and sequencing, the greater the natural decay of the magnetic metal
beads, to which the sequences are attached. It decreases the light intensity de-
tected by the fluorochromes and, therefore, the quality of the readings. These
facts explain the larger presence of low-quality readings in C3 when compared to
C1 and justify the need to filter the data with low quality.
5.2. Filtering Results
The first filtering was conducted using QV as input. It was observed that small
variations in the input parameter imply in discrepancies in the number of se-
quences that did not pass the filters threshold. However, as the degree of confi-
dence was used, the changes are less abrupt, especially for lower confidence val-
ues. These changes are shown in Table 3.
Analyzing the data obtained by the model application, it was possible to ob-
serve that the C2 and C3 filtered sequences rate are lower and in line with the cha-
racteristics identified previously. As one of the requirements of the approach is sca-
lability, filtering based on Degree of Confidence is more useful, given that for large
sequences (e.g. sequences of interest) the fine-tuning provides a better filter control.
As criteria for assessing the filtering, the Match with Barcodes used in the expe-
riment was selected as a parameter. Table 4 presents the results obtained.
Table 3. Filtered rate sequences using Quality Value and Degree of Confidence.
Threshold C1 C2 C3
Quality 5 84.17% 43.69% 61.80%
7 72.10% 20.49% 39.00%
10 63.00% 9.12% 27.51%
15 46.28% 1.64% 13.54
Probability 60.00% 80.52% 34.56% 52.08%
70.00% 74.38% 22.36% 41.76%
75.00% 70.93% 16.90% 36.63%
80.00% 67.43% 12.30% 31.93%
95.00% 47.49% 1.58% 13.98%
Table 4. Information on Match with Barcodes used in experiments after filtering.
Threshold C1 C2 C3
Quality 5 90.56% 2.06% 66.84%
7 96.36% 1.98% 82.46%
10 98.22% 1.86% 89.99%
15 99.29% 1.43% 95.75%
Probability 60.00% 93.25% 2.10% 74.77%
70.00% 95.91% 2.06% 81.87%
75.00% 96.99% 1.99% 85.32%
95.00% 99.24% 1.35% 95.71%
11. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 36 American Journal of Molecular Biology
The results are consistent with the expected: there were no large differences
between the filtering schemes. In both cases, the anomaly of C2 was proved,
while C1 and C3 showed the expected behavior: when the threshold filtering in-
creases, the match also increases.
6. Summary and Conclusion
Multiplex sequencing allows the analysis of many samples at the same time, re-
ducing costs, and also becoming more useful to the analysis of related samples.
Despite the advantages of this kind of sequencing, it requires a careful evaluation
of the markers in order to assure that the readings will not be interchanged
among the multiple samples sequenced. Given the importance of a high reliabil-
ity in this kind of analysis, we present in this paper a probabilistic model suitable
to evaluate the quality of multiplex sequencing runs performed on the SOLiD
platform. Besides, we also propose a filtering strategy using the degree of confi-
dence obtained from the proposed model.
The experimental results showed that the proposed model is suitable to assess
multiplex sequencing. Moreover, the adoption of our filtering strategy has been
proven more useful because it provides softer cutoff points. In summary, the
main contributions of this work rely on what the model allows: 1) identification
of faults in the sequencing process; 2) adaptation and development of new pro-
tocols for sample preparation; 3) assignment of a degree of confidence to the da-
ta generated; 4) and having potential to guide a filtering process that respects the
characteristic of each sequencing, without discarding useful sequences in an ar-
bitrary manner.
Acknowledgements
The authors would like to thank CNPQ and CAPES for supporting this research.
The funders had no role in study design, data collection and analysis, decision to
publish, or preparation of the manuscript. This study did not receive any sup-
port or had any involvement with Life Technologies or Thermo Fisher Scientific
companies.
References
[1] Ma, R., Gong, J. and Jiang, X. (2017) Novel Applications of Next-Generation Se-
quencing in Breast Cancer Research. Genes & Diseases, 4, 149-153.
https://doi.org/10.1016/j.gendis.2017.07.003
[2] Mardis, E.R. (2013) Next-Generation Sequencing Platforms. Annual Review of
Analytical Chemistry, 6, 287-303.
https://doi.org/10.1146/annurev-anchem-062012-092628
[3] Dopazo, J. (2014) Genomics and Transcriptomics in Drug Discovery. Drug Discov-
ery Today, 19, 126-132. https://doi.org/10.1016/j.drudis.2013.06.003
[4] Pillai, S., Gopalan, V. and Lam, A.K.-Y. (2017) Review of Sequencing Platforms and
Their Applications in Phaeochromocytoma and Paragangliomas. Critical Reviews in
Oncology/Hematology, 116 (Supplement C), 58-67.
12. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 37 American Journal of Molecular Biology
https://doi.org/10.1016/j.critrevonc.2017.05.005
[5] David, M., Dzamba, M., Lister, D., Ilie, L. and Brudno, M. (2011) Shrimp2: Sensitive
Yet Practical Short Read Mapping. Bioinformatics, 27, 1011-1012.
https://doi.org/10.1093/bioinformatics/btr046
[6] Li, H. and Homer, N. (2010) A Survey of Sequence Alignment Algorithms for
Next-Generation Sequencing. Briefings in Bioinformatics, 11, 473-483.
https://doi.org/10.1093/bib/bbq015
[7] Applied Biosystem, SOLiD(TM) System Barcoding, Application Note (2008).
[8] Applied Biosystem, SOLiD(TM) Fragment Library Barcoding Kit Module 1{16
Protocol, Tech. rep., Thermo Fisher Scientific Inc. (2010)
[9] Ambardar, S., Gupta, R., Trakroo, D., Lal, R. and Vakhlu, J. (2016) High Through-
put Sequencing: An Overview of Sequencing Chemistry. Indian Journal of Microbi-
ology, 56, 394-404. https://doi.org/10.1007/s12088-016-0606-4
[10] Parameswaran, P., Jalili, R., Tao, L., Shokralla, S., Gharizadeh, B., Ronaghi, M. and
Fire, A.Z. (2007) A Pyrosequencing-Tailored Nucleotide Barcode Design Unveils
Opportunities for Large-Scale Sample Multiplexing. Nucleic Acids Research, 35,
e130. https://doi.org/10.1093/nar/gkm760
[11] Farrer, R.A., Henk, D.A., MacLean, D., Studholme, D.J. and Fisher, M.C. (2013)
Using False Discovery Rates to Benchmark SNP-Callers in Next-Generation Se-
quencing Projects. Scientific Reports, 3, 1512.
[12] Parker, D.J., Ritchie, M.G. and Kankare, M. (2016) Preparing for Winter: The
Transcriptomic Response Associated with Different Day Lengths in Drosophila
Montana. G3: Genes, Genomes, Genetics, 6, 1373-1381.
[13] Richardson, R., Mitchell, K., Hammond, N.L., Mollo, M.R., Kouwenhoven, E.N.,
Wyatt, N.D., Donaldson, I.J., Zeef, L., Burgis, T., Blance, R., van Heeringen, S.J.,
Stunnenberg, H.G., Zhou, H., Missero, C., Romano, R.A., Sinha, S., Dixon, M.J. and
Dixon, J. (2017) p63 Exerts Spatio-Temporal Control of Palatal Epithelial Cell Fate
to Prevent Cleft Palate. PLOS Genetics, 13, 1-24.
https://doi.org/10.1371/journal.pgen.1006828
[14] Sasson, A. and Michael, T.P. (2010) Filtering Error from SOLiD Output. Bioinfor-
matics, 26, 849-850. https://doi.org/10.1093/bioinformatics/btq045
[15] Pearl, J. (1984) Heuristics: Intelligent Search Strategies for Computer Problem
Solving. Addison-Wesley Longman Publishing Co., Inc., Boston.
[16] Russell, S.J. and Norvig, P. (2003) Artificial Intelligence: A Modern Approach. 2nd
Edition, Pearson Education.
[17] Poinar, H.N., Schwarz, C., Qi, J., Shapiro, B., MacPhee, R.D.E., Buigues, B., Tikho-
nov, A., Huson, D.H., Tomsho, L.P., Auch, A., Rampp, M., Miller, W. and Schuster,
S.C. (2006) Metagenomics to Paleogenomics: Large-Scale Sequencing of Mammoth
DNA. Science, 311, 392-394. https://doi.org/10.1126/science.1123360
[18] Longo, M.S., O’Neill, M.J. and O’Neill, R.J. (2011) Abundant Human DNA Conta-
mination Identified in Non-Primate Genome Databases. PLoS ONE, 6, e16410.
https://doi.org/10.1371/journal.pone.0016410
[19] Thakur, R.S., Bandopadhyay, R., Chaudhary, B. and Chatterjee, S. (2012) Now and
Next-Generation Sequencing Techniques: Future of Sequence Analysis using Cloud
Computing. Frontiers in Genetics, 3, 280.
[20] Kwon, T., Yoo, W.G., Lee, W.-J., Kim, W. and Kim, D.-W. (2015) Next-Generation
Sequencing Data Analysis on Cloud Computing. Genes & Genomics, 37, 489-501.
https://doi.org/10.1007/s13258-015-0280-7
13. F. M. F. Lobato et al.
DOI: 10.4236/ajmb.2018.81003 38 American Journal of Molecular Biology
[21] Zhao, S., Watrous, K., Zhang, C. and Zhang, B. (2017) Cloud Computing for
Next-Generation Sequencing Data Analysis. In: Sen, J., Ed., Cloud Compu-
ting-Architecture and Applications, InTech, Rijeka, Ch. 2, 29-51.
https://doi.org/10.5772/66732
[22] Amazon Web Services LLC, Amazon Elastic Compute Cloud (Amazon EC2) (2017).
http://aws.amazon.com/ec2/
[23] Gonçalves, A.N., Lobato, F., Santana, A., dos Santos, A.R., Burbano, R.M.R., de As-
sumpção, P.P., da Costa Silva, A.L. and Darnet, H. (2010) Otimização da detecção
dos barcodes em corrida multiplex da plataforma solid, in: 57o. Congresso Brasilei-
ro de Genética, 404.