Carl Jung said, 'Collective unconscious' i.e. we are all connected to each other in some way or the other via our DNA.In frequent cases there are four bases in a DNA. They are a (Adenine),c (Cytosine),g(Guanine)
and t (Thymine).Each of these bases can be represented by two bits as 2 powers 2 =4 i.e.a–00,c–01,g–11 and t–10 respectively, although this choice is random.Soredundancy within a sequence is more likely to exist.That’s why in this paper wehave explored different types of repeat to compress DNA.These are direct repeats, palindrome or reverse direct repeat , inverted exact repeats or complementary palindrome or exact reverse complement, inverted approximate repeats or approximate complementary palindrome or approximate reverse complement, interspersed or dispersed repeats,
flanking repeats or terminal repeats etc. Better compression gives better network speed and save storage space.
Develop and design hybrid genetic algorithms with multiple objectives in data...khalil IBRAHIM
Data compression plays a significant role and is necessary to minimize the storage size and accelerate the data transmission by the communication channel object, The quality of dictionary-based text compression is a new approach, in this concept, the hybrid dictionary compression (HDC) is used in one compression system, instead of several compression systems of the characters, syllables or words, HDC is a new technique and has proven itself especially on the text files. The compression effectiveness is affected by the quality of hybrid dictionary of characters, syllables, and words. The dictionary is created with a forward analysis of text file a. In this research, the genetic algorithm (GAs) will be used as the search engine for obtaining this dictionary and data compression, therefore, GA is stochastic search engine algorithms related to the natural selection and mechanisms of genetics, the s research aims to combine Huffman’s multiple trees and GAs in one compression system. The Huffman algorithm coding is a tactical compression method of creating the code lengths of variable-length prefix code, and the genetic algorithm is algorithms are used for searching and finding the best Huffman tree [1], that gives the best compression ratio. GAs can be used as search algorithms for both Huffman method of different codes in the text characters leads to a difference in the trees, The GAs work in the same way to create the population, and can be used to guide the research to better solutions, after applying the test survive and their genetic information is used.
Arabic named entity recognition using deep learning approachIJECEIAES
Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
Develop and design hybrid genetic algorithms with multiple objectives in data...khalil IBRAHIM
Data compression plays a significant role and is necessary to minimize the storage size and accelerate the data transmission by the communication channel object, The quality of dictionary-based text compression is a new approach, in this concept, the hybrid dictionary compression (HDC) is used in one compression system, instead of several compression systems of the characters, syllables or words, HDC is a new technique and has proven itself especially on the text files. The compression effectiveness is affected by the quality of hybrid dictionary of characters, syllables, and words. The dictionary is created with a forward analysis of text file a. In this research, the genetic algorithm (GAs) will be used as the search engine for obtaining this dictionary and data compression, therefore, GA is stochastic search engine algorithms related to the natural selection and mechanisms of genetics, the s research aims to combine Huffman’s multiple trees and GAs in one compression system. The Huffman algorithm coding is a tactical compression method of creating the code lengths of variable-length prefix code, and the genetic algorithm is algorithms are used for searching and finding the best Huffman tree [1], that gives the best compression ratio. GAs can be used as search algorithms for both Huffman method of different codes in the text characters leads to a difference in the trees, The GAs work in the same way to create the population, and can be used to guide the research to better solutions, after applying the test survive and their genetic information is used.
Arabic named entity recognition using deep learning approachIJECEIAES
Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
Multilabel Classification by BCH Code and Random ForestsIDES Editor
Multilabel classification deals with problems in
which an instance can belong to multiple classes. This
paper uses error correcting codes for multilabel
classification. BCH code and Random Forests learner are
used to form the proposed method. Thus, the advantage of
the error-correcting properties of BCH is merged with the
good performance of the random forests learner to
enhance the multilabel classification results. Three
experiments are conducted on three common benchmark
datasets. The results are compared against those of
several exiting approaches. The proposed method does
well against its counterparts for the three datasets of
varying characteristics.
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...IDES Editor
Hyperelliptic Curve Cryptography (HECC) is a Public
Key Cryptographic technique which is required for secure
transmission. HECC is better than the existing public key
cryptography technique such as RSA, DSA, AES and ECC in
terms of smaller key size. DNA cryptography is a next
generation security mechanism, storing almost a million
gigabytes of data inside DNA strands. Existing DNA based
Elliptic Curve Cryptographic technique require larger key
size to encrypt and decrypt the message resulting in increased
processing time, more computational and memory overhead.
To overcome the above limitations, DNA strands are used to
encode the data to provide first level of security and HECC
encryption algorithm is used for providing second level of
security. Hence this proposed integration of DNA computing
based HECC provides higher level of security with less
computational and memory overhead.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching
(PPM) compression algorithm. This achieves significantly better compression for different natural
language texts compared to other well-known compression methods. Our method first generates a grammar
based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in
the text being compressed and then substitutes these sequences using the respective non-terminal symbols
defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly
improved results in compression for various natural languages (a 5% improvement for American English,
10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We
describe further improvements using a two pass scheme where the grammar-based pre-processing is
applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary
Corpus and also achieve significantly improved results in compression, between 11% and 20%, when
compared with other compression algorithms, including a grammar-based approach, the Sequitur
algorithm.
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
Multilabel Classification by BCH Code and Random ForestsIDES Editor
Multilabel classification deals with problems in
which an instance can belong to multiple classes. This
paper uses error correcting codes for multilabel
classification. BCH code and Random Forests learner are
used to form the proposed method. Thus, the advantage of
the error-correcting properties of BCH is merged with the
good performance of the random forests learner to
enhance the multilabel classification results. Three
experiments are conducted on three common benchmark
datasets. The results are compared against those of
several exiting approaches. The proposed method does
well against its counterparts for the three datasets of
varying characteristics.
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...IDES Editor
Hyperelliptic Curve Cryptography (HECC) is a Public
Key Cryptographic technique which is required for secure
transmission. HECC is better than the existing public key
cryptography technique such as RSA, DSA, AES and ECC in
terms of smaller key size. DNA cryptography is a next
generation security mechanism, storing almost a million
gigabytes of data inside DNA strands. Existing DNA based
Elliptic Curve Cryptographic technique require larger key
size to encrypt and decrypt the message resulting in increased
processing time, more computational and memory overhead.
To overcome the above limitations, DNA strands are used to
encode the data to provide first level of security and HECC
encryption algorithm is used for providing second level of
security. Hence this proposed integration of DNA computing
based HECC provides higher level of security with less
computational and memory overhead.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching
(PPM) compression algorithm. This achieves significantly better compression for different natural
language texts compared to other well-known compression methods. Our method first generates a grammar
based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in
the text being compressed and then substitutes these sequences using the respective non-terminal symbols
defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly
improved results in compression for various natural languages (a 5% improvement for American English,
10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We
describe further improvements using a two pass scheme where the grammar-based pre-processing is
applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary
Corpus and also achieve significantly improved results in compression, between 11% and 20%, when
compared with other compression algorithms, including a grammar-based approach, the Sequitur
algorithm.
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHMijcsa
There are plenty specific types of data which are needed to compress for easy storage and to reduce overall retrieval times. Moreover, compressed sequence can be used to understand similarities between biological sequences. DNA data compression challenge has become a major task for many researchers for the last few years as a result of exponential increase of produced sequences in gene databases. In this research paper we have attempt to develop an algorithm by self-reference bases; namely Single Base Variable Repeat Length DNA Compression (SBVRLDNAComp). There are a number of reference based compression methods but they are not satisfactory for forthcoming new species. SBVRLDNAComp is an optimal solution of the result obtained from small to long, uniform identical and non-identical string of nucleotides checked in four different ways. Both exact repetitive and non-repetitive bases are compressed by SBVRLDNAComp.The sound part of it is without any reference database BVRLDNAComp achieves 1.70 to 1.73 compression ratio α after testing on ten benchmark DNA sequences. The compressed file can be further compressed with standard tools (such as WinZip or WinRar) but even without this SBVRLDNAComp outperforms many standard DNA compression algorithms.
A Biological Sequence Compression Based on cross chromosomal similarities usi...CSCJournals
While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of Biological sequences is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as meta genomes. The present Biological sequence compression algorithms work by finding similar repeated regions within the Biological sequence and then encode these repeated regions together for compression. The previous research on chromosome sequence similarity reveals that the length of similar repeated regions within one chromosome is about 4.5% of the total sequence length. The compression gain is often not high because of these short lengths of repeated regions. It is well recognized that similarities exist among different regions of chromosome sequences. This implies that similar repeated sequences are found among different regions of chromosome sequences. Here, we apply cross-chromosomal similarity for a Biological sequence compression. The length and location of similar repeated regions among the different Biological sequences are studied. It is found that the average percentage of similar subsequences found between two chromosome sequences is about 10% in which 8% comes from cross-chromosomal prediction and 2% from self-chromosomal prediction. The percentage of similar subsequences is about 18% in which only 1.2% comes from self-chromosomal prediction while the rest is from cross-chromosomal prediction among the different Biological sequences studied. This suggests the significance of cross-chromosomal similarities in addition to self-chromosomal similarities in the Biological sequence compression. An additional 23% of storage space could be reduced on average using self-chromosomal and cross-chromosomal predictions in compressing the different Biological sequences.
Analysis of Genomic and Proteomic Sequence Using Fir FilterIJMER
Bioinformatics is a field of science that implies the use of techniques from mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the molecular level. Digital Signal Processing (DSP) applications in genomic sequence analysis have received great attention in recent years.DSP principles are used to analyse genomic and proteomic sequences. The DNA sequence is mapped into digital signals in the form of binary indicator sequences. Signal processing techniques such as digital filtering is applied to genomic sequences to identify protein coding region. Frequency response of genomic sequences is used to solve many optimization problems in science, medicine and many other applications. The aim of this paper is to describe a method of generating Finite Impulse Response (FIR) of the genomic sequence. The same DNA sequence is used to convert into proteomic sequence using transcription and translation, and also digital filtering technique such as FIR filter applied to know the frequency response. The frequency response is same for both gene and proteomic sequence.
RNA Sequence data analysis,Transcriptome sequencing, Sequencing steady state RNA in a sample is known as RNA-Seq. It is free of limitations such as prior knowledge about the organism is not required.
RNA-Seq is useful to unravel inaccessible complexities of transcriptomics such as finding novel transcripts and isoforms.
Data set produced is large and complex; interpretation is not straight forward.
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...cscpconf
The DNA sequences similarity analysis approaches have been based on the representation and the frequency of sequences components; however, the position inside sequence is important information for the sequence data. Whereas, insufficient information in sequences representations is important reason that causes poor similarity results. Based on three classifications of the DNA bases according to their chemical properties, the frequencies and average positions of group mutations have been grouped into two twelve-components vectors, the Euclidean distances among introduced vectors applied to compare the coding sequences of the first exon of beta globin gene of 11 species.
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...csandit
The DNA sequences similarity analysis approaches have been based on the representation and the frequency of sequences components; however, the position inside sequence is important information for the sequence data. Whereas, insufficient information in sequences
representations is important reason that causes poor similarity results. Based on three classifications of the DNA bases according to their chemical properties, the frequencies and
average positions of group mutations have been grouped into two twelve-components vectors,the Euclidean distances among introduced vectors applied to compare the coding sequences of the first exon of beta globin gene of 11 species.
Apollo: A workshop for the Manakin Research Coordination NetworkMonica Munoz-Torres
Apollo is a web-based, collaborative genomic annotation editing platform. We need annotation editing tools to modify and refine precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.
This presentation is an introduction to how the manual annotation process takes place using Apollo. It is addressed to the members of the Manakin Genomics research community.
Apollo is a web-based, collaborative genomic annotation editing platform. We need annotation editing tools to modify and refine precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.
This presentation is an introduction to how the manual annotation process takes place using Apollo. It is addressed to the members of the American Chestnut & Chinese Chestnut Genomics research community.
Loss less DNA Solidity Using Huffman and Arithmetic CodingIJERA Editor
DNA Sequences making up any bacterium comprise the blue print of that bacterium so that understanding and analyzing different genes with in sequences has become an exceptionally significant mission. Naturalists are manufacturing huge volumes of DNA Sequences every day that makes genome sequence catalogue emergent exponentially. The data bases such as Gen-bank represents millions of DNA Sequences filling many thousands of gigabytes workstation storing capability. Solidity of Genomic sequences can decrease the storage requirements, and increase the broadcast speed. In this paper we compare two lossless solidity algorithms (Huffman and Arithmetic coding). In Huffman coding, individual bases are coded and assigned a specific binary number. But for Arithmetic coding entire DNA is coded in to a single fraction number and binary word is coded to it. Solidity ratio is compared for both the methods and finally we conclude that arithmetic coding is the best.
This lab has two parts – please answer all parts.Lab 7 Biotechn.docxglennf2
This lab has two parts – please answer all parts.
Lab 7: Biotechnology
A. Gene Finder Activity
Lab Materials
Materials found in your lab kit:
• none
Additional materials needed:
• access to the Internet
Activity
1. Connect to the Internet.
2. Go to the National Center for Biotechnology Information (NCBI) at http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&BLAST_PROGRAMS=megaBlast&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome
3. You will see the following screen:
4. Copy the exact nucleotide sequence given below and then paste it into the Enter Query Sequence box on the Nucleotide-nucleotide BLAST search page. Accuracy is extremely important. The sequence should be one long string of letters with no spaces. Record the sequence number you are given (you may have a different sequence than your classmates).
DNA sequences (When copying and pasting, please don't include the dashes!)
For the men:
----------------------------------------------------------------
cccgaattcgacaatgcaatcatatgcttctgctatgttaagcgtattcaacagcgatgattacagtccagctgtgcaagagaatattcccgctctccggagaagctcttccttcctttgcactgaaagctgtaactctaagtatcagtgtgaaacgggagaaaacagtaaaggcaacgtccaggatggagtgaagcgacccatgaacgcattcatcgtgtggtctcgcgatcagaggcgcaagatggctctagagaatcccagaatgcgaaactcagagatcagcaagcagctgggataccagtggaaaatgcttactgaagccgaaaaatggccattcttccaggaggcacagaaattacaggccatgcacagagagaaatacccgaattataagtatcgacctcgtcggaaggcgaagatgctgccgaagaattgcagtttgcttcccgcagatcccgcttcggtactctgcagcgaagtgcaactggacaacaggttgtacagggatgactgtacgaaagccacacactcaagaatggagcaccagctaggccacttaccgcccatcaacgcagccagctcaccgcagcaacgggaccgctacag
----------------------------------------------------------------
For the women:
----------------------------------------------------------------
cagtggaattctagagtcacacttcctaaaatatgcatttttgttttcacttttagatatgatacggaaattgatagaagcagaagatcggctataaaaaagataatggaaagggatgacacagctgcaaaaacacttgttctctgtgtttctgacataatttcattgagcgcaaatatatctgaaacttctagcagtaaaactagtagtgcagatacccaaaaagtggc
---------------------------------------------------------------
5. Scroll down and Press BLAST:
Once you have pasted the sequence in the Search box, click on BLAST! A screen should appear that says to wait a period of time (usually 30 seconds or less) for the search to be completed. Be patient while formatting takes place.
After the search has ended, scroll down the screen until you find a list introduced by the words DESCRIPTIONS – look below it and find Sequences producing significant alignments. Listed in order are the closest matches with the pasted DNA sequence.
6. Find the first listing (the closest match) that has both a blue square containing a 'G' and the term Human or Homo sapiens in the description. Click on the blue G. Clicking on the G will take you to a new screen that gives the gene name, symbol, and identifier. Write down or electronically copy the name of the gene. (Copy and paste the information in a separate MSWord file for y.
ENHANCING ENGLISH WRITING SKILLS THROUGH INTERNET-PLUS TOOLS IN THE PERSPECTI...ijfcstjournal
This investigation delves into incorporating a hybridized memetic strategy within the framework of English
composition pedagogy, leveraging Internet Plus resources. The study aims to provide an in-depth analysis
of how this method influences students’ writing competence, their perceptions of writing, and their
enthusiasm for English acquisition. Employing an explanatory research design that combines qualitative
and quantitative methods, the study collects data through surveys, interviews, and observations of students’
writing performance before and after the intervention. Findings demonstrate a beneficial impact of
integrating the memetic approach alongside Internet Plus tools on the writing aptitude of English as a
Foreign Language (EFL) learners. Students reported increased engagement with writing, attributing it to
the use of Internet plus tools. They also expressed that the memetic approach facilitated a deeper
understanding of cultural and social contexts in writing. Furthermore, the findings highlight a significant
improvement in students’ writing skills following the intervention. This study provides significant insights
into the practical implementation of the memetic approach within English writing education, highlighting
the beneficial contribution of Internet Plus tools in enriching students' learning journeys.
A SURVEY TO REAL-TIME MESSAGE-ROUTING NETWORK SYSTEM WITH KLA MODELLINGijfcstjournal
Messages routing over a network is one of the most fundamental concept in communication which requires
simultaneous transmission of messages from a source to a destination. In terms of Real-Time Routing, it
refers to the addition of a timing constraint in which messages should be received within a specified time
delay. This study involves Scheduling, Algorithm Design and Graph Theory which are essential parts of
the Computer Science (CS) discipline. Our goal is to investigate an innovative and efficient way to present
these concepts in the context of CS Education. In this paper, we will explore the fundamental modelling of
routing real-time messages on networks. We study whether it is possible to have an optimal on-line
algorithm for the Arbitrary Directed Graph network topology. In addition, we will examine the message
routing’s algorithmic complexity by breaking down the complex mathematical proofs into concrete, visual
examples. Next, we explore the Unidirectional Ring topology in finding the transmission’s
“makespan”.Lastly, we propose the same network modelling through the technique of Kinesthetic Learning
Activity (KLA). We will analyse the data collected and present the results in a case study to evaluate the
effectiveness of the KLA approach compared to the traditional teaching method.
A COMPARATIVE ANALYSIS ON SOFTWARE ARCHITECTURE STYLESijfcstjournal
Software architecture is the structural solution that achieves the overall technical and operational
requirements for software developments. Software engineers applied software architectures for their
software system developments; however, they worry the basic benchmarks in order to select software
architecture styles, possible components, integration methods (connectors) and the exact application of
each style.
The objective of this research work was a comparative analysis of software architecture styles by its
weakness and benefits in order to select by the programmer during their design time. Finally, in this study,
the researcher has been identified architectural styles, weakness, and Strength and application areas with
its component, connector and Interface for the selected architectural styles.
SYSTEM ANALYSIS AND DESIGN FOR A BUSINESS DEVELOPMENT MANAGEMENT SYSTEM BASED...ijfcstjournal
A design of a sales system for professional services requires a comprehensive understanding of the
dynamics of sale cycles and how key knowledge for completing sales is managed. This research describes
a design model of a business development (sales) system for professional service firms based on the Saudi
Arabian commercial market, which takes into account the new advances in technology while preserving
unique or cultural practices that are an important part of the Saudi Arabian commercial market. The
design model has combined a number of key technologies, such as cloud computing and mobility, as an
integral part of the proposed system. An adaptive development process has also been used in implementing
the proposed design model.
AN ALGORITHM FOR SOLVING LINEAR OPTIMIZATION PROBLEMS SUBJECTED TO THE INTERS...ijfcstjournal
Frank t-norms are parametric family of continuous Archimedean t-norms whose members are also strict
functions. Very often, this family of t-norms is also called the family of fundamental t-norms because of the
role it plays in several applications. In this paper, optimization of a linear objective function with fuzzy
relational inequality constraints is investigated. The feasible region is formed as the intersection of two
inequality fuzzy systems defined by frank family of t-norms is considered as fuzzy composition. First, the
resolution of the feasible solutions set is studied where the two fuzzy inequality systems are defined with
max-Frank composition. Second, some related basic and theoretical properties are derived. Then, a
necessary and sufficient condition and three other necessary conditions are presented to conceptualize the
feasibility of the problem. Subsequently, it is shown that a lower bound is always attainable for the optimal
objective value. Also, it is proved that the optimal solution of the problem is always resulted from the
unique maximum solution and a minimal solution of the feasible region. Finally, an algorithm is presented
to solve the problem and an example is described to illustrate the algorithm. Additionally, a method is
proposed to generate random feasible max-Frank fuzzy relational inequalities. By this method, we can
easily generate a feasible test problem and employ our algorithm to it.
LBRP: A RESILIENT ENERGY HARVESTING NOISE AWARE ROUTING PROTOCOL FOR UNDER WA...ijfcstjournal
Underwater detector network is one amongst the foremost difficult and fascinating analysis arenas that
open the door of pleasing plenty of researchers during this field of study. In several under water based
sensor applications, nodes are square measured and through this the energy is affected. Thus, the mobility
of each sensor nodes are measured through the water atmosphere from the water flow for sensor based
protocol formations. Researchers have developed many routing protocols. However, those lost their charm
with the time. This can be the demand of the age to supply associate degree upon energy-efficient and
ascendable strong routing protocol for under water actuator networks. During this work, the authors tend
to propose a customary routing protocol named level primarily based routing protocol (LBRP), reaching to
offer strong, ascendable and energy economical routing. LBRP conjointly guarantees the most effective use
of total energy consumption and ensures packet transmission which redirects as an additional reliability in
compare to different routing protocols. In this work, the authors have used the level of forwarding node,
residual energy and distance from the forwarding node to the causing node as a proof in multicasting
technique comparisons. Throughout this work, the authors have got a recognition result concerning about
86.35% on the average in node multicasting performances. Simulation has been experienced each in a
wheezy and quiet atmosphere which represents the endorsement of higher performance for the planned
protocol.
STRUCTURAL DYNAMICS AND EVOLUTION OF CAPSULE ENDOSCOPY (PILL CAMERA) TECHNOLO...ijfcstjournal
This research paper examined and re-evaluates the technological innovation, theory, structural dynamics
and evolution of Pill Camera(Capsule Endoscopy) technology in redirecting the response manner of small
bowel (intestine) examination in human. The Pill Camera (Endoscopy Capsule) is made up of sealed
biocompatible material to withstand acid, enzymes and other antibody chemicals in the stomach is a
technology that helps the medical practitioners especially the general physicians and the
gastroenterologists to examine and re-examine the intestine for possible bleeding or infection. Before the
advent of the Pill camera (Endoscopy Capsule) the colonoscopy was the local method used but research
showed that some parts (bowel) of the intestine can’t be reach by mere traditional method hence the need
for Pill Camera. Countless number of deaths from stomach disease such as polyps, inflammatory bowel
(Crohn”s diseases), Cancers, Ulcer, anaemia and tumours of small intestines which ordinary would have
been detected by sophisticated technology like Pill Camera has become norm in the developing nations.
Nevertheless, not only will this paper examine and re-evaluate the Pill Camera Innovation, theory,
Structural dynamics and evolution it unravelled and aimed to create awareness for both medical
practitioners and the public.
AN OPTIMIZED HYBRID APPROACH FOR PATH FINDINGijfcstjournal
Path finding algorithm addresses problem of finding shortest path from source to destination avoiding
obstacles. There exist various search algorithms namely A*, Dijkstra's and ant colony optimization. Unlike
most path finding algorithms which require destination co-ordinates to compute path, the proposed
algorithm comprises of a new method which finds path using backtracking without requiring destination
co-ordinates. Moreover, in existing path finding algorithm, the number of iterations required to find path is
large. Hence, to overcome this, an algorithm is proposed which reduces number of iterations required to
traverse the path. The proposed algorithm is hybrid of backtracking and a new technique(modified 8-
neighbor approach). The proposed algorithm can become essential part in location based, network, gaming
applications. grid traversal, navigation, gaming applications, mobile robot and Artificial Intelligence.
EAGRO CROP MARKETING FOR FARMING COMMUNITYijfcstjournal
The Major Occupation in India is the Agriculture; the people involved in the Agriculture belong to the poor
class and category. The people of the farming community are unaware of the new techniques and Agromachines, which would direct the world to greater heights in the field of agriculture. Though the farmers
work hard, they are cheated by agents in today’s market. This serves as a opportunity to solve
all the problems that farmers face in the current world. The eAgro crop marketing will serve as a better
way for the farmers to sell their products within the country with some mediocre knowledge about using
the website. This would provide information to the farmers about current market rate of agro-products,
their sale history and profits earned in a sale. This site will also help the farmers to know about the market
information and to view agricultural schemes of the Government provided to farmers.
EDGE-TENACITY IN CYCLES AND COMPLETE GRAPHSijfcstjournal
It is well known that the tenacity is a proper measure for studying vulnerability and reliability in graphs.
Here, a modified edge-tenacity of a graph is introduced based on the classical definition of tenacity.
Properties and bounds for this measure are introduced; meanwhile edge-tenacity is calculated for cycle
graphs and also for complete graphs.
COMPARATIVE STUDY OF DIFFERENT ALGORITHMS TO SOLVE N QUEENS PROBLEMijfcstjournal
This Paper provides a brief description of the Genetic Algorithm (GA), the Simulated Annealing (SA)
Algorithm, the Backtracking (BT) Algorithm and the Brute Force (BF) Search Algorithm and attempts to
explain the way as how the Proposed Genetic Algorithm (GA), the Proposed Simulated Annealing (SA)
Algorithm using GA, the Backtracking (BT) Algorithm and the Brute Force (BF) Search Algorithm can be
employed in finding the best solution of N Queens Problem and also, makes a comparison between these
four algorithms. It is entirely a review based work. The four algorithms were written as well as
implemented. From the Results, it was found that, the Proposed Genetic Algorithm (GA) performed better
than the Proposed Simulated Annealing (SA) Algorithm using GA, the Backtracking (BT) Algorithm and
the Brute Force (BF) Search Algorithm and it also provided better fitness value (solution) than the
Proposed Simulated Annealing Algorithm (SA) using GA, the Backtracking (BT) Algorithm and the Brute
Force (BF) Search Algorithm, for different N values. Also, it was noticed that, the Proposed GA took more
time to provide result than the Proposed SA using GA.
PSTECEQL: A NOVEL EVENT QUERY LANGUAGE FOR VANET’S UNCERTAIN EVENT STREAMSijfcstjournal
In recent years, the complex event processing technology has been used to process the VANET’s temporal
and spatial event streams. However, we usually cannot get the accurate data because the device sensing
accuracy limitations of the system. We only can get the uncertain data from the complex and limited
environment of the VANET. Because the VANET’s event streams are consist of the uncertain data, so they
are also uncertain. How effective to express and process these uncertain event streams has become the core
issue for the VANET system. To solve this problem, we propose a novel complex event query language
PSTeCEQL (probabilistic spatio-temporal constraint event query language). Firstly, we give the definition
of the possible world model of VANET’s uncertain event streams. Secondly, we propose an event query
language PSTeCEQL and give the syntax and the operational semantics of the language. Finally, we
illustrate the validity of the PSTeCEQL by an example.
CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...ijfcstjournal
Now a day enormous amount of data is getting explored through Internet of Things (IoT) as technologies
are advancing and people uses these technologies in day to day activities, this data is termed as Big Data
having its characteristics and challenges. Frequent Itemset Mining algorithms are aimed to disclose
frequent itemsets from transactional database but as the dataset size increases, it cannot be handled by
traditional frequent itemset mining. MapReduce programming model solves the problem of large datasets
but it has large communication cost which reduces execution efficiency. This proposed new pre-processed
k-means technique applied on BigFIM algorithm. ClustBigFIM uses hybrid approach, clustering using kmeans algorithm to generate Clusters from huge datasets and Apriori and Eclat to mine frequent itemsets
from generated clusters using MapReduce programming model. Results shown that execution efficiency of
ClustBigFIM algorithm is increased by applying k-means clustering algorithm before BigFIM algorithm as
one of the pre-processing technique.
A MUTATION TESTING ANALYSIS AND REGRESSION TESTINGijfcstjournal
Software testing is a testing which conducted a test to provide information to client about the quality of the
product under test. Software testing can also provide an objective, independent view of the software to
allow the business to appreciate and understand the risks of software implementation. In this paper we
focused on two main software testing –mutation testing and mutation testing. Mutation testing is a
procedural testing method, i.e. we use the structure of the code to guide the test program, A mutation is a
little change in a program. Such changes are applied to model low level defects that obtain in the process
of coding systems. Ideally mutations should model low-level defect creation. Mutation testing is a process
of testing in which code is modified then mutated code is tested against test suites. The mutations used in
source code are planned to include in common programming errors. A good unit test typically detects the
program mutations and fails automatically. Mutation testing is used on many different platforms, including
Java, C++, C# and Ruby. Regression testing is a type of software testing that seeks to uncover
new software bugs, or regressions, in existing functional and non-functional areas of a system after
changes such as enhancements, patches or configuration changes, have been made to them. When defects
are found during testing, the defect got fixed and that part of the software started working as needed. But
there may be a case that the defects that fixed have introduced or uncovered a different defect in the
software. The way to detect these unexpected bugs and to fix them used regression testing. The main focus
of regression testing is to verify that changes in the software or program have not made any adverse side
effects and that the software still meets its need. Regression tests are done when there are any changes
made on software, because of modified functions.
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...ijfcstjournal
Advances in micro fabrication and communication techniques have led to unimaginable proliferation of
WSN applications. Research is focussed on reduction of setup operational energy costs. Bulk of operational
energy costs are linked to communication activities of WSN. Any progress towards energy efficiency has a
potential of huge savings globally. Therefore, every energy efficient step is an endeavour to cut costs and
‘Go Green’. In this paper, we have proposed a framework to reduce communication workload through: Innetwork compression and multiple query synthesis at the base-station and modification of query syntax
through introduction of Static Variables. These approaches are general approaches which can be used in
any WSN irrespective of application.
A NEW MODEL FOR SOFTWARE COSTESTIMATION USING HARMONY SEARCHijfcstjournal
Accurate and realistic estimation is always considered to be a great challenge in software industry.
Software Cost Estimation (SCE) is the standard application used to manage software projects. Determining
the amount of estimation in the initial stages of the project depends on planning other activities of the
project. In fact, the estimation is confronted with a number of uncertainties and barriers’, yet assessing the
previous projects is essential to solve this problem. Several models have been developed for the analysis of
software projects. But the classical reference method is the COCOMO model, there are other methods
which are also applied such as Function Point (FP), Line of Code(LOC); meanwhile, the expert`s opinions
matter in this regard. In recent years, the growth and the combination of meta-heuristic algorithms with
high accuracy have brought about a great achievement in software engineering. Meta-heuristic algorithms
which can analyze data from multiple dimensions and identify the optimum solution between them are
analytical tools for the analysis of data. In this paper, we have used the Harmony Search (HS)algorithm for
SCE. The proposed model which is a collection of 60 standard projects from Dataset NASA60 has been
assessed.The experimental results show that HS algorithm is a good way for determining the weight
similarity measures factors of software effort, and reducing the error of MRE.
AGENT ENABLED MINING OF DISTRIBUTED PROTEIN DATA BANKSijfcstjournal
Mining biological data is an emergent area at the intersection between bioinformatics and data mining
(DM). The intelligent agent based model is a popular approach in constructing Distributed Data Mining
(DDM) systems to address scalable mining over large scale distributed data. The nature of associations
between different amino acids in proteins has also been a subject of great anxiety. There is a strong need to
develop new models and exploit and analyze the available distributed biological data sources. In this study,
we have designed and implemented a multi-agent system (MAS) called Agent enriched Quantitative
Association Rules Mining for Amino Acids in distributed Protein Data Banks (AeQARM-AAPDB). Such
globally strong association rules enhance understanding of protein composition and are desirable for
synthesis of artificial proteins. A real protein data bank is used to validate the system.
International Journal on Foundations of Computer Science & Technology (IJFCST)ijfcstjournal
International Journal on Foundations of Computer Science & Technology (IJFCST) is a Bi-monthly peer-reviewed and refereed open access journal that publishes articles which contribute new results in all areas of the Foundations of Computer Science & Technology. Over the last decade, there has been an explosion in the field of computer science to solve various problems from mathematics to engineering. This journal aims to provide a platform for exchanging ideas in new emerging trends that needs more focus and exposure and will attempt to publish proposals that strengthen our goals. Topics of interest include, but are not limited to the following:
Because the technology is used largely in the last decades; cybercrimes have become a significant
international issue as a result of the huge damage that it causes to the business and even to the ordinary
users of technology. The main aims of this paper is to shed light on digital crimes and gives overview about
what a person who is related to computer science has to know about this new type of crimes. The paper has
three sections: Introduction to Digital Crime which gives fundamental information about digital crimes,
Digital Crime Investigation which presents different investigation models and the third section is about
Cybercrime Law.
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...ijfcstjournal
In this paper, we analyze the evolution of a small-world network and its subsequent transformation to a
random network using the idea of link rewiring under the well-known Watts-Strogatz model for complex
networks. Every link u-v in the regular network is considered for rewiring with a certain probability and if
chosen for rewiring, the link u-v is removed from the network and the node u is connected to a randomly
chosen node w (other than nodes u and v). Our objective in this paper is to analyze the distribution of the
maximal clique size per node by varying the probability of link rewiring and the degree per node (number
of links incident on a node) in the initial regular network. For a given probability of rewiring and initial
number of links per node, we observe the distribution of the maximal clique per node to follow a Poisson
distribution. We also observe the maximal clique size per node in the small-world network to be very close
to that of the average value and close to that of the maximal clique size in a regular network. There is no
appreciable decrease in the maximal clique size per node when the network transforms from a regular
network to a small-world network. On the other hand, when the network transforms from a small-world
network to a random network, the average maximal clique size value decreases significantly
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
Dna data compression algorithms based on redundancy
1. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
DNA DATA COMPRESSION ALGORITHMS BASED ON
REDUNDANCY
Subhankar Roy1 and Sunirmal Khatua2
1Department of Computer Science and Engineering, Academy of Technology, G. T.
Road, Aedconagar, Hooghly-712121, W.B., India
2Department of Computer Science and Engineer, University of Calcutta, 92, A.P.C. Road,
Kolkata-700009, India
ABSTRACT
Carl Jung said, 'Collective unconscious' i.e. we are all connected to each other in some way or the other
via our DNA. In frequent cases there are four bases in a DNA. They are a (Adenine), c (Cytosine), g
(Guanine) and t (Thymine). Each of these bases can be represented by two bits as 2 powers 2 =4 i.e. a – 00,
c – 01, g – 11 and t – 10 respectively, although this choice is random. So redundancy within a sequence is
more likely to exist. That’s why in this paper we have explored different types of repeat to compress DNA.
These are direct repeats, palindrome or reverse direct repeat, inverted exact repeats or complementary
palindrome or exact reverse complement, inverted approximate repeats or approximate complementary
palindrome or approximate reverse complement, interspersed or dispersed repeats, flanking repeats or
terminal repeats etc. Better compression gives better network speed and save storage space.
KEYWORDS
Direct repeats, Inverted repeats, Complementary palindrome, Interspersed repeats and Flanking repeats.
1. INTRODUCTION
Today, more and more DNA sequences are becoming available [1]. The information about DNA
sequences are stored in molecular biology databases. The size and importance of these databases
getting bigger with time; therefore this information must be stored or communicated efficiently
[2]. Furthermore, sequence compression can be used to define similarities between biological
sequences. The only way to handle this situation is to encoding these sequences. Better
compression ratio depends upon the DNA sequences [3].
However if one applies the standard text compression software such as GZIP [4], they cannot
compress DNA sequences but only expand the file with more than two bits per symbol. There are
some reasons pointed out. This software is designed mainly for English text compression, while
the regularities in DNA sequences are much subtler and these tools do not make use of the special
characteristics of a DNA sequence. For example, it is well known to us that all DNA sequences
have very long-term correlation [5] in that the subsequences in different regions of any DNA
sequence of same genome or of different genome are quite similar to each other. State-of-the-art
DNA encoding schemes rely on exploiting this long-term correlation. In particular, similarity
within the DNA sequence is searched so that similar subsequences of different fragment can be
encoded with reference to its earlier section.
Thus DNA sequences are important as a new challenge for study of compression algorithms.
Eclipse uses Cp1252 as the default for its encoding. But UTF-8 can represent every character in
Unicode (216 = 65536) and has a much bigger library of recognized symbols. The console in
DOI:10.5121/ijfcst.2014.4605 49
2. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
eclipse uses the default encoding of OS i.e. inherited from container (Cp1252). The size of each
character in Unicode is two byte whereas ASCII code character size is one byte. When
compression is done on DNA sequence or on some plain text, for a line break in a file each line
feed character (‘n’) and carriage return (‘r’) as a whole takes four bytes. So if a text file contains
more line compressed file using Unicode character give bad compression factor. But in ASCII
code (28=256) the size of each character is one byte so a line break that is a combination of line
feed (‘n’) and carriage return (‘r’) as a whole takes two bytes.
If number of lines is 1000 then compressed file using Unicode 1000*4 = 4000 bytes and using
ASCII code 1000*2 = 2000 bytes that is half in size compare to Unicode.
That is why most of the efficient compression algorithm coding is done using C language rather
than using Java because Java character set is called UNICODE character set and C follows ASCII
code character set.
2. LITERATURE REVIEW
It is true that the compression of DNA sequence is a difficult task for general compression
algorithms, but at the same time, from the viewpoint of compression theory it is an interesting
subject for understanding the properties of various compression algorithms.
Generally the windows of the methods based on dictionary [6-7] have a fixed width of small size.
The use of small windows is efficient on plain text whose redundancy is very fewer and local.
However, in the case of DNA sequence, redundancies may occur at very long distances and
factors can be very long.
Shannon-Fano [8-9] or Huffman Coding [8-9] does not give good compression factor for DNA
data compression. Huffman’s code fails badly on DNA sequences both in the static and adaptive
model [8-9], because there are only four kind symbols in DNA sequences and the probabilities of
occurrence of the symbols are close. Both follow statistical modeling by creating variable length
codes for bases in DNA sequence. Base with higher probability of occurrence in a sequence gets
shorter codes in binary form as an integral multiple. Code assignment is done by making a binary
tree that is called the Huffman tree. As the most frequent bases in DNA sequence are 4 in number.
They are a, c, g and t respectively. Let us consider a part of a sequence
“actggtcgatgtacacacataggaaccaacccccaaaaa … accat” and table 1 shows their frequencies in that
sequence.
50
Table 1. Bases and their frequencies in a sequence
Nucleotides Frequencies
a 100
c 100
g 50
t 50
So if we build the Huffman tree using these symbols then the binary code for these four symbols
is shown in Table 2.
Table 2. The Huffman Code table of nucleotides
Nucleotides Codes
a 0 / 00
c 10 / 01
3. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
51
g 110 / 10
t 111 / 11
So total number of bits required is (100*1 + 100*2 + 50*3 + 50*3) = (100+200+150+150) =
(100*2 + 100*2 + 50*2 + 50*3) = (200+200+100+100) = 600 bits.
Using simplest two bit encoding technique it requires same number of bits. As Huffman code
does not consider any properties within a sequence such as repeats or similarity that is why it
cannot give better result. If we apply Shannon-Fano technique then again it needs 600 bits to
compress the above file.
Concerning compression ratio, Prediction by Partial Matching (PPM) [10] is one of the best
compression algorithms in practice developed by Cleary and Witten is accomplished of very high
compression rates, encoding English text in as little as 2.2 bits/character. It predicts according to
the most recent symbols seen. However it cannot compress DNA sequences less than two bits per
symbol either.
Context Tree Weighting (CTW) [11] method introduced by Willems, Shtarkov, and Tjalkens can
compress DNA sequences less than two bits per symbol. It compresses the data by branch
prediction. But this algorithm does not use special structures of biological sequences.
DNA sequences are more redundant in nature due to only four frequent bases a, c, g and t without
considering eleven rare bases. So it is obvious that the similarity between subsequence within a
particular sequence will be more. Now if instead of considering a particular sequence if a number
of sequences of a particular genome considered then the similarity between those sequences are
more due to redundancy. So it would be advantageous to compress different chromosomes
together to take into account both self-chromosomal similarity and cross-chromosomal
similarities within chromosome [12]. A statistical study reveals that the similarity within a
sequence is 4.5% of total sequence length. Whereas similarity within two chromosomes is 10% in
which 8% comes from cross-chromosomal similarity and 2% comes from self-chromosomal
similarity. Among 16 chromosomes it is 18% in which cross-similarity is 16.8% and rest is self-similarity.
There is a software tool called PatternHunter are used to search both self and cross-chromosomal
similarity. It can search both exact, approximate repeat and complementary
palindrome. So the compression algorithm using self and cross chromosomal similarity is local in
nature because here chromosomes of a particular genome has been considered. This one is like a
dictionary based encoding algorithm because each chromosomes is compared with other using
PatternHunter. So there have some reference chromosomes with which target chromosomes are
compared and the differences are compressed. There are some pointer which points the common
region and length of the common region of the reference chromosomes. In the decompression
time the compressed difference and reference sequences are used to recover the original sequence.
Instead of compressing individual genome sequence of an organism, here entire genome sequence
of an organism has been compressed [13-14]. This would give better saving percentage due to
redundancy nature of DNA sequence. Here each genome sequence of an organism is compared
with other genome sequence of that organism and only difference are compressed. The common
region needs not to compress again. This one is like a dictionary based encoding algorithm
because each different genome sequence is compared with different specific genome sequence.
Therefore, the compressed data are poised of reference sequence, differences sequence, and the
locations of differences, instead of storing each DNA data sequence individually. A better result
of this approach is depends on the similarity between sequences. It is shown that mitochondria
dataset give 195-fold compression rate.
Original sequence = Reference sequences + The difference within sequence
4. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
Genbit compression algorithm [15] considered both for repetitive and non repetitive DNA
sequences. In this scheme all possible combination has been introduced. The input sequence is
divided in to block, where each block is of four bases. Thus in this coding scheme, 2 power 8 =
256 combinations can be represented. Hence every DNA segment containing four bases is
replaced by an 8 bit binary number for e.g. “01010101”. If the consecutive fragments are same,
then a specific bit “1” is introduced as a 9th bit. If the consecutive fragments are different, then a
specific bit “0” is introduced as a 9th bit to the 8 bit unique number. GenBit Compress is a simple
algorithm without Dynamic programming approach. It takes an input of a DNA sequence of
length n, and divides into n/4 number of fragments. The left out individual bases (fragment
length<4) is assigned 4 unique “2” bits. (a=”00”, g=”01”, c=”10”, t=”11”).
3. PROPOSED METHODOLOGY
The general encoding methodology is before encoding the next symbol in a DNA sequence by
mapping technique, algorithms search for direct repeats, reverse direct repeats and
complementary palindrome which are explain below in details. If they found with desired
sequence length then our algorithm count the number of occurrence and represents it by the
mapped symbol followed by the number of repeat. These algorithms are completely different
from others existing in the way that it search for best result from different compression technique
using different technique and give the final output with least bpb.
3.1 Direct exact or approximate repeats in DNA sequence
Algorithm FS4DR uses this property.
Repeat in DNA sequence means exact and approximate matches. Approximate matches are done
by operations such as substitution or Replacement, insertion and deletion.
An exact appearance means two substrings of a string in a sequence consist of indistinguishable
bases along the chosen subsequence.
For example if a subsequence is “attcgtgtattcgtgt”. The first eight bases i.e. “attcgtgt” and last
eight bases i.e. “attcgtgt”, the second subsequence is exact copy of the first one.
Exact repeat also called tandem repeat. When a pattern of two or more nucleotides is repeated and
the repetitions are adjacent to each other either directly or inverted.
An example is:
atcgatcgatcgatcg, in which the sequence “atcg” is repeated four times.
When the number is not known, variable, or irrelevant, it is sometimes called a variable number
tandem repeats (VNTR). The term minisatellite has been interchangeably used with variable
number tandem repeats (VNTRs).
When between 10 and 60 nucleotides are repeated, it is called a minisatellite. These occur at more
than 1,000 locations in the human genome. Those with fewer are known as microsatellites or
short tandem repeats (STR).STRs are also repeated sequences, but they are usually 2–10
nucleotides long.
When exactly two nucleotides are repeated, it is called a dinucleotide repeat (for example:
tctctctc…). The microsatellite instability in hereditary nonpolyposis colon cancer most commonly
affects such regions.
When three nucleotides are repeated, it is called a trinucleotide repeat (for example:
tagtagtagtag…), and abnormalities in such regions can give rise to trinucleotide repeat disorders.
52
5. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
An approximate appearance means two substrings of a string in a sequence consist of
indistinguishable bases after one of the following operations along the chosen subsequence.
For example if a subsequence is “atttgtgtattcgtgt”. The first eight bases i.e. “atttgtgt” and last
eight nucleotides i.e. “attcgtgt”. The second subsequence can be obtained from the first
subsequence if the 4th base “T” in the first subsequence is substituted or replaced by “c”.
If a DNA sequence is “attcgttattcgtgt”. The second subsequence “attcgtgt” can be obtained from
the first one “attcgtt” if we insert “g” within first sequence between 6th and 7th character. This is
called approximate matches with insertion.
For approximate matches with deletion operation, let us consider a DNA sequence is
“attcgtgtattcgtt”. The second subsequence “attcgtt” can be obtained from the first one “attcgtgt” if
we delete 7th base i.e. “g” within first sequence.
Substitution can be used as a combination of insertion and deletion operations in some cases. For
example if a subsequence is “atttgtgtattcgtgt” then the second subsequence “attcgtgt” of length
eight can be obtained from the first one “atttgtgt” by inserting “c” between 3rd and 4th bases then
by deleting base “t” from the 5th position. This one can be also obtained by simply replacing
character as described above.
53
1 atgaatgaatgaatgaatgaatga 24
1 atga 4
5 atga 8
9 atga 12
13 atga 16
17 atga 20
21 atga 24
Figure 1. An example of direct repeat of four nucleotides. The sequence is a part of DNA sequence.
3.2 Palindrome or reverse direct repeat, Inverted exact repeat or complementary
palindrome and inverted approximate repeat or approximate complementary
palindrome or approximate reverse complement repeats
Algorithm FS4RDR and FS4CP uses these property.
The palindrome sequence is “acgttgca”, the fragment “tgca” can be obtained from fragment
“acgt” by reversing the second one.
An inverted repeat, reversed repeat, complemented inverted repeat or reverse complement repeat
is a sequence of bases that is the reversed after complementing of another sequence further
downstream to the sequence.
The original sequence of nucleotides is written in the 5' to 3' direction.
For e.g. 5'-atgc-3'
The complementary sequence is written, matching bases ‘a’ with -‘t’ and ‘g’ with ‘c’. The
resulting sequence is in the 3' to 5' orientation.
3'-tacg-5'
If the new sequence “3'-tacg-5' “, is reversed, resulting in the complementary sequence in the
correct orientation i.e. 5'-gcat-3'.
If a sequence is “gcatatgc” so second subsequence “atgc” of length four is reverse complement of
first subsequences “gcat” of length four.
6. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
If there are no nucleotides or bases intervene between the subsequence and its downstream
complement, it is called a palindrome. So in the above case the second subsequence is
complementary palindrome or inverted exact repeat of the first subsequence. Inverted repeats also
indicate regions capable of self-complementary base pairing (regions within a single sequence
which can base pair with each other).
If a sequence is “gcttatgc” so second subsequence “atgc” is reverse complements of first
subsequences “gctt”. But it is an approximate reverse complement. Complement of the second
subsequence is “tacg” in 3' to 5' direction. So in the original direction i.e. 5' to 3' it is “gcat”. So if
the 3rd base “t” of reference subsequence is replaced by ”a” then it will be similar to the compared
subsequence.
54
1 ctaggatcgatcgatcgatcgatc 24
ctagctagctagctagctag -> reverse
Figure 2. An example of palindrome of four nucleotides. The sequence is a part of DNA sequence.
1 ctgatcagtcagtcagtcagtcagtcag 24
agtcagtcagtcagtcagtcagtc -> Complement
ctgactgactgactgactgactga -> Reverse
Figure 3. An example of complementary palindrome of four nucleotides. The sequence is a part of DNA
sequence.
3.3 Interspersed or Dispersed Repeats
It is the copy of repetitive sequences that are interspersed throughout the genome. If a genome
sequence that consists of two or more repeats of a specific subsequence. Two identical (or nearly
identical) nucleotide bases sequences sometimes separated by a sequence of non-repeated DNA.
For example if a sequence is “3' tagtacgttagt 5'”. The last four subsequences “tagt” is exact repeat
of first four subsequences i.e. “tagt” but the intermediate sub sequence “acgt” in a non repeated
region. It can present in multiple copies in the genome. This can be exact and approximate repeat.
It is classified as Short Interspersed Repeat (SIR) and Long Interspersed Repeat (LIR).
3.4 Flanking Repeats or Terminal Repeat
The type of repeats is sequences that are repeated on both ends of a sequence. Direct terminal
repeats are in the same direction and inverted terminal repeats are opposite to each other in
direction.
3.5 Complementary Nature of Nucleotide Sequences
DNA double strand is complementary in nature. If one strand is known, then other strand can be
formed from that one just by replacing ‘a’ with ‘t’ and ‘g’ with ‘c’ and vice versa. The polymer
forms in a unidirectional manner from the 5' carbon that is bound to a phosphate group to the 3'
carbon that is bound to the hydroxyl group. DNA sequences are often written in a shorthand
notation using the one-letter name for each nucleotide, and listing the nucleotides in the 5' to 3'
direction.
7. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
55
3.6 Compression technique
Original file
e.g. humdystrop.txt
Compression
technique
FS4
Compressed file
e.g. humdystrop_cmp.txt
Figure 4. The compression technique.
Figure 5. An example of sequence humdystrop.txt
3.7 Decompression Technique
Compressed file
e.g. humdystrop_cmp.txt
Decompression
technique
RFS4
Reconstructed file
e.g. humdystrop_rec.txt
Figure 6. The decompression technique.
8. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
56
Figure 7. An example of compressed sequence humdystrop_cmp.txt
4. SIMULATION RESULTS
Table 3 shows the information about the ten standard benchmark genome sequence [17]. Table 4
shows the result in bytes by fragment size four (FS4), fragment size four with direct repeat
(FS4DR), fragment size four with reverse direct repeat (FS4RDR) and fragment size four with
complementary palindrome (FS4CP). RFS4 means reverse fragment size four. Table 5 shows the
resultant bits per base (bpb) used of our proposed compression algorithm in compressing each
genome sequence separately.
Tab1e 3. Information of ten standard benchmark DNA sequences.
Sequences Name Length(Bytes) Source File Size(KB)
chmpxx 121,024 Chloroplasts 118.1875
chntxx 155,844 152.1914
humdystrop 38,770 Human 37.8613
humghcsa 66,495 64.9365
humhdabcd 58,864 57.4844
humhprtb 56,737 55.4072
mpomtcg 186,608 Mitochondria 182.2344
mtpacg 100,314 97.9629
hehcmvcg 229,354 Viruses 223.9785
vaccg 191,737 187.2431
Tab1e 4. The sequences length in bytes by different compression techniques. The bits per base (bpb) are
used by the below compression methods along with their properties.
Sequences Name FS4 FS4DR FS4RDR FS4CP
chmpxx 31,945 30,045 30,403 30,203
chntxx 40,960 38,960 38,900 38,700
humdystrop 10,006 10,083 10,011 10,020
humghcsa 18,160 16,576 16,776 16,976
humhdabcd 15,295 14,576 14,376 14,376
humhprtb 14,984 14,153 14,053 14,053
9. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
57
mpomtcg 51,499 46,388 46,188 46,088
mtpacg 26,983 25,008 24,908 24,808
hehcmvcg 60,806 56,806 57,306 57,206
vaccg 52,169 47,852 47,352 47,152
Tab1e 5. The bpb by different compression techniques.
Sequences Name FS4 FS4DR FS4RDR FS4CP
chmpxx 2.111647 1.986052 2.009717 1.996497
chntxx 2.102615 1.999949 1.996869 1.986602
humdystrop 2.064689 2.080578 2.065721 2.067578
humghcsa 2.184826 1.994255 2.018317 2.042379
humhdabcd 2.07869 1.980973 1.953792 1.953792
humhprtb 2.112766 1.995594 1.981494 1.981494
mpomtcg 2.207794 1.988682 1.980108 1.975821
mtpacg 2.151883 1.994378 1.986403 1.978428
hehcmvcg 2.120948 1.981426 1.998866 1.995378
vaccg 2.17669 1.900741 1.975706 1.967362
Average bpb 2.131255 1.990263 1.996699 1.994533
5. CONCLUSIONS
Our proposed compression algorithm was tested on ten standard benchmark DNA sequences.
The experimental results showed that the bpb is less than two bits per base when we compress
these sequences using any of the three simple properties. After applying all techniques our
algorithm searches for optimal solution to get the final result and will store that result. When none
of the above characteristics is used the compression factor is approximately equals to 2, because
each nucleotides base is represented by two bits.
ACKNOWLEDGEMENTS
This research work is supported in part by the Computer Science and Engineering Department of
University of Calcutta, Bioinformatics Centre of Bose Institute and Academy of Technology
under WBUT. We would like to thank Dr. Zumur Ghosh and Asst. Professor Utpal Nandi for
their valuable suggestion.
REFERENCES
[1] Subhankar Roy, Sunirmal Khatua, Sudipta Roy and Prof. Samir K. Bandyopadhyay, “An Efficient
Biological Sequence Compression Technique Using LUT and Repeat in the Sequence”, IOSRJCE,
Vol. 6, Issue 1, pp. 42-50, Sep-Oct. 2012.
[2] Subhankar Roy, Sunirmal Khatua, “Compression Algorithm for all Specified Bases in Nucleic Acid
Sequences”, IJCA, Vol. 75, No. 4, pp. 29-34, August. 2013.
[3] Subhankar Roy, Sunirmal Khatua, “Biological Data Compression Methods Based Upon
Microsatellite”, IJARCSSE, pp. 1321-1326, Vol. 3, Issue 7, July 2013.
[4] http://www.gzip.org.
[5] Wu Choi Ping Paula, “An Approach to Multiple DNA Sequences Compression”, The Hong Kong
Polytechnic University, xiii, 115 p, PolyU Library Call No.: [THS] LG51 .H577M EIE 2009 Wu ,
February 2009.
10. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014
[6] Shanika Kuruppu, Simon J. Puglisi and Justin Zobel,” Relative Lempel-Ziv Compression of Genomes
for Large-Scale Storage and Retrieval”, Springer-Verlag Berlin Heidelberg, LNCS 6393, pp. 201–
206, 2010.
[7] Ziv, J. and Lempel, A., A universal algorithm for sequential data compression, IEEE Trans. Inform.
58
Theory, 23(3):337{343, 1977.
[8] Khalid Sayood, “Introduction to Data Compression, 3rd ed.”, University of Nebraska, 2006.
[9] Mark Nelson and Jean-Loup Gailly, “The Data Compression Book, 2nd ed.”, 2012.
[10] Alistair Moffat, “Implementing the PPM Data Compression Scheme”, IEEE Transactions on
Communications, Vol. 38, No. 11, November 1990.
[11] Zijun Wu, “A Tutorial of The Context Tree Weighting Method: Basic Properties”, November 9, 2005.
[12] Choi-Ping Paula Wu1, Ngai-Fong Law and Wan-Chi Siu, “Cross chromosomal similarity for DNA
sequence compression”, Biomedical Informatics Publishing Group, pp. 412-416, vol. 2, No. 9, July
14, 2008.
[13] Hyoung Do Kim and Ju-Han Kim, “DNA Data Compression Based on the Whole Genome
Sequence”, Journal of Convergence Information Technology, Vol. 4, No. 3, September 2009.
[14] Heba Afify, Muhammad Islam and Manal Abdel Wahed, “DNA Lossless Differential Compression
Algorithm Based on Similarity of Genomic Sequence Database”, International Journal of Computer
Science & Information Technology, pp. 145 – 154, Vol. 3, No. 4, August 2011.
[15] P.Raja Rajeswari, Dr. Allam AppaRao, “Genbit Compress Tool (GBC): A Java-Based Tool to
Compress DNA Sequences and Compute Compression Ratio (bits/base) of Genomes”, International
Journal of Computer Science and Information Technology, pp. 181-191, Vol. 2, No. 3, June 2010.
[16] Nour S. Bakr andAmr A. Sharawi, “DNA Lossless Compression Algorithms: Review”, American
Journal of Bioinformatics Research, pp. 72-81, 2013.
[17] The GeNML homepage: http://www.cs.tut.fi/~tabus/genml/results.html.
Authors
Subhankar Roy is currently working as an Assistant Professor in the Department of
Computer Science & Engineering, Academy of Technology, India. He received his
B.Tech degree and M.Tech degree in Computer Science and Engineering in 2012 both
from University of Calcutta. He is doing research work from University of Calcutta
under the guidance of Mr. Sunirmal Khatua. His research interests include
Bioinformatics and compression techniques. He is a member of IEEE.
Sunirmal Khatua is currently an Assistant Professor in the Department of Computer
Science and Engineering, University of Calcutta, India. He has received the M.E. degree
in Computer Science and Engineering from Jadavpur University, India in 2006. He is
also pursuing his PhD in Cloud Computing from Jadavpur University. His research
interests are in the areas of Distributed Computing, Cloud Computing and Sensor
Networks. He is a member of IEEE.