This document summarizes research analyzing the statistical randomness of SHA3-256 hash algorithm output. Researchers adapted 5 of the 15 NIST Statistical Testing Suite (STS) tests to analyze massive datasets of 996 million to 101 billion SHA3-256 hashes. Four tests showed no evidence against randomness, but the longest runs test did show some evidence against it. Overall the results suggest SHA3-256 appears random and suitable as a cryptographic hash function, but more research is needed, especially validating the longest runs test on larger datasets and assessing the spectral test. Scaling statistical tests to "big data" sizes is an important area for further cryptanalysis research.
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATAIJCSEA Journal
In Classical Hypothesis testing volumes of data is to be collected and then the conclusions are drawn which may take more time. But, Sequential Analysis of statistical science could be adopted in order to decide upon the reliable / unreliable of the developed software very quickly. The procedure adopted for this is, Sequential Probability Ratio Test (SPRT). In the present paper we proposed the performance of SPRT on Time domain data using Weibull model and analyzed the results by applying on 5 data sets. The parameters are estimated using Maximum Likelihood Estimation.
This document summarizes a research paper that proposes using a two-step sequential probability ratio test (SPRT) approach to analyze software reliability growth model (SRGM) data. Specifically, it applies the approach to the Half Logistic Software Reliability Growth Model (HLSRGM). The SPRT approach allows drawing conclusions about software reliability from sequential or continuous monitoring of failure data, potentially reaching conclusions more quickly than traditional hypothesis testing. Equations are provided for determining acceptance, rejection, and continuation regions based on comparing observed failure counts to lines derived from the HLSRGM mean value function. The approach is applied to five sets of existing software failure data to analyze results.
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...Waqas Nawaz
Waqas Nawaz Khokhar presented research on optimizing shortest path traversal and analysis for large graph clustering. The presentation outlined challenges with traditional graph clustering approaches for big real-world graphs. It proposed four optimizations: 1) a collaborative similarity measure to reduce complexity from O(n3) to O(n2logn); 2) identifying overlapping shortest path regions to avoid redundant traversals; 3) confining traversals within clusters to limit unnecessary graph regions; and 4) allowing parallel shortest path queries to reduce latency. Experimental results on real and synthetic graphs showed the approaches improved efficiency by 40% in time and an order of magnitude in space while maintaining clustering quality. Future work aims to address intermediate data explosion
Exponential software reliability using SPRT: MLEIOSR Journals
This document discusses using sequential probability ratio testing (SPRT) to analyze software reliability growth models (SRGMs). It proposes using SPRT on four datasets with an exponential SRGM to quickly determine if software is reliable or unreliable. The key steps are: 1) Define failure rate hypotheses for reliable (λ0) and unreliable (λ1) software; 2) Continuously monitor failure data and calculate the probability ratio of the data under each hypothesis; 3) Use SPRT decision rules to accept, reject, or continue testing based on if the ratio exceeds, is less than, or between thresholds. Maximum likelihood estimation is used to estimate SRGM parameters. SPRT allows decisions to be made much faster than traditional hypothesis testing by continuously
This document summarizes an assignment to implement a document similarity detection application using min-hash algorithms. It describes testing various parameter values for the number of hashes (r) and number of bands (b) using a subset of documents. The optimal values found were r=3 and b=36, as they had fewer false positives than r=2 while being faster than r=4. Testing on the full dataset showed r=3, b=36 was also fastest while maintaining accuracy compared to exact matching and other min-hash configurations. Memory usage was also considered, as r=4 required allocating more RAM.
Performance Analysis of Different Clustering AlgorithmIOSR Journals
This document discusses and compares different clustering algorithms for outlier detection: PAM, CLARA, CLARANS, and ECLARANS. It provides an overview of how each algorithm works, including describing the procedures and steps involved. The proposed work is to modify the ECLARANS algorithm to improve its accuracy and time efficiency for outlier detection by selecting cluster nodes based on maximum distance between data points rather than randomly. This is expected to reduce the number of iterations needed.
This document introduces an R package called PSF that implements a Pattern Sequence based Forecasting (PSF) algorithm for univariate time series forecasting. The PSF algorithm clusters time series data and then predicts future values based on identifying repeating patterns of clusters. The PSF package contains functions that perform the main steps of the PSF algorithm, including selecting the optimal number of clusters, selecting the optimal window size, and making predictions for a given window size and number of clusters. The package aims to promote and simplify the use of the PSF algorithm for time series forecasting.
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATAIJCSEA Journal
In Classical Hypothesis testing volumes of data is to be collected and then the conclusions are drawn which may take more time. But, Sequential Analysis of statistical science could be adopted in order to decide upon the reliable / unreliable of the developed software very quickly. The procedure adopted for this is, Sequential Probability Ratio Test (SPRT). In the present paper we proposed the performance of SPRT on Time domain data using Weibull model and analyzed the results by applying on 5 data sets. The parameters are estimated using Maximum Likelihood Estimation.
This document summarizes a research paper that proposes using a two-step sequential probability ratio test (SPRT) approach to analyze software reliability growth model (SRGM) data. Specifically, it applies the approach to the Half Logistic Software Reliability Growth Model (HLSRGM). The SPRT approach allows drawing conclusions about software reliability from sequential or continuous monitoring of failure data, potentially reaching conclusions more quickly than traditional hypothesis testing. Equations are provided for determining acceptance, rejection, and continuation regions based on comparing observed failure counts to lines derived from the HLSRGM mean value function. The approach is applied to five sets of existing software failure data to analyze results.
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...Waqas Nawaz
Waqas Nawaz Khokhar presented research on optimizing shortest path traversal and analysis for large graph clustering. The presentation outlined challenges with traditional graph clustering approaches for big real-world graphs. It proposed four optimizations: 1) a collaborative similarity measure to reduce complexity from O(n3) to O(n2logn); 2) identifying overlapping shortest path regions to avoid redundant traversals; 3) confining traversals within clusters to limit unnecessary graph regions; and 4) allowing parallel shortest path queries to reduce latency. Experimental results on real and synthetic graphs showed the approaches improved efficiency by 40% in time and an order of magnitude in space while maintaining clustering quality. Future work aims to address intermediate data explosion
Exponential software reliability using SPRT: MLEIOSR Journals
This document discusses using sequential probability ratio testing (SPRT) to analyze software reliability growth models (SRGMs). It proposes using SPRT on four datasets with an exponential SRGM to quickly determine if software is reliable or unreliable. The key steps are: 1) Define failure rate hypotheses for reliable (λ0) and unreliable (λ1) software; 2) Continuously monitor failure data and calculate the probability ratio of the data under each hypothesis; 3) Use SPRT decision rules to accept, reject, or continue testing based on if the ratio exceeds, is less than, or between thresholds. Maximum likelihood estimation is used to estimate SRGM parameters. SPRT allows decisions to be made much faster than traditional hypothesis testing by continuously
This document summarizes an assignment to implement a document similarity detection application using min-hash algorithms. It describes testing various parameter values for the number of hashes (r) and number of bands (b) using a subset of documents. The optimal values found were r=3 and b=36, as they had fewer false positives than r=2 while being faster than r=4. Testing on the full dataset showed r=3, b=36 was also fastest while maintaining accuracy compared to exact matching and other min-hash configurations. Memory usage was also considered, as r=4 required allocating more RAM.
Performance Analysis of Different Clustering AlgorithmIOSR Journals
This document discusses and compares different clustering algorithms for outlier detection: PAM, CLARA, CLARANS, and ECLARANS. It provides an overview of how each algorithm works, including describing the procedures and steps involved. The proposed work is to modify the ECLARANS algorithm to improve its accuracy and time efficiency for outlier detection by selecting cluster nodes based on maximum distance between data points rather than randomly. This is expected to reduce the number of iterations needed.
This document introduces an R package called PSF that implements a Pattern Sequence based Forecasting (PSF) algorithm for univariate time series forecasting. The PSF algorithm clusters time series data and then predicts future values based on identifying repeating patterns of clusters. The PSF package contains functions that perform the main steps of the PSF algorithm, including selecting the optimal number of clusters, selecting the optimal window size, and making predictions for a given window size and number of clusters. The package aims to promote and simplify the use of the PSF algorithm for time series forecasting.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
The document presents John Alexander Vargas' research proposal for exploring the use of the utcc calculus with spatial logic as the underlying logic for modeling mobile properties of concurrent systems. The methodology involves defining a spatial constraint system, modeling a simple example using utcc and spatial constraints, verifying spatial properties, and modeling a complex multi-cellular system. Spatial logic is introduced for specifying properties of mobile systems, with logical inference rules and a sequent calculus for deciding validity presented.
Computational Complexity Comparison Of Multi-Sensor Single Target Data Fusion...ijccmsjournal
This document compares the computational complexity of four multi-sensor data fusion methods based on the Kalman filter using MATLAB simulations. The four methods are: group-sensor method, sequential-sensor method, inverse covariance form, and track-to-track fusion. The results show that the inverse covariance method has the best computational performance if the number of sensors is above 20. For fewer sensors, other methods like the group sensors method are more appropriate due to lower computational loads when inverting smaller matrices.
COMPUTATIONAL COMPLEXITY COMPARISON OF MULTI-SENSOR SINGLE TARGET DATA FUSION...ijccmsjournal
Target tracking using observations from multiple sensors can achieve better estimation performance than a single sensor. The most famous estimation tool in target tracking is Kalman filter. There are several mathematical approaches to combine the observations of multiple sensors by use of Kalman filter. An
important issue in applying a proper approach is computational complexity. In this paper, four data fusion algorithms based on Kalman filter are considered including three centralized and one decentralized methods. Using MATLAB, computational loads of these methods are compared while number of sensors
increases. The results show that inverse covariance method has the best computational performance if the number of sensors is above 20. For a smaller number of sensors, other methods, especially group sensors, are more appropriate..
On Improving the Performance of Data Leak Prevention using White-list ApproachPatrick Nguyen
This document proposes improving data leak prevention performance using a white-list approach and Bloom filters. It summarizes previous related work using blacklists and keywords to detect leaks. The authors then improve upon prior work by Fang Hao et al. that used CRC to create fingerprints, by using hash functions with Bloom filters instead to generate fingerprints faster while maintaining accuracy. Experiments test five hash functions on a 9.3GB dataset to evaluate system throughput and percentage of leaked files.
The document describes efficient algorithms for projecting a vector onto the l1-ball (sum of absolute values being less than a threshold). It presents two methods: 1) An exact projection algorithm that runs in expected O(n) time, where n is the dimension. 2) A method for vectors with k perturbed elements outside the l1-ball, which projects in O(k log n) time. It demonstrates these algorithms outperform interior point methods on various learning tasks, providing models with high sparsity.
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
This document provides an overview and comparison of three classification algorithms: K-Nearest Neighbors (KNN), Decision Trees, and Bayesian Networks. It discusses each algorithm, including how KNN classifies data based on its k nearest neighbors. Decision Trees classify data based on a tree structure of decisions, and Bayesian Networks classify data based on probabilities of relationships between variables. The document conducts an analysis of these three algorithms to determine which has the best performance and lowest time complexity for classification tasks based on evaluating a mock dataset over 24 months.
This document discusses using particle swarm optimization based on variable neighborhood search (PSO-VNS) to attack classical cryptography ciphers. PSO is a population-based optimization algorithm inspired by bird flocking behavior. VNS is a metaheuristic algorithm that explores neighborhoods of solutions to escape local optima. The paper proposes improving PSO with VNS to find better solutions. It evaluates PSO-VNS on substitution and transposition ciphers, finding it recovers keys better than standard PSO and other variants.
A Non Parametric Estimation Based Underwater Target ClassifierCSCJournals
Underwater noise sources constitute a prominent class of input signal in most underwater signal processing systems. The problem of identification of noise sources in the ocean is of great importance because of its numerous practical applications. In this paper, a methodology is presented for the detection and identification of underwater targets and noise sources based on non parametric indicators. The proposed system utilizes Cepstral coefficient analysis and the Kruskal-Wallis H statistic along with other statistical indicators like F-test statistic for the effective detection and classification of noise sources in the ocean. Simulation results for typical underwater noise data and the set of identified underwater targets are also presented in this paper.
The document discusses clustering techniques and provides details about the k-means clustering algorithm. It begins with an introduction to clustering and lists different clustering techniques. It then describes the k-means algorithm in detail, including how it works, the steps involved, and provides an example illustration. Finally, it discusses comments on the k-means algorithm, focusing on aspects like choosing the value of k, initializing cluster centroids, and different distance measurement methods.
Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...sipij
It is an experimental design problem in which there are two Poisson sources with two possible and known rates, and one counter. Through a switch, the counter can observe the sources individually or the counts can be combined so that the counter observes the sum of the two. The sensor scheduling problem is to determine an optimal proportion of the available time to be allocated toward individual and joint sensing, under a total time
constraint. Two different metrics are used for optimization: mutual information between the sources and the observed counts, and probability of detection for the associated source detection problem. Our results, which are primarily computational, indicate similar but not identical results under the two cost functions.
Propagation of Error Bounds due to Active Subspace ReductionMohammad
This document summarizes the propagation of error bounds due to active subspace reduction in computational models. It presents two algorithms for performing active subspace reduction: one that is gradient-free and reduces the response or state space, and one that is gradient-based and reduces the parameter space. It then develops a theorem for propagating error bounds across multiple reductions, both in the parameter and response spaces. Numerical experiments on an analytic function and a nuclear reactor pin cell model are used to validate the error bound approach.
Em computação quântica, um algoritmo quântico é um algoritmo que funciona em um modelo realístico de computação quântica. O modelo mais utilizado é o modelo do circuito de computação quântica.
This document summarizes a paper presentation on selecting the optimal number of clusters (K) for k-means clustering. The paper proposes a new evaluation measure to automatically select K without human intuition. It reviews existing methods, analyzes factors influencing K selection, describes the proposed measure, and applies it to real datasets. The method was validated on artificial and benchmark datasets. It aims to suggest multiple K values depending on the required detail level for clustering. However, it is computationally expensive for large datasets and the data used may not reflect real complexity.
Multiple Dimensional Fault Tolerant Schemes for Crypto Stream CiphersIJNSA Journal
To enhance the security and reliability of the widely-used stream ciphers, a 2-D and a 3-D mesh-knight Algorithm Based Fault Tolerant (ABFT) schemes for stream ciphers are developed which can be universally applied to RC4 and other stream ciphers. Based on the ready-made arithmetic unit in stream ciphers, the proposed 2-D ABFT scheme is able to detect and correct any simple error, and the 3-D meshknight ABFT scheme is capable of detecting and correcting up to three errors in an n2 -data matrix with liner computation and bandwidth overhead. The proposed schemes provide one-to-one mapping between data index and check sum group so that error can be located and recovered by easier logic and simple operations.
Performance Comparison of Cluster based and Threshold based Algorithms for De...Eswar Publications
In mobile ad-hoc networks (MANET), the movement of the nodes may quickly change the networks topology resulting in the increase of the overhead message in topology maintenance. The nodes communicate with each other by exchanging the hello packet and constructing the neighbor list at each node. MANET is vulnerable to attacks such as black hole attack, gray hole attack, worm hole attack and sybil attack. A black hole attack makes a serious impact on routing, packet delivery ratio, throughput, and end to end delay of packets. In this paper, the performance comparison of clustering based and threshold based algorithms for detection and prevention of
cooperative in MANETs is examined. In this study every node is monitored by its own cluster head (CH), while server (SV) monitors the entire network by channel overhearing method. Server computes the trust value based on sent and receive count of packets of the receiver node. It is implemented using AODV routing protocol in the NS2 simulations. The results are obtained by comparing the performance of clustering based and threshold based methods by varying the concentration of black hole nodes and are analyzed in terms of throughput,
packet delivery ratio. The results demonstrate that the threshold based method outperforms the clustering based method in terms of throughput, packet delivery ratio and end to end delay.
Computing probabilistic queries in the presence of uncertainty via probabilis...Konstantinos Giannakis
1) The document proposes using probabilistic automata to compute probabilistic queries on RDF-like data structures that contain uncertainty. It shows how to assign a probabilistic automaton corresponding to a particular query.
2) An example query is provided that finds all nodes influenced by a starting node with a probability above a threshold. The probabilistic automata calculations allow filtering results by probability.
3) Benefits cited include leveraging well-studied probabilistic automata results and efficient handling of uncertainty. Future work could expand the models to infinite data and provide more empirical results.
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...cscpconf
The document proposes an alternative approach for selecting pseudo-random numbers for online examination systems. It compares three random number generators: a procedural language random number generator, the PHP random number generator, and an atmospheric noise-based true random number generator. It tests the randomness quality of patterns generated by each using the Diehard statistical tests. The results show that the true random number generator passes all tests, while the procedural language and PHP generators fail most tests, indicating their patterns have lower randomness quality than the true random generator.
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automataijait
In this research work, we have put an emphasis on the cost effective design approach for high quality pseudo-random numbers using one dimensional Cellular Automata (CA) over Maximum Length CA. This work focuses on different complexities e.g., space complexity, time complexity, design complexity and searching complexity for the generation of pseudo-random numbers in CA. The optimization procedure for
these associated complexities is commonly referred as the cost effective generation approach for pseudorandom numbers. The mathematical approach for proposed methodology over the existing maximum length CA emphasizes on better flexibility to fault coverage. The randomness quality of the generated patterns for the proposed methodology has been verified using Diehard Tests which reflects that the randomness quality
achieved for proposed methodology is equal to the quality of randomness of the patterns generated by the maximum length cellular automata. The cost effectiveness results a cheap hardware implementation for the concerned pseudo-random pattern generator. Short version of this paper has been published in [1].
A Modular approach on Statistical Randomness Study of bit sequencesEswar Publications
Randomness studies of bit sequences, created either by a ciphering algorithm or by a pseudorandom bit generator are a subject of prolonged research interest. During the recent past the 15 statistical tests of NIST turn out to be the most important as well as dependable tool for the same. For searching a right pseudorandom bit generator from among many such algorithms, large time is required to run the complete NIST statistical test suite. In this paper three test modules are considered in succession to reduce the searching time. The module-1 has one program and is executed almost instantly. The module-2 has four programs and takes about half an hour. The module-3 has fifteen programs and takes about four to five hours depending on the machine configuration. To choose the right pseudorandom bits generator, the algorithms rejected by the first module are not considered by the second module
while the third module does consider only those passed by the second module.
Computational intelligence based simulated annealing guided key generation in...ijitjournal
In this paper, a Computational Intelligence based Simulated Annealing (SA) guided approach is use to
construct the key stream. SA is a randomization technique for solving optimization problems. It is a
procedure for finding good quality solutions to a large diversity of combinatorial optimization problems.
This technique can assist to stay away from the problem of getting stuck in local optima and to escort
towards the globally optimum solution. It is inspired by the annealing procedure in metallurgy. At high
temperatures, the molecules of liquid move freely with respect to one another. If the liquid is cooled slowly,
thermal mobility is lost. Parametric tests are done and results are compared with some existing classical
techniques, which shows comparable results for the proposed system.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
The document presents John Alexander Vargas' research proposal for exploring the use of the utcc calculus with spatial logic as the underlying logic for modeling mobile properties of concurrent systems. The methodology involves defining a spatial constraint system, modeling a simple example using utcc and spatial constraints, verifying spatial properties, and modeling a complex multi-cellular system. Spatial logic is introduced for specifying properties of mobile systems, with logical inference rules and a sequent calculus for deciding validity presented.
Computational Complexity Comparison Of Multi-Sensor Single Target Data Fusion...ijccmsjournal
This document compares the computational complexity of four multi-sensor data fusion methods based on the Kalman filter using MATLAB simulations. The four methods are: group-sensor method, sequential-sensor method, inverse covariance form, and track-to-track fusion. The results show that the inverse covariance method has the best computational performance if the number of sensors is above 20. For fewer sensors, other methods like the group sensors method are more appropriate due to lower computational loads when inverting smaller matrices.
COMPUTATIONAL COMPLEXITY COMPARISON OF MULTI-SENSOR SINGLE TARGET DATA FUSION...ijccmsjournal
Target tracking using observations from multiple sensors can achieve better estimation performance than a single sensor. The most famous estimation tool in target tracking is Kalman filter. There are several mathematical approaches to combine the observations of multiple sensors by use of Kalman filter. An
important issue in applying a proper approach is computational complexity. In this paper, four data fusion algorithms based on Kalman filter are considered including three centralized and one decentralized methods. Using MATLAB, computational loads of these methods are compared while number of sensors
increases. The results show that inverse covariance method has the best computational performance if the number of sensors is above 20. For a smaller number of sensors, other methods, especially group sensors, are more appropriate..
On Improving the Performance of Data Leak Prevention using White-list ApproachPatrick Nguyen
This document proposes improving data leak prevention performance using a white-list approach and Bloom filters. It summarizes previous related work using blacklists and keywords to detect leaks. The authors then improve upon prior work by Fang Hao et al. that used CRC to create fingerprints, by using hash functions with Bloom filters instead to generate fingerprints faster while maintaining accuracy. Experiments test five hash functions on a 9.3GB dataset to evaluate system throughput and percentage of leaked files.
The document describes efficient algorithms for projecting a vector onto the l1-ball (sum of absolute values being less than a threshold). It presents two methods: 1) An exact projection algorithm that runs in expected O(n) time, where n is the dimension. 2) A method for vectors with k perturbed elements outside the l1-ball, which projects in O(k log n) time. It demonstrates these algorithms outperform interior point methods on various learning tasks, providing models with high sparsity.
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
This document provides an overview and comparison of three classification algorithms: K-Nearest Neighbors (KNN), Decision Trees, and Bayesian Networks. It discusses each algorithm, including how KNN classifies data based on its k nearest neighbors. Decision Trees classify data based on a tree structure of decisions, and Bayesian Networks classify data based on probabilities of relationships between variables. The document conducts an analysis of these three algorithms to determine which has the best performance and lowest time complexity for classification tasks based on evaluating a mock dataset over 24 months.
This document discusses using particle swarm optimization based on variable neighborhood search (PSO-VNS) to attack classical cryptography ciphers. PSO is a population-based optimization algorithm inspired by bird flocking behavior. VNS is a metaheuristic algorithm that explores neighborhoods of solutions to escape local optima. The paper proposes improving PSO with VNS to find better solutions. It evaluates PSO-VNS on substitution and transposition ciphers, finding it recovers keys better than standard PSO and other variants.
A Non Parametric Estimation Based Underwater Target ClassifierCSCJournals
Underwater noise sources constitute a prominent class of input signal in most underwater signal processing systems. The problem of identification of noise sources in the ocean is of great importance because of its numerous practical applications. In this paper, a methodology is presented for the detection and identification of underwater targets and noise sources based on non parametric indicators. The proposed system utilizes Cepstral coefficient analysis and the Kruskal-Wallis H statistic along with other statistical indicators like F-test statistic for the effective detection and classification of noise sources in the ocean. Simulation results for typical underwater noise data and the set of identified underwater targets are also presented in this paper.
The document discusses clustering techniques and provides details about the k-means clustering algorithm. It begins with an introduction to clustering and lists different clustering techniques. It then describes the k-means algorithm in detail, including how it works, the steps involved, and provides an example illustration. Finally, it discusses comments on the k-means algorithm, focusing on aspects like choosing the value of k, initializing cluster centroids, and different distance measurement methods.
Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...sipij
It is an experimental design problem in which there are two Poisson sources with two possible and known rates, and one counter. Through a switch, the counter can observe the sources individually or the counts can be combined so that the counter observes the sum of the two. The sensor scheduling problem is to determine an optimal proportion of the available time to be allocated toward individual and joint sensing, under a total time
constraint. Two different metrics are used for optimization: mutual information between the sources and the observed counts, and probability of detection for the associated source detection problem. Our results, which are primarily computational, indicate similar but not identical results under the two cost functions.
Propagation of Error Bounds due to Active Subspace ReductionMohammad
This document summarizes the propagation of error bounds due to active subspace reduction in computational models. It presents two algorithms for performing active subspace reduction: one that is gradient-free and reduces the response or state space, and one that is gradient-based and reduces the parameter space. It then develops a theorem for propagating error bounds across multiple reductions, both in the parameter and response spaces. Numerical experiments on an analytic function and a nuclear reactor pin cell model are used to validate the error bound approach.
Em computação quântica, um algoritmo quântico é um algoritmo que funciona em um modelo realístico de computação quântica. O modelo mais utilizado é o modelo do circuito de computação quântica.
This document summarizes a paper presentation on selecting the optimal number of clusters (K) for k-means clustering. The paper proposes a new evaluation measure to automatically select K without human intuition. It reviews existing methods, analyzes factors influencing K selection, describes the proposed measure, and applies it to real datasets. The method was validated on artificial and benchmark datasets. It aims to suggest multiple K values depending on the required detail level for clustering. However, it is computationally expensive for large datasets and the data used may not reflect real complexity.
Multiple Dimensional Fault Tolerant Schemes for Crypto Stream CiphersIJNSA Journal
To enhance the security and reliability of the widely-used stream ciphers, a 2-D and a 3-D mesh-knight Algorithm Based Fault Tolerant (ABFT) schemes for stream ciphers are developed which can be universally applied to RC4 and other stream ciphers. Based on the ready-made arithmetic unit in stream ciphers, the proposed 2-D ABFT scheme is able to detect and correct any simple error, and the 3-D meshknight ABFT scheme is capable of detecting and correcting up to three errors in an n2 -data matrix with liner computation and bandwidth overhead. The proposed schemes provide one-to-one mapping between data index and check sum group so that error can be located and recovered by easier logic and simple operations.
Performance Comparison of Cluster based and Threshold based Algorithms for De...Eswar Publications
In mobile ad-hoc networks (MANET), the movement of the nodes may quickly change the networks topology resulting in the increase of the overhead message in topology maintenance. The nodes communicate with each other by exchanging the hello packet and constructing the neighbor list at each node. MANET is vulnerable to attacks such as black hole attack, gray hole attack, worm hole attack and sybil attack. A black hole attack makes a serious impact on routing, packet delivery ratio, throughput, and end to end delay of packets. In this paper, the performance comparison of clustering based and threshold based algorithms for detection and prevention of
cooperative in MANETs is examined. In this study every node is monitored by its own cluster head (CH), while server (SV) monitors the entire network by channel overhearing method. Server computes the trust value based on sent and receive count of packets of the receiver node. It is implemented using AODV routing protocol in the NS2 simulations. The results are obtained by comparing the performance of clustering based and threshold based methods by varying the concentration of black hole nodes and are analyzed in terms of throughput,
packet delivery ratio. The results demonstrate that the threshold based method outperforms the clustering based method in terms of throughput, packet delivery ratio and end to end delay.
Computing probabilistic queries in the presence of uncertainty via probabilis...Konstantinos Giannakis
1) The document proposes using probabilistic automata to compute probabilistic queries on RDF-like data structures that contain uncertainty. It shows how to assign a probabilistic automaton corresponding to a particular query.
2) An example query is provided that finds all nodes influenced by a starting node with a probability above a threshold. The probabilistic automata calculations allow filtering results by probability.
3) Benefits cited include leveraging well-studied probabilistic automata results and efficient handling of uncertainty. Future work could expand the models to infinite data and provide more empirical results.
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...cscpconf
The document proposes an alternative approach for selecting pseudo-random numbers for online examination systems. It compares three random number generators: a procedural language random number generator, the PHP random number generator, and an atmospheric noise-based true random number generator. It tests the randomness quality of patterns generated by each using the Diehard statistical tests. The results show that the true random number generator passes all tests, while the procedural language and PHP generators fail most tests, indicating their patterns have lower randomness quality than the true random generator.
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automataijait
In this research work, we have put an emphasis on the cost effective design approach for high quality pseudo-random numbers using one dimensional Cellular Automata (CA) over Maximum Length CA. This work focuses on different complexities e.g., space complexity, time complexity, design complexity and searching complexity for the generation of pseudo-random numbers in CA. The optimization procedure for
these associated complexities is commonly referred as the cost effective generation approach for pseudorandom numbers. The mathematical approach for proposed methodology over the existing maximum length CA emphasizes on better flexibility to fault coverage. The randomness quality of the generated patterns for the proposed methodology has been verified using Diehard Tests which reflects that the randomness quality
achieved for proposed methodology is equal to the quality of randomness of the patterns generated by the maximum length cellular automata. The cost effectiveness results a cheap hardware implementation for the concerned pseudo-random pattern generator. Short version of this paper has been published in [1].
A Modular approach on Statistical Randomness Study of bit sequencesEswar Publications
Randomness studies of bit sequences, created either by a ciphering algorithm or by a pseudorandom bit generator are a subject of prolonged research interest. During the recent past the 15 statistical tests of NIST turn out to be the most important as well as dependable tool for the same. For searching a right pseudorandom bit generator from among many such algorithms, large time is required to run the complete NIST statistical test suite. In this paper three test modules are considered in succession to reduce the searching time. The module-1 has one program and is executed almost instantly. The module-2 has four programs and takes about half an hour. The module-3 has fifteen programs and takes about four to five hours depending on the machine configuration. To choose the right pseudorandom bits generator, the algorithms rejected by the first module are not considered by the second module
while the third module does consider only those passed by the second module.
Computational intelligence based simulated annealing guided key generation in...ijitjournal
In this paper, a Computational Intelligence based Simulated Annealing (SA) guided approach is use to
construct the key stream. SA is a randomization technique for solving optimization problems. It is a
procedure for finding good quality solutions to a large diversity of combinatorial optimization problems.
This technique can assist to stay away from the problem of getting stuck in local optima and to escort
towards the globally optimum solution. It is inspired by the annealing procedure in metallurgy. At high
temperatures, the molecules of liquid move freely with respect to one another. If the liquid is cooled slowly,
thermal mobility is lost. Parametric tests are done and results are compared with some existing classical
techniques, which shows comparable results for the proposed system.
Random Keying Technique for Security in Wireless Sensor Networks Based on Mem...ijcsta
The document proposes a random keying technique combined with memetics concepts to provide security in wireless sensor networks. It involves randomly selecting keys from ranges distributed from the base station to cluster heads and nodes. When a node communicates, it selects keys that undergo crossover and mutation to generate header and trailer keys. The receiving node verifies packets by applying the same operations to the header keys and comparing the results to the trailer keys. Simulations showed this technique effectively combats spoofing attacks while being energy efficient compared to cryptographic methods.
Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...ijceronline
The document summarizes research on efficient query evaluation methods for probabilistic top-k queries in wireless sensor networks. It proposes three algorithms (SSB, NSB, BB) that use the concepts of sufficient and necessary sets to prune data and reduce transmissions between clusters and the base station. It also develops an adaptive algorithm that dynamically switches between the three based on estimated costs. Experimental results show the algorithms outperform baselines and the adaptive approach achieves near-optimal performance under different conditions.
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...IRJET Journal
This document summarizes research on developing parallel algorithms to optimize solving the longest common subsequence (LCS) problem. LCS is commonly used for sequence comparison in bioinformatics. Traditional sequential dynamic programming algorithms have complexity of O(mn) for sequences of lengths m and n. The document reviews parallel algorithms developed using tools like OpenMP and GPUs like CUDA to reduce computation time. It proposes the authors' own optimized parallel algorithm for multi-core CPUs using OpenMP.
A STATISTICAL COMPARATIVE STUDY OF SOME SORTING ALGORITHMSijfcstjournal
This research paper is a statistical comparative study of a few average case asymptotically optimal sorting
algorithms namely, Quick sort, Heap sort and K- sort. The three sorting algorithms all with the same
average case complexity have been compared by obtaining the corresponding statistical bounds while
subjecting these procedures over the randomly generated data from some standard discrete and continuous
probability distributions such as Binomial distribution, Uniform discrete and continuous distribution and
Poisson distribution. The statistical analysis is well supplemented by the parameterized complexity
analysis
This research paper is a statistical comparative study of a few average case asymptotically optimal sorting algorithms namely, Quick sort, Heap sort and K- sort. The three sorting algorithms all with the same average case complexity have been compared by obtaining the corresponding statistical bounds while subjecting these procedures over the randomly generated data from some standard discrete and continuous
probability distributions such as Binomial distribution, Uniform discrete and continuous distribution and Poisson distribution. The statistical analysis is well supplemented by the parameterized complexity analysis.
A linear-Discriminant-Analysis-Based Approach to Enhance the Performance of F...CSCJournals
Spike sorting is of prime importance in neurophysiology and hence has received considerable attention. However, conventional methods suffer from the degradation of clustering results in the presence of high levels of noise contamination. This paper presents a scheme for taking advantage of automatic clustering and enhancing the feature extraction efficiency, especially for low-SNR spike data. The method employs linear discriminant analysis based on a fuzzy c-means (FCM) algorithm. Simulated spike data [1] were used as the test bed due to better a priori knowledge of the spike signals. Application to both high and low signal-to-noise ratio (SNR) data showed that the proposed method outperforms conventional principal-component analysis (PCA) and FCM algorithm. FCM failed to cluster spikes for low-SNR data. For two discriminative performance indices based on Fisher's discriminant criterion, the proposed approach was over 1.36 times the ratio of between- and within-class variation of PCA for spike data with SNR ranging from 1.5 to 4.5 dB. In conclusion, the proposed scheme is unsupervised and can enhance the performance of fuzzy c-means clustering in spike sorting with low-SNR data.
An SPRT Procedure for an Ungrouped Data using MMLE ApproachIOSR Journals
This document describes a sequential probability ratio test (SPRT) procedure for analyzing ungrouped software failure data using a modified maximum likelihood estimation (MMLE) approach. The SPRT procedure can help quickly detect unreliable software by making decisions with fewer observed failures than traditional hypothesis testing methods. Parameters are estimated using MMLE, which approximates functions in the maximum likelihood equation with linear functions to simplify calculations compared to other estimation methods. The document provides details on how to apply the SPRT procedure and MMLE parameter estimation to a software reliability growth model to analyze software failure data sequentially and detect unreliable software components earlier.
Multiple Dimensional Fault Tolerant Schemes for Crypto Stream CiphersIJNSA Journal
This document proposes two fault tolerant schemes for stream ciphers based on Algorithm Based Fault Tolerance (ABFT). The first is a 2-D mesh ABFT scheme that can detect and correct any single error in an n-by-n plaintext matrix with linear computation and bandwidth overhead. It constructs matrices for the plaintext, keystream, and transmitted data with row and column checksums. The second is a 3-D mesh-knight ABFT scheme that can detect and correct up to three errors by adding an extra "knight" checksum dimension. Both schemes use only XOR operations and allow errors to be efficiently located and recovered.
Design of optimized Interval Arithmetic MultiplierVLSICS Design
Many DSP and Control applications that require the user to know how various numerical errors(uncertainty) affect the result. This uncertainty is eliminated by replacing non-interval values with intervals. Since most DSPs operate in real time environments, fast processors are required to implement interval arithmetic. The goal is to develop a platform in which Interval Arithmetic operations are performed at the same computational speed as present day signal processors. So we have proposed the design and implementation of Interval Arithmetic multiplier, which operates with IEEE 754 numbers. The proposed unit consists of a floating point CSD multiplier, Interval operation selector. This architecture implements an algorithm which is faster than conventional algorithm of Interval multiplier . The cost overhead of the proposed unit is 30% with respect to a conventional floating point multiplier. The
performance of proposed architecture is better than that of a conventional CSD floating-point multiplier, as it can perform both interval multiplication and floating-point multiplication as well as Interval comparisons
New Data Association Technique for Target Tracking in Dense Clutter Environme...CSCJournals
Improving data association process by increasing the probability of detecting valid data points (measurements obtained from radar/sonar system) in the presence of noise for target tracking are discussed in manuscript. We develop a novel algorithm by filtering gate for target tracking in dense clutter environment. This algorithm is less sensitive to false alarm (clutter) in gate size than conventional approaches as probabilistic data association filter (PDAF) which has data association algorithm that begin to fail due to the increase in the false alarm rate or low probability of target detection. This new selection filtered gate method combines a conventional threshold based algorithm with geometric metric measure based on one type of the filtering methods that depends on the idea of adaptive clutter suppression methods. An adaptive search based on the distance threshold measure is then used to detect valid filtered data point for target tracking. Simulation results demonstrate the effectiveness and better performance when compared to conventional algorithm.
In the classical model, the fundamental building block is represented by bits exists in two states a 0 or a 1. Computations are done by logic gates on the bits to produce other bits. By increasing the number of bits, the complexity of problem and the time of computation increases. A quantum algorithm is a sequence of operations on a register to transform it into a state which when measured yields the desired result. This paper provides introduction to quantum computation by developing qubit, quantum gate and quantum circuits.
The document compares Shannon, Renyi, and Tsallis entropy measures used in decision trees. It modified the C4.5 decision tree algorithm to replace Shannon entropy with Renyi and Tsallis entropies. It then tested these modified decision trees on three microarray gene expression datasets. The results showed the modified decision trees achieved accuracies comparable or higher than the standard C4.5 algorithm, depending on the entropy parameter value used.
This document discusses using unsupervised support vector analysis to increase the efficiency of simulation-based functional verification. It describes applying an unsupervised machine learning technique called support vector analysis to filter redundant tests from a set of verification tests. By clustering similar tests into regions of a similarity metric space, it aims to select the most important tests to verify a design while removing redundant tests, improving verification efficiency. The approach trains an unsupervised support vector model on an initial set of simulated tests and uses it to filter future tests by comparing them to support vectors that define regions in the similarity space.
Compressive Data Gathering using NACS in Wireless Sensor NetworkIRJET Journal
The document proposes a Neighbor-Aided Compressive Sensing (NACS) scheme for efficient data gathering in wireless sensor networks. NACS exploits both spatial and temporal correlations in sensor data to reduce data transmissions compared to existing compressive sensing models like Kronecker Compressive Sensing (KCS) and Structured Random Matrix (SRM). In NACS, each sensor node sends its raw sensor readings to a uniquely selected nearest neighbor node, which then applies compressive sensing measurements and sends the compressed data to the sink node. Simulation results show NACS achieves better data recovery performance using fewer transmissions than KCS and SRM, improving energy efficiency for data gathering in wireless sensor networks.
Compressive Data Gathering using NACS in Wireless Sensor Network
HPC_NIST_SHA3
1. JOURNAL OF ——, 1
Analysis of SHA3-256 as a Random Number
Generator
Andrew Pollock, Chad Maybin and Elliott Whitling
Abstract—In this paper, we present an empirical evaluation
of the SHA3-256 hash algorithms’ output statistical randomness.
Hash algorithms must produce random and independent output
in order to be considered cryptographically secure. This output
must be random over varying sizes of data sets. Testing for
Randomness has historically been conducted using the industry
standard NIST STS. However, this tool is designed for statistical
analysis of small datasets and does not scale to larger sample
sizes. To assess SHA3-256, we adapted several tests from STS
to run on massive data sets. Assessment of Randomness changes
with scale and consequently, theoretical computational statistics
were given significant consideration. Between 996 million and
101 billion samples were tested using five of the NIST STS tests.
SHA3-256 hashes showed no evidence against randomness in four
of the five tests. However, the longest runs test did show evidence
against randomness and the experiment should be replicated.
Overall, a first pass at Big Data statistical cryptanalysis is
presented with many opportunities for improving on both STS
and our additions.
Index Terms—SHA3, Statistics, HPC, NIST, STS
I. INTRODUCTION
THe Keccak sponge function was accepted as the SHA3
standard in FIPS 202. The National Institute of Standards
and Technology (NIST) conducted extensive cryptographic
testing under the broad battery known as Cryptographic Al-
gorithm Validation Program (”CAVP”) to validate candidate
functions. NIST uses the Statistical Testing Suite (”STS”) to
ensure that the output of an encryption algorithm are suitably
random. If an algorithm were found to be predictable, the
strength of the encryption would suffer. As an algorithm weak-
ens, it becomes much easier for an attacker to compromise
communications which are encrypted with this algorithm. The
father of modern computing, John Von Neumann, warned
against the mathematical generation of random numbers. To
paraphrase, semi-numeric algorithmic approaches to random
number generation cannot achieve perfect randomness as,
given identical input and sufficient compute time, the value
could be reliably reproduced. Schneier [1] articulates two traits
for cryptographically secure pseudo-random sequences:
1) It looks random. This means that it passes all the
statistical tests of randomness that we can find.
2) It is unpredictable. It must be computationally infeasible
to predict what the next random bit will be, given com-
plete knowledge of the algorithm or hardware generating
the sequence and all of the previous bits in the stream.
The authors are with the Masters of Data Science program, Southern
Methodist University, Dallas, Tx, 75205 USA e-mail: please direct commu-
nication to ewhitling@smu.edu
Manuscript received ——; revised ——–.
In the seminal book, Art of Computer Programming Volume
2, Knuth [2] describes fundamental statistical tools to test
pseudo-random sequences. Of the discussed tests, the spectral
test is considered to be one of the most valuable as it can
identify prominent pseudo-random generators (PRNG) like a
linear feedback shift register. These were the first attempts to
assess hash functions as random generators.
With the Big Data revolution, came the ability to compute
previously unheard of sample sizes. The NIST STS is not
designed to accommodate massive data. We propose and
deliver the next step in random number generation testing: a
test suite designed to scale infinitely towards the total output
space of a random number generator function. In this paper,
we attempt to empirically determine the randomness, and
therefore the security, of SHA3-256 by applying tests based
on NIST Statistical Testing Suite to large sequences of values
generated using SHA.
SHA3 is an ideal first candidate to test near-population data
samples (NPDS). The keccak sponge function was recently
awarded the winner of the SHA3 standard competition, making
it the next hash algorithm to be universally adopted.[3] NIST
used STS to assess Keccak but was limited to a sample size
contained within the memory of a single computer.[4]
Given the importance of randomness in Hash Sequences,
very little effort has been put forth by the security community
to assess significant samples from the total output population.
The emergence of Big Data techniques from the Data Science
community warrants adaptation to cryptography problems. In
this paper, we will present the results of NPDS testing that is
several orders of magnitude larger than the NIST test.
The history of detecting non-randomness has been focused
on the theoretical crafting of Psudo-Random Number Gener-
ators PRNG against known statistical techniques. However,
little emphasis has been placed on applying these tools at
suitable scale. All implementations known to the authors have
only conducted tests to data sets that can be stored in memory.
In the source code of STS, the authors explicitly declare such
a limitation.
Six STS tests were adapted to scale up: the monobit,
random block, longest runs, independent runs, binary matrix
and spectral tests. Five of the six tests passed a validation
process designed to ensure consistent results with the existing
NIST STS toolset. The Spectral FFT test failed validation as
it always reported that any input had evidence against ran-
domness. It was therefore not included in the large scale tests.
The other five tests ran between 996 million and 101 billion
samples. Only one test, Longest Runs, suggested evidence
against randomness for SHA3-256.
2. JOURNAL OF ——, 2
Our results suggest that the Keccak Sponge Function as
SHA3-256 is probably a suitable replacement for SHA2.
Despite a test providing evidence against randomness, the
overwhelming conclusion is that SHA3-256 appears random.
The Longest Runs test should be re-validated and replicated
to ensure a consistent result. Further, the Spectral test should
be assessed. It is largely considered one of the superior
statistical tests for testing randomness and its failure at scale is
unsettling. Finally, we recommend the community as a whole
rethink the scale of statistical testing. The current toolset is
not truly ready to handle computation and calculation beyond
the limitation of a single CPU and RAM.
The remainder of this paper is organized as follows. Sec-
tion II, provides a brief explanation of the mathematical back-
ground of the NIST STS tests. In Section III, we present our
approach to validating the tests. Section IV briefly describes
our SHA3 generation process. Section V contains our results
from one-hundred billion SHA3-256 hashes analyzed with the
tool. In this section, the main focus is on the statistical security
of SHA3-256 but will also touch briefly on validation using
known SHA2 datasets. The final section, VI, concludes with
a summary of major results, a discussion of limitations of the
new toolset and suggested steps for further research.
II. NIST TESTS
In this section we cover the basics of the NIST STS tests.
The core functions for creating P-Values are also discussed. At
the end, we summarize modern constraints with STS focusing
on scaling.
A. Monobit
The monobit test is perhaps the most simplistic of all the
STS tests. Given any hash input as binary, a truly random
population should have roughly equal quantities of 0’s and
1’s. A test sequence, Sn, is generated by transforming the
generated bit sequence ( ) into a sum of bits, each doubled
and minus one.
Sn =
n
i=1
2 i − 1 (1)
And the P test statistic calculated as:
P − value = erfc(
|Sn|
√
2
) (2)
with erfc being the complementary error function.
A computed P-value that is smaller than then 0.01 indicates
that the sequence is likely to be random. The directionality
of Sn signifies the relative proportion of 0’s and 1’s in the
sequence. A large Sn means that a larger quantity of 1’s exist
in input than 0’s. NIST recommends that test sequences are
composed of at least 100 bits.
B. Random Block
The Random Block (also known as binary frequency) test
is an extension of Monobit. Again, the quantity of 0’s and 1’s
should be approximately equal but in this version the test is
within a M-bit block instead of an input sequence. This test is
identical to Monobit when M = 1. A series of test sequences
are created using: N = | n
M |. When M = 4 for a 256 bit SHA-3
sequence, then N = 128.
The ratio is calculated with:
πi =
M
j=1(i − 1)M + j
M
(3)
The test statistic is calculated as:
X2
(obs) = 4M
N
i=1
(πi − 1/2)2
(4)
The test statistic is then compared with a P-Value calculated
as:
P − V alue = igamc(N/2,
X2
(obs)
2
) (5)
As the P-value approaches zero, the ratio of 0’s and 1’s
become uneven.
C. Independent Runs
The independent runs test checks for contiguous series of
0’s or 1’s. The goal is to ensure the continuous length of the
same bit is random.
The ratio of ones is calculated with:
π =
j j
n
(6)
The test statistic is calculated:
Vn(obs) =
n−1
k=1
r(k) + 1 (7)
where r(k) = 1 when k = k+1, else 0.
And the P-value calculated as:
P − value = erfc(
|Vn(obs) − 2nπ(1 − π)|
2
√
2nπ(1 − π)
) (8)
with erfc being the complementary error function.
D. Longest Runs
Longest runs assess the largest series of 1’s in an input
sequence.
The test statistic is calculated:
X2
(obs) =
K
i=0
(vi − Nπi)2
Nπi
(9)
when M = 8, K = 3 and N = 16, M = 128 then K = 5 and N
= 49, and when M = 104
then K = 6 and N = 75.
The test statistic is then compared with a P-Value calculated
as:
P − V alue = igamc(
k
2
,
X2
(obs)
2
) (10)
A sequence is considered non-random if the P-value is
greater than the 0.01 threshold.
3. JOURNAL OF ——, 3
E. Binary Matrix
The Binary Matrix Rank test (section 2.5 of [5]) is intended
to test for linear correlation of subsets of the bitsequence. The
test follows an algorithm of iterating over blocks where n is
the length of bits, M is the rows and Q is the columns of the
Matrix. Both M and Q are hard coded to 32 in the NIST STS
implementation. The test starts by creating blocks using:
N = |
n
MQ
| (11)
Each block N is filled row by row sequentially then any
remaining bits are transferred to the next block. Each matrix
is then ranked. A chi-square test is then conducted with:
X2
(obs) =
(FM − 0.2888N)2
0.2888N
+ (12)
(FM − 1 − 0.5776N)2
0.5776N
+
(N − FM − FM − 1 − 0.1336N)2
0.1336N
Where FM are the count of matrices with full rank, FM − 1
are the count of matrices with full rank minus one and N −
FM − FM − 1 are the remainder.
The test statistic is then compared with a P-Value calculated
as:
P − V alue = igamc(1,
X2
(obs)
2
)e−t
(13)
Large X2
(obs) values indicate deviation of rank distribution
from that of a random sequence.
F. Spectral
The spectral test (section 2.6 of [5]) is based on the
Linear Algebra approach of mapping Euclidean distances into
Hamming, Banach or Hilbert space. In this case, the algorithm
maps data from Hamming space and mapping the point as
either a negative or positive unit. Once mapped to a sequence,
X, a Discrete Fourier Transformation (DFT) is applied. DFT
takes the sequence and returns Fourier coefficients or a set
samples of the sequence extracted for faster computation. This
new sequence has a modulus function applied that produces
a series of peaks from the original data. These peaks should
be less than 95 percent of total samples divided by two, T.
The theoretical value, N0, is compared to the actual number
of peaks, N1, less than T. The test statistic D is calculated as:
d =
(N1 − N0)
n(.95)(.05)/4)
(14)
And the P test statistic calculated as:
P − value = erfc(
|d|
√
2
) (15)
with erfc being the complementary error function.
A low d value indicates that the number of peaks in the
sample are outside the 95 percent range of T. A relatively
large d will result in a P-value that is greater than or equal
to the significance level which indicates the sequence maybe
random.
G. Core functions
There are two functions commonly used in the NIST tests,
IGAMC and erfc. Both serve different purposes but between
them, the two are included in every test. The function erfc,
or complementary error function, is a ANSI defined formula
included in the C math.h by default; however, NIST STS
utilizes the CEPHES version. A complementary error function
is used to calculate the likelihood a value is outside of the
defined range - often the range is defined by a Gaussian
distribution. [5]
IGAMC is an incomplete gamma function that returns a
normalized range between 0 and 1. Two parameters are passed
to the function, a and x, and the function inverts when a = x.
Both parameters must be non-negative but x may equal zero
( a > 0 and x >= 0). [5]
IGAMC is defined
P(a, x) =
1
Γ(a)
x
0
e−tta−1
dt (16)
where P(a,0) = 0 and P(a,1) = 1
H. Constraints and Improvements
Despite the well constructed source code for the STS, these
suites still have a critical issue: scaling to massive datasets.
Initial runs of SHA3-256 were run using a compiled version
of the official C source. The key error: the data size would
produce IGAMC overflow errors. IGAMC is an incomplete
gamma function used to create a custom distribution for
hypothesis testing. This function is used in six of the fifteen
tests.
IGAMC should be able to scale to an infinite population
size but has a built-in constraint in the NIST implementation:
it can only scale to the size of memory on a single server
node. This severely limits the scale of the tests and our test
server couldn’t surpass a few million hashes before creating an
overflow error. Memory issues are the core of most of NISTs
limitations. Many constants, like the max column and row size
for the Binary Matrix test, appear to be set with protecting
against overflow errors. Despite these measures, the bitstream
size is critical limitation prevent large scale execution.
There has been much research in the area of cryptographic
hash functions, and novel ways of generating pseudoran-
dom and random sequences. Bertoni et al. [6] covered the
submission of the KECCEK sponge function to NIST for
consideration as the SHA3 algorithm. The functions are tested
algebraically in the research, but output from the functions is
not tested empirically. Gholipour and Mirzakuchaki [7] applied
the NIST test suite to output generated from an algorithm
based on the KECCAK hash function (which later was adopted
as SHA3). The sample of data input into the test was much
smaller, however, roughly 150 megabytes of output. We were
unable to find instances of any research that proposed to test
an amount of output data of any algorithm on several orders
of magnitude greater.
III. VALIDATION OF HPC TESTS
The purpose of the testing proofs was to make sure that each
test was both understood and scalable in the context of this
4. JOURNAL OF ——, 4
(a) Monobit Series (b) Longest Runs Series
(c) Binary Frequency Series (d) Independent Runs Series
(e) Binary Matrix Series
Fig. 1: Five of the six tests P-values by series. The scatter of P-Values indicates the tests and validations are working. Note
the difference in y-axis scaling for the Independent Runs. FFT is excluded as all testing resulted in P-Values indistinguishable
from zero.
project. Accordingly, the source code and mathematical back-
ground of each test is from the NIST documentation and each
test was run against an increasing variety of known scenarios
to ensure that results were as expected before expanding the
testing to ever larger binary strings.
All six tests were conducted using a Lenovo Thinkstation
S20 with 24G ram and 8 cores at 2 gigahertz each. The tests
were validated one at a time and was the sole task of the
computer.
The testing proof methodology was fundamentally the same
for each NIST test used (see appendix for flow chart):
1) SHA2-256 is used to create 10,000 outputs using
random sentences. The test should produce no evidence
against randomness.
2) A repeating series of 0s and 1s that is known to be
non-random is tested. The test should demonstrate
evidence against randomness.
3) Step 1 is repeated but with a larger scale of 1,000,000
sentences.
4) A random binary string produced from Random.Org
should produce no evidence against randomness. Note
that Random.Org vets its services using NIST STS to
ensure its samples have no evidence against randomness.
5. JOURNAL OF ——, 5
5) Step 1 is repeated but with SHA3-256. A different
10,000 sentences are analyzed.
6) SHA3-256 outputs around another 33,000 sentences
taken from the New King James Bible as source. This is
meant to emulate a large amount of text generation with
cohesive themes unlike our previous random generation.
There should be no evidence against randomness.
7) Large bitstreams created and passed through SHA3-256
in multiple series producing sample sizes between
approximately one-thousand and one million. This
final test should have no evidence against randomness
and should show no significant difference in P-Values
between iterations.
Step 7 is the most important part of the validation process.
The process had to ensure that spurious results were not being
produced as a by-product of the test. For example, the length
in bits should produce a P-Value indistinguishable from other
lengths. Bit lengths were tested at intervals between 1024 and
896,000, doubling each previous iteration for a total of 12
lengths tested. Additionally, there should be no impact on
the sequence used. The each bit length test was conducted
in twenty iterations to that no contiguous seeds or PRNG’s
affected the result of the test. A Tukey-Kramer differences
analysis was conducted between the series and bit lengths to
ensure that all P-Value output was not different at a statistically
significant level.
Inconsistent P-Values are a strong indication that the vali-
dation process worked. Random generation of hashes should
not produce identical P-Values. The noise in figures 1 and 2
between P-Values supports the validation process. Note that
the independent runs test has a larger range and confidence
intervals than other tests producing visually reduced noise.
The FFT test fails all inputs as non-random regardless of
bit length or series. Our diagnostic efforts could not reveal
a flaw in the implementation in time to include it in our
large scale run on the super-computer. Additionally, FFT may
need an entire code redesign to work at scale. One of the
problems we ran into with FFT is that the test needs to be
run against a full data sequence. The reason for this is that
the values of the FFT are created across the entire sequence -
breaking the sequence into multiple pieces does not simply
extend the original calculation. The definition of peaks is
likewise contingent on the full sequence itself. Had our FFT
test returned meaningful values, we were planning to look at
a distribution of p-values on smaller tests as the creation of
a distributed FFT model was out of scope for our project.
Consequently, the FFT test was not included in the full run.
IV. SHA3 HASH GENERATION
A python script automated the generation of SHA3 256-
bit hashes using Keccak’s python implementation. We utilized
their Keccak function with a message pair of the input hex
string length and string. The string was generated by seeding
the process with the string composed of 77 characters then
adding one per hash generation. The bitrate was set to 1088
bits with a capacity of 256 bits. Per Keccak’s implementation
a suffix for SHA3 was selected as 0x06 and the output bit
length was set to 256. [6]
V. RESULTS
The large scale testing was completed at Southern Methodist
University’s ManeFrame super computer. ManeFrame uses a
queue based system to manage tasks and consequently, all
jobs were interlaced with the processing of other computations.
ManeFrame runs Scientific Linux 6 (64bit) using the SLURM
scheduler. ManeFrame has several queue types depending on
duration and node requirements. Our testing was conducted
on the Serial Queue with a maximum duration of 24 hours
per job allocation. The Serial Queue has access to 384 worker
nodes, each with 24 GB of Ram and 8-core Intel(R) Xeon(R)
CPU X5560 @ 2.80GHz 107 processors.
Both SHA2 and SHA3 256 were assessed. Two variants of
the suite were utilized. A full test suite used all test methods. A
quick test suite, composed of Monobit and Binary Frequency,
was also launched with the aim to produce as many samples
as possible. In general, the SHA2 hash generation was signifi-
cantly faster than SHA3 hash generation. The SHA2 test suites
both have an order of magnitude greater sample size than their
SHA3 counter parts. A post-mortem of python implementation
revealed a Python construct that throttled SHA3 generations.
This defect was fixed and did not result in abhorrent hash
outputs - everything was functioning correctly, only slower
than optimal.
A total of 15.267 billion hashes were produced for the
SHA2 full suite. The quick suite produced and tested 101.976
billion samples. SHA3 produced 0.996 and 23.180 billion sam-
ples respective to full and quick suites. All samples generated
within a suite were unique and sequential but the actual value
was not stored due to hard disk limitations.
A. Monobit
The monobit test provided no evidence against randomness
for all test suites. All P-Values returned were larger than
the 0.01 significance level. Both SHA2 samples produced
P-Values above 1.0. This appears to be caused by the test
sequence, Sn, to generate such an approximately even distri-
bution of 0’s and 1’s that when the P-Value was calculated
using Equation 2, the absolute value of Sn was outside the
practical limits of the erfc distribution.
It should be noted that SHA3-256 P-Values decreased with
the increase of sample size by more than double. This maybe a
by-product of randomness during calculation of P-Values with
large sample sizes or it may hint at a real trend. Our current
results cannot distinguish between the two possibilities. Repli-
cations of this experiment, along with storage of confidence
intervals or standard error would enable the necessary methods
to determine if SHA3 Monobit does trend towards a significant
level of 0.01 as the sample size increases.
6. JOURNAL OF ——, 6
(a) Monobit Bitlength (b) Longest Runs Bitlength
(c) Binary Frequency Bitlength (d) Independent Runs Binlength
(e) Binary Matrix Bit Length
Fig. 2: Five of the six tests P-values by bit length. The scatter of P-Values indicates the tests and validations are working. Note
the difference in y-axis scaling for the Independent Runs. FFT is excluded as all testing resulted in P-Values indistinguishable
from zero.
B. Random Block
The random block test produced no evidence against ran-
domness for all test suites. The SHA2-256 suites had P-Values
of 0.5169 and 0.9343 for full and quick suites respectively.
These are well above the 0.01 significance level indicating an
even distribution of 0s and 1s among random chunks of M-
bit sized blocks. The SHA3-256 also had P-Values larger than
0.01 but at 0.1338 and 0.2762 respective to full and quick,
the output is prone to a more uneven distribution of 0’s and
1’s. As NIST notes, the chosen significance level illustrates
an arbitrary point where the output should be considered
non-random. And as the P-Values approach this point, the
methodology is informing us that the distribution is becoming
more uneven. However, both values are an order of magnitude
above the 0.01 threshold and consequently, do not show any
evidence against randomness.
C. Longest Runs
The SHA2 full test suite calculated a P-Value of 2.0. This
indicates there is no evidence against randomness. It should
be noted that this is an atypical P-Value but is likely caused by
the methodology. The value is calculated with a Chi-Square
that must have resulted in a ratio of 0’s to 1’s that are evenly
distributed as the largest series of 1’s is near 1. We conjecture
that incomplete gamma function did not contain an upper limit
7. JOURNAL OF ——, 7
P-Values
hash algorithm Sample Size Monobit Longest Runs Independent Runs Binary Frequency Binary Matrix
SHA2-256 15,267,000,000 1.2596801123956967 2.0 1.0 0.51695668919649718 1.0
SHA2-256 101,976,000,000 1.1912280400600554 - - 0.93439071101230919 -
SHA3-256 996,700,000 0.5662553593860834 0.0 1.0 0.13385067814625731 0.19970306365206464
SHA3-256 23,180,000,000 0.19447400097104728 - - 0.27626007166449562 -
TABLE I: The resultant P-Values of each test and the related sample size
large enough to encompass the value produced from the Chi-
Square.
Contrariwise, the SHA3 test produced a P-Value of 0.0.
Assuming an accurate value, this provides evidence against
randomness. This result should be replicated and the methods
should be vetted again before suggesting that SHA3-256 pro-
duces non-random values. This result should be viewed with
additional skepticism since the Independent Runs test results
in no evidence against randomness (see next subsection). The
independent runs test and longest runs tests are very similar.
The independent runs test checks for the number of continuous
0’s and 1’s, ensuring that the length of runs is random. The
largest runs assess the largest series of 1’s in an sequence.
Given the two extremes of results, it is highly probable
that the methodology is either incorrectly implemented or the
test cannot scale to large values. A replication of this test is
especially called for, in addition to increased analysis.
D. Independent Runs
Only the Full Suites of SHA2 and SHA3 ran the indepen-
dent runs test. In both algorithms trials, the P-Value return
with a perfect 1.0, which indicates there is no evidence against
randomness. More precisely, this means that the contiguous
series of 0’s and 1’s are approximately even in size.
E. Binary Matrix
The Binary Matrix suites resulted in no evidence against
randomness. The SHA2 full suite produced a P-Value of 1.0.
The SHA3 test measured a P-Value of 0.1997. Both suggest
that subsets of bit sequences broken into 32x32 matrices do
not have any linear correlation between elements within the
matrix.
VI. CONCLUSION
In this paper we present an initial approach to assessing
hash output as random number generators using large sample
sizes. The tests were chosen from the industry standard NIST
STS and were adapted to not be limited by memory utiliza-
tion. Additionally, we constructed an automation process to
schedule analysis jobs at a massive scale.
Five of the six tests adapted passed validation. The valida-
tion process assessed known random and non-random values.
Additionally, we compared the P-Value output by series and bit
length. The Monobit, Longest Runs, Independent Runs, Binary
Frequency and Binary Matrix tests all passed verification.
However, the FFT or Spectral Test always reported that all
inputs are non-random. This occurred even on SHA2-256
tests of known random sequences. Consequently, this test was
removed from the large scale run. One of the first efforts after
this experiment should be a post-mortem of the FFT code to
diagnose the failure. Spectral FFT is widely considered one
of the most powerful tools to detect subtle deviations against
randomness and its exclusion in our results shift this effort
from a reliable assessment of SHA3 as a random number
generator to an initial pass at execution of large sample size
testing of hash outputs.
Using the updated toolkit, we tested SHA2 and SHA3
output with sample sizes between 0.996 billion and 101.976
billion. All tests demonstrated no evidence against randomness
for SHA2-256. Four of the five tests showed no evidence of
randomness for SHA3 hash output. The Longest Runs test
resulted in a P-Value smaller than a 0.01 significance level. The
method passed the verification process, making it improbable
that an implementation error occurred. Two other possibilities
are likely. First, this result is caused by the test itself being
incapable of scaling. The incomplete gamma function used
to evaluate the sequence might not produce a distribution
compatible with this size of samples. This observation is
supported by the atypical result of a P-Value of 2.0 for the
SHA2 Longest Runs. The other possibility is that SHA3-
256 really does show evidence against non-randomness when
assess at scale. While this result would be a significant blow
to the adoption of SHA3, it is also the least likely given the
overall lack of evidence against randomness. This test needs to
be replicated and analyzed before any conclusion can be made
about SHA3’s ability to act like a random number generator.
In addition to the validation of the Longest Run result and
analysis of FFT testing, there are several additional next steps
that can strengthen the ability to assess randomness in hash
algorithms. The adaptation and validation of the other tests to
no longer be constricted by memory would allow for full NIST
STS parity with large samples. Additionally, the incomplete
gamma function and error function should be assessed from a
theoretical and experimental perspective to ensure it can scale
to large samples.
APPENDIX
Test Validation Flow Chart
9. JOURNAL OF ——, 9
ACKNOWLEDGMENT
The authors would like to thank Dr. Engels and Dr. Mcgee
at Southern Methodist University.
REFERENCES
[1] B. Schneier, Applied cryptography: protocols, algorithms, and source
code in C. Wiley, 1996.
[2] D. Knuth, Art of Computer Programming, Volume 2: Seminumerical
Algorithms (Page 2). Addison-Wesley, 1997.
[3] NIST, “Sha-3 standard: Permutation-based hash and extendable-output
functions, fips 202,” Tech. Rep.
[4] C. Boutin, “Nist selects winner of secure has algorithm (sha-3) competi-
tion. nist tech beat. [online],” 2012.
[5] Statistical Test Suite for Random and Pseudorandom Number Generators
for Cryptographic Applications., 1st ed., NIST, April 2010.
[6] G. Bertoni, J. Daemen, M. Peeters, and G. V. Assche, “Keccak sponge
function family main document. submission to nist (round 2),” Keccak,
Tech. Rep., 2009.
[7] A. Gholipour and S. Mirzakuchaki, “A pseudorandom number generator
with keccak hash function.” International Journal of Computer and
Electrical Engineering 3.6, 2011.