The document discusses different methods researchers have used to code qualitative and quantitative data for analysis. It describes several coding schemes researchers developed to analyze patterns in language learner data, such as question formation stages, feedback on errors, and classroom interaction. The document emphasizes that reliable coding requires carefully designing a scheme, training multiple coders, and calculating interrater reliability statistics on a sample of the data.
GROUP FUZZY TOPSIS METHODOLOGY IN COMPUTER SECURITY SOFTWARE SELECTIONijfls
In today's interconnected world, the risk of malwares is a major concern for users. Antivirus software is a
device to prevent, discover, and eliminatemalwares such as, computer worm, trojan horses,computer
viruses,spyware and adware. In the competitive IT environment, due to availability of many antivirus
software and their diverse features evaluating them is an arguable and complicated issue for users which
has a significant impact on the efficiency of computers defense systems. The anti-virus selection problem
can be formulated as a multiple criteria decision making problem. This paper proposes an antivirus
evaluation model for computer users based on group fuzzy TOPSIS. We study a real world case of antivirus
software and define criteria for antivirus selection problem. Seven alternatives were selected from among
the most popular antiviruses in the market and seven criteria were determined by the experts. The study is
followed by the sensitivity analyses of the results which also gives valuable insights into the needs and
solutions for different users in different conditions.
An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...ijctcm
This paper reports on the empirical evaluation of five machine learning algorithm such as J48, BayesNet, OneR, NB and ZeroR using ten performance criteria: accuracy, precision, recall, F-Measure, incorrectly classified instances, kappa statistic, mean absolute error, root mean squared error, relative absolute error, root relative squared error. The aim of this paper is to find out which classifier is better in its performance for intrusion detection system. Machine Learning is one of the methods used in the intrusion detection system (IDS).Based on this study, it can be concluded that J48 decision tree is the most suitable associated algorithm than the other four algorithms. In this paper we compared the performance of Intrusion Detection System (IDS) Classifiers using seven feature reduction techniques.
Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...CSCJournals
Testing has been an essential part of software development life cycle. Automatic test case and test data generation has attracted many researchers in the recent past. Test suite generation is the concept given importance which considers multiple objectives in mind and ensures core coverage. The test cases thus generated can have dependencies such as open dependencies and closed dependencies. When there are dependencies, it is obvious that the order of execution of test cases can have impact on the percentage of flaws detected in the software under test. Therefore test case prioritization is another important research area that complements automatic test suite generation in objects oriented systems. Prior researches on test case prioritization focused on dependency structures. However, in this paper, we automate the extraction of dependency structures. We proposed a methodology that takes care of automatic test suite generation and test case prioritization for effective testing of object oriented software. We built a tool to demonstrate the proof of concept. The empirical study with 20 case studies revealed that the proposed tool and underlying methods can have significant impact on the software industry and associated clientele.
SENSITIVITY ANALYSIS OF INFORMATION RETRIEVAL METRICS ijcsit
Average Precision, Recall and Precision are the main metrics of Information Retrieval (IR) systems performance. Using Mathematical and empirical analysis, in this paper, we show the properties of those metrics. Mathematically, it is demonstrated that all those parameters are very sensitive to relevance judgment which is not usually very reliable. We show that position shifting downwards of the relevant document within the ranked list is followed by Average Precision decreasing. The variation of Average Precision parameter value is highly present in the positions 1 to 10, while from the 10th position on, this variation is negligible. In addition, we try to estimate the regularity of the Average Precision value changes, when we assume that we are switching the arbitrary number of relevance judgments within the existing ranked list, from non-relevant to relevant. Empirically, it is shown hat 6 relevant documents at the end of the 20 document list, have approximately same Average Precision value as a single relevant document at the beginning of this list, while Recall and Precision values increase linearly, regardless of the document position in the list. Also, we show that in the case of Serbian-to-English human translation query followed by English-to-Serbian machine translation, relevance judgment is significantly changed and therefore, all the parameters for measuring the IR system performance are also subject to change.
Determining Basis Test Paths Using Genetic Algorithm and J48 IJECEIAES
Basis test paths is a method that uses a graph contains nodes as a representation of codes and the lines as a sequence of code execution steps. Determination of basis test paths can be generated using a Genetic Algorithm, but the drawback was the number of iterations affect the possibility of visibility of the appropriate basis path. When the iteration is less, there is a possibility the paths do not appear all. Conversely, if the iteration is too much, all the paths have appeared in the middle of iteration. This research aims to optimize the performance of Genetic Algorithms for the generation of Basis Test Paths by determining how many iterations level corresponding to the characteristics of the code. Code metrics Node, Edge, VG, NBD, LOC were used as features to determine the number of iterations. J48 classifier was employed as a method to predict the number of iterations. There were 17 methods have selected as a data training, and 16 methods as a data test. The system was able to predict 84.5% of 58 basis paths. Efficiency test results also show that our system was able to seek Basis Paths 35% faster than the old system.
GROUP FUZZY TOPSIS METHODOLOGY IN COMPUTER SECURITY SOFTWARE SELECTIONijfls
In today's interconnected world, the risk of malwares is a major concern for users. Antivirus software is a
device to prevent, discover, and eliminatemalwares such as, computer worm, trojan horses,computer
viruses,spyware and adware. In the competitive IT environment, due to availability of many antivirus
software and their diverse features evaluating them is an arguable and complicated issue for users which
has a significant impact on the efficiency of computers defense systems. The anti-virus selection problem
can be formulated as a multiple criteria decision making problem. This paper proposes an antivirus
evaluation model for computer users based on group fuzzy TOPSIS. We study a real world case of antivirus
software and define criteria for antivirus selection problem. Seven alternatives were selected from among
the most popular antiviruses in the market and seven criteria were determined by the experts. The study is
followed by the sensitivity analyses of the results which also gives valuable insights into the needs and
solutions for different users in different conditions.
An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...ijctcm
This paper reports on the empirical evaluation of five machine learning algorithm such as J48, BayesNet, OneR, NB and ZeroR using ten performance criteria: accuracy, precision, recall, F-Measure, incorrectly classified instances, kappa statistic, mean absolute error, root mean squared error, relative absolute error, root relative squared error. The aim of this paper is to find out which classifier is better in its performance for intrusion detection system. Machine Learning is one of the methods used in the intrusion detection system (IDS).Based on this study, it can be concluded that J48 decision tree is the most suitable associated algorithm than the other four algorithms. In this paper we compared the performance of Intrusion Detection System (IDS) Classifiers using seven feature reduction techniques.
Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...CSCJournals
Testing has been an essential part of software development life cycle. Automatic test case and test data generation has attracted many researchers in the recent past. Test suite generation is the concept given importance which considers multiple objectives in mind and ensures core coverage. The test cases thus generated can have dependencies such as open dependencies and closed dependencies. When there are dependencies, it is obvious that the order of execution of test cases can have impact on the percentage of flaws detected in the software under test. Therefore test case prioritization is another important research area that complements automatic test suite generation in objects oriented systems. Prior researches on test case prioritization focused on dependency structures. However, in this paper, we automate the extraction of dependency structures. We proposed a methodology that takes care of automatic test suite generation and test case prioritization for effective testing of object oriented software. We built a tool to demonstrate the proof of concept. The empirical study with 20 case studies revealed that the proposed tool and underlying methods can have significant impact on the software industry and associated clientele.
SENSITIVITY ANALYSIS OF INFORMATION RETRIEVAL METRICS ijcsit
Average Precision, Recall and Precision are the main metrics of Information Retrieval (IR) systems performance. Using Mathematical and empirical analysis, in this paper, we show the properties of those metrics. Mathematically, it is demonstrated that all those parameters are very sensitive to relevance judgment which is not usually very reliable. We show that position shifting downwards of the relevant document within the ranked list is followed by Average Precision decreasing. The variation of Average Precision parameter value is highly present in the positions 1 to 10, while from the 10th position on, this variation is negligible. In addition, we try to estimate the regularity of the Average Precision value changes, when we assume that we are switching the arbitrary number of relevance judgments within the existing ranked list, from non-relevant to relevant. Empirically, it is shown hat 6 relevant documents at the end of the 20 document list, have approximately same Average Precision value as a single relevant document at the beginning of this list, while Recall and Precision values increase linearly, regardless of the document position in the list. Also, we show that in the case of Serbian-to-English human translation query followed by English-to-Serbian machine translation, relevance judgment is significantly changed and therefore, all the parameters for measuring the IR system performance are also subject to change.
Determining Basis Test Paths Using Genetic Algorithm and J48 IJECEIAES
Basis test paths is a method that uses a graph contains nodes as a representation of codes and the lines as a sequence of code execution steps. Determination of basis test paths can be generated using a Genetic Algorithm, but the drawback was the number of iterations affect the possibility of visibility of the appropriate basis path. When the iteration is less, there is a possibility the paths do not appear all. Conversely, if the iteration is too much, all the paths have appeared in the middle of iteration. This research aims to optimize the performance of Genetic Algorithms for the generation of Basis Test Paths by determining how many iterations level corresponding to the characteristics of the code. Code metrics Node, Edge, VG, NBD, LOC were used as features to determine the number of iterations. J48 classifier was employed as a method to predict the number of iterations. There were 17 methods have selected as a data training, and 16 methods as a data test. The system was able to predict 84.5% of 58 basis paths. Efficiency test results also show that our system was able to seek Basis Paths 35% faster than the old system.
ANALYSIS OF MACHINE LEARNING ALGORITHMS WITH FEATURE SELECTION FOR INTRUSION ...IJNSA Journal
In recent times, various machine learning classifiers are used to improve network intrusion detection. The researchers have proposed many solutions for intrusion detection in the literature. The machine learning classifiers are trained on older datasets for intrusion detection, which limits their detection accuracy. So, there is a need to train the machine learning classifiers on the latest dataset. In this paper, UNSW-NB15, the latest dataset is used to train machine learning classifiers. The selected classifiers such as K-Nearest Neighbors (KNN), Stochastic Gradient Descent (SGD), Random Forest (RF), Logistic Regression (LR), and Naïve Bayes (NB) classifiers are used for training from the taxonomy of classifiers based on lazy and eager learners. In this paper, Chi-Square, a filter-based feature selection technique, is applied to the UNSW-NB15 dataset to reduce the irrelevant and redundant features. The performance of classifiers is measured in terms of Accuracy, Mean Squared Error (MSE), Precision, Recall, F1-Score, True Positive Rate (TPR) and False Positive Rate (FPR) with or without feature selection technique and comparative analysis of these machine learning classifiers is carried out.
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSijcsit
Diabetes disease is amongst the most common disease in India. It affects patient’s health and also leads to
other chronic diseases. Prediction of diabetes plays a significant role in saving of life and cost. Predicting
diabetes in human body is a challenging task because it depends on several factors. Few studies have reported the performance of classification algorithms in terms of accuracy. Results in these studies are difficult and complex to understand by medical practitioner and also lack in terms of visual aids as they arepresented in pure text format. This reported survey uses ROC and PRC graphical measures toimproveunderstanding of results. A detailed parameter wise discussion of comparison is also presented which lacksin other reported surveys. Execution time, Accuracy, TP Rate, FP Rate, Precision, Recall, F Measureparameters are used for comparative analysis and Confusion Matrix is prepared for quick review of each
algorithm. Ten fold cross validation method is used for estimation of prediction model. Different sets of
classification algorithms are analyzed on diabetes dataset acquired from UCI repository
This paper presents a review & performs a comparative evaluation of few known machine learning
algorithms in terms of their suitability & code performance on any given data set of any size. In this paper,
we describe our Machine Learning ToolBox that we have built using python programming language. The
algorithms used in the toolbox consists of supervised classification algorithms such as Naïve Bayes,
Decision Trees, SVM, K-nearest Neighbors and Neural Network (Backpropagation). The algorithms are
tested on iris and diabetes dataset and are compared on the basis of their accuracy under different
conditions. However using our tool one can apply any of the implemented ML algorithms on any dataset of
any size. The main goal of building a toolbox is to provide users with a platform to test their datasets on
different Machine Learning algorithms and use the accuracy results to determine which algorithms fits the
data best. The toolbox allows the user to choose a dataset of his/her choice either in structured or
unstructured form and then can choose the features he/she wants to use for training the machine We have
given our concluding remarks on the performance of implemented algorithms based on experimental
analysis
Investigation of Attitudes Towards Computer Programming in Terms of Various V...ijpla
This study aims to determine the attitudes of individuals towards computer programming in terms of
various variables. The study group consists of the students of Kastamonu University Department of
Computer Education and Instructional Technologies Teaching (CEIT), Department of Computer
Engineering, and Department of Computer Programming. Data were collected via Attitude towards
Computer Programming Scale (AtCPS).The results of this study show that students have neutral attitudes
towards computer programming in general. Male computer programming students have significantly
higher attitudes towards programming in comparison to female computer programming students. In
addition, attitude towards computer programming statistically varies by grade. The higher is grade, the
lower is attitude. The more time CEIT and computer programming students spend on computer for
programming purposes daily, the more positive attitudes they have towards programming. Attitude
significantly varies by graduated high school only among CEIT students.
A novel ensemble modeling for intrusion detection system IJECEIAES
Vast increase in data through internet services has made computer systems more vulnerable and difficult to protect from malicious attacks. Intrusion detection systems (IDSs) must be more potent in monitoring intrusions. Therefore an effectual Intrusion Detection system architecture is built which employs a facile classification model and generates low false alarm rates and high accuracy. Noticeably, IDS endure enormous amounts of data traffic that contain redundant and irrelevant features, which affect the performance of the IDS negatively. Despite good feature selection approaches leads to a reduction of unrelated and redundant features and attain better classification accuracy in IDS. This paper proposes a novel ensemble model for IDS based on two algorithms Fuzzy Ensemble Feature selection (FEFS) and Fusion of Multiple Classifier (FMC). FEFS is a unification of five feature scores. These scores are obtained by using feature-class distance functions. Aggregation is done using fuzzy union operation. On the other hand, the FMC is the fusion of three classifiers. It works based on Ensemble decisive function. Experiments were made on KDD cup 99 data set have shown that our proposed system works superior to well-known methods such as Support Vector Machines (SVMs), K-Nearest Neighbor (KNN) and Artificial Neural Networks (ANNs). Our examinations ensured clearly the prominence of using ensemble methodology for modeling IDSs, and hence our system is robust and efficient.
A NOVEL APPROACH FOR GENERATING FACE TEMPLATE USING BDAcsandit
In identity management system, commonly used biometric recognition system needs attention
towards issue of biometric template protection as far as more reliable solution is concerned. In
view of this biometric template protection algorithm should satisfy security, discriminability and
cancelability. As no single template protection method is capable of satisfying the basic
requirements, a novel technique for face template generation and protection is proposed. The
novel approach is proposed to provide security and accuracy in new user enrollment as well as
authentication process. This novel technique takes advantage of both the hybrid approach and
the binary discriminant analysis algorithm. This algorithm is designed on the basis of random
projection, binary discriminant analysis and fuzzy commitment scheme. Three publicly available
benchmark face databases are used for evaluation. The proposed novel technique enhances the
discriminability and recognition accuracy by 80% in terms of matching score of the face images
and provides high security.
Predicting Contradiction Intensity: Low, Strong or Very Strong?Ismail BADACHE
Reviews on web resources (e.g. courses, movies) become increasingly exploited in text analysis tasks (e.g. opinion detection, controversy detection). This paper investigates contradiction intensity in reviews exploiting different features such as variation of ratings and variation of polarities around specific entities (e.g. aspects, topics). Firstly, aspects are identified according to the distributions of the emotional terms in the vicinity of the most frequent nouns in the reviews collection. Secondly, the polarity of each review segment containing an aspect is estimated. Only resources containing these aspects with opposite polarities are considered. Finally, some features are evaluated, using feature selection algorithms, to determine their impact on the effectiveness of contradiction intensity detection. The selected features are used to learn some state-of-the-art learning approaches. The experiments are conducted on the Massive Open Online Courses data set containing 2244 courses and their 73,873 reviews, collected from coursera.org. Results showed that variation of ratings, variation of polarities, and reviews quantity are the best predictors of contradiction intensity. Also, J48 was the most effective learning approach for this type of classification.
ANALYSIS OF MACHINE LEARNING ALGORITHMS WITH FEATURE SELECTION FOR INTRUSION ...IJNSA Journal
In recent times, various machine learning classifiers are used to improve network intrusion detection. The researchers have proposed many solutions for intrusion detection in the literature. The machine learning classifiers are trained on older datasets for intrusion detection, which limits their detection accuracy. So, there is a need to train the machine learning classifiers on the latest dataset. In this paper, UNSW-NB15, the latest dataset is used to train machine learning classifiers. The selected classifiers such as K-Nearest Neighbors (KNN), Stochastic Gradient Descent (SGD), Random Forest (RF), Logistic Regression (LR), and Naïve Bayes (NB) classifiers are used for training from the taxonomy of classifiers based on lazy and eager learners. In this paper, Chi-Square, a filter-based feature selection technique, is applied to the UNSW-NB15 dataset to reduce the irrelevant and redundant features. The performance of classifiers is measured in terms of Accuracy, Mean Squared Error (MSE), Precision, Recall, F1-Score, True Positive Rate (TPR) and False Positive Rate (FPR) with or without feature selection technique and comparative analysis of these machine learning classifiers is carried out.
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSijcsit
Diabetes disease is amongst the most common disease in India. It affects patient’s health and also leads to
other chronic diseases. Prediction of diabetes plays a significant role in saving of life and cost. Predicting
diabetes in human body is a challenging task because it depends on several factors. Few studies have reported the performance of classification algorithms in terms of accuracy. Results in these studies are difficult and complex to understand by medical practitioner and also lack in terms of visual aids as they arepresented in pure text format. This reported survey uses ROC and PRC graphical measures toimproveunderstanding of results. A detailed parameter wise discussion of comparison is also presented which lacksin other reported surveys. Execution time, Accuracy, TP Rate, FP Rate, Precision, Recall, F Measureparameters are used for comparative analysis and Confusion Matrix is prepared for quick review of each
algorithm. Ten fold cross validation method is used for estimation of prediction model. Different sets of
classification algorithms are analyzed on diabetes dataset acquired from UCI repository
This paper presents a review & performs a comparative evaluation of few known machine learning
algorithms in terms of their suitability & code performance on any given data set of any size. In this paper,
we describe our Machine Learning ToolBox that we have built using python programming language. The
algorithms used in the toolbox consists of supervised classification algorithms such as Naïve Bayes,
Decision Trees, SVM, K-nearest Neighbors and Neural Network (Backpropagation). The algorithms are
tested on iris and diabetes dataset and are compared on the basis of their accuracy under different
conditions. However using our tool one can apply any of the implemented ML algorithms on any dataset of
any size. The main goal of building a toolbox is to provide users with a platform to test their datasets on
different Machine Learning algorithms and use the accuracy results to determine which algorithms fits the
data best. The toolbox allows the user to choose a dataset of his/her choice either in structured or
unstructured form and then can choose the features he/she wants to use for training the machine We have
given our concluding remarks on the performance of implemented algorithms based on experimental
analysis
Investigation of Attitudes Towards Computer Programming in Terms of Various V...ijpla
This study aims to determine the attitudes of individuals towards computer programming in terms of
various variables. The study group consists of the students of Kastamonu University Department of
Computer Education and Instructional Technologies Teaching (CEIT), Department of Computer
Engineering, and Department of Computer Programming. Data were collected via Attitude towards
Computer Programming Scale (AtCPS).The results of this study show that students have neutral attitudes
towards computer programming in general. Male computer programming students have significantly
higher attitudes towards programming in comparison to female computer programming students. In
addition, attitude towards computer programming statistically varies by grade. The higher is grade, the
lower is attitude. The more time CEIT and computer programming students spend on computer for
programming purposes daily, the more positive attitudes they have towards programming. Attitude
significantly varies by graduated high school only among CEIT students.
A novel ensemble modeling for intrusion detection system IJECEIAES
Vast increase in data through internet services has made computer systems more vulnerable and difficult to protect from malicious attacks. Intrusion detection systems (IDSs) must be more potent in monitoring intrusions. Therefore an effectual Intrusion Detection system architecture is built which employs a facile classification model and generates low false alarm rates and high accuracy. Noticeably, IDS endure enormous amounts of data traffic that contain redundant and irrelevant features, which affect the performance of the IDS negatively. Despite good feature selection approaches leads to a reduction of unrelated and redundant features and attain better classification accuracy in IDS. This paper proposes a novel ensemble model for IDS based on two algorithms Fuzzy Ensemble Feature selection (FEFS) and Fusion of Multiple Classifier (FMC). FEFS is a unification of five feature scores. These scores are obtained by using feature-class distance functions. Aggregation is done using fuzzy union operation. On the other hand, the FMC is the fusion of three classifiers. It works based on Ensemble decisive function. Experiments were made on KDD cup 99 data set have shown that our proposed system works superior to well-known methods such as Support Vector Machines (SVMs), K-Nearest Neighbor (KNN) and Artificial Neural Networks (ANNs). Our examinations ensured clearly the prominence of using ensemble methodology for modeling IDSs, and hence our system is robust and efficient.
A NOVEL APPROACH FOR GENERATING FACE TEMPLATE USING BDAcsandit
In identity management system, commonly used biometric recognition system needs attention
towards issue of biometric template protection as far as more reliable solution is concerned. In
view of this biometric template protection algorithm should satisfy security, discriminability and
cancelability. As no single template protection method is capable of satisfying the basic
requirements, a novel technique for face template generation and protection is proposed. The
novel approach is proposed to provide security and accuracy in new user enrollment as well as
authentication process. This novel technique takes advantage of both the hybrid approach and
the binary discriminant analysis algorithm. This algorithm is designed on the basis of random
projection, binary discriminant analysis and fuzzy commitment scheme. Three publicly available
benchmark face databases are used for evaluation. The proposed novel technique enhances the
discriminability and recognition accuracy by 80% in terms of matching score of the face images
and provides high security.
Predicting Contradiction Intensity: Low, Strong or Very Strong?Ismail BADACHE
Reviews on web resources (e.g. courses, movies) become increasingly exploited in text analysis tasks (e.g. opinion detection, controversy detection). This paper investigates contradiction intensity in reviews exploiting different features such as variation of ratings and variation of polarities around specific entities (e.g. aspects, topics). Firstly, aspects are identified according to the distributions of the emotional terms in the vicinity of the most frequent nouns in the reviews collection. Secondly, the polarity of each review segment containing an aspect is estimated. Only resources containing these aspects with opposite polarities are considered. Finally, some features are evaluated, using feature selection algorithms, to determine their impact on the effectiveness of contradiction intensity detection. The selected features are used to learn some state-of-the-art learning approaches. The experiments are conducted on the Massive Open Online Courses data set containing 2244 courses and their 73,873 reviews, collected from coursera.org. Results showed that variation of ratings, variation of polarities, and reviews quantity are the best predictors of contradiction intensity. Also, J48 was the most effective learning approach for this type of classification.
Regularized Weighted Ensemble of Deep Classifiers ijcsa
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
An Empirical Study on the Adequacy of Testing in Open Source ProjectsPavneet Singh Kochhar
In this study, we investigate the state-of-the-practice of testing
by measuring code coverage in open-source software projects. We examine over 300 large open-source projects written in Java, to measure the code coverage of their associated test cases.
Machine learning techniques can be used to analyse data from different perspectives and enable developers to retrieve useful information. Machine learning techniques are proven to be useful
in terms of software bug prediction. In this paper, a comparative performance analysis of
different machine learning techniques is explored for software bug prediction on public
available data sets. Results showed most of the machine learning methods performed well on
software bug datasets.
Comparative Performance Analysis of Machine Learning Techniques for Software ...csandit
Machine learning techniques can be used to analyse data from different perspectives and enable
developers to retrieve useful information. Machine learning techniques are proven to be useful
in terms of software bug prediction. In this paper, a comparative performance analysis of
different machine learning techniques is explored for software bug prediction on public
available data sets. Results showed most of the machine learning methods performed well on
software bug datasets.
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisOlga Scrivner
In the format of hands-on session, this workshop will introduce participants to the Language Variation Suite (LVS), a user-friendly interactive web application built in R. LVS provides access to advanced statistical methods and visualization techniques, such as mixed-effects modeling, conditional and random tree analyses, cluster analysis. These advanced methods enable researchers to handle imbalanced data, measure individual and group variation, estimate significance, and rank variables according to their significance.
J48 and JRIP Rules for E-Governance DataCSCJournals
Data are any facts, numbers, or text that can be processed by a computer. Data Mining is an analytic process which designed to explore data usually large amounts of data. Data Mining is often considered to be \"a blend of statistics. In this paper we have used two data mining techniques for discovering classification rules and generating a decision tree. These techniques are J48 and JRIP. Data mining tools WEKA is used in this paper.
Advanced Computational Intelligence: An International Journal (ACII)aciijournal
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEaciijournal
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
Ch 6 only 1. Distinguish between a purpose statement, research pMaximaSheffield592
Ch 6 only
1. Distinguish between a purpose statement, research problem, and research questions.
2. What are major ideas that should be included in a qualitative purpose statement?
3. What are the major components of a quantitative purpose statement?
4. What are the major components of a mixed methods purpose statement?
Requirements Engineering (20 points)
In Chapter 4 of Software Engineering. Sommerville, Pearson, 2016 (10th edition), Sommerville discusses ethnography as a method for eliciting requirements.
1. Discuss two advantages and two disadvantages of an ethnographic approach. (5 points)
2. Suggest two contexts where ethnography might be a challenging method of requirements engineering. For each context, how would you recommend that your team elicit requirements? (15 points)
Design (20 points)Design patterns (5 points)
Which of the following statements is (are) true? Explain.
1. StudentsDatabase is the model, StudentsManager is the controller, and WebApplication is the view.
2. StudentsDatabase is the model, StudentsManager is the view, and WebApplication is the controller.
3. StudentsManager is the model, StudentsDatabase is the view, and StudentsManager is the controller.
4. This is not MVC, because StudentsManager must use a listener to be notified when the database changes.
(Credit: EPFL)Design task (15 points)
Suppose you are asked to design a time management and notetaking system to support (1) scheduling meetings; and (2) tracking the documents associated with those meetings (e.g. agendas, presentations, meeting minutes).[footnoteRef:1] The system should accommodate [1: Such a feature seems like an inevitable development in any messaging platform…]
Make reasonable assumptions as needed.
1. Create a use case for “Schedule meeting”. You might follow the style in Sommerville Figure 7.3. (5 points)
2. Identify the objects in your system. Represent them using a structural diagram showing the associations between objects (“Class diagram” – cf. Sommerville Figure 5.9). (5 points)
3. Draw a sequence diagram showing the interactions between objects when a group of people are arranging a meeting (cf. Sommerville Figure 5.15). (5 points)
1. Implementation (20 points)
Consider the software package is-positive.[footnoteRef:2] Examine its source code (see index.js) and its test suite (see test.js), then complete these questions. [2: https://www.npmjs.com/package/is-positive]
1. Describe the API surface of this package. (2 points)
2. Describe how you would test this package. Describe how and why your approach would change if you maintained a similar package in a different programming language of your choice. (2 points)
3. According to npmjs.com, this package receives over 16,000 downloads each month.
a. Why might an engineer choose to use this package? (4 points)
b. Why might an engineer choose not to use this package? (You may find insights from the chapter ab ...
2. 8.3.2 CUSTOM-MADE CODING
SYSTEM
8.3.2.1 QUESTION FORMATION
The researchers needed a coding scheme:
allow them to identify how the
learners question formation changed over
time.
To code the data, Mackey & Philp designated
the questions produced by their child
learners as belonging to one of the six
stages based on the Pienemann-John
hierarchy. The modified version is on Table
8.6.
3.
4. After the stages, the next step:
determine the highest level stage
The next step of the coding involved the assignment of
an overall stage to each learner, based on two highest-level
question forms asked in two different tests.
It was then possible to examine whether the learners
had improved over time.
5. Table 8.7
Coding for Question Stage
ID Pretest Immediate Posttest Delayed
Posttest
Task Task Task Final Task Task Task Final Task Task Task Final
1 2 3 Stage 1 2 3 Stage 1 2 3
Stage
AB 3 3 2 3 3 3 3 3 3 3 2
3
AA 3 3 3 3 5 5 4 5 5 5 4
5
AC 3 4 3 3 2 2 3 2 3 3 3
3
AD 3 3 4 4 3 5 5 5 5 3 3
3
Learner AB continues throughout the study at the third
stage.
Learner AA began the study at stage 3 & continued through
the next three posttest at Stage 5.
Once this sort of coding has been carried out, the
researcher can make decisions about the analysis.
6. 8.3.2.2 NEGATIVE FEEDBACK
Oliver developed a hierarchical coding
system for analysis that first divided all
teacher-student and NS-NNS conversations
into three parts:
→Native Speaker – Nonnative Speaker
(1) NNS’s initial turn
(2) the response given by the teacher or NS
partner
(3) the NNS’ reaction
→ each part was subjected to further coding.
7. Figure 8.1 Three-turn coding scheme
rated as
Initial Turn → Correct Non-target Incomplete
↙ ↓ ↘
NS Response → Ignore Negative Feedback Continue
↙ ↓ ↘
NNS Response → Response Ignore No Chance
As with many schemes, this one is top-down,
known as hierarchical, & the categories are
mutually exclusive. → meaning that it is
possible to code each piece of data in only one
way.
8. 8.3.2.3 CLASSROOM INTERACTION
Next turn was examined to determine :
(1) whether the error was occurred
(2) whether it was ignored
If the error was corrected, the following turn
was examined and coded according to
(1) whether the learner produced uptake
(2) whether the topic was continued.
Finally, the talk following uptake was examined
with regard to
(1) whether the uptake was reinforced
(2) or the topic continued.
9. 8.3.2.4 SECOND LANGUAGE WRITING
INSTRUCTION
Two studies used coding categories:
(1) Adams (2003):
→ investigate the effects of written error
correction on learners’ subsequent 2nd
language writing
(2) Sachs & Polis (2004)
→ compared three feedback conditions
10. The researchers used different coding schemes to
fit the question to compare the four feedback
conditions with each other.
(1) original error (s) (+)
(2) completely corrected (0)
(3) completely unchanged (-)
(4) not applicable (n/a)
Adams coded individual forms as:
(1) more targetlike
(2) not more targetlike
(3) not attempted (avoided)
Sachs & Polio considered T-unit codings of “at least
partially changed” (+) to be possible evidence of
noticing even when the forms were not completely
more targetlike.
11. 8.3.2.5. TASK PLANNING
The effects of planning on task performance (fluency,
accuracy, and complexity.)
Yuan and Ellis (2003): Through operationalization
(1)Fluency: (a) number of syllables per minute, and (b)
number of meaningful syllables per minute, where repeated
or reformulated syllables were not counted.
(2) Complexity: syntactic complexity, the ratio of clauses to
t-units; syntactic variety, the total number of different
grammatical verb forms used; and mean segmental type-token
ration.
(3) Accuracy: the percentage of error-free clauses, and
correct verb forms (the percentage of accurately used
verb forms).
Benefit of a coding system: is similar enough to those used in
previous studies that results are comparable, while also finely
grained enough to capture new information.
12. 8.3.3 CODING QUALITATIVE
DATA(1)
The schemes for qualitative coding generally
emerge from the data (open coding).
The range of variation within individual categories:
can assist in the procedure of adapting and
finalizing the coding system, with the goal of
closely reflecting and representing the data
Examining the data for emergent patterns and
themes, by looking for anything pertinent to the
research question or problem
New insights and observations that are not
derived from the research question or literature
review may important.
13. 8.3.3 CODING QUALITATIVE
DATA(2)
Themes and topics should emerge from the
first round of insights into the data, when the
researcher begins to consider what chunks of
data fit together, and which, if any, are
independent categories.
Problem:
With developing highly specific coding schemes, it can be
problematic to
compare qualitative coding and results across studies and
contexts.
Watson-Gegeo (1988):
“Although it may not be possible to compare coding between
settings on a surface level, it may still be possible to do so on
an abstract level.”
14. 8.4. INTERRATER
RELIABILITY(1)
Reliability of a test or measurement based on
the degree of similarity of results obtained from
different researchers using the same equipment
and method. If interrater reliability is high,
results will be very similar.
Only one coder and no intracoder reliability
measures, the reader’s confidence in the
conclusions of the study may be undermined.
To increase confidence:
(1)More than one rater code the data
wherever possible
(2)Carefully select and train the raters
Keep coders selectively blind about what part of
the data or for which group they are coding, in
order to reduce the possibility of inadvertent
coder biases.
15. 8.4. INTERRATER
RELIABILITY(2)
To increase rater reliability: to schedule
coding in rounds or trials to reduce
boredom or drift
How much data should be coded: as much
as is feasible give the time and resources
available for the study
Consider the nature of the coding scheme
in determining how much data should be
coded by a second rater
With highly objective, low-inference coding
schemes, it is possible to establish
confidence in rater reliability with as little
as 10% of the data
16. 8.4.1.1. SIMPLE
PERCENTAGE
AGREEMENT
This is the ratio of all coding agreements
over the total number of coding decisions
made by the coders (appropriate for
continuous data).
The drawback: to ignore the possibility that
some of the agreement may have occurred
by chance
17. 8.4.1.2. COHEN’S KAPPA
This statistic represents the average rate of
agreement for an entire set of scores,
accounting for the frequency of both
agreements and disagreements by category.
In a dichotomous coding scheme ( like
targetlike or nontargetlike):
(1)First coder: targetlike, nontargetlike
(2)Second coder: targetlike, nontargetlike
(3)First and Second coders: targetlike
It also accounts for chance.
18. 8.4.1.3. ADDITIONAL
MEASURES OF RELIABILITY
Pearson’s Product Moment or Spearman Rank
Correlation Coefficients: are based on
measures of correlation and reflect the
degree of association between the ratings
provided by two raters.
19. 8.4.1.4. GOOD PRACTICE
GUIDELINES FOR
INTERRATER RELIABILITY
“There is no well-developed framework for
choosing appropriate reliability
measures.”
(Rust&Cooil 1994)
General good practice guidelines suggest
that researchers should state:
(1)Which measure was used to calculate
interrater reliability
(2)What the score was
(3)Briefly explain why that particular
measure was chosen.
20. 8.4.1.5 HOW DATA ARE
SELECTED FOR
INTERRATER RELIABILITY
TESTS
Semi-randomly select a portion of the data
(say 25%), then coded by a second rater
To create comprehensive datasets for
random selection of the 25% from different
parts of the main dataset
If a pretest and three posttests are used,
data from each of them should be included
in the 25%.
Intrarater reliability refers to whether a
rater will assign the same score after a set
time period.
21. 8.4.1.6. WHEN TO CARRY OUT
CODING RELIABILITY CHECKS
To use a sample dataset to train themselves and their
other coders, and test out their coding scheme early on
in the coding process
The following reporting on coding:
(1)What measure was used
(2)The amount of data coded
(3)Number of raters employed
(4)Rationale for choosing the measurement used
(5)Interrater reliability statistics
(6)What happened to data about which there was
disagreement
Complete reporting will help the researcher provide a
solid foundation for the claims made in the study, and
will also facilitate the process of replicating studies.
22. 8.5. THE MECHANICS OF CODING
(1)Using highlighting pens, working
directly on transcripts.
(2)Listening to tapes or watching
videotapes without transcribing
everything: May simply mark coding
sheets, when the phenomena
researchers are interested in occur.
(3)Using computer programs (CALL
programs).
23. 8.5.1. HOW MUCH TO CODE
(1)Consider and justify why they are not
coding all their data.
(2)Determining how much of the data to
code. ( data sampling or data
segmentation)
(3)The data must be representative of the
dataset as a whole and should also be
appropriate for comparisons if these are
being made.
(4)The research questions should
ultimately drive the decisions made, and to
specify principled reasons for selecting
data to code.
24. 8.5.2 WHEN TO MAKE
CODING DECISIONS
How to code and who much to code
prior to the data collection process
Carrying out an adequate pilot study:
This will allow for piloting not only of
materials and methods, but also of
coding and analysis.
The most effective way to avoid
potential problems: Designing coding
sheets ahead of data collection and
then testing them out in a pilot study
25. 8.6. CONCLUSION
Many of processes involved in data coding
can be thought through ahead of time and
then pilot tested.
These include the preparation of raw data for
coding, transcription, the modification or
creation of appropriate coding systems, and
the plan for determining reliability.
Editor's Notes
Stage 1: One astronaut outside the spaceship?
Stage 2: The boys throw the shoe?
Stage 3: How many planets are in this picture?
Do you have a shoes on your picture?
Stage 4: where is the sun?
The ball is it in the grass or in the sky?
Stage 5: How many astronauts do you have?
What's the boy doing?
Stage 6: You live here, don’t you?
Doesn't your wife speak English?
Can you tell me where the station is?