This document discusses techniques for clustering log data to categorize log events in real-time. It compares two traditional document clustering approaches (Spherical K-Means and another unspecified method) on a log corpus using metrics like Entropy, Purity and Silhouette Index to evaluate them. It then proposes using the better model along with computationally efficient approaches and a combination of distance measures and category assignment to dynamically cluster incoming real-time log events and determine if they are new or existing categories. The goal is to automatically categorize and correlate log events to past responses to help with issues like monitoring, prioritization and dynamic solution recommendation.
According to the fifth annual IDC Digital Universe study of 20111 the world's information is doubling every two years and in 2011 would reach 1.8 zettabytes. It then goes on to predict that by 2020 the world will generate 50 times more information with 75 times the number of «information containers».
That means an awful lot of additional hay to be added to the already mountainous stacks
Association Rule Mining Scheme for Software Failure AnalysisEditor IJMTER
The software execution process is tracked with event logs. The event logs are used to maintain the
execution process flow in a textual log file. The log file also manages the error values and their source of classes.
The error values are used to analyze the failure of the software. The data mining methods are used to evaluate the
quality and software failure rate analysis process. The text logs are processed and data values are extracted from
the data values. The data values are mined using the machine learning methods for failure analysis.
The service error, service complaints, interaction error and crash errors are maintained under the log files.
The events and their reactions are also maintained under the log files. Software termination and execution failures
are identified using the log details. The log file parsing process is applied to extract data from the logs. The
associations rule mining methods are used to analyze the log files for failure detection process. The system uses
the Weighted Association Rule Mining (WARM) scheme to fetch failure rate in the software execution flow. The
system improves the failure rate detection accuracy in WARM model.
Concept Drift Identification using Classifier Ensemble Approach IJECEIAES
Abstract:-In Internetworking system, the huge amount of data is scattered, generated and processed over the network. The data mining techniques are used to discover the unknown pattern from the underlying data. A traditional classification model is used to classify the data based on past labelled data. However in many current applications, data is increasing in size with fluctuating patterns. Due to this new feature may arrive in the data. It is present in many applications like sensornetwork, banking and telecommunication systems, financial domain, Electricity usage and prices based on its demand and supplyetc .Thus change in data distribution reduces the accuracy of classifying the data. It may discover some patterns as frequent while other patterns tend to disappear and wrongly classify. To mine such data distribution, traditionalclassification techniques may not be suitable as the distribution generating the items can change over time so data from the past may become irrelevant or even false for the current prediction. For handlingsuch varying pattern of data, concept drift mining approach is used to improve the accuracy of classification techniques. In this paper we have proposed ensemble approach for improving the accuracy of classifier. The ensemble classifier is applied on 3 different data sets. We investigated different features for the different chunk of data which is further given to ensemble classifier. We observed the proposed approach improves the accuracy of classifier for different chunks of data.
According to the fifth annual IDC Digital Universe study of 20111 the world's information is doubling every two years and in 2011 would reach 1.8 zettabytes. It then goes on to predict that by 2020 the world will generate 50 times more information with 75 times the number of «information containers».
That means an awful lot of additional hay to be added to the already mountainous stacks
Association Rule Mining Scheme for Software Failure AnalysisEditor IJMTER
The software execution process is tracked with event logs. The event logs are used to maintain the
execution process flow in a textual log file. The log file also manages the error values and their source of classes.
The error values are used to analyze the failure of the software. The data mining methods are used to evaluate the
quality and software failure rate analysis process. The text logs are processed and data values are extracted from
the data values. The data values are mined using the machine learning methods for failure analysis.
The service error, service complaints, interaction error and crash errors are maintained under the log files.
The events and their reactions are also maintained under the log files. Software termination and execution failures
are identified using the log details. The log file parsing process is applied to extract data from the logs. The
associations rule mining methods are used to analyze the log files for failure detection process. The system uses
the Weighted Association Rule Mining (WARM) scheme to fetch failure rate in the software execution flow. The
system improves the failure rate detection accuracy in WARM model.
Concept Drift Identification using Classifier Ensemble Approach IJECEIAES
Abstract:-In Internetworking system, the huge amount of data is scattered, generated and processed over the network. The data mining techniques are used to discover the unknown pattern from the underlying data. A traditional classification model is used to classify the data based on past labelled data. However in many current applications, data is increasing in size with fluctuating patterns. Due to this new feature may arrive in the data. It is present in many applications like sensornetwork, banking and telecommunication systems, financial domain, Electricity usage and prices based on its demand and supplyetc .Thus change in data distribution reduces the accuracy of classifying the data. It may discover some patterns as frequent while other patterns tend to disappear and wrongly classify. To mine such data distribution, traditionalclassification techniques may not be suitable as the distribution generating the items can change over time so data from the past may become irrelevant or even false for the current prediction. For handlingsuch varying pattern of data, concept drift mining approach is used to improve the accuracy of classification techniques. In this paper we have proposed ensemble approach for improving the accuracy of classifier. The ensemble classifier is applied on 3 different data sets. We investigated different features for the different chunk of data which is further given to ensemble classifier. We observed the proposed approach improves the accuracy of classifier for different chunks of data.
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper discusses how the traditional batch and real time paradigm can work together to deliver smarter, quicker and better insights on large volumes of data picking the right strategy and right technology.
Truly dependable software systems should be built with structuring techniques able to decompose the software complexity without
hiding important hypotheses and assumptions such as those regarding
their target execution environment and the expected fault- and system
models. A judicious assessment of what can be made transparent and
what should be translucent is necessary. This paper discusses a practical
example of a structuring technique built with these principles in mind:
Reflective and refractive variables. We show that our technique offers
an acceptable degree of separation of the design concerns, with limited
code intrusion; at the same time, by construction, it separates but does
not hide the complexity required for managing fault-tolerance. In particular, our technique offers access to collected system-wide information
and the knowledge extracted from that information. This can be used
to devise architectures that minimize the hazard of a mismatch between
dependable software and the target execution environments.
Management of errors in multi-tenant cloud based applications remains a challenging problem. This
problem is compounded due to (i) multiple versions of application serving different clients, (ii) agile nature
in which the applications are released to the clients, and (iii) variations in specific usage patterns of each
client. We propose a framework for isolating and managing errors in such applications. The proposed
framework is evaluated with two different popular cloud based applications and empirical results are
presented.
Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications neirew J
Management of errors in multi-tenant cloud based applications remains a challenging problem. This
problem is compounded due to (i) multiple versions of application serving different clients, (ii) agile nature
in which the applications are released to the clients, and (iii) variations in specific usage patterns of each
client. We propose a framework for isolating and managing errors in such applications. The proposed
framework is evaluated with two different popular cloud based applications and empirical results are
presented.
Bug triage means to transfer a new bug to expertise developer. The manual bug triage is opulent in time
and poor in accuracy, there is a need to automatize the bug triage process. In order to automate the bug triage
process, text classification techniques are applied using stopword removal and stemming. In our proposed work
we have used NB-Classifiers to predict the developers. The data reduction techniques like instance selection
and keyword selection are used to obtain bug report and words. This will help the system to predict only those
developers who are expertise in solving the assigned bug. We will also provide the change of status of bug
report i.e. if the bug is solved then the bug report will be updated. If a particular developer fails to solve the bug
then the bug will go back to another developer.
Information security audit is a monitoring/logging mechanism to ensure compliance with regulations and to detect abnormalities, security breaches, and privacy violations; however, auditing too many events causes overwhelming use of system resources and impacts performance. Consequently, a classification of events is used to prioritize events and configure the log system. Rules can be applied according to this classification to make decisions about events to be archived and types of actions invoked by events. Current classification methodologies are fixed to specific types of incident occurrences and applied in terms of system-dependent description. In this paper, we propose a conceptual model that produces an implementation-independent logging scheme to monitor events.
A Model of Local Area Network Based Application for Inter-office Communicationtheijes
In most organizations, information circulation within offices poses a problem because clerical officers and office messengers usually dispatch letters and memos from one office to another manually. Sequel to this circulation procedure, implementation of decisions is always difficult and slow. On one hand, clerical officers sometimes divulge confidential information during the process of receiving and sending of mails and other documents. On the other hand, money is wasted on the purchase of paper for printing. However, it is cheerful to learn that technology has made it possible for staff, structures and infrastructures to control and share organization resources; hardware; software and knowledge by means of modern electronic communication. One cannot deny the fact that the flow of information is very imperative in every organization as it determines the effectiveness of decision-making and implementation. This feat can be achieved using Object-Oriented System Analysis and Design Methodology (OSADM). This is structurally analyzed with Use Case Diagram (UCD), Class Diagram, Sequence Diagram, State Transition Diagram, and Activity Diagram. Moreover, the system is coded with Java 1.6 version, in a client-server network running on star topology in LAN environment. This is the concept of this paper. The system is designed to work in two parts – the control part that is installed on the server; and the messenger part that is installed in the clients. This research work integrates the utility and organization resources into a shared center for all users to have access to; and for free communication during office hours.
This study aims to enlighten the researchers about the details of process mining. As process mining is a new research area, it includes process modelling and process analysis, as well as business intelligence and data mining. Also it is used as a tool that gives information about procedures. In this paper classification of process mining techniques, different process mining algorithms, challenges and area of application have been explained.Therefore, it was concluded that process mining can be a useful technique with faster results and ability to check conformance and compliance.
An Investigation into Brain Tumor Segmentation Techniques IIRindia
A tumor is an anomalous mass in the brain which can be cancerous. Such anomalous growth within this restricted space or inside the covering skull can cause problems. Detecting brain tumors from images of medical modalities like CT scan or MRI involves segmentation (Division into parts) for analysis and can be a challenging task. Accurate segmentation of brain images is very essential for proper diagnosis of tumor and non-tumor areas for clinical analysis. This paper details on segmentation algorithms for brain images, advantages, disadvantages and a comparison of the algorithms.
Agricultural sector is the backbone of our country and it plays a vital role in the overall economic growth of our nation. India has about 59% of its total area for agricultural purpose. The contribution of agricultural sector to our GDP is about 17%. Advanced techniques or the betterment in the arena of agriculture will as certain to increase the competence of certain farming activities. In this paper we introduce a concept for smart farming which utilizes wireless sensor web technology with a web based application. This will play a crucial role in helping farmers. It will aim for the betterment in the facilities given to the farmers and by focussing on the measurement of production of the crops. With the help of data mining techniques and algorithms like K-nearest, decision tree we will gather each and every data related to the farming and it should be updated frequently so that farmers and the consumers will get the right knowledge of the respective crops and about the suitable equipments related to farming. Existing system are not so much efficient in displaying such data characteristics. Our main aim is to enhance the growth in the agriculture sector and make the existing system smarter so that the decision- maker can define the expansion of agriculture activities to empower the different forces in existing agriculture sector
More Related Content
Similar to Silhouette Threshold Based Text Clustering for Log Analysis
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper discusses how the traditional batch and real time paradigm can work together to deliver smarter, quicker and better insights on large volumes of data picking the right strategy and right technology.
Truly dependable software systems should be built with structuring techniques able to decompose the software complexity without
hiding important hypotheses and assumptions such as those regarding
their target execution environment and the expected fault- and system
models. A judicious assessment of what can be made transparent and
what should be translucent is necessary. This paper discusses a practical
example of a structuring technique built with these principles in mind:
Reflective and refractive variables. We show that our technique offers
an acceptable degree of separation of the design concerns, with limited
code intrusion; at the same time, by construction, it separates but does
not hide the complexity required for managing fault-tolerance. In particular, our technique offers access to collected system-wide information
and the knowledge extracted from that information. This can be used
to devise architectures that minimize the hazard of a mismatch between
dependable software and the target execution environments.
Management of errors in multi-tenant cloud based applications remains a challenging problem. This
problem is compounded due to (i) multiple versions of application serving different clients, (ii) agile nature
in which the applications are released to the clients, and (iii) variations in specific usage patterns of each
client. We propose a framework for isolating and managing errors in such applications. The proposed
framework is evaluated with two different popular cloud based applications and empirical results are
presented.
Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications neirew J
Management of errors in multi-tenant cloud based applications remains a challenging problem. This
problem is compounded due to (i) multiple versions of application serving different clients, (ii) agile nature
in which the applications are released to the clients, and (iii) variations in specific usage patterns of each
client. We propose a framework for isolating and managing errors in such applications. The proposed
framework is evaluated with two different popular cloud based applications and empirical results are
presented.
Bug triage means to transfer a new bug to expertise developer. The manual bug triage is opulent in time
and poor in accuracy, there is a need to automatize the bug triage process. In order to automate the bug triage
process, text classification techniques are applied using stopword removal and stemming. In our proposed work
we have used NB-Classifiers to predict the developers. The data reduction techniques like instance selection
and keyword selection are used to obtain bug report and words. This will help the system to predict only those
developers who are expertise in solving the assigned bug. We will also provide the change of status of bug
report i.e. if the bug is solved then the bug report will be updated. If a particular developer fails to solve the bug
then the bug will go back to another developer.
Information security audit is a monitoring/logging mechanism to ensure compliance with regulations and to detect abnormalities, security breaches, and privacy violations; however, auditing too many events causes overwhelming use of system resources and impacts performance. Consequently, a classification of events is used to prioritize events and configure the log system. Rules can be applied according to this classification to make decisions about events to be archived and types of actions invoked by events. Current classification methodologies are fixed to specific types of incident occurrences and applied in terms of system-dependent description. In this paper, we propose a conceptual model that produces an implementation-independent logging scheme to monitor events.
A Model of Local Area Network Based Application for Inter-office Communicationtheijes
In most organizations, information circulation within offices poses a problem because clerical officers and office messengers usually dispatch letters and memos from one office to another manually. Sequel to this circulation procedure, implementation of decisions is always difficult and slow. On one hand, clerical officers sometimes divulge confidential information during the process of receiving and sending of mails and other documents. On the other hand, money is wasted on the purchase of paper for printing. However, it is cheerful to learn that technology has made it possible for staff, structures and infrastructures to control and share organization resources; hardware; software and knowledge by means of modern electronic communication. One cannot deny the fact that the flow of information is very imperative in every organization as it determines the effectiveness of decision-making and implementation. This feat can be achieved using Object-Oriented System Analysis and Design Methodology (OSADM). This is structurally analyzed with Use Case Diagram (UCD), Class Diagram, Sequence Diagram, State Transition Diagram, and Activity Diagram. Moreover, the system is coded with Java 1.6 version, in a client-server network running on star topology in LAN environment. This is the concept of this paper. The system is designed to work in two parts – the control part that is installed on the server; and the messenger part that is installed in the clients. This research work integrates the utility and organization resources into a shared center for all users to have access to; and for free communication during office hours.
This study aims to enlighten the researchers about the details of process mining. As process mining is a new research area, it includes process modelling and process analysis, as well as business intelligence and data mining. Also it is used as a tool that gives information about procedures. In this paper classification of process mining techniques, different process mining algorithms, challenges and area of application have been explained.Therefore, it was concluded that process mining can be a useful technique with faster results and ability to check conformance and compliance.
An Investigation into Brain Tumor Segmentation Techniques IIRindia
A tumor is an anomalous mass in the brain which can be cancerous. Such anomalous growth within this restricted space or inside the covering skull can cause problems. Detecting brain tumors from images of medical modalities like CT scan or MRI involves segmentation (Division into parts) for analysis and can be a challenging task. Accurate segmentation of brain images is very essential for proper diagnosis of tumor and non-tumor areas for clinical analysis. This paper details on segmentation algorithms for brain images, advantages, disadvantages and a comparison of the algorithms.
Agricultural sector is the backbone of our country and it plays a vital role in the overall economic growth of our nation. India has about 59% of its total area for agricultural purpose. The contribution of agricultural sector to our GDP is about 17%. Advanced techniques or the betterment in the arena of agriculture will as certain to increase the competence of certain farming activities. In this paper we introduce a concept for smart farming which utilizes wireless sensor web technology with a web based application. This will play a crucial role in helping farmers. It will aim for the betterment in the facilities given to the farmers and by focussing on the measurement of production of the crops. With the help of data mining techniques and algorithms like K-nearest, decision tree we will gather each and every data related to the farming and it should be updated frequently so that farmers and the consumers will get the right knowledge of the respective crops and about the suitable equipments related to farming. Existing system are not so much efficient in displaying such data characteristics. Our main aim is to enhance the growth in the agriculture sector and make the existing system smarter so that the decision- maker can define the expansion of agriculture activities to empower the different forces in existing agriculture sector
A Survey on the Analysis of Dissolved Oxygen Level in Water using Data Mining...IIRindia
Data Mining (DM) is a powerful and a new field having various techniques to analyses the recent real world problems. In DM, environmental mining is one of the essential and interesting research areas. DM enables to collect fundamental insights and knowledge from massive volume of environmental data. The water quality is determining the condition of water in the environment. It represents the concentration and state (dissolved or particulate) of some or all the organic and inorganic material present in the water, together with certain physical characteristics of the water. The Dissolved Oxygen (DO) is one of the important aspects of water quality. The DO is the quantity of gaseous oxygen (O2) incorporated into the water. The DO is essential for keeping the water organisms alive. The amount of DO level in the water can be detected by various methods. The data mining techniques are properly used to find DO Level in the different types of water. A number of DM methods used to analyze the DO level such as Multi-Layer Perceptron, Multivariate Linear Regression, Factor Analysis, and Feed Forward Neural Network. This survey work discusses about such type of methods, particularly used for the analysis of DO level elaborately.
Kidney Failure Due to Diabetics – Detection using Classification Algorithm in...IIRindia
In order to analyse the chosen data from various points of view, data mining is used as the effective process. This process is also used to sum-up all those views into useful information. There are several types of algorithms in data mining such as Classification algorithms, Regression, Segmentation algorithms, association algorithms, sequence analysis algorithms, etc.,. The classification algorithm can be used to bifurcate the data set from the given data set and foretell one or more discrete variables, based on the other attributes in the dataset. The ID3 (Iterative Dichotomiser 3) algorithm is an original data set S as the root node. An unutilised attribute of the data set S calculates the entropy H(S) (or Information gain IG (A)) of the attribute. Upon its selection, the attribute should have the smallest entropy (or largest information gain) value. The prime objective of this paper is to analyze the data from a Kidney disorder due to diabetics by using classification technique to predict class accurately.
Analysis and Representation of Igbo Text Document for a Text-Based SystemIIRindia
The advancement in Information Technology (IT) has assisted in inculcating the three Nigeria major languages in text-based application such as text mining, information retrieval and natural language processing. The interest of this paper is the Igbo language, which uses compounding as a common type of word formation and as well has many vocabularies of compound words. The issues of collocation, word ordering and compounding play high role in Igbo language. The ambiguity in dealing with these compound words has made the representation of Igbo language text document very difficult because this cannot be addressed using the most common and standard approach of the Bag-Of-Words (BOW) model of text representation, which ignores the word order and relation. However, this cause for a concern and the need to develop an improved model to capture this situation. This paper presents the analysis of Igbo language text document, considering its compounding nature and describes its representation with the Word-based N-gram model to properly prepare it for any text-based application. The result shows that Bigram and Trigram n-gram text representation models provide more semantic information as well addresses the issues of compounding, word ordering and collocations which are the major language peculiarities in Igbo. They are likely to give better performance when used in any Igbo text-based system.
A Survey on E-Learning System with Data MiningIIRindia
E-learning process has been widely used in university campus and educational institutions are playing vital role to enhance the skill set of students. Modern E-learning done by many electronic devices, such as smartphones, Tabs, and so on, on existing E-learning tools is insufficient to achieve the purpose of online training of education. This paper presents a survey of online e-Learning authoring tools for creating and integrating reusable e-learning tool for generation and enhancing existing learning resources with them. The work concentrates on evaluation of the existing e-learning tools a, and authoring tools that have shown good performance in the past for online learners. This survey work takes more than 20 online tools that deal with the educational sector mechanism, for the purpose of observations, and the outcome were analyzed. The findings of this paper are the main reason for developing a new tool, and it shows that educators can enhance existing learning resources by adding assessment resources, if suitable authoring tools are provided. Finally, the different factors that assure the reusability of the created new e-learning tool has been analysed in this paper.E-learning environment is a guide for both students and tutorial management system. The useful on the e-learning system for apart from students and distance learning students. The purpose of using e-learning environment for online education system, developed in data mining for more number of clustering servers and resource chain has been good.
Image Segmentation Based Survey on the Lung Cancer MRI ImagesIIRindia
Educational data mining (EDM) creates high impact in the field of academic domain. The methods used in this topic are playing a major advanced key role in increasing knowledge among students. EDM explores and gives ideas in understanding behavioral patterns of students to choose a correct path for choosing their carrier. This survey focuses on such category and it discusses on various techniques involved in making educational data mining for their knowledge improvement. Also, it discusses about different types of EDM tools and techniques in this article. Among the different tools and techniques, best categories are suggested for real world usage.
The Preface Layer for Auditing Sensual Interacts of Primary Distress Conceali...IIRindia
Resting anterior brain electrical activity, self-report measures of Behavioral Approach System (BAS) and Behavioral Inhibition System (BIS) strength, and common levels of Positive Affect (PA) and Negative Affect (NA) were composed from 46 unselected undergraduates two split occasions Electroencephalogram (EEG) measures of prefrontal asymmetry and the self-report measures showed excellent internal reliability, steadiness and tolerable test-retest stability. Strong connection betweens the unconstrained facial emotional expressions and the full of feeling states correlated cerebrum movement. When seeing dreadful as contrasted with unbiased faces, members showed larger amounts of actuation inside the privilege average prefrontal cortex (PFC). To propose a multimodal method to deal with assess Efficient Practical near Infrared Spectroscopy (EPNIS) signals and EEG signals for full of feeling state identification. Outcomes demonstrate that proposed technique with EPNIS enhances execution over EPNIS methodologies. Based on
Feature Based Underwater Fish Recognition Using SVM ClassifierIIRindia
An approach for underwater fish recognition based on wavelet transform is presented in this paper. This approach decomposes the input image into sub-bands by using the multi resolutional analysis known as Discrete Wavelet Transform (DWT). As each sub-band in the decomposed image contains useful information about the image, the mean values of every sub-band are assumed as features. This approach is tested on Underwater Photography - A Fish Database. The database contains 7953 pictures of 1458 different species. The database is considered for the classification based on Support Vector machine (SVM) classifier. The result shows that maximum recognition accuracy of 90.74% is achieved by the wavelet features.
A Survey on Educational Data Mining TechniquesIIRindia
Educational data mining (EDM) creates high impact in the field of academic domain. The methods used in this topic are playing a major advanced key role in increasing knowledge among students. EDM explores and gives ideas in understanding behavioral patterns of students to choose a correct path for choosing their carrier. This survey focuses on such category and it discusses on various techniques involved in making educational data mining for their knowledge improvement. Also, it discusses about different types of EDM tools and techniques in this article. Among the different tools and techniques, best categories are suggested for real world usage.
The objective of this research work is focused on the right cluster creation of lung cancer data and analyzed the efficiency of k-Means and k-Medoids algorithms. This research work would help the developers to identify the characteristics and flow of algorithms. In this research work is pertinent for the department of oncology in cancer centers. This implementation helps the oncologist to make decision with lesser execution time of the algorithm.It is also enhances the medical care applications. This work is very suitable for selection of cluster development algorithm for lung cancer data analysis.Clustering is an important technique in data mining which is applied in many fields including medical diagnosis to find diseases. It is the process of grouping data, where grouping is recognized by discovering similarities between data based on their features. In this research work, the lung cancer data is used to find the performance of clustering algorithms via its computational time. Considering a limited number attributes of lung cancer data, the algorithmic steps are applied to get results and compare the performance of algorithms. The partition based clustering algorithms k-Means and k-Mediods are selected to analyze the lung cancer data.The efficiency of both the algorithms is analyzed based on the results produced by this approach. The finest outcome of the performance of the algorithm is reported for the chosen data concept.
A Study on MRI Liver Image Segmentation using Fuzzy Connected and Watershed T...IIRindia
A comparison study between automatic and interactive methods for liver segmentation from contrast-enhanced MRI images is ocean. A collection of 20 clinical images with reference segmentations was provided to train and tune algorithms in advance. Employed algorithms include statistical shape models, atlas registration, level-sets, graph-cuts and rule-based systems. All results were compared to refer five error measures that highlight different aspects of segmentation accuracy. The measures were combined according to a specific scoring system relating the obtained values to human expert variability. In general, interactive methods like Fuzzy Connected and Watershed Methods reached higher average scores than automatic approaches and featured a better consistency of segmentation quality. However, the best automatic methods (mainly based on statistical shape models with some additional free deformation) could compete well on the majority of test images. The study provides an insight in performance of different segmentation approaches under real-world conditions and highlights achievements and limitations of current image analysis techniques. In this paper only Fuzzy Connected and Watershed Methods are discussed.
A Clustering Based Collaborative and Pattern based Filtering approach for Big...IIRindia
With web services developing and aggregating in application range, benefit revelation has turned into a hot issue for benefit organization and service management. Service clustering gives a promising approach to part the entire seeking space into little areas in order to limit the disclosure time successfully. In any case, semantic data is a basic component amid the entire arranging process. Current industrialized Web Service Portrayal Language (WSPL) does not contain enough data for benefit depiction. Thusly, a service clustering technique has been proposed, which upgrades unique WSPL report with semantic data by methods for Connected Open Information (COI). Examination based genuine service information has been performed, and correlation with comparable techniques has additionally been given to exhibit the adequacy of the strategy. It is demonstrated that using semantic data from COI improves the exactness of service grouping. Furthermore, it shapes a sound base for promote thorough preparing with semantic data.
Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...IIRindia
Numerous movable applications on secure groceries expenditure and e-health have designed recently. Health aware clients respect such applications for secure groceries expenditure, particularly to avoid irritating groceries and added substances. However, there is the lack of a complete database including organized or unstructured information to help such applications. In the paper propose the Multiple Scoring Frameworks (MSF), a healthy groceries expenditure search service for movable applications using Hadoop and MapReduce (MR). The MSF works in a procedure behind a portable application to give a search service for data on groceries and groceries added substances. MSF works with similar logic from a web search engine (WSE) and it crawls over Web sources cataloguing important data for possible utilize in reacting to questions from movable applications. MSF outline and advancement are featured in the paper during its framework design, inquiry understanding, its utilization of the Hadoop/MapReduce infrastructure, and activity contents.
Performance Evaluation of Feature Selection Algorithms in Educational Data Mi...IIRindia
Educational Data mining(EDM)is a prominent field concerned with developing methods for exploring the unique and increasingly large scale data that come from educational settings and using those methods to better understand students in which they learn. It has been proved in various studies and by the previous study by the authors that data mining techniques find widespread applications in the educational decision making process for improving the performance of students in higher educational institutions. Classification techniques assumes significant importance in the machine learning tasks and are mostly employed in the prediction related problems. In machine learning problems, feature selection techniques are used to reduce the attributes of the class variables by removing the redundant and irrelevant features from the dataset. The aim of this research work is to compares the performance of various feature selection techniques is done using WEKA tool in the prediction of students’ performance in the final semester examination using different classification algorithms. Particularly J48, Naïve Bayes, Bayes Net, IBk, OneR, and JRip are used in this research work. The dataset for the study were collected from the student’s performance report of a private college in Tamil Nadu state of India. The effectiveness of various feature selection algorithms was compared with six classifiers and the results are discussed. The results of this study shows that the accuracy of IBK is 99.680% which is found to be
A Review of Edge Detection Techniques for Image SegmentationIIRindia
Edge detection is a key stride in Image investigation. Edges characterize the limits between areas in a image, which assists with division and article acknowledgment.Edge discovery is a image preparing method for finding the limits of articles inside Image. It works by distinguishing irregular in brilliance and utilized for Image division and information extraction in zones, for example, Image preparing, PC vision and Image vision. There are likely more algorithms in a writing of upgrading and distinguishing edges than whatever other single subject.In this paper, the principle is to concentrate most usually utilized edge methods for Image segmentation.
Leanness Assessment using Fuzzy Logic Approach: A Case of Indian Horn Manufac...IIRindia
Lean principles are being implemented by many industries today that focus on improving the efficiency of the operations for reducing the waste, efforts and consumption. Organizations implementing lean principles can be assessed using the some tools. This paper attempts to assess the lean implementation in a leading Horn manufacturing industry in South India. The twofold objectives are set to be achieved through this paper. First is to find the leanness level of a manufacturing organization for which a horn manufacturing company has been selected as the case company. Second is to find the critical obstacles for the lean implementation. The fuzzy logic computation method is used to extract the perceptions about the particular variables by using linguistic values and then match it with fuzzy numbers to compute the precise value of the leanness level of the organization. Based on the results obtained from this analysis, it was found that the case study company has performed in the lean to vey lean range and the weaker areas have been identified to improve the performance further.
Comparative Analysis of Weighted Emphirical Optimization Algorithm and Lazy C...IIRindia
Health care has millions of centric data to discover the essential data is more important. In data mining the discovery of hidden information can be more innovative and useful for much necessity constraint in the field of forecasting, patient’s behavior, executive information system, e-governance the data mining tools and technique play a vital role. In Parkinson health care domain the hidden concept predicts the possibility of likelihood of the disease and also ensures the important feature attribute. The explicit patterns are converted to implicit by applying various algorithms i.e., association, clustering, classification to arrive at the full potential of the medical data. In this research work Parkinson dataset have been used with different classifiers to estimate the accuracy, sensitivity, specificity, kappa and roc characteristics. The proposed weighted empirical optimization algorithm is compared with other classifiers to be efficient in terms of accuracy and other related measures. The proposed model exhibited utmost accuracy of 87.17% with a robust kappa statistics measurement and roc degree indicated the strong stability of the model when compared to other classifiers. The total penalty cost generated by the proposed model is less when compared with the penalty cost of other classifiers in addition to accuracy and other performance measures.
Survey on Segmentation Techniques for Spinal Cord ImagesIIRindia
Medical imaging is a technique which is used to expose the interior part of the body, to diagnose the diseases and to treat them as well. Different modalities are used to process the medical images. It helps the human specialists to make diagnosis ailments. In this paper, we surveyed segmentation on the spinal cord images using different techniques such as Data mining, Support vector machine, Neural Networks and Genetic Algorithm which are applied to find the disorders and syndromes affected in the spinal cord system. As a result, we have gained knowledge in an identified disarrays and ailments affected in lumbar vertebra, thoracolumbar vertebra and spinal canal. Finally how the Disc Similarity Index values are generated in each method is also analysed.
An Approach for Breast Cancer Classification using Neural NetworksIIRindia
Breast Cancer,an increasing predominant death causing disease among women has become a social concern. Early detection and efficient treatment helps to reduce the breastcancerrisk.AdaptiveResonanceTheory(ART1),anunsupervised neural network has become an efficient tool in the classification of breast cancer as Benign(non dangerous tumour) or Malignant (dangerous tumour). 400 instances were pre processed to convert real data into binary data and the classification was carried out using ART1 network. The results of the classified data and the physician diagnosed data were compared and the standard performance measures accuracy, sensitivity and specificity were computed. The results show that the simulation results are analogous to the clinical results.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Silhouette Threshold Based Text Clustering for Log Analysis
1. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 06 Issue: 01 June 2017 Page No.17-25
ISSN: 2278-2419
17
Silhouette Threshold Based Text Clustering for Log
Analysis
Jayadeep J
Data Scientist, Tata Consultancy Services Ltd, Chennai, India
Email: jayadeepj@gmail.com
Abstract - Automated log analysis has been a dominant
subject area of interest to both industry and academics alike.
The heterogeneous nature of system logs, the disparate sources
of logs (Infrastructure, Networks, Databases and Applications)
and their underlying structure & formats makes the challenge
harder. In this paper I present the less frequently used
document clustering techniques to dynamically organize real
time log events (e.g. Errors, warnings) to specific categories
that are pre-built from a corpus of log archives. This kind of
syntactic log categorization can be exploited for automatic log
monitoring, priority flagging and dynamic solution
recommendation systems. I propose practical strategies to
cluster and correlate high volume log archives and high
velocity real time log events; both in terms of solution quality
and computational efficiency. First I compare two traditional
partitional document clustering approaches to categorize high
dimensional log corpus. In order to select a suitable model for
our problem, Entropy, Purity and Silhouette Index are used to
evaluate these different learning approaches. Then I propose
computationally efficient approaches to generate vector space
model for the real time log events. Then to dynamically relate
them to the categories from the corpus, I suggest the use of a
combination of critical distance measure and least distance
approach. In addition, I introduce and evaluate three different
critical distance measures to ascertain if the real time event
belongs to a totally new category that is unobserved in the
corpus.
Keywords - Clustering methods, Text mining, System Log
Analysis, Spherical K-Means, Silhouette Index
I. INTRODUCTION
With the growing complexity of IT systems, it has now
become ardous to continue the traditional approaches to system
monitoring and management. Just like the infrastructure, the
logs have also grown in complexity both in terms of scale and
heterogeneity. The log files generated by the systems contain
wealth of information related to states of the system,
component interactions, faults & failures, and trends in their
operation. The traditional approach to solve problems in IT
systems has been to rely on the expertise of domain specialists
to sift through these log files, correlate the log events with
similar past occurrences (if any) or investigate them from the
root. However, it is often infeasible to do this kind of manual
log analysis. Hence it is quite common to see terra bytes of
unused logs generated from databases, infrastructure monitors,
networks, and of course user applications even in mid-sized
enterprises. This is where machine learning and text analytics
techniques can become handy. However automated log file
analysis varies considerably from scenarios where document
clustering techniques is typically used. First of all system log
data is high dimensional and sparse. Secondly the log events
are often semantically ineligible and considerably disparate in
nature. A log event could be a process failure from the
operating system, a standard warning message from a user
application, a network failure message from an infrastructure
component or even a simple disk usage error. Third the log
event statements are usually shorter though varied in format.
Finally traditional document clustering approaches rarely
consider the high velocity streaming nature of real time logs. In
this paper, I try to experimentally evaluate and sometimes
adapt the classical document clustering techniques to log event
analysis. However, in this paper I do not perform any semantic
understanding or determination of grammatical relationships
between words of the log events in order to find out the
meaning of the log data.
The remainder of this paper is organized as described as below.
Section 2 describes the problem of real time log categorization
in detail and its associated challenges. Section 3 presents the
vector space model. Section 4 gives an overview of each of the
two clustering techniques used, similarity measures and cluster
evaluation techniques along with their underlying
mathematical descriptions. Section 5 describes the datasets
used in the experiments. Section 6 introduces the experimental
approaches and implementation to categorize the log archive
corpus. Similarly Section 7 introduces the experimental
approaches and implementation to dynamically cluster real
time logs through computationally efficient means. In section 7
I also introduce the concept of Critical Distance measures.
Both section 6&7 also goes over the results of each technique,
followed by a comparison of the results. A brief conclusion is
presented in Section 8.
II. OVERVIEW OF LOG CATEGORIZATION
A. Real Time Logs Vs Log Archive (Corpus)
Before describing the problem of log categorization, it is
essential to distinguish real time log events from an archive of
log events. As mentioned above, a log event in general is any
report of an abnormality detected in the system .Typical log
events could be a process failure from the operating system, a
standard warning message from a user application, a network
failure message from an infrastructure component or a simple
disk overflow error message Modern IT systems spawn & feed
log events continuously to monitoring systems. These real time
log events are monitored by IT monitoring staff to identify
anomalies that require alerts and investigation. By and large
there are 3 modes of actions a support staff takes on an
incoming real time log event
Table I: Comparison of Real Time Logs vs. Log
Corpus
Attribute/Log Type Real Time Log Log Corpus
Volume Low High
Typical Size 2 or 3 Sentences Terabytes
Incoming Velocity High Low
Analysis Process Dynamic Batch
Performance Need High Low
2. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 06 Issue: 01 June 2017 Page No.17-25
ISSN: 2278-2419
18
1) No Response - Skip the log event if he knows from
experience that the event is relatively insignificant
2) Immediate Response - If he knows from experience that
the event is significant. He may correlate it with
occurrences in the past, correlate it with prior similar
instances (or support tickets) and respond appropriately.
3) Investigative response - If the event is of a new type that
was unobserved earlier, the abnormality requires
investigation by specialist before response. Usually a new
support ticket is raised.
In all the above cases, the log files containing the log events
get archived continuously. I refer to this unfiltered historical
log archive as the corpus. Typically the corpus runs into
terabytes. Analysis performed on the log archives (if any) is
done through batch process usually involving Big data tools.
However the analysis and action on the real time log events has
to be dynamic and immediate.
B. Real Time Log Categorization Problem
Real time log categorization aims to automate the mode of
response to a given real time log event. The decision whether a
real time log event has similar occurrence in the past or if it is a
new category has to be formed automatically. And if similar
event is observed in the past, its closest category in the corpus
needs to be identified, so that it can be correlated with a past
response/ticket.
Image 1: Industrial Framework for Real Time Log
Categorization
This problem involves 2 distinctive set of tasks.
Batch Task: This is to group the log corpus into clusters. I use
unsupervised learning to split the entire log corpus into
dynamic categories. Traditional document clustering
techniques or its adaptations are useful here.
Real Time Task: This second task is to associate each
streaming real time event to categories formed in the previous
task or identify it as a new Category. This step has to be
dynamic and hence should be computationally efficient with
superior performance.
C. Challenges to Log Categorization
1) Disparate formats
The log event messages have wide variety of formats both
within and outside a category. The error message from
Microsoft SQL Server will obviously differ from a web-service
error generated out of Oracle SOAP API libraries. However
even within a specific category, the error messages differ
considerably (in server names, timestamps, and unique ids
e.t.c). As described in [10], consider the sample Log Events in
Table 2, though there are 6 log events, there are only 2 types of
events. Importantly, the events within a category do not match
in syntax. Within the category 1, server ids, connection
protocols, ports, class names from the libraries all differ.
Within the category 2, alphanumeric user IDs, statement ids,
account ids differ. However, for an experienced IT support
staff, it is easy to ascertain that the 6 events belong to just 2
categories, i.e. Category 1 which deals with an Unexpected
disconnection in network most likely due to server restarts and
Category 2 that deals with wrong input of the mentioned
organization codes and account identifiers. It is also very likely
that the events within the same category require similar
response. Generally logs are characterized by very few types
and high volume of events within each type.
Explicit Programming is not feasible
The categories/templates of log events are rarely known in
advance. Using explicit programming to identify these patterns
is not feasible. A large-scale system is usually composed of
various parts, and those parts may be developed with different
programming languages. Despite of the availability of BSD
Syslog Protocol and other standards, the contents of different
components’ logs can be greatly different. Within a component
and category itself , explicit programming (e.g using regex
pattern matching) cannot be done simply because there may be
thousands of categories and different patterns of log events in
the system.
Table 2: Sample Log events
Category Log Event
1 [com.tvcs.OvcsSubscriber.bcilAD] Unexpected
disconnect of connection
bcilConnection1[tcp://sydbcil4test1:34435]State:
STREAMING -> CLOSED
1 [com. tvcs.OvcsSubscriber.options] Unexpected
disconnect of connection
optConnection1[tcp://sydbcil5dev1:35322]
State: STREAMING -> CLOSED
1 [com.cle.CleanerEditer.cleanerEdit] Unexpected
disconnect of connection
clsEditer0[tcp://sydbcilclean1:15634]State:
STREAMING -> CLOSED
2 [com.refdata.a3.impl.AccountCreationBuilder]
Invalid organization code '9432' for account
'4590XDCD6' with statement id
'US#BbCX0llQ9-E1MVPhhwA'
2 [com.refdata.a3.impl.AccountCreationBuilder]
Invalid organization code '6321' for account
'93DCC0485' with statement id
'US#CRXUTsRpmXq-5xW1pHA'
2 [com.refdata.a3.impl.AccountCreationBuilder]
Invalid organization code '5846' for account
'92BASS0028' with statement id
'US#OMXOgAVSywnVatcLLA'
2) High Volume Log Corpus
In the Big data world, systems are designed to generate
massive amount of logs. The log corpus is usually a high
volume and high dimensional but sparse dataset. This renders
many of the traditional techniques like agglomerative
clustering (E.g. Hierarchical Clustering) or even divisive
clustering with an exhaustive search too slow for industrial use.
3) High Velocity Real Time Log Events
High velocity is a quintessential feature of streaming real time
3. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 06 Issue: 01 June 2017 Page No.17-25
ISSN: 2278-2419
19
events which makes association large computations with
individual events extremely arduous. For e.g. re-clustering the
real time events along with the log corpus to figure out the new
categories is technically feasible but impractical
D. Value in Log Categorization
The advantages of real time log categorization are
Faster querying on organized logs
Automatic log monitoring
Automatic priority flagging of faults
Ignoring insignificant alerts
Dynamic solution recommendation
Self healing of systems
Reduced support staff costs
III. VECTOR SPACE MODEL (VSM)
E. Log Pre-processing
Though the format of logs is not fixed, they do have some
common characteristics, for example, they may contain
timestamps, and composed by upper and lower case letters,
digits and special characters. As suggested by Natural
Language Processing (NLP) [4], it is helpful to do some text
pre-processing to the raw logs before building a TF-IDF matrix
Remove Timestamps – The timestamp information is
removed
Remove digits – The digits in the log events are removed
(or replaced with a special character)
Lowercase conversion – All log events are converted to
lowercase to provide uniformity
Special Character removal – All special characters,
including punctuations are removed and only letters are
retained
Stop Words Removal – All stop words, such as 'what', 'a',
'the', 'any', 'I', etc are removed since they are frequent and
carry no information. These commonly used words may
not be very useful in improving the clustering quality.
F. TF-IDF Representation
A log event can be represented either in the form of binary
data, when I use the presence or absence of a word (or string)
in the event to create a binary vector. A more enhanced and
commonly used representation would include refined
weighting methods based on the frequencies of the individual
words in the log event as well as frequencies of words in an
entire collection of log corpus; commonly known as the TF-
IDF (Term Frequency times Inverse Document Frequency)
representation.[7] The TF-IDF matrix for the log events is
computed as below. The term frequency of 𝑖𝑡ℎ
word in 𝑗𝑡ℎ
log
event 𝑡𝑓𝑖𝑗 is defined as below
𝑡𝑓𝑖𝑗 = {
0 𝑖𝑓𝑚𝑗 = 0
𝑓𝑖𝑗
𝑚𝑗
𝑖𝑓𝑚𝑗 > 0
Where 𝑓𝑖𝑗 the frequency of 𝑖𝑡ℎ
word in 𝑗𝑡ℎ
log event is
normalized by dividing it by 𝑚𝑗 , the number of words in 𝑗𝑡ℎ
log event.
𝑗 = 1,2,3 … . 𝑛where𝑛 is the count of log events in the set of
log events N
𝑖 = 1,2,3 … 𝑚where𝑚 is the count of the selected vocabulary
(words) set M in the collection of log events N.
i.e.|𝑀| = 𝑚𝑎𝑛𝑑|𝑁| = 𝑛
The inverse document frequency 𝑖𝑑𝑓𝑖of the 𝑖𝑡ℎ
word is defined
as below
𝑖𝑑𝑓𝑖 = {
0 𝑖𝑓𝑛𝑖 = 0
log
𝑛
𝑛𝑖
𝑖𝑓𝑛𝑖 > 0
Where 𝑛𝑖 is the number of log events where the 𝑖𝑡ℎ
term
appears in the whole set N of log events. Now, I calculate a
TF-IDF vector for every log event. The 𝑗𝑡ℎ
log event 𝑙𝑗can be
represented as
(𝑡𝑓1𝑗 × 𝑖𝑑𝑓1 ,𝑡𝑓2𝑗 × 𝑖𝑑𝑓2 ,𝑡𝑓3𝑗 ×
𝑖𝑑𝑓3 , … … … … … … … … 𝑡𝑓𝑀𝑗 × 𝑖𝑑𝑓𝑀)
So finally, the entire collection of log events can be
represented as a 𝑛 × 𝑚 matrix
IV. DOCUMENT CLUSTERING OVERVIEW
G. Similarity Measures
Here I attempt to cluster and evaluate the log events with 2
measures of similarity; Euclidean measure and the Cosine
similarity. As mentioned in [5], the choice of a similarity
measure can be crucial to the performance of a clustering
procedure. While Euclidean or Mahalanobis distance has its
advantage in some circumstances, the Cosine similarity
captures the 'directional' characteristics which is intrinsic of the
log event TF-IDF vectors. The Euclidean distance between 2
log event vectors [𝑝1 ,𝑝2 ,𝑝3 ,𝑝4 … … … 𝑝𝑀] and
[𝑞1 ,𝑞2 ,𝑞3 ,𝑞4 … … … 𝑞𝑀] is defined as
𝑒. 𝑑(𝑝, 𝑞) = √∑(𝑝𝑖 − 𝑞𝑖)2
𝑀
𝑖=1
The Cosine distance between 2 log event vectors
[𝑝1 ,𝑝2 ,𝑝3 ,𝑝4 … … … 𝑝𝑀] and [𝑞1 ,𝑞2 ,𝑞3 ,𝑞4 … … … 𝑞𝑀] is
defined as
𝑐. 𝑑(𝑝, 𝑞) = (1 − cos 𝜃)
= 1 −
∑ (𝑝𝑖 × 𝑞𝑖)
𝑀
𝑖=1
√∑ (𝑝𝑖)2
𝑀
𝑖=1 × √∑ (𝑞𝑖)2
𝑀
𝑖=1
The resulting cosine distance is a number from 0 to 1. When
the result is 1, the two messages are completely different, and
when the result is 0, the two messages are identical
H. Similarity Measures
The clustering techniques can be loosely divided into 2 classes;
Hierarchical and Partitional clustering. While hierarchical
clustering algorithms repeat the cycle of either merging smaller
clusters to larger ones or dividing larger clusters to smaller
ones, Partitional clustering algorithms generate various
partitions and then evaluate them by some distance/similarity
criterion. Hierarchical clustering is traditionally slower in
performance for large datasets. In addition, Partitional
methods, where clusters are represented by centroids has
computational advantages when predicting cluster membership
for new data i.e. real time log events. Hence in this paper I
consider only Partitional methods
1) Standard K-Means
In K-Means based clustering; the centroid, a median point
ascertained by some similarity measures from the set of points
represents the set itself. K-means Algorithm to obtain K
clusters is as follows:
Choose random K points as the initial centroids.
4. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 06 Issue: 01 June 2017 Page No.17-25
ISSN: 2278-2419
20
Assign every point to the closest centroid.
Re-compute the centroid of each cluster.
Repeat steps 2 & 3 till the centroids don’t change.
The Standard K-means clustering, attempts iteratively to find
clusters in a data set such that a cost function of dissimilarity
(or distance) is minimized. The dissimilarity measure is chosen
as the Euclidean distance. A log event set N of size n are to be
partitioned into k categories, into a set of clusters 𝐶
𝑙𝑗𝜖𝑁 ; 𝑗 = 1,2,3 … 𝑛
𝐶𝑖 ∈ 𝐶 ; 𝑖 = 1,2,3 … 𝑘
The cost function, based on the Euclidean distance can be
defined as
𝑅 = ∑ 𝑅𝑖
𝑘
𝑖=1
= ∑ ∑‖𝑙𝑗 − 𝜇𝑖‖
2
𝑙𝑗𝜖𝑐𝑖
𝑘
𝑖=1
Where 𝜇𝑖 is the mean (centroid) of log events in 𝐶𝑖category
2) Spherical K-Means
Spherical K Means [3] is essentially K Means attempted with
Cosine similarity. Just like the Standard K-means clustering, it
attempts iteratively to find clusters in a data set such that a cost
function of Cosine dissimilarity (or distance) measure is
minimized For a log event set N of size n; 𝑙𝑗 ∈ 𝑁 ; 𝑗 =
1,2,3 … 𝑛that are to be partitioned into k categories,
𝐶𝑖 ∈ 𝐶 ; 𝑖 = 1,2,3 … 𝑘the cost function is defined as
𝑅 = ∑ 𝑅𝑖
𝑘
𝑖=1
= ∑ ∑ 1 − cos(𝑙𝑗 , 𝜇𝑖)
𝑙𝑗𝜖𝑐𝑖
𝑘
𝑖=1
Where 𝜇𝑖 is the mean (centroid) of log events in 𝐶𝑖cluster. For
high-dimensional data such as text documents, and market
baskets, Spherical K Means has been shown to be a superior
measure to Standard K means.
I. Evaluation Measures
For clustering, basically two types of quality measures are
used. First type allows to compare multiple cluster sets without
any indicator to the actual class, hence referred to as internal
quality measure. The second measure compares the categories
created by the clustering techniques to original classes to
evaluate the cluster hence referred to as external quality
measure. I use 2 external measures Entropy, Purity and one
internal measure Silhouette Index to evaluate our clustering
1) Entropy
Entropy[9] is an external quality measure. Initially the class
distribution is calculated for every cluster, i.e., I compute 𝑃𝑖𝑗,
the probability that a member of cluster-j corresponds to class-
i. The entropy of every cluster 𝐸𝑗 is then calculated using the
formula below:
𝐸𝑗 = − ∑ 𝑃𝑖𝑗 log(𝑃𝑖𝑗)
𝑘
𝑖=1
Then the sum of individual entropies of clusters weighted by
its size is the total entropy
𝐸 = ∑
𝐸𝑗 × 𝑛𝑗
𝑛
𝑘
𝑗=1
Where 𝑛𝑗is the size of the cluster j, k being the number of
cluster(s), and 𝑛 being the number of log events.
2) Purity
Purity[9] is another external quality measure. Purity measure is
calculated by assigning each cluster to the label which is most
frequent in it. Then the number of correctly assignedlog events
are counted and is divided by 𝑛 the total number of log events.
𝑃 =
1
𝑛
∑ 𝑚𝑎𝑥𝑗 |𝑐𝑘 ∩ 𝑧𝑗|
𝑘
𝑖=1
Where
𝐶 = {𝑐1 ,𝑐2 ,𝑐3 ,𝑐4 … … … 𝑐𝑘}is the set of generated clusters.
𝑍 = {𝑧1 ,𝑧2 ,𝑧3 ,𝑧4 … … … 𝑧𝑘}is the set of original classes.
3) Silhoutte Index
The Silhouette Index is a combined measure of how well each
data point lies within its own cluster and how dissimilar each
data point is to its closest neighbouring cluster. It is computed
as below [6]. A log event set N of size 𝑛 is partitioned into k
categories, into a set of clusters 𝐶
Each log event 𝑙𝑗is such that
𝑙𝑗𝜖𝑁 ; 𝑗 = 1,2,3 … 𝑛
𝐶𝑖 ∈ 𝐶 ; 𝑖 = 1,2,3 … 𝑘
𝑛𝑖is the size of cluster 𝐶𝑖 i.e. 𝑛𝑖 = |𝐶𝑖|
The average distance 𝑎𝑗
𝑖
between the 𝑗𝑡ℎ
log event𝑙𝑗in the
cluster 𝐶𝑖 and the all other log events in the same cluster is
calculated as
𝑎𝑗
𝑖
=
1
𝑛𝑖 − 1
∑ 𝑑𝑖𝑠𝑡(𝑙𝑗
𝑖
, 𝑙𝑧
𝑖
)
𝑛𝑖
𝑧=1,𝑧≠𝑗
𝑗 = 1,2,3 … 𝑛𝑖 ; 𝑙𝑗
𝑖
𝜖𝐶𝑖; 𝑙𝑧
𝑖
𝜖𝐶𝑖
The average distance to the neighbouring cluster 𝑏𝑗
𝑖
is
calculated next. The minimum average distance between the
𝑗𝑡ℎ
log event𝑙𝑗in the cluster 𝐶𝑖 and all the log eventsclustered in
the other clusters except 𝐶𝑖 is calculated as
𝑏𝑗
𝑖
=
𝑚𝑖𝑛
𝑟=1,2..𝑘
𝑟≠𝑖
{
1
𝑛𝑟
∑ 𝑑𝑖𝑠𝑡(𝑙𝑗
𝑖
, 𝑙𝑧
𝑟
)
𝑛𝑟
𝑧=1
}
𝑗 = 1,2,3 … 𝑛𝑖 ; 𝑙𝑗
𝑖
𝜖𝐶𝑖; 𝑙𝑧
𝑟
𝜖𝐶𝑟
Then as described in [10], the silhouette width of the cluster 𝐶𝑖
is defined in the following way:
𝑆𝑊𝑖 =
1
𝑛𝑖
∑
𝑏𝑗
𝑖
− 𝑎𝑗
𝑖
𝑚𝑎𝑥{𝑏𝑗
𝑖
, 𝑎𝑗
𝑖
}
𝑛𝑖
𝑗=1
And the Silhouette Index of the entire clustering is taken by
average
𝑆. 𝐼 =
1
𝑘
∑ 𝑆. 𝑊𝑖
𝑘
𝑖=1
V. EXPERIMENTAL DATASETS
I used 3 separate log data sets pairs (for the corpus and real
time data). Each of the dataset was obtained from the IT logs
of a single real banking service provider. The log events
typically included Overflow alerts from messaging queues,
Database adapter connection failures, Application server
cluster failover warnings and SOAP web-service failure
messages. Each of the real time data sets contain 10 new
categories and 10 old categories that exist in the Corpus. I used
2 Corpus - Real time pairs with log event counts 13200-20,
17600-20 & 18700-20. The number of words in the the pairs
5. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 06 Issue: 01 June 2017 Page No.17-25
ISSN: 2278-2419
21
were 189640-327, 323950-323, 365530-370. All three corpus
set had 20 categories and each real data set contained 20 old
and 20 new categories. The automatic log data was fetched
from the systems/servers. Being from disparate sources there
was no unity in the set of original log attributes. However, in
general the log data contained (not restricted to) the following
original log attributes
Time stamp of the Log Event
Source of the log event, such as the class, function, or
filename
Log Category Information. For example, the severity -
INFO, WARN, ERROR & DEBUG
Script/Process names
Brief Summary of the Log Event
Detailed Description of Log Event/Error
Description/Event message
Out of these log attributes only the last attribute i.e Detailed
Description of Log Event/Error Description/Event message
was used for this paper
VI. LOG CORPUS CATEGORIZATION
J. Methodology – Batch Task
The methodology to cluster log corpus log in batch mode is
described below. The steps in batch process may be repeated
are frequent intervals as an offline process.
Step 1: Pre-process the corpus log data as described in Section
3.1 to clean and filter the data
Step 2: Build TF-IDF matrix from the corpus dataset i.e. a
𝑇𝑛×𝑚 matrix where 𝑚 is the total count of words in the
collection excluding the stop words and 𝑛 is the number log
events in the log corpus
Step 3: Apply the document clustering algorithm (Standard K
Means / Spherical K Means ) to the generated TF-IDF matrix.
Generate a centroid set 𝐶𝑘×𝑚 matrix of dimension 𝑘 × 𝑚
where 𝑘 is the number of clusters and 𝑚 is the total count of
words in the collection excluding the stop words
For the purpose of the experiment, each of the two algorithms
Standard K Means & Spherical K Means are run on each of the
3 corpus datasets (P1-Corpus, P2-Corpus, P3-Corpus) for 10
iterations each. The clustered data was evaluated for Purity,
Entropy and Silhouette Index scores.
Figure 1: Entropy Comparison
K. Results and Evaluation
Because of the high number of dimensions in the dataset, it is
difficult to present visual representation of the clusters. The
performance measures are shown in Figure 1, 2 & 3. The
Entropy results for each of the 3 datasets (P1- Corpus, P2-
Corpus, and P3-Corpus) are shown in figures 1 - smaller is
better. The Purity comparison is shown in figures 2 - higher is
better. The Silhouette Index comparison is shown in figures 3 -
higher is better. Notice that Spherical K Means outperforms
Standard K Means in all the three measures for each of the 3
data sets. Hence I can reasonably conclude that Spherical K
Means is best suited for corpus log categorization.
Figure 2: Purity Comparison
Another interesting aspect that can be observed from the plots
is that the performance of Standard K Means degrades when
the number of dimensions in the log event data set increases.
Note that the difference in the performance between Standard
K Means and Spherical K Means is the highest in P3 Corpus.
All other factors (iterations, number of clusters) remaining the
same, it is safe to assume that this difference is due to the
dimensionality. The performance of Spherical K Means
remains fairly consistent
Figure 3: Silhouette Index Comparison
VII. REAL TIME LOG CATEGORIZATION
Here the real rime log event has to be associated to one distinct
category from the clustered log corpus or identify the event as
a new category unobserved in the log corpus. This is an online
process
L. Methodology – Real TimeTask
Similar to description in [10], the methodology to dynamically
categorize real time log events is described below
Step 1: For each incoming real time log event apply the pre-
process step as described in Section 3.1 to clean and filter the
data
Step 2: Generate a TF-IDF matrix for the real time log event
based on the pre-built TF-IDF matrix of the log corpus. If the
corpus TF-IDF is a 𝑛 × 𝑚matrix, then the real time log event
TF-IDF would be a 1 × (𝑚 + ∆𝑚) matrix where 𝑚 is the
count of words in corpus and ∆𝑚 is the count of unseen words
disregarding the oft repeated stop-words in the streaming log
event. Explanation provided in Sect:7.2
6. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 06 Issue: 01 June 2017 Page No.17-25
ISSN: 2278-2419
22
Step 3: Adapt the centroid generated while clustering the log
corpus to suit the TF-IDF vector of the real time log event. If
the log corpus centroid set is a 𝑘 × 𝑚 matrix, then the adapted
centroid set would be a 𝑘 × (𝑚 + ∆𝑚)matrix where 𝑘 is the
number of clusters from the corpus, 𝑚 is the word count in the
corpus and ∆𝑚 is the unseen word count disregarding the oft
repeated stop-words in the streaming log event. Explanation
provided in Sect:7.3
Step 4: The closest median point is then selected (based on the
above described distance measures) from the adapted K-
centroid collection. However, If distance from the nearest
centroid to the real time event is larger than the appropriate
Critical Distance Measure (RMWSS, Cluster Radius,
Silhouette Threshold), then classify the real time event with an
unseen label.
Step 5: If not, the cluster with the nearest centroid is assigned
as the label.
For the purpose of the experiment, the above steps are run on
each of the 3 real time datasets (P1-Real Time, P2-Real Time,
P3-Real Time) against the clusters built from their
corresponding corpus datasets (P1-Corpus, P2-Corpus, P3-
Corpus) for 10 iterations each. The steps are repeated with
Euclidean distance measure on the Standard K Means
generated centroids and with Cosine distance on the Spherical
K Means generated centroids. To evaluate the different critical
distance measures, the accuracy for real time events classified
as old or new (similar events observed or unobserved in the
original corpus) are calculated. Similarly the accuracy scores
are calculated for real time events classified correctly into its
true class from original corpus.
M. Real Time TF- IDF generation
For each real time log event, the TF-IDF vector has to be
generated dynamically. The following factors should be
considered while determining the real time log event TF-IDF
matrix. To measure the distance to the centroids, the TF-IDF
vector of the real time log event should be comparable to the
corresponding matrix built from the corpus. The real time
event might contain a subset of the current vocabulary from the
corpus or the unseen word collection w.r.t the corpus (in any
log events) or a combination of both (most likely). The process
to generate the TF-IDF vector of the real time log event should
be computationally inexpensive. Although adding the new the
real time event to the whole corpus and regenerating the TF-
IDF matrix will theoretically yield the same result, this is not
computationally realistic for a dynamic real time process.
𝑁is the set of all events from corpus.
𝑁 = {𝑙1 ,𝑙2 ,𝑙3 ,𝑙4 … … … 𝑙𝑛} ; 𝑙represents the individual log
events
𝑀is the set of the selected vocabulary (words) in the set of log
events 𝑁 in the Log Corpus.
𝑀 = {𝑤1 ,𝑤2 ,𝑤3 ,𝑤4 … … … 𝑤𝑚} ; 𝑤represents the individual
words𝑚 and 𝑛 are the number of vocabulary and number of
log events from corpus respectively.
|𝑀| = 𝑚𝑎𝑛𝑑|𝑁| = 𝑛
𝑇𝑛 × 𝑚is the TF-IDF matrix built from the corpus dataset and
𝑇𝑝𝑞 is matrix entry for the 𝑝𝑡ℎ
row and the 𝑞𝑡ℎ
column i.e.
the 𝑞𝑡ℎ
word in 𝑝𝑡ℎ
log event
𝑀𝑅
is the set of the selected vocabulary from the real time
event.
𝑀𝑅
= {𝑤1 ,
𝑅
𝑤2 ,
𝑅
𝑤3 ,
𝑅
𝑤4 ,
𝑅
… … … 𝑤𝑟
𝑅
}
𝑤𝑖
𝑅
represents the 𝑖𝑡ℎ
terms from the real time event and 𝑟 is
word count from the real time event
If 𝑇 1 ×(𝑚 + ∆𝑚)
𝑅
is the TF-IDF vector of the real time log
event that need to be generated and ∆𝑚 is the count of
unseen words disregarding the oft repeated stop-words in the
streaming log event. The algorithm to generate TF-IDF vector
of the streaming event is described in Algorithm 1.
---------------------------------------------------------------------------
Algorithm 1: TF-IDF for Real Time Events
---------------------------------------------------------------------------
Initialize 𝑇 1 ×𝑚
𝑅
matrix ; {With same columns as 𝑇𝑛×𝑚 }
for each word 𝑤𝑖
𝑅
∈ 𝑀𝑅
; 𝑖 = 1,2,3 … 𝑟
if𝑤𝑖
𝑅
∉ 𝑀 then
𝑀 ← {𝑀 , 𝑤𝑖
𝑅
} ; {Add the unseen term to
collection}
𝑚 ← 𝑚 + 1 {up collection size by 1}
𝑇 1𝑚
𝑅
←
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖
𝑅
)
𝑟
×
log (𝑛+1)
1
else
Find 𝑗 where 𝑤𝑗 = 𝑤𝑖
𝑅
; 𝑤𝑖
𝑅
∈ 𝑀𝑅
, 𝑤𝑗 ∈
𝑀
𝑐(𝑤𝑖
𝑅
) = 𝑟𝑜𝑤𝑐𝑜𝑢𝑛𝑡(𝑇𝑛×𝑚) | 𝑇𝑝𝑗 ≠ 0 ; 𝑝 =
1,2,3 … 𝑛
𝑇 1𝑗
𝑅
←
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖
𝑅
)
𝑟
×
log (𝑛+1)
𝑐(𝑤𝑖
𝑅)+1
end if
end for
---------------------------------------------------------------------------
First the new TF-IDF vector; 𝑇 1 ×𝑚
𝑅
of the real time log event
that need to be generated is initialized. Both this vector and the
original matrix has the exact same number of fileds, with the
vector containing a single record. I loop over every term in the
real time event in next step. If the word is not present in the log
corpus Vocabulary, you need to add it to 𝑇 1 ×𝑚
𝑅
. This is
accomplished in step 4 & 5. Once the TF IDF is computed in
the next step using the document size to ascertain the IDF now
is (𝑛 + 1) i.e. the corpus size + one for the real time event. The
denominator is 1 because this word is not found in any other
log event. I take the valid row count for the word in the
original TF-IDF matrix from corpus in step 9. This count is
used to calculate the IDF in step 10.
Computational Advantage
In this algorithm I iterate only over the 𝑟 dimensions of the
real time log event. Since 𝑟 ≪ 𝑚 this algorithm is better in
terms of computational efficiency when compared to
regenerating the whole TF-IDF matrix
N. K-Centroid dimension adaptation
While trying to compare and find the distance between the real
time event to each centroid from the set of corpus log clusters,
it is essential that the column types of the matrices match.
Hence there is a need to adapt the centroid matrix generated
while clustering the log corpus to suit the TF-IDF vector of the
real time log event in the run time.
Original dimension of Centroid Matrix = 𝑘 × 𝑚
Dimension of TF-IDF of real time event
𝑇1 ×(𝑚 + ∆𝑚)
𝑅
= 1 × (𝑚 + ∆𝑚)
where∆𝑚 is the unseen word count in the streaming event that
was absent in the log corpus. Dimension of adapted Centroid
7. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 06 Issue: 01 June 2017 Page No.17-25
ISSN: 2278-2419
23
Matrix = 𝑘 × (𝑚 + ∆𝑚)
The adaptation is done by simply adding ∆𝑚 new columns
representing the new words to the original centroid matrix.
Each entry for the 𝑘 rows in the ∆𝑚 columns is initialized to
zero.
O. Critical Distance Measure
Once the TF-IDF of the streaming event is generated and the k
centroids are adapted to suit them, I took the least dissimilarity
approach to find the closest centroid. However, along with
finding the nearest category in the log corpus for the streaming
event, it has to be identified if the streaming event is a new
category, unobserved in the log corpus.
Figure 1: Real Time Log Event Categorization
To do this, I propose using a Critical Distance measure. If the
distance of real time log event to the closest centroid is greater
than the Critical Distance, then it is considered as a new
category of log event. As shown in Figure1, real time log
events r1 and r2 are assigned to Cluster 3 and Cluster 1
respectively. However the distance of r3 to its nearest cluster,
Cluster 1 is greater than the Critical Distance and hence I
associate it with a new category.
Table 4 : Clustered Corpus Log Events
Cluster Corpus Log Event
1 [RequestHandler.D.42] Book Request is
rejected because of unexpected exception:
Invalid customer account: LBM2XT12.
1 [RequestHandler.D.42] Book Request is
rejected because of unexpected exception:
Invalid customer account: AVBGV711.
1 [RequestHandler.D.42] Book Request is
rejected because of unexpected exception:
Invalid customer account: ATUIF169.
1 [RequestHandler.D.42] Book Request is
rejected because of unexpected exception:
Invalid customer account: CRJAY123.
2 [bcil.TradeClient.java] ProtocolException
Unexpected format Soap@17fcvve9 <?xml
msg = Ordinal Request for Price Lead Cancel
Stream bound/>
2 [bcil.TradeClient.java] ProtocolException
Unexpected format Soap@25515779 <?xml
msg = Trade and balance loading work flow
run in progress/>
2 [bcil.TradeClient.java] ProtocolException
Unexpected format Soap@gcfd88a1 <?xml
msg =T he view selected contain risk from the
Fraud System/>
Critical Distance chosen should be dynamic
For Higher accuracy categorization, it is essential that the
Critical Distance chosen is dynamic i.e. it should vary
according to the distribution of log events in the cluster. This
reason is explained in the example below
Table 5: Real time Log Events
Cluster Real Time Log Event
NEW [RequestHandler.D.42] Cancel Request is
rejected because of unexpected exception:
Invalid organization account: CRTPBLA.
2 [bcil.TradeClient.java] Protocol Exception
Unexpected format Soap@gcfd88a1 <?xml
msg = Micro trades sent to target successfully
by overflow/>
Consider the real examples in Table 4 which contains the
clustered log events from the corpus and the examples in Table
5 that has the incoming real time events. The first real time
event in Table 5 looks surprisingly similar to the category of
log events in cluster1. On word to word comparison with
Category 1, only 3 words differ, out of an average length of the
log event 13; i.e. an approximate similarity of 10/13 = 76%.
However on closer observation, it is obvious that this real time
event is a new category. Note that in this example the variation
within the cluster1 is very less ; only 1 word typically varies in
the events i.e. an approximate similarity of 12/13 = 92%. This
is a very compact cluster. The second real time event in Table
5 varies in approximately 9 words out of an average length of
the log event 16 i.e. an approximate similarity of 7/14 = 50%.
Even with this poor similarity, this event most likely belongs to
category 2. Note that the Cluster 2 is has a high radius , due to
the higher variation within its members. I propose three
different Critical Distance Measures beyond which the event
belongs to a new category and evaluate each of them against
the test datasets in Table 3.
P. RMWSS
The Root Mean Within Sum of Squares (RMWSS) is a
conservative measure of Critical Distance for each cluster. For
a log event set N of size n that are clustered into k categories,
into a set of clusters 𝐶 ,
𝑙𝑗𝜖𝑁 ; 𝑗 = 1,2,3 … 𝑛
𝐶𝑖 ∈ 𝐶 ; 𝑖 = 1,2,3 … 𝑘
The 𝑅𝑀𝑊𝑆𝑆 of Cluster 𝐶𝑖 , 𝑅𝑀𝑊𝑆𝑆𝑖 can be defined as
𝑅𝑀𝑊𝑆𝑆𝑖 = √
1
𝑛𝑖
∑ [𝑑𝑖𝑠𝑡(𝑙𝑗, 𝜇𝑖)]
2
𝑙𝑗𝜖𝐶𝑖
Where 𝜇𝑖 is the mean (centroid) of log events in 𝐶𝑖 category
and 𝑛𝑖is the number of log events in cluster 𝐶𝑖. The
𝑑𝑖𝑠𝑡(𝑙𝑗, 𝜇𝑖)is the Euclidean or Cosine distance between 𝑙𝑗&𝜇𝑖
Q. Cluster Radius
The cluster radius is defined as the distance from the cluster
centroid to the farthest point within the cluster. It is the
maximum distance among distance of each entry from cluster
and its centroid. For a log event set N of size n that are
clustered into k categories, into a set of clusters 𝐶
𝑙𝑗𝜖𝑁 ; 𝑗 = 1,2,3 … 𝑛
𝐶𝑖 ∈ 𝐶 ; 𝑖 = 1,2,3 … 𝑘
8. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 06 Issue: 01 June 2017 Page No.17-25
ISSN: 2278-2419
24
The Cluster Radius of Cluster 𝐶𝑖, 𝐶𝑅𝑖 can be defined as
𝐶𝑅𝑖 = max{𝑑𝑖𝑠𝑡(𝑙𝑗, 𝜇𝑖)} ; 𝑙𝑗𝜖𝐶𝑖
Where 𝜇𝑖 is the mean (centroid) of log events in 𝐶𝑖 category
and 𝑛𝑖is the number of log events in cluster𝐶𝑖. The
𝑑𝑖𝑠𝑡(𝑙𝑗, 𝜇𝑖)is the Euclidean or Cosine distance between 𝑙𝑗&𝜇𝑖
R. Silhouette Threshold
The Silhouette Threshold of a cluster is a combined measure of
the silhouette width of the cluster and the Radius of that
cluster. As mentioned in [10], the Silhouette Width of an
individual cluster is an indication of how perfectly each entity
resides within a cluster and how distant is each entity from its
nearest neighbouring cluster.
The silhouette width of the cluster 𝐶𝑖is defined as below:
𝑆𝑊𝑖 =
1
𝑛𝑖
∑
𝑏𝑗
𝑖
− 𝑎𝑗
𝑖
𝑚𝑎𝑥{𝑏𝑗
𝑖
, 𝑎𝑗
𝑖
}
𝑛𝑖
𝑗=1
The silhouette width varies from -1 to 1.
The cluster radius of the cluster 𝐶𝑖is defined as below:
𝐶𝑅𝑖 = max{𝑑𝑖𝑠𝑡(𝑙𝑗, 𝜇𝑖)} ; 𝑙𝑗𝜖𝐶𝑖
Then I define Silhouette Threshold as:
𝑆𝑇𝑖 = 𝐶𝑅𝑖[1 + 𝑆𝑊𝑖]
S. Results & Evaluation
First I analyze the accuracy of categorizing the real time events
as old or new category (observed or unobserved category in the
original corpus) and compare them for each of the Critical
Distance Measures (RMWSS, Cluster Radius and Silhouette
Threshold). The experiments are carried out for each of the 3
datasets (P1- Real Time, P2-Real Time, and P3-Real Time)
against both Euclidean and Cosine dissimilarity measures. The
Accuracy is calculated as
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑁𝑜𝑜𝑓𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑧𝑒𝑑
𝑅𝑒𝑎𝑙𝑇𝑖𝑚𝑒𝐸𝑣𝑒𝑛𝑡𝑠𝑎𝑠𝑂𝑙𝑑𝑜𝑟𝑁𝑒𝑤
𝑁𝑜𝑜𝑓𝑇𝑜𝑡𝑎𝑙𝑅𝑒𝑎𝑙𝑇𝑖𝑚𝑒𝐸𝑣𝑒𝑛𝑡𝑠
The Accuracy measures are shown in Figure 4 & 5.
T. Critical Threshold Measures Differentiation
The Cluster radius and RMWSS does not perform well as
Critical distance measures. It is mainly because there is no
positive/negative limiting factors to these measures. For
example, for a cluster with high variation within its members,
the probability of wrongly categorizing a streaming event as
belonging to that cluster is high. If one considers
Radius/RMWSS as the threshold measures, there are no
limiting factors in these measures to contain the wrong
categorization because of poor clustering performance.
Figure 4: Accuracy vs. Iterations – Euclidean Distance
It is evident that the Silhouette Threshold as a Critical Distance
Measure outperforms RMWSS and Cluster Radius for each of
the datasets for both Cosine and Euclidean dissimilarity. The
accuracy levels of Cluster Radius though relatively better than
RMWSS, it is not as good as Silhouette Threshold for most
iterations.
Figure 5: Accuracy vs. Iterations – Cosine Distance
Next I also analyze the accuracy of correctly categorizing a
real time event to its true category given that similar events are
previously observed in the original corpus. The experiments
are carried out for each of the 3 datasets (P1- Real Time, P2-
Real Time, and P3-Real Time) against both Euclidean and
Cosine dissimilarity measures. This comparison is presented in
Figure 6. Among the real time events that was present in the
original corpus, I find the accuracy as below.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑁𝑜𝑜𝑓𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑧𝑒𝑑𝑅𝑒𝑎𝑙𝑇𝑖𝑚𝑒
𝐸𝑣𝑒𝑛𝑡𝑠𝑡𝑜𝑖𝑡𝑠𝑜𝑟𝑔𝑖𝑛𝑎𝑙𝑐𝑙𝑎𝑠𝑠
𝑁𝑜𝑜𝑓𝑇𝑜𝑡𝑎𝑙𝑅𝑒𝑎𝑙𝑇𝑖𝑚𝑒𝐸𝑣𝑒𝑛𝑡𝑠
It can be seen that in case of prediction of the class, there is no
significant distinction between Euclidean and Cosine distance
measure. Hence once the original corpus is clustered with high
accuracy, the distance measure chosen to find the closest
centroid/category in the corpus is relatively insignificant.
Figure 6: Real Time Events – Closest Category Prediction
U. Why Silhouette Threshold Performs Better
Figure 7: Sample Clustering with High vs. Low Silhouette
Width
As mentioned before the Silhouette Width is defined as,.
𝑆𝑊𝑖 =
1
𝑛𝑖
∑
𝑏𝑗
𝑖
– 𝑎𝑗
𝑖
𝑚𝑎𝑥{𝑏𝑗
𝑖
, 𝑎𝑗
𝑖
}
𝑛𝑖
𝑗=1
9. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 06 Issue: 01 June 2017 Page No.17-25
ISSN: 2278-2419
25
𝑆𝑊𝑖 =
{
1 −
𝑎𝑖
𝑏𝑖 𝑖𝑓𝑎𝑖
< 𝑏𝑖
0 𝑖𝑓𝑎𝑖
= 𝑏𝑖
𝑏𝑖
𝑎𝑖 − 1 𝑖𝑓𝑎𝑖
> 𝑏𝑖
Hence −1 ≤ 𝑆𝑊𝑖 ≤ 1
For a cluster with high avg inner cluster distance and low
within cluster distance will have a silhouette width greater than
0. Conversely, for a cluster with low avg inner cluster distance
and high within cluster distance will have silhouette width less
than 0. This is shown in Figure 7. Silhouette Threshold is
chosen as:
𝑆𝑇𝑖 = 𝐶𝑅𝑖[1 + 𝑆𝑊𝑖]
For a cluster with high variation within its members, the
probability of wrongly categorizing a streaming event as
belonging to that cluster is high. Hence the Critical Distance
should be less than the Cluster Radius. So hypothetically, for a
really bad cluster, (𝑎𝑖
≫ 𝑏𝑖
) Silhouette Index would be -1 and
the Silhouette Threshold would be zero indicating that it is
better not to associate any Real Time Event with it.
Conversely, for a cluster with high inter-cluster dissimilarity,
there is more leeway to assign a new real time log event as
belonging to that cluster. Hence the Critical Distance may be
more than the Cluster Radius. Silhouette Width acts a balance
measure in choosing the Silhouette Threshold as the Critical
Distance Measure
VIII.CONCLUSION
In this paper, I have introduced various techniques forreal time
log categorization. First I have experimentally evaluated
traditional document clustering techniques for categorization of
the corpus. Then I have proposed computationally efficient
strategies for categorizing real time events. However, the major
contribution of this work lies in the introduction of Critical
Distance Measures to identify if similar real time events are
unobserved in the corpus. I have found that among the
centroid methods that I have evaluated, Spherical K Means
performs well with log events. I have also found that the new
measure Silhouette Threshold that I have introduced performs
well as a Critical Distance Measure to identify new log event
categories.Finally, there is lot of scope for future work. It
would be interesting if the mechanism proposed is combined
with the ordering of terms in the log events. An ordering
weighted spherical k-means is likely to produce much better
results in log categorization. Similarly, a framework that uses
the proposed mechanism along with the trouble tickets would
be of considerable industrial interest.
Appendix
Attached is a summary workflow for industrial applications
Acknowledgment
I sincerely thank Mr. Sudheer Koppineni, Digital Enterprise,
TCS and Danteswara Rao Pilla, TCS for their guidance and
contributions
References
[1] K. Hammouda and M. Kamel, "Collaborative Document
Clustering," Proc. Sixth SIAM Int'l Conf. Data Mining
(SDM'06), pp. 453-463, Apr. 2006.
[2] Michael Steinbach, G. Karypis and Vipin Kumar," A
comparison of document clustering techniques." KDD
Workshop on Text Mining, 2000
[3] Kurt Hornik, Ingo Feinerer, Martin Kober, Christian
Buchta "Spherical k-Means Clustering" JSS Vol. 50,
Issue 10, Sep 2012
[4] P. Spyns "Natural language processing," Methods of
information in medicine, vol. 35, no. 4, pp.285–301, 1996.
[5] Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh,
and Suvrit Sra. "Clustering on the unit hyperspherenn
using von mises-sher distributions." Journal of Machine
Learning Research, 6:1345-1382,2005.
[6] Bolshakova N. and Azuaje F., "Cluster Validation
Techniques for Genome Expression Data", Signal
Processing, 83, 2003, pp. 825-833.
[7] J. Ramos, "Using TF-IDF to Determine Word Relevance
in Document Queries", (2012) September 19,
http://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML0
3/papers/ramos.pdf.
[8] Li Weixi, "Automatic Log Analysis using Machine
Learning : Awesome Automatic Log Analysis version
2.0", (2013) November 19, http://uu.diva-
portal.org/smash/get/diva2:667650/FULLTEXT01.pdf
[9] Stanford University - Evluation of
Clusteringhttp://nlp.stanford.edu/IR-
book/html/htmledition/evaluation-of-clustering-1.html
[10] Jayadeep Jacob, "Real-time categorization of log
events", (2015) March 17,
https://www.google.co.uk/patents/US20160196174