Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
JPD1437 Click Prediction for Web Image Reranking Using Multimodal Sparse Coding
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

Final proj 2 (1)

Download to read offline

COMPLETE DOCUMENT OF STEGNOGRAPHY PROJECT

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Final proj 2 (1)

  1. 1. 1 1.INTRODUCTION
  2. 2. 2 1. INTRODUCTION Data mining is the discovery of the unknown patterns from both heterogeneous and homogeneous database. Secure Data Mining helps to discover association rules which are being shared by homogeneous databases (same schema but the data is present on different entities).The algorithm not only finds the union and intersection of association rules with support and confidence which hold in the total database, while ensuring the data held by players to be authenticated. It is estimated that the volume of data in the digital world increased from 161 hexa bytes in 2007 to 998 hexa bytes in 2011 about 18 times the amount of information present in all the books ever written and it continues to grow exponentially. This large amount of data has a direct impact in Computer Data Inspection, which can be broadly defined as the discipline that combines several elements of data and computer science to collect and analyze data from computer systems in a way that is admissible as the data should have similarities between several collected data fields. examining hundreds of thousands of files per computer. This activity exceeds the expert’s ability of analysis and interpretation of data. Therefore, methods for automated data analysis, like those widely used for machine learning and data mining, are of paramount importance. In particular, algorithms for pattern recognition from the information present in text documents are promising. Clustering algorithms are typically used for exploratory data analysis, where there is little or no prior knowledge about the data. This is precisely the case in several applications of Computer Data Inspection, including the one addressed in our work. From a more technical View point, our datasets consist of unlabeled objects the classes or categories of documents that can be found are a priori unknown. Moreover, even assuming that labeled datasets could be available from previous analyses, there is almost no hope that the same classes (possibly learned earlier by a classifier in a supervised learning setting) would be still valid for the upcoming data, obtained from other computers and associated to different investigation processes. More precisely, it is likely that the new data sample would come from a different population. In this context, the use of clustering algorithms, which are capable of finding latent patterns from text documents found in seized computers, can enhance the analysis performed by the expert examiner. Clustering algorithms have been studied for decades, and the literature on the subject is huge. Therefore, we
  3. 3. 3 decided to choose a set of several representative algorithms in order to show the potential of the proposed approach, namely: the partitional K-means and K-medoids, the hierarchical Single/Complete/Average Link, and the cluster ensemble algorithm known as CSPA and also Cosine similarity function. These algorithms were run with different combinations of their parameters, resulting in various different algorithmic instantiations. Thus, as a contribution of our work, we compare their relative performances on the studied application domain—using different sample text data sets containing information like sports, food habits, culture and animals. 1.1 Background and motivation The main scope of this project is in computer Data analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. It is well-known that the number of clusters is a critical parameter of many algorithms and it is usually a priori unknown. As far as we know, however, the automatic estimation of the number of clusters has not been investigated in the Computer Data Analysis literature. Actually, we could not even locate one work that is reasonably close in its application domain and that reports the use of algorithms capable of estimating the number of clusters. Perhaps even more surprising is the lack of studies on hierarchical clustering algorithms, which date back to the sixties. 1.2 Problem Statement The problem statement is that in order to identify the documents that are stored in remote locations inside a computer during computer inspection. As we know that there will be computer inspection regularly in all the organizations in order to identify some sort of data, at that time it is very difficult to identify the data through existing algorithms ,so we have proposed a new system to identify the documents easily and cluster them with the matched attributes that are present in the system.
  4. 4. 4 2. LITERATURE SURVEY
  5. 5. 5 2.LITERATURE SURVEY Literature survey is the most important step in software development process. Before developing the tool it is necessary to determine the time factor, economy n company strength. Once these things are satisfied, ten next steps to determine which operating system and language can be used for developing the tool. Once the programmers start building the tool the programmers need lot of external support. This support can be obtained from senior programmers, from book or from websites. Before building the system the above consideration r taken into account for developing the proposed system 2.1 Cluster ensembles: A knowledge reuse framework for combining multiple partitions This project introduces the problem of combining multiple partitioning of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitioning. We first identify several application scenarios for the resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared mutual information. In addition to a direct maximization approach, we propose three effective and efficient techniques for obtaining high-quality combiners (consensus functions). The first combiner induces a similarity measure from the partitioning and then reclusters the objects. The second combiner is based on hyper graph partitioning. The third one collapses groups of clusters into meta-clusters which then compete for each object to determine the combined clustering. Due to the low computational costs of our techniques, it is quite feasible to use a supra-consensus function that evaluates all three approaches against the objective function and picks the best solution for a given situation. We evaluate the effectiveness of cluster ensembles in three qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms worked on non-identical sets of objects, and (iii) where a common data-set is used and the main purpose of combining multiple clusterings is to improve the quality and robustness of the solution. Promising results are obtained in all three situations for synthetic as well as real data-sets.
  6. 6. 6 2.2 Evolving clusters in gene-expression data Clustering is a useful exploratory tool for gene-expression data. Although successful applications of clustering techniques have been reported in the literature, there is no method of choice in the gene-expression analysis community. Moreover, there are only a few works that deal with the problem of automatically estimating the number of clusters in bioinformatics datasets. Most clustering methods require the number k of clusters to be either specified in advance or selected a posteriori from a set of clustering solutions over a range of k. In both cases, the user has to select the number of clusters. This project proposes improvements to a clustering genetic algorithm that is capable of automatically discovering an optimal number of clusters and its corresponding optimal partition based upon numeric criteria. The proposed improvements are mainly designed to enhance the efficiency of the original clustering genetic algorithm, resulting in two new clustering genetic algorithms and an evolutionary algorithm for clustering (EAC). The original clustering genetic algorithm and its modified versions are evaluated in several runs using six gene-expression datasets in which the right clusters are known a priori. The results illustrate that all the proposed algorithms perform well in gene-expression data, although statistical comparisons in terms of the computational efficiency of each algorithm point out that EAC outperforms the others. Statistical evidence also shows that EAC is able to outperform a traditional method based on multiple runs of k-means over a range of k. 2.3 Exploring data with self-organizing maps This project discusses the application of a self-organizing map (SOM), an unsupervised learning neural network model, to support decision making by computer investigators and assist them in conducting data analysis in a more efficient manner. A SOM is used to search for patterns in data sets and produce visual displays of the similarities in the data. The project explores how a SOM can be used as a basis for further analysis. Also, it demonstrates how SOM visualization can provide investigators with greater abilities to interpret and explore data generated by computer tools.
  7. 7. 7 2.4 Digital text string searching: Improving information retrieval effectiveness by thematically clustering search results Current digital text string search tools use match and/or indexing algorithms to search digital evidence at the physical level to locate specific text strings. They are designed to achieve 100% query recall (i.e. find all instances of the text strings). Given the nature of the data set, this leads to an extremely high incidence of hits that are not relevant to investigative objectives. Although Internet search engines suffer similarly, they employ ranking algorithms to present the search results in a more effective and efficient manner from the user's perspective. Current digital forensic text string search tools fail to group and/or order search hits in a manner that appreciably improves the investigator's ability to get to the relevant hits first (or at least more quickly). This project proposes and empirically tests the feasibility and utility of post-retrieval clustering of digital text string search results This project is presented as a work-in-progress. A working tool has been developed and experimentation has begun. Findings regarding the feasibility and utility of the proposed approach will be presented , as well as suggestions for follow-on research. 2.5 Towards an integrated e-mail analysis framework Due to its simple and inherently vulnerable nature, e-mail communication is abused for numerous illegitimate purposes. E-mail spamming, phishing, drug trafficking, cyber bullying, racial vilification, child pornography, and sexual harassment are some common e-mail mediated cyber crimes. Presently, there is no adequate proactive mechanism for securing e-mail systems. In this context, this analysis plays a major role by examining suspected e-mail accounts to gather evidence to prosecute criminals in a court of law. To accomplish this task, a forensic investigator needs efficient automated tools and techniques to perform a multi-staged analysis of e-mail ensembles with a high degree of accuracy, and in a timely fashion. In this article, we present our e-mail forensic analysis software tool, developed by integrating existing state-of-the-art statistical and machine-learning techniques complemented with social networking techniques. In this framework we incorporate our two proposed authorship attribution approaches.
  8. 8. 8 3. SYSTEM REQUIREMENS
  9. 9. 9 3. SYSTEM REQUIREMENTS 3.1Requirement Analysis Document Requirement Analysis is the first phase in the software development process. The main objective of the phase is to identify the problem and the problem and the system to be developed .The later phases are strictly dependent on this phase and hence requirements for the system analyst to be clearer, precise about this phase. Any inconsistency in this phase will lead to lot of problem in the other phases to be followed. Hence there will be several reviews before the final copy of the analysis is made on the system to be developed. After all the analysis is completed the system analyst will submit the details of the system to be developed in the form of a document called requirement specification. The Requirement analysis task is a process of discovery, refinement, modeling and specifications. The software scope, initially established by a system engineer and refined during software project planning, is refined in detail. Models of required data, information and control flow and operational behavior are created. Alternative solution are analyzed and allocated to various software elements. Both the developer and the customer take an active role in requirement analysis and specification. The customer attempts to reformulate a sometimes-nebulous concept of software function and performance into concrete detail. The developer acts as interrogator, consultant and problem solver. The communication content is very high. Changes for misinterpretation of misinformation abound.Ambiguity are probable. Requirement analysis is a software engineering task that bridges the gap between the system level software allocation and software design. Requirement analysis enables the system engineer to specify the software function and performance indicate software interface with other system elements and establish constraints that software must meet. It allows the software engineer, often called analyst in this role, to refine the software allocation and build model of the data, functional and behavior domain and that will be treated by software. Requirement analysis provides the software designer with models that can be translated into data, architectural, interface and procedural design. Finally, the requirement specification provides the developer and customer with the means to access quality once software.
  10. 10. 10 3.1.1 Functional Requirements The functional requirement of the system defines a function of software system or its components. A function is described as set of inputs, behavior of a system and output. The Functional Requirements are: The functional requirements comprises of 3 parts. 1) Input 2) Output 3) Data Storage 1) Input The following are the inputs that should be performed on your current application. They are as follows: User selects a text file as input data set. a. User selects a stop words button in order to remove the unwanted words(I.e. Words other than Noun, Verb and Adverb) b. User selects on Stemming Button for removing duplicate attributes c. User click on calculation button in order to get the result d. User click on K-means to generate clusters with id e. User click on Distance calculation button in order to calculate distance between attributes. f. User clicks on Incremental button to generate clusters. g. User click on purity button to get the purity values of K-means and Inc clustering. 2) Output The following are the steps that user will click for generating the output as a result. a. User gets a message called as “Data Selected “ b. User gets a message called as “Stopwords removal completed” c. User gets a message after he chooses a valid input file as “File Selected Successfully”. d. User gets failed message if he chooses invalid input file type as “not a valid type”. e. User gets the filtered words with no duplication when he click on stemming button. f. User gets the cluster ids and cluster values after he click on K-means g. User gets the distance matrix values after he click on generate distance matrix.
  11. 11. 11 h. User gets the processed values of Cos similarity after he chooses the Inc clustering. i. User gets the Graph for comparison of purity values of K-means and Inc clustering. 3) Data Storage Here we use My Sql data base as data base storage function in order to store all the registration details. In this project we use My Sql as back end because it has following advantages like It is GUI in Nature. It is cross Platform (I.e. It can run and reside on any Operating System). It has a feature called as Auto Commit. It takes very less space for installing on any system(i.e. hardly less than 30 Mb). 3.1.2 Non-Functional Requirements In non-functional requirements the following are the things that come under .They are as follows: 1) Reusability: As we developed the application in java, the application can be re-used for any one without having any restrictions in its usage. Hence it is re-Usable. 2) Portability: As the application is designed with java as programming language, we know java can be run on any operating system. Hence the application is portable to run on any operating system. 3) Extensibility: The application can be extended at any level if the user wish to extend that in future this is done because java is a open source medium which doesn’t have any time limits for expiry or renewal.
  12. 12. 12 Requirements Hardware Requirements System Pentium IV 2.4 GHz Hard Disk 40 MB Floppy Drive 1.44 Mb Monitor 15 VGA Colour Mouse Logitech Ram 512 Mb 3.1: Hardware Requirements Software Requirements Operating system Windows XP Coding Language Java Swings Data Base MYSQL 3.2: Software Requirements
  13. 13. 13 4. DESIGNING
  14. 14. 14 4. DESIGNING 4.1 Design Considerations Design Considerations is a process of problem solving and planning for a software solution. After the purpose and specifications of software are determined, software developers will design or employ designers to develop a plan for a solution. It includes low-level component and algorithm implementation issues as well as the architectural view. 4.1.1 Assumptions and Dependencies Describe any assumptions or dependencies regarding the software and its use. These may concern such issues as: It is assumed that the system will be deployed on Windows 2007 or later operating system. A working visual studio 2010 or above is necessary. 4.1.2 General Constraints This project is a desktop based application, developed in java technology. A major constraint is to provide security for the information. In our project we use symmetric cryptography algorithm and cipher text and key follows different paths. 4.1.3 Development Methods A system development methodology refers to the framework that is used to structure, plan, and control the process of developing an information system. The following diagram explains the stages.
  15. 15. 15 Figure 4.1: Water-Fall Model Requirement Analysis and Definition All possible requirements of the system to be developed are captured in this phase. Requirements are a set of functions and constraints that the end user (who will be using the system) expects from the system. The requirements are gathered from the end user at the start of the software development phase. These requirements are analyzed for their validity and the possibility of incorporating the requirements in the system to be developed is also studied. Finally, a requirement specification document is created which serves the purpose of guideline for the next phase of the model. System and Software Design Before starting the actual coding phase, it is highly important to understand the requirements of the end user and also have an idea of how should be the end product looks like. The requirement specifications from the first phase are studied in this phase and a system design is prepared. System design helps in specifying hardware and system requirements and also helps in defining the overall system architecture. The system design specifications serve as an input for the next phase of the model. Implementation and Unit Testing On receiving system design documents, the work is divided in modules/units and actual coding is started. The system is first developed in small programs called units, which are integrated in the next phase. Each unit is developed and tested for its functionality; this is
  16. 16. 16 referred to as unit testing. Unit testing mainly verifies if the modules/units meet their specifications. Integration and System Testing As specified above, the system is first divided into units which are developed and tested for their functions. These units are integrated into a complete system during integration phase and tested to check if all modules/units coordinate with each other and the system as a whole behaves as per the specifications. After successfully testing the software, it is delivered to the customer. 4.2 System Design The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of input data to the system, various processing carried out on this data, and the output data is generated by this system. Figure 4.2: Data Flow Diagram Preprocessing Documents Documnets Term Frequency Similarity Calculation Cluster Formation Query Results
  17. 17. 17 1. The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system. 2. DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output. 3. DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail. 4.2.1 Proposed Architecture Figure 4.3:Proposed Architecture The architecture contains four modules. These are listed below 1. Pre-Processing Module 2. Calculating the number of clusters 3. Clustering techniques 4. Removing Outliers
  18. 18. 18 4.2.1.1 Preprocessing Module: Before running clustering algorithms on text datasets, we performed some preprocessing steps. In particular, stop words (prepositions, pronouns, articles, and irrelevant document metadata) have been removed. Also, the Snow balls stemming algorithm for Portuguese words has been used. Then, we adopted a traditional statistical approach for text mining, in which documents are represented in a vector space model. In this model, each document is represented by a vector containing the frequencies of occurrences of words, which are defined as delimited alphabetic strings, whose number of characters is between 4 and 25. We also used a dimensionality reduction technique known as Term Variance (TV) that can increase both the effectiveness and efficiency of clustering algorithms. TV selects a number of attributes (in our case 100 words) that have the greatest variances over the documents. In order to compute distances between documents, two measures have been used, namely: cosine-based distance and Levenshtein-based distance. The later has been used to calculate distances between file (document) names only. 4.2.1.2 Calculating the number of Clusters: In order to estimate the number of clusters, a widely used approach consists of getting a set of data partitions with different numbers of clusters and then selecting that particular partition that provides the best result according to a specific quality criterion (e.g., a relative validity index). Such a set of partitions may result directly from a hierarchical clustering dendrogram or, alternatively, from multiple runs of a partitional algorithm (e.g., K-means) starting from different numbers and initial positions of the cluster prototypes. 4.2.1.3 Clustering Techniques: The clustering algorithms adopted in our study—the partitional K-means and K-medoids, the hierarchical Single/Complete/Average Link, and the cluster ensemble based algorithm known as CSPA—are popular in the machine learning and data mining fields, and therefore they have been used in our study. Nevertheless, some of our choices regarding their use deserve further comments. For instance, K-medoids is similar to K-means. However, instead of computing centroids, it uses medoids, which are the representative objects of the clusters. This property makes it particularly interesting for applications in which (i) centroids cannot be computed; and
  19. 19. 19 (ii) distances between pairs of objects are available, as for computing dissimilarities between names of documents with the Levenshtein distance. 4.2.1.4 Removing Outliers: We assess a simple approach to remove outliers. This approach makes recursive use of the silhouette. Fundamentally, if the best partition chosen by the silhouette has singletons (i.e., clusters formed by a single object only), these are removed. Then, the clustering process is repeated over and over again—until a partition without singletons is found. At the end of the process, all singletons are incorporated into the resulting data partition (for evaluation purposes) as single clusters. Input Design The input design is the link between the information system and the user. It comprises the developing specification and procedures for data preparation and those steps are necessary to put transaction data in to a usable form for processing can be achieved by inspecting the computer to read data from a written or printed document or it can occur by having people keying the data directly into the system. The design of input focuses on controlling the amount of input required, controlling the errors, avoiding delay, avoiding extra steps and keeping the process simple. The input is designed in such a way so that it provides security and ease of use with retaining the privacy. Input Design considered the following things: • What data should be given as input? • How the data should be arranged or coded? • The dialog to guide the operating personnel in providing input. • Methods for preparing input validations and steps to follow when error occur. Objectives: 1. Input Design is the process of converting a user-oriented description of the input into a computer-based system. This design is important to avoid errors in the data input process and show the correct direction to the management for getting correct information from the computerized system.
  20. 20. 20 2. It is achieved by creating user-friendly screens for the data entry to handle large volume of data. The goal of designing input is to make data entry easier and to be free from errors. The data entry screen is designed in such a way that all the data manipulates can be performed. It also provides record viewing facilities. 3. When the data is entered it will check for its validity. Data can be entered with the help of screens. Appropriate messages are provided as when needed so that the user will not be in maize of instant. Thus the objective of input design is to create an input layout that is easy to follow Output Design A quality output is one, which meets the requirements of the end user and presents the information clearly. In any system results of processing are communicated to the users and to other system through outputs. In output design it is determined how the information is to be displaced for immediate need and also the hard copy output. It is the most important and direct source information to the user. Efficient and intelligent output design improves the system’s relationship to help user decision-making. 1. Designing computer output should proceed in an organized, well thought out manner; the right output must be developed while ensuring that each output element is designed so that people will find the system can use easily and effectively. When analysis design computer output, they should Identify the specific output that is needed to meet the requirements. 2. Select methods for presenting information. 3. Create document, report, or other formats that contain information produced by the system. The output form of an information system should accomplish one or more of the following objectives. • Convey information about past activities, current status or projections of the • Future. • Signal important events, opportunities, problems, or warnings. • Trigger an action.
  21. 21. 21 • Confirm an action. 4.3 Unified Modeling Language  UML stands for Unified Modeling Language. UML is a standardized general-purpose modeling language in the field of object-oriented software engineering. The standard is managed, and was created by, the Object Management Group.  The goal is for UML to become a common language for creating models of object oriented computer software. In its current form UML is comprised of two major components: a Meta-model and a notation. In the future, some form of method or process may also be added to; or associated with, UML.  The Unified Modeling Language is a standard language for specifying, Visualization, Constructing and documenting the artifacts of software system, as well as for business modeling and other non-software systems.  The UML represents a collection of best engineering practices that have proven successful in the modeling of large and complex systems.  The UML is a very important part of developing objects oriented software and the software develop  The UML uses mostly graphical notations to express the design of software projects. 4.3.1 Scenarios A scenario is “a narrative description of what people do and experience as they try to make use of computer systems and applications”. A scenario is a concrete, focused, informal description of single feature of the system from the viewpoint of a single actor. Scenarios cannot replace use cases, as they focus on specific instances and concrete events. However, scenarios enhance requirements elicitation providing a tool that is understandable to users and clients.
  22. 22. 22 Scenario 1: Table 4.1:Scenario1 table Figure 4.4: User selects a text document BrowseTextFile checkOnBrowseButton for selection selected data sets are saved & press EXIT button user Use case Name User Selects a Text Documents Participating Actors User Flow of Events 1) User has to browse a text file as input data set 2) Click on Browse button to select the Dataset. Entry Condition User has To browse for an input Dataset. Exit Condition Selected Dataset are saved into a output Panel and click on EXIT button to close the application
  23. 23. 23 Scenario 2 Table 4.2:Scenario2 table Figure 4.5:Preprocessing click on stop words button click on stemming button preprocessed data is saved&click on EXIT button user Browse for input data set Use case Name Preprocessing Participating Actors User Flow of Events 1) User has to browse a text file as input data set 2) User click on Stop words to remove the unwanted words and phrases. 3) User then clicks on Stemming button inorder to remove the duplicates. Entry Condition User has to browse for an input Dataset. Exit Condition Finally preprocessed data is saved onto the output panel and click on EXIT button to close the application
  24. 24. 24 Scenario 3 Table 4.3:Scenario3 table Figure 4.6:Term Frequency Calculation Browse for input data set Preprocessing click on calculation button for clusters term frequency calculation gets the frequency values term frequency data is saved & click on EXIT button User Use case Name Term Frequency Calculation Participating Actors User Flow of Events 1) User after do preprocessing on input text file, he will go for calculation button for clusters. 2) User creates term frequency calculation between all attributes and documents. 3) User gets the frequency values for all the documents parallel with attributes. Entry Condition User has to browse for an input Dataset. Exit Condition Finally term Frequency data is saved onto the output panel and click on EXIT button to close the application
  25. 25. 25 Scenario 4 Table 4.4:Scenario4 table Figure 4.7: Similarity Calculation Browse for input dataset Term frequency is calculated Click on next button Click on similarity button Cos similarity values are calculated Purity values calculated User Similarity calculation is saved &Click on EXIT button Use case Name Similarity Calculation Participating Actors User Flow of Events 1) User after term frequency calculation, he will go for next button. 2) He will click on similarity button to calculate the cos similarity values 3) The sum of all documents similarity values gives the purity values. Entry Condition User has to browse for an input Dataset. Exit Condition Finally Similarity calculation between all documents is saved onto the output panel and click on EXIT button to close the application
  26. 26. 26 Scenario 5 Table 4.5:Scenario4 table Figure 4.8:Cluster Formation and Query Results Browse for input dataset Preprocessing Similarity Calculation Click on next button process the cluster values get query results from unmatched values user Cluster values are saved &Click on EXIT button Use case Name Cluster Formation and Query Result Participating Actors User Flow of Events 1) User after processing the similarity values, he will click on next button. 2) User process the cluster values of matches documents with cluster id 3) User gets the Query result at the end in order to show the values that are not matched in that list. Entry Condition User has to browse for an input Dataset. Exit Condition Finally Cluster values between all documents is saved onto the output panel and click on EXIT button to close the application
  27. 27. 27 4.3.2 Use case Diagram: A use case diagram in the Unified Modeling Language (UML) is a type of behavioral diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical overview of the functionality provided by a system in terms of actors, their goals (represented as use cases), and any dependencies between those use cases. The main purpose of a use case diagram is to show what system functions are performed for which actor. Roles of the actors in the system can be depicted. System: A system/system boundary groups use cases together to accomplish a purpose. Each use case diagram can only have one system. Actor: An actor represents a coherent set of roles that users of the system plays when interacting with the use cases of the system. An actor participates in use cases to accomplish an overall purpose. An actor can represent the role of a human, a device or any other systems. Use case: A use case describes a sequence of actions that provide something of measurable value to an actor and is drawn as a horizontal ellipse. Use case relationships: Four relationships among use cases are used often in practice. Include: In one form of interaction, a given use case may include another. "Include is a Directed Relationship”. In between two use cases, implying that the behavior of the included use case is inserted into the behavior of the including use case”. The first use case often depends on the outcome of the included use case. This is useful for extracting truly common behaviors from multiple use cases into a single description. The notation is a dashed arrow from the including to the included use case, with the label "«include»". Extend: This relationship indicates that the behavior of the extension use case may be inserted in the extended use case under some conditions. The notation is a dashed arrow from the extension to the extended use case, with the label "«extend»". The notes or constraints may be associated with this relationship to illustrate the conditions under which this behavior will be executed. Modelers use the «extend» relationship to indicate use cases that are "optional" to the base use case.
  28. 28. 28 Generalization: A given use case may have common behaviors, requirements, constraints, and assumptions with a more general use case. In this case, describe them once, and deal with it in the same way, describing any differences in the specialized cases. The notation is a solid line ending in a hollow triangle drawn from the specialized to the more general use case. Association: Associations between actors and use cases are indicated in use case lid lines. An association exists whenever an actor is involved with an interaction described by a use case. Associations are modeled as lines connecting use cases and actors to one another, with an optional arrowhead on one end of the line. The arrowhead is often used to indicate the direction of the initial invocation of the relationship or to indicate the primary actor within the use case. Figure 4.9: Use Case Diagram ChooseAnInputDataset Preprocessing TermFrequency SimilarityCalculation ClusterFormation User/Computer Examiner EvaluatingQueryResults
  29. 29. 29 4.3.3 Class Diagram A class diagram describes the static structure of the system. It is a graphic presentation of the static view that shows a collection of declarative (static) model elements, such as classes, types, and their contents and relationships. Classes are abstractions that specify the common structure and behavior of a set of objects. Objects are the instances of the classes that are created, modified and destroyed during the execution of the system. A Class diagram describes the system in terms of objects, classes, attributes, operations and their associations. Class: A rectangle is the icon that represents the class. It is divided into 3 areas. The uppermost contains name, the middle area holds the attributes and the bottom area holds the operations. Package: A package is a mechanism for organizing elements into groups. It is used in the Use Case, Class, and Component diagrams. Packages may be nested within other packages. A package may contain both subordinate packages and ordinary model elements. The entire system description can be thought of as a single high-level subsystem package with everything else in it. Subsystem: A subsystem groups diagram elements together. Generalization: Generalization is a relationship between a general element and a more specific kind of that element. It means that the more specific element can be used whenever the general element appears. Usage: Usage is a dependency situation in which one element (the Client) requires the presence of another element (the supplier) for its correct functioning or implementation. Realization: Realization is the relationship between a specialization and its implementation. It is an indication of the inheritance of behavior without the inheritance of structure. One classifier specifies a contract such that another classifier guarantees to carry out Realization is used in two places: one is between interfaces and the classes that realize them, and the other is between use cases and the collaboration that realize them.
  30. 30. 30 Association: Association is represented by drawing a line between classes and can be named to facilitate model understanding. If two classes are associated, you can navigate from an object of one class to an object of the class. Aggregation: Aggregation is a special kind of association in which class represents as the larger class that consists of a smaller class. It has the meaning of “has-a” relationship. Composition: Composition is a strong form of aggregation association. It has strong ownership and coincident lifetime of parts by the whole. A part may belong to only one composite. Parts with non-fixed multiplicity may be created after the composite itself. But once created, they live and die with it (that is, they share lifetimes). Such arts can also be explicitly removed before the death of the composite. N-ary Association: N-ary associations are associations that connect more than two classes. Dependency: The dependency link is a semantic relationship between two elements. It indicates that whenever a change occurs in one element, there may be a change necessary to other element.
  31. 31. 31 Figure 4.10: Class Diagram
  32. 32. 32 4.3.4 Sequence Diagram: A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that shows how processes operate with one another and in what order. It is a construct of a Message Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and timing diagrams. Object: Object can be viewed as an entity at a particular point in time with a specific value and as a holder of identity that has different values over time. Associations among objects are not shown. When you place an object tag in the design area, a lifeline is automatically drawn and attached to that object tag. Actor: An actor represents a coherent set of roles that users of a system play when interacting with the use cases of the system. An actor participates in use cases to accomplish an overall purpose. An actor can represent the role of a human, a device, or any other systems. Message: A message is a sending of a signal from one sender object to other receiver object(s). It can also be the call of an operation on receiver object by caller object. The arrow can be labeled with the name of the message (operation or signal) and its argument values. A sequence number that shows the sequence of the message in the overall interaction as well as a guard condition can also be labeled at the arrow. Lifetime: It is the duration that indicates the completion of an action or a message and it will cause transition from one state to another state. The life time of an object is represented with a dotted line. Self Message: A message that indicates an action will perform at a particular state and stay there. Create Message: A message that indicates an action that will perform between two states.
  33. 33. 33 Figure 4.11: Sequence Diagram
  34. 34. 34 4.3.5 Collaborationdiagram: Communication diagram was called collaboration diagram in UML 1. It is similar to sequence diagrams but the focus is on messages passed between objects. The same information can be represented using a sequence diagram and different objects. Click here to understand the differences using an example Class roles: Class roles describe how objects behave. Use the UML object symbol to illustrate class roles, but don't list object attributes. Association roles: Association roles describe how an association will behave given a particular situation. You can draw association roles using simple lines labeled with stereotypes. Messages: Unlike sequence diagrams, collaboration diagrams do not have an explicit way to denote time and instead number messages in order of execution. Sequence numbering can become nested using the Dewey decimal system. The condition for a message is usually placed in square brackets immediately following the sequence number. Use a * after the sequence number to indicate a loop.
  35. 35. 35 Figure 4.12: Collaboration Diagram
  36. 36. 36 4.3.6 Activity Diagram: Activity diagrams are graphical representations of workflows of stepwise activities and actions with support for choice, iteration and concurrency. In the Unified Modeling Language, activity diagrams can be used to describe the business and operational step-by-step workflows of components. An Activity diagram consists of the following behavioral elements: Action State: It describes the execution of an atomic action. Sub-Activity: It is an activity that will perform within another activity. Initial State: A pseudo state to establish the start of the event . Final State: It signifies when a transition ends. Horizontal Synchronization: A horizontal synchronization splits a single transition into parallel transition or merges concurrent transitions to a single target. Vertical Synchronization: A vertical synchronization splits a single transition into parallel transitions or merges concurrent transitions to a single target. Decision Point: A decision point is used to model the conditional flow of control. It labels each output transition of a decision with a different guard condition Swim Lane: A Swim lane is a partition on interaction diagram for organizing responsibilities for activities. Each lane presents the responsibilities of a particular class. To use a Swim lane activity diagrams are arranged into vertical zones.
  37. 37. 37 Figure 4.13:Activity Diagram document preprocess ing term frequency unconsidered similarity computation cluster formation query results valid yes no
  38. 38. 38 IMPLEMENTATION
  39. 39. 39 5. IMPLEMENTATION 5.1 Preparing the data sets The input to the document clustering algorithm can be any set of documents which have to be divided into clusters based on their similarity. The individual terms from each of the documents have to be extracted in order to identify similar items. The data set thus undergoes three pre- processing steps: Tokenization Stop word Removal Stemming 5.1.1 Tokenization Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics(where it is form of text segmentation), and in computer science, where it forms part of lexical analysis. 5.1.2 Stop word Removal In computing, stop words are the words which are filtered out prior to, or after, processing of natural language data(text).There is not one definite list of stop words which all tools use and such a filter is not always used. Some tools specifically avid removing them to support phrase search. Some of the common stop words are: a, be, been, and, as, out, ever, own, he, she, shall etc. 5.1.3 Stemming Stemming is the process for reducing derived words to their stem, base or root form-generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation. Stemming programs are commonly referred to as stemming algorithms or stemmers.
  40. 40. 40 5.2 Cluster Analysis Clustering is the process of grouping a set of objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other cluster. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machines learning, data mining, pattern recognition, image analysis and bioinformatics. The computation task of classifying the data set into k cluster is often referred to as k-clustering. Cluster is also called as data segmentation in some applications because clustering partitions large datasets into groups according to their similarity. Clustering can also be used for outlier detection. Cluster analysis aims to organize a collection of patterns into cluster based on similarity. Cluster has its roots in many fields, such as mathematics, computer science, statistics, biology, and economics. In different applications domains, a variety of clustering techniques have been developed, depending on the methods used to represent data, the measure of similarity between data objects, and the technique for grouping data objects into cluster. In data mining, hierarchical clustering is a method of cluster analysis which creates a hierarchical decomposition of the given set of data objects. Depending on the decomposition approach hierarchical algorithms are classified as agglomerative (merging) or divine (splitting). In this project we focus on decimal clustering using hierarchical clustering Types of clustering There are different clustering methodologies. Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative (“bottom-up”) or (“up-down”) partitioning algorithms typically determine all clusters at once. There are different clustering methods like Density based, Grid based, Model based, Constraints Based clustering. Figure: clustering
  41. 41. 41 The clustering algorithms used are  K-means Algorithm  K-medoids Algorithm  Hierarchical Algorithm These algorithms results minimal latency time delays for all the clients. Partitioning Clustering: Given a database of n objects a partitioning methods constructs partitionsof the data, where each partition represents a cluster andk<=n that it classifies the data into k groups. Given k, the number of partitions to construct the methods creates a initial partitioning. It then uses an iterative relocations technique that attempts to improve the partitioning by moving objects from one group to another.  K-means Algorithm: Demonstration of the standard algorithm 1) k initial “means” (in this case k=3) are randomly generated within the data domain. 2) k clusters are created by associating every observation with the nearest mean. The partition here represent the Voronoi diagram generated by the means. 3) The centroid of each of the k clusters becomes the new mean. 4) Steps 2 and 3 are repeated until convergence has been reached. Figure 5.1:Clustering through K-means
  42. 42. 42  K-medoids Algorithm 1. Initialize: randomly select (without replacement) k of the n data points as the medoids 2. Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) 3. For each medoid m 4. For each non-medoid data point o. 5. Swap m and o and compute the total cost of the configuration 6. Select the configuration with the lowest cost. 7. Repeat steps 2 to 4 until there is no change in the medoid. Figure: Cluster through k- medoids
  43. 43. 43 Hierarchal clustering:  A hierarchal clustering works by grouping data objects into a tree of clusters.  There are two types of hierarchal clustering :  Agglomerative hierarchal clustering: It is a bottom-up strategy, where it starts by placing each object into clusters & then merges into larger clusters un-till all the objects are in single cluster.  Divisive hierarchal clustering: This is top-down strategy, where the clusters are subdivided into smaller pieces un-till each object forms a cluster on its own or un-till it satisfies certain termination conditions.  Agglomerative vs. Divisive approach. Agglomerative approach Divisive approach We start out with all sample units in n clusters of size 1. We start out with all sample units in a single cluster of size n. Then, at each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster. Then, at each step of the algorithm, clusters are partitioned into a pair of daughters clusters, selected to maximize the distance between each daughter. The algorithm stop when all sample units are combined into a single cluster of size n. The algorithm stops when sample units are partitioned into n clusters of size 1. Table: comparison of agglomerative and divisive approach
  44. 44. 44 Hierarchical Algorithm: • The maximum distance between elements of each cluster (also called complete-linkage clustering): maxf d(x; y) : x 2 A; y 2 B g: • The minimum distance between elements of each cluster (also called single-linkage clustering): minf d(x; y) : x 2 A; y 2 B g: • The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA): • The sum of all intra-cluster variance. • The increase in variance for the cluster being merged (Ward’s method<ref name="[6])The probability that candidate clusters spawn from the same distribution function (V-linkage). Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). Hierarchical vs. partitioning algorithms: Hierarchical techniques produce a nested sequence of partitions, with a single, all inclusive cluster at the top and singleton clusters of individual points at the bottom. Each intermediate level can be viewed as combining (splitting) two cluster from the next lower (next higher) level. Partitional techniques create a one level (unnested) partitioning the data points. If k is the desired number of clusters, then partitions approaches typically find all k clusters at once. Contrast this with traditional hierarchical schemes, which bisect a cluster to get two clusters or merge two clusters to get one. Distance Measure An important step in any clustering is to select a distance measure. We will determine how the similarity of two elements is calculated. This influence the shape of the clusters, as some elements may be close to another according to one distance and further away according to another. The various distance measures are:
  45. 45. 45 Euclidean Distance This is probably the most commonly chosen type of distance. It simply gives the geometric distance in the multidimensional space. It is computed as: The Euclidean (and squared Euclidean) distances are usually computed on raw data and not from standardized data. City Block Distance(Manhattan Distance): This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. The city-block distance is computed as: Cosine Similarity: Cosine Similarity is one of the most popular similarity measure practical to text documents, such as in various information retrieval applicationsand clustering too. An important property of the cosine similarity is its independence of document length. For two documents A and B, the similarity between them can be calculated as
  46. 46. 46 5.2 Software Environment and Technologies Java Technology Java technology is both a programming language and a platform. The Java Programming Language The Java programming language is a high-level language that can be characterized by all of the following buzzwords  Simple  Architecture neutral  Object oriented  Portable  Distributed  High performance  Interpreted  Multithreaded  Robust  Dynamic  Secure With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works. You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That
  47. 47. 47 means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac. The Java Platform A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and Macros. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms. The Java platform has two components:  The Java Virtual Machine (JVM)  The Java Application Programming Interface (Java API) The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability. What Can Java Technology Do?
  48. 48. 48 The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser. However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs. An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet. A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server. Java makes our programs better and requires less effort than other languages. Java technology will help you do the following:  Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.  Write less code:  Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.  Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.  Developprograms more quickly:
  49. 49. 49 Our development time may be as much as twice as fast versus writing the same program in C++. Why? We write fewer lines of code and it is a simpler programming language than C++.  Avoid platform dependencies with 100% Pure Java: We can keep our program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.  Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.  Distribute software more easily: We can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program. 6.2.1 ODBC Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change. Through the ODBC Administrator in Control Panel, we can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead us to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN. The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0.
  50. 50. 50 The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year. 6.2.2 JDBC In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on. To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution. JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after. 6.2.2.1 JDBC Goals Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The seven design goals for JDBC are as follows: 1. SQL Level API The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level
  51. 51. 51 tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user. 2. SQL Conformance SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non- standard functionality in a manner that is suitable for its users. 3. JDBC must be implemental on top of common database interfaces The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa. 4. Provide a Java interface that is consistent with the rest of the Java system Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system. 5. Keep it simple This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API. 6. Use strong, static typing wherever possible Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime. 7. Keep the common cases simple Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible. And for dynamically updating the cache table we go for MS Access database. Java has two things: a programming language and a platform.
  52. 52. 52 8. Keep the common cases simple Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible. And for dynamically updating the cache table we go for MS Access database.Java has two things: a programming language and a platform. Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform- independent code instruction is passed and run on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works. 6.2.2.2 JFree Chart JFree Chart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFree Chart's extensive feature set includes: A consistent and well-documented API, supporting a wide range of chart types; A flexible design that is easy to extend, and targets both server-side and client-side applications; Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG); JFreeChart is "open Java Program Compilers Interpreter My Program
  53. 53. 53 source" or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications. 1. Map Visualizations Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas); Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart. 2. Time Series Chart Interactivity Implement a new (to JFreeChart) feature for interactive time series charts --- to display a separate control that shows a small version of ALL the time series data, with a sliding "view" rectangle that allows you to select the subset of the time series data to display in the main chart. 3. Dashboards There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet. 4. Property Editors The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement). this mechanism to provide greater end-user control over the appearance of the charts.
  54. 54. 54 6.TESTING
  55. 55. 55 6. TESTING The purpose of testing is to discover errors. Testing is the process of trying to discover every conceivable fault or weakness in a work product. It provides a way to check the functionality of components, sub assemblies, assemblies and/or a finished product It is the process of exercising software with the intent of ensuring that the Software system meets its requirements and user expectations and does not fail in an unacceptable manner. There are various types of test. Each test type addresses a specific testing requirement. TYPES OF TESTS Unit testing Unit testing involves the design of test cases that validate that the internal program logic is functioning properly, and that program inputs produce valid outputs. All decision branches and internal code flow should be validated. It is the testing of individual software units of the application .it is done after the completion of an individual unit before integration. This is a structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform basic tests at component level and test a specific business process, application, and/or system configuration. Unit tests ensure that each unique path of a business process performs accurately to the documented specifications and contains clearly defined inputs and expected results. Integration testing Integration tests are designed to test integrated software components to determine if they actually run as one program. Testing is event driven and is more concerned with the basic outcome of screens or fields. Integration tests demonstrate that although the components were individually satisfaction, as shown by successfully unit testing, the combination of components is correct and consistent. Integration testing is specifically aimed at exposing the problems that arise from the combination of components. Functional test Functional tests provide systematic demonstrations that functions tested are available as specified by the business and technical requirements, system documentation, and user manuals. Functional testing is centered on the following items: Valid Input : identified classes of valid input must be accepted.
  56. 56. 56 Invalid Input : identified classes of invalid input must be rejected. Functions : identified functions must be exercised. Output : identified classes of application outputs must be exercised. Systems/Procedures: interfacing systems or procedures must be invoked. Organization and preparation of functional tests is focused on requirements, key functions, or special test cases. In addition, systematic coverage pertaining to identify Business process flows; data fields, predefined processes, and successive processes must be considered for testing. Before functional testing is complete, additional tests are identified and the effective value of current tests is determined. System Test System testing ensures that the entire integrated software system meets requirements. It tests a configuration to ensure known and predictable results. An example of system testing is the configuration oriented system integration test. System testing is based on process descriptions and flows, emphasizing pre-driven process links and integration points. White Box Testing White Box Testing is a testing in which in which the software tester has knowledge of the inner workings, structure and language of the software, or at least its purpose. It is purpose. It is used to test areas that cannot be reached from a black box level. Black Box Testing Black Box Testing is testing the software without any knowledge of the inner workings, structure or language of the module being tested. Black box tests, as most other kinds of tests, must be written from a definitive source document, such as specification or requirements document, such as specification or requirements document. It is a testing in which the software under test is treated, as a black box .you cannot “see” into it. The test provides inputs and responds to outputs without considering how the software works. 6.1 Unit Testing:
  57. 57. 57 Unit testing is usually conducted as part of a combined code and unit test phase of the software lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct phases. 6.1.1Test strategy and approach Field testing will be performed manually and functional tests will be written in detail. Test objectives  All field entries must work properly.  Pages must be activated from the identified link. The entry screen, messages and responses must not be delayed. Features to be tested  Verify that the entries are of the correct format  No duplicate entries should be allowed  All links should take the user to the correct page. 6.1.2 Test Cases S No. of test case 1 Name of test User browse for a file (Success) Sample Input User selects a file to be clustered. Expected output Displays Message as “File updated successfully” Actual output Same as expected Remarks This component clearly tells that file is uploaded successfully. Table 6.1: Unit Test Case1
  58. 58. 58 S No. of test case 2 Name of test User browse for a file (Fails) Sample Input User selects a file to be clustered of different file format. Expected output Displays Message as “File not updated” Actual output Same as expected Remarks This component clearly tells that file is not updated successfully. Table 6.2: Unit Test Case2 S No. of test case 3 Name of test User Click on Remove Button Sample Input User after uploads a file click on remove button Expected output Displays message as “Login SucStop words removed Successfully” Actual output Same as expected Remarks This component tells that user removed stop words successfully so that it can be forwarded for stemming Table 6.3: Unit Test Case3
  59. 59. 59 Table 6.4: Unit Test Case4 S No. of test case 5 Name of test User Click on Stemming Button Sample Input User after click on remove button ,he will go for stemming action Expected output Displays message as “Stemming is successful” and displays the words in distinct type Actual output Same as expected Remarks This component tells that user removed stop words successfully ,so that it is filtered from stop words. Table 6.5: Unit Test Case5 S No. of test case 4 Name of test User clicks on Stemming button without performing remove action Sample Input User forgot to click on remove and he went directly to stemming button.. Expected output Displays message as “Please enter the remove and then click on stemming” Actual output Same as expected Remarks This component clearly that we will get an invalid error message if we doesn’t enter remove button before we go for stemming button.
  60. 60. 60 S No. of test case 6 Name of test User clicks on Calculation button without performing stemming action Sample Input User forgot to click on stemming and he went directly to calculation button.. Expected output Displays message as “Please enter the stemming and then click on calculation” Actual output Same as expected Remarks This component clearly that we will get an invalid error message if we doesn’t enter stemming button before we go for calculation button. Table 6.6: Unit Test Case6 S No. of test case 7 Name of test User Click on Stemming Button Sample Input User after click on stemming button ,he will go for calculation action in order to find the clusters. Expected output Displays message as “Clustered the input data set successfully” Actual output Same as expected Remarks This component tells that user performed stemming and it is ready to apply clustering algorithms. Table 6.7: Unit Test Case7
  61. 61. 61 6.2 Integration Testing Software integration testing is the incremental integration testing of two or more integrated software components on a single platform to produce failures caused by interface defects. The task of the integration test is to check that components or software applications, e.g. components in a software system or – one step up – software applications at the company level – interact without error. Test Results: All the test cases mentioned above passed successfully. No defects encountered. 6.3 Acceptance Testing User Acceptance Testing is a critical phase of any project and requires significant participation by the end user. It also ensures that the system meets the functional requirements. Test Results: All the test cases mentioned above passed successfully. No defects encountered.
  62. 62. 62 7.CONCLUSION
  63. 63. 63 7. CONCLUSION We presented an approach that applies document clustering methods to document analysis of computer document inspection. Also, we reported and discussed several practical results that can be very useful for researchers and practitioners of document computing. More specifically, in our experiments the hierarchical algorithms known as Average Link and Complete Link presented the best results. Despite their usually high computational costs, we have shown that they are particularly suitable for the studied application domain because the dendrograms that they provide offer summarized views of the documents being inspected, thus being helpful tools for document examiners that analyze textual documents from seized computers. As already observed in other application domains, dendrograms provide very informative descriptions and visualization capabilities of data clustering structures . The partitional K-means and K-medoids algorithms also achieved good results when properly initialized. Considering the approaches for estimating the number of clusters, the relative validity criterion known as silhouette has shown to simplified version. In addition, some of our results suggest that using the file names along with the document content information may be useful for cluster ensemble algorithms. Most importantly, we observed that clustering algorithms indeed tend to induce clusters formed by either relevant or irrelevant documents, thus contributing to enhance the expert examiner’s job. Furthermore, our evaluation of the proposed approach in five real-world applications show that it has the potential to speed up the computer inspection process. Aimed at further leveraging the use of data clustering algorithms in similar applications, a promising venue for future work involves investigating automatic approaches for cluster labeling. The assignment of labels to clusters may enable the expert examiner to identify the semantic content of each cluster more quickly—eventually even before examining their contents. Finally, the study of algorithms that induce overlapping partitions (e.g., Fuzzy C- Means and Expectation-Maximization for Gaussian Mixture Models) is worth of investigation.
  64. 64. 64 REFERENCES
  65. 65. 65 REFERENCES 1.Document Clustering for Forensic Analysis:An Approach for Improving Computer Inspection Luís Filipe da Cruz Nassif and Eduardo Raul Hruschka. 2.Data Mining Concepts and Techniques-Jiawei Han and Micheline Kamber 3.Object-Oriented Software Engineering-Bruegge,Dutoit. 4.JAVA,The Complete Reference by Herbert Schildt.
  66. 66. 66 APPENDIX
  67. 67. 67 APPENDIX: A-Input/Output Screens: a.Selecting Dataset: Figure 9.1: Selecting Dataset Description: The above page mainly indicates the dataset . in this process selecting the dataset.The selecting the data from computer randomly.
  68. 68. 68 b. Removing Stop Words: Figure 9.2: Removing Stop Words Description: The above page mainly indicates the removing stop words. In this process removing unnecessary data from the dataset.
  69. 69. 69 d. Stemming: Figure 9.3: Stemming Page Description: The above page mainly indicates the stemming data. We select the required data in stemming data process we remove the unnecessary words.
  70. 70. 70 e. Clustering Process: Figure 9.4: Clustering Process Description: The above page mainly indicates the clustering process we form the number of clusters
  71. 71. 71 f. K-means: g. Computing Term Frequency: Figure 9.5: K-means Page Description:This page describes the repetition of words. It calculates the how many times each word repeated.
  72. 72. 72 h. Cluster Preprocessing: Figure 9.6: Cluster Preprocessing Description: The above page mainly indicates the cluster preprocessing we used snow ball technique.
  73. 73. 73 i. Distance Calculation: Figure 9.7: Distance Calculation Description: The above page mainly indicates the Distance calculation. We used eucludian distance method.
  74. 74. 74 j. Incremental or Hierarchal Clustering: Figure 9.8: Incremental or Hierarchal Clustering Description: The above page mainly indicates the Incremental or Hierarchal Clustering. Here we initially find the similarity of the data points.
  75. 75. 75 k. Similarity Measurement Figure 9.9: Similarity Calculation Description: The above page mainly indicates the similarity calculation, in this level we calculate the similarity using cosine similarity and also provide maximum dissimilar value.
  76. 76. 76 l. Purity Checking Figure 9.10: Purity Checking Description: The above page mainly indicates the Purity Checking , in this level we give get the purity levels of K-means and Incremental Clustering.
  77. 77. 77 m. Clustering Accuracy Figure 9.11: Clustering Accuracy Description: The above page mainly indicates the Clustering Accuracy. In the above page the accuracy of the clustering techniques is represented in the form of a graph.
  78. 78. 78 B-Source Code Preprocess.java : package ncluster; import com.mysql.jdbc.Connection; import java.io.*; import java.sql.*; import java.util.*; import javax.swing.JFileChooser; import ptstemmer.implementations.PorterStemmer; public class preprocess extends javax.swing.JFrame { String cont="", line="", path="", filename="", word="", str="", count="", nooffile=""; public static int numofdoc,count1,coun,i, noofterm; File folder, files[]; PorterStemmer stemmer = new PorterStemmer(); float[] tf=new float[1500]; double[] idf=new double[1500]; double[] result=new double[1500]; int i1=0,j1=0,k1=0; public preprocess() { initComponents(); } @SuppressWarnings("unchecked") // <editor-fold defaultstate="collapsed" desc="Generated Code"> private void initComponents() {
  79. 79. 79 selfiles = new javax.swing.JLabel(); select = new javax.swing.JButton(); jScrollPane1 = new javax.swing.JScrollPane(); text = new javax.swing.JTextArea(); textbox1 = new javax.swing.JTextField(); removestopword = new javax.swing.JButton(); stemming = new javax.swing.JButton(); title = new javax.swing.JLabel(); pathoffile = new javax.swing.JLabel(); calc = new javax.swing.JButton(); jPanel1 = new javax.swing.JPanel(); DocClust = new javax.swing.JLabel(); jLabel1 = new javax.swing.JLabel(); jLabel2 = new javax.swing.JLabel(); setDefaultCloseOperation(javax.swing.WindowConstants.EXIT_ON_CLOSE); setTitle("Selecting_Documents"); setMinimumSize(new java.awt.Dimension(599, 601)); getContentPane().setLayout(null); selfiles.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N selfiles.setForeground(new java.awt.Color(51, 51, 51)); selfiles.setText("Select Files "); getContentPane().add(selfiles);
  80. 80. 80 selfiles.setBounds(10, 110, 100, 30); select.setBackground(java.awt.SystemColor.inactiveCaption); select.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N select.setForeground(new java.awt.Color(0, 0, 102)); select.setText("SELECT"); select.addActionListener(new java.awt.event.ActionListener() { public void actionPerformed(java.awt.event.ActionEvent evt) { selectActionPerformed(evt); } }); getContentPane().add(select); select.setBounds(120, 110, 100, 30); text.setBackground(java.awt.SystemColor.inactiveCaption); text.setColumns(20); text.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N text.setForeground(new java.awt.Color(51, 51, 51)); text.setRows(5); jScrollPane1.setViewportView(text); getContentPane().add(jScrollPane1); jScrollPane1.setBounds(70, 240, 440, 320); textbox1.setFont(new java.awt.Font("Tahoma", 0, 12)); // NOI18N
  81. 81. 81 textbox1.setForeground(new java.awt.Color(0, 0, 102)); getContentPane().add(textbox1); textbox1.setBounds(170, 170, 360, 30); removestopword.setBackground(java.awt.SystemColor.inactiveCaption); removestopword.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N removestopword.setForeground(new java.awt.Color(0, 0, 102)); removestopword.setText("REMOVE"); removestopword.addActionListener(new java.awt.event.ActionListener() { public void actionPerformed(java.awt.event.ActionEvent evt) { removestopwordActionPerformed(evt); } }); getContentPane().add(removestopword); removestopword.setBounds(240, 110, 100, 30); stemming.setBackground(java.awt.SystemColor.inactiveCaption); stemming.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N stemming.setForeground(new java.awt.Color(0, 0, 102)); stemming.setText("STEMMING"); stemming.addActionListener(new java.awt.event.ActionListener() { public void actionPerformed(java.awt.event.ActionEvent evt) { stemmingActionPerformed(evt); } });
  82. 82. 82 getContentPane().add(stemming); stemming.setBounds(350, 110, 100, 30); title.setBackground(new java.awt.Color(255, 0, 0)); title.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N getContentPane().add(title); title.setBounds(80, 210, 368, 21); pathoffile.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N pathoffile.setText("Path of the File"); getContentPane().add(pathoffile); pathoffile.setBounds(40, 170, 110, 18); calc.setBackground(java.awt.SystemColor.inactiveCaption); calc.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N calc.setForeground(new java.awt.Color(0, 0, 102)); calc.setText("CALCULATION"); calc.addActionListener(new java.awt.event.ActionListener() { public void actionPerformed(java.awt.event.ActionEvent evt) { calcActionPerformed(evt); } }); getContentPane().add(calc); calc.setBounds(460, 110, 130, 30);
  83. 83. 83 jPanel1.setBackground(new java.awt.Color(204, 204, 204)); jPanel1.setLayout(null); DocClust.setFont(new java.awt.Font("Times New Roman", 1, 20)); // NOI18N DocClust.setIcon(new javax.swing.ImageIcon(getClass().getResource("/image/cooltext1297916834.png"))); // NOI18N jPanel1.add(DocClust); DocClust.setBounds(30, 60, 563, 40); jLabel1.setIcon(new javax.swing.ImageIcon(getClass().getResource("/image/cooltext1297931724.png"))); // NOI18N jPanel1.add(jLabel1); jLabel1.setBounds(30, 10, 570, 40); jLabel2.setIcon(new javax.swing.ImageIcon(getClass().getResource("/image/deep-blue-sky- background.jpg"))); // NOI18N jPanel1.add(jLabel2); jLabel2.setBounds(-50, -20, 680, 660); getContentPane().add(jPanel1); jPanel1.setBounds(-10, 0, 610, 630); java.awt.Dimension screenSize = java.awt.Toolkit.getDefaultToolkit().getScreenSize(); setBounds((screenSize.width-607)/2, (screenSize.height-659)/2, 607, 659); }// </editor-fold> private void selectActionPerformed(java.awt.event.ActionEvent evt) {
  84. 84. 84 try{ JFileChooser chooser=new JFileChooser(); int returnVal = chooser.showOpenDialog(this); if(returnVal == JFileChooser.APPROVE_OPTION) { folder = chooser.getCurrentDirectory(); path = folder.getPath(); textbox1.setText(path); files = folder.listFiles(); } title.setText("Content of the File"); if(files.length>1){ for(i = 0;i<files.length; i++){ if (files[i].isFile()) { int index = files[i].getName().lastIndexOf('.'); if (index>0&& index <= files[i].getName().length() - 2 ) { filename = files[i].getName().substring(0, index); String fname = filename.toUpperCase(); text.append("n"+fname+"nn"); } } FileReader fr = new FileReader(files[i]); BufferedReader br = new BufferedReader(fr); while((line = br.readLine())!=null){ text.append(line+" ");
  85. 85. 85 } text.append("n"); } } } catch (Exception ex) { System.out.println(ex.getMessage()); } } private void removestopwordActionPerformed(java.awt.event.ActionEvent evt) { while(true){ ch = Character.toLowerCase((char) ch); w[j] = (char) ch; if (j < 500) j++; ch = in.read(); if (!Character.isLetter((char) ch)){ for (int c = 0; c < j; c++) s.add(w[c]); s.stem(); { String u; u = s.toString(); f.createNewFile(); FileWriter writer = new FileWriter(newfname,true); writer.write(u+" "); writer.close();
  86. 86. 86 text.append(u+"n"); } break; } } } if (ch < 0) break; } text.append("n"); } catch (Exception ex){ System.out.println(ex.getMessage()); } } catch (Exception ex){ System.out.println(ex.getMessage()); } } } } catch(Exception ex){ System.out.println(ex.getMessage()); } }
  87. 87. 87 private void calcActionPerformed(java.awt.event.ActionEvent evt) { frame1 form =new frame1(); form.setVisible(true); } public static void main(String args[]) { java.awt.EventQueue.invokeLater(new Runnable() { public void run() { new preprocess().setVisible(true); }});} // Variables declaration - do not modify private javax.swing.JLabel DocClust; private javax.swing.JButton calc; private javax.swing.JLabel jLabel1; private javax.swing.JLabel jLabel2; private javax.swing.JPanel jPanel1; private javax.swing.JScrollPane jScrollPane1; private javax.swing.JLabel pathoffile; private javax.swing.JButton removestopword; private javax.swing.JButton select; private javax.swing.JLabel selfiles; private javax.swing.JButton stemming; private javax.swing.JTextArea text; private javax.swing.JTextField textbox1; private javax.swing.JLabel title; // End of variables declaration }
  88. 88. 88 Graph.java : package ncluster; import java.awt.*; import org.jfree.chart.*; import org.jfree.chart.axis.*; import org.jfree.chart.plot.*; import org.jfree.chart.renderer.category.BarRenderer; import org.jfree.data.category.DefaultCategoryDataset; public class Graph { public static double kmeans = Purity.res1; public static double hsk = Purity.res; public static void main(String arg[]) { DefaultCategoryDataset dataset = new DefaultCategoryDataset(); dataset.setValue(kmeans, "Accuracy", "K-MEANS"); dataset.setValue(hsk, "Accuracy", "Incremental Mining"); JFreeChart chart = ChartFactory.createBarChart("", "Text Mining", "Accuracy", dataset, PlotOrientation.VERTICAL, false, true, false); chart.setBackgroundPaint(Color.white); final CategoryPlot plot = chart.getCategoryPlot(); plot.setBackgroundPaint(Color.lightGray); plot.setDomainGridlinePaint(Color.white); plot.setRangeGridlinePaint(Color.white); final NumberAxis rangeAxis = (NumberAxis) plot.getRangeAxis(); rangeAxis.setStandardTickUnits(NumberAxis.createIntegerTickUnits()); final BarRenderer renderer = (BarRenderer) plot.getRenderer();
  89. 89. 89 renderer.setDrawBarOutline(false); final GradientPaint gp0 = new GradientPaint( 0.0f, 0.0f, Color.blue, 0.0f, 0.0f, Color.lightGray); final GradientPaint gp1 = new GradientPaint( 0.0f, 0.0f, Color.green, 0.0f, 0.0f, Color.lightGray); final GradientPaint gp2 = new GradientPaint( 0.0f, 0.0f, Color.red, 0.0f, 0.0f, Color.lightGray); renderer.setSeriesPaint(0, gp0); renderer.setSeriesPaint(1, gp1); renderer.setSeriesPaint(2, gp2); final CategoryAxis domainAxis = plot.getDomainAxis(); domainAxis.setCategoryLabelPositions( CategoryLabelPositions.createUpRotationLabelPositions(Math.PI / 6.0)); ChartFrame frame1 = new ChartFrame("Clustering Accuracy", chart); frame1.setVisible(true); frame1.setSize(500, 500); }

COMPLETE DOCUMENT OF STEGNOGRAPHY PROJECT

Views

Total views

1,052

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

26

Shares

0

Comments

0

Likes

0

×