This document provides background information on a management and visualization tool for text mining applications developed by Peishan Mao for her MSc project. It discusses natural language text classification and describes how suffix trees can be used to represent documents at a more granular level than traditional "bag-of-words" models. The document outlines the design of the tool, which aims to provide a flexible framework for text classification experiments and allow evaluation and refinement of classifiers. It also describes the implementation of the tool in C# with a Windows Forms interface and Access database.
This document introduces the topic of using semantic web services for computational mechanics. It discusses how current structural analysis software requires close human guidance and can be tedious and time-consuming. Distributed computing technologies like web services and the semantic web can help unify scattered computational resources and automate structural analysis. By augmenting web pages with machine-readable semantic data, computers can intelligently discover, execute, and compose web services to perform complex analysis tasks with less human involvement. The goal is to transform the web into a large-scale distributed computational platform for engineering applications.
This document provides a table of contents for Microsoft Access tutorials covering topics like tables, relationships, queries, forms, subforms, combo boxes, parameter queries, action queries, and an introduction to Visual Basic. The tutorials are divided into sections with learning objectives and exercises for each topic. Sections include introductions to concepts, step-by-step exercises, and discussions of key terminology and applications of each concept.
Couchbase Server 2.0 - Indexing and Querying - Deep diveDipti Borkar
The document provides an overview of indexing and querying in Couchbase Server 2.0. It discusses view basics like index definition, building, and querying phases. It covers topics like replica indexes, failover, primary and secondary indexes, and best practices. Examples are provided for simple indexing, aggregations, time-based rollups, and leaderboards using views.
This document is the title page and contents of a book titled "Computational Intelligence: An Introduction" by Andries P. Engelbrecht. It provides information about the publisher, copyright, and lists the chapters contained in the book, which cover topics including artificial neural networks, evolutionary computing, genetic algorithms, and genetic programming. The book is intended to introduce the key concepts and applications of computational intelligence.
PATHS Evaluation of the 1st paths prototypepathsproject
This document summarizes the evaluation of the first prototype of the PATHS project. It describes the evaluation methodology, which included field-based demonstrations and laboratory evaluations. Results are presented from both types of evaluations, including participant demographics, user feedback on ease of use and usefulness of PATHS, suggested improvements, and results from structured tasks conducted in the laboratory evaluations. The document also reviews how well the first PATHS prototype met its functional specifications.
1. The document is the specification for the GCE AS and A Level exams in Information and Communication Technology (ICT) from 2009 onwards.
2. It provides details on the content and assessment of the four units that make up the AS and A Level qualifications, including the topics covered, exam structure, and coursework requirements.
3. The specification aims to bring the study of ICT up to date for the 21st century and allow for greater flexibility, practical work, and engagement with the subject through investigation rather than passive learning.
Tracking Trends in Korean Information Science Research, 2000-2011SoYoung YU
This is a presentation file of "Tracking Trends in Korean Information Science Research, 2000-2011" which was published in COLLNET 2012 proceeding, October 23rd, 2012.
If you need a full paper of it, feel free to contact So Young Yu (soyoung.yu21@gmail.com)
This document provides an introduction and overview of Oracle Database including:
1. Key features of Oracle Database such as supporting E.F. Codd's rules for relational databases and the Oracle Internet Platform.
2. The physical structures that make up an Oracle Database including data files, redo log files, and tablespaces.
3. The memory structures of an Oracle Instance including the system global area and process/background memory.
4. An overview of basic SQL statements for data retrieval, manipulation, and schema object management.
This document introduces the topic of using semantic web services for computational mechanics. It discusses how current structural analysis software requires close human guidance and can be tedious and time-consuming. Distributed computing technologies like web services and the semantic web can help unify scattered computational resources and automate structural analysis. By augmenting web pages with machine-readable semantic data, computers can intelligently discover, execute, and compose web services to perform complex analysis tasks with less human involvement. The goal is to transform the web into a large-scale distributed computational platform for engineering applications.
This document provides a table of contents for Microsoft Access tutorials covering topics like tables, relationships, queries, forms, subforms, combo boxes, parameter queries, action queries, and an introduction to Visual Basic. The tutorials are divided into sections with learning objectives and exercises for each topic. Sections include introductions to concepts, step-by-step exercises, and discussions of key terminology and applications of each concept.
Couchbase Server 2.0 - Indexing and Querying - Deep diveDipti Borkar
The document provides an overview of indexing and querying in Couchbase Server 2.0. It discusses view basics like index definition, building, and querying phases. It covers topics like replica indexes, failover, primary and secondary indexes, and best practices. Examples are provided for simple indexing, aggregations, time-based rollups, and leaderboards using views.
This document is the title page and contents of a book titled "Computational Intelligence: An Introduction" by Andries P. Engelbrecht. It provides information about the publisher, copyright, and lists the chapters contained in the book, which cover topics including artificial neural networks, evolutionary computing, genetic algorithms, and genetic programming. The book is intended to introduce the key concepts and applications of computational intelligence.
PATHS Evaluation of the 1st paths prototypepathsproject
This document summarizes the evaluation of the first prototype of the PATHS project. It describes the evaluation methodology, which included field-based demonstrations and laboratory evaluations. Results are presented from both types of evaluations, including participant demographics, user feedback on ease of use and usefulness of PATHS, suggested improvements, and results from structured tasks conducted in the laboratory evaluations. The document also reviews how well the first PATHS prototype met its functional specifications.
1. The document is the specification for the GCE AS and A Level exams in Information and Communication Technology (ICT) from 2009 onwards.
2. It provides details on the content and assessment of the four units that make up the AS and A Level qualifications, including the topics covered, exam structure, and coursework requirements.
3. The specification aims to bring the study of ICT up to date for the 21st century and allow for greater flexibility, practical work, and engagement with the subject through investigation rather than passive learning.
Tracking Trends in Korean Information Science Research, 2000-2011SoYoung YU
This is a presentation file of "Tracking Trends in Korean Information Science Research, 2000-2011" which was published in COLLNET 2012 proceeding, October 23rd, 2012.
If you need a full paper of it, feel free to contact So Young Yu (soyoung.yu21@gmail.com)
This document provides an introduction and overview of Oracle Database including:
1. Key features of Oracle Database such as supporting E.F. Codd's rules for relational databases and the Oracle Internet Platform.
2. The physical structures that make up an Oracle Database including data files, redo log files, and tablespaces.
3. The memory structures of an Oracle Instance including the system global area and process/background memory.
4. An overview of basic SQL statements for data retrieval, manipulation, and schema object management.
FULL REPORT OF ASSURANCE OF LEARNING RESULTSbutest
This document summarizes assessment results from West Texas A&M University's College of Business for the 2008-2009 academic year. It reports on assessment activities and results for the BBA program's learning goals of effective communication and application of business principles. It analyzes student performance on writing assignments and oral presentations. While most students performed acceptably, some areas showed room for improvement, such as strengthening writing courses and using uniform rubrics. The college will work to enhance the curriculum and assessment process.
The document discusses machine learning techniques for multivariate data analysis using the TMVA toolkit. It describes several common classification problems in high energy physics (HEP) and summarizes several machine learning algorithms implemented in TMVA for supervised learning, including rectangular cut optimization, likelihood methods, neural networks, boosted decision trees, support vector machines and rule ensembles. It also discusses challenges like nonlinear correlations between input variables and techniques for data preprocessing and decorrelation.
Learning and Text Analysis for Ontology Engineeringbutest
This document calls for papers and participation in a workshop on learning and text analysis for ontology engineering to be held in conjunction with the ECAI 2002 conference in Lyon, France. The workshop aims to bring together researchers from linguistics, natural language processing, knowledge representation, and machine learning to discuss issues around building, maintaining, and reusing ontologies and terminological resources. Topics of interest include using texts and linguistic/terminological resources as knowledge sources for building ontologies, applying machine learning and NLP tools to ontology engineering, and learning ontologies from sources like the web. The deadline for paper submissions is March 15th and for motivation abstracts is May 24th. The workshop will include paper presentations, discussions, and
Noshir Contractor - WebSci'09 - Society On-Linebutest
Web Science is well poised to make advances by facilitating collaboration between theories, data, methods, and computational infrastructure regarding social networks. Recent developments allow capturing relational metadata from social networks to better understand how communities emerge and are enabled. Exponential random graph modeling can detect structural patterns in large networks to test theories about motivations for link creation and dissolution. Understanding these generative mechanisms could help contextualize community goals.
Improving the Performance of Action Prediction through ...butest
The document describes an approach to improve action prediction in smart homes by identifying abstract tasks from low-level inhabitant actions. The approach models actions as states in a simple Markov model. Actions are clustered into groups representing tasks, and hidden Markov models are created using the clusters as hidden states. On simulated data with embedded patterns, the approach achieved good prediction accuracy, but had only marginal performance on real home data which contained more noise. The identification of tasks is meant to provide context that can help predict the next action and task more accurately than using low-level actions alone.
This proposal requests $1.45 million from the NSF to develop a Curator Assistant to help communities annotate the rapidly increasing number of sequenced genomes. As sequencing costs decrease from $1 million to $10,000 per genome, the bottleneck has shifted to functional annotation, which currently relies on human curators. The proposed software will use natural language processing to extract gene functions from literature and suggest annotations to assist community curators with databases for non-model organisms lacking professional curation resources. It will initially focus on arthropod genomes through collaboration with the Arthropod Base Consortium.
What s an Event ? How Ontologies and Linguistic Semantics ...butest
The document discusses challenges for machine learning models in extracting information about events from text. It describes different approaches to representing events, from single relations to complex structures with interconnected subevents. Proper representation requires modeling event granularity, ordering, interrelations and other factors. Learning the building blocks of events and how they connect poses difficult problems for machine learning systems.
1. The document discusses machine learning and provides an overview of key concepts like inductive reasoning, learning from examples, and the constituents of machine learning problems.
2. It explains that machine learning problems involve an example set, background concepts, background axioms, and potential errors in data. Common machine learning tasks are categorization and prediction.
3. The document also outlines the constituents of machine learning methods, including representation schemes, search methods, and approaches for selecting hypotheses when multiple solutions are produced.
This proposal seeks funding to develop technology using data mining and machine learning to detect insider threats and security breaches through email and application monitoring. Specifically, the proposal aims to:
1) Develop an email security appliance that integrates with mail servers and alerts of potential insider misuse while quarantining emails to mitigate damage.
2) Extend current email monitoring technology to also monitor applications handling document attachments to detect insider malfeasance.
3) Investigate additional non-email data sources like host-based sensors monitoring user actions and file activities to identify insider threats.
The proposal involves researchers at Columbia University and developers at System Detection, Inc. working to expand current email profiling technology and develop a proof-
This document provides a summary of a textbook on data structures and algorithms. It includes a preface describing the textbook's approach of presenting data structures within the context of assessing costs and benefits, understanding tradeoffs, learning common practices, and matching data structures to application needs. The preface also discusses how the textbook can be used in an advanced undergraduate course on data structures or algorithms.
The document provides an acknowledgement for the completion of a project titled "LAN CHAT MESSENGER (LCM)". It expresses gratitude to the project guide and head of the computer science department for their guidance and support. It also thanks the director of the institute for encouragement. Signatures from the involved teacher, head of department, and director are provided to certify that the project was completed under supervision. The document includes contents like introduction, fundamentals, software requirements specification, analysis, design, testing, snapshots of the software, future scope, and conclusion. It aims to develop a messaging software that allows users connected to the same local area network to communicate via live chat and file sharing while filtering unwanted content.
Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....thienvo61
This document provides a summary of a textbook on data structures and algorithms. The textbook covers fundamental data structures like lists, stacks, queues, trees, graphs and advanced structures. It also covers searching and sorting algorithms. The textbook is intended as a teaching text to help readers understand the principles of selecting appropriate data structures and assessing their costs and benefits rather than just memorizing implementations.
This document provides an overview of programming tools and techniques for ABAP development in Eclipse. It covers topics like installing Eclipse and SAP plugins, using features like bookmarks and code extraction, unit testing, and customizing Eclipse with user-defined plugins. The document also discusses topics for improving code quality like eliminating dependencies, implementing mock objects, automating tests, and behavior-driven development.
This document describes a project that aims to prevent hacking of data in a data grid system. It combines data partitioning techniques like secret sharing and erasure coding with dynamic replication to achieve data survivability, security, and access performance. The project develops heuristic algorithms to optimally allocate replicated data shares across clusters and within clusters to minimize communication costs and access latency. Experimental results show the heuristic algorithms achieve good performance and are close to optimal solutions. The success of this project could help applications like disaster relief and military systems by providing secure and high-performance distributed data storage.
This document discusses a proposed system for jamming-aware source routing in networks. It formulates the problem of distributing traffic across available paths in a network while accounting for empirical jamming statistics at nodes as a lossy network flow optimization problem using portfolio selection theory from financial statistics. It proposes solving the centralized optimization problem using a distributed algorithm based on decomposition in network utility maximization and simulating the achievable throughput using the proposed traffic allocation method.
The document discusses a peer-to-peer (P2P) communication framework for Android devices. It aims to explore alternatives to traditional network architectures and leverage the built-in hardware capabilities of mobile devices (3G, WiFi, Bluetooth) without requiring developers to deal with underlying networking implementations. The framework allows developers to easily build P2P applications on Android to take advantage of growing smartphone usage. It seeks to address limitations of traditional networks that may not support huge numbers of increasingly data-hungry mobile devices.
This document describes Sean Ogden's PhD thesis on developing an object-oriented software called POISE to represent polymer materials information for use in engineering design. POISE integrates knowledge representation, user interfaces, and data management to provide domain experts with tools for materials selection and design. It uses a hierarchical classification of polymer objects with multiple orthogonal properties to account for the complex relationships between a polymer's chemistry, processing, and behaviors. The classification schema can evolve as new materials are developed. POISE was implemented in Smalltalk to take advantage of its object-oriented features like encapsulation, inheritance and dynamic behavior sharing.
The document summarizes the key changes in Excel 2007 compared to previous versions:
1. Microsoft introduced a new interface called Fluent UI which features a ribbon, contextual tabs, and other improvements like live preview and formula autocomplete for a more intuitive experience.
2. Under the hood, Excel 2007 increased limits on rows, columns, cell character limits, and colors while optimizing for multi-core CPUs and allowing use of more system RAM.
3. In addition to the interface changes, new features were added like the ability to directly download and try a 60-day trial of Office 2007 online through Microsoft's website.
This document certifies that A. Venkatesan completed the first phase of their project titled "DoubleGuard detection in Multitier Architecture" under the supervision of Mr. S. Athirayan between June 2012 and December 2012. The project proposes an efficient intrusion detection system called DoubleGuard that models network behavior in multitiered web applications by monitoring both front-end HTTP requests and back-end database queries. It aims to detect abnormal behaviors on a session/client level by mapping requests to subsequent queries.
Mcts self paced training kit exam 432 sql server 2008 - implementation and ...Portal_do_Estudante_SQL
This document provides an overview of the objectives covered in the Exam 70-432: Microsoft SQL Server 2008—Implementation and Maintenance certification exam. It lists the exam topics and where to find them in the book. The topics include installing and configuring SQL Server 2008, maintaining SQL Server instances, managing security, maintaining databases, performing data management tasks, monitoring and troubleshooting, optimizing performance, implementing high availability through features like database mirroring and clustering, and more. The document provides the objectives, book chapter references, and publishing information for the exam preparation guide.
This document provides an introduction to software engineering. It discusses the emergence of software engineering as a field due to increasing complexity in software development. Key issues that drove the need for software engineering include correctness, efficiency, complexity, interfaces, reliability, flexibility, documentation, and maintainability. The document also covers quality attributes of software like correctness, dependability, user-friendliness, adequacy, and learnability. Finally, it introduces different software process models like the waterfall model, prototyping model, spiral model, and object-oriented lifecycle model.
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfYasmine Anino
This document provides a summary of a book titled "A Comprehensive Introduction to Object-Oriented Programming with Java". The book introduces object-oriented programming concepts and teaches Java programming. It includes 21 chapters that cover core Java topics, intermediate concepts such as data structures, and advanced topics such as inheritance and polymorphism. The book emphasizes hands-on practice with numerous sample programs.
FULL REPORT OF ASSURANCE OF LEARNING RESULTSbutest
This document summarizes assessment results from West Texas A&M University's College of Business for the 2008-2009 academic year. It reports on assessment activities and results for the BBA program's learning goals of effective communication and application of business principles. It analyzes student performance on writing assignments and oral presentations. While most students performed acceptably, some areas showed room for improvement, such as strengthening writing courses and using uniform rubrics. The college will work to enhance the curriculum and assessment process.
The document discusses machine learning techniques for multivariate data analysis using the TMVA toolkit. It describes several common classification problems in high energy physics (HEP) and summarizes several machine learning algorithms implemented in TMVA for supervised learning, including rectangular cut optimization, likelihood methods, neural networks, boosted decision trees, support vector machines and rule ensembles. It also discusses challenges like nonlinear correlations between input variables and techniques for data preprocessing and decorrelation.
Learning and Text Analysis for Ontology Engineeringbutest
This document calls for papers and participation in a workshop on learning and text analysis for ontology engineering to be held in conjunction with the ECAI 2002 conference in Lyon, France. The workshop aims to bring together researchers from linguistics, natural language processing, knowledge representation, and machine learning to discuss issues around building, maintaining, and reusing ontologies and terminological resources. Topics of interest include using texts and linguistic/terminological resources as knowledge sources for building ontologies, applying machine learning and NLP tools to ontology engineering, and learning ontologies from sources like the web. The deadline for paper submissions is March 15th and for motivation abstracts is May 24th. The workshop will include paper presentations, discussions, and
Noshir Contractor - WebSci'09 - Society On-Linebutest
Web Science is well poised to make advances by facilitating collaboration between theories, data, methods, and computational infrastructure regarding social networks. Recent developments allow capturing relational metadata from social networks to better understand how communities emerge and are enabled. Exponential random graph modeling can detect structural patterns in large networks to test theories about motivations for link creation and dissolution. Understanding these generative mechanisms could help contextualize community goals.
Improving the Performance of Action Prediction through ...butest
The document describes an approach to improve action prediction in smart homes by identifying abstract tasks from low-level inhabitant actions. The approach models actions as states in a simple Markov model. Actions are clustered into groups representing tasks, and hidden Markov models are created using the clusters as hidden states. On simulated data with embedded patterns, the approach achieved good prediction accuracy, but had only marginal performance on real home data which contained more noise. The identification of tasks is meant to provide context that can help predict the next action and task more accurately than using low-level actions alone.
This proposal requests $1.45 million from the NSF to develop a Curator Assistant to help communities annotate the rapidly increasing number of sequenced genomes. As sequencing costs decrease from $1 million to $10,000 per genome, the bottleneck has shifted to functional annotation, which currently relies on human curators. The proposed software will use natural language processing to extract gene functions from literature and suggest annotations to assist community curators with databases for non-model organisms lacking professional curation resources. It will initially focus on arthropod genomes through collaboration with the Arthropod Base Consortium.
What s an Event ? How Ontologies and Linguistic Semantics ...butest
The document discusses challenges for machine learning models in extracting information about events from text. It describes different approaches to representing events, from single relations to complex structures with interconnected subevents. Proper representation requires modeling event granularity, ordering, interrelations and other factors. Learning the building blocks of events and how they connect poses difficult problems for machine learning systems.
1. The document discusses machine learning and provides an overview of key concepts like inductive reasoning, learning from examples, and the constituents of machine learning problems.
2. It explains that machine learning problems involve an example set, background concepts, background axioms, and potential errors in data. Common machine learning tasks are categorization and prediction.
3. The document also outlines the constituents of machine learning methods, including representation schemes, search methods, and approaches for selecting hypotheses when multiple solutions are produced.
This proposal seeks funding to develop technology using data mining and machine learning to detect insider threats and security breaches through email and application monitoring. Specifically, the proposal aims to:
1) Develop an email security appliance that integrates with mail servers and alerts of potential insider misuse while quarantining emails to mitigate damage.
2) Extend current email monitoring technology to also monitor applications handling document attachments to detect insider malfeasance.
3) Investigate additional non-email data sources like host-based sensors monitoring user actions and file activities to identify insider threats.
The proposal involves researchers at Columbia University and developers at System Detection, Inc. working to expand current email profiling technology and develop a proof-
This document provides a summary of a textbook on data structures and algorithms. It includes a preface describing the textbook's approach of presenting data structures within the context of assessing costs and benefits, understanding tradeoffs, learning common practices, and matching data structures to application needs. The preface also discusses how the textbook can be used in an advanced undergraduate course on data structures or algorithms.
The document provides an acknowledgement for the completion of a project titled "LAN CHAT MESSENGER (LCM)". It expresses gratitude to the project guide and head of the computer science department for their guidance and support. It also thanks the director of the institute for encouragement. Signatures from the involved teacher, head of department, and director are provided to certify that the project was completed under supervision. The document includes contents like introduction, fundamentals, software requirements specification, analysis, design, testing, snapshots of the software, future scope, and conclusion. It aims to develop a messaging software that allows users connected to the same local area network to communicate via live chat and file sharing while filtering unwanted content.
Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....thienvo61
This document provides a summary of a textbook on data structures and algorithms. The textbook covers fundamental data structures like lists, stacks, queues, trees, graphs and advanced structures. It also covers searching and sorting algorithms. The textbook is intended as a teaching text to help readers understand the principles of selecting appropriate data structures and assessing their costs and benefits rather than just memorizing implementations.
This document provides an overview of programming tools and techniques for ABAP development in Eclipse. It covers topics like installing Eclipse and SAP plugins, using features like bookmarks and code extraction, unit testing, and customizing Eclipse with user-defined plugins. The document also discusses topics for improving code quality like eliminating dependencies, implementing mock objects, automating tests, and behavior-driven development.
This document describes a project that aims to prevent hacking of data in a data grid system. It combines data partitioning techniques like secret sharing and erasure coding with dynamic replication to achieve data survivability, security, and access performance. The project develops heuristic algorithms to optimally allocate replicated data shares across clusters and within clusters to minimize communication costs and access latency. Experimental results show the heuristic algorithms achieve good performance and are close to optimal solutions. The success of this project could help applications like disaster relief and military systems by providing secure and high-performance distributed data storage.
This document discusses a proposed system for jamming-aware source routing in networks. It formulates the problem of distributing traffic across available paths in a network while accounting for empirical jamming statistics at nodes as a lossy network flow optimization problem using portfolio selection theory from financial statistics. It proposes solving the centralized optimization problem using a distributed algorithm based on decomposition in network utility maximization and simulating the achievable throughput using the proposed traffic allocation method.
The document discusses a peer-to-peer (P2P) communication framework for Android devices. It aims to explore alternatives to traditional network architectures and leverage the built-in hardware capabilities of mobile devices (3G, WiFi, Bluetooth) without requiring developers to deal with underlying networking implementations. The framework allows developers to easily build P2P applications on Android to take advantage of growing smartphone usage. It seeks to address limitations of traditional networks that may not support huge numbers of increasingly data-hungry mobile devices.
This document describes Sean Ogden's PhD thesis on developing an object-oriented software called POISE to represent polymer materials information for use in engineering design. POISE integrates knowledge representation, user interfaces, and data management to provide domain experts with tools for materials selection and design. It uses a hierarchical classification of polymer objects with multiple orthogonal properties to account for the complex relationships between a polymer's chemistry, processing, and behaviors. The classification schema can evolve as new materials are developed. POISE was implemented in Smalltalk to take advantage of its object-oriented features like encapsulation, inheritance and dynamic behavior sharing.
The document summarizes the key changes in Excel 2007 compared to previous versions:
1. Microsoft introduced a new interface called Fluent UI which features a ribbon, contextual tabs, and other improvements like live preview and formula autocomplete for a more intuitive experience.
2. Under the hood, Excel 2007 increased limits on rows, columns, cell character limits, and colors while optimizing for multi-core CPUs and allowing use of more system RAM.
3. In addition to the interface changes, new features were added like the ability to directly download and try a 60-day trial of Office 2007 online through Microsoft's website.
This document certifies that A. Venkatesan completed the first phase of their project titled "DoubleGuard detection in Multitier Architecture" under the supervision of Mr. S. Athirayan between June 2012 and December 2012. The project proposes an efficient intrusion detection system called DoubleGuard that models network behavior in multitiered web applications by monitoring both front-end HTTP requests and back-end database queries. It aims to detect abnormal behaviors on a session/client level by mapping requests to subsequent queries.
Mcts self paced training kit exam 432 sql server 2008 - implementation and ...Portal_do_Estudante_SQL
This document provides an overview of the objectives covered in the Exam 70-432: Microsoft SQL Server 2008—Implementation and Maintenance certification exam. It lists the exam topics and where to find them in the book. The topics include installing and configuring SQL Server 2008, maintaining SQL Server instances, managing security, maintaining databases, performing data management tasks, monitoring and troubleshooting, optimizing performance, implementing high availability through features like database mirroring and clustering, and more. The document provides the objectives, book chapter references, and publishing information for the exam preparation guide.
This document provides an introduction to software engineering. It discusses the emergence of software engineering as a field due to increasing complexity in software development. Key issues that drove the need for software engineering include correctness, efficiency, complexity, interfaces, reliability, flexibility, documentation, and maintainability. The document also covers quality attributes of software like correctness, dependability, user-friendliness, adequacy, and learnability. Finally, it introduces different software process models like the waterfall model, prototyping model, spiral model, and object-oriented lifecycle model.
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfYasmine Anino
This document provides a summary of a book titled "A Comprehensive Introduction to Object-Oriented Programming with Java". The book introduces object-oriented programming concepts and teaches Java programming. It includes 21 chapters that cover core Java topics, intermediate concepts such as data structures, and advanced topics such as inheritance and polymorphism. The book emphasizes hands-on practice with numerous sample programs.
PATHS Final prototype interface design v1.0pathsproject
This document summarizes the design methodology and current status of the interface design for the second prototype of the PATHS project. It begins with a three-stage design methodology that includes: evaluating the first prototype design process, creating low-fidelity storyboards, and developing high-fidelity interaction designs. It then reviews lessons learned from developing the first prototype interface. The document introduces new user interface components and presents preliminary high-fidelity designs for key pages like the landing page, path editing, and item pages. Expert evaluation of the designs is planned along with user evaluation of a working prototype. The goal is to address issues identified in prior evaluations and create an intuitive interface for the PATHS cultural heritage system.
This document provides an overview of the contents and organization of a project on developing a .NET-based content management system. It includes sections on software and hardware requirements, a literature review of relevant .NET Framework and ADO.NET technologies, software requirement analysis, system design including data flow diagrams, UML diagrams, database design, implementation details, testing approaches, and conclusions. Diagrams and figures are provided to illustrate key architectural and design aspects.
This document summarizes the porting of the AODV-UU routing protocol implementation from Linux to the ns-2 network simulator. The key goals were to 1) port the AODV-UU source code to run within ns-2 simulations while maintaining compatibility with the original Linux version, and 2) enable trace-based simulations using logs from real-world experiments on the APE ad-hoc network testbed. The porting process extracted the AODV-UU source code and integrated it as a routing agent in ns-2, addressing differences in environment, packet handling, timers, and other areas. Trace-based simulations using the APE logs showed packet delivery ratios closely matching the real-world experiments.
This document describes a grade management system that was designed and implemented by students at IIT Mandi. It uses SNMP protocol for data access and is divided into four parts - an API built in Python, Net-SNMP, an SNMP server, and a MySQL database. The system is designed with a client-server architecture and allows faculty, students, and teaching assistants to access and manage course and student grade data.
This document provides an overview of computers, programming languages like C# and Java, software design concepts like object-oriented programming, and web technologies like HTML, XML and ASP.NET. It discusses computer hardware and software, programming fundamentals, graphical user interfaces, database programming, and web development topics. The document is intended as a comprehensive introduction and reference for computer programming and web technologies.
This document is a textbook on open source programming that covers topics such as HTML, CSS, JavaScript, PHP, MySQL, Linux and more. It is divided into chapters that introduce each topic and provide examples and exercises. The goal is to teach readers about various open source technologies and how to use them for web development and other applications. Key topics include how to set up development environments, build websites with HTML and CSS, add dynamic content with JavaScript and PHP, work with databases using MySQL, and use the Linux operating system. Review questions are provided at the end of each chapter.
This document provides an overview of the Web Application Development course offered at Universiti Utara Malaysia's School of Computing. The course introduces concepts and techniques for developing web applications, including client-side and server-side scripting, databases, and additional features like CSS and XML. Over 12 weeks students will learn skills like HTML, CSS, server-side scripting, working with databases and SQL, sessions, cookies, and more. Assessment includes assignments, a midterm exam, lab tests, and a final exam.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
The defense was successful in portraying Michael Jackson favorably to the jury in several ways:
1) They dressed Jackson in ornate costumes that conveyed images of purity, innocence, and humility.
2) Jackson was shown entering the courtroom as if on a red carpet, emphasizing his celebrity status.
3) Jackson appeared vulnerable, childlike, and in declining health during the trial, eliciting sympathy from jurors.
4) Defense attorney Tom Mesereau effectively presented a coherent narrative of Jackson as a victim and portrayed Neverland as a place of refuge, undermining the prosecution's arguments.
Michael Jackson was born in 1958 in Gary, Indiana and rose to fame in the 1960s as the lead singer of The Jackson 5, topping music charts in the 1970s. As a solo artist in the 1980s, his album Thriller broke music records. In the 1990s and 2000s, Jackson faced several legal issues related to child abuse allegations while continuing to release music. He married Lisa Marie Presley and Debbie Rowe and had two children before his death in 2009.
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
This document appears to be a list of popular books from various authors. It includes over 150 book titles across many genres such as fiction, non-fiction, memoirs, and novels. The books cover a wide range of topics from politics to cooking to autobiographies.
The prosecution lost the Michael Jackson trial due to several key mistakes and weaknesses in their case:
1) The lead prosecutor, Thomas Sneddon, was too personally invested in the case against Jackson, having pursued him for over a decade without success.
2) Sneddon's opening statement was disorganized and weak, failing to effectively outline the prosecution's case.
3) The accuser's mother was not credible and damaged the prosecution's case through her erratic testimony, history of lies and con artist behavior.
4) Many prosecution witnesses were not credible due to prior lawsuits against Jackson, debts owed to him, or having been fired by him. Several witnesses even took the Fifth Amendment.
Here are three examples of public relations from around the world:
1. The UK government's "Be Clear on Cancer" campaign which aims to raise awareness of cancer symptoms and encourage early diagnosis.
2. Samsung's global brand marketing and sponsorship activities which aim to increase brand awareness and favorability of Samsung products worldwide.
3. The Brazilian government's efforts to improve its international image and relations with other countries through strategic communication and diplomacy.
The three most important functions of public relations are:
1. Media relations because the media is how most organizations reach their key audiences. Strong media relationships are crucial.
2. Writing, because written communication is at the core of public relations and how most information is
Michael Jackson Please Wait... provides biographical information about Michael Jackson including his birthdate, birthplace, parents, height, interests, idols, favorite foods, films, and more. It discusses his background, career highlights including influential albums like Thriller, and films he appeared in such as The Wiz and Moonwalker. The document contains photos and details about Jackson's life and illustrious music career.
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
The document discusses the process of manufacturing celebrity and its negative byproducts. It argues that celebrities are rarely the best in their individual pursuits like singing, dancing, etc. but become famous due to being products of a system controlled by wealthy elites. This system stifles opportunities for worthy artists and creates feudalism. The document also asserts that manufactured celebrities should not be viewed as role models due to behaviors like drug abuse and narcissism that result from the celebrity-making process.
Michael Jackson was a child star who rose to fame with the Jackson 5 in the late 1960s and early 1970s. As a solo artist in the 1970s and 1980s, he had immense commercial success with albums like Off the Wall, Thriller, and Bad, which featured hit singles and groundbreaking music videos. However, his career and public image were plagued by controversies related to allegations of child sexual abuse in the 1990s and 2000s. He continued recording and performing but faced ongoing media scrutiny into his private life until his death in 2009.
Social Networks: Twitter Facebook SL - Slide 1butest
The document discusses using social networking tools like Twitter and Facebook in K-12 education. Twitter allows students and teachers to share short updates and can be used to give parents a window into classroom activities. Facebook allows targeted advertising that could be used to promote educational activities. Both tools could help facilitate communication between schools and communities if used properly while managing privacy and security concerns.
Facebook has over 300 million active users who log on daily, and allows brands to create public profile pages to interact with users. Pages are for brands and organizations only, while groups can be made by any user about any topic. Pages do not show admin names and have no limits on fans, while groups display admin names and are limited to 5,000 members. Content on pages should aim to provoke action from subscribers and establish a regular posting schedule using a conversational tone.
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Hare Chevrolet is a car dealership located in Noblesville, Indiana that has successfully used social media platforms like Twitter, Facebook, and YouTube to create a positive brand image. They invest significant time interacting directly with customers online to foster a sense of community rather than overtly advertising. As a result, Hare Chevrolet has built a large, engaged audience on social media and serves as a model for how brands can use online presences strategically.
Welcome to the Dougherty County Public Library's Facebook and ...butest
This document provides instructions for signing up for Facebook and Twitter accounts. It outlines the sign up process for both platforms, including filling out forms with name, email, password and other details. It describes how the platforms will then search for friends and suggest people to connect with. It also explains how to search for and follow the Dougherty County Public Library page on both Facebook and Twitter once signed up. The document concludes by thanking participants and providing a contact for any additional questions.
Paragon Software announces the release of Paragon NTFS for Mac OS X 8.0, which provides full read and write access to NTFS partitions on Macs. It is the fastest NTFS driver on the market, achieving speeds comparable to native Mac file systems. Paragon NTFS for Mac 8.0 fully supports the latest Mac OS X Snow Leopard operating system in 64-bit mode and allows easy transfer of files between Windows and Mac partitions without additional hardware or software.
This document provides compatibility information for Olympus digital products used with Macintosh OS X. It lists various digital cameras, photo printers, voice recorders, and accessories along with their connection type and any notes on compatibility. Some products require booting into OS 9.1 for software compatibility or do not support devices that need a serial port. Drivers and software are available for download from Olympus and other websites for many products to enable use with OS X.
To use printers managed by the university's Information Technology Services (ITS), students and faculty must install the ITS Remote Printing software on their Mac OS X computer. This allows them to add network printers, log in with their ITS account credentials, and print documents while being charged per page to funds in their pre-paid ITS account. The document provides step-by-step instructions for installing the software, adding a network printer, and printing to that printer from any internet connection on or off campus. It also explains the pay-in-advance printing payment system and how to check printing charges.
The document provides an overview of the Mac OS X user interface for beginners, including descriptions of the desktop, login screen, desktop elements like the dock and hard disk, and how to perform common tasks like opening files and folders. It also addresses frequently asked questions for Windows users switching to Mac OS X, such as where documents are stored, how to save or find documents, and what the equivalent of the C: drive is in Mac OS X. The document concludes with sections on file management tasks like creating and deleting folders, organizing files within applications, using Spotlight search, and an overview of the Dashboard feature.
This document provides a checklist for securing Mac OS X version 10.5, focusing on hardening the operating system, securing user accounts and administrator accounts, enabling file encryption and permissions, implementing intrusion detection, and maintaining password security. It describes the Unix infrastructure and security framework that Mac OS X is built on, leveraging open source software and following the Common Data Security Architecture model. The checklist can be used to audit a system or harden it against security threats.
This document summarizes a course on web design that was piloted in the summer of 2003. The course was a 3 credit course that met 4 times a week for lectures and labs. It covered topics such as XHTML, CSS, JavaScript, Photoshop, and building a basic website. 18 students from various majors enrolled. Student and instructor evaluations found the course to be very successful overall, though some improvements were suggested like ensuring proper software and pairing programming/non-programming students. The document also discusses implications of incorporating web design material into existing computer science curriculums.
1. A Management and Visualisation Tool for
Text Mining Applications
Student Peishan Mao
MSc Computing Science Project Report
School of Computing Science and Information System
Birkbeck College, University of London 2005
Status Draft
Last saved 26 Apr. 10
1 of 93
2. 1 TABLE OF CONTENTS
1 TABLE OF CONTENTS 2
2 ACKNOWLEDGEMENT 5
3 ABSTRACT 6
4 INTRODUCTION 7
5 BACKGROUND 8
5.1 Written Text 8
5.2 Natural Language Text Classification 8
5.2.1 Text Classification 8
5.2.2 The Classifier 9
5.3 Text Classifier Experimentations 12
6 HIGH-LEVEL APPLICATION DESCRIPTION 14
6.1 Description and Rationale 14
6.1.1 Build a Classifier 14
6.1.2 Evaluate and Refine the Classifier 15
6.2 Development and Technologies 15
7 DESIGN 17
7.1 Functional Requirements 17
7.2 Non-Functional Requirements 22
7.2.1 Usability 22
7.2.2 Hardware and Software Constraint 22
7.2.3 Documentation 23
7.3 System Framework 23
7.4 Components in Detail 25
7.4.1 The Client - User Interface 25
7.4.2 Display Manager 26
7.4.3 The Classifier 26
7.4.4 Data Manipulation and Cleansing 28
7.4.5 Experimentation 29
7.4.6 Results Manager 30
7.4.7 Error Handling 31
2 of 93
3. 7.5 Class Diagram 32
8 DATABASE 33
8.1 Entities 33
8.1.1 Score Table 33
8.1.2 Source Table 33
8.1.3 Configuration Table 33
8.1.4 Score Functions Table 33
8.1.5 Match Normalisation Functions Table 34
8.1.6 Tree Normalisation Functions Table 34
8.1.7 Classification Condition Table 34
8.1.8 Class Weights Table 34
8.1.9 Temporary Max and Min Score Table 34
8.2 Views 35
8.2.1 Weighted Scores 35
8.2.2 Maximum and Minimum Scores 35
8.2.3 Misclassified Documents 35
8.3 Relation Design for the Main Tables 35
9 IMPLEMENTATION 37
9.1 Main User Interface 37
9.2 Display Manager 39
9.3 Classifier Classes 40
9.4 Results Output Classes 41
9.5 Other Controller Classes 43
9.6 TreeView Controller Class 44
9.7 Error Interface 45
10 IMPLEMENTATION SPECIFICS 46
10.1 Generic Selection Form Class 46
10.2 Visualisation of the Suffix Tree 48
10.3 Dynamic Sub-String Matching 49
10.4 User Interaction Warnings 50
11 USER GUIDE 53
3 of 93
4. 11.1 Getting Started 53
11.1.1 Input Data 53
11.2 Loading a Resource Corpus 54
11.3 Selecting a Sampling Set 57
11.4 Performing Pre-processing 61
11.5 Running N-Fold Cross-Validation 64
11.5.1 Set Up Cross-Validation Set 64
11.5.2 Perform experiments on the data 67
11.5.2.1 Create the Suffix Tree 67
11.5.2.2 Display Suffix Tree 69
11.5.2.3 Delete Suffix Tree 71
11.5.2.4 N-Gram Matching 71
11.5.2.5 Score Documents 73
11.5.2.6 Classify documents 74
11.5.2.7 Add New Document to Classify 76
11.6 Creating a Classifier 79
12 TESTING 81
13 CONCLUSION 83
13.1 Evaluation 83
13.2 Future Work 84
14 BIBLIOGRAPHY 86
15 APPENDIX A DATABASE 88
16 APPENDIX B CLASS DEFINITIONS 90
17 APPENDIX C SOURCE CODE 93
4 of 93
5. 2 ACKNOWLEDGEMENT
I would like to thank the following people for their help over the course of this project:
Rajesh Pampapathi: for his spectrum of help on the project, ranging from his patient and
advice on the whole area of text classification, and pointing me in the right direction for
information on the topic to being interviewed as a potential user to the proposed system
as part of the requirement collection.
Timothy Yip: for laboriously proof reading the draft for the report despite not having
much interest in information technology.
5 of 93
6. 3 ABSTRACT
This report describes the design and implementation of a management and visualisation
tool for text classification applications. The system is built as a wrapper for machine
learning classification tool. It aims to provide a flexible framework to accommodate for
future changes to the system. The system is implemented in C# .Net with a Windows
Forms front end and an Access Database as an example, but should be flexible enough
to add different underlying components.
6 of 93
7. 4 INTRODUCTION
This report describes the project carried out to implement a management and
visualisation tool for text classification. It covers background information about the
project, the design, implementation and conclusion. The report is organised as follows:
Section 4 this section. It describes the organisation of the report.
Section 5 takes a look at the background of the project. This section covers discussion
on natural language classification, and suffix tree data structure used in Pampapathi et
al‟s study.
Section 6 a high-level description and rationale of the system.
Section 7 describes the design of the system. Lays out the system requirements,
system framework, and describes system components and classes.
Section 8 explains the database design and description of the database entities and
table relations.
Section 9 discusses how the system was implemented and goes into class definitions.
Section 10 focuses on specific system implementations and looks at the implementation
of the generic selection form class, visualisation of the suffix tree, dynamic sub-string
matching on documents, and user warnings.
Section 11 is the user guide to the system.
Section 13 concludes the project. This section discusses whether the system built has
met the requirements laid out at the beginning of the project. It also looks at future work.
Appendix A Database
Appendix B Class Definitions
Error! Reference source not found.
7 of 93
8. 5 BACKGROUND
5.1 Written Text
Writing has long been an important means of exchanging information, ideas and
concepts from one individual to another, or to a group. Indeed, it is even thought to be
the single most advantageous evolutionary adaptation for species preservation [2]. The
written text available contains a vast amount of information. The advent of the internet
and on-line documents has contributed to the proliferation of digital textual data readily
available for our perusal. Consequently, it is increasingly important to have a systematic
method of organising this corpus of information.
Tools for textual data mining are proving to be increasingly important to our growing
mass of text based data. The discipline of computing science has provided significant
contributions to this area by means of automating the data mining process. To encode
unstructured text data into a more structured form is not a straightforward task. Natural
language is rich and ambiguous. Working with free text is one of the most challenging
areas in computer science.
This project aims to investigate how computer science can help to evaluate some of the
vast amounts of textual information available to us, and how to provide a convenient way
to access this type of unstructured data. In particular, the focus will be on the data
classification aspect of data mining. The next section will explore this topic in more
depth.
5.2 Natural Language Text Classification
5.2.1 Text Classification
F Sebastiani [3] described automated text categorisation as
“The task of automatically sorting a set of documents into categories (or classes, or
topics) from a predefined set. The task, that falls at the crossroads of information
retrieval, machine learning, and (statistical) natural language processing, has witnessed
a booming interest in the last ten years from researchers and developers alike.”
Classification maps data into predefined groups or classes. Examples of classification
applications include image and pattern recognition, medical diagnosis, loan approval,
detecting faults in industry applications, and classifying financial trends. Until the late
80‟s, knowledge engineering was the dominant paradigm in automated text
categorisation. Knowledge engineering consists of the manual definition of a set of rules
which form part of a classifier by domain experts. Although this approach has produced
results with accuracies as high as 90% [3], it is labour intensive and domain specific.
The emergence of a new paradigm based on machine learning which answers many of
the limitations with knowledge engineering has superseded its predecessor.
Machine learning encompasses a variety of methods that represent the convergence of
statistics, biological modelling, adaptive control theory, psychology, and artificial
8 of 93
9. intelligence (AI) [11]. Data classification by machine learning is a two-phase process
(Figure 1). The first phase involves a general inductive process to automatically build a
model by using classification algorithm that describes a predetermined set of data
classes which are non-overlapping. This step is referred to as supervised learning
because the classes are determined before examining the data and the set of data is
known as the training data set. Data in text classification comes in the form of files and
each file is often described as documents. Classification algorithms require that the
classes are defined based on purely the content of the documents. They describe these
classes by looking at the characteristics of the documents in the training set already
known to belong to the class. The learned model constitutes the classifier and can be
used to categorise future corpus samples. In the second phase, the classifier
constructed in the phase one is used for classification.
Machine leaning approach to text classification is less labour intensive, and is domain
independent. Since the attribution of documents to categories is based purely on the
content of the documents effort is thus concentrated on constructing an automatic
builder of classifiers (also known as the learner), and not the classifier itself [3]. The
automatic builder is a tool that extracts the characteristics from the training set which is
represented by a classification model. This means that once a learner is built, new
classifiers can be automatically constructed from sets of manually classified documents.
Training Classification Classification
Set Algorithm Model
a)
Classification
Model
Test Set New
Documents
b)
Figure 1. a) Step One in Text Classification b) Step two in text classification
5.2.2 The Classifier
In general a text classifier comprises a number of basic components. As noted in the
previous section, the text classifier begins with an inductive stage. A classifier requires
some sort of text representation of documents. In order to build an internal model the
inductive step involves a set of examples used for training the classifier. This set of
examples is known as the training set and each document in the training set is assigned
to a class C = {c1, c2, … cn}. All the documents used in the training phase are
transformed into internal representations.
Currently, a dominant learning method in text classification is based on a vector space
model [5]. The Naïve Bayesian is one example and is often used as a benchmark in text
9 of 93
10. classification experiments. Bayesian classifiers are statistical classifiers. Classification
is based on the probability that a given document belongs to a particular class. The
approach is „naïve‟ because it assumes that the contribution by all attributes on a given
class is independent and each contributed equally to the classification problem. By
analysing the contribution of each „independent‟ attribute, a conditional probability is
determined. Attributes in this approach are the words that appear in the documents of
the training set.
Documents are represented by a vector with dimensions equal to the number of different
words within the documents of the training set. The value of each individual entry within
the vector is set at the frequency of the corresponding word. According to this approach,
training data are used to estimate parameters of a probability distribution, and Bayes
theorem is used to estimate the probability of a class. A new document is assigned to
the class that yields the highest probability. It is important to perform pre-processing to
remove frequent words such as stop words before a training set is used in the inductive
phase.
The Naïve Bayesian approach has several advantages. Firstly, it is easy to use;
secondly only one scan of the training data is required. It can also easily handle missing
values by simply omitting that probability when calculating the likelihoods of membership
in each class. Although the Naïve Bayesian-based classifier is popular, documents are
represented as a „bag-of-words‟ where words in the document have no relationships with
each other. However words that appear in a document are usually not independent.
Furthermore, the smallest unit of representation is a word.
Research is continuously investigating how designs of text classifiers can be further
improved and Pampapathi et al [1] at Birkbeck College, London recently proposed a new
innovative approach to the internal modelling of text classifiers. They used a well known
data structure called a suffix tree [11] which allows for indexing the characteristics of
documents at a more granular level, with documents represented by substrings. The
suffix tree is a compact trie containing all the suffixes of strings represented. A trie is a
tree structure, where each node represents one character, and the root represents the
null string. Each path from the root represents a string, described by the characters
labelling the nodes traversed. All strings sharing a common prefix will branch off from a
common node. When strings are words over a to z, a node has at most 26 children, one
for each letter (or 27 children, plus a terminator). Suffix trees have traditionally been
used for complex string matching problems in matching string sequences (data
compression, DNA sequencing). Pampapathi et al‟s research is the first to apply suffix
trees to natural language text classification.
Pampapathi et al‟s method of constructing the suffix tree varies slightly from the
standard way. Firstly, the tree nodes are labelled instead of the edges in order to
associate directly the frequency with the characters and substrings. Secondly, a special
terminal character is not used as the focus is on the substrings and not the suffixes.
Each suffix tree has a depth. The depth is described by the maximum number of levels
in the tree. A level is defined by the number of nodes away from the root node. For
example the suffix tree illustrated in Figure 2 has a depth of 4. Pampapathi et al‟s sets a
limit to the tree depth and each node of the suffix tree stores the frequency and the
character.
For example, to construct a suffix tree for the string S1 = “COOL”, the suffix tree in Figure
2 is created. The substrings are COOL; OOL; OL; and L.
10 of 93
11. C (1) O (1) O (1) L (1)
Root
O (1) L (1)
O (1)
L (1)
L (1)
Figure 2. Suffix Tree for String „COOL‟
If a second string S2 =”FOOL” is inserted into the suffix tree, it will look like the diagram
illustrated in Figure 3. The substrings for S2 are FOOL; OOL; OL; and L. Notice that the
last three substrings in S2 are duplicates of some of the substrings already seen in S1,
and new nodes are not created for these repeated substrings.
F (1) O (1) O (1) L (1)
Root C (1) O (1) O (1) L (1)
O (2) L (2)
O (2)
L (2)
L (2)
Figure 3. Suffix Tree with String „FOOL‟ Added
Similar to the Naïve Bayesian method, a classifier using the suffix tree for its internal
model undergoes supervised learning from a training set which contains documents that
have been pre-classified into classes. Unlike the Naïve Bayesian approach, the suffix
tree, by capturing the characteristics of documents at the character level, does not
require pre-processing of the training set. A suffix tree is built for each class and a new
document is classified by scoring it against each of the trees. The class of the highest
scoring tree is assigned to the document. Pampapathi et al‟s study was based on email
11 of 93
12. classification and the result of the experiment showed that a classifier employing a suffix
tree outperformed the Naïve Bayesian method.
In order to solve a classification problem, not only is the classifier one of the central
components, but as seen with the Naïve Bayesian method it is also important to perform
pre-processing on data used for training. The next section looks at other processes
involved in text classification other than the classifier component itself.
5.3 Text Classifier Experimentations
As described in previous sections that there is a two-step process to classification:
1. Create a specific model by evaluating the training data. This step has as input
the training data (including the category/class labels) and as output a definition of
the model developed. The model created which is the classifier classifies the
training data as accurately as possible.
2. Apply the model developed by classifying new sets of documents.
In the research community or for those interested in evaluating the performance of a
classifier the second step can be more involved. First, the predictive accuracy of the
classifier is estimated. A simple yet popular technique is called the holdout method
which uses a test set of class-labelled samples. These samples are usually randomly
selected and it is important that they are independent of the training samples, otherwise
the estimate could be optimistic since the learned model is based on that data, and
therefore tend to overfit. The accuracy of a classifier on a given test set is the
percentage of test set samples that are correctly classified by the classifier. For each
test sample the known class label is compared with the classifier‟s class prediction for
that sample.
If the accuracy of the classifier model is considered as acceptable, the model can be
used to classify new documents.
Training Derive Estimate
Set Classifier Accuracy
Corpus
data
Test Set
Figure 4. Estimating Classifier Accuracy with the Holdout Method
The estimate using the holdout method is pessimistic since only a portion of the initial
data is used to derive the classifier. Another technique call N-fold cross-validation is
often used in research. Cross-validation is a statistical technique which can mitigate
bias caused by a particular partition of training and test set. It is also useful when the
amount of data is limited. The method can be used to evaluate and estimate the
performance of a classifier, and the aim is to obtain as honest an estimation as possible
about the classification accuracy of the system. N-fold cross-validation involves
12 of 93
13. partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping
blocks/folds. Then the training-testing process is run N times, with a different test set.
For example, when N=3, we will have the following training and test sets.
Block 1 Train Test
Run 1 1, 2 3
Block 2
Run 2 1, 3 2
Block 3 Run 3 2, 3 1
Figure 5. 3-Fold Cross-Validation
For each cross-validation run the user will be able to use a training set to build the
classifier.
Stratified N-fold cross-validation is a recommended method for estimating classifier
accuracy due to its low bias and variance [13]. In stratified cross-validation, the folds are
stratified so that the class distribution of the samples in each fold is approximately the
same as that of the initial training set.
Preparing the training set data for classification using pre-processing can help improve
the accuracy, efficiency, and scalability of the evaluation of the classification. Methods
include stop word removal, punctuation removal, and stemming.
The use of the above techniques to prepare the data and estimate classifier accuracy
increases the overall computational time yet is useful for evaluating a classifier, and
selecting among several classifiers.
The current project aims to build a system which is a wrapper to a text classifier and
incorporates the suffix tree that was used in the research done by Pampapathi et al as
an example. The next section and beyond describes the project in detail.
13 of 93
14. 6 HIGH-LEVEL APPLICATION DESCRIPTION
6.1 Description and Rationale
The aim of this project is to build a management and visualisation tool that will allow
researchers to perform data manipulation support for underlying text classification
algorithms. The tool will provide a software infrastructure for a data mining system
based on machine learning. The goal is to build a flexible framework that would allow
changes to the underlying components with relative ease. Functions maybe added to
the system in the future. Adding new functionalities should have minimal effect on the
current system.
The system will be built as a wrapper for the two-step process involved in classification.
First, a component will be built that will automatically build a classifier given some
training data. Secondly, to provide capabilities to perform classification and evaluate the
performance of a classifier. Additionally, the tool will provide functionalities to run data
sampling and various pre-processing on data.
For the researcher it is incumbent to clearly define the training set (this will be known as
the „resource corpus‟ in this report) used for the training the classifier. When the
resource corpus is small the user can choose to use the entire corpus in the study. If the
resource corpus is large, the tool gives the option to select sampling sets to represent it.
A number of sampling methodologies is implemented that allows the user to select a
sample, which will reflect the characteristics of the resource corpus from which it is
drawn.
Note that a resource corpus is grouped into classes and this structure needs to be taken
into consideration when the sampling mechanism was developed. Three popular
sampling methods will be developed. Although other sampling methods can be added,
such as convenience sampling, judgement sampling, quota sampling, and snowball
sampling.
Note that the user can choose to evaluate data used to construct the classier before
actually building the classifier. The tool will be designed to be generic enough to
analyse a corpus of any categorisation type e.g. automated indexing of scientific articles,
emails routing, spam filtering, criminal profiling, and expertise profiling.
6.1.1 Build a Classifier
The tool allows the user to build a classifier. The current framework only implements the
suffix tree-based classifier developed by Birkbeck College using the suffix tree, but will
be flexible enough to incorporate other classification models in the future. The research
on suffix trees applied to classification is new, and there is currently no such application.
The learning process of the classifier follows the machine learning approach to
automated text classification, whereby the system automatically builds a classifier for the
categories of interest. From the graphical user interface (GUI), the user can select a
corpus to use as training data. The application provides links to .dll files developed by
Birkbeck College which allow the user to build a suffix tree from the selected corpus. The
internal data representation is constructed by generalising from a training set of pre-
classified documents. Once the classifier is built the user can load new documents into
the system to be classified.
14 of 93
15. 6.1.2 Evaluate and Refine the Classifier
In research once a classifier has been built it is desirable to evaluate its effectiveness.
Even before the construction of the classifier the tool provides a platform for users to
perform a number of experiments and refinements on the source (training) data. Hence,
the second focus of the project is to provide a user-friendly front-end and a base
application for testing classification algorithms.
The user can load in a text based corpus and perform standard pre-processing functions
to remove noise and prepare the data for experimentation. There is also a choice of
sampling methods to use in order to reduce the size of the initial corpus making it more
manageable.
Sebastiani [2] notes that any classifier is prone to classification error, whether the
classifier is human or machine. This is due to a central notion to text classification that
the membership of a document in a class based on the characteristics of the document
and the class is inherently subjective, since the characteristics of both the documents
and class cannot be formally specified. As a result automatic text classifiers are
evaluated using a set of pre-classified documents. The accuracy of classifiers is
compared to the classification decision and the original category the documents were
assigned to. For experimentation and evaluation purpose, this set of pre-classified
documents is split into two sets: a training set and test set, not necessarily of equal
sizes.
The tool implements an extra level of experimentation using n-fold cross-validation.
When employing cross-validation in classification it must take into account that the data
is grouped by classes therefore this project will implement stratified cross-validation.
Once a classifier has been constructed, it is possible to perform data classification
experiments as well as other tasks such as single document analysis. For example, for
the implementation of a suffix tree-based classifier the user will be able to view the
structure of the suffix tree, as well, the documents in the test sets or load a new
document and obtain a full matrix of output data about it. The output data is persisted in
an information system which is subsequently used to perform analysis and visualisation
tasks.
6.2 Development and Technologies
Development was done in C#, using the .NET framework. The architect of the system
was designed to be an extensible platform to enable users and developers to leverage
the existing framework for future system upgrades. The tool was built from several
components and aims to be modular. There are a number of controller components to
provide functionalities for the tool. A set of libraries is used to provide the functionalities
for the suffix tree. Working closely with researchers from Birkbeck College on the
interface, these libraries for the suffix tree were provided by Birkbeck College.
The suffix tree data structure is built in memory and can become very large. One
solution to better utilise resources is to have the data structure physically stored as one
tree, although it is logically represented as individual trees for each class. Further
discussion can be found in subsequent sections.
15 of 93
16. A Windows application was built as the client. This forms the interface that the user
interacts with to gain access to the functionalities of the tool. The output data is cached
in a database.
The main targeted users for the tool are researchers in the research community for
natural language text classification, and other users who want to mine textual data.
16 of 93
17. 7 DESIGN
7.1 Functional Requirements
Requirements for the application were collected from research on natural language text
classification and discussions with targeted users in the research community.
Requirements are the capabilities and conditions to which the application must conform.
The functional requirements of the system are captured using „use cases‟. Use cases
are a useful tool in describing how a user interacts with a system. They are written
stories that describe the interaction between the system and the user that is easy to
understand. Requirements can often change over the course of development and for
this reason there was no attempt to define and freeze all requirements from the onset of
the project. The following use cases were produced. Note some use cases were added
throughout the development of the system
Use Case Name: Load Directory as Source Corpus
Primary Actor: User
Pre-conditions: The application is running
Post-conditions: A source corpus is loaded into the application
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. The user selects a valid directory 2. The system checks for directory
and has at least read access to the path validity and access
directory, and loads it as a corpus 3. Builds a tree structure of classes
into the system based on the sub-folders in the
directory and displays the classes
in the GUI
Use Case Name: View a Document in Corpus
Primary Actor: User
Pre-conditions: A corpus is successfully loaded
Post-conditions:
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. Select the document to view 2. Display content of document in the
GUI
Use Case Name: Create Sampling Set
17 of 93
18. Primary Actor: User
Preconditions: A source corpus is successfully loaded
Postconditions: A sampling set based on the source corpus is created. New
file directory created for the corpus.
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects how they want to 3. Creates a sampling set based on
select the sampling set parameters given by the user
2. User specifies location to store the 4. Creates the directory structure and
documents/files created for the document/files in the location
sampling set specified by the user
5. Displays new corpus created in the
GUI
Use Case Name: Run Pre-Processing
Primary Actor: User
Pre-conditions: A training set exist in the system
Post-conditions: A new pre-processed sampling set created. New file directory
created for the corpus.
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. Select type of pre-processing to 4. Performs pre-processing
perform 5. Creates a new pre-processed set
2. User specifies location to store the 6. Stores the directory structure and
documents/files created for the pre- documents/files at the location
pre-processing set specified by the user.
3. Run pre-processing 7. Displays the corpus as a directory
structure in the GUI
Use Case Name: Run N-Fold Cross-Validation
Primary Actor: User
Preconditions: A sampling set is successfully created
Postconditions: N-fold cross-validation set is created virtually
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects sampling set to 2. Builds n-fold cross-validation set
process and the number of fold based on parameters given by the
user, which includes the n-runs,
18 of 93
19. each run containing training set and
test set.
3. Displays new cross-validation set
created in the GUI
Use Case Name Create Classifier (Suffix Tree)
Primary Actor: User
Preconditions: A cross-validation set or classification set exist
Postconditions: Classifier created in memory
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User actives an event to build 3. Builds classifier in memory, based
classifier for a cross-validation set on the corpus set selected
or classification set 4. indicate in the GUI that the
2. User choose any additional classifier of the corpus has been
conditions to apply created
Use Case Name: Score Documents
Primary Actor: User
Preconditions: An n-fold cross-validation set is created. Classifier for the
corpus set is created
Postconditions: Documents in the cross-validation set is scored and data
stored in the database
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects the cross-validation 2. Scores all documents under the
run to score selected corpus set
3. Inserts score data into database
Use Case Name: Classify Documents
Primary Actor: User
Preconditions: An n-fold cross-validation set is created. Classifier for the set
is created and the documents have been scored
Postconditions: Misclassified documents in the cross-validation set is flagged
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
19 of 93
20. 1. User selects the cross-validation 2. Classify all documents under the
run to classify selected cross-validation set
3. Flag all misclassified documents in
the GUI
Use Case Name: Create Classification Set
Primary Actor: User
Preconditions: A source corpus is successfully loaded
Postconditions: A classification set is created virtually
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects the corpus set they 2. Display new corpus created in the
want to use to create a classifier GUI as a classification corpus set
Use Case Name: Load New Document to Classify
Primary Actor: User
Preconditions: Cross-validation set or classification set exist
Postconditions: Substring matches and relates output data is store in
database
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User decides which suffix tree to 2. Document name and relevant
use for classification and loads in a information is displayed in the GUI
valid textual document as an item ready to be analysed
to be classified and analysed 3. Score and classify document
4. Stores output data in database
Use Case Name: View a Document
Primary Actor: User
Pre-conditions: Document loaded into the system
Post-conditions:
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. Select the document to view 2. Display content of document on
GUI
20 of 93
21. Use Case Name View n-Gram Matches in document
Primary Actor: User
Preconditions: The document in concern is successfully loaded and suffix
classifier created
Postconditions:
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects a string/substring in a 2. Queries the classifier to retrieve the
document to match n length substring matches
3. Displays to user the frequency for
the string/substring selected
Use Case Name View Statistics on Matches
Primary Actor: User
Preconditions: Document successfully loaded, scored and output exists in
database
Postconditions: Displays information in GUI
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects to view output 2. System queries and retrieves
relevant data in the database
3. Displays the output in table form in
the GUI
Use Case Name Visualise Representation of Classifier (View Suffix Tree)
Primary Actor: User
Preconditions: Classifier was successfully built
Postconditions: Classifier visual representation displayed on GUI
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects option to display suffix 2. Builds visual representation of the
tree classifier and displays in GUI
21 of 93
22. Use Case Name Delete Classifier
Primary Actor: User
Preconditions: Classifier was successfully built
Postconditions: Classifier is deleted
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
3. User selects classifier to delete 4. Remove classifier, and clear
displayed tree in GUI
7.2 Non-Functional Requirements
The non-functional requirements for the use cases are as follows.
7.2.1 Usability
The user should have one main single user interface to interact with the system. The
user interface should be user friendly and the complexity of computation e.g. building an
n-fold cross-validation set, scoring documents against a classification model, should be
hidden from the user.
An experimental run of the suffix tree classifier could involve as many as 126 scoring
configurations, all of which could together take some considerable time to calculate. It
therefore makes sense to keep a store of all calculated scores, rather than calculate
them on-the-fly whenever they are requested. The results will be cached in a data store,
which is implemented as database in this project. Hence, optimizing system
responsiveness.
Some system requests can only be activated once a pre-condition has been satisfied
e.g. the user can only score documents when the suffix tree has been created. The
system should give informative warning messages if the user attempts to perform a task
without pre-conditions being satisfied. Where appropriate, upon a task being performed,
the system may automatically carry out pre-conditions before performing the requested
task.
7.2.2 Hardware and Software Constraint
The application should be easily extensible and scalable. Developers should be able to
add both extra functionality and expand the workload the application can handle with
relative ease.
The design should consider the future enhancement of the system and should be
reasonably easy to maintain and upgrade. Codes should also be well documented.
The system should use an RDBMS to manage its data layer, but be independent of the
RDBMS it uses to manage its data.
22 of 93
23. 7.2.3 Documentation
Help menus and tool tips will be available to help users interact with the system. The
application will also come with a user manual, including screen shots. The application
will be available along with written documentation for its installation and configuration.
7.3 System Framework
It was decided to build the system with a number of components. Each component has
a specialised function in the system. Figure 6 illustrates the main components and the
system boundary. The next section will describe the functions of each component in
more detail and section 7.5 contains the class diagram. By isolating system
responsibilities the following main components were identified.
User interface
Display Manager
Classifier (Central Manager, STClassifier Manager, STClassifier)
Sampling Set Generator
Pre-processor
Cross-validation
Results Manager (Database Manager, OLEDB, Database)
Figure 7 shows how the system is divided into a client/server architecture. The
advantage of this set up is its ease of maintenance as the server implementation can be
an abstraction to the client. All the functionalities of the system are accessed through
the graphical user interface (GUI). The implementation is in the server, isolating users
from the system complexities not relevant to the user.
One of the main aims of the design of the system was to create a flexible framework.
Others...
The green boxes seen in Figure 8 represent new or alternative components that
can be added to the system in the future with relative ease.
23 of 93
24. Input Data
System Boundary
Random
Graphical User
DisplayManager
Interface
Sampling Set
Generator
Utility
Results Manager Central Manager
Pre-processor
OLEBD Database STClassifier
Manager Manager
Stemmer
Database
STClassifier Cross-Validation
Figure 6. System Components and Boundary
Input Data
Graphical User Client
Interface
Server
DisplayManager
Random
v
Sampling Set
Results Manager Central Manager Generator
Utility
Pre-processor
Database STClassifier
OLEBD
Manager Manager
Stemmer
Database
STClassifier
Cross-Validation
Figure 7. Client Server Division
24 of 93
25. Graphical User
Others...
Interface
Random Others...
Input Data DisplayManager
Sampling Set
Generator
Utility
Others... Results Manager Central Manager
Pre-processor
STClassifier
Database Others...
OLEBD Manager
Manager
Stemmer Others..
STClassifier Cross-Validation
Database
Figure 8. Additional or Alternative Components
7.4 Components in Detail
7.4.1 The Client - User Interface
Graphical User
Interface
The user interacts with the system via a single graphical user interface which is also the
client. In this project the client is implemented as a set of Windows forms and controls in
.NET. There is one main form where users can access all the functionalities of the
system. There are a number of other dialog boxes and forms to help with the navigation
and interaction with the system. For example there is a Select Scoring Method form,
used to request from the user the scoring methodology to use when scoring a new
document. Other more generic forms such as the Select Dialog form are employed for a
number of uses and do not display specific types of information (see section 10
Implementation Specifics for further discussion).
The client is simply an event handler for each of the GUI controls that calls the Central
Manager via the Display Manager for actual data processing. The GUI contains no
implementation, but delegates to the Display Manager, thus decoupling the interface
from the implementation. There is a two-way communication between the client and the
Display Manager, whereby a user invokes an event and related messages are passed to
the Central Manager. The Central Manager passes the messages to the Central
Manager which subsequently either delegates to other more specialised controllers to
handle the task, or resolves the request itself.
The design of the screens was done in speaking with potential users. The user should
be able to perform all the tasks described by the use cases seen earlier in the Functional
Requirements section (the functions will not be reiterated here).
25 of 93
26. For this project Windows forms were chosen for the implementation because most users
are familiar with the Windows form interface. It creates a familiar interface on initial
interaction with the system and facilitates use of the system. In particular, the .NET
framework provides a wealth of controls and functionalities, which help to build a user
friendly interface and hides the complexity of the underlying workings from the user. The
different components are built as separate classes and the user interface or the client
can be implemented using a different methodology from Windows forms, such as
command line as illustrated.
Select Select Scoring
Dialog Method
Graphical User
Command Line
Interface
Input Data
Display Manager
Figure 9. Client interface and Its Collaborating Components
7.4.2 Display Manager
DisplayManager
The Display Manager is a layer between the User Interface and the Central Manager
and the rest of the system. It essentially passes messages between these two
components. The Display Manager is responsible for information displayed back to the
user and it manages also the input data.
Graphical User
Others...
Interface
Input Data DisplayManager
Central Manager
7.4.3 The Classifier
It was mentioned in the previous section that the Central Manager is part of the
classifier. Figure 10 illustrates the classifier, which is enclosed by the red box and its
connecting components. The classifier comprises of the Central Manager, a controller
26 of 93
27. that manages the underlying model of the classifier, and the underlying model itself. The
Central Manager is a controller that handles the communication between all the main
components in the system which communicates with the classifier. The Central
Manager should provide the following functionalities:
Select Sampling Set for a corpus
Pre-process all documents in a corpus
Run cross-validation on a corpus
Create a classifier for a given corpus
Score all documents in a corpus
Classify all documents in a corpus
Obtain classification results for a corpus
There are further controller classes called by the Central Manager to provide more
specialised functionalities, these are the Output Manager, Suffix Tree Manager,
Sampling Set Generator, Pre-processor, and Cross-validation.
When a user loads a corpus into the system it is managed by the Central Manager. If
there is a request to create a sampling set for example, the Central Manager should
know where the corpus is located and delegates the Sampling Set Generator the task of
creating a sampling set based on parameters set by the user. Similarly, a request from
the user to perform pre-processing on the corpus is delegated to the Pre-processor to
carry out the task by the central manager.
The various components is designed to have specialised tasks, they do not need to
know where the data is located as this information is passed to the components when
the Central Manger invokes a request. The Sampling Set generator does not need to
know how the Pre-processor carries out its task, nor does it need to know about the
Cross-validation component. The three components receive data and requests from the
Central Manager, perform its task and return any information back to the Central
Manager.
The classifier has to be connected to an internal model. In this project the suffix tree
data structure is employed to model the representation of document characteristics. As
seen in Figure 10, the classifier can be implemented with different types of models such
as a Naïve Bayesian or Neural Networks. There is a dual way communication between
the Central Manager and the STClassifier via the STClassifier Manager. The
STClassifier is a DLL library built by Birkbeck research. It provides public interfaces to:
Building the representation of documents using the suffix tree data structure
Training the classifier
Score a document
Returns classification results
The STClassifier Manager controls the flow of messages between the Central Manager
and the STClassifier. Responsibilities involve converting data to the format that is
accepted by the STClassifier, and converting output from the STClassifier which is
27 of 93
28. passed back to the STClassifier Manager. It is essentially a wrapper class for the
STClassifier.
The suffix tree is built using the contents of documents in a training set. Once a suffix
tree is built it will be cached in an ArrayList that is managed by the STClassifier
Manager. An ArrayList is a C# collection class implemented in .NET. The suffix tree
remains stored in memory until the user activates an event to delete the suffix tree. As a
result the system does not need to create a suffix tree every subsequent action that
references it. Hence, only methods in the STClassifier Manager are called and it is not
necessary to call methods in the STClassifier.
The classifier generates output data when a request is invoked to classify and score
documents. These two actions can be a time consuming activities. The Central
Manager decides what type of output data needs to be saved and passes the data from
the classifier to the Results Manager to handle. Section Figure 13 describes the design
of the Results manager.
Graphical User
Interface Command Line
Results Manager
Display Manager
Sampling Set
Generator
Central Manager
Pre-processor
NBClassifier NNClassifier STClassifier
Manager Manager Manager
Cross-Validation
NBClassifier NNClassifier STClassifier
Classifier
Figure 10. The Classifier and Its Collaborating Components
7.4.4 Data Manipulation and Cleansing
Sampling Set
Pre-processor
Generator
28 of 93
29. When a corpus is loaded into the system as input data. The user can create sampling
sets from the initial corpus and also prepare the data for experimentation by performing
various types of pre-processing on the data. The input data is given to the classifier,
which sends it to the Sampling Set Generator to handle the generation of sampling sets.
Various sampling methodologies can be plugged into the Sampling Set Generator. For
this project the system will implement random sampling and systematic sampling
methodologies. The pre-processor provides the functionality for pre-processing data
passed to it. Similarly, various methods of pre-processing can be plugged into the
system with relative ease. Currently, the system provides stemming, stop word removal,
and punctuation removal.
In order for a method to plug into the system, a method class must implement an
IMethod interface so that it guarantees the following:
A method class must have a name property to return the name of the
method. This is necessary, so if new methods are added to the system it
will be identified by its name.
A method class must have a Run method. This method is where all the
work is done
A set of utility classes will provide helper functionalities such as random number
generator, common divisor, and file system.
Systematic Random Snowball
Sampling Set
Generator
Utility
Central Manager
Pre-processor
Stop
Word Punctuation
Stemmer Others..
Removal Removal
Figure 11. Data Manipulation and Cleansing Components and Its Collaborating Components
7.4.5 Experimentation
Cross-Validation
Setting up data for experimentation is the main responsibly of the Cross-validation class.
The Central Manager passes a corpus to the Cross-validation component, which uses
the data to build N-fold cross-validation sets. It divides the given set of corpus into N
blocks and builds a training set and test set for each N run. The data is stored as an
array that is passed back to the Central Manager.
29 of 93
30. The methods the Cross-Validation class is expected to perform are:
Set the number of N-folds
Run N-fold cross-validation on a given source data
Return the cross-validation sets in an array data structure
Central Manager
Cross-Validation
Figure 12. Cross-validation and Its Collaborating Components
7.4.6 Results Manager
Results Manager
The Results Manager handles the output of the classifier and the repository of the
output. The underlying RDBMS of this project is an Access database, which is used to
cache the data generated by the classifier. The OLEDB component is responsible for
the direct communication with the database. This class needs to provide the basic
database functionalities such as read/write/ delete in a generic fashion. It is through the
Database Manager object that all communication with the OLEDB library occurs, and the
data flow between the Results Manager. The Database Manager manages the OLEDB.
The green boxes illustrate that the information system for the system does not
necessarily has to be an Access database. The system is designed to be able to store
the data using a different means with relative ease, e.g. XML files, SQL server etc.
30 of 93
31. Results Manager Central Manager
XML File Database
Manager Manager
XML OLEDB
XML File(s)
Database
Figure 13. Results Manager and Its Collaborating Components
7.4.7 Error Handling
Adequate error handling for an end user application is essential. Displays of warnings
and errors should be handled in the higher level of the system, namely by the Display
manager and then displayed to the user in a reasonable fashion. Errors that occur in the
other classes should be propagated to the Display Manager. All classes apart from the
User Interface and the Display Manager are expected to implement an IErrorRecord
interface. A class that implements this interface will guarantee that it has a property
called error which returns the error message.
31 of 93
32. 7.5 Class Diagram
Figure 14 shows a class diagram of the main components of the system discussed above
Controllers::DisplayManager
MainForm
-nodeMgr : TreeViewNodeManager
-tvExplorer -classifier : CentralManager
-sTreeView -dbProvider : string
-rtxtView -dbUserId : string
-rtxtInfo -dbPassword : string
-mItemAddRCorpus_Click(in sender : object, in e) -dbName : string
-mitemSelectSampling_Click(in sender : object, in e) -Controlled By
-dbAccessMode : string 1
-mitemPreprocess_Click(in sender : object, in e) +AddNode(in destNode : TreeNode, in nodeNames : string[], in imageIdx : TreeImages, in selectedImageIdx : TreeImages)
-mitemCrossValidation_Click(in sender : object, in e) +FindNode(in selectedNode : TreeNode, in nodeName : string) : TreeNode
-CreateSTree_Click(in sender : object, in e) 1..*
+DisplayBlank()
-DeleteSTree_Click(in sender : object, in e) +DisplayFile(in filePathname : string)
-DisplaySuffixTree_Click(in sender : object, in e) +SelectSampleCorpus(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode)
-AddNewDoc_Click(in sender : object, in e) +AddNewClassificationSet(in treeStructure : TreeView, in sourceNode : TreeNode, in destRoot : string)
-AddClassificationSet_Click(in sender : object, in e) 1 +PerformPreprocessing(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode)
-ScoreAllDoc_Click(in sender : object, in e) -PerformCrossValidation(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode)
-ClassifyAllDocs_Click(in sender : object, in e) +SetupSTree(in defaultCorpus : string, in sourceFilesNode : TreeNode, in STreeNode : TreeNode)
+DisplayScoresByDoc(in displayView : ListView, in sourceNode : TreeNode, in filepath : string)
+ScoreAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string)
+ClassifyAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string)
+FlagMisClassifiedDocuments(in sourceNodePath : string, in sourceDataNode : TreeNode, in sf : int, in mn : int, in tn : int)
+DeleteScores(in parentPath : string)
+DeleteSTree(in STreeNode : TreeNode)
+DisplaySTree(in displayTxt : Label, in diplayView : TreeView, in defaultCorpus : string, in dataSource : TreeNode, in STreeNode : TreeNode) Controllers::SampleSetGenerator
+GetMatchInfo(in text : string, in STreeNode : TreeNode) : string -error : string
+CleanupDatabase() -Controls
-methodNames : string[] = new string[] {"Census", "Random", "Systematic"}
+ErrorMessage() : string
1 1 -CodeToName(in code : int) : string
+Run(in resourcePath : string, in destPath : string, in selectMethod : string)
1 -Controls +MethodNames() : string[]
Classifier::CentralManager
-sampler : SampleSetGenerator
-preprocessor : Preprocessor Controllers::CrossValidation
1 -crossValidator : CrossValidation -folds : Array[]
-dataModelMgr : SuffixTreeManager -noOfFolds : int
-outputMgr : DatabaseManager -minFold : int = 2
1
-error : string -maxFold : int = 10
1 -Controls
+Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool -error : string
+Contains(in key : string) : bool -Performs 1 +ErrorMessage() : string
Output::DatabaseManager +Remove(in key : string) +CrossValidation(in folds : int)
+GetClassNames(in key : string) : string[] +Run(in path : string) : Array[]
-dbAccess : OLEDB +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] 1
-dbProvider : string +FoldCount() : int
+ErrorMessage() : string
-dbUserId : string +CentralManager() -Controls
-dbPassword : string +GetModel(in key : string) : EMSTreeClassifier
-dbName : string +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int Controllers::Preprocessor
-ScoresTable : string = "Scores" +Sampler() : SampleSetGenerator 1
-ConfigTable : string = "Config" -stopWordFile : string
+Preprocessor() : Preprocessor
-ClassWeightsTable : string = "ClassWeights" -punctuationFile : string
+CrossValidator() : CrossValidation
-ClassifiedTable : string = "qry3a_MaxWScoreClass" -methodNames : string[] = new string[methodCount]
+OutputManager() : DatabaseManager
-MisClassifyFiles : string = "qry2b_MisClassifiedByFile" -error : string
-MatchByClass : string = "zqry2b_matchByClass_Crosstab" +ErrorMessage() : string
-error : string 1 1 +Preprocessor()
-bOpen : bool -SetupMethodNames()
+ErrorMessage() : string -CodeToName(in code : int) : string
+DatabaseManager() +Run(in content : string, in type : string) : string
+SelectScoresByFile(in parentPathNode : string, in filePath : string) : OleDbDataReader +MethodNames() : string[]
+SelectMisClassifiedDocuments(in parentPathNode : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader
+SelectClassifiedClass(in sourceNodePath : string, in filepath : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader 1
+DeleteScores(in ParentNodePath : string)
+Provider() : string
+UserId() : string
+Password() : string 1 -Controls 1 -Has
+DatabaseName() : string
Classifier::SuffixTreeManager DataMining::StopWord
1
-createdSTreeList : SortedList -name : string
-error : string -stringList : ArrayList = new ArrayList()
1 -Access Database -error : string
+Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool
+Contains(in key : string) : bool +Name() : string
Output::OLEDB +Remove(in key : string) +Run(in text : string) : string
-oleDbDataAdapter : OleDbDataAdapter +GetClassNames(in key : string) : string[] +ErrorMessage() : string
-oleDbConnection : OleDbConnection +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] +StopWord(in filePathName : string)
-oleDbInsertCommand : OleDbCommand +ErrorMessage() : string +Add(in filePathName : string)
-oleDbDeleteCommand : OleDbCommand +SuffixTreeManager() -AddWord(in targetWord : string)
-oleDbUpdateCommand : OleDbCommand -AddSTreeToCache(in key : string, in sTree : EMSTreeClassifier) : bool +Clear()
-oleDbSelectCommand : OleDbCommand +GetModel(in key : string) : EMSTreeClassifier +Reset() 1 -Controls
+oleDbDataReader : OleDbDataReader +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int +Contains(in word : string) : bool
-command : COMMAND +StringList() : ArrayList
-error : string 1
-bOpen : bool
+ErrorMessage() : string 1..* -Access
Controllers::TreeViewNodeManager
+IsOpen() : bool
+InsertCommand() : string -error : string
EMSTreeClassifier
+DeleteCommand() : string +ErrorMessage() : string
+UpdateCommand() : string -className : string[]
+ChildNameExist(in TargetNode : TreeNode, in matchName : string) : bool
+SelectCommand() : string -dictionary : string[]
+GetClassFiles(in classFileParent : TreeNode) : FileInfo[][]
+GetReader() : OleDbDataReader -dictionaryByClass : string[][]
+GetChildrenNodeNames(in targetNode : TreeNode) : string[]
+ExecuteCommand() : bool -mergedTree : EMSTreeClassifier.EMSTree
+GetTreeNode(in targetNodeName : string, in Parentnode : TreeNode) : TreeNode
-SelectReader() : OleDbDataReader +addToClass(in txt : string, in class : string) +DisplaySTree(in displayView : TreeView, in sTree : EMSTreeClassifier, in classFreqToDisplay : string[])
-UpdateReader() : OleDbDataReader +classIntToName(in classInt : int) : string +AddItemToTreeView(in root : TreeNode, in childNames : params string[]) : TreeNode
-InsertReader() : OleDbDataReader +classNameToInt(in className : string) : int +AddCrossValidationSetsToTreeView(in sourceNode : TreeNode, in content : Array[])
-DeleteReader() : OleDbDataReader +classScore(in example : string, in class : string, in nsf : int, in nmnf : int, in ntnf : int) : double[,,] -PopulateRunNode(in content : Array[], in testSetNum : int, in parentNode : TreeNode)
+OLEDB() +maxScore(in a : double[]) : static int -Combine(in array1 : FileInfo[][], in array2 : FileInfo[][]) : FileInfo[][]
+Open(in Provider : string, in UserID : string, in Password : string, in DatabaseName : string, in Mode : string) +setDepth(in d : int) +AddItem(in destNode : TreeNode, in newNodeName : string, in imageIdx : TreeImages) : TreeNode
+Close() +train(in classTrainingFiles : <unspecified>[][]) : bool -CreateNewNode(in nodeName : string, in imageIdx : TreeImages) : TreeNode
Figure 14. Class Diagram
32 of 93
33. 8 DATABASE
8.1 Entities
All the data in the system is stored in an Access database. The following describes the
organisation of the data that the system will store.
8.1.1 Score Table
When a user calls to score a new document or a set of documents, each document is
scored against 126 configurations for each class. The data is cached in the score table.
8.1.2 Source Table
The source table stores the location properties of documents. This includes the physical
pathname of the document and where it is logically located in the display tree.
8.1.3 Configuration Table
This configuration table stores the 126 combination of scoring methods used in
Pampapathi et al‟s study. Each configuration consists of a type of scoring function,
match normalisation, and tree normalisation function.
8.1.4 Score Functions Table
33 of 93
34. This table contains the name description of score functions.
8.1.5 Match Normalisation Functions Table
This table contains the name description of match normalisation functions.
8.1.6 Tree Normalisation Functions Table
This table contains the name description of tree normalisation functions.
8.1.7 Classification Condition Table
This table stores any classification conditions to be considered when classifying a
document from a particular corpus.
8.1.8 Class Weights Table
This table stores the class weights when classifying documents.
8.1.9 Temporary Max and Min Score Table
34 of 93
35. This is a temporary table used to cache the maximum and minimum scores for a class
grouped by document, configuration.
8.2 Views
The following are some of the main views to assist in querying the main tables for data
displayed in the user interface.
8.2.1 Weighted Scores
This view obtains the weighted scores by documents and scoring configuration.
8.2.2 Maximum and Minimum Scores
This view obtains the maximum and minimum score by document and scoring
configuration.
8.2.3 Misclassified Documents
This view obtains the misclassified documents and related data.
8.3 Relation Design for the Main Tables
The main table of the database is the Scores table. This table contains the scores for
each document, scored by different configuration combinations (see the Implementation
35 of 93
36. section for scoring configuration description). Figure 15 shows the relationships
between the main tables.
tTreeNormalisation tMatchNormalisation tScoreFunction
PK Index PK Index PK Index
Name Name Name
1..1
Config
PK,I1 ConfigId
1..1 1..1
FK2 SF
FK3 MN
FK1 TN
SF Name
MN Name
TN Name
*..1
tempMaxMinWScores
Source FK2,I2 SourceId
*..1 FK1,I1 ConfigId
PK SourceId True Class
Node Parent Path MaxOfWScore
Node Path MinOfWScore
File Path
Scores
*..1 PK ScoreId
*..1
FK2,I4,I3 SourceId
FK1,I2,I1 ConfigId
Score Class
True Class
Score
Figure 15. Table Relations
36 of 93
37. 9 IMPLEMENTATION
Due to the large size of the program, this report will not cover all the different
implementation details, but instead the discussion will focus on the main classes and
highlight some specific implementation. See Appendix B Class Definitions.
9.1 Main User Interface
The main form of the user interface is divided into four resizable panes which each
display different types of information to the user (see Figure 16):
tvExplorer
rtxtView/sTreeView.
lblTreeDetail/listView
rTxtInfo
The tvExplorer is a Windows Form TreeView control, which displays the different
corpuses available in the system. The information is presented as a hierarchy of nodes,
like the way files and folders are displayed in the left pane of Windows Explorer.
The rtxtView is implemented as a Windows Forms RichTextBox control. When the user
selects a child node in tvExplorer that represents a document, rtxtView will display the
content of document. The rtxtView will also allow users to perform dynamic n-gram
(sub-string) matching on a document (see section 10.3 Dynamic Sub-String Matching).
The sTreeView is implemented as a TreeView control. It shares the same pane as the
rtxtView control and is only made visible on the main form (and the rtxtView becomes
invisible) when the user requests to display a suffix tree that has been created. At the
same time the lblSTreeDetail control, which is implemented as a Windows Form Label
control will display description about the suffix tree currently displayed in the sTreeView
control. ListView is a Windows Form ListView control which provides information related
to the current content of the rtxtView control.
RtxtInfo is a RichText control and displays classification summary regarding a document.
37 of 93
38. lblSTreeDetail/listView
tvExplorer
rtxtInfo rtxtView/sTreeView
Figure 16. Main User Interface
The main form is implemented as a .NET class called MainForm. Figure 17 shows the
class members and class interface.
Note that there are other Windows Form control classes which were implemented to
control the flow of user-system interaction. Section 10 Implementation Specifics will
describe one of them in detail, and see Appendix x for all the user interface classes.
38 of 93