This document describes a project to classify web documents using machine learning techniques. It involves three phases:
1. Collecting sets of web documents grouped by topic from DMOZ. The goal is to collect 100 documents across 5 topics with at least 20 documents per topic.
2. Performing feature extraction on the documents by selecting keywords and creating feature vectors representing whether each keyword is present in each document.
3. Applying machine learning algorithms to create models that can accurately classify new documents into the existing topics and evaluate the accuracy of the initial topic structure. The models will be used to automatically classify new web documents.
The document discusses file input/output (IO) in Java. It covers key concepts like streams, files, the File class, and IO classes. The File class represents file and directory pathnames and is used to get file information. Streams are used for IO and can be byte-based or character-based. The File class has methods for renaming, deleting, and getting attributes of files. Sample programs demonstrate using the File class to rename and delete files. The document also outlines the main abstract stream classes in Java for input, output, readers and writers.
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
Structured Data extraction from deep Web pages is a challenging task due to the underlying complex structures of such pages. Also website developer generally follows different web page design technique. Data extraction from webpage is highly useful to build our own database from number applications. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they present different limitations and constraints for extracting data from such webpages. This paper presents two different approaches to get structured data extraction. The first approach is non-generic solution which is based on template detection using intersection of Document Object Model Tree of various webpages from the same website. This approach is giving better result in terms of efficiency and accurately locating the main data at the particular webpage. The second approach is based on partial tree alignment mechanism based on using important visual features such as length, size, and position of web table available on the webpages. This approach is a generic solution as it does not depend on one particular website and its webpage template. It is perfectly locating the multiple data regions, data records and data items within a given web page. We have compared our work's result with existing mechanism and found our result much better for number webpage
Corina Galvan chose to review databases and database management systems because only 10% of her classmates understood these concepts. She works with databases at her job searching for permit information.
Databases organize information into a tabular format like a spreadsheet with records and fields. The Internet Movie Database is an example that organizes movies, actors, and other information into records with fields.
Database management systems allow users to store, retrieve, and modify data in databases. They manage access and can improve speed for common queries.
This document discusses data management strategies for organizing files. It recommends developing a consistent naming system using logical folders, date formats, and keywords. It suggests using logic tests to determine the best storage location based on factors like confidentiality, sharing needs, and whether the file is a draft or final version. The document also provides search tips, discusses different Windows views, and covers options for synchronizing and archiving files, including cloud storage and department-specific applications.
Powerpoint versiebeheer there is no such thing as a final version 1Hugo Besemer
This document discusses version control and file management for PhD students. It covers topics such as creating a logical folder structure, file naming conventions, deciding what files to keep, tracking relationships between files, and managing literature references. It also discusses applications for synchronizing files across devices and collaborating on documents in real-time, such as SURFdrive, OneDrive, SharePoint, OneNote, Google Docs and Overleaf. The document provides examples and recommendations for best practices in organizing research documents and files.
Graph-based Approaches for Organization Entity Resolution in MapReduceDeepak K
This document describes graph-based approaches for organization entity resolution used by a major commercial people search engine. It discusses blocking and clustering strategies to resolve duplicate organization records from multiple data sources. The blocking strategy groups similar organization records into blocks based on shared properties like words or people they appear with. It then recursively subdivides very large blocks until they are a manageable size. The clustering strategy assigns pairwise linkage scores to records and groups linked records into clusters, each representing a unique organization entity. These approaches were implemented on Hadoop and able to process billions of records across a 50-node cluster.
Machine learning involves programming computers to optimize performance using example data or past experience. It is used when human expertise does not exist, humans cannot explain their expertise, solutions change over time, or solutions need to be adapted to particular cases. Learning builds general models from data to approximate real-world examples. There are several types of machine learning including supervised learning (classification, regression), unsupervised learning (clustering), and reinforcement learning. Machine learning has applications in many domains including retail, finance, manufacturing, medicine, web mining, and more.
CSCI6505 Project:Construct search engine using ML approachbutest
This document summarizes a student project report on developing a topic-based search engine for a website using machine learning. The project uses an instance-based learning algorithm (k-nearest neighbors) to classify HTML files into topics like artificial intelligence, programming languages, etc. It includes modules for training a classifier, crawling a website to index files into topics, and a search interface for users. The report describes implementing classes for preprocessing HTML, indexing, classification, and search functionality. Sample results show a keyword-based and topic-based search interface that returns relevant files.
The document discusses file input/output (IO) in Java. It covers key concepts like streams, files, the File class, and IO classes. The File class represents file and directory pathnames and is used to get file information. Streams are used for IO and can be byte-based or character-based. The File class has methods for renaming, deleting, and getting attributes of files. Sample programs demonstrate using the File class to rename and delete files. The document also outlines the main abstract stream classes in Java for input, output, readers and writers.
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
Structured Data extraction from deep Web pages is a challenging task due to the underlying complex structures of such pages. Also website developer generally follows different web page design technique. Data extraction from webpage is highly useful to build our own database from number applications. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they present different limitations and constraints for extracting data from such webpages. This paper presents two different approaches to get structured data extraction. The first approach is non-generic solution which is based on template detection using intersection of Document Object Model Tree of various webpages from the same website. This approach is giving better result in terms of efficiency and accurately locating the main data at the particular webpage. The second approach is based on partial tree alignment mechanism based on using important visual features such as length, size, and position of web table available on the webpages. This approach is a generic solution as it does not depend on one particular website and its webpage template. It is perfectly locating the multiple data regions, data records and data items within a given web page. We have compared our work's result with existing mechanism and found our result much better for number webpage
Corina Galvan chose to review databases and database management systems because only 10% of her classmates understood these concepts. She works with databases at her job searching for permit information.
Databases organize information into a tabular format like a spreadsheet with records and fields. The Internet Movie Database is an example that organizes movies, actors, and other information into records with fields.
Database management systems allow users to store, retrieve, and modify data in databases. They manage access and can improve speed for common queries.
This document discusses data management strategies for organizing files. It recommends developing a consistent naming system using logical folders, date formats, and keywords. It suggests using logic tests to determine the best storage location based on factors like confidentiality, sharing needs, and whether the file is a draft or final version. The document also provides search tips, discusses different Windows views, and covers options for synchronizing and archiving files, including cloud storage and department-specific applications.
Powerpoint versiebeheer there is no such thing as a final version 1Hugo Besemer
This document discusses version control and file management for PhD students. It covers topics such as creating a logical folder structure, file naming conventions, deciding what files to keep, tracking relationships between files, and managing literature references. It also discusses applications for synchronizing files across devices and collaborating on documents in real-time, such as SURFdrive, OneDrive, SharePoint, OneNote, Google Docs and Overleaf. The document provides examples and recommendations for best practices in organizing research documents and files.
Graph-based Approaches for Organization Entity Resolution in MapReduceDeepak K
This document describes graph-based approaches for organization entity resolution used by a major commercial people search engine. It discusses blocking and clustering strategies to resolve duplicate organization records from multiple data sources. The blocking strategy groups similar organization records into blocks based on shared properties like words or people they appear with. It then recursively subdivides very large blocks until they are a manageable size. The clustering strategy assigns pairwise linkage scores to records and groups linked records into clusters, each representing a unique organization entity. These approaches were implemented on Hadoop and able to process billions of records across a 50-node cluster.
Machine learning involves programming computers to optimize performance using example data or past experience. It is used when human expertise does not exist, humans cannot explain their expertise, solutions change over time, or solutions need to be adapted to particular cases. Learning builds general models from data to approximate real-world examples. There are several types of machine learning including supervised learning (classification, regression), unsupervised learning (clustering), and reinforcement learning. Machine learning has applications in many domains including retail, finance, manufacturing, medicine, web mining, and more.
CSCI6505 Project:Construct search engine using ML approachbutest
This document summarizes a student project report on developing a topic-based search engine for a website using machine learning. The project uses an instance-based learning algorithm (k-nearest neighbors) to classify HTML files into topics like artificial intelligence, programming languages, etc. It includes modules for training a classifier, crawling a website to index files into topics, and a search interface for users. The report describes implementing classes for preprocessing HTML, indexing, classification, and search functionality. Sample results show a keyword-based and topic-based search interface that returns relevant files.
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...GeeksLab Odessa
This document discusses challenges related to information overload on the internet. It notes that the amount of available information is growing exponentially each year. It identifies key issues like duplicates, information waste, and time spent searching. It then discusses potential solutions like data mining, natural language processing, semantic analysis, and algorithms like ant colony optimization to help filter content and find the most relevant information more efficiently.
This document presents a general framework for building classifiers and clustering models using hidden topics to deal with short and sparse text data. It analyzes hidden topics from a large universal dataset using LDA. These topics are then used to enrich both the training data and new short text data by combining them with the topic distributions. This helps reduce data sparseness and improves classification and clustering accuracy for short texts like web snippets. The framework is also applied to contextual advertising by matching web pages and ads based on their hidden topic similarity.
The document discusses the three core topic types of concept, task, and reference for organizing information. It describes the characteristics of well-formed topics, including using heading syntax to indicate topic type, focusing on one question, and linking to related topics. The document provides examples of restructuring topics to better fit the core types and improve usability.
Web Information Extraction for the DB Research Domainliat_kakun
A presentation describing my final project for an engineering degree at the Hebrew University of Jerusalem - a system for extracting information from web sites into instances of an XML schema, utilizing machine learning, structural analysis of documents and a divide & conquer strategy.
The document summarizes notes from a meeting about a grant project to build an intelligent web service for collecting and classifying online educational resources. The project will use a combination of manual and automatic methods, including machine learning classifiers and rule-based classifiers, to identify different types of syllabus materials on websites. It will provide users with sample course sites, overviews of course topic relationships and schedules, and the ability to populate or search for materials based on topics or templates. The goal is to leverage both human and machine intelligence to maximize the usefulness and accuracy of the system.
This document summarizes a presentation on leveraging object-oriented programming techniques in LotusScript. It introduces object-oriented concepts like classes, objects, and encapsulation. It then walks through building an application to monitor news sites for company mentions using a class to represent each site and a nested class to represent individual news items. The presentation demonstrates encapsulating the news item class within the site class and using inheritance by extending all classes from a base class. It shows how to make the application more robust by adding logging through the base class.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Web Information Extraction for the Database Research DomainMichael Genkin
A presentation describing my final project for an engineering degree at the Hebrew University of Jerusalem - a system for extracting information from web sites into instances of an XML schema, utilizing machine learning, structural analysis of documents and a divide & conquer strategy.
Topic-oriented writing structures information around topics rather than categories. It allows for easier access, scanability, modular writing by multiple authors, and facilitation of content reuse. Topic types include concept, reference, and task. A topic has a title describing its theme, followed by mixed text and images. The process involves determining topic types, writing topical titles, and describing each topic's theme. This approach can be applied to existing material by analyzing content, determining topic types, rewriting titles, and using tables to remove repetition.
This document introduces inverted files, which are a core data structure for text search engines. It describes inverted files and how they allow for efficient indexing, construction, and querying. The document then outlines some common extensions to inverted file indexes, such as compression, phrase querying, and distribution. It concludes by providing context on text search and information retrieval.
Applying Machine Learning to Software Clusteringbutest
This document discusses applying machine learning techniques to the problem of automatically clustering source code files into subsystems. Specifically, it formulates software clustering as a supervised machine learning problem, where a learner is trained on a subset of files that have been manually categorized and then aims to generalize that categorization to other files. The document tests two machine learning algorithms - Naive Bayes and Nearest Neighbor - on decompositions of three software systems, with the Nearest Neighbor algorithm achieving the best results.
In tech application-of_data_mining_technology_on_e_learning_material_recommen...Enhmandah Hemeelee
The document describes a recommendation system that applies data mining techniques to recommend e-learning materials. It proposes using LDAP for fast searching of materials across systems, JAXB for parsing content, and association rule mining and collaborative filtering for recommendations. A web spider collects content indexes from learning management systems and stores data in an LDAP directory. Users can search for related materials, and the system mines log data to associate frequently searched terms and recommend additional resources.
In tech application-of_data_mining_technology_on_e_learning_material_recommen...Enhmandah Hemeelee
The document describes a recommendation system that applies data mining techniques to recommend e-learning materials. It proposes using LDAP for fast searching of materials across systems, JAXB for parsing content, and association rule mining and collaborative filtering to generate recommendations. The system collects user activity data, analyzes it using Apriori algorithm to find related search terms and content, and stores results in an LDAP database to provide recommendations to users.
Modern Database Management 12th Global Edition by Hoffer solution manual.docxssuserf63bd7
https://qidiantiku.com/solution-manual-for-modern-database-management-12th-global-edition-by-hoffer.shtml
name:Solution manual for Modern Database Management 12th Global Edition by Hoffer
Edition:12th Global Edition
author:by Hoffer
ISBN:ISBN 10: 0133544613 / ISBN 13: 9780133544619
type:solution manual
format:word/zip
All chapter include
Focusing on what leading database practitioners say are the most important aspects to database development, Modern Database Management presents sound pedagogy, and topics that are critical for the practical success of database professionals. The 12th Edition further facilitates learning with illustrations that clarify important concepts and new media resources that make some of the more challenging material more engaging. Also included are general updates and expanded material in the areas undergoing rapid change due to improved managerial practices, database design tools and methodologies, and database technology.
This document discusses using qualitative research software like WebCT and N6 to collect and analyze online discussion data. It outlines a three stage data collection strategy including open, axial, and selective coding. Advantages of computer assisted qualitative data analysis include organization, systematic approaches, and time savings. Disadvantages include complex software, loss of context, and potential data loss. The document demonstrates exporting discussion data, open coding to develop categories and properties, transforming free nodes to a tree structure, and using text searching to support research variables in analysis.
Using Computer as a Research Assistant in Qualitative ResearchJoshuaApolonio1
This document discusses using qualitative research software to collect and analyze online discussion data. It demonstrates exporting discussion data from WebCT into N6 for coding. A three-stage data collection strategy is outlined, beginning with open coding to generate categories and properties, then axial coding to interconnect categories, and ending with selective coding to build a theoretical model. Advantages of this approach include organization of large data sets and time savings, while disadvantages include complexity of software and potential to lose sight of data contexts.
BUSI 340
Discussion Board Instructions
The learning theories, upon which this course is based, are actualized in the Discussion Board Forums. At the beginning of each module/week, you will choose a key term to research. You will be required to write a thread of at least 400 words on the topic, complete with page references and specifics to document the response, and post it to the corresponding Discussion Board Forum. Correct use of English and grammar are required.
Additionally, you will be required to post a substantive written reply of a minimum of 200 words to at least 3 classmates’ Discussion Board threads.
To complete your thread:
1. Select a key term from assigned chapters. Team Cohesion
2. Terms cannot be duplicated; therefore, reserve it as a topic on the Discussion Board Forum by posting a thread with only the term in the subject line. Topics can be reserved beginning at 12:01 a.m. (ET) on Monday of Modules/Weeks 1, 3, 5, and 7. Topic reservations posted earlier will be deleted.
3. Conduct an Internet search to find and read 3 recent articles that relate to the term.
4. Select the 1 article that you wish to discuss.
5. Post a new thread that contains the following information in the following format, using the headers so that you ensure that all aspects of the assignment are completed as required. Failure to follow these instructions will result in a 1-point deduction.
a. Definition: Give a brief definition of the key term followed by the APA reference for the term; this does not count in the 400-word requirement.
b. Summary: Give a brief summary of the selected article, in your own words.
c. Discussion:
Give a brief discussion of how the article relates to the selected chapter key term. This gives you the opportunity to add value to the discussion by sharing your experiences, thoughts, and opinions. Draw your peers into discussion of topics by asking questions. This is the most important part of the posting! Most of your discussion section should be based upon scholarly researches sources. Opinions can be a very small supplement to the literature base.
i. Include the complete URL of each article (use a persistent link for articles from the Liberty University Online Library) in APA-reference format of each article read. These do not count toward the 400-word requirement.
6. Click here for assistance with APA formatting.
To complete your replies:
1. Read the postings of your peers and the articles which are referenced (This is why it is imperative that the articles be accessible via working URL links). Expect to spend some time each day reviewing all threads and replies, even those in which you are not involved.
2. Write at least 200 words to 3 or more classmates’ threads. You should expect to answer questions posed within each discussion thread. Student interaction is key to success in this course.
Grading
Consult the accompanying document to see a rubric for how your instructor will grade this assignment. Note that l ...
This document discusses planning and designing SharePoint sites. It covers managing the site lifecycle including creating site collections and managing disposition. It also covers identifying inactive site collections, backing up and archiving site collection data, and managing policies. The document provides details on creating content types including using the object model, schema development, and managing document templates.
This module introduces concepts related to file and database organization. It describes how data is organized in a hierarchy from characters to fields, records, files, and databases. It then discusses different database organization methods like hierarchical, network, and relational models. It also describes what a database management system is and why it is needed to create and manage databases. The key aspects covered are data modeling and organization, database components, and the role of a DBMS.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...GeeksLab Odessa
This document discusses challenges related to information overload on the internet. It notes that the amount of available information is growing exponentially each year. It identifies key issues like duplicates, information waste, and time spent searching. It then discusses potential solutions like data mining, natural language processing, semantic analysis, and algorithms like ant colony optimization to help filter content and find the most relevant information more efficiently.
This document presents a general framework for building classifiers and clustering models using hidden topics to deal with short and sparse text data. It analyzes hidden topics from a large universal dataset using LDA. These topics are then used to enrich both the training data and new short text data by combining them with the topic distributions. This helps reduce data sparseness and improves classification and clustering accuracy for short texts like web snippets. The framework is also applied to contextual advertising by matching web pages and ads based on their hidden topic similarity.
The document discusses the three core topic types of concept, task, and reference for organizing information. It describes the characteristics of well-formed topics, including using heading syntax to indicate topic type, focusing on one question, and linking to related topics. The document provides examples of restructuring topics to better fit the core types and improve usability.
Web Information Extraction for the DB Research Domainliat_kakun
A presentation describing my final project for an engineering degree at the Hebrew University of Jerusalem - a system for extracting information from web sites into instances of an XML schema, utilizing machine learning, structural analysis of documents and a divide & conquer strategy.
The document summarizes notes from a meeting about a grant project to build an intelligent web service for collecting and classifying online educational resources. The project will use a combination of manual and automatic methods, including machine learning classifiers and rule-based classifiers, to identify different types of syllabus materials on websites. It will provide users with sample course sites, overviews of course topic relationships and schedules, and the ability to populate or search for materials based on topics or templates. The goal is to leverage both human and machine intelligence to maximize the usefulness and accuracy of the system.
This document summarizes a presentation on leveraging object-oriented programming techniques in LotusScript. It introduces object-oriented concepts like classes, objects, and encapsulation. It then walks through building an application to monitor news sites for company mentions using a class to represent each site and a nested class to represent individual news items. The presentation demonstrates encapsulating the news item class within the site class and using inheritance by extending all classes from a base class. It shows how to make the application more robust by adding logging through the base class.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Web Information Extraction for the Database Research DomainMichael Genkin
A presentation describing my final project for an engineering degree at the Hebrew University of Jerusalem - a system for extracting information from web sites into instances of an XML schema, utilizing machine learning, structural analysis of documents and a divide & conquer strategy.
Topic-oriented writing structures information around topics rather than categories. It allows for easier access, scanability, modular writing by multiple authors, and facilitation of content reuse. Topic types include concept, reference, and task. A topic has a title describing its theme, followed by mixed text and images. The process involves determining topic types, writing topical titles, and describing each topic's theme. This approach can be applied to existing material by analyzing content, determining topic types, rewriting titles, and using tables to remove repetition.
This document introduces inverted files, which are a core data structure for text search engines. It describes inverted files and how they allow for efficient indexing, construction, and querying. The document then outlines some common extensions to inverted file indexes, such as compression, phrase querying, and distribution. It concludes by providing context on text search and information retrieval.
Applying Machine Learning to Software Clusteringbutest
This document discusses applying machine learning techniques to the problem of automatically clustering source code files into subsystems. Specifically, it formulates software clustering as a supervised machine learning problem, where a learner is trained on a subset of files that have been manually categorized and then aims to generalize that categorization to other files. The document tests two machine learning algorithms - Naive Bayes and Nearest Neighbor - on decompositions of three software systems, with the Nearest Neighbor algorithm achieving the best results.
In tech application-of_data_mining_technology_on_e_learning_material_recommen...Enhmandah Hemeelee
The document describes a recommendation system that applies data mining techniques to recommend e-learning materials. It proposes using LDAP for fast searching of materials across systems, JAXB for parsing content, and association rule mining and collaborative filtering for recommendations. A web spider collects content indexes from learning management systems and stores data in an LDAP directory. Users can search for related materials, and the system mines log data to associate frequently searched terms and recommend additional resources.
In tech application-of_data_mining_technology_on_e_learning_material_recommen...Enhmandah Hemeelee
The document describes a recommendation system that applies data mining techniques to recommend e-learning materials. It proposes using LDAP for fast searching of materials across systems, JAXB for parsing content, and association rule mining and collaborative filtering to generate recommendations. The system collects user activity data, analyzes it using Apriori algorithm to find related search terms and content, and stores results in an LDAP database to provide recommendations to users.
Modern Database Management 12th Global Edition by Hoffer solution manual.docxssuserf63bd7
https://qidiantiku.com/solution-manual-for-modern-database-management-12th-global-edition-by-hoffer.shtml
name:Solution manual for Modern Database Management 12th Global Edition by Hoffer
Edition:12th Global Edition
author:by Hoffer
ISBN:ISBN 10: 0133544613 / ISBN 13: 9780133544619
type:solution manual
format:word/zip
All chapter include
Focusing on what leading database practitioners say are the most important aspects to database development, Modern Database Management presents sound pedagogy, and topics that are critical for the practical success of database professionals. The 12th Edition further facilitates learning with illustrations that clarify important concepts and new media resources that make some of the more challenging material more engaging. Also included are general updates and expanded material in the areas undergoing rapid change due to improved managerial practices, database design tools and methodologies, and database technology.
This document discusses using qualitative research software like WebCT and N6 to collect and analyze online discussion data. It outlines a three stage data collection strategy including open, axial, and selective coding. Advantages of computer assisted qualitative data analysis include organization, systematic approaches, and time savings. Disadvantages include complex software, loss of context, and potential data loss. The document demonstrates exporting discussion data, open coding to develop categories and properties, transforming free nodes to a tree structure, and using text searching to support research variables in analysis.
Using Computer as a Research Assistant in Qualitative ResearchJoshuaApolonio1
This document discusses using qualitative research software to collect and analyze online discussion data. It demonstrates exporting discussion data from WebCT into N6 for coding. A three-stage data collection strategy is outlined, beginning with open coding to generate categories and properties, then axial coding to interconnect categories, and ending with selective coding to build a theoretical model. Advantages of this approach include organization of large data sets and time savings, while disadvantages include complexity of software and potential to lose sight of data contexts.
BUSI 340
Discussion Board Instructions
The learning theories, upon which this course is based, are actualized in the Discussion Board Forums. At the beginning of each module/week, you will choose a key term to research. You will be required to write a thread of at least 400 words on the topic, complete with page references and specifics to document the response, and post it to the corresponding Discussion Board Forum. Correct use of English and grammar are required.
Additionally, you will be required to post a substantive written reply of a minimum of 200 words to at least 3 classmates’ Discussion Board threads.
To complete your thread:
1. Select a key term from assigned chapters. Team Cohesion
2. Terms cannot be duplicated; therefore, reserve it as a topic on the Discussion Board Forum by posting a thread with only the term in the subject line. Topics can be reserved beginning at 12:01 a.m. (ET) on Monday of Modules/Weeks 1, 3, 5, and 7. Topic reservations posted earlier will be deleted.
3. Conduct an Internet search to find and read 3 recent articles that relate to the term.
4. Select the 1 article that you wish to discuss.
5. Post a new thread that contains the following information in the following format, using the headers so that you ensure that all aspects of the assignment are completed as required. Failure to follow these instructions will result in a 1-point deduction.
a. Definition: Give a brief definition of the key term followed by the APA reference for the term; this does not count in the 400-word requirement.
b. Summary: Give a brief summary of the selected article, in your own words.
c. Discussion:
Give a brief discussion of how the article relates to the selected chapter key term. This gives you the opportunity to add value to the discussion by sharing your experiences, thoughts, and opinions. Draw your peers into discussion of topics by asking questions. This is the most important part of the posting! Most of your discussion section should be based upon scholarly researches sources. Opinions can be a very small supplement to the literature base.
i. Include the complete URL of each article (use a persistent link for articles from the Liberty University Online Library) in APA-reference format of each article read. These do not count toward the 400-word requirement.
6. Click here for assistance with APA formatting.
To complete your replies:
1. Read the postings of your peers and the articles which are referenced (This is why it is imperative that the articles be accessible via working URL links). Expect to spend some time each day reviewing all threads and replies, even those in which you are not involved.
2. Write at least 200 words to 3 or more classmates’ threads. You should expect to answer questions posed within each discussion thread. Student interaction is key to success in this course.
Grading
Consult the accompanying document to see a rubric for how your instructor will grade this assignment. Note that l ...
This document discusses planning and designing SharePoint sites. It covers managing the site lifecycle including creating site collections and managing disposition. It also covers identifying inactive site collections, backing up and archiving site collection data, and managing policies. The document provides details on creating content types including using the object model, schema development, and managing document templates.
This module introduces concepts related to file and database organization. It describes how data is organized in a hierarchy from characters to fields, records, files, and databases. It then discusses different database organization methods like hierarchical, network, and relational models. It also describes what a database management system is and why it is needed to create and manage databases. The key aspects covered are data modeling and organization, database components, and the role of a DBMS.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
The defense was successful in portraying Michael Jackson favorably to the jury in several ways:
1) They dressed Jackson in ornate costumes that conveyed images of purity, innocence, and humility.
2) Jackson was shown entering the courtroom as if on a red carpet, emphasizing his celebrity status.
3) Jackson appeared vulnerable, childlike, and in declining health during the trial, eliciting sympathy from jurors.
4) Defense attorney Tom Mesereau effectively presented a coherent narrative of Jackson as a victim and portrayed Neverland as a place of refuge, undermining the prosecution's arguments.
Michael Jackson was born in 1958 in Gary, Indiana and rose to fame in the 1960s as the lead singer of The Jackson 5, topping music charts in the 1970s. As a solo artist in the 1980s, his album Thriller broke music records. In the 1990s and 2000s, Jackson faced several legal issues related to child abuse allegations while continuing to release music. He married Lisa Marie Presley and Debbie Rowe and had two children before his death in 2009.
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
This document appears to be a list of popular books from various authors. It includes over 150 book titles across many genres such as fiction, non-fiction, memoirs, and novels. The books cover a wide range of topics from politics to cooking to autobiographies.
The prosecution lost the Michael Jackson trial due to several key mistakes and weaknesses in their case:
1) The lead prosecutor, Thomas Sneddon, was too personally invested in the case against Jackson, having pursued him for over a decade without success.
2) Sneddon's opening statement was disorganized and weak, failing to effectively outline the prosecution's case.
3) The accuser's mother was not credible and damaged the prosecution's case through her erratic testimony, history of lies and con artist behavior.
4) Many prosecution witnesses were not credible due to prior lawsuits against Jackson, debts owed to him, or having been fired by him. Several witnesses even took the Fifth Amendment.
Here are three examples of public relations from around the world:
1. The UK government's "Be Clear on Cancer" campaign which aims to raise awareness of cancer symptoms and encourage early diagnosis.
2. Samsung's global brand marketing and sponsorship activities which aim to increase brand awareness and favorability of Samsung products worldwide.
3. The Brazilian government's efforts to improve its international image and relations with other countries through strategic communication and diplomacy.
The three most important functions of public relations are:
1. Media relations because the media is how most organizations reach their key audiences. Strong media relationships are crucial.
2. Writing, because written communication is at the core of public relations and how most information is
Michael Jackson Please Wait... provides biographical information about Michael Jackson including his birthdate, birthplace, parents, height, interests, idols, favorite foods, films, and more. It discusses his background, career highlights including influential albums like Thriller, and films he appeared in such as The Wiz and Moonwalker. The document contains photos and details about Jackson's life and illustrious music career.
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
The document discusses the process of manufacturing celebrity and its negative byproducts. It argues that celebrities are rarely the best in their individual pursuits like singing, dancing, etc. but become famous due to being products of a system controlled by wealthy elites. This system stifles opportunities for worthy artists and creates feudalism. The document also asserts that manufactured celebrities should not be viewed as role models due to behaviors like drug abuse and narcissism that result from the celebrity-making process.
Michael Jackson was a child star who rose to fame with the Jackson 5 in the late 1960s and early 1970s. As a solo artist in the 1970s and 1980s, he had immense commercial success with albums like Off the Wall, Thriller, and Bad, which featured hit singles and groundbreaking music videos. However, his career and public image were plagued by controversies related to allegations of child sexual abuse in the 1990s and 2000s. He continued recording and performing but faced ongoing media scrutiny into his private life until his death in 2009.
Social Networks: Twitter Facebook SL - Slide 1butest
The document discusses using social networking tools like Twitter and Facebook in K-12 education. Twitter allows students and teachers to share short updates and can be used to give parents a window into classroom activities. Facebook allows targeted advertising that could be used to promote educational activities. Both tools could help facilitate communication between schools and communities if used properly while managing privacy and security concerns.
Facebook has over 300 million active users who log on daily, and allows brands to create public profile pages to interact with users. Pages are for brands and organizations only, while groups can be made by any user about any topic. Pages do not show admin names and have no limits on fans, while groups display admin names and are limited to 5,000 members. Content on pages should aim to provoke action from subscribers and establish a regular posting schedule using a conversational tone.
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Hare Chevrolet is a car dealership located in Noblesville, Indiana that has successfully used social media platforms like Twitter, Facebook, and YouTube to create a positive brand image. They invest significant time interacting directly with customers online to foster a sense of community rather than overtly advertising. As a result, Hare Chevrolet has built a large, engaged audience on social media and serves as a model for how brands can use online presences strategically.
Welcome to the Dougherty County Public Library's Facebook and ...butest
This document provides instructions for signing up for Facebook and Twitter accounts. It outlines the sign up process for both platforms, including filling out forms with name, email, password and other details. It describes how the platforms will then search for friends and suggest people to connect with. It also explains how to search for and follow the Dougherty County Public Library page on both Facebook and Twitter once signed up. The document concludes by thanking participants and providing a contact for any additional questions.
Paragon Software announces the release of Paragon NTFS for Mac OS X 8.0, which provides full read and write access to NTFS partitions on Macs. It is the fastest NTFS driver on the market, achieving speeds comparable to native Mac file systems. Paragon NTFS for Mac 8.0 fully supports the latest Mac OS X Snow Leopard operating system in 64-bit mode and allows easy transfer of files between Windows and Mac partitions without additional hardware or software.
This document provides compatibility information for Olympus digital products used with Macintosh OS X. It lists various digital cameras, photo printers, voice recorders, and accessories along with their connection type and any notes on compatibility. Some products require booting into OS 9.1 for software compatibility or do not support devices that need a serial port. Drivers and software are available for download from Olympus and other websites for many products to enable use with OS X.
To use printers managed by the university's Information Technology Services (ITS), students and faculty must install the ITS Remote Printing software on their Mac OS X computer. This allows them to add network printers, log in with their ITS account credentials, and print documents while being charged per page to funds in their pre-paid ITS account. The document provides step-by-step instructions for installing the software, adding a network printer, and printing to that printer from any internet connection on or off campus. It also explains the pay-in-advance printing payment system and how to check printing charges.
The document provides an overview of the Mac OS X user interface for beginners, including descriptions of the desktop, login screen, desktop elements like the dock and hard disk, and how to perform common tasks like opening files and folders. It also addresses frequently asked questions for Windows users switching to Mac OS X, such as where documents are stored, how to save or find documents, and what the equivalent of the C: drive is in Mac OS X. The document concludes with sections on file management tasks like creating and deleting folders, organizing files within applications, using Spotlight search, and an overview of the Dashboard feature.
This document provides a checklist for securing Mac OS X version 10.5, focusing on hardening the operating system, securing user accounts and administrator accounts, enabling file encryption and permissions, implementing intrusion detection, and maintaining password security. It describes the Unix infrastructure and security framework that Mac OS X is built on, leveraging open source software and following the Common Data Security Architecture model. The checklist can be used to audit a system or harden it against security threats.
This document summarizes a course on web design that was piloted in the summer of 2003. The course was a 3 credit course that met 4 times a week for lectures and labs. It covered topics such as XHTML, CSS, JavaScript, Photoshop, and building a basic website. 18 students from various majors enrolled. Student and instructor evaluations found the course to be very successful overall, though some improvements were suggested like ensuring proper software and pairing programming/non-programming students. The document also discusses implications of incorporating web design material into existing computer science curriculums.
1. Web Document Classification
Ingrid Russell1, Zdravko Markov, Todd Neller
June 3, 2005
1. Introduction
Along with search engines, topic directories are the most popular sites on the
Web. Topic directories organize web pages in a hierarchical structure (taxonomy,
ontology) according to their content. The purpose of this structuring is twofold: firstly, it
helps web searches focus on the relevant collection of Web documents. The ultimate goal
here is to organize the entire web into a directory, where each web page has its place in
the hierarchy and thus can be easily identified and accessed. The Open Directory Project
(dmoz.org) and About.com are some of the best-known projects in this area. Secondly,
the topic directories can be used to classify web pages or associate them with known
topics. This process is called tagging and can be used to extend the directories
themselves. In fact, some well-known search portals as Google return with their
responses the topic path of the response, if the response URL has been associated with
some topic found in the Open Directory. Also, when one performs a search, in addition
to a list of URLs one gets a ‘similar pages’ link. These similar pages could be generated
from a system such as dmoz using the pages that the search returned and the dmoz
structure. Dmoz could return, as similar pages to the documents returned, those in a
subdirectory under a certain topic or subtopic.
Some of these topic directories are not very well developed yet. As the Open
Directory is created manually, it cannot capture all URLs, therefore just a fraction of all
responses are tagged. It would be good to be able to automatically classify pages and
have the system identify where in the dmoz directory structure a page belongs or to be
able to expand and create a new subdirectory/topic tree.
1
Corresponding author: irussell@hartford.edu, Department of Computer Science, University of Hartford,
West Hartford CT 06117.
1
2. 2. Project overview
The aim of the project is to investigate the process of tagging web pages using the
topic directory structures and apply Machine Learning techniques for automatic tagging.
This would help in filtering out the responses of a search engine or ranking them
according to their relevance to a topic specified by the user.
For example, a keyword search for “Machine Learning” in Google may return2
along with the list of pages found (about 7,760,000) a topic directory path:
Category: Computers > Artificial Intelligence > Machine Learning
The first page found (David W. Aha: Machine Learning Page) belongs to this
category. A search into the Google directories (the Open Directory) shows 210 pages
belonging to the category Machine Learning. David Aha’s page is among them.
Another search however, “Tom Mitchell textbook”, returns the web page of the
most popular ML book by Tom Mitchell, but does not return the topics path simply
because this page is not listed in the Open Directory under Machine Learning.
Assuming that we know the general topic of the web page in question, say
Artificial Intelligence, and this is a topic in the Open Directory, we can try to find the
closest subtopic to the web page found. This is where Machine Learning comes into play.
Using some text document classification techniques we can classify the new web page to
one of the existing topics. By using the collection of pages available under each topic as
examples, we can create category descriptions (e.g. classification rules, or conditional
probabilities). Then using these descriptions we can classify new web pages. Another
approach would be the nearest neighbor approach, where using some metric over text
documents we find the closest document and assign its category to the new web page.
In this project, we will be working on a very small section of dmoz. Specifically,
we plan to select 5 topics from dmoz and create a system that will automate adding web
pages to that branch of dmoz. The plan is to use machine learning to implement a system
that automates taking a web page and identifying which subtree of dmoz it belongs to.
We will select about 100 web documents that we know where they belong in dmoz,
train/teach the system to recognize how to classify these web documents, and then use it
2
Note that this may not be what you see when you try this query. The web content is constantly changing
as well as Google and other search engines’ approaches to search the web. This usually results in getting
different results from the same search query at different times.
2
3. to categorize new web documents, i.e., identify which subdirectory of dmoz the new
document should be added to.
3. Project description
The project is split into three major parts: data collection, feature extraction, and
machine learning. These parts are also phases in the overall process of knowledge
extraction from the web and classification of web documents (tagging). As this process is
interactive and iterative in nature, the phases may be included in a loop structure that
would allow each stage to be revisited so that some feedback from later stages can be
used. The parts are well defined and can be developed separately (e.g. by different teams)
and then put together as components in a semi-automated system or executed manually.
Hereafter we describe the project phases in detail along with the deliverables that the
students need to submit on completion of each stage.
Phase 1 consists of collecting a set of 100 web documents grouped by topic.
These documents will serve as our training set. Phase 2 involves feature extraction and
data preparation. During this phase the web documents will be represented by feature
vectors, which in turn are used to form a training data set for the Machine Learning stage.
Phase 3 is the machine learning phase. Machine learning algorithms are used to create
models of the data sets. These models are used for two purposes. The accuracy of the
initial topic structure is evaluated and secondly, new web documents are classified into
existing topics.
3.1 Phase 1 -- Collecting sets of web documents grouped by topic
The purpose of this stage is to collect sets of web documents belonging to
different topics (subject area). The basic idea is to use a topic directory structure. Such
structures are available from dmoz.org (the Open Directory project), the yahoo directory
(dir.yahoo.com), about.com and many other web sites that provide access to web pages
grouped by topic or subject. These topic structures have to be examined in order to find
several topics (e.g. 5), each of which is well represented by a set of documents (at least
10). For this project you are asked to use dmoz and to identify 20 web documents for
each of the 5 topics. This will result in approximately 100 web documents. These
3
4. documents will be used as our training and learning set. It is not necessary that your
topics are computer science related topics. As you select your topics, you may want to
select topics that are of interest to you and ones that you are familiar with, such as
mountain biking, or any area of interest to you. As you decide on topics, note that dmoz
is still weak in some areas but mature and rich in others. You may want to avoid topics
that dmoz is weak at.
Alternative approaches could be extracting web documents manually from the list
of hits returned by a search engine using a general keyword search or collecting web
pages from the web page structure of a large organization (e.g. university).
The outcome of this stage is a collection of several sets of web documents (actual
files stored locally, not just URLs) representing different topics or subjects, where the
following restrictions apply:
a) As these topics will be used for learning and classification experiments at later stages.
they have to form a specific structure (part of the topic hierarchy). It is good to have
topics at different levels of the topic hierarchy and with different distances between them
(a distance between two topics can be defined as the number of predecessors to the first
common parent in the hierarchy). An example of such structure is:
topic1 > topic2 > topic3
topic1 > topic2 > topic4
topic1 > topic5 > topic6
topic1 > topic7 > topic8
topic1 > topic9
The set of topics here is {topic3, topic4, topic6, topic8, topic9}. Also, it would be
interesting to find topics, which are subtopics of two different topics. An example of this
is:
Top > … > topic2 > topic4
Top > … > topic5 > topic4
4
5. As you select your topics, it is important that this structure be used. An example
of a topic structure from dmoz would be:
Topic 1 Topic 2 Topic 3
Computers: Artificial Intelligence: Machine Learning
Topic 1 Topic 2 Topic 4
Computers: Artificial Intelligence: Agents
Topic 1 Topic 5 Topic 6
Computers: Algorithms: Sorting and Searching
Topic 1 Topic 7 Topic 8
Computers: Multimedia: MPEG
Topic 1 Topic 9
Computers: History
Use the format of the sample list of documents for the above topics available at:
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resources/Docu
ments.doc
Note how each link has a reference number in the left column.
b) There must be at least 5 different topics with at least 20 documents in each.
c) Each document should contain a certain minimum amount of text. This may be
measured with the number of words (excluding articles and punctuation marks). For
example, this minimum could be 200 words.
d) Each document should be in HTML format and contain HTML tags as title, headings
or font modifiers.
e) Avoid documents that are mostly links to other documents.
f) Each page must have enough keywords that represent the topic.
5
6. In the later phase, we will be extracting keywords from each web document that
identifies the topic. These keywords will need to represent the web document well. The
selection of these documents is very important as we will be training our system with
them. If we do not use ‘good’ training data, our system will not work as well. It is thus
important that the documents you select now represent the topic very well. As you
compile these pages, you need to keep in mind the following:
° The documents you select should have at least 5 keywords that identify the topic.
° As mentioned above, each document should have around 200 words. However, it
is acceptable to select a document with fewer words if the document has a high
concentration of keywords.
° You have to stop at the first level web document, i.e., you cannot navigate more.
° You will run into web documents that have a significant amount of graphics and
not enough text to classify them. You need to avoid such documents. However,
you will see that some of these documents have an ‘introduction’ type link that
fully describes the topic. In this case, you may select the web document pointed
to by such a link to represent that topic.
° Each two pages under the same topic must share 5-10 keywords.
The topics at the non-leaf nodes of the structure (1, 2, 5 and 7) stand for topics
represented by all pages falling into their successor nodes. There is no need to collect
pages for them because all actual pages fall into the leaves. The higher level nodes are
composite topics that conceptually include lower level ones. In fact these composite
topics appear at the learning stage when the pages from their subtopics are supplied
as examples.
It is important to note, however, that to make learning such composite topics
possible, there must be some similarity between their constituent topics. This must be
taken into account when collecting pages. For example, topic3 and topic 4 must be
similar (i.e. share keywords), but they both must be different from other topics such as 6
and 8.
6
7. 3.1.1 Phase 1 Deliverable
A document listing all web documents by topic in the same format as the sample
example at:
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resources/docu
ments.doc
3.2 Phase 2: Feature Extraction and Data Preparation
You should now have around 20 web documents for each of the 5 topics for a
total of approximately 100 documents. During this phase the web documents will be
represented by feature vectors, which in turn are used to form a training data set for the
Machine Learning stage. The basic steps to achieve this follow. An alternative to this
approach, a Prolog based approach to feature extraction and data preparation, is included
in Appendix A.
3.2.1 Step 1: Keyword Selection
Select a number of terms (words) whose presence or absence in each document
can be used to describe the document topic. This can be done manually by using some
domain expertise for each topic or automatically by using a statistical text processing
system which is what we will be using.
Use a text corpus analysis package that filters and extracts keywords with their
frequency counts. An example of such a system is TextSTAT, freeware software
available from http://www.niederlandistik.fu-berlin.de/textstat/software-en.html. Other
such systems are also available as freeware from http://www.textanalysis.info/.
The process involves entering each document into TextSTAT and sorting in ascending
order all words appearing in each document by their frequency. The goal is to collect 100
keywords that represent the documents.
While TextSTAT can take URLs as input, TextSTAT is not able to differentiate
between html formatting and other text and for some pages would return unusable site
related text such as navigation bar text. As a result, each web document should be copied
and pasted into a txt file. The names of the files being associated with the web document
7
8. reference numbers. Files corresponding to a topic should be included in a folder whose
name is associated with that topic. You should have 5 folders with 20 documents each.
For each of the 20 documents, use TextSTAT to generate a file with word frequency list.
You will have 20 documents for each topic, for a total of 100 files. All word frequency
files should be exported from TextSTAT as CSV files. Again, use the same file and
directory naming format as above. We will refer to this as Data Set I.
The next step involves generating the keyword list and the ARFF file. While you
may do this process manually or write your own program to automate the process, the
steps below describe the process using a program that has already been created for you to
automate the process. Import all 20 word frequency CSV files in each of the 5 folders
into excel as one CSV file with topics divided by tabs. A template ByPageTemplate.xls
with integrated VBA applications is provided to automate the process at
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resources/ByPa
geTemplate.xls.
You may simply copy and paste from each of the 20 CSV files the words and
their frequencies and put them in the corresponding columns of ByPageTemplate.xls
under the appropriate web document number.
A toolbar provides several menu options that will be useful as you work on the
next steps of generating the keywords including a script to gray out all instances of these
commonly used words in the frequency list. This filtering eliminates noise and words
that do not represent the topics. A demo of this process is available at
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resources/Dem
o.xls
An option to generate the ARFF file is also included in the toolbar. This will be needed
in a later step.
A list of the 1000 most common words is already imported into
ByPageTemplate.xls. Use the options on the toolbar menu to help you generate the list of
keywords. Your next step is to look through ByPageTemplate.xls and select a total of
approximately 100 keywords that represent all 5 topics. You will need to select
approximately 20 keywords per topic. Keywords need to be representing the documents
well. It is important that each two documents under the same topic have 5-10 common
8
9. keywords. This process will be done manually and using various strategies including
your knowledge about the domain. Once you finalize your list of keywords, include the
list in the designated column in ByPageTemplate.xls. The list of keywords will help you
in the next step as you generate the ARFF file which will serve as input in the machine
learning phase.
3.2.1.1 Step 1 Deliverable
1. A document listing all 20 keywords for each of the 5 topics.
2. A combined list of all 100 keywords.
3. A description of the strategies used for the selection of these keywords.
3.2.2 Step 2: Feature Extraction
During this phase, you will be creating the vector space model, a 100x100 matrix,
which serves as a representation of the 100 web documents. These documents will serve
as our training and learning set in the machine learning phase.
Using the selected 100 keywords as features (attributes), you will be creating a
feature vector (tuple) for each document with Boolean values corresponding to each
attribute (1 if the keyword is in the document, 0 – if it’s not). We have 100 keywords
which are attributes used to represent the documents. You will end up with 100 feature
vectors, with 100 elements each. This is the vector space model of the documents. It is a
representation of the 100 documents. There are other ways to get better representation of
these documents, but for now we will use this Boolean representation.
A more sophisticated approach for determining the attributes values can be used
too. It is based on using the term frequencies scaled in some way to normalize the
document length. Further, the HTML tags may be used to modify the attribute values of
the terms appearing with the scope of some tags (for example. increase the values for
titles, headings and emphasized terms).
Ideally, one would want to prepare several files by using different approaches to
feature extraction. For example, one with Boolean attributes, one with numeric based on
text only, and one with numeric using the html information. Versions of the data sets with
a different number of attributes can also be prepared. A rule of thumb here is that the
9
10. number of attributes should be less than the number of examples. The idea of preparing
all those data sets is twofold. By experimenting with different data sets and different
machine learning algorithms the best classification model can be found. By evaluating
all those models, students will understand the importance of various parameters of the
input data for the quality of learning and classification.
3.2.2.1 Step 2 Deliverable
1. A paragraph describing what a feature vector is and another paragraph describing
the vector space model.
2. For each topic, select one web document and create the corresponding feature
vector.
3. A copy of all 100 keywords, the selected web document, and the resulting 5
feature vectors.
3.2.3 Step 3: Data Preparation
Next, you will need to create a data set in the ARFF format to be used by the
Weka 3 Data Mining System which we will be using in the next phase. The input data to
Weka should be in Attribute-Relation File Format (ARFF) format. An ARFF file is a text
file, which defines the attribute types and lists all document feature vectors along with
their class value (the document topic).
In the next phase and once we load the ARFF formatted files into Weka, we will
be using several learning algorithms implemented in Weka to create models of our data
and to test these models and decide which is the best model to use.
Weka 3 Data Mining System is a free Machine Learning software package in Java
available from http://www.cs.waikato.ac.nz/~ml/weka/index.html. Install the Weka
package using the information provided in the Weka software page and familiarize
yourself with its functionality. A readme file for installing and using Weka 3 is available
at
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resources/Weka
Readme.txt
10
11. Weka 3 tips and tricks are available at:
http://www.cs.waikato.ac.nz/~ml/weka/tips_and_tricks.html
This is one of the most popular ML systems used for educational purposes. It is
the companion software package of the book titled Machine Learning and Data Mining
[Witten and Frank, 2000]. Chapter 8 of Witten’s book describes the command-line-based
version of Weka. Chapter 8 is available at
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resources/Chapt
er8.pdf.
Read Section 1 of chapter 8. For the GUI version, read Weka’s user guide at
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resources/Explo
rerGuide.pdf
An introduction to Weka is also available at: http://www.oefai.at/~alexsee/WEKA/
Once you have installed Weka and read section 8.1, run some experiments using the data
sets provided with the package (e.g. the weather data).
The links below provide additional information on the ARFF format:
http://www.cs.waikato.ac.nz/~ml/weka/arff.html and
http://www.cs.waikato.ac.nz/~ml/old/workbench/arff.html
Steps (1) and (2) above are part of the so-called vector space model, which is well
known in the area of Information Retrieval (IR). For more details, see [Chakrabarti,
2002], Chapter 3 or any text on IR.
The next step is to generate the ARFF file for all 100 documents. You may write
your own program to generate the ARFF files or may generate the files manually.
Alternatively, you may use a program in ByPageTemplate that helps you automate the
process by selecting an option on the toolbar that allows you to generate the ARFF file.
This ARFF file will serve as input to Weka in the machine learning phase. A demo of
this process is available at
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resources/Dem
o.xls
3.2.3.1 Step 3 Deliverable
11
12. 1. The ARFF data file containing the feature vectors for all web documents collected
during Phase I.
2. A description of the ARFF data file including:
° An explanation of the correspondence between the 100 keywords and the
attribute declaration part of the ARFF file (the lines beginning with
@attribute).
° An explanation of the data rows (the portion after @data). For example,
pick a tuple and explain what the 0’s and 1’s mean for the document that
this tuple represents.
3.3 Phase 3: Machine Learning Phase
At this stage, Machine Learning algorithms are used to create models of the data sets.
These models are then used for two purposes. The accuracy of the initial topic structure
is evaluated and secondly, new web documents are classified into existing topics. For
both purposes we use the Weka 3 Data Mining System. The steps involved are:
1. Preprocessing of the web document data: Load the ARFF files created at project
stage 2, verify their consistency and get some statistics by using the preprocess
panel. Screenshots from Weka are available at
http://www.cs.waikato.ac.nz/~ml/weka/gui_explorer.html.
A sample Weka output with descriptions of various terms is available at
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resource
s/J48Output.doc
2. Using the Weka’s decision tree algorithm (J48) examine the decision tree
generated from the data set. Which are the most important terms (the terms
appearing on the top of the tree)? Check also the classification accuracy and the
confusion matrix obtained with 10-fold cross validation and find out which topic
is best represented by the decision tree.
3. Repeat the above steps using the Naïve Bayes algorithm and compare its
classification accuracy and confusion matrices obtained with 10-fold cross
12
13. validation with the ones produced by the decision tree. Which ones are better?
Why?
4. New web document classifications: Collect web documents from the same subject
areas (topics), but not belonging to the original set of documents prepared in
project stage 1. Collect also documents from different topics.
Use a web crawler to collect these new documents. An example of a web crawler
is available at WebSPHINX: A Personal, Customizable Web Crawler. Download
it and try it (simply click on the jar file link http://www-
2.cs.cmu.edu/~rcm/websphinx/websphinx.jar or see the explanations in the web
page). Experiment with varying the following parameters: Crawl the subtree/the
server/the Web, Depth/Breadth first, use different limits (number of threads, page
size, timeout). See how the dynamics of crawling changes by inspecting the web
page graph. Once you collected the new documents, apply feature extraction and
create an ARFF test file with one data row for each document. Then using the
Weka test set option classify the new documents. Compare their original topic
with the one predicted by Weka. For the classification experiments use the
guidelines provided in
http://uhaweb.hartford.edu/compsci/ccli/FinalVersion/DocClassification/Resource
s/DMEx.doc
3.3.1 Phase 3 Deliverable
1. Explain the web crawler algorithm in terms of search by answering the following
questions:
o The Web is (1) tree, (2) directed acyclic graph, (3) directed graph, (4)
graph, where:
the nodes are represented by ...
the edges are represented by ...
o Which search algorithms are used by Web Crawlers and why?
o Can a crawler go in a loop?
13
14. o How does the choice of the part of the web to be crawled (subtree/the
server/the Web) affect the search algorithm?
o How is multi-threading used to improve the efficiency of crawling
algorithms?
o What happens when page size or timeout limits are reached?
2. Explain the decision tree learning algorithm (Weka’s J48) in terms of state space
search by answering the following questions:
• What is the initial state (decision tree)?
• How are the state transitions implemented?
• What is the final state?
• Which search algorithm (uninformed or informed, depth/breadth/best-first
etc.) is used?
• What is the evaluation function?
• What does tree pruning mean with respect to the search?
3. This stage of the project requires writing a report on the experiments performed.
The report should include detailed description of the experiments (input data,
Weka outputs), and answers to the questions above. Weka does not classify web
documents. Instead, Weka prints classification accuracy for the test set (a new
web document), which is simply a number (percentage). This number must be
used to explain how this new document is classified. The report should also
include such interpretation and analysis of the results with respect to the original
problem stated in the project.
4. Looking back at the process, describe what changes in the process you think could
improve on the classification.
4. Extra Credit Possibilities
1. Repeat Step 3 of Phase 2 using the Nearest Neighbor (IBk) algorithm. Compare
their classification accuracy and confusion matrices obtained with 10-fold cross
14
15. validation with the ones produced by the decision tree. Which of the three models
explored here are better? Why?
2. Write your own web crawler to fetch a web document to be classified by the
system. An algorithm for this is available in “Mining the Web” book listed below.
You may restrict your collection to URL's only and to page titles to be able to do
some page content analysis. You should introduce parameters to control the search.
For example, depth-first, breadth-first with some parameters to bound the search as
depth or breath limits, number of pages to retrieve, time-out time for each page or
for the whole run, size limits for the pages etc.
3. Customize or add new and significant features to WebSPHINX. You should
discuss with me the new features before you start working on this.
References and Readings
[Chakrabarti, 2002] Soumen Chakrabarti, Mining the Web - Discovering Knowledge
from Hypertext Data, Morgan Kaufmann Publishers, 2002.
[Mitchell, 97] Mitchell, T.M. Machine Learning, McGraw Hill, New York, 1997.
[Witten and Frank, 2000] Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000.
Web->KB project: http://www-2.cs.cmu.edu/~webkb/
15
16. Appendix A
Prolog Based Approach to Feature Extraction and Data Preparation
(Section 3.2)
3.2 Phase 2: Feature Extraction and Data Preparation
During this phase the web documents will be represented by feature vectors, which in
turn are used to form a training data set for the Machine Learning stage. We provide a
Prolog program that can do all the steps in this process and generate a data file to be used
by the Weka ML system. The following components are needed for performing this:
• SWI-Prolog. Use the stable versions and the self-installing executable for Windows
95/98/ME/NT/2000/XP. Available at http://www.swi-prolog.org/
• Quick Introduction to Prolog available at
http://www.cs.ccsu.edu/~markov/ccsu_courses/prolog.txt
• Other Prolog Tutorials (optional)
o A Prolog Tutorial by J.R. Fisher
(http://www.csupomona.edu/~jrfisher/www/prolog_tutorial/contents.html)
o More tutorials: http://www.swi-prolog.org/www.html
• Prolog program textmine.pl available at
http://www.cs.ccsu.edu/~markov/ccsu_courses/mlprograms/textmine.pl
• A data set webdata.zip used to illustrate the use of textmine.pl available at
http://www.cs.ccsu.edu/~markov/ccsu_courses/mlprograms/webdata.zip
The basic steps to achieve this follow. We use the data sample provided in the
webdata.zip file as an illustrative example of this process.
3.2.1 Step 1: Keyword Selection
The webdata.zip archive contains 20 text files generated from 20 web pages collected
from the web site of the CCSU school of Art and Sciences. For convenience we put the
file names in a list, and the list in a file called files.pl (also available from the archive).
The contents of files.pl is the following:
files([ 'Anthropology.txt',
'Art.txt',
'Biology.txt',
'Chemistry.txt',
'Communication.txt',
'Computer.txt',
'Justice.txt',
'Economics.txt',
'English.txt',
'Geography.txt',
'History.txt',
'Math.txt',
'Languages.txt',
16
17. 'Music.txt',
'Philosophy.txt',
'Physics.txt',
'Political.txt',
'Psychology.txt',
'Sociology.txt',
'Theatre.txt' ]).
label( [
art - [ 'Art.txt',
'Justice.txt',
'English.txt',
'History.txt',
'Languages.txt',
'Music.txt',
'Philosophy.txt',
'Political.txt',
'Theatre.txt' ],
sci - ['Anthropology.txt',
'Biology.txt',
'Chemistry.txt',
'Communication.txt',
'Computer.txt',
'Math.txt',
'Physics.txt',
'Geography.txt',
'Economics.txt',
'Psychology.txt',
'Sociology.txt' ]
]).
The first list (files) is a catalog of all file names and the second one (label) groups the
files (documents) in two classes (two sublists) – art and sci.
After installing and running SWI-Prolog we have to load textmine.pl and files.pl in the
Prolog database with the following queries:
?- [files].
?- [textmine].
Then the following query generates a list of the 20 most frequent terms that appear in the
corpus of all 20 documents. Note that the actual text files (listed in files) should be stored
in the same folder where textmine.pl and files.pl are located.
?- files(F),tf(F,20,T),write(T).
[department, study, students, ba, website, location, programs, 832, phone, chair, program,
science, hall, faculty, offers, music, courses, research, studies, sociology]
17
18. F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt',
'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...]
T = [department, study, students, ba, website, location, programs, 832, phone|...]
Note that we use write(T) to print the whole list, because Prolog prints just the first 9
elements in its standard answer.
Then we may extend the query to generate the inverted document frequency list (IDF).
First we have to generate a list of terms and then we pass them to the procedure that
generates the IDF list. For example:
?- files(F),tf(F,50,T),idf(F,T,20,IDF),write(IDF).
[3.04452-music, 3.04452-sociology, 3.04452-anthropology, 3.04452-theatre, 3.04452-
criminal, 3.04452-justice, 3.04452-communication, 3.04452-chemistry, 2.35138-physics,
2.35138-political, 1.94591-history, 1.94591-sciences, 1.65823-american, 1.65823-social,
1.65823-international, 1.65823-public, 1.43508-computer, 1.43508-offered, 1.25276-ma,
1.25276-work]
F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt',
'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...]
T = [department, study, students, ba, website, location, programs, 832, phone|...]
IDF = [3.04452-music, 3.04452-sociology, 3.04452-anthropology, 3.04452-theatre,
3.04452-criminal, 3.04452-justice, 3.04452-communication, 3.04452-chemistry, ... -...|...]
Note that the IDF list is ordered by decreasing values of IDF (shown before each term).
As the IDF value is usually big for rare terms, in the IDF list we have the least frequent
20 terms out of the 50 terms generated by tf(F,50,T).
3.2.2 Step 2: Feature Extraction
At this step we add the document labels and generate document vectors with the
following query:
?- files(F),tf(F,50,T),idf(F,T,20,IDF),label(L),class(F,L,FL),vectors(FL,IDF,V),ppl(V).
ppl(V) will print here the vectors with numeric values (the output is skipped for brevity).
We may also generate binary vectors by replacing vectors with binvectors with the
following query:
?-
files(F),tf(F,50,T),idf(F,T,20,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,V),ppl(V).
The above two queries just show the vectors and can be used to visually inspect the
results of feature extraction. The idea is that the two parameters – the sizes of the TF and
IDF lists (50 and 20) have to be adjusted so that the vectors do not have columns or rows
with all the same value or all 0’s.
18
19. 3.2.3 Step 3: Data Preparation
After we get a good set of vectors from the previous step we may generate the ARFF data
files for Weka just by adding the arff procedure at the end of the query (or replacing ppl
with it, if we don’t want to see the output):
?- files(F),tf(F,50,T),idf(F,T,20,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,V),
arff(IDF,V,'wekadata.arff').
This query generates binary vectors. By using vectors instead of binvectors we get
numeric vectors (using the IDF values).
The file 'wekadata.arff' is in the proper format to be loaded in Weka and used for
classification.
More information about using the textmine.pl program for feature extraction,
classification and clustering is available in the following documents:
• http://www.cs.ccsu.edu/~markov/ccsu_courses/mlprograms/WebMiningLab1.txt
• http://www.cs.ccsu.edu/~markov/ccsu_courses/mlprograms/WebMiningLab2.txt
• http://www.cs.ccsu.edu/~markov/ccsu_courses/mlprograms/WebMiningLab3.txt
• http://www.cs.ccsu.edu/~markov/ccsu_courses/mlprograms/WebMiningLab4.txt
19