This document describes a study on text classification in the deep web. It discusses using classification methods like Naive Bayesian (CNB) and K-Nearest Neighbor (CK-NN) to classify web documents. The document outlines preprocessing steps like removing stop words and weighting terms. It also provides details on implementing CNB and CK-NN classifiers to classify Arabic documents into categories like economic, cultural, political etc. and compares the results of the two classifiers to select the most accurate one.
This document discusses the challenges of designing a web crawler that can scale to billions of pages. It presents algorithms developed by the authors to address issues related to URL uniqueness checking, politeness enforcement, and spam avoidance. The algorithms were tested in a 41-day crawl that successfully downloaded over 6 billion pages from over 117 million hosts at an average rate of 319 mb/s using a single server.
These slides accompany the LDOW2010 paper "An HTTP-Based Versioning Mechanism for Linked Data". The paper is available at http://arxiv.org/abs/1003.3661. It describes how the combination of the Memento (Time Travel for the Web) framework, and a resource versioning approach that is aligned both with the Cool URI notion and with Tim Berners-Lee concept of Time-Generic and Time-Specific, yields the ability to collect current and prior versions of resource merely using "follow your nose" HTTP navigation. The proposed combination further extends the value of a URI, and allows the emergence of a novel realm of temporal Web applications.
The document discusses the differences between the deep web and surface web. The deep web refers to content that is not indexed by typical search engines, as it is stored in dynamic databases rather than static web pages. It contains over 500 times more information than the surface web. Some key differences are that deep web content is accessed through direct database queries rather than URLs, and search results are generated dynamically rather than having fixed URLs. Specialized search engines are needed to access the deep web.
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
Neo4j is a graph database that stores data in nodes and relationships. It allows for efficient querying of connected data through graph traversals. Key aspects include nodes that can contain properties, relationships that connect nodes and also contain properties, and the ability to navigate the graph through traversals. Neo4j provides APIs for common graph operations like creating and removing nodes/relationships, running traversals, and managing transactions. It is well suited for domains that involve connected, semi-structured data like social networks.
This document outlines an approach for using web content mining techniques for Arabic text classification. It begins with introductions to web mining and its subfields like web content mining. It discusses related work on text classification in different languages, including a few prior studies on Arabic text classification. The document then describes building an Arabic text corpus from online newspapers and preprocessing steps. It proposes using machine learning algorithms like Naive Bayes and K-Nearest Neighbor for Arabic text classification and evaluating accuracy through cross-validation. The full document provides details on the proposed method and evaluation plan to classify Arabic texts using web content mining techniques.
This document discusses making semantic technologies more accessible to non-experts by combining data semantics, mathematical theories of declarative knowledge, and application semantics expressed in English. It proposes a browser-based system for writing and running applications using business rules in open vocabulary English. Examples demonstrate resolving semantic differences between retailer and manufacturer ontology data and answering questions about oil industry SQL data, both through English explanations. The goal is to bridge the gap between people and machines through natural language.
Technologies For Appraising and Managing Electronic Recordspbajcsy
This document summarizes technologies for appraising and managing electronic records, including discovering relationships among digital file collections and comparing document versions. It presents three technologies: file2learn to discover relationships between files based on metadata extraction and analysis; doc2learn for comprehensive document comparisons; and Polyglot for automated file format conversion and quality assessment.
This document discusses the challenges of designing a web crawler that can scale to billions of pages. It presents algorithms developed by the authors to address issues related to URL uniqueness checking, politeness enforcement, and spam avoidance. The algorithms were tested in a 41-day crawl that successfully downloaded over 6 billion pages from over 117 million hosts at an average rate of 319 mb/s using a single server.
These slides accompany the LDOW2010 paper "An HTTP-Based Versioning Mechanism for Linked Data". The paper is available at http://arxiv.org/abs/1003.3661. It describes how the combination of the Memento (Time Travel for the Web) framework, and a resource versioning approach that is aligned both with the Cool URI notion and with Tim Berners-Lee concept of Time-Generic and Time-Specific, yields the ability to collect current and prior versions of resource merely using "follow your nose" HTTP navigation. The proposed combination further extends the value of a URI, and allows the emergence of a novel realm of temporal Web applications.
The document discusses the differences between the deep web and surface web. The deep web refers to content that is not indexed by typical search engines, as it is stored in dynamic databases rather than static web pages. It contains over 500 times more information than the surface web. Some key differences are that deep web content is accessed through direct database queries rather than URLs, and search results are generated dynamically rather than having fixed URLs. Specialized search engines are needed to access the deep web.
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
Neo4j is a graph database that stores data in nodes and relationships. It allows for efficient querying of connected data through graph traversals. Key aspects include nodes that can contain properties, relationships that connect nodes and also contain properties, and the ability to navigate the graph through traversals. Neo4j provides APIs for common graph operations like creating and removing nodes/relationships, running traversals, and managing transactions. It is well suited for domains that involve connected, semi-structured data like social networks.
This document outlines an approach for using web content mining techniques for Arabic text classification. It begins with introductions to web mining and its subfields like web content mining. It discusses related work on text classification in different languages, including a few prior studies on Arabic text classification. The document then describes building an Arabic text corpus from online newspapers and preprocessing steps. It proposes using machine learning algorithms like Naive Bayes and K-Nearest Neighbor for Arabic text classification and evaluating accuracy through cross-validation. The full document provides details on the proposed method and evaluation plan to classify Arabic texts using web content mining techniques.
This document discusses making semantic technologies more accessible to non-experts by combining data semantics, mathematical theories of declarative knowledge, and application semantics expressed in English. It proposes a browser-based system for writing and running applications using business rules in open vocabulary English. Examples demonstrate resolving semantic differences between retailer and manufacturer ontology data and answering questions about oil industry SQL data, both through English explanations. The goal is to bridge the gap between people and machines through natural language.
Technologies For Appraising and Managing Electronic Recordspbajcsy
This document summarizes technologies for appraising and managing electronic records, including discovering relationships among digital file collections and comparing document versions. It presents three technologies: file2learn to discover relationships between files based on metadata extraction and analysis; doc2learn for comprehensive document comparisons; and Polyglot for automated file format conversion and quality assessment.
The document introduces MongoDB as an open source, high performance database that is a popular NoSQL option. It discusses how MongoDB stores data as JSON-like documents, supports dynamic schemas, and scales horizontally across commodity servers. MongoDB is seen as a good alternative to SQL databases for applications dealing with large volumes of diverse data that need to scale.
The document discusses methods for improving web navigation efficiency through reconciling website structure based on user browsing patterns. It involves mining the website structure and user logs to determine browsing behaviors. Efficiency is calculated as the shortest path from the start page to the target page divided by the operating cost, defined as the number of pages visited. The approach was tested on a website and was able to reorganize the structure based on user navigation analysis from logs to increase browsing efficiency.
Web mining is the application of data mining techniques to extract knowledge from web data, including web content, structure, and usage data. Web content mining analyzes text, images, and other unstructured data on web pages using natural language processing and information retrieval. Web structure mining examines the hyperlinks between pages to discover relationships. Web usage mining applies data mining methods to server logs and other web data to discover patterns of user behavior on websites. Text mining aims to extract useful information from unstructured text documents using techniques like summarization, information extraction, categorization, and sentiment analysis.
This document provides information about a database management systems (DBMS) course offered by the Department of Computer Science & Engineering at Cambridge University. The course objectives are to provide a strong foundation in database concepts, practice SQL programming, demonstrate transactions and concurrency, and design database applications. Course outcomes include identifying and defining database objects, using SQL, designing simple databases, and developing applications. The course modules cover topics such as conceptual modeling, the relational model, SQL, normalization, transactions, and recovery protocols. Required textbooks are also listed.
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
The document summarizes a project by the Missouri Botanical Garden Archives to digitize and provide public access to historical images through a searchable online database. Key goals were to test best practices for digitization, build infrastructure for ongoing digitization, and increase public awareness of the Garden's role through improved access to the digital images. Dublin Core metadata was used to describe over 1,000 images scanned from glass plates, slides and photographs. The database was designed to be flexible, easy to use and publicly available online.
This document summarizes a student's research report on improving data privacy for cloud computing documents. The student proposes a method using a red-black tree and partial document encryption to update encrypted documents efficiently. An experiment tests the encryption and processing times of modifying documents of different sizes using 3DES and AES algorithms. The method shows improvements over fully re-encrypting documents for small modifications. Future work aims to further optimize efficiency and develop a collaboration service.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
The document discusses web mining techniques for web personalization. It defines web mining as extracting useful information from web data, including web usage mining, web content mining, and web structure mining. Web usage mining involves data gathering, preparation, pattern discovery, analysis, visualization and application. Web content mining extracts information from web document contents. The document then discusses how these web mining techniques can be applied to web personalization by learning about user interactions and interests to customize web page content and presentations.
1) Ontologies play a key role in semantic digital libraries by supporting bibliographic descriptions, extensible resource structures, and community-aware features.
2) Semantic digital libraries integrate information from various metadata sources and provide interoperability between systems using semantics.
3) Key ontologies for digital libraries include bibliographic ontologies, structure description ontologies, and community-aware ontologies that model folksonomies and social semantic collaborative filtering.
This document proposes using feedback and k-means clustering to refine web data. The k-means clustering algorithm is used to initially cluster web usage data. Then, a genetic algorithm is applied to the clusters to improve their quality based on feedback from users on the usefulness of different web pages. This combined approach of initial k-means clustering followed by genetic algorithm refinement aims to better organize web data according to user preferences and eliminate unwanted websites.
The document discusses various data and technology concepts relevant for libraries, including blockchain, data lakes, FRBR linked data models, and using a "Brainz" model. Blockchain could be used for digital first sale programs, patron data privacy, and other applications. Data lakes store raw data for archival, statistics and searches. Data lakes can provide data as a service through subscriptions and reports. FRBR models are reexamined to allow for non-canonical work identifiers. The document suggests libraries could implement a "Koha brainz" model inspired by Music Brainz and Book Brainz for organizing metadata.
The document summarizes a technical seminar on web-based information retrieval systems. It discusses information retrieval architecture and approaches, including syntactical, statistical, and semantic methods. It also covers web search analysis techniques like web structure analysis, content analysis, and usage analysis. The document outlines the process of web crawling and types of crawlers. It discusses challenges of web structure, crawling and indexing, and searching. Finally, it concludes that as unstructured online information grows, information retrieval techniques must continue to improve to leverage this data.
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
This document proposes a focused semantic web crawler to efficiently access valuable and relevant deep web content in two stages. The first stage fetches relevant websites, while the second performs a deep search within sites using cosine similarity to rank pages. Deep web content, estimated at over 500 times the size of the surface web, is difficult for search engines to index as it is dynamic. The proposed crawler aims to address this using adaptive learning and storing patterns to become more efficient at locating deep web information.
Produce and consume_linked_data_with_drupalSTIinnsbruck
This document discusses a set of Drupal modules that integrate Drupal sites into the web of linked data by:
1. Automatically generating a site vocabulary in RDFS/OWL from Drupal content types and fields.
2. Mapping the generated site vocabulary to existing public vocabularies.
3. Providing SPARQL querying of the RDF data through an endpoint.
4. Lazily loading external RDF data through SPARQL queries.
This document discusses building a software tool to archive websites using web crawling and blockchain technology. It proposes a system that crawls websites, stores web page content and metadata in WARC files, and records this information in a blockchain database with two layers - a domain blockchain to store domain information and a web content blockchain to store WARC files. This approach aims to provide a consistent and secure system for archiving websites while allowing users to monitor and analyze archived web content. The document reviews related work on web archiving and outlines the proposed system architecture and implementation requirements.
Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...Christophe Tricot
The use of the World Wide Web as a free source for large linguistic resources is a well-established idea. Such resources are keystones to domains such as lexicon-based categorization, information retrieval, machine translation and information extraction. In this paper, we present an industrial focused web crawler for the automatic compilation of specialized corpora from the web. This application, created within the framework of the TTC project1, is used daily by several linguists to bootstrap large thematic corpora which are then used to automatically generate bilingual terminologies
Similarity based Dynamic Web Data Extraction and Integration System from Sear...IDES Editor
There is an explosive growth of information in
the World Wide Web thus posing a challenge to Web users
to extract essential knowledge from the Web. Search
engines help us to narrow down the search in the form of
Search Engine Result Pages (SERP). Web Content Mining
is one of the techniques that help users to extract useful
information from these SERPs. In this paper, we propose
two similarity based mechanisms; WDES, to extract desired
SERPs and store them in the local depository for offline
browsing and WDICS, to integrate the requested contents
and enable the user to perform the intended analysis and
extract the desired information. Our experimental results
show that WDES and WDICS outperform DEPTA [1] in
terms of Precision and Recall.
Embedding Services: Linking from Google Scholar to DiscoverCISTI ICIST
The document discusses a project to link Google Scholar to the Canadian Information System for the Advancement of Library and Information Science (CISTI) collection and services. The goals of the project were to make CISTI's collection more accessible worldwide, meet researchers where they work by collaborating with Google Scholar, and provide access to both licensed and open access content. Using agile project management methods, the project built components like a crawler, web services, and XML files to create links between Google Scholar search results and relevant CISTI resources. Over 15.7 million article-level XML records were produced.
Digital content management involves the administration of digital content throughout its lifecycle from creation to permanent storage or deletion. A key part of digital content management is digital rights management (DRM) which uses technologies like fingerprinting, watermarking, and digital certificates to restrict the use and sharing of digital content and protect the intellectual property rights of content creators. The Digital Object Identifier (DOI) system is also important for digital content as it provides a persistent way to uniquely identify digital objects online.
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...Zakaria Zubi
This document discusses applying web mining techniques to understand user behavior by analyzing web server log files. It describes the phases of web usage mining as including data preprocessing, pattern discovery, and pattern analysis. Data preprocessing involves cleaning the log files, identifying page views, users, and sessions. Pattern discovery applies techniques like association rule mining and classification to find patterns in user behavior. The results section shows applying association rule mining to a transactional database of user sessions to find rules of user behavior. The conclusion emphasizes that web logs contain valuable information about user behavior and different data mining methods can be used to analyze the data.
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
The document discusses Knowledge Discovery Query Language (KDQL), a proposed query language for interacting with i-extended databases in the knowledge discovery process. KDQL is designed to handle data mining rules and retrieve association rules from i-extended databases. The key points are:
1) KDQL is based on SQL and is intended to support tasks like association rule mining within the ODBC_KDD(2) model for knowledge discovery.
2) It can be used to query i-extended databases, which contain both data and discovered patterns.
3) The KDQL RULES operator allows users to specify data mining tasks like finding association rules that satisfy certain frequency and confidence thresholds.
The document introduces MongoDB as an open source, high performance database that is a popular NoSQL option. It discusses how MongoDB stores data as JSON-like documents, supports dynamic schemas, and scales horizontally across commodity servers. MongoDB is seen as a good alternative to SQL databases for applications dealing with large volumes of diverse data that need to scale.
The document discusses methods for improving web navigation efficiency through reconciling website structure based on user browsing patterns. It involves mining the website structure and user logs to determine browsing behaviors. Efficiency is calculated as the shortest path from the start page to the target page divided by the operating cost, defined as the number of pages visited. The approach was tested on a website and was able to reorganize the structure based on user navigation analysis from logs to increase browsing efficiency.
Web mining is the application of data mining techniques to extract knowledge from web data, including web content, structure, and usage data. Web content mining analyzes text, images, and other unstructured data on web pages using natural language processing and information retrieval. Web structure mining examines the hyperlinks between pages to discover relationships. Web usage mining applies data mining methods to server logs and other web data to discover patterns of user behavior on websites. Text mining aims to extract useful information from unstructured text documents using techniques like summarization, information extraction, categorization, and sentiment analysis.
This document provides information about a database management systems (DBMS) course offered by the Department of Computer Science & Engineering at Cambridge University. The course objectives are to provide a strong foundation in database concepts, practice SQL programming, demonstrate transactions and concurrency, and design database applications. Course outcomes include identifying and defining database objects, using SQL, designing simple databases, and developing applications. The course modules cover topics such as conceptual modeling, the relational model, SQL, normalization, transactions, and recovery protocols. Required textbooks are also listed.
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
The document summarizes a project by the Missouri Botanical Garden Archives to digitize and provide public access to historical images through a searchable online database. Key goals were to test best practices for digitization, build infrastructure for ongoing digitization, and increase public awareness of the Garden's role through improved access to the digital images. Dublin Core metadata was used to describe over 1,000 images scanned from glass plates, slides and photographs. The database was designed to be flexible, easy to use and publicly available online.
This document summarizes a student's research report on improving data privacy for cloud computing documents. The student proposes a method using a red-black tree and partial document encryption to update encrypted documents efficiently. An experiment tests the encryption and processing times of modifying documents of different sizes using 3DES and AES algorithms. The method shows improvements over fully re-encrypting documents for small modifications. Future work aims to further optimize efficiency and develop a collaboration service.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
The document discusses web mining techniques for web personalization. It defines web mining as extracting useful information from web data, including web usage mining, web content mining, and web structure mining. Web usage mining involves data gathering, preparation, pattern discovery, analysis, visualization and application. Web content mining extracts information from web document contents. The document then discusses how these web mining techniques can be applied to web personalization by learning about user interactions and interests to customize web page content and presentations.
1) Ontologies play a key role in semantic digital libraries by supporting bibliographic descriptions, extensible resource structures, and community-aware features.
2) Semantic digital libraries integrate information from various metadata sources and provide interoperability between systems using semantics.
3) Key ontologies for digital libraries include bibliographic ontologies, structure description ontologies, and community-aware ontologies that model folksonomies and social semantic collaborative filtering.
This document proposes using feedback and k-means clustering to refine web data. The k-means clustering algorithm is used to initially cluster web usage data. Then, a genetic algorithm is applied to the clusters to improve their quality based on feedback from users on the usefulness of different web pages. This combined approach of initial k-means clustering followed by genetic algorithm refinement aims to better organize web data according to user preferences and eliminate unwanted websites.
The document discusses various data and technology concepts relevant for libraries, including blockchain, data lakes, FRBR linked data models, and using a "Brainz" model. Blockchain could be used for digital first sale programs, patron data privacy, and other applications. Data lakes store raw data for archival, statistics and searches. Data lakes can provide data as a service through subscriptions and reports. FRBR models are reexamined to allow for non-canonical work identifiers. The document suggests libraries could implement a "Koha brainz" model inspired by Music Brainz and Book Brainz for organizing metadata.
The document summarizes a technical seminar on web-based information retrieval systems. It discusses information retrieval architecture and approaches, including syntactical, statistical, and semantic methods. It also covers web search analysis techniques like web structure analysis, content analysis, and usage analysis. The document outlines the process of web crawling and types of crawlers. It discusses challenges of web structure, crawling and indexing, and searching. Finally, it concludes that as unstructured online information grows, information retrieval techniques must continue to improve to leverage this data.
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
This document proposes a focused semantic web crawler to efficiently access valuable and relevant deep web content in two stages. The first stage fetches relevant websites, while the second performs a deep search within sites using cosine similarity to rank pages. Deep web content, estimated at over 500 times the size of the surface web, is difficult for search engines to index as it is dynamic. The proposed crawler aims to address this using adaptive learning and storing patterns to become more efficient at locating deep web information.
Produce and consume_linked_data_with_drupalSTIinnsbruck
This document discusses a set of Drupal modules that integrate Drupal sites into the web of linked data by:
1. Automatically generating a site vocabulary in RDFS/OWL from Drupal content types and fields.
2. Mapping the generated site vocabulary to existing public vocabularies.
3. Providing SPARQL querying of the RDF data through an endpoint.
4. Lazily loading external RDF data through SPARQL queries.
This document discusses building a software tool to archive websites using web crawling and blockchain technology. It proposes a system that crawls websites, stores web page content and metadata in WARC files, and records this information in a blockchain database with two layers - a domain blockchain to store domain information and a web content blockchain to store WARC files. This approach aims to provide a consistent and secure system for archiving websites while allowing users to monitor and analyze archived web content. The document reviews related work on web archiving and outlines the proposed system architecture and implementation requirements.
Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...Christophe Tricot
The use of the World Wide Web as a free source for large linguistic resources is a well-established idea. Such resources are keystones to domains such as lexicon-based categorization, information retrieval, machine translation and information extraction. In this paper, we present an industrial focused web crawler for the automatic compilation of specialized corpora from the web. This application, created within the framework of the TTC project1, is used daily by several linguists to bootstrap large thematic corpora which are then used to automatically generate bilingual terminologies
Similarity based Dynamic Web Data Extraction and Integration System from Sear...IDES Editor
There is an explosive growth of information in
the World Wide Web thus posing a challenge to Web users
to extract essential knowledge from the Web. Search
engines help us to narrow down the search in the form of
Search Engine Result Pages (SERP). Web Content Mining
is one of the techniques that help users to extract useful
information from these SERPs. In this paper, we propose
two similarity based mechanisms; WDES, to extract desired
SERPs and store them in the local depository for offline
browsing and WDICS, to integrate the requested contents
and enable the user to perform the intended analysis and
extract the desired information. Our experimental results
show that WDES and WDICS outperform DEPTA [1] in
terms of Precision and Recall.
Embedding Services: Linking from Google Scholar to DiscoverCISTI ICIST
The document discusses a project to link Google Scholar to the Canadian Information System for the Advancement of Library and Information Science (CISTI) collection and services. The goals of the project were to make CISTI's collection more accessible worldwide, meet researchers where they work by collaborating with Google Scholar, and provide access to both licensed and open access content. Using agile project management methods, the project built components like a crawler, web services, and XML files to create links between Google Scholar search results and relevant CISTI resources. Over 15.7 million article-level XML records were produced.
Digital content management involves the administration of digital content throughout its lifecycle from creation to permanent storage or deletion. A key part of digital content management is digital rights management (DRM) which uses technologies like fingerprinting, watermarking, and digital certificates to restrict the use and sharing of digital content and protect the intellectual property rights of content creators. The Digital Object Identifier (DOI) system is also important for digital content as it provides a persistent way to uniquely identify digital objects online.
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...Zakaria Zubi
This document discusses applying web mining techniques to understand user behavior by analyzing web server log files. It describes the phases of web usage mining as including data preprocessing, pattern discovery, and pattern analysis. Data preprocessing involves cleaning the log files, identifying page views, users, and sessions. Pattern discovery applies techniques like association rule mining and classification to find patterns in user behavior. The results section shows applying association rule mining to a transactional database of user sessions to find rules of user behavior. The conclusion emphasizes that web logs contain valuable information about user behavior and different data mining methods can be used to analyze the data.
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
The document discusses Knowledge Discovery Query Language (KDQL), a proposed query language for interacting with i-extended databases in the knowledge discovery process. KDQL is designed to handle data mining rules and retrieve association rules from i-extended databases. The key points are:
1) KDQL is based on SQL and is intended to support tasks like association rule mining within the ODBC_KDD(2) model for knowledge discovery.
2) It can be used to query i-extended databases, which contain both data and discovered patterns.
3) The KDQL RULES operator allows users to specify data mining tasks like finding association rules that satisfy certain frequency and confidence thresholds.
Knowledge Discovery in Remote Access Databases Zakaria Zubi
This document provides an overview of the thesis which investigates knowledge discovery in remote access databases. The thesis contains three parts:
Part 1 introduces knowledge discovery in databases (KDD) and data mining (DM), and defines the goal of the thesis work.
Part 2 discusses remote access KDD models, the logical foundation of data mining, mining discovered association rules, and data mining query languages.
Part 3 proposes the Knowledge Discovery Query Language (KDQL) for mining association rules from databases and visualizing results. It also discusses I-extended databases and the implementation of KDQL.
The thesis aims to develop methods for remote knowledge discovery using query languages and by extending databases to include generalized patterns discovered through
(1) The document discusses I-Extended Databases, which contain generalizations about data in addition to the data. (2) I-Extended Databases can be used in the knowledge discovery process by modeling steps as queries. (3) Patterns discovered in data mining can be represented in I-Extended Databases based on their frequency, confidence, and support values.
Using Data Mining Techniques to Analyze Crime PatternZakaria Zubi
Our proposed model will be able to extract crime patterns by using association rule mining and clustering to classify crime records on the basis of the values of crime attributes.
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA Zakaria Zubi
Ad Hoc wireless network that without any central controlling authority, which is a collection of mobile nodes that are dynamically and arbitrarily located in such a manner that the interconnections between nodes are capable of changing on a continual basis, so nodes cooperate to route a packet.
The purpose of the routing protocols is to discover rapid changes of the topology in such a way that intermediate nodes can act as routers to forward packets on behalf of the communicating pair .
This document compares routing protocols for ad hoc wireless networks used for transmitting medical data. It evaluates AODV, OLSR, and TORA protocols in OPNET based on throughput and delay. The simulation models a 1000m x 1000m indoor hospital environment with
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...Zakaria Zubi
Our proposed model will be able to extract crime patterns by using association rule mining and clustering to classify crime records on the basis of the values of crime attributes.
Applying web mining application for user behavior understandingZakaria Zubi
This document discusses applying web mining techniques to understand user behavior by analyzing server log files. It describes how web usage mining involves three phases: data preprocessing, pattern discovery, and pattern analysis. In data preprocessing, log files are cleaned and parsed to identify users, sessions, and page views. Pattern discovery applies techniques like association rule mining and classification to find relationships between frequently accessed page types and predict future page views. Pattern analysis validates and interprets the discovered patterns to model user behavior and create visualizations. The document provides an example of using association rule mining on a transactional database of user sessions to find patterns in user behavior.
This document discusses text mining of documents in electronic data interchange (EDI) environments. It describes how EDI formats can be transformed and stored in databases, then text mining techniques like clustering can be applied. Specifically, it uses k-means clustering with Euclidean distance measures on a dataset of 2000 EDI documents categorized into 7 groups related to banking transactions. The goal is to discover patterns in the text that can help applications like identifying customer demographics and buying behaviors.
This document discusses using data mining techniques like neural networks and association rule mining to classify chest x-rays as normal or abnormal to aid in early diagnosis of lung cancer. The objectives are to classify 300 chest x-ray images and help physicians make important diagnostic decisions. The data mining tasks of data preprocessing, feature extraction, and rule generation are described. Classification methods like neural networks and association rule mining will be used to analyze the medical image data.
Ibtc dwt hybrid coding of digital imagesZakaria Zubi
This document proposes a hybrid IBTC-DWT encoding scheme that combines the simple computation and edge preservation of interpolative block truncation coding (IBTC) with the high compression ratio of discrete wavelet transform (DWT). Simulation results showed that the proposed algorithm achieved better performance than IBTC-DCT in terms of compression ratio, bit rate, and reconstruction quality at low bit rates. The hybrid approach reduces computational complexity by applying DWT to the smaller sub-images produced by IBTC.
Information communication technology in libya for educational purposesZakaria Zubi
This document discusses the use of information and communication technology (ICT) in education in Libya. It outlines several challenges, including high costs that limit access, a lack of necessary training skills, and limited ICT infrastructure. It describes UNESCO and Libya signing an agreement to improve ICT capacity building through establishing networks and digital libraries. Current challenges facing Libya's education system are also summarized, including improving quality, strengthening research, and increasing ICT use. The conclusion emphasizes the need to adopt ICT to improve teaching and learning, provide appropriate training, and ensure development of awareness and motivation regarding ICT in education.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/how-axelera-ai-uses-digital-compute-in-memory-to-deliver-fast-and-energy-efficient-computer-vision-a-presentation-from-axelera-ai/
Bram Verhoef, Head of Machine Learning at Axelera AI, presents the “How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-efficient Computer Vision” tutorial at the May 2024 Embedded Vision Summit.
As artificial intelligence inference transitions from cloud environments to edge locations, computer vision applications achieve heightened responsiveness, reliability and privacy. This migration, however, introduces the challenge of operating within the stringent confines of resource constraints typical at the edge, including small form factors, low energy budgets and diminished memory and computational capacities. Axelera AI addresses these challenges through an innovative approach of performing digital computations within memory itself. This technique facilitates the realization of high-performance, energy-efficient and cost-effective computer vision capabilities at the thin and thick edge, extending the frontier of what is achievable with current technologies.
In this presentation, Verhoef unveils his company’s pioneering chip technology and demonstrates its capacity to deliver exceptional frames-per-second performance across a range of standard computer vision networks typical of applications in security, surveillance and the industrial sector. This shows that advanced computer vision can be accessible and efficient, even at the very edge of our technological ecosystem.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Deep Web mining
1. Text Classification in Deep Web
Mining
Presented by:
Zakaria Suliman Zubi
Associate Professor
Computer Science Department
Faculty of Science
Sirte University
Sirte, Libya
1
2. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
2
3. Abstract
• The World Wide Web is a rich source of knowledge that can be
useful to many applications.
– Source?
• Billions of web pages and billions of visitors and
contributors.
– What knowledge?
• e.g., the hyperlink structure and a variety of languages.
– Purpose?
• To improve users’ effectiveness in searching for
information on the web.
• Decision-making support or business management.
3
4. Continue
• Web’s Characteristics:
– Large size
– Unstructured
– Different data types: text, image, hyperlinks and user usage
information
– Dynamic content
– Time dimension
– Multilingual (i.e. Latin, non Latin languages)
• The Data Mining (DM) is a significant subfield of this area.
• Using a Classification Methods such as K-Nearest Neighbor
(CK-NN) and Classifier Naïve Bayes (CNB).
• The various activities and efforts in this area are referred to as
Web Mining. 4
5. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
5
6. Introduction
The Internet is probably the biggest world’s database were the data is available using
easily accessible techniques.
Data is held in various forms: text, multimedia, database.
Web pages keep standard of html (or another ML family member) which makes it kind
of structural form, but not sufficient to easily use it in data mining.
Web mining – the application of data mining techniques is to extract knowledge from
Web content, structure, and usage.
Deep web also defined as hidden web, invisible web or invisible Internet refers to the
lower novae of the global network.
The easiest way is to put Deep Web as a part of data mining, where web resources are
explored. It is commonly divided into three: 6
7. Introduction is looking
Web usage mining
for useful patterns in logs and
documents containing history
of user’s activity
Web mining – the application of data mining techniques is to
extract knowledge from Web content, structure, and usage.
W e b M in in g
Web content mining is the closest
one to the “classic” data
W e b C o n t e n t M in in g W e bmining”,Masi nWCM mostly W e b
S tru c tu re in g U s a g e M in in g
operates on text and it is
generally common way to put
Text informatione rin kInternet as text,
H y p lin s W e b S e rv e r L o g s
Im a g e D o c u m e n t S tru c tu re d A p p lic a t io n L e v e l L o g s
Web structured mining goal is to
A u d io A p p lic a t io n S e rv e r L o g s
use nature of the Internet
V e d io
connection structure as it is a
bunch of documents connected
S tru c tu re d R e c o rd s with links.
7
8. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
8
9. Deep W Content M
eb ining
• Deep Web Content Mining is the process of extracting
useful information from the contents of Web
documents. It may consist of text, images, audio,
video, or structured records such as lists and tables.”
• “Deep Web Content mining refers to the overall
process of discovering potentially useful and
previously unknown information or knowledge from
the Web data.”
9
10. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
10
11. Modeling the W Documents
eb
• We represent the Web data in the binary format where all
of the keywords derived from the schema.
• If a keyword is in a frequent schema, a 1 is stored in related
cell and otherwise a 0 is stored in it.
• The attributes of frequent schemas are stated as follow:
– QI1: Data Mining Extract Hidden Data from Database = {Data,
Mining, Hidden, Database}, stop word {from};
– QI2: Web Mining discovers Hidden information on the Web ={Web,
Mining, Hidden}, stop words {on, the};
– QI3: Web content Mining is a branch in Web Mining= {Web,
Mining} , stop words {is, a, in};
– QI4: Knowledge discovery in Database= {Database}, stop word {in}.
11
12. Cont…
Data Mining Extract Database Hidden Web Other Stop
key words
words
QI1 1 1 1 1 1 0 0 1
QI2 0 1 0 0 1 2 2 2
QI3 0 2 0 0 0 2 2 3
QI4 0 0 0 1 0 0 3 1
Tab1. Represent web data in binary scale
12
13. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
13
14. W Documents Classification
eb
Methods
• Web documents consist of text, images, videos and audios.
• Text data in web documents are defined to be the most
tremendously.
• The automatic text classification is the process of assigning a
text document to one or more predefined categories based on
its content.
• Automatic text web document classification requires three main
consecutive phases in constructing a classification system listed as
follows:
1. Collect the text documents in corpora and tag them.
2. Select a set of features to represent the defined classes.
3. The classification algorithms must be trained and tested using the
collected corpora in the first stage. 14
15. Cont…
The text classification problem is composed of several sub problems such as:
The document indexing: Document indexing is related to with the way of
extracting the document's keywords, two main approaches to achieve the
document indexing, the first approach considers index terms as bags of words
and the second approach regards the index terms as phrases.
The weighting assignment: Weight assignment techniques associate a real number
assignment
that ranges from 0 to 1 for all documents’ terms weights will be required to
classify new arrived documents.
Learning based text classification algorithm :. A text classification algorithm used
is inductive learning algorithm based on probabilistic theory and different
models were emphasized such as Naive Bayesian models (Which always shows
good result and widely used in text classification). Another text classification
methods have been emerged to categorize documents such as K-Nearest
Neighbor KNN which compute the distances between the document index
terms and the known terms of each category. The accuracy will be tested by K-
fold cross-validation method.
15
16. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
16
17. Preprocessing phases
The data used in this work
are collected from many
news web sites. The data set
consist of 1562 Arabic
documents of different
lengths that belongs to 6
categories, the categories
Table 2. Number of Documents per Category
are ( Economic , Cultural ,
First phase: is the preprocessing Political , Social , Sports ,
step where documents are prepared General ), Table 2 represent
to make it adequate for further use, the number of documents
stop words removal and rearrange of for each category.
the document contents are some
steps in this phase.
17
18. Continue
Second phase is the weighting assignment phase, it is defined as
the assignment of real number that relies between 0 and 1 to each
keyword and this number indicates the imperativeness of the
keyword inside the document.
Many methods have been developed and the most widely used model is the tf-idf
weighting factor. This weight of each keyword is computed by multiplying the
term factor (tf) with the inverse document factor (idf) where:
Fik = Occurrences of term tK in document Di.
tfik = fik/max (fil) normalized term frequency occurred in document.
dfk = documents which contain tk .
idfk= log (d/dfk) where d is the total number of documents and dfk is number of
document s that contains term tk.
wik = tfik * idfk for term weight, the computed w ik is a real number ɛ[0,1].
18
19. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
19
20. Classifier Naive B of class (CNB ayesian
P(class| document) : It’s the probability )
given a document, or the probability that a given
document D belongs to probability-driven algorithm
Bayesian learning is a a given class C, and that is our
target. on Bayes probability theorem it is highly
based
P(document ) : The probability of a document, we can
recommended in text classification
notice that p(document ) is a Constance divider to
every calculation, so we can ignore it.
A documents can be modeled as sets of words thus the
P( class ): Theclass ) can be written in two way Where:
P(document | probability of a class (or category), we
can compute it from the number of documents in the
category dividedProbability that number inoften outperform more
p(wordi |C )The Naive Bayesian can all of a given
: by documents the i-th word
categories. occurs in a document from class C, and this Classifier task
document sophisticated classification methods.
can be calculated as follow: incoming objects to their appropriate
is to categorize
P(document | Class. : It’s the probability of document
class )
in a given class.
20
21. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
21
22. Classifier K-Nearest Neighbor
(CK-NN)
• K-Nearest Neighbor is a widely used text classifier especially in text mining
because of its simplicity and efficiency.
• It’s a a supervised learning algorithm where the result of a new occurrence
query is classified based on the K-nearest neighbor category measurement.
• Its training-phase consists of nothing more than storing all training examples as
classifier.
• It works based on minimum distance from the query instance to the training
samples to determine the K nearest neighbors.
• After collecting K nearest neighbors, we take simple majority of these K-
nearest neighbors to be the prediction of the query-instance.
• CK-NN algorithm consists of several multivariate attributes names X i that will
be used to classify the object Y. We will deal only with quantitative Xi and
binary (nominal) Y. 22
23. Continue
• Example: Suppose that the K factor is set to be equal to 8 (there are 8 nearest
neighbors) as a parameter of this algorithm. Then the distance between the
query-instance and all the training samples is computed, so there are only
quantitative Xi.
• All training samples are included as nearest neighbors if the distance of this
training sample to the query is less than or equal to the Kth smallest distance
in this case the distances are sorted of all training samples to the query and
determine the Kth as a minimum distance.
• The unknown sample is assigned the most common class among its k nearest
neighbors. Then we find the distances between the query and all training
samples.
• The K training samples are the closest K nearest neighbors for the unknown
sample. Closeness is defined in terms of Euclidean distance, where the
Euclidean between two points, X = (x1, x2,...,xn ) and Y = (y1, y2,...,yn) is:
23
24. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
24
25. Implementation
The implementation of the proposed Deep Web Text Classifier
(DWTC) demonstrates the importance of classifying the Latin text on
the web documents needs for information retrieval to illustrate both
keywords extraction and text classifiers used in the Algorithms
implementation:
Keywords extraction: Text web documents are scanned to find the
keywords each one is normalized. Normalization process consists of
removing stop words, removing punctuation mark and non-letters in
Latin letters shown in table 3.
Some stop words:
Tab 3: The Example of stop words and non -letters.
25
26. Continue
Terms weighting: There are two criterions:
First criterion the more number of times a term occurs in documents which belongs to
some category, the more it is relative to that category.
Second criterion the more the term appears in different documents representing different
categories; the less the term is useful for discriminating between documents as
belonging to different categories.
In this implementation we used the commonly used approach which is Normalized
tf×idf to overcome the problem of variant documents lengths.
Algorithms implementation :this implementation were mainly developed for testing the
effectiveness of CK-NN and CNB algorithms when it is applied to the Latin text.
We supplies a set of labeled text documents supplied to the system, the labels are used
to indicate the class or classes that the text document belongs to. All documents
belonging to the data set should be labeled in order to learn the system and then test it.
The system distinguishes the labels of the training documents but not those of the test
set.
The system will compare between the two classifiers and report the most higher
accuracy classifier for the current labeled text documents.
The system will compare these results and select the best average accuracy result rates for
each classifier and uses the greater average accuracy result rates in the system. The system
will choose the higher rate to start the retrieving process.
26
27. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
27
28. Results and Discussion
• The data used in this work are collected from many web sites. The
data set consist of 3533 Latin and non- Latin text documents of
different lengths that belongs to 6 categories, the categories are
( Economic, Cultural, Political, Social, Sports and General). Table 2.
• To test the system, the documents in the data set were preprocessed
to find main categories.
• Various splitting percentages were used to see how the number of
training documents impacts the classification effectiveness.
• Different k values starting from 1 and up to 20 in order were used to
find the best results for CK-NN. Effectiveness started to decline at
k>15.
• A comparison between the two algorithms and make labeling to the
sample data, the classifier has been indicated in the system also
28
29. Continue
• The k-fold cross-validation method is used to test the accuracy of the
system.
• Our result is roughly near from the other developer's results.
• The results of the conducted experiments are included on the last columns in
Table 4 and table 5. Our result is roughly near from each other results.
• It is induced from the below results that the Classifier K-Nearest Neighbors
(CK-NN) with an average (93.08%) has better than Classifier Naïve
Bayesian that had (90.03%) in Latin text.
• It means that the DWTC system in this case will use the CK-NN for Latin
text classification and extraction instead of CNB.
• In case of non –Latin text the DWTC system will use CNB text
classification which has the average of 91.05% in Non- Latin classification
and extraction instead of CK-NN with the average of 88.06% indicated in
table5.
29
30. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
30
31. Conclusion
• An evaluation to the use of Classifier K-Nearest Neighbor (CK-NN) and
Classifier Naïve Bayes (CNB) to the classification Arabic text was considered.
• A development of a special corpus which consists of 3533 documents that
belong to 6 categories.
• An extracted feature set of keywords and terms weighting in order to improve
the performance were indicated as well.
• As a result we applied two algorithms for classifying to the text documents with
a satisfactory number of patterns for each category.
• The accuracy was measured by the use of k-fold cross-validation method to test
the accuracy of the system.
• We proposed an empirical Latin and non-Latin text classifier system called the
Deep Web Text Classifier (DWTC).
• The system compares the results between both classifiers used (CK-NN, CNB)
and select the best average accuracy result rates in case of Latin or non-Latin
31
text.
32. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
32
33. References
[2] Alexandrov M., Gelbukh A. and Lozovo. (2001). Chi-square Classifier for
Document Categorization. 2nd International Conference on Intelligent Text
Processing and Computational Linguistics, Mexico City.
[37] Zakaria Suliman Zubi. 2010. Text mining documents in electronic data
interchange environment. In Proceedings of the 11th WSEAS international
conference on nural networks and 11th WSEAS international conference on
evolutionary computing and 11th WSEAS international conference on Fuzzy
systems (NN'10/EC'10/FS'10), Viorel Munteanu, Razvan Raducanu, Gheorghe
Dutica, Anca Croitoru, Valentina Emilia Balas, and Alina Gavrilut (Eds.).
World Scientific and Engineering Academy and Society (WSEAS), Stevens
Point, Wisconsin, USA, 76-88.
[38] Zakaria Suliman Zubi. 2009. Using some web content mining techniques for
Arabic text classification. In Proceedings of the 8th WSEAS international
conference on Data networks, communications, computers (DNCOCO'09), Manoj
Jha, Charles Long, Nikos Mastorakis, and Cornelia Aida Bulucea (Eds.). World
Scientific and Engineering Academy and Society (WSEAS), Stevens Point,
Wisconsin, USA, 73-84. 33