Most of the internet applications are built using web technologies like HTML. Web pages are designed in such a way that it displays the data records from the underlying databases or just displays the text in an unstructured format but using some fixed template. Summarizing these data which are dispersed in different web pages is hectic and tedious and consumes most of the time and manual effort. A supervised learning technique called Wrapper Induction technique
can be used across the web pages to learn data extraction rules. By applying these learnt rules to web pages, enables the information extraction an easier process. This paper focuses on
developing a tool for information extraction from the unstructured data. The use of semantic web technologies much simplifies the process. This tool enables us to query the data being scattered over multiple web pages, in distinguished ways. This can be accomplished by the following steps – extracting the data from multiple web pages, storing them in the form of RDF triples, integrating multiple RDF files using ontology, generating SPARQL query based on user query and generating report in the form of tables or charts from the results of SPARQL query. The relationship between various related web pages are identified using ontology and used to query in better ways thus enhancing the searching efficacy.
Information on the web is tremendously increasing in
recent years with the faster rate. This massive or voluminous data
has driven intricate problems for information retrieval and
knowledge management. As the data resides in a web with several
forms, the Knowledge management in the web is a challenging
task. Here the novel 'Semantic Web' concept may be used for
understanding the web contents by the machine to offer
intelligent services in an efficient way with a meaningful
knowledge representation. The data retrieval in the traditional
web source is focused on 'page ranking' techniques, whereas in
the semantic web the data retrieval processes are based on the
‘concept based learning'. The proposed work is aimed at the
development of a new framework for automatic generation of
ontology and RDF to some real time Web data, extracted from
multiple repositories by tracing their URI’s and Text Documents.
Improved inverted indexing technique is applied for ontology
generation and turtle notation is used for RDF notation. A
program is written for validating the extracted data from
multiple repositories by removing unwanted data and considering
only the document section of the web page.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...IJwest
Constructing ontologies from relational databases is an active research topic in the Semantic Web domain.
While conceptual mapping rules/principles of relational databases and ontology structures are being
proposed, several software modules or plug-ins are being developed to enable the automatic conversion of
relational databases into ontologies. However, the correlation between the resulting ontologies built
automatically with plug-ins from relational databases and the database-toontology mapping principles has
been given little attention. This study reviews and applies two Protégé plug-ins, namely, DataMaster and
OntoBase to automatically construct ontologies from a relational database. The resulting ontologies are
further analysed to match their structures against the database-to-ontology mapping principles. A
comparative analysis of the matching results reveals that OntoBase outperforms DataMaster in applying
the database-to-ontology mapping principles for automatically converting relational databases into
ontologies
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Information on the web is tremendously increasing in
recent years with the faster rate. This massive or voluminous data
has driven intricate problems for information retrieval and
knowledge management. As the data resides in a web with several
forms, the Knowledge management in the web is a challenging
task. Here the novel 'Semantic Web' concept may be used for
understanding the web contents by the machine to offer
intelligent services in an efficient way with a meaningful
knowledge representation. The data retrieval in the traditional
web source is focused on 'page ranking' techniques, whereas in
the semantic web the data retrieval processes are based on the
‘concept based learning'. The proposed work is aimed at the
development of a new framework for automatic generation of
ontology and RDF to some real time Web data, extracted from
multiple repositories by tracing their URI’s and Text Documents.
Improved inverted indexing technique is applied for ontology
generation and turtle notation is used for RDF notation. A
program is written for validating the extracted data from
multiple repositories by removing unwanted data and considering
only the document section of the web page.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...IJwest
Constructing ontologies from relational databases is an active research topic in the Semantic Web domain.
While conceptual mapping rules/principles of relational databases and ontology structures are being
proposed, several software modules or plug-ins are being developed to enable the automatic conversion of
relational databases into ontologies. However, the correlation between the resulting ontologies built
automatically with plug-ins from relational databases and the database-toontology mapping principles has
been given little attention. This study reviews and applies two Protégé plug-ins, namely, DataMaster and
OntoBase to automatically construct ontologies from a relational database. The resulting ontologies are
further analysed to match their structures against the database-to-ontology mapping principles. A
comparative analysis of the matching results reveals that OntoBase outperforms DataMaster in applying
the database-to-ontology mapping principles for automatically converting relational databases into
ontologies
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Abstract—Since the demand for information retrieval increases quickly, indexing structures became an important issue to support fast information retrieval. According to the work in this paper, a new data structure called Dynamic Ordered Multi-field Index (DOMI) for information retrieval has been introduced. It is based on radix trees organized in segments in addition to a hash table to point to the roots of each segment, where each segment is dedicated to store the values of a single field. The hash table is used to access the needed segments directly without traversing the upper segments. So, DOMI improves look-up performance for queries addressing to a single field. In the case of multiple queries addressing, each segment of the radix tree is traversed sequentially without visiting the unrelated branches. The use of segmentation for the proposed DOMI provides flexibility for minimizing communication overhead in the distributed system. Every field in the radix tree is represented by one segment, where each segment can be stored as one block.
In addition to, the proposed DOMI consumes less space comparing to indexes which are built using B or B+ trees. Hence, it is more suitable for intensive-data such as Big Data.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
This presentation mainly intends to explain how to gather the author name, title, submission year, keywords, subjects, department, university, language, and the number of pages of every thesis document in Theseus and then reuse the gathered data for building a Web portal.
The Web is a universal medium for information, data and knowledge exchange. The Semantic Web is an extension of the World Wide Web, ``in which information is given well-defined meaning, better enabling computers and people to work in cooperation''\cite{semweb:lee}. RDF, together with SparQL, provide a powerful mechanism for describing and interchanging metadata on the web. This paper presents briefly the two concepts - RDF, SparQL - and three of the most popular frameworks (written in Java) that offer support for RDF: Jena, Sesame and JRDF.
Lecture Notes by Mustafa Jarrar at Birzeit University, Palestine.
See the course webpage at: http://jarrar-courses.blogspot.com/2014/01/web-data-management.html
and http://www.jarrar.info
you may also watch this lecture at: http://www.youtube.com/watch?v=rH9mksypcNw
The lecture covers Data Integration and Fusion
WebSpa is a tool that allows the quick, intuitive (and even fun) interrogation of arbitrary SPARQL endpoints. WebSpa runs in the web browser and does not require the installation of any additional software. The tool manages a large variety of pre-defined SPARQL endpoints and allows the addition of new ones. An user account gives the possibility of saving both the interrogation and its results on the local computer, as well as further editing of the queries. The application is written in both Java and Flex. It uses Jena and ARQ application programming interface in order to perform the queries, and the results are processed and displayed using Flex.
Objeto de conferencia
VI Simposio Internacional de Bibliotecas Digitales (Brasil)
Digital repositories acting as resource aggregators typically face different challenges, roughly classified in three main categories: extraction, improvement and storage. The first category comprises issues related to dealing with different resource collection protocols: OAI-PMH, web-crawling, webservices, etc and their representation: XML, HTML, database tuples, unstructured documents, etc. The second category comprises information improvements based on controlled vocabularies, specific date formats, correction of malformed data, etc. Finally, the third category deals with the destination of downloaded resources: unification into a common database, sorting by certain criteria, etc.
This paper proposes an ETL architecture for designing a software application that provides a comprehensive solution to challenges posed by a digital repository as resource aggregator.
Design and implementation aspects considered during the development of this tool are described, focusing especially on architecture highlights.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5529
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014Robert Meusel
The Linked Data for Information Extraction challenge explores aims at extracting structured data from Web pages. It is based on a subset of the Web Data Commons Microformats dataset.
For the challenge, original annotated pages are provided, as well as the triples extracted from them. Based on that information, participants have to design an Information extraction system for extracting that information from other web pages. In this year's challenge, we focus on hCard data, i.e., information about persons. The use case of such a system could be the assembly of a large database on person data.
The systems are evaluated on a test set of annotated web pages, from which all annotations have been removed. The participants have to extract triples from those pages and send in their resulting triple files. The submitted files are evaluated against the gold standard of the original triples, ranking the solutions by F-measure.
The Semantic Web is a vision of information that is understandable by computers. Although there is great exploitable potential, we are still in "Generation Zero'' of the Semantic Web, since there are few real-world compelling applications. The heterogeneity, the volume of data and the lack of standards are problems that could be addressed through some nature inspired methods. The paper presents the most important aspects of the Semantic Web, as well as its biggest issues; it then describes some methods inspired from nature - genetic algorithms, artificial neural networks, swarm intelligence, and the way these techniques can be used to deal with Semantic Web problems.
Information residing in relational databases and delimited file systems are inadequate for reuse and sharing over the web. These file systems do not adhere to commonly set principles for maintaining data harmony. Due to these reasons, the resources have been suffering from lack of uniformity, heterogeneity as well as redundancy throughout the web. Ontologies have been widely used for solving such type of problems, as they help in extracting knowledge out of any information system. In this article, we focus on extracting concepts and their relations from a set of CSV files. These files are served as individual concepts and grouped into a particular domain, called the domain ontology. Furthermore, this domain ontology is used for capturing CSV data and represented in RDF format retaining links among files or concepts. Datatype and object properties are automatically detected from header fields. This reduces the task of user involvement in generating mapping files. The detail analysis has been performed on Baseball tabular data and the result shows a rich set of semantic information.
SURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASEIAEME Publication
MongoDB is open source cross platform database. It is classified as NoSQL (Not Only SQL). It is written in C++ and it follows document oriented data model. It is database model which provides dynamic schema. It uses features like Map/Reduce, Auto-sharding and MongoDump etc. Using these features MongoDB provides high performance, where Map/Reduce is efficient data arrangement, Auto-sharding is storing data on across the different machines, Backup facilities and many more. It has collections as table and each collection can store different kinds of data. It stores data in JSON like structure. Unlike the RDBMS databases it can store unstructured data as well. It can process and handle large amount of data more efficiently than RDBMS. It is ACID system like RDBMS databases. MongoDB mainly used in such application which produces and uses vast amount of data. Like blogs or sites which produces or stores unstructured data. It can be used to in applications which stores structured and semi-structured data as well.
Ontology languages are used in modelling the semantics of concepts within a particular domain and the relationships between those concepts. The Semantic Web standard provides a number of modelling languages that differ in their level of expressivity and are organized in a Semantic Web Stack in such a way that each language level builds on the expressivity of the other. There are several problems when one attempts to use independently developed ontologies. When existing ontologies are adapted for new purposes it requires that certain operations are performed on them. These operations are currently performed in a semi-automated manner. This paper seeks to model categorically the syntax and semantics of RDF ontology as a step towards the formalization of ontological operations using category theory.
The logic-based machine-understandable framework of the Semantic Web often challenges naive users when they try to query ontology-based knowledge bases. Existing research efforts have approached this problem by introducing Natural Language (NL) interfaces to ontologies. These NL interfaces have the ability to construct SPARQL queries based on NL user queries. However, most efforts were restricted to queries expressed in English, and they often benefited from the advancement of English NLP tools. However, little research has been done to support querying the Arabic content on the Semantic Web by using NL queries. This paper presents a domain-independent approach to translate Arabic NL queries to SPARQL by leveraging linguistic analysis. Based on a special consideration on Noun Phrases (NPs), our approach uses a language parser to extract NPs and the relations from Arabic parse trees and match them to the underlying ontology. It then utilizes knowledge in the ontology to group NPs into triple-based representations. A SPARQL query is finally generated by extracting targets and modifiers, and interpreting them into SPARQL. The interpretation of advanced semantic features including negation, conjunctive and disjunctive modifiers is also supported. The approach was evaluated by using two datasets consisting of OWL test data and queries, and the obtained results have confirmed its feasibility to translate Arabic NL queries to SPARQL.
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...cscpconf
The data from internet are dispersed in multiple documents or web pag-es. Most of them are not properly structured and organized. It becomes necessary to organize these contents in order to improve the search results by increasing the relevancy. The semantic web technologies and ontologies play a vital role in in-formation extraction and new knowledge discovery from the web documents. This paper suggests a model for storing the web content in an organized and structured manner in RDF format. The information extraction techniques and the ontologies developed for the domain together discovers new knowledge. The paper also proves that the
time taken for inferring the new knowledge is also minimal compared to manual effort when semantic web technologies are used while developing the applications.
A semantic based approach for knowledge discovery and acquistion from multipl...csandit
The data from internet are dispersed in multiple documents or web pag-es. Most of them are not
properly structured and organized. It becomes necessary to organize these contents in order to
improve the search results by increasing the relevancy. The semantic web technologies and
ontologies play a vital role in in-formation extraction and new knowledge discovery from the
web documents. This paper suggests a model for storing the web content in an organized and
structured manner in RDF format. The information extraction techniques and the ontologies
developed for the domain together discovers new knowledge. The paper also proves that the
time taken for inferring the new knowledge is also minimal compared to manual effort when
semantic web technologies are used while developing the applications.
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Abstract—Since the demand for information retrieval increases quickly, indexing structures became an important issue to support fast information retrieval. According to the work in this paper, a new data structure called Dynamic Ordered Multi-field Index (DOMI) for information retrieval has been introduced. It is based on radix trees organized in segments in addition to a hash table to point to the roots of each segment, where each segment is dedicated to store the values of a single field. The hash table is used to access the needed segments directly without traversing the upper segments. So, DOMI improves look-up performance for queries addressing to a single field. In the case of multiple queries addressing, each segment of the radix tree is traversed sequentially without visiting the unrelated branches. The use of segmentation for the proposed DOMI provides flexibility for minimizing communication overhead in the distributed system. Every field in the radix tree is represented by one segment, where each segment can be stored as one block.
In addition to, the proposed DOMI consumes less space comparing to indexes which are built using B or B+ trees. Hence, it is more suitable for intensive-data such as Big Data.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
This presentation mainly intends to explain how to gather the author name, title, submission year, keywords, subjects, department, university, language, and the number of pages of every thesis document in Theseus and then reuse the gathered data for building a Web portal.
The Web is a universal medium for information, data and knowledge exchange. The Semantic Web is an extension of the World Wide Web, ``in which information is given well-defined meaning, better enabling computers and people to work in cooperation''\cite{semweb:lee}. RDF, together with SparQL, provide a powerful mechanism for describing and interchanging metadata on the web. This paper presents briefly the two concepts - RDF, SparQL - and three of the most popular frameworks (written in Java) that offer support for RDF: Jena, Sesame and JRDF.
Lecture Notes by Mustafa Jarrar at Birzeit University, Palestine.
See the course webpage at: http://jarrar-courses.blogspot.com/2014/01/web-data-management.html
and http://www.jarrar.info
you may also watch this lecture at: http://www.youtube.com/watch?v=rH9mksypcNw
The lecture covers Data Integration and Fusion
WebSpa is a tool that allows the quick, intuitive (and even fun) interrogation of arbitrary SPARQL endpoints. WebSpa runs in the web browser and does not require the installation of any additional software. The tool manages a large variety of pre-defined SPARQL endpoints and allows the addition of new ones. An user account gives the possibility of saving both the interrogation and its results on the local computer, as well as further editing of the queries. The application is written in both Java and Flex. It uses Jena and ARQ application programming interface in order to perform the queries, and the results are processed and displayed using Flex.
Objeto de conferencia
VI Simposio Internacional de Bibliotecas Digitales (Brasil)
Digital repositories acting as resource aggregators typically face different challenges, roughly classified in three main categories: extraction, improvement and storage. The first category comprises issues related to dealing with different resource collection protocols: OAI-PMH, web-crawling, webservices, etc and their representation: XML, HTML, database tuples, unstructured documents, etc. The second category comprises information improvements based on controlled vocabularies, specific date formats, correction of malformed data, etc. Finally, the third category deals with the destination of downloaded resources: unification into a common database, sorting by certain criteria, etc.
This paper proposes an ETL architecture for designing a software application that provides a comprehensive solution to challenges posed by a digital repository as resource aggregator.
Design and implementation aspects considered during the development of this tool are described, focusing especially on architecture highlights.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5529
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014Robert Meusel
The Linked Data for Information Extraction challenge explores aims at extracting structured data from Web pages. It is based on a subset of the Web Data Commons Microformats dataset.
For the challenge, original annotated pages are provided, as well as the triples extracted from them. Based on that information, participants have to design an Information extraction system for extracting that information from other web pages. In this year's challenge, we focus on hCard data, i.e., information about persons. The use case of such a system could be the assembly of a large database on person data.
The systems are evaluated on a test set of annotated web pages, from which all annotations have been removed. The participants have to extract triples from those pages and send in their resulting triple files. The submitted files are evaluated against the gold standard of the original triples, ranking the solutions by F-measure.
The Semantic Web is a vision of information that is understandable by computers. Although there is great exploitable potential, we are still in "Generation Zero'' of the Semantic Web, since there are few real-world compelling applications. The heterogeneity, the volume of data and the lack of standards are problems that could be addressed through some nature inspired methods. The paper presents the most important aspects of the Semantic Web, as well as its biggest issues; it then describes some methods inspired from nature - genetic algorithms, artificial neural networks, swarm intelligence, and the way these techniques can be used to deal with Semantic Web problems.
Information residing in relational databases and delimited file systems are inadequate for reuse and sharing over the web. These file systems do not adhere to commonly set principles for maintaining data harmony. Due to these reasons, the resources have been suffering from lack of uniformity, heterogeneity as well as redundancy throughout the web. Ontologies have been widely used for solving such type of problems, as they help in extracting knowledge out of any information system. In this article, we focus on extracting concepts and their relations from a set of CSV files. These files are served as individual concepts and grouped into a particular domain, called the domain ontology. Furthermore, this domain ontology is used for capturing CSV data and represented in RDF format retaining links among files or concepts. Datatype and object properties are automatically detected from header fields. This reduces the task of user involvement in generating mapping files. The detail analysis has been performed on Baseball tabular data and the result shows a rich set of semantic information.
SURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASEIAEME Publication
MongoDB is open source cross platform database. It is classified as NoSQL (Not Only SQL). It is written in C++ and it follows document oriented data model. It is database model which provides dynamic schema. It uses features like Map/Reduce, Auto-sharding and MongoDump etc. Using these features MongoDB provides high performance, where Map/Reduce is efficient data arrangement, Auto-sharding is storing data on across the different machines, Backup facilities and many more. It has collections as table and each collection can store different kinds of data. It stores data in JSON like structure. Unlike the RDBMS databases it can store unstructured data as well. It can process and handle large amount of data more efficiently than RDBMS. It is ACID system like RDBMS databases. MongoDB mainly used in such application which produces and uses vast amount of data. Like blogs or sites which produces or stores unstructured data. It can be used to in applications which stores structured and semi-structured data as well.
Ontology languages are used in modelling the semantics of concepts within a particular domain and the relationships between those concepts. The Semantic Web standard provides a number of modelling languages that differ in their level of expressivity and are organized in a Semantic Web Stack in such a way that each language level builds on the expressivity of the other. There are several problems when one attempts to use independently developed ontologies. When existing ontologies are adapted for new purposes it requires that certain operations are performed on them. These operations are currently performed in a semi-automated manner. This paper seeks to model categorically the syntax and semantics of RDF ontology as a step towards the formalization of ontological operations using category theory.
The logic-based machine-understandable framework of the Semantic Web often challenges naive users when they try to query ontology-based knowledge bases. Existing research efforts have approached this problem by introducing Natural Language (NL) interfaces to ontologies. These NL interfaces have the ability to construct SPARQL queries based on NL user queries. However, most efforts were restricted to queries expressed in English, and they often benefited from the advancement of English NLP tools. However, little research has been done to support querying the Arabic content on the Semantic Web by using NL queries. This paper presents a domain-independent approach to translate Arabic NL queries to SPARQL by leveraging linguistic analysis. Based on a special consideration on Noun Phrases (NPs), our approach uses a language parser to extract NPs and the relations from Arabic parse trees and match them to the underlying ontology. It then utilizes knowledge in the ontology to group NPs into triple-based representations. A SPARQL query is finally generated by extracting targets and modifiers, and interpreting them into SPARQL. The interpretation of advanced semantic features including negation, conjunctive and disjunctive modifiers is also supported. The approach was evaluated by using two datasets consisting of OWL test data and queries, and the obtained results have confirmed its feasibility to translate Arabic NL queries to SPARQL.
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...cscpconf
The data from internet are dispersed in multiple documents or web pag-es. Most of them are not properly structured and organized. It becomes necessary to organize these contents in order to improve the search results by increasing the relevancy. The semantic web technologies and ontologies play a vital role in in-formation extraction and new knowledge discovery from the web documents. This paper suggests a model for storing the web content in an organized and structured manner in RDF format. The information extraction techniques and the ontologies developed for the domain together discovers new knowledge. The paper also proves that the
time taken for inferring the new knowledge is also minimal compared to manual effort when semantic web technologies are used while developing the applications.
A semantic based approach for knowledge discovery and acquistion from multipl...csandit
The data from internet are dispersed in multiple documents or web pag-es. Most of them are not
properly structured and organized. It becomes necessary to organize these contents in order to
improve the search results by increasing the relevancy. The semantic web technologies and
ontologies play a vital role in in-formation extraction and new knowledge discovery from the
web documents. This paper suggests a model for storing the web content in an organized and
structured manner in RDF format. The information extraction techniques and the ontologies
developed for the domain together discovers new knowledge. The paper also proves that the
time taken for inferring the new knowledge is also minimal compared to manual effort when
semantic web technologies are used while developing the applications.
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
No other medium has taken a more meaningful place in our life in such a short time than the world wide largest data network, the World Wide Web. However, when searching for information in the data network, the user is constantly exposed to an ever growing ood of information. This is both a blessing and a curse at the same time. The explosive growth and popularity of the world wide web has resulted in a huge number of information sources on the Internet. As web sites are getting more complicated, the construction of web information extraction systems becomes more difficult and time consuming. So the scalable automatic Web Information Extraction WIE is also becoming high demand. There are four levels of information extraction from the World Wide Web such as free text level, record level, page level and site level. In this paper, the target extraction task is record level extraction. Nwe Nwe Hlaing | Thi Thi Soe Nyunt | Myat Thet Nyo "The Data Records Extraction from Web Pages" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd28010.pdfPaper URL: https://www.ijtsrd.com/computer-science/world-wide-web/28010/the-data-records-extraction-from-web-pages/nwe-nwe-hlaing
Semantic Knowledge Acquisition of Information for Syntactic web dannyijwest
Information retrieval is one of the most common web service used. Information is knowledge. In earlier
days one has to find a resource person or resource library to acquire knowledge. But today just by typing a
keyword on a search engine all kind of resources are available to us. Due to this mere advancement there
are trillions of information available on net. So, in this era we are in need of search engine which also
search with us by understanding the semantics of given query by the user. One such design is only possible
only if we provide semantic to our ordinary HTML web page. In this paper we have explained the concept
of converting an HTML page to RDFS/OWL page. This technique is incorporated along with natural
language technology as we have to provide the Hyponym and Meronym of the given HTML pages. Through
this automatic conversion the concept of intelligent information retrieval is framed.
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...Computer Science Journals
With the increased number of web databases, major part of deep web is one of the bases of database. In several search engines, encoded data in the returned resultant pages from the web often comes from structured databases which are referred as Web databases (WDB).
Annotation for query result records based on domain specific ontologyijnlc
The World Wide Web is enriched with a large collection of data, scattered in deep web databases and web
pages in unstructured or semi structured formats. Recently evolving customer friendly web applications
need special data extraction mechanisms to draw out the required data from these deep web, according to
the end user query and populate to the output page dynamically at the fastest rate. In existing research
areas web data extraction methods are based on the supervised learning (wrapper induction) methods. In
the past few years researchers depicted on the automatic web data extraction methods based on similarity
measures. Among automatic data extraction methods our existing Combining Tag and Value similarity
method, lags to identify an attribute in the query result table. A novel approach for data extracting and
label assignment called Annotation for Query Result Records based on domain specific ontology. First, an
ontology domain is to be constructed using information from query interface and query result pages
obtained from the web. Next, using this domain ontology, a meaning label is assigned automatically to each
column of the extracted query result records.
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
Structured Data extraction from deep Web pages is a challenging task due to the underlying complex structures of such pages. Also website developer generally follows different web page design technique. Data extraction from webpage is highly useful to build our own database from number applications. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they present different limitations and constraints for extracting data from such webpages. This paper presents two different approaches to get structured data extraction. The first approach is non-generic solution which is based on template detection using intersection of Document Object Model Tree of various webpages from the same website. This approach is giving better result in terms of efficiency and accurately locating the main data at the particular webpage. The second approach is based on partial tree alignment mechanism based on using important visual features such as length, size, and position of web table available on the webpages. This approach is a generic solution as it does not depend on one particular website and its webpage template. It is perfectly locating the multiple data regions, data records and data items within a given web page. We have compared our work's result with existing mechanism and found our result much better for number webpage
Similar to A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING WRAPPER INDUCTION TECHNIQUE (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
Using social media in education provides learners with an informal way for communication. Informal communication tends to remove barriers and hence promotes student engagement. This paper presents our experience in using three different social media technologies in teaching software project management course. We conducted different surveys at the end of every semester to evaluate students’ satisfaction and engagement. Results show that using social media enhances students’ engagement and satisfaction. However, familiarity with the tool is an important factor for student satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The amount of piracy in the streaming digital content in general and the music industry in specific is posing a real challenge to digital content owners. This paper presents a DRM solution to monetizing, tracking and controlling online streaming content cross platforms for IP enabled devices. The paper benefits from the current advances in Blockchain and cryptocurrencies. Specifically, the paper presents a Global Music Asset Assurance (GoMAA) digital currency and presents the iMediaStreams Blockchain to enable the secure dissemination and tracking of the streamed content. The proposed solution provides the data owner the ability to control the flow of information even after it has been released by creating a secure, selfinstalled, cross platform reader located on the digital content file header. The proposed system provides the content owners’ options to manage their digital information (audio, video, speech, etc.), including the tracking of the most consumed segments, once it is release. The system benefits from token distribution between the content owner (Music Bands), the content distributer (Online Radio Stations) and the content consumer(Fans) on the system blockchain.
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This paper discusses the importance of verb suffix mapping in Discourse translation system. In
discourse translation, the crucial step is Anaphora resolution and generation. In Anaphora
resolution, cohesion links like pronouns are identified between portions of text. These binders
make the text cohesive by referring to nouns appearing in the previous sentences or nouns
appearing in sentences after them. In Machine Translation systems, to convert the source
language sentences into meaningful target language sentences the verb suffixes should be
changed as per the cohesion links identified. This step of translation process is emphasized in
the present paper. Specifically, the discussion is on how the verbs change according to the
subjects and anaphors. To explain the concept, English is used as the source language (SL) and
an Indian language Telugu is used as Target language (TL)
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The using of information technology resources is rapidly increasing in organizations,
businesses, and even governments, that led to arise various attacks, and vulnerabilities in the
field. All resources make it a must to do frequently a penetration test (PT) for the environment
and see what can the attacker gain and what is the current environment's vulnerabilities. This
paper reviews some of the automated penetration testing techniques and presents its
enhancement over the traditional manual approaches. To the best of our knowledge, it is the
first research that takes into consideration the concept of penetration testing and the standards
in the area.This research tackles the comparison between the manual and automated
penetration testing, the main tools used in penetration testing. Additionally, compares between
some methodologies used to build an automated penetration testing platform.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
In order to treat and analyze real datasets, fuzzy association rules have been proposed. Several
algorithms have been introduced to extract these rules. However, these algorithms suffer from
the problems of utility, redundancy and large number of extracted fuzzy association rules. The
expert will then be confronted with this huge amount of fuzzy association rules. The task of
validation becomes fastidious. In order to solve these problems, we propose a new validation
method. Our method is based on three steps. (i) We extract a generic base of non redundant
fuzzy association rules by applying EFAR-PN algorithm based on fuzzy formal concept analysis.
(ii) we categorize extracted rules into groups and (iii) we evaluate the relevance of these rules
using structural equation model.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
2. 510 Computer Science & Information Technology (CS & IT)
The semantic web technologies [16] overcome these difficulties to a greater extent. A core data
representation format for semantic web is Resource Description Framework (RDF). RDF is a data
model for web pages. RDF is a framework for representing information about resources in a
graph form. It was primarily intended for representing metadata about WWW resources, such as
the title, author, and modification date of a Web page, but it can be used for storing any other
data. It is based on triples subject-predicate-object that form graph of data [18].
All data in the semantic web use RDF as the primary representation language [9]. RDF Schema
(RDFS) can be used to describe taxonomies of classes and properties and use them to create
lightweight ontologies. Ontologies describe the conceptualization, the structure of the domain,
which includes the domain model with possible restrictions [11]. More detailed ontologies can be
created with Web Ontology Language (OWL). It is syntactically embedded into RDF, so like
RDFS, it provides additional standardized vocabulary. For querying RDF data as well as RDFS
and OWL ontologies with knowledge bases, a Simple Protocol and RDF Query Language
(SPARQL) is available. SPARQL is SQL-like language, but uses RDF triples and resources for
both matching part of the query and for returning results [10].
Ujjal Marjit et.al. [1] presented a linked data approach to discover resume information by
providing a standard data model, data sharing and data reusing. Arup Sarkar et.al. [2] provided an
approach to publish the data from legacy databases as linked data on the web thus enabling the
machine readability. David Camacho et.al. [3] extracted information from HTML documents and
added semantics through XML to generate rules using Web Mantic with the help of user
interaction. Uriv shah et.al [4] provided a framework for retrieval of documents containing both
free text and semantically enriched markup using DAML+OIL semantic web language. Olaf
Hartig et.al. [5] developed an approach to execute SPARQL queries over web of linked data to
discover relevant data for query answering during query execution itself using RDF links between
data sources. Peter Haase et. al. [6] assumed a globally shared domain ontology, which can not
only be used for subsequently classifying the bibliographic metadata, but also for supporting an
improved query refinement process. Abirami et. al [7] proposed a Semantic based approach for
generating reports from HTML pages by converting those HTML pages into pre-processed and
formatted CVS files, generating OWL files for domain, separating contents from CVS files based
on OWL files and rules to generate RDF files and querying RDF files using SPARQL.
This paper proposes a model for storing the content of HTML pages into RDF format using the
rules generated by the wrapper induction technique. For example, the university web site may
contain the details of all the staff members of all the departments mostly in HTML pages. The
data is collected and consolidated mostly manually by every year. Though the content is readily
available in HTML pages, the manual effort is wasted while consolidating the report. The
proposed model generates the report from different HTML documents using RDF and SPARQL.
Only the relevant information is extracted and stored in RDF format which makes further
querying easier. This paper is organized into following sections: section 2 describes the
methodology, section 3 shows the experiments that are taken and section 4 gives the conclusion.
2. METHODOLOGY
The proposed model includes five different processes namely – (i) extracting structured data from
multiple HTML pages (ii) generating RDF files with the structured data extracted (iii) creating
ontology for specific domain relationship (iv) generating corresponding SPARQL queries for the
3. Computer Science & Information Technology (CS & IT) 511
user query and (v) generating reports according to the user needs. The Figure 1 shows the various
processes in the implementation, which are dealt in the subsequent sections.
Figure. 1. Model for Information Retrieval
2.1 Pre-Processing of HTML pages
It is important to check for the clean HTML pages, because a typical Web page contains a large
amount of noise, which can adversely affect data extraction accuracy. Pre-Processing phase is
done before actual extraction starts, which is depicted in Figure 2.The raw HTML is fed into the
HTML Tidy [13]. This tool checks for the presence of close tag for every open tag. The resultant
well formed html is fed into noise removal phase, where the output will be a clean HTML page.
2.2 Structured Data Extraction & RDF Generation
Once the pre-processing phase gets completed, the clean html pages are fed into structured data
extraction phase. This is accomplished by using the Wrapper Induction Technique [17], which is
the supervised learning approach, and is semi-automatic. In this approach, a set of extraction rules
is learned from a collection of manually labeled pages. The rules are then employed to extract
target data items from other similarly formatted pages. The process is pictorially represented in
Figure 3.
Most HTML tags work in pairs. Each pair consists of an open tag and a close tag indicated by < >
and </> respectively. Within each corresponding tag-pair, there can be other pairs of tags,
resulting in nested structures. Thus, HTML tags can naturally encode nested data. Notable points
are: (a) there are no designated tags for each type as HTML was not designed as a data encoding
language and any HTML tag can be used for any type (b) for a tuple type, values (data items) of
different attributes are usually encoded differently to distinguish them and to highlight important
items. In case of any conflicts, they are identified and corrected by replacing or adding rules.
These two processes go iteratively. Once the resultant structured data goes along with the human
oracle, the extracted structured data is sent for the RDF generation phase.
4. 512 Computer Science & Information Technology (CS & IT)
Figure. 2. Pre-processing of web pages
The extraction is done using a tree structure called the
which model the data embedding in a HTML page. For example EC tree used in this paper is
given in Figure 4.The wrapper [18] uses a tree structure based on this to facilitate extraction rule
learning and data extraction. Given the
following the tree path P from the root to the node by extracting each node in
The extraction rules are based on the idea of
consecutive tokens and is used to locate the beginning or the end of a target item.
Some of the rules identified by the tool are:
- HTML page with the word “journal” shows that the faculty has published journals.
- Pattern says that the very first list (“<ul> </ul>”) is the Journal
- in a list (<li> </li>) within journal list, the contents before the bold tag(<b>) are Authors
names.
- content within the bold tag is the Title of the paper published.
- content after the bold tag is the Journal name.
- within the journal name, the four co
year of journal publication.
- Some of the author names are within bold tag, which gets included wrongly in journal
name during extraction. After careful assessment, an “<auth> </auth>” tag is inserted
between individual author names. During extraction, the presence of “auth” tag within
bold tag is examined to extract the author names and title precisely.
Those rules were used to extract the structured data from the HTM
output is checked against the human oracle. In case of conflict, the conflicting things were
identified and corrected by replacing or adding rules. These two processes go iteratively. Once
the resultant structured data goes alon
sent for the RDF generation phase. With the default format of the RDF file and the structured data
extracted, the RDF file is generated for its corresponding HTML file.
Computer Science & Information Technology (CS & IT)
processing of web pages Figure. 3. Structured data extraction
The extraction is done using a tree structure called the EC tree (embedded catalog tree
which model the data embedding in a HTML page. For example EC tree used in this paper is
given in Figure 4.The wrapper [18] uses a tree structure based on this to facilitate extraction rule
learning and data extraction. Given the EC tree and the rules, any node can be extracted by
from the root to the node by extracting each node in P from its parent.
The extraction rules are based on the idea of landmarks. Each landmark is a sequence of
s used to locate the beginning or the end of a target item.
Some of the rules identified by the tool are:
HTML page with the word “journal” shows that the faculty has published journals.
Pattern says that the very first list (“<ul> </ul>”) is the Journal list.
in a list (<li> </li>) within journal list, the contents before the bold tag(<b>) are Authors
content within the bold tag is the Title of the paper published.
content after the bold tag is the Journal name.
within the journal name, the four consecutive numbers within the range, 19** to 20** is
year of journal publication.
Some of the author names are within bold tag, which gets included wrongly in journal
name during extraction. After careful assessment, an “<auth> </auth>” tag is inserted
between individual author names. During extraction, the presence of “auth” tag within
bold tag is examined to extract the author names and title precisely.
Figure. 4.Embedded catalog tree
Those rules were used to extract the structured data from the HTML pages of same format. The
output is checked against the human oracle. In case of conflict, the conflicting things were
identified and corrected by replacing or adding rules. These two processes go iteratively. Once
the resultant structured data goes along with the human oracle, the extracted structured data is
sent for the RDF generation phase. With the default format of the RDF file and the structured data
extracted, the RDF file is generated for its corresponding HTML file.
Figure. 3. Structured data extraction
catalog tree) [17],
which model the data embedding in a HTML page. For example EC tree used in this paper is
given in Figure 4.The wrapper [18] uses a tree structure based on this to facilitate extraction rule
e and the rules, any node can be extracted by
from its parent.
. Each landmark is a sequence of
HTML page with the word “journal” shows that the faculty has published journals.
in a list (<li> </li>) within journal list, the contents before the bold tag(<b>) are Authors
nsecutive numbers within the range, 19** to 20** is
Some of the author names are within bold tag, which gets included wrongly in journal
name during extraction. After careful assessment, an “<auth> </auth>” tag is inserted
between individual author names. During extraction, the presence of “auth” tag within
L pages of same format. The
output is checked against the human oracle. In case of conflict, the conflicting things were
identified and corrected by replacing or adding rules. These two processes go iteratively. Once
g with the human oracle, the extracted structured data is
sent for the RDF generation phase. With the default format of the RDF file and the structured data
5. Computer Science & Information Technology (CS & IT) 513
2.3 OWL Creation
Domain specific vocabularies and the relationships are established by ontology creation step
using the Protégé tool [8]. The relationship between those RDF files is established by .owl file
[12]. The ontology developed by this step is very much useful during the report generation by
correctly identifying the RDF file for search. SPARQL query [9] is analyzed and the report is
generated in the required format using JENA APIs [11]. This is further explained in the
subsequent section.
3. EXPERIMENTAL RESULTS
The proposed model is experimented with the college web site (www.tce.edu) to get the report on
the journal publications of various departments’ staff members for a particular time period.
HTML pages containing Journal details are pre-processed. Some faculty members may not have
updated their details in HTML page. Such HTML pages are rejected before pre-processing phase.
Table 1. Experimental data considered
Dept
# Html
Pages
# Preprocessed
Pages
CSE 28 11
ECE 32 26
EEE 30 14
MECH 34 18
CIVIL 23 14
IT 20 11
3.1 HTML to RDF conversion
Each cleaned HTML pages are converted into RDF [9] as shown below in Figure 5.
Figure. 5. RDF for HTML page
3.2 Evaluation Metrics
We’ve considered precision as a measure [17] before and after the rule refinement. Precision is
calculated as the ratio of information retrieved and the total information. The precision value is
measured against the number of correctly extracted information under the right RDF tag for 94
HTML pages. The following Table 2 gives the precision values for various tags.
6. 514 Computer Science & Information Technology (CS & IT)
Table 2. Precision for RDF conversion
Lower precision in extracting author names is due to improper alignment of author names which
cannot be corrected in preprocessing phase. Precision is increased by rule refinement for the
following problems that are dealt: many consecutive commas, misplaced commas and absence of
comma. The following corrective actions are also taken: (i) Within the journal name, the four
consecutive numbers within the range, 19** to 20** is considered as year (ii) Some of the author
names are within bold tag, which is wrongly placed, thus adding <auth> in between the author
names.
3.3 Query Results
For an instance, a general query [15] for retrieving all details is given below.
3.4 Report Generation
One of the ways of presenting the output is through various charts. The tool uses JFreeChart [14]
and enables easy query interface for various factors and they are depicted in the series of Figures
6-9.
Figure.6. Number of journals published by Figure.7. Number of journals published by
each department in the range of years given author in given range of year
Extraction of
Before Rule
Refinement
After Rule
Refinement
Author 51% (48) 89% (84)
Title 75% (71) 94% (89)
Journal 100% (94) 100% (94)
Year 68% (64) 88% (83)
7. Computer Science & Information Technology (CS & IT) 515
Figure. 8. Number of journals published by Figure. 9. Number of journals published by
each department in given year given department in the range of years.
Time taken by manual effort includes time taken to traverse the links to access required HTML
document, scan the page for required details, identify and feed the details in different location (eg.
spreadsheet) and generate report. These functions takes time in the order of minutes and in worst
case it takes hours, based on the person doing the job. But the toolkit enables to update the
repository on regular basis. Once the details are updated, required information can be obtained
and reports can be generated in seconds. Time taken by tool includes - generation of SPARQL
query, identifying required RDF files from ontology, extracting details from RDF files identified
and generating the report. The time taken to generate the reports is shown in Table 3.
Table 3. Report generation time
Report
No.
Manual
Effort (in min)
Toolkit’s time
(in sec)
R1 1416 4
R2 184 3
R3 670 4
R4 14 2
4. CONCLUSION
We developed a toolkit to retrieve the web of data using RDF and ontology by well refined rules
to extract the data. Quick and useful inferences can be made by easy query building and report
generation process. As a future enhancement, inter-department relationships can be made using
MULTIPLE ontologies to bring out the inter-college relationships.
REFERENCES
[1] Ujjal Marjit,Kumar Sharma and Utpal Biswas,"Discovering Resume Information using Linked data",
Information Journal of web & semantic technology(IJWesT) Vol.3,No.2,April 2012.
[2] Arup Sarkar,Ujjal Marjit and Utpal Biswas,"Linked data generation for the University data from
Legacy database", International Journal of Web & Semantic technology(IJWesT) Vol.2,No.3,July
2011.
8. 516 Computer Science & Information Technology (CS & IT)
[3] David Camacho and Maria D.R-Moreno,"Web data Extraction using Semantic Generators", VSP
International Science Publishers,2006.
[4] Uriv shah,Tim Finin,Anupam Joshi,R.Scott cost,James Mayfield,"Information Retrieval on the
Semantic web".
[5] Olaf Hartig,Christian Bizer and Johann-Christoph Freytag ,"Executing SPARQL Queries over the
Web of Linked Data".
[6] Peter Haase, Nenad Stojanovic, York Sure, and Johanna Volker, “Personalized Information Retrieval
in Bibster,a Semantics-Based Bibliographic Peer-to-Peer System”
[7] A.M.Abirami , Dr.A.Askarunisa, ”A Proposal for the Semantic based Report Generation of Related
HTML Documents”, International Journal of Software Engineering and Technology, Sept 2011.
[8] http://protege.stanford.edu/
[9] http://www.w3schools.com/rdf/rdf_example.asp
[10] http://www.w3.org/TR/rdf-sparql-query/
[11] http://jena.sourceforge.net/tutorial/RDF_API/
[12] http://www.obitko.com/tutorials/ontologies-semanticweb/ontologies.htm
[13] http://tidy.sourceforge.net/
[14] http://www.jfree.org/jfreechart/
[15] http://code.google.com/
[16] http://www.ai.sri.com/~muslea/PS/jaamas-2k.pdf
[17] Bing Liu,"Web data mining - Exploring hyperlinks,contents,and usage data".
[18] Grigoris Antoniou and Frank van Harmelen, "A semantic Web Primer".