Objeto de conferencia
VI Simposio Internacional de Bibliotecas Digitales (Brasil)
Digital repositories acting as resource aggregators typically face different challenges, roughly classified in three main categories: extraction, improvement and storage. The first category comprises issues related to dealing with different resource collection protocols: OAI-PMH, web-crawling, webservices, etc and their representation: XML, HTML, database tuples, unstructured documents, etc. The second category comprises information improvements based on controlled vocabularies, specific date formats, correction of malformed data, etc. Finally, the third category deals with the destination of downloaded resources: unification into a common database, sorting by certain criteria, etc.
This paper proposes an ETL architecture for designing a software application that provides a comprehensive solution to challenges posed by a digital repository as resource aggregator.
Design and implementation aspects considered during the development of this tool are described, focusing especially on architecture highlights.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5529
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...IJwest
Constructing ontologies from relational databases is an active research topic in the Semantic Web domain.
While conceptual mapping rules/principles of relational databases and ontology structures are being
proposed, several software modules or plug-ins are being developed to enable the automatic conversion of
relational databases into ontologies. However, the correlation between the resulting ontologies built
automatically with plug-ins from relational databases and the database-toontology mapping principles has
been given little attention. This study reviews and applies two Protégé plug-ins, namely, DataMaster and
OntoBase to automatically construct ontologies from a relational database. The resulting ontologies are
further analysed to match their structures against the database-to-ontology mapping principles. A
comparative analysis of the matching results reveals that OntoBase outperforms DataMaster in applying
the database-to-ontology mapping principles for automatically converting relational databases into
ontologies
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...cscpconf
Most of the internet applications are built using web technologies like HTML. Web pages are designed in such a way that it displays the data records from the underlying databases or just displays the text in an unstructured format but using some fixed template. Summarizing these data which are dispersed in different web pages is hectic and tedious and consumes most of the time and manual effort. A supervised learning technique called Wrapper Induction technique
can be used across the web pages to learn data extraction rules. By applying these learnt rules to web pages, enables the information extraction an easier process. This paper focuses on
developing a tool for information extraction from the unstructured data. The use of semantic web technologies much simplifies the process. This tool enables us to query the data being scattered over multiple web pages, in distinguished ways. This can be accomplished by the following steps – extracting the data from multiple web pages, storing them in the form of RDF triples, integrating multiple RDF files using ontology, generating SPARQL query based on user query and generating report in the form of tables or charts from the results of SPARQL query. The relationship between various related web pages are identified using ontology and used to query in better ways thus enhancing the searching efficacy.
A semantic based approach for information retrieval from html documents using...csandit
Most of the internet applications are built using web technologies like HTML. Web pages are
designed in such a way that it displays the data records from the underlying databases or just
displays the text in an unstructured format but using some fixed template. Summarizing these
data which are dispersed in different web pages is hectic and tedious and consumes most of the
time and manual effort. A supervised learning technique called Wrapper Induction technique
can be used across the web pages to learn data extraction rules. By applying these learnt rules
to web pages, enables the information extraction an easier process. This paper focuses on
developing a tool for information extraction from the unstructured data. The use of semantic
web technologies much simplifies the process. This tool enables us to query the data being
scattered over multiple web pages, in distinguished ways. This can be accomplished by the
following steps – extracting the data from multiple web pages, storing them in the form of RDF
triples, integrating multiple RDF files using ontology, generating SPARQL query based on user
query and generating report in the form of tables or charts from the results of SPARQL query.
The relationship between various related web pages are identified using ontology and used to
query in better ways thus enhancing the searching efficacy.
This presentation discusses the following topics:
Introduction
Objectives
What is Data Mining?
Data Mining Applications
Data Warehousing
Advantages and disadvantages
Trends and Current Issues
Future Research Possibilities
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...IJwest
Constructing ontologies from relational databases is an active research topic in the Semantic Web domain.
While conceptual mapping rules/principles of relational databases and ontology structures are being
proposed, several software modules or plug-ins are being developed to enable the automatic conversion of
relational databases into ontologies. However, the correlation between the resulting ontologies built
automatically with plug-ins from relational databases and the database-toontology mapping principles has
been given little attention. This study reviews and applies two Protégé plug-ins, namely, DataMaster and
OntoBase to automatically construct ontologies from a relational database. The resulting ontologies are
further analysed to match their structures against the database-to-ontology mapping principles. A
comparative analysis of the matching results reveals that OntoBase outperforms DataMaster in applying
the database-to-ontology mapping principles for automatically converting relational databases into
ontologies
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...cscpconf
Most of the internet applications are built using web technologies like HTML. Web pages are designed in such a way that it displays the data records from the underlying databases or just displays the text in an unstructured format but using some fixed template. Summarizing these data which are dispersed in different web pages is hectic and tedious and consumes most of the time and manual effort. A supervised learning technique called Wrapper Induction technique
can be used across the web pages to learn data extraction rules. By applying these learnt rules to web pages, enables the information extraction an easier process. This paper focuses on
developing a tool for information extraction from the unstructured data. The use of semantic web technologies much simplifies the process. This tool enables us to query the data being scattered over multiple web pages, in distinguished ways. This can be accomplished by the following steps – extracting the data from multiple web pages, storing them in the form of RDF triples, integrating multiple RDF files using ontology, generating SPARQL query based on user query and generating report in the form of tables or charts from the results of SPARQL query. The relationship between various related web pages are identified using ontology and used to query in better ways thus enhancing the searching efficacy.
A semantic based approach for information retrieval from html documents using...csandit
Most of the internet applications are built using web technologies like HTML. Web pages are
designed in such a way that it displays the data records from the underlying databases or just
displays the text in an unstructured format but using some fixed template. Summarizing these
data which are dispersed in different web pages is hectic and tedious and consumes most of the
time and manual effort. A supervised learning technique called Wrapper Induction technique
can be used across the web pages to learn data extraction rules. By applying these learnt rules
to web pages, enables the information extraction an easier process. This paper focuses on
developing a tool for information extraction from the unstructured data. The use of semantic
web technologies much simplifies the process. This tool enables us to query the data being
scattered over multiple web pages, in distinguished ways. This can be accomplished by the
following steps – extracting the data from multiple web pages, storing them in the form of RDF
triples, integrating multiple RDF files using ontology, generating SPARQL query based on user
query and generating report in the form of tables or charts from the results of SPARQL query.
The relationship between various related web pages are identified using ontology and used to
query in better ways thus enhancing the searching efficacy.
This presentation discusses the following topics:
Introduction
Objectives
What is Data Mining?
Data Mining Applications
Data Warehousing
Advantages and disadvantages
Trends and Current Issues
Future Research Possibilities
In this presentation to BAADD (SF Bay Area), BI Consultant Mark Gschwind shows one of the leading analytic platforms in the Real Estate industy, AMB Property Corporations data warehouse. Mark gives the attendees a tour of the infrastructure, explaining the challenges faced and the ways he solved them. He discusses how he achieved near-real-time data latency that helped drive user adoption. He demos the cubes, and an innovative custom application called MyData that helped ensure data quality. The presentation is a good example of how one organization achieved success using BI.
Information on the web is tremendously increasing in
recent years with the faster rate. This massive or voluminous data
has driven intricate problems for information retrieval and
knowledge management. As the data resides in a web with several
forms, the Knowledge management in the web is a challenging
task. Here the novel 'Semantic Web' concept may be used for
understanding the web contents by the machine to offer
intelligent services in an efficient way with a meaningful
knowledge representation. The data retrieval in the traditional
web source is focused on 'page ranking' techniques, whereas in
the semantic web the data retrieval processes are based on the
‘concept based learning'. The proposed work is aimed at the
development of a new framework for automatic generation of
ontology and RDF to some real time Web data, extracted from
multiple repositories by tracing their URI’s and Text Documents.
Improved inverted indexing technique is applied for ontology
generation and turtle notation is used for RDF notation. A
program is written for validating the extracted data from
multiple repositories by removing unwanted data and considering
only the document section of the web page.
Data Integration in Multi-sources Information Systemsijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
This presentation discusses the following topics:
Concepts
Component Architecture for a DDBMS
Distributed Processing
Parallel DBMS
Advantages of DDBMSs
Disadvantages of DDBMSs
Homogeneous DDBMS
Heterogeneous DDBMS
Open Database Access and Interoperability
Multidatabase System
Functions of a DDBMS
Reference Architecture for DDBMS
Components of a DDBMS
Fragmentation
Transparencies in a DDBMS
Date’s 12 Rules for a DDBMS
Query Optimization Techniques in Graph Databasesijdms
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
International Journal of Research in Engineering and Science is an open access peer-reviewed international forum for scientists involved in research to publish quality and refereed papers. Papers reporting original research or experimentally proved review work are welcome. Papers for publication are selected through peer review to ensure originality, relevance, and readability.
Abstract: Every program whether in c, java or any other language consists of a set of commands which are based on the logic behind the program as well as the syntax of the language and does the task of either fetching or storing the data to the computer, now here comes the role of the word known as “data structure”. In computer science, a data structure is a particular way of organizing data in a computer so that it can be used efficiently. Data structures provide a means to manage large amounts of data efficiently, such as large databases and internet indexing services. Usually, efficient data structures are a key in designing efficient algorithms. Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design. Storing and retrieving can be carried out on data stored in both main memory and in secondary memory. Now as different data structures are having their different usage and benefits, hence selection of the same is a task of importance. “Therefore the paper consists of the basic terms and information regarding data structures in detail later on will be followed by the practical usage of different data structures that will be helpful for the programmer for selection of a perfect data structure that would make the programme much more easy and flexible.
Keywords: Data structures, Arrays, Lists, Trees.
Articulo
The Online Journal of New Horizons in Education; vol. 3, no. 1
This work presents an open source web environment to learn GPSS language in Modeling and Simulation courses. With this environment, students build their models by selecting entities and configuring them instead of programming GPSS codes from scratch. Teachers can also create models so that students can apply, analyze and interpret results. Thus, it includes a simulation engine that stores snapshots of models as they are executed, and allows students to navigate through these snapshots. The environment may be combined with existing learning management systems.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/25669
Objeto de conferencia
International PKP Schorlaly Publishing Conference 2009 (Harbour Centre - Simon Frasier University, Canada)
La Plata National University is making a great effort to support and encourage scientific, technologic and artistic research and innovation, maintaining its quality, as well as the development and knowledge transference that benefit society and generate new scientific, technologic and artistic knowledge.
To achieve this purpose, the University fosters the education of high-quality human resources, the promotion of scientific, technologic and artistic activity, and its dissemination around the world.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5551
In this presentation to BAADD (SF Bay Area), BI Consultant Mark Gschwind shows one of the leading analytic platforms in the Real Estate industy, AMB Property Corporations data warehouse. Mark gives the attendees a tour of the infrastructure, explaining the challenges faced and the ways he solved them. He discusses how he achieved near-real-time data latency that helped drive user adoption. He demos the cubes, and an innovative custom application called MyData that helped ensure data quality. The presentation is a good example of how one organization achieved success using BI.
Information on the web is tremendously increasing in
recent years with the faster rate. This massive or voluminous data
has driven intricate problems for information retrieval and
knowledge management. As the data resides in a web with several
forms, the Knowledge management in the web is a challenging
task. Here the novel 'Semantic Web' concept may be used for
understanding the web contents by the machine to offer
intelligent services in an efficient way with a meaningful
knowledge representation. The data retrieval in the traditional
web source is focused on 'page ranking' techniques, whereas in
the semantic web the data retrieval processes are based on the
‘concept based learning'. The proposed work is aimed at the
development of a new framework for automatic generation of
ontology and RDF to some real time Web data, extracted from
multiple repositories by tracing their URI’s and Text Documents.
Improved inverted indexing technique is applied for ontology
generation and turtle notation is used for RDF notation. A
program is written for validating the extracted data from
multiple repositories by removing unwanted data and considering
only the document section of the web page.
Data Integration in Multi-sources Information Systemsijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
This presentation discusses the following topics:
Concepts
Component Architecture for a DDBMS
Distributed Processing
Parallel DBMS
Advantages of DDBMSs
Disadvantages of DDBMSs
Homogeneous DDBMS
Heterogeneous DDBMS
Open Database Access and Interoperability
Multidatabase System
Functions of a DDBMS
Reference Architecture for DDBMS
Components of a DDBMS
Fragmentation
Transparencies in a DDBMS
Date’s 12 Rules for a DDBMS
Query Optimization Techniques in Graph Databasesijdms
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
International Journal of Research in Engineering and Science is an open access peer-reviewed international forum for scientists involved in research to publish quality and refereed papers. Papers reporting original research or experimentally proved review work are welcome. Papers for publication are selected through peer review to ensure originality, relevance, and readability.
Abstract: Every program whether in c, java or any other language consists of a set of commands which are based on the logic behind the program as well as the syntax of the language and does the task of either fetching or storing the data to the computer, now here comes the role of the word known as “data structure”. In computer science, a data structure is a particular way of organizing data in a computer so that it can be used efficiently. Data structures provide a means to manage large amounts of data efficiently, such as large databases and internet indexing services. Usually, efficient data structures are a key in designing efficient algorithms. Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design. Storing and retrieving can be carried out on data stored in both main memory and in secondary memory. Now as different data structures are having their different usage and benefits, hence selection of the same is a task of importance. “Therefore the paper consists of the basic terms and information regarding data structures in detail later on will be followed by the practical usage of different data structures that will be helpful for the programmer for selection of a perfect data structure that would make the programme much more easy and flexible.
Keywords: Data structures, Arrays, Lists, Trees.
Articulo
The Online Journal of New Horizons in Education; vol. 3, no. 1
This work presents an open source web environment to learn GPSS language in Modeling and Simulation courses. With this environment, students build their models by selecting entities and configuring them instead of programming GPSS codes from scratch. Teachers can also create models so that students can apply, analyze and interpret results. Thus, it includes a simulation engine that stores snapshots of models as they are executed, and allows students to navigate through these snapshots. The environment may be combined with existing learning management systems.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/25669
Objeto de conferencia
International PKP Schorlaly Publishing Conference 2009 (Harbour Centre - Simon Frasier University, Canada)
La Plata National University is making a great effort to support and encourage scientific, technologic and artistic research and innovation, maintaining its quality, as well as the development and knowledge transference that benefit society and generate new scientific, technologic and artistic knowledge.
To achieve this purpose, the University fosters the education of high-quality human resources, the promotion of scientific, technologic and artistic activity, and its dissemination around the world.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5551
Objeto de conferencia
World Engineering Education Forum (WEEF 2012) "Educación en Ingeniería para el Desarrollo Sostenible y la inclusión social"
The aim of this paper is to present an historical compendium of ISTEC’s activities in the region, highlighting its core ideas and principles, and how these have been successfully applied for the benefit of many higher education institutions. These aim at dynamically improving the quality of quant coveraand access to of education in Latin America. They reflect ISTEC’s multidisciplinary approach, based on entrepreneurial activities, not only to educate engineers but to produce the next generation of leaders the region needs. Latin America must be placed in the
world map of education, innovation, generation of wealth and intellectual property with a strong sense of social responsibility. Due to the nature of ISTEC’s members and strategic partners, the consortium can leverage and balance the influence of academia, industrial partners and government bodies to make the “Triple Helix” work for the benefit of our peoples in general, across geographical, cultural and social borders.; El objetivo de este artículo es presentar una descripción histórica de las actividades del ISTEC en la región, haciendo hincapié en sus ideas y principios fundamentales, y describir cómo éstos han sido aplicados con éxito en beneficio de muchas instituciones de educación superior. Éstas tienen el objetivo de mejorar dinámicamente la calidad y cantidad de la cobertura y acceso a la educación en América Latina. Reflejan el enfoque multidisciplinario del ISTEC, basado en actividades emprendedoras, no sólo para educar ingenieros pero también para producir la siguiente generación de líderes que la región requiere. América Latina debe de estar en el mapa mundial de educación, innovación, generación de riqueza y propiedad intelectual con un fuerte compromiso de responsabilidad social, Dada la naturaleza de los miembros de ISTEC y sus alianzas estratégicas, el Consorcio puede ser la palanca para lograr el balance entre la academia, los gobiernos y la industria para hacer que la “Triple Hélice” trabaje en beneficio de nuestros pueblos en general, traspasando fronteras geográficas, culturales y sociales.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/22944
Objeto de conferencia
International PKP Scholarly Publishing Conference 2009 (Harbour Centre - Simon Frasier University)
During the year 2008 La Plata National University (UNLP) has stressed up in its Strategic Plan that all Intellectual Creation from professors, students and researchers must be visible and accessible from outside the scope of this college: the University has a considerable scientific and academic production, and the world might see it if it was somewhere easily available. In this direction, projects such as SeDiCI UNLP have been strongly strengthen, and new projects have been created: The Portal of Journals and The Portal of Congresses.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5560
Articulo
Journal of Computer Science & Technology; vol. 6, no. 2
The image recognizing process requires the identification of every logical object that compose every image, which first implies to recognize it as an object (segmentation) and then identify which object is, or at least which is the most likely one from the universe of objects that can be recognized (recognition). During the segmentation process, the aim is to identify as many objects that compose the images as possible. This process must be adapted to the universe of all objects that are looked for, which can vary from printed or manuscript characters to fruits or animals, or even fingerprints. Once all objects have been obtained, the system must carry on to the next step, which is the identification of the objects based on the called universe. In other words, if the system is looking for fruits, it must identify univocally fruits from apples and oranges; if they are characters, it must identify the character "a" from the rest of the alphabet, and soon. In this document, the character recognition step has been studied. More specifically, which methods to obtain characteristics exist (advantages and disadvantages, implementations, costs). There is also an overview about the feature vector, in which all features are stored and analyzed in order to perform the character recognition itself.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5520
Objeto de conferencia
VI International Conference on Education and Information Systems, Technologies and Applications (EISTA) (Florida, Estados Unidos)
Simulation is the process of executing a model, that is a representation of a system with enough detail to describe it but not too excessive. This model has a set of entities, an internal state, a set of input variables that can be controlled and others that cannot, a list of processes that bind these input variables with the entities and one or more output values, which result from the execution of events.
Running a model is totally useless if it can not be analyzed, which means to study all interactions among input variables, model entities and their weight in the values of the output variables. In this work we consider Discrete Event Simulation, which means that the status of the system variables being simulated change in a countable set of instants, finite or countable infinite.
Existing GPSS implementations and IDE's provide a wide range of tools for analysis of the results, for generation and execution of experiments and to perform complex analysis (such as Analysis of Variance, screening and so on). This is usually enough for most common analysis, but more detailed information and much more specific analysis are often required.
In addition, teaching this kind of simulation languages is always a challenge, since the way it executes the models and the abstraction level that their entities achieve is totally different compared to general purpose programming languages, well known by most students of the area. And this is usually hard for students to understand how everything works underground.
We have developed an open source simulation framework that implements a subset of entities of GPSS language, which can be used for students to improve the understanding of these entities. This tool has also the ability to store all entities of simulations in every single simulation time, which is very useful for debugging simulations, but also for having a detailed history of all entities (permanents and temporary) in the simulations, knowing exactly how they have behaved in every simulation time.
In this paper we provide an overview of this development, making special stress on the simulation model design and on the persistence of simulation entities, which are the basis that students and researchers of the area need in order to extend the model, adapt it to their needs or add all analysis tools to the framework.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5557
Objeto de conferencia
International Conference on Engineering Education ICEE-2011 (Irlanda)
The Ibero-American Science and Technology Education Consortium (ISTEC) is a non-profit organization comprised of educational, research, industrial, and multilateral organizations throughout the Americas and the Iberian Peninsula. The Consortium was established in 1990 to foster scientific, engineering, and technology education, joint international research and development efforts among its members, and to provide a cost-effective vehicle for the application and transfer of technology. After twenty years, ISTEC has established a presence in the region, but it also has experienced problems to interact with different cultures and interests. During 2010 it suffered important changes in its organization and big efforts were realized to accomplish new goals and to share worldwide expertise, to facilitate distributed problem solving, creating the local critical mass needed for the development of regional projects in areas such as: continuing education, libraries and repositories, globalization of the culture of quality and accreditation standards, R&D, intellectual property development, capital acquisition, and social responsibility, among others. ISTEC continues to be dedicated to the improvement of Science, Engineering, Technology, Math education, R&D, and Entrepreneurship. The Consortium will foster technology transfer and the development of social and business entrepreneurs through the implementation of a global network that pretends to reach other countries in the world creating clusters of businesses and institutions that share common interest, assisting in the establishment of strategic alliances/joint ventures, and the promotion of collaborative partnerships in general.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/27159
Growth of relational model: Interdependence and complementary to big data IJECEIAES
A database management system is a constant application of science that provides a platform for the creation, movement, and use of voluminous data. The area has witnessed a series of developments and technological advancements from its conventional structured database to the recent buzzword, bigdata. This paper aims to provide a complete model of a relational database that is still being widely used because of its well known ACID properties namely, atomicity, consistency, integrity and durability. Specifically, the objective of this paper is to highlight the adoption of relational model approaches by bigdata techniques. Towards addressing the reason for this in corporation, this paper qualitatively studied the advancements done over a while on the relational data model. First, the variations in the data storage layout are illustrated based on the needs of the application. Second, quick data retrieval techniques like indexing, query processing and concurrency control methods are revealed. The paper provides vital insights to appraise the efficiency of the structured database in the unstructured environment, particularly when both consistency and scalability become an issue in the working of the hybrid transactional and analytical database management system.
A Framework for Automated Association Mining Over Multiple DatabasesGurdal Ertek
Literature on association mining, the data mining methodology that investigates associations between items, has primarily focused on efficiently mining larger databases. The motivation for association mining is to use the rules obtained from historical data to influence future transactions. However, associations in transactional processes change significantly over time, implying that rules extracted for a given time interval may not be applicable for a later time interval. Hence, an
analysis framework is necessary to identify how associations change over time. This paper presents such a framework, reports the implementation of the framework as a tool, and demonstrates the applicability of and the necessity for the framework through a case study in the domain of finance.
http://research.sabanciuniv.edu.
Comparison of data recovery techniques on master file table between Aho-Coras...TELKOMNIKA JOURNAL
Data recovery is one of the tools used to obtain digital forensics from various storage media that rely entirely on the file system used to organize files in these media. In this paper, two of the latest techniques of file recovery from file system (new technology file system (NTFS)) logical data recovery, Aho-Corasick data recovery were studied, examined and a practical comparison was made between these two techniques according to the speed and accuracy factors using three global datasets. It was noted that all previous studies in this field completely ignored the time criterion despite the importance of this standard. On the other hand, algorithms developed with other algorithms were not compared. The proposed comparison of this paper aims to detect the weaknesses and strength of both algorithms to develop a new algorithm that is more accurate and faster than both algorithms. The paper concluded that the logical algorithm was superior to the Aho-Corasick algorithm according to the speed criterion, whereas the algorithms gave the same results according to the accuracy criterion. The paper leads to a set of suggestions for future research aimed at achieving a highly efficient and high-speed data recovery algorithm such as the file-carving algorithm.
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAINcscpconf
Data warehouse systems provide a great way to centralize and converge all data of an
organization in order to facilitating access to the huge amounts of information, analysing and
decision making. Actually, the conceptual data-models of data warehouses does not take into
account the semantic dimension of information. However, the semantic of data models
constitute an important indicator to help users to finds its way in any applications that use the
data warehouse. In this study, we will tackle this problem trough using ontologies and semantic
web techniques to integrate and model information. The contributions of this paper are an
ontology for the field of grain trade and a semantic data warehouse which uses the ontology as
a conceptual data-model.
Brief Introduction to Digital PreservationMichael Day
Presentation slides from a lecture given at the University of the West of England (UWE) as part of the MSc in Library and Library Management, University of the West of England, Frenchay Campus, Bristol, March 10, 2010
Organizing Data in a Traditional File Environment
File organization Term and Concepts
Computer system organizes data in a hierarchy
Bit: Smallest unit of data; binary digit (0,1)
Byte: Group of bits that represents a single character
Field: Group of characters as word(s) or number
Record: Group of related fields
File: Group of records of same type
Data Warehouse Process and Technology: Warehousing Strategy, Warehouse management and Support Processes.
Warehouse Planning and Implementation.
H/w and O.S. for Data Warehousing, C/Server Computing Model & Data Warehousing, Parallel Processors & Cluster Systems, Distributed DBMS implementations.
Warehousing Software, Warehouse Schema Design.
Data Extraction, Cleanup & Transformation Tools, Warehouse Metadata
,data warehouse process and technology: warehousing ,warehouse management and support processes. wareh ,c/server computing model & data warehousing ,parallel processors & cluster systems ,distributed dbms implementations. warehousing sof ,warehouse schema design. data extraction ,cleanup & transformation tools ,warehouse metadata
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a preprocessing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
Objeto de conferencia
Ciclo de conferencias en la Universidad Nacional Experimental del Táchira (UNET) (Venezuela, 2012)
Objetivo:
• Compartir la experiencia del SEDICI en todas las áreas que hacen al quehacer del repositorio: edición, catalogación, comunicación y difusión, software de soporte e interoperabilidad, servicios asociados y cuestiones legales, entre otras.
• Crear conciencia sobre el acceso abierto en todas sus formas.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/37866
Tesis de doctorado
De Giusti, Marisa Raquel; Gordillo, Silvia; Castro, Silvia; Suppi, Remo; Baldasarri, Sandra
Doctor en Ciencias Informáticas; Facultad de Informática
La enseñanza en el área de simulación de eventos discretos requiere integrar una variedad de conceptos teóricos y ponerlos en práctica a través de la creación y ejecución de modelos abstractos de simulación, con el objetivo de recopilar información que pueda traspolarse hacia los sistemas reales. Para construir modelos, ejecutarlos y analizar los resultados de cada ejecución se utilizan herramientas de software cada vez más sofisticadas que permiten expresar los elementos de los modelos en términos de entidades abstractas y relaciones, y que recopilan gran cantidad de datos y estadísticas sobre cada una de estas entidades del modelo. GPSS es una de estas herramientas, y se compone de un lenguaje de programación por bloques y un motor de simulación que traduce estos bloques en distintas entidades del modelo. A pesar de que su primera versión data de 1961, GPSS es aún muy utilizado por profesionales y empresas, y es una de las herramientas más utilizadas para la enseñanza de simulación de eventos discretos por instituciones académicas de todo el mundo.
El avance de la capacidad de cómputo de las computadoras ha permitido incorporar una mayor cantidad de herramientas y funciones a las distintas implementaciones de GPSS. Mientras que esto representa una ventaja para sus usuarios, requiere también un cada vez mayor esfuerzo por parte de los docentes para enseñar a sus estudiantes a aprovechar todo su potencial. Muchos docentes e investigadores han buscado optimizar la enseñanza de simulación de eventos discretos desde múltiples ángulos: la organización del curso y la metodología de enseñanza, la creación de elementos de aprendizaje que ayuden a aplicar los distintos elementos teóricos, la generación de herramientas para construir modelos GPSS, y la construcción de herramientas para comprender el motor de simulación por dentro.
En esta tesis se introduce una herramienta de software que permite construir modelos GPSS de manera interactiva, cuyo diseño fue pensado para integrar los elementos teóricos del curso con los objetos y entidades de GPSS. Esta herramienta también permite ejecutar estos modelos y analizar con alto nivel de detalle su evolución a través del tiempo de simulación, lo que permite a los estudiantes comprender cómo funciona el motor de simulación y cómo interactúan las distintas entidades entre sí. Se incluye también una propuesta de enseñanza basada en una fuerte participación de los estudiantes, que, por medio de esta nueva herramienta, les permite incorporar los conceptos más fácilmente. Esta propuesta de enseñanza fue puesta a prueba con alumnos del área de sistemas, quienes tomaron un curso que contiene los m
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/29753
Objeto de conferencia
The Benefits of Model-Driven Development in Institutional Repositories
II Conferencia Internacional Acceso Abierto, Comunicación Científica y Preservación Digital (BIREDIAL) (Colombia, 2012)
Los Repositorios Institucionales (RI) se han consolidado en las instituciones en las áreas científicas y académicas, así lo demuestran los directorios de repositorios existentes de acceso abierto y en los depósitos diarios de artículos o documentos realizados por diferentes vías, tales como el autoarchivo por parte de los usuarios registrados y las catalogaciones por parte de los bibliotecarios. Los sistemas RI se basan en diversos modelos conceptuales, por lo que en este trabajo se realiza un relevamiento bibliográfico del Desarrollo de Software Dirigido por Modelos (MDD) en los sistemas y aplicaciones para los RI con el propósito de exponer los beneficios de la aplicación del MDD en los RI. El MDD es un paradigma de construcción de software que asigna a los modelos un rol central y activo bajo el cual se derivan modelos que van desde los más abstractos a los concretos, este proceso se realiza a través de transformaciones sucesivas. Este paradigma proporciona un marco de trabajo que permite a los interesados compartir sus puntos de vista y manipular directamente las representaciones de las entidades de este dominio. Por ello, se presentan los beneficios agrupados según los actores que están presentes, a saber, desarrolladores, dueños de negocio y expertos del dominio. En conclusión, estos beneficios ayudan a que todo el entorno del dominio de los RI se concentre en implementaciones de software más formales, generando una consolidación de tales sistemas, donde los principales beneficiarios serán los usuarios finales a través de los múltiples servicios que son y serán ofrecidos por estos sistemas.; The Institutional Repositories (IR) have been consolidated into the institutions in scientific and academic areas, as shown by the directories existing open access repositories and the deposits daily of articles made by different ways, such as by self-archiving of registered users and the cataloging by librarians. IR systems are based on various conceptual models, so in this paper a bibliographic survey Model-Driven Development (MDD) in systems and applications for RI in order to expose the benefits of applying MDD in IR. The MDD is a paradigm for building software that assigns a central role models and active under which derive models ranging from the most abstract to the concrete, this is done through successive transformations. This paradigm provides a framework that allows interested parties to share their views and directly manipulate representations of the entities of this domain. Therefore, the benefits are grouped by actors that are present, namely, developers, business owners and domain experts. In conclusion, t
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/26044
Objeto de conferencia
I Conferencia sobre Bibliotecas y Repositorios Digitales (BIREDIAL) (Colombia, 2011)
El Servicio de Difusión de la Creación Intelectual es un proyecto de Repositorio Digital Institucional creado dentro de la UNLP, orientado a funcionar como el punto central de difusión de toda la producción académica generada dentro de la institución. Dada la tendencia vista en las principales instituciones académicas del mundo de hacer pública su producción científica a través de repositorios digitales de acceso abierto, el SeDiCI pasa a ser una herramienta estratégica para la jerarquización de la institución.
Desde su creación en el año 2003, el SeDiCI ha afrontado diversas dificultades que han influído directa o indirectamente en su desarrollo y crecimiento, pero aún a pesar de estas problemáticas, actualmente el SeDiCI cuenta con una base de datos documental que supera los 14000 recursos académicos propios (de la UNLP) expuestos bajo las políticas del Acceso Abierto. Esto convierte al SeDiCI en uno de los principales exponentes en su tipo, tanto a nivel nacional como regional (América Latina). En este documento se presentan experiencias y desafíos que el SeDiCI ha enfrentado, describiendo en cada caso el problema, su contexto y las vías de acción tomadas para superarlo. Los principales tópicos son: necesidad de apoyo institucional, reglas y metodologías de catalogación, mejoras en los servicios provistos a los usuarios, importación de recursos, entre otros.
Adicionalmente se describen algunos de los desafíos actuales, diferentes líneas de investigación y desarrollo orientadas a resolver los retos y a expandir y mejorar los servicios proporcionados a la comunidad de usuarios. Entre estos se encuentran: mecanismos y herramientas de harvesting, gestión de grandes volúmenes de información, ontologías y repositorios semánticos, legislación relacionada al Acceso Abierto, Diseminación Selectiva de la Información, Autoarchivo, etc.
El principal objetivo de este documento es exponer la experiencia adquirida a partir de este proyecto, con la intención de que resulte de utilidad para aquellas instituciones que se encuentran en el proceso de creación de sus propios repositorios institucionales, o bien que se encuentren frente a problemáticas similares a las aquí expuestas.; The Intellectual Creation Dissemination Service is the Institutional Digital Repository of La Plata National University.
The project is intended to be the main distribution source of all the academic work produced inside UNLP. In view of main worldwide institutions's trend towards the publication of academic resources through open access digital repositories, SeDiCI has pointed to become a strategic tool to bring relevance to the University.
Since its creation in the year 2003, SeDiCI has faced up many challenges and difficulties. Th
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5528
Preprint
DYNA; Edition 184
Los Repositorios Institucionales (RI) se han consolidado en la academia, prueba de ello es el crecimiento en número de registros en los directorios existentes realizado por diferentes vías: autoarchivo por parte de autores, la incorporación de material a cargo de bibliotecarios, entre otras. En este trabajo se hace un relevamiento bibliográfico sobre el uso del enfoque de Desarrollo de Software Dirigido por Modelos (MDD) en los sistemas de RI con el propósito de establecer una relación entre ellos. El MDD es un paradigma de construcción de software que asigna a los modelos un rol central y se derivan modelos que van desde los más abstractos a los más concretos. Este paradigma, además, proporciona un marco de trabajo que permite a los interesados compartir sus puntos de vista y manipular las representaciones de las entidades del dominio. En conclusión, el seguimiento de las diferentes investigaciones relevadas y lo aquí expuesto permiten incentivar implementaciones de software para los RI.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/35601
Articulo
e-colabora; vol. 1, no. 2
Desde su creación en el año 2003, el Servicio de Difusión de la Creación Intelectual ha afrontado diversas dificultades que han influido directa o indirectamente en su desarrollo y crecimiento. En este documento se presentan algunas de estas experiencias, describiendo en cada caso el problema, su contexto y las vías de acción tomadas para superarlo. Adicionalmente se describen algunas de las diferentes líneas de investigación y desarrollo actuales, orientadas a expandir y mejorar los servicios proporcionados a la comunidad de usuarios. De ahí que el objetivo principal de este trabajo sea exponer la experiencia adquirida, con la intención de que resulte de utilidad para aquellas instituciones que se encuentren en el proceso de creación de sus propios repositorios.; Since its creation in the year 2003, the Intellectual Creation Dissemination Service has faced up many difficulties. Thus the initiative has been directly and indirectly affected during its development and growth. This work presents some of these experiencies, describing problems, context and solution approaches. This document also describes some of the new and most recent challenges and the current research and development trends, which are oriented to improve and extend the services provided by SeDiCI. The main purpose of this document is to share the lived experiences with this project, which may be useful to other institutions working on their own digital repositories.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5527
Informe tecnico
America Learning & Media; Edición 027
Celsius es el software utilizado por los miembros de ISTEC que participan de la iniciativa LibLink para gestionar los pedidos de material bibliográfico de sus usuarios, atender solicitudes de provisión desde otras instituciones participantes, facilitar el intercambio de los documentos y generar estadísticas que permiten transparentar el intercambio y evaluar la calidad del trabajo de los participantes. Este software es desarrollado por la UNLP, y su primera versión data del año 2001. Celsius es ofrecido a todas las instituciones participantes de manera gratuita, a quienes también se les brinda documentación actualizada y asistencia personalizada para realizar la instalación y mantenimiento, instalar actualizaciones y formar al equipo de personas que utilizarán esta herramienta en cada institución. El proyecto Celsius3 tiene como característica principal la gestión centralizada de todas las instancias de Celsius de los miembros de LibLink. Esto implica, por un lado, la creación de instancias a medidas que se incorporan nuevos miembros, y por el otro la centralización todas las instalaciones existentes de Celsius 1.x y 2.x, lo que implica a su vez las migraciones y normalizaciones de sus respectivas bases de datos. Cabe aclarar que, si bien se busca contar con una instalación única y centralizada de esta plataforma, las instituciones seguirán contando con su propia instancia de Celsius: usuarios, pedidos, administradores, comunicaciones y estadísticas de cada instancia serán datos que gestionará cada institución de manera independiente del resto. Además la plataforma está pensada para que puedan mantener sus dominios actuales de acceso a Celsius.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/34504
Objeto de conferencia
PKP International Scholarly Publishing Conferences (Mexico, 2013)
La Universidad Nacional de La Plata tiene como objetivo prioritario la difusión de todo el conocimiento generado en la institución, a fin de devolver a la sociedad parte del esfuerzo invertido en la Universidad pública. Para alcanzar este objetivo se han generado diversas iniciativas desde la gestión, la creación de nuevos servicios y la implementación de nuevas líneas de investigación y desarrollo para potenciar tales servicios. El Servicio de Difusión de la Creación Intelectual es el Repositorio Institucional central de la UNLP, y posee en la actualidad todo tipo de materiales científicos y académicos producidos desde la Universidad, incluyendo artículos de revistas científicas, publicaciones en congresos, tesis de posgrado, tesinas, normativas y ordenanzas, libros y libros electrónicos, documentos de audio, materiales educativos, y fotografías de piezas de museos, entre otros. Este repositorio institucional, el mayor de la Argentina y uno de los principales de América Latina, posee un rol central en la difusión de la producción intelectual de la UNLP y en la coordinación con otros servicios y desarrollos, entre los que se destacan el Portal de Revistas de la UNLP, soportado por el software Open Journal Systems, y el Portal de Congresos de la UNLP, soportado por el software Open Conference Systems. En este trabajo se detallan los distintos mecanismos que se han implementado desde la Universidad Nacional de La Plata para facilitar la interacción entre su repositorio institucional SEDICI -construido sobre el software Dspace- y estos portales, evitar la duplicación de esfuerzos y maximizar la difusión abierta del conocimiento. Dichos mecanismos incluyen desarrollos tecnológicos como la utilización de diversos protocolos (RSS/Atom, Sword, OAI-PMH) para establecer un camino unificado de comunicación entre los sistemas, o el desarrollo de plugins que permiten exportar la información desde las herramientas de gestión de producción académica, en un formato comprensible por otros sistemas. Todo esto, además, involucra el establecimiento de flujos de trabajo entre los equipos de personas que conforman los distintos servicios, con la finalidad de obtener un mecanismo unificado que contemple los métodos y las herramientas para exportar los datos e incorporarlos al repositorio. Finalmente, en este trabajo se incluyen otros esfuerzos generados tanto desde el SEDICI como desde la presidencia de la Universidad para asegurar la preservación de toda la producción intelectual y brindar nuevos mecanismos para maximizar la visibilidad de estos documentos científicos y académicos. Esto abarca a otros actores y direcciones de esta institución tales como la Editorial de la Universidad (EDULP), la Dirección de Educación a Distancia (EAD) y la Rad
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/27406
Objeto de conferencia
XX Asamblea General de ISTEC (Puebla, México, 2014)
Sumario de la presentación:
Parte 1 - Conceptos básicos. Repositorio, interoperabilidad, preservación, guías, proyectos
Parte 2 - Metadatos de preservación
Parte 3 - Directrices sobre preservación PREMIS, Modelo de datos PREMIS, METS Otros esquema de metadatos y más posibilidades en la preservación
Parte 4- OAIS
Parte 5- DSPACE Modelo de datos, OAIS en Dspace
Panel LibLink.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/34889
Objeto de conferencia
XX Asamblea General de ISTEC (Puebla, México, 2014)
Sumario:
- Innovación tecnológica
- Formación de RRHH
- Calidad de servicios
- Integración con otras iniciativas
- Administración y gestión
Sesión final LibLink.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/34941
Objeto de conferencia
III Conferencia Internacional de Biblioteca Digital y Educación a Distancia
Se presenta una plataforma de recolección destinada a relacionar y unificar información disponible en distintos lugares de la Web-que siguen diferentes convenciones-para crear un repositorio temático que puedan navegar los usuarios. La plataforma será usada en el Servicio de Difusión de la Creación Intelectual (SeDiCI) y utiliza de manera combinada ontologías y tesauros para brindar información mejor clasificada.
Actualmente, la información está diseminada en recursos de la Web y los motores de búsqueda tradicionales le devuelven al usuario listas rankeadas sin proveer ninguna relación semántica entre documentos. Los usuarios pasan gran cantidad de tiempo para vincular unos documentos con otros y saber cuáles atacan el dominio completo del problema; recién al localizar las semejanzas y las diferencias entre fragmentos de información éstas se trasladan a su trabajo y sirven para la creación de nuevo conocimiento.
La plataforma propuesta separa los módulos de funcionamiento de los diferentes dominios de interés (temas) para permitir su utilización en distintas áreas de conocimiento. El desarrollo incluye dos agentes que recorren las URLs almacenadas en una base de datos (uno responsable de poblar una ontología y otro de obtener URLs relacionadas), un módulo capaz de reconocer las páginas marcadas, interpretar las etiquetas y proveer las reglas para extraer la información y guardarla en un fichero RDF; tras esta etapa se aplica una homogeneización y la información así transformada se clasifica en función de una ontología de dominio.
La plataforma vuelve más eficientes los procesos de extracción automática y búsqueda de información en fuentes heterogéneas que representan los mismos conceptos siguiendo distintas convenciones.; Presentation of a web collection platform designed to relate and unify information available on different standard web sources with a view to creating a user-browseable thematic repository.
The platform will be used at the Servicio de Difusión de la Creación Intelectual (SeDiCI) [Intellectual Creation Diffusion Service] combined with ontologies and thesaurus to provide improved data sorting.
Data is currently spread on web resources and traditional search engines return ranked lists with no semantic relation among documents. Users have to spend a great deal of time relating documents and trying to figure out which ones fully address the issue domain. It is only after locating similarities and differences that information fragments are applied to the user's work, enabling knowledge creation.
The proposed platform sorts out the different theme domain functioning modules to allow their use in various knowledge areas. Development includes two agents that searches data base stored URLs, one is
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5555
Objeto de conferencia
III Simposio Internacional de Bibliotecas Digitales (San Pablo, Brasil)
El proceso de reconocimiento de la escritura manuscrita forma parte de las iniciativas que propenden a la preservación de patrimonio cultural resguardado en Bibliotecas y archivos donde existe una gran riqueza de documentos y hasta fichas manuscritas que acompañan libros incunables. Este trabajo es el punto de partida de un proyecto de investigación y desarrollo orientado a la digitalización y reconocimiento de material manuscrito y la ponencia que aquí se presenta discute diferentes algoritmos utilizados en una primera etapa dedicada a "limpiar" la imagen de ruido para mejorarla antes de comenzar el reconocimiento de caracteres. Dado que PrEBi-SeDiCI forman parte integrante de redes de bibliotecas que intercambian documentos digitalizados vía scanning, el presente desarrollo ha tenido una utilización adicional relacionada al mejoramiento de las imágenes de documentos de intercambio que presentaban problemas comunes en la digitalización: bordes, impurezas, descentrado, etc.., si bien no es esta la finalidad de esta investigación no por ello resulta una utilidad menor en el marco de intercambios de consorcios de bibliotecas. Para que el proceso de digitalización y reconocimiento de textos manuscritos sea eficiente debe estar precedido de una etapa de "preprocesamiento" de la imagen a tratar que incluye umbralización, limpieza de ruido, adelgazamiento, enderezamiento de la línea base y segmentación de la imagen entre otros. Cada uno de estos pasos permitirá reducir la variabilidad nociva al momento de reconocer los textos manuscritos (ruido, niveles aleatorios de grises, inclinación de caracteres, zonas con más y menos tinta), aumentando así la probabilidad de reconocer adecuadamente los textos. En este trabajo se consideran dos métodos de adelgazamiento de imágenes, se realiza la implementación y finalmente se lleva adelante una evaluación obteniendo conclusiones relativas a la eficiencia, velocidad y requerimientos, así como también ideas para futuras implementaciones. En la primera parte del documento, se presentan algunas definiciones relacionadas con los métodos utilizados, luego se muestran los resultados obtenidos sobre un mismo conjunto de imágenes aplicando las teorías propuestas y finalmente, se exponen algunas ideas para optimizar los algoritmos elegidos.; The handwritten manusctipt recognizing process belongs to the iniciatives which lean to cultural patrimony preservation shielded in Libraries and files where there exists a big wealth in documents and even handritten cards that accompany incunable books. This work is point to begin with a research and development proyect oriented to digitalization and recognition of manuscipt materials and the paper presented here discuss diferent algorithms used in the first stage ded
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5534
Objeto de conferencia
III Conferencia de Bibliotecas y Repositorios Digitales de América Latina (BIREDIAL) y VIII Simposio Internacional de Bibliotecas Digitales (SIBD) (Costa Rica, 2013)
Este trabajo describe 2 vías de extensión sobre el módulo de curation de Dspace. En primer lugar se describe un conjunto de curation tasks orientadas a analizar y reportar distintos aspectos asociados a la calidad de los datos y a brindar un soporte adicional a las tareas de preservación sobre el repositorio por medio de chequeos de integridad y de generación de nuevos metadatos. En segundo lugar se plantea la modificación de la estrategia de ejecución de curation tasks provisto por DSpace, en pos de minimizar su impacto en la performance de la aplicación, y flexibilizar los criterios de selección de recursos a procesar.
A continuación se mencionan curation tasks que serán consideradas en este trabajo:
● chequeo de enlaces web a documentos alojados en servidores externos al repositorio, configurable;
● chequeo de metadatos conectados con autoridades (o vocabularios controlados) dentro o fuera del repositorio, para chequeo de integridad en los datos;
● chequeo de archivos cargados en el repositorio, asegurando que todos los recursos cuenten con un archivo asociado bajo las normas y políticas del repositorio;
● control de metadatos obligatorios, según el tipo de documento;
● control del dominio de metadatos, de acuerdo a tipos primitivos como ser fecha, número, texto, etc;
● generación de metadatos de preservación a partir de los archivos asociados a los recursos (ej.: software con el que se realizó el archivo, con su correspondiente versión, versión del formato, nivel de compresión utilizado, etc);
● testeo y posterior reporte de recursos a partir de condiciones lógicas sobre metadatos y archivos, utilizando un lenguaje de expresión simple.
Actualmente las curation tasks se ejecutan sobre todos los ítems de una colección, una comunidad o incluso el repositorio completo, sin interrupciones y de manera secuencial. Esta estrategia de selección y ejecución genera una elevada demanda de recursos sobre el servidor que aloja el repositorio durante todo el tiempo de ejecución de los procesos de curation tasks, degradando la performance del mismo. Además, cuando se incluye más de una tarea en una misma orden de ejecución, éstas se ejecutan de forma secuencial, es decir, una tarea no puede iniciar su ejecución hasta tanto la tarea anterior no haya finalizado completamente. De aquí que en este trabajo se propone una nueva estrategia para la selección de los recursos a procesar y dos nuevas estrategias de ejecución de curation tasks:
● estrategia de selección de recursos a procesar en base a una expresión lógica configurable (ej.: seleccionar recursos según el valor de su metadato dc.type);
● estrategi
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/30524
Objeto de conferencia
III Conferencia de Bibliotecas y Repositorios Digitales de América Latina (BIREDIAL) y VIII Simposio Internacional de Bibliotecas Digitales (SIBD) (Costa Rica, 2013)
Durante el año 2012, el equipo del Portal de Congresos de la UNLP desarrolló un plugin para el software OCS que permite automatizar la generación de actas de congresos, y que a la vez simplifica de manera considerable la integración de dichas actas con el SEDICI. Este plugin puede, en líneas generales, exportar información de un mismo congreso en diferentes formatos, según se requiera en cada caso. Se han implementado distintos formatos de exportación, como por ejemplo CSV (valores separados por coma), HTML e incluso documento de texto (Word).
Estas tres exportaciones toman los datos de todos los trabajos (título, autores, resumen, instituciones, etc.) y los combinan en un único archivo, que el gestor de cada congreso puede descargar. También se implementó la exportación de todos los artículos de cada congreso en un único archivo .zip, y se modificaron los nombres de los archivos incluidos en el archivo .zip para que adopten el título del trabajo al que corresponden. De esta forma, los gestores pueden generar los libros de resúmenes de manera muy simple, y el repositorio institucional puede cargar todos los trabajos de cada congreso de una forma mucho más rápida.
Si bien se han implementado hasta ahora algunos formatos de exportación, el plugin fue pensado desde el principio para ser extendido hacia nuevos formatos, mediante un diseño basado en objetos que implementa el patrón de diseño conocido tomo Template Method.
Esta herramienta fue puesta a disponibilidad para usuarios en varios congresos del Portal de Congresos de la UNLP a modo de pruebas y los resultados fueron satisfactorios. En la actualidad, la herramienta ya está siendo utilizada para incorporar las producciones de los congresos al repositorio institucional SEDICI.
Aún cuando este desarrollo permitió simplificar la generación de actas de resúmenes de congresos, puede optimizarse más la incorporación de los trabajos al repositorio institucional, a fin de evitar la carga manual de los mismos. Por este motivo, el equipo del Portal de Congresos se encuentra evaluando diversas tecnologías de interoperabilidad entre sistemas. Entre ellas se destaca el protocolo sword en su versión 2 como una opción muy prometedora y relativamente simple. Este protocolo permite a los gestores de los congresos realizar el depósito automático o semiautomático de los trabajos desde el Portal de Congresos en el repositorio institucional, soportado por DSpace. Dicho depósito puede realizarse sobre una colección privada, lo que permitirá que el equipo de SEDICI revise los trabajos antes de confirmar su incorporación definitiva al repositorio.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/30522
Objeto de conferencia
BIREDIAL - Conferencia Internacional Acceso Abierto, Comunicación Científica y Preservación Digital
La preservación digital se define como el conjunto de prácticas de naturaleza política, estratégica y acciones concretas, destinadas a asegurar el acceso a los objetos digitales a largo plazo. El desarrollo de los repositorios institucionales, el crecimiento de sus contenidos y el reconocimiento de que la actividad institucional se canaliza, cada vez más, en soporte digital, obliga a los repositorios a acompañar su desarrollo con actividades destinadas a la preservación. En este trabajo se presentan el estándar 14721 (OAIS), los metadatos PREMIS y las directrices para la preservación, en conjunto con el esquema METS, para finalmente, explorar los metadatos en esquemas muy utilizados en la tarea normal de un repositorio (MODS, DC) y señalar los que resultan útiles a los fines de la preservación, proponiendo su reutilización. Un segundo objetivo práctico es mostrar qué herramientas de preservación ofrece el desarrollo DSpace que sustenta al repositorio SeDiCI - UNLP.; Digital preservation is defined as a set of political and strategic practices and concrete actions deployed to ensure long term access to digital objects. The development of institutional repositories, growth of their content and awareness that institutional activity is increasingly channeled through digital media require the deployment of preservation activities to sustain the evolution of repositories. This work presents ISO standard 14721 (OAIS), PREMIS metadata and guidelines for preservation, along with the METS schema to finally explWore metadata in widely-used repository-related schema (MODS, DC) and point out those useful for preservation purposes, proposing their reuse. A second practical purpose is to show the preservation tools offered by the DSpace development which supports the SeDiCI - UNLP repository.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/26045
Objeto de conferencia
XX Asamblea General de ISTEC (México, 2014)
Ponencia presentada en la XX Asamblea General de ISTEC (Puebla, México), en la cual se describen los distintos mecanismos de interoperabilidad implementados entre el repositorio institucional (SEDICI) y distintos servicios en línea de la Universidad Nacional de La Plata.
Lugar: INAOE (Puebla, México).
Expositores a través de videoconferencia: Gonzalo Luján Villarreal y Franco Agustín Terruzzi.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/34200
Articulo
Jornada Virtual de Acceso Abierto
El Servicio de Difusión de la Creación Intelectual (SeDiCI) es el repositorio institucional de la Universidad Nacional de La Plata (UNLP), creado en el 2003 con el objetivo de dar visibilidad a la producción académica producida en esta casa de estudios considerando que el acceso libre posibilita un mayor número de citas y por tanto un mayor impacto, atendiendo al rol fundamental de una institución pública de socializar el conocimiento. Creado en el año 2003, actualmente SeDiCI se encuentra posicionado entre los primeros 10 principales repositorios digitales de América Latina según la Webometrics, y ocupa la primera posición en Argentina como repositorio institucional. En este trabajo se presentan algunas de las principales características y servicios ofrecidos por el portal, desde su fundación hasta la actualidad.
Publicado en CD-ROM en: <i>Jornada Virtual de Acceso Abierto (Argentina 2010)</i>.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/27160
Objeto de conferencia
III International Conference on New Horizons in Education (INTE) (Praga, República Checa)
This work presents an open source web environment to learn GPSS language in Modeling and Simulation courses. With this environment, students build their models by selecting entities and configuring them instead of programming GPSS codes from scratch. Teachers can also create models so that students can apply, analyze and interpret results. Thus, it includes a simulation engine that stores snapshots of models as they are executed, and allows students to navigate through these snapshots. The environment may be combined with existing learning management systems.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/25674
Objeto de conferencia
IV Simposio Internacional de Bibliotecas Digitales (Málaga, España)
El objetivo de esta ponencia será presentar un estudio de avance junto con los resultados, asi como también el
impacto de la implementación de SRU/W sobre el repositorio de documentos locales de SeDiCI. También se
estudiará el grado de complejidad basado en la bibliografía existente acerca de la implementación de un gateway
SRU/W que permita acceder a la información referencial obtenida a través de OAI y el grado de impacto esperado en
los OPAC’s de la Universidad Nacional de La Plata.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5562
Objeto de conferencia
III Conferencia de Bibliotecas y Repositorios Digitales de América Latina (BIREDIAL) y VIII Simposio Internacional de Bibliotecas Digitales (SIBD) (Costa Rica, 2013)
Creada en diciembre de 1990, ISTEC es una organización internacional sin fines de lucro, compuesta por más de 100 instituciones educativas, de investigación, agencias gubernamentales, industria y organismos multilaterales en América y la Península Ibérica.
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/30511
More from Servicio de Difusión de la Creación Intelectual (SEDICI) (20)
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Extract, transform and load architecture for metadata collection
1. Extract, Transform and Load architecture for
metadata collection
Marisa R. De Giusti
Comisión de Investigaciones Científicas de la Provincia de
Buenos Aires (CICPBA)
Proyecto de Enlace de Bibliotecas (PrEBi)
Servicio de Difusión de la Creación Intelectual (SeDiCI)
Universidad Nacional de La Plata
La Plata, Argentina
marisa.degiusti@sedici.unlp.edu.ar
Nestor F. Oviedo, Ariel J. Lira
Proyecto de Enlace de Bibliotecas (PrEBi)
Servicio de Difusión de la Creación Intelectual (SeDiCI)
Universidad Nacional de La Plata
La Plata, Argentina
{nestor,alira}@sedici.unlp.edu.ar
Abstract—Digital repositories acting as resource aggregators
typically face different challenges, roughly classified in three
main categories: extraction, improvement and storage. The first
category comprises issues related to dealing with different
resource collection protocols: OAI-PMH, web-crawling, web-
services, etc and their representation: XML, HTML, database
tuples, unstructured documents, etc. The second category
comprises information improvements based on controlled
vocabularies, specific date formats, correction of malformed
data, etc. Finally, the third category deals with the destination of
downloaded resources: unification into a common database,
sorting by certain criteria, etc.
This paper proposes an ETL architecture for designing a
software application that provides a comprehensive solution to
challenges posed by a digital repository as resource aggregator.
Design and implementation aspects considered during the
development of this tool are described, focusing especially on
architecture highlights.
Keywords-digital repositories; data integration; data
warehousing; harvesting; aggregation
I. INTRODUCTION
Resource aggregation is one of the activities performed in
the context of digital repositories. Its goal is usually to
increase the amount of resources exposed by the repository.
They may even find digital repositories that are only resource
aggregators – e.g. do not expose their own material.
Aggregation starts with relevant resource collection from
external data sources; currently, there are several
communication and transference protocols, as well as
techniques for collecting available material from data sources
that were not originally intended for this purpose. Some of
these protocols and techniques include:
OAI-PMH [1]: A simple and easy-to-deploy protocol for
metadata exchange. It does not impose restraints upon
resource representation, allowing the service repository to
select metadata format.
Web-Crawling [2]: A robot scans web pages and tries to
detect and extract metadata. This method is useful to capture
the large volume of information distributed throughout the
Web, but the problem is that documents lack homogeneous
structure.
Web-Services: Using SOAP or XML-RPC as
communication protocols in general, resource structure
depends on server implementation.
As briefly seen above, methods and means for resource
collection vary significantly depending on the situation,
context and specific needs of each institutional repository,
both at the level of communication protocol and of data.
Therefore, independent data collection processes and different
analysis methodologies are critical to standardizing aspects
such as controlled vocabularies, standard code use, etc.
Likewise, when it comes to determining the use of the
information collected, there are also different situations that
depend on specific repository needs. The most common
scenario is the unification of collected resources into a central
database. Another usual approach is to logically sort
information -i.e. by topic, source country, language- inserting
the resources in different data stores. Furthermore, it may be
necessary to generate additional information by applying
analysis and special processes to resources: concept extraction,
semantic detection and ontology population, quotation
extraction, etc., and use different databases to store this new
information.
In general, information from different repositories is
typically diverse in structure, character coding, transfer
protocols, etc., requiring different extraction and
transformation approaches. Analogously, specific capabilities
are required to interact with each data store receiving the
transformed resources. This requires a set of organized tools
that provide a possible solution. There are a number of
potential complications: initially, it is necessary to find –or
develop- a series of tools, each tailored to solve one specific
problem, and then install, setup, test and launch each of them.
Subsequently, there is the problem of tool coupling, and the
need to ensure reliable interaction, since it is highly probable
that these tools act upon the same dataset. On the other hand, it
is important to consider the type of synchronization
mechanism used to determine task sequences: order of tasks
that each tool will carry out, which tasks can be executed
simultaneously and which ones sequentially, etc.
2. From a highly abstracted viewpoint, it is possible to
identify three main issues: Extract, Transform and Load
(ETL).
II. DEVELOPMENT OF A UNIFIED SOLUTION
ETL [3] is a software architectural pattern in the area of
Data Integration, usually related to data warehousing. This
process involves data extraction from different sources,
subsequent transformations by rules and validations, and final
loading into a Data Warehouse [4] or Data Mart [5]. This
architecture is used mainly in enterprise software to unify the
information used for Business Intelligence [6] processes that
lead to decision-making.
Given the challenges presented by resource aggregation
via institutional repositories throughout the different phases in
heterogeneous information management, a comprehensive
ETL solution is highly practicable.
This paper presents the development of a tool intended as a
unified solution for these various issues.
The design is based on the following premises:
(a) Allow the use of different data sources and data stores,
encapsulating their particular logic in connectable
components.
(b) Allow tool extension with new data source and data
store components developed by third parties.
(c) Allow selection and configuration of the analysis and
transformation filters supplied by the tool,
encapsulating the particular logic in connectable
components.
(d) Allow tool extension by adding new analysis and
transformation filter components developed by third
parties.
(e) Present an abstract resource representation for uniform
resource transformation.
(f) Provide a simple and intuitive user interface for tool
management.
(g) Provide an interface for collection and storage
management.
(h) Achieve fault tolerance and resume interrupted
processes after external problems.
(i) Provide statistic information about process status and
collected information.
The software tool was developed along these premises,
trying to keep components as separated/uncoupled as possible.
Fig. 1 shows an architecture diagram for the tool.
III. DATA MODEL OVERVIEW
This data model is primarily based in three elements, from
which the whole model is developed. These elements are
Repositories, Harvest Definitions and Collections.
“Repositories” represent external digital repositories with
relevant resources, thus being the object of data collection. A
repository is an abstract entity that does not determine how to
obtain resources and only registers general information such as
the name of the source institution, contact e-mail, Web site,
etc. In order to harvest resources from a specific repository,
connection drivers –components with the required logic to
establish connections- must be first associated, determining
the relevant parameters.
A “Harvest Definition” element comprises all the
specifications required to carry out a harvest instance (or
particular harvest). That is, harvesting processes are performed
according to the harvest definitions in the adequate status -i.e.
still have jobs to carry out-. A harvest definition is created
Figure 1. Architecture diagram
3. from a connector associated to a repository, thus specifying
the protocol or harvest method used. This allows the creation
of multiple harvests on a single repository, using different
communication approaches.
“Collections” are the third important element. They
represent the various end targets for the information generated
after applying transformation and analysis processes to
harvested resources. As is the case with repositories,
collections are an abstract element within the system, and this
means that each collection has an associated connector that
determines the storage method and its corresponding
parameters. The main goal is to allow the use of the different
storage options, not only based on the storage type, but also
the type of information to be stored. For example, let us
consider a collection that specifies storage into the file system
as backup, another collection that specifies insertion into an
Apache Solr [7] core for resources identified as Thesis, and a
third collection that specifies insertion into another Apache
Solr core for resources written in Spanish.
The data model is completed from the three main elements
described above, adding elements associated to connectors and
to harvest definitions, supplementary information about
repositories and additional elements for controlling and
tracking harvesting methods.
IV. EXTRACT
Extract is the first phase in resource harvesting, carried out
in different stages. The first is the determination of harvest
definitions that must be loaded to be run. For this purpose,
each definition has scheduling information that specifies date
and time of the next execution. Since harvest definitions are
the actual extraction jobs, they contain a reference to the
interacting connector to establish the connection and
download the information. Likewise, harvest definitions are
narrowed down, adding supplementary information -usually
parameters- about the associated connector protocol.
Specifically, the connector is the component that carries the
logic required to establish the connection, and the harvest
definition specialization contains the particular harvesting
parameters.
In some cases, harvesting jobs must be carried out in
stages, due to a number of reasons: data volume is too large
and thus must be partitioned, organizative issues, etc. An
actual case is OAI-PMH protocol, which allows incremental
harvesting by date range. This is shown in the data model,
when definitions are decomposed from harvests into Actual
Harvests. For example, an OAI harvest can be created
specifying a date range of a year and then split this job into
one-month separate harvests, which will generate twelve
harvests that must be completed to meet the requirements of
the initial definition. This also reduces losses due to system
crashes, since only the job associated to a part of the harvest
would be lost, and not the whole harvest itself.
This fault-tolerance is achieved using “Harvest Attempts”.
That is, for each connection a new attempt is registered, and it
will remain valid until the harvest is completed or an
interruption occurs: either a manual interruption by the
administrator, a lack of response from the target server, or
errors in the target server responses, system crashes, etc. There
is a configurable limit that determines the number of attempts
to try before disabling the harvest.
Figure 2. Extraction phase data model abstraction
Downloaded information is handled by a general handler
common to all connectors which stores harvested data locally
and retrieves them when needed. For example, a particular
handler can store data as files on disk.
V. TRANSFORM
This phase initially transforms the harvested resources to a
simple abstract representation that allows to uniformly process
all resources. This transformation is done by connectors, since
they have information about the original representation and the
rules that must be applied to take it to an abstract level. Each
resource, already in their abstract representation, goes through
a filter chain to analyze particularities and modify data, if
necessary. The system comprises a predetermined set of
independent filters, which are simple and reusable components
that act according to parameters specified in filter
configuration file.
As seen above, each harvest definition refers to a target
collection set. Each collection specifies a set of filters that
must be applied before inserting a resource in that collection,
where selection order determines their application.
Filter execution may lead to modification, adding or
erasing specific resource data (metadata values), depending on
specific filter functions and configuration.
Available filters on this application include:
• CopyField: copies content from field to field. If the target
field is nonexistent, the filter creates one.
• DefaultValue: determines if there’s a nonexistent or
valueless field. If this is the case, it creates a new one
with a predetermined value.
• FieldRemover: takes a field list and removes them from
the resource.
4. • Tokenizer: takes field values and tokenizes them from a
specific character series, generating multiple additional
values.
• Stack: aggregates filters; defines a filter list (with
configuration and order) to ensure the order of
application.
• ISOLanguage: applied to a field that specifies the
resource language, searches for the field value in a
language list and replaces the original value with the
ISO-639 language code found.
• YearExtractor: applied to a field that contains a date,
extracts the year and saves it on a new field.
• Vocabulary: takes field values and contrasts them against
a dictionary, unifying word variations and synonyms into
a single word.
VI. LOAD
This is the third part, when transformed resources are sent
to data stores, completing the scope of this tool. For this
purpose, each collection refers to a target connector that
contains the data store logic required to interact with this
latter.
After going through the transformation stage, resources in
their abstract representation are sent to the connector
associated to the target collection, where they undergo further
transformations to produce an adequate representation that
matches data store requirements.
VII. MANAGEMENT
Loading of repositories, collection, harvest definitions,
filter selection and so forth is managed through a web
application. This web application is included in the software
and allows management capabilities to handle all aspects that
make up the tool. More precisely, it allows management of
collections, repositories, harvest definitions, connectors –
source and target-, languages, publication types, users, roles,
system parameters, collection assignments in a harvest
definition, filter selection from a collection, among others.
Besides, it has a special section to control the execution of
collection and storing, from which these processes can be
independently initiated and interrupted, creating a real time
report of the jobs that are being run.
Finally, simple reports associated with repositories are
created to show the status of completed harvests -number of
failed harvests, number of harvests with no register return, etc.
- average daily resource downloads, total volume of document
downloads, and more. Analogously, resource distribution by
source target is shown for each collection, specifying amount
and proportion represented by each one in the whole
collection.
VIII. FUTURE RESEARCH
This tool has a number of features that allow for further
improvements or extensions. Key aspects include:
Transformations: the most important extension point is
transformation, since it allows application of interesting
processes to collected information.
Semantic extraction: detect relations among resources
based on the information they contain.
Fulltext download: identify fields with the URL to the
fulltext and attempt download it to apply further filters to its
content.
Author standardization: analyze author’s name to generate
standardized metadata.
Duplicate detection: provide techniques to avoid insertion
of two resources –probably from different sources- when they
represent the same resource.
IX. CONCLUSION
This document discusses a recurring challenge faced by
digital repositories that arises from resource collection from
diverse sources, and it was shown how an architecture used
mainly in the business area can provide a solution for these
issues. The three main ETL architecture phases cover each one
of the activities performed during the resource collection work
in the context of digital repositories, making it an adequate
approach in this area, allowing for further improvements and
extensions, as seen above.
REFERENCES
[1] The Open Archives Initiative Protocol for Metadata Harvesting,
http://www.openarchives.org/OAI/openarchivesprotocol.html
[2] A. Maedche, M. Ehrig, S. Handschuh, R. Volz, L. Stojanovic,
"Ontology-Focused Crawling of Documents and Relational Metadata",
Proceedings of the Eleventh International World Wide Web Conference
WWW2002, 2002.
[3] P. Minton, D. Steffen, “The Conformed ETL Architecture”, DM
Review, 2004, http://amberleaf.net/content/pdfs/ConformedETL.pdf.
[4] Data Warehouse, http://en.wikipedia.org/wiki/Data_warehouse.
[5] Data Mart, http://en.wikipedia.org/wiki/Data_mart.
[6] Bussiness Intelligence,
http://en.wikipedia.org/wiki/Business_intelligence.
[7] Apache Solr, http://lucene.apache.org/solr.
[8] M. L. Nelson, J. A. Smith, I. Garcia del Campo, H. Van de Sompel, X.
Liu, “Efficient, Automatic Web Resource Harvesting”, WIDM '06
Proceedings of the 8th annual ACM international workshop on Web
information and data management, 2006.
[9] C. B. Baruque, R. N. Melo. “Using data warehousing and data mining
techniques for the development of digital libraries for LMS”, IADIS
International Conference WWW/Internet, 2004.