A data warehouse integrates data from various and heterogeneous data sources and creates a
consolidated view of the data that is optimized for reporting and analysis. Today, business and
technology are constantly evolving, which directly affects the data sources. New data sources can
emerge while some can become unavailable. The DW or the data mart that is based on these data
sources needs to reflect these changes. Various solutions to adapt a data warehouse after the changes
in the data sources and the business requirements have been proposed in the literature [1].
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
The document discusses techniques for detecting similarity and deduplication in document analysis using vector analysis. It proposes analyzing documents by extracting abstract content, separating words and combining them in a word cloud to determine frequency. This approach aims to identify whether documents are duplicates by analyzing word vectors at the word, sentence and paragraph level while also applying techniques like stemming, stopping words and semantic similarity.
Role of Data Cleaning in Data WarehouseRamakant Soni
Data cleaning is an essential part of building a data warehouse as it improves data quality by detecting and removing errors and inconsistencies. Data warehouses integrate large amounts of data from various sources, so the probability of dirty data is high. Clean data is vital for decision making based on the data warehouse. The data cleaning process involves data analysis, defining transformation rules, verification of cleaning, applying transformations, and incorporating cleaned data. Tools can help support the different phases of data cleaning from data profiling to specialized cleaning of particular domains.
Lecture 03 - The Data Warehouse and Design phanleson
The chapter discusses the design process for building a data warehouse, including designing the interface from operational systems and designing the data warehouse itself. It covers topics like beginning with operational data, data and process models, data warehouse data models at different levels, normalization and denormalization, and managing complexity in transformations and integrations between operational and warehouse systems. The goal is to extract, transform, and load relevant data from operational sources into the data warehouse in a way that supports analysis and decision-making.
The document discusses key concepts related to data warehousing including:
1) What data warehousing is, its main components, and differences from OLTP systems.
2) The typical architecture of a data warehouse including operational data sources, storage, and end-user access tools.
3) Important considerations like data flows, integration, management of metadata, and tools/technologies used.
4) Additional topics such as benefits, challenges, administration, and data marts.
Data Mining, KDD Process, Data mining functionalities, Characterization,
Discrimination ,
Association,
Classification,
Prediction,
Clustering,
Outlier analysis, Data Cleaning as a Process
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
Data mining involves discovering hidden patterns in data, while data warehousing involves integrating data from multiple sources and storing it in a centralized location to support analysis. Some key differences are:
- Data mining uses techniques like classification, clustering, and association to discover insights from data, while data warehousing focuses on data integration and OLAP tools.
- Data mining looks for unknown relationships and makes predictions, while data warehousing provides a way to extract and analyze historical data.
- Data warehousing involves extracting, cleaning, and transforming data during an ETL process before loading it into a separate database optimized for analysis. Data mining builds on the outputs of data warehousing.
Data mining and data warehouse lab manual updatedYugal Kumar
This document describes experiments conducted for a Data Mining and Data Warehousing Lab course. Experiment 1 involves studying data pre-processing steps using a dataset. Experiment 2 involves implementing a decision tree classification algorithm in Java. Experiment 3 uses the WEKA tool to implement the ID3 decision tree algorithm on a bank dataset, generating and visualizing the decision tree model. The experiments aim to help students understand key concepts in data mining such as pre-processing, classification algorithms, and using tools like WEKA.
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
The document discusses techniques for detecting similarity and deduplication in document analysis using vector analysis. It proposes analyzing documents by extracting abstract content, separating words and combining them in a word cloud to determine frequency. This approach aims to identify whether documents are duplicates by analyzing word vectors at the word, sentence and paragraph level while also applying techniques like stemming, stopping words and semantic similarity.
Role of Data Cleaning in Data WarehouseRamakant Soni
Data cleaning is an essential part of building a data warehouse as it improves data quality by detecting and removing errors and inconsistencies. Data warehouses integrate large amounts of data from various sources, so the probability of dirty data is high. Clean data is vital for decision making based on the data warehouse. The data cleaning process involves data analysis, defining transformation rules, verification of cleaning, applying transformations, and incorporating cleaned data. Tools can help support the different phases of data cleaning from data profiling to specialized cleaning of particular domains.
Lecture 03 - The Data Warehouse and Design phanleson
The chapter discusses the design process for building a data warehouse, including designing the interface from operational systems and designing the data warehouse itself. It covers topics like beginning with operational data, data and process models, data warehouse data models at different levels, normalization and denormalization, and managing complexity in transformations and integrations between operational and warehouse systems. The goal is to extract, transform, and load relevant data from operational sources into the data warehouse in a way that supports analysis and decision-making.
The document discusses key concepts related to data warehousing including:
1) What data warehousing is, its main components, and differences from OLTP systems.
2) The typical architecture of a data warehouse including operational data sources, storage, and end-user access tools.
3) Important considerations like data flows, integration, management of metadata, and tools/technologies used.
4) Additional topics such as benefits, challenges, administration, and data marts.
Data Mining, KDD Process, Data mining functionalities, Characterization,
Discrimination ,
Association,
Classification,
Prediction,
Clustering,
Outlier analysis, Data Cleaning as a Process
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
Data mining involves discovering hidden patterns in data, while data warehousing involves integrating data from multiple sources and storing it in a centralized location to support analysis. Some key differences are:
- Data mining uses techniques like classification, clustering, and association to discover insights from data, while data warehousing focuses on data integration and OLAP tools.
- Data mining looks for unknown relationships and makes predictions, while data warehousing provides a way to extract and analyze historical data.
- Data warehousing involves extracting, cleaning, and transforming data during an ETL process before loading it into a separate database optimized for analysis. Data mining builds on the outputs of data warehousing.
Data mining and data warehouse lab manual updatedYugal Kumar
This document describes experiments conducted for a Data Mining and Data Warehousing Lab course. Experiment 1 involves studying data pre-processing steps using a dataset. Experiment 2 involves implementing a decision tree classification algorithm in Java. Experiment 3 uses the WEKA tool to implement the ID3 decision tree algorithm on a bank dataset, generating and visualizing the decision tree model. The experiments aim to help students understand key concepts in data mining such as pre-processing, classification algorithms, and using tools like WEKA.
The METL process extracts meta data from a source system, transforms it to describe a different database structure, and loads it into a target system. This allows data to be accessed across systems with different structures without changing the source. Specifically, it extracts meta data from a legacy banking system, transforms it to work with a business intelligence tool, and loads it so the tool can query the legacy data through an SQL interface without knowing the source's structure. The METL process facilitates data sharing across different systems in a non-invasive way to prolong legacy systems' lives and provide access to production data.
Lecture 02 - The Data Warehouse Environment phanleson
The document summarizes key concepts from Chapter 2 of the book "Building Data WareHouse" by Inmon. It discusses characteristics of a data warehouse including being subject-oriented, integrated from multiple sources, and containing time-variant data. It also covers topics like granularity, exploration and data mining capabilities, partitioning, structuring data, auditing, homogeneity vs heterogeneity, purging data, reporting, operational windows, and handling incorrect data.
The document provides an overview of the key components and considerations for building a data warehouse. It discusses 7 main components: 1) the data warehouse database, 2) sourcing, acquisition, cleanup and transformation tools, 3) metadata, 4) access (query) tools, 5) data marts, 6) data warehouse administration and management, and 7) information delivery systems. It also outlines important design considerations, technical considerations, and implementation considerations that must be addressed when building a data warehouse environment.
This document summarizes statistical disclosure control techniques for protecting private data, specifically microaggregation. Microaggregation involves clustering individual records into small groups to anonymize the data before release. It aims to minimize information loss while preventing re-identification of individuals. The document discusses challenges with multivariate microaggregation and reviews different heuristic approaches. It also covers related topics like k-anonymity algorithms, various clustering techniques for microaggregation like k-means, and using genetic algorithms to handle large datasets.
Description of four techniques for Data Cleaning:
1.DWCLEANER Framework
2.Data Mining Techniques include Association Rule and Functional Dependecies
,...
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
This document discusses scheduling algorithms for processing big data using Hadoop. It provides background on big data and Hadoop, including that big data is characterized by volume, velocity, and variety. Hadoop uses MapReduce and HDFS to process and store large datasets across clusters. The default scheduling algorithm in Hadoop is FIFO, but performance can be improved using alternative scheduling algorithms. The objective is to study and analyze various scheduling algorithms that could increase performance for big data processing in Hadoop.
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
Data warehousing involves extracting and consolidating data from multiple sources into a centralized database to support decision-making across an organization. It provides a comprehensive, integrated, and historical view of data organized by subject area rather than transactionally. Data warehouses use dimensional modeling and star schemas to store aggregated, multidimensional data in a way that facilitates analysis. Online analytical processing (OLAP) tools enable interactive exploration and analysis of multidimensional data through techniques like drill-down, roll-up, and slicing and dicing.
this is the ppt this contains definition of data ware house , data , ware house, data modeling , data warehouse architecture and its type , data warehouse types, single tire, two tire, three tire .
This document introduces an online course on data warehousing from Edureka. It provides an overview of key topics that will be covered in the course, including what a data warehouse is, its architecture, the ETL process, and modeling dimensions and facts. It also shows examples of using PostgreSQL to create tables and Talend to populate them as part of a hands-on project in the course. The course modules will cover data warehousing introduction, dimensions and facts, normalization, modeling, ETL concepts, and a project building a data warehouse using Talend.
Data Linkage is an important step that can provide valuable insights for evidence-based decision making, especially for crucial events. Performing sensible queries across heterogeneous databases containing millions of records is a complex task that requires a complete understanding of each contributing database’s schema to define the structure of its information. The key aim is to approximate the structure
and content of the induced data into a concise synopsis in order to extract and link meaningful data-driven facts. We identify such problems as four major research issues in Data Linkage: associated costs in pairwise matching, record matching overheads, semantic flow of information restrictions, and single order classification limitations. In this paper, we give a literature review of research in Data Linkage. The
purpose for this review is to establish a basic understanding of Data Linkage, and to discuss the
background in the Data Linkage research domain. Particularly, we focus on the literature related to the recent advancements in Approximate Matching algorithms at Attribute Level and Structure Level. Their efficiency, functionality and limitations are critically analysed and open-ended problems have been
exposed.
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
The document provides an overview of data mining techniques and processes. It discusses data mining as the process of extracting knowledge from large amounts of data. It describes common data mining tasks like classification, regression, clustering, and association rule learning. It also outlines popular data mining processes like CRISP-DM and SEMMA that involve steps of business understanding, data preparation, modeling, evaluation and deployment. Decision trees are presented as a popular classification technique that uses a tree structure to split data into nodes and leaves to classify examples.
The document discusses data warehousing, data mining, and business intelligence applications. It explains that data warehousing organizes and structures data for analysis, and that data mining involves preprocessing, characterization, comparison, classification, and forecasting of data to discover knowledge. The final stage is presenting discovered knowledge to end users through visualization and business intelligence applications.
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCEcscpconf
Data warehouse use has increased significantly in recent years and now plays a fundamental role in many organizations’ decision-support processes. An effective business intelligence
infrastructure that leverages the power of a data warehouse can deliver value by helping companies enhance their customer experience. In this paper is to generate reports with various
drilldowns and slier conditions with suitable parameters which provide a complete business solution which is helpful for monitor the company inflow and outflow. The goal of the work is
for potential users of the data warehouse in their decision making process in the Business process system to get a complete visual effort of those reports by creating the chart and grid interface from warehouse. The example in this paper relate directly to the Adventure Work Data Warehouse Project implementation which helps to know the internet sales amount according to different date
This document defines a data warehouse as a collection of corporate information derived from operational systems and external sources to support business decisions rather than operations. It discusses the purpose of data warehousing to realize the value of data and make better decisions. Key components like staging areas, data marts, and operational data stores are described. The document also outlines evolution of data warehouse architectures and best practices.
An ontological approach to handle multidimensional schema evolution for data ...ijdms
In recent years, the number of digital information storage and retrieval systems has increased immensely.
Data warehousing has been found to be an extremely useful technology for integrating such heterogeneous
and autonomous information sources. Data within the data warehouse is modelled in the form of a star or
snowflake schema which facilitates business analysis in a multidimensional perspective. As user
requirements are interesting measures of business processes, the data warehouse schema is derived from
the information sources and business requirements. Due to the changing business scenario, the information
sources not only change their data, but also change their schema structure. In addition to the source
changes the business requirements for data warehouse may also change. Both these changes results in data
warehouse schema evolution. These changes can be handled either by just updating it in the DW model, or
can be developed as a new version of the DW structure. Existing approaches either deal with source
changes or requirements changes in a manual way and changes to the data warehouse schema is carried
out at the physical level. This may induce high maintenance costs and complex OLAP server
administration. As ontology seems to be a promising solution for the data warehouse research, in this
paper an ontological approach to automate the evolution of a data warehouse schema is proposed. This
method assists the data warehouse designer in handling evolution at the ontological level based on which
decision can be made to carry out the changes at the physical level. We evaluate the proposed ontological
approach with the existing method of manual adaptation of data warehouse schema.
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
The document discusses techniques for detecting similarity and deduplication in document analysis using vector analysis. It proposes analyzing documents by extracting abstract content, separating words and combining them in a word cloud to determine frequency. This approach aims to identify whether documents are duplicates by analyzing word vectors at the word, sentence and paragraph level while also applying techniques like stemming, stopping words and semantic similarity.
Big Data triggered furthered an influx of research and prospective on concepts and processes pertaining previously to the Data Warehouse field. Some conclude that Data Warehouse as such will disappear;others present Big Data as the natural Data Warehouse evolution (perhaps without identifying a clear division between the two); and finally, some others pose a future of convergence, partially exploring the
possible integration of both. In this paper, we revise the underlying technological features of Big Data and Data Warehouse, highlighting their differences and areas of convergence. Even when some differences exist, both technologies could (and should) be integrated because they both aim at the same purpose: data exploration and decision making support. We explore some convergence strategies, based on the common elements in both technologies. We present a revision of the state-of-the-art in integration proposals from the point of view of the purpose, methodology, architecture and underlying technology, highlighting the
common elements that support both technologies that may serve as a starting point for full integration and we propose a proposal of integration between the two technologies.
Big Data triggered furthered an influx of research and prospective on concepts and processes pertaining
previously to the Data Warehouse field. Some conclude that Data Warehouse as such will disappear;
others present Big Data as the natural Data Warehouse evolution (perhaps without identifying a clear
division between the two); and finally, some others pose a future of convergence, partially exploring the
possible integration of both. In this paper, we revise the underlying technological features of Big Data and
Data Warehouse, highlighting their differences and areas of convergence.
The METL process extracts meta data from a source system, transforms it to describe a different database structure, and loads it into a target system. This allows data to be accessed across systems with different structures without changing the source. Specifically, it extracts meta data from a legacy banking system, transforms it to work with a business intelligence tool, and loads it so the tool can query the legacy data through an SQL interface without knowing the source's structure. The METL process facilitates data sharing across different systems in a non-invasive way to prolong legacy systems' lives and provide access to production data.
Lecture 02 - The Data Warehouse Environment phanleson
The document summarizes key concepts from Chapter 2 of the book "Building Data WareHouse" by Inmon. It discusses characteristics of a data warehouse including being subject-oriented, integrated from multiple sources, and containing time-variant data. It also covers topics like granularity, exploration and data mining capabilities, partitioning, structuring data, auditing, homogeneity vs heterogeneity, purging data, reporting, operational windows, and handling incorrect data.
The document provides an overview of the key components and considerations for building a data warehouse. It discusses 7 main components: 1) the data warehouse database, 2) sourcing, acquisition, cleanup and transformation tools, 3) metadata, 4) access (query) tools, 5) data marts, 6) data warehouse administration and management, and 7) information delivery systems. It also outlines important design considerations, technical considerations, and implementation considerations that must be addressed when building a data warehouse environment.
This document summarizes statistical disclosure control techniques for protecting private data, specifically microaggregation. Microaggregation involves clustering individual records into small groups to anonymize the data before release. It aims to minimize information loss while preventing re-identification of individuals. The document discusses challenges with multivariate microaggregation and reviews different heuristic approaches. It also covers related topics like k-anonymity algorithms, various clustering techniques for microaggregation like k-means, and using genetic algorithms to handle large datasets.
Description of four techniques for Data Cleaning:
1.DWCLEANER Framework
2.Data Mining Techniques include Association Rule and Functional Dependecies
,...
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
This document discusses scheduling algorithms for processing big data using Hadoop. It provides background on big data and Hadoop, including that big data is characterized by volume, velocity, and variety. Hadoop uses MapReduce and HDFS to process and store large datasets across clusters. The default scheduling algorithm in Hadoop is FIFO, but performance can be improved using alternative scheduling algorithms. The objective is to study and analyze various scheduling algorithms that could increase performance for big data processing in Hadoop.
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
Data warehousing involves extracting and consolidating data from multiple sources into a centralized database to support decision-making across an organization. It provides a comprehensive, integrated, and historical view of data organized by subject area rather than transactionally. Data warehouses use dimensional modeling and star schemas to store aggregated, multidimensional data in a way that facilitates analysis. Online analytical processing (OLAP) tools enable interactive exploration and analysis of multidimensional data through techniques like drill-down, roll-up, and slicing and dicing.
this is the ppt this contains definition of data ware house , data , ware house, data modeling , data warehouse architecture and its type , data warehouse types, single tire, two tire, three tire .
This document introduces an online course on data warehousing from Edureka. It provides an overview of key topics that will be covered in the course, including what a data warehouse is, its architecture, the ETL process, and modeling dimensions and facts. It also shows examples of using PostgreSQL to create tables and Talend to populate them as part of a hands-on project in the course. The course modules will cover data warehousing introduction, dimensions and facts, normalization, modeling, ETL concepts, and a project building a data warehouse using Talend.
Data Linkage is an important step that can provide valuable insights for evidence-based decision making, especially for crucial events. Performing sensible queries across heterogeneous databases containing millions of records is a complex task that requires a complete understanding of each contributing database’s schema to define the structure of its information. The key aim is to approximate the structure
and content of the induced data into a concise synopsis in order to extract and link meaningful data-driven facts. We identify such problems as four major research issues in Data Linkage: associated costs in pairwise matching, record matching overheads, semantic flow of information restrictions, and single order classification limitations. In this paper, we give a literature review of research in Data Linkage. The
purpose for this review is to establish a basic understanding of Data Linkage, and to discuss the
background in the Data Linkage research domain. Particularly, we focus on the literature related to the recent advancements in Approximate Matching algorithms at Attribute Level and Structure Level. Their efficiency, functionality and limitations are critically analysed and open-ended problems have been
exposed.
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
The document provides an overview of data mining techniques and processes. It discusses data mining as the process of extracting knowledge from large amounts of data. It describes common data mining tasks like classification, regression, clustering, and association rule learning. It also outlines popular data mining processes like CRISP-DM and SEMMA that involve steps of business understanding, data preparation, modeling, evaluation and deployment. Decision trees are presented as a popular classification technique that uses a tree structure to split data into nodes and leaves to classify examples.
The document discusses data warehousing, data mining, and business intelligence applications. It explains that data warehousing organizes and structures data for analysis, and that data mining involves preprocessing, characterization, comparison, classification, and forecasting of data to discover knowledge. The final stage is presenting discovered knowledge to end users through visualization and business intelligence applications.
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCEcscpconf
Data warehouse use has increased significantly in recent years and now plays a fundamental role in many organizations’ decision-support processes. An effective business intelligence
infrastructure that leverages the power of a data warehouse can deliver value by helping companies enhance their customer experience. In this paper is to generate reports with various
drilldowns and slier conditions with suitable parameters which provide a complete business solution which is helpful for monitor the company inflow and outflow. The goal of the work is
for potential users of the data warehouse in their decision making process in the Business process system to get a complete visual effort of those reports by creating the chart and grid interface from warehouse. The example in this paper relate directly to the Adventure Work Data Warehouse Project implementation which helps to know the internet sales amount according to different date
This document defines a data warehouse as a collection of corporate information derived from operational systems and external sources to support business decisions rather than operations. It discusses the purpose of data warehousing to realize the value of data and make better decisions. Key components like staging areas, data marts, and operational data stores are described. The document also outlines evolution of data warehouse architectures and best practices.
An ontological approach to handle multidimensional schema evolution for data ...ijdms
In recent years, the number of digital information storage and retrieval systems has increased immensely.
Data warehousing has been found to be an extremely useful technology for integrating such heterogeneous
and autonomous information sources. Data within the data warehouse is modelled in the form of a star or
snowflake schema which facilitates business analysis in a multidimensional perspective. As user
requirements are interesting measures of business processes, the data warehouse schema is derived from
the information sources and business requirements. Due to the changing business scenario, the information
sources not only change their data, but also change their schema structure. In addition to the source
changes the business requirements for data warehouse may also change. Both these changes results in data
warehouse schema evolution. These changes can be handled either by just updating it in the DW model, or
can be developed as a new version of the DW structure. Existing approaches either deal with source
changes or requirements changes in a manual way and changes to the data warehouse schema is carried
out at the physical level. This may induce high maintenance costs and complex OLAP server
administration. As ontology seems to be a promising solution for the data warehouse research, in this
paper an ontological approach to automate the evolution of a data warehouse schema is proposed. This
method assists the data warehouse designer in handling evolution at the ontological level based on which
decision can be made to carry out the changes at the physical level. We evaluate the proposed ontological
approach with the existing method of manual adaptation of data warehouse schema.
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
The document discusses techniques for detecting similarity and deduplication in document analysis using vector analysis. It proposes analyzing documents by extracting abstract content, separating words and combining them in a word cloud to determine frequency. This approach aims to identify whether documents are duplicates by analyzing word vectors at the word, sentence and paragraph level while also applying techniques like stemming, stopping words and semantic similarity.
Big Data triggered furthered an influx of research and prospective on concepts and processes pertaining previously to the Data Warehouse field. Some conclude that Data Warehouse as such will disappear;others present Big Data as the natural Data Warehouse evolution (perhaps without identifying a clear division between the two); and finally, some others pose a future of convergence, partially exploring the
possible integration of both. In this paper, we revise the underlying technological features of Big Data and Data Warehouse, highlighting their differences and areas of convergence. Even when some differences exist, both technologies could (and should) be integrated because they both aim at the same purpose: data exploration and decision making support. We explore some convergence strategies, based on the common elements in both technologies. We present a revision of the state-of-the-art in integration proposals from the point of view of the purpose, methodology, architecture and underlying technology, highlighting the
common elements that support both technologies that may serve as a starting point for full integration and we propose a proposal of integration between the two technologies.
Big Data triggered furthered an influx of research and prospective on concepts and processes pertaining
previously to the Data Warehouse field. Some conclude that Data Warehouse as such will disappear;
others present Big Data as the natural Data Warehouse evolution (perhaps without identifying a clear
division between the two); and finally, some others pose a future of convergence, partially exploring the
possible integration of both. In this paper, we revise the underlying technological features of Big Data and
Data Warehouse, highlighting their differences and areas of convergence.
Big Data triggered furthered an influx of research and prospective on concepts and processes pertaining previously to the Data Warehouse field. Some conclude that Data Warehouse as such will disappear; others present Big Data as the natural Data Warehouse evolution (perhaps without identifying a clear division between the two); and finally, some others pose a future of convergence, partially exploring the possible integration of both. In this paper, we revise the underlying technological features of Big Data and Data Warehouse, highlighting their differences and areas of convergence. Even when some differences exist, both technologies could (and should) be integrated because they both aim at the same purpose: data exploration and decision making support. We explore some convergence strategies, based on the common elements in both technologies. We present a revision of the state-of-the-art in integration proposals from the point of view of the purpose, methodology, architecture and underlying technology, highlighting the common elements that support both technologies that may serve as a starting point for full integration and we propose a proposal of integration between the two technologies.
This document provides an overview of big data storage technologies and their role in the big data value chain. It identifies key insights about data storage, including that scalable storage technologies have enabled virtually unbounded data storage and advanced analytics across sectors. However, lack of standards and challenges in distributing graph-based data limit interoperability and scalability. The document also notes the social and economic impacts of big data storage in enabling a data-driven society and transforming sectors like health and media through consolidated data analysis.
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
A Study on Graph Storage Database of NOSQLIJSCAI Journal
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
A Study on Graph Storage Database of NOSQLIJSCAI Journal
This document summarizes a research paper on graph storage databases in NoSQL. It discusses big data and the need for alternative databases to handle large, diverse datasets. It defines the key aspects of big data including volume, velocity, variety and complexity. It also describes different types of NoSQL databases, focusing on the basic structure of graph databases. Graph databases use nodes and relationships to model connected data. The document compares several graph database systems and discusses advantages like performance and flexibility as well as disadvantages like complexity. It outlines several applications of graph databases in areas like social networks and logistics.
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
This document discusses data mining of big data using Hadoop and MongoDB. It provides an overview of Hadoop and MongoDB and their uses in big data analysis. Specifically, it proposes using Hadoop for distributed processing and MongoDB for data storage and input. The document reviews several related works that discuss big data analysis using these tools, as well as their capabilities for scalable data storage and mining. It aims to improve computational time and fault tolerance for big data analysis by mining data stored in Hadoop using MongoDB and MapReduce.
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
This document discusses data mining of big data using Hadoop and MongoDB. It provides an overview of Hadoop and MongoDB and their uses in big data analysis. Specifically, it proposes using Hadoop for distributed processing and MongoDB for data storage and input. The document reviews several related works that discuss big data analysis using these tools, as well as their capabilities for scalable data storage and mining. It aims to improve computational time and fault tolerance for big data analysis by mining data stored in Hadoop using MongoDB and MapReduce.
This document describes a proposed tool called Warehouse Creator that can automatically generate data warehouses from heterogeneous data sources within an enterprise. The tool extracts data from various data sources like databases and files, integrates the data by generating dimension and fact tables, and provides a web interface for users to search and retrieve information from the warehouse without needing direct access to the underlying data sources. The tool aims to address issues like the need for users to have detailed knowledge of different data sources and query languages by providing a centralized warehouse that integrates data from multiple sources.
This document discusses data migration in schemaless NoSQL databases. It begins by defining NoSQL databases and comparing them to traditional relational databases. It then covers aggregate data models and the concepts of schemalessness and implicit schemas in NoSQL databases. The main focus is on data migration when an implicit schema changes, including principles, strategies, and test options for ensuring data matches the new implicit schema in applications.
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSijdms
ABSTRACT
The amount of data stored in IoT databases increases as the IoT applications extend throughout smart city appliances, industry and agriculture. Contemporary database systems must process huge amounts of sensory and actuator data in real-time or interactively. Facing this first wave of IoT revolution, database vendors struggle day-by-day in order to gain more market share, develop new capabilities and attempt to overcome the disadvantages of previous releases, while providing features for the IoT.
There are two popular database types: The Relational Database Management Systems and NoSQL databases, with NoSQL gaining ground on IoT data storage. In the context of this paper these two types are examined. Focusing on open source databases, the authors experiment on IoT data sets and pose an answer to the question which one performs better than the other. It is a comparative study on the performance of the commonly market used open source databases, presenting results for the NoSQL MongoDB database and SQL databases of MySQL and PostgreSQL
The aim of this paper is to evaluate, through indexing techniques, the performance of Neo4j and
OrientDB, both graph databases technologies and to come up with strength and weaknesses os each
technology as a candidate for a storage mechanism of a graph structure. An index is a data structure that
makes the searching faster for a specific node in concern of graph databases. The referred data structure
is habitually a B-tree, however, can be a hash table or some other logic structure as well. The pivotal
point of having an index is to speed up search queries, primarily by reducing the number of nodes in a
graph or table to be examined. Graphs and graph databases are more commonly associated with social
networking or “graph search” style recommendations. Thus, these technologies remarkably are a core
technology platform for some Internet giants like Hi5, Facebook, Google, Badoo, Twitter and LinkedIn.
The key to understanding graph database systems, in the social networking context, is they give equal
prominence to storing both the data (users, favorites) and the relationships between them (who liked
what, who ‘follows’ whom, which post was liked the most, what is the shortest path to ‘reach’ who). By a
suitable application case study, in case a Twitter social networking of almost 5,000 nodes imported in
local servers (Neo4j and Orient-DB), one queried to retrieval the node with the searched data, first
without index (full scan), and second with index, aiming at comparing the response time (statement query
time) of the aforementioned graph databases and find out which of them has a better performance (the
speed of data or information retrieval) and in which case. Thereof, the main results are presented in the
section 6.
MEGA is a freely available software that allows users to conduct statistical analysis of molecular evolution and construct phylogenetic trees. It is an integrated tool that can perform sequence alignment, infer phylogenetic trees, estimate divergence times and molecular evolution rates, infer ancestral sequences, and test evolutionary hypotheses. MEGA is widely used by biologists to reconstruct the evolutionary histories of species.
Analysis and evaluation of riak kv cluster environment using basho benchStevenChike
This document analyzes and evaluates the performance of the Riak KV NoSQL database cluster using the Basho-bench benchmark tool. Experiments were conducted on a 5-node Riak KV cluster to test throughput and latency under different workloads, data sizes, and operations (read, write, update). The results found that Riak KV can handle large volumes of data and various workloads effectively with good throughput, though latency increased with larger data sizes. Overall, Riak KV is suitable for distributed big data environments where high availability, scalability and fault tolerance are important.
The document discusses key concepts in data warehousing including:
1) The distinction between data and information, with data becoming valuable when organized and presented as information for decision making.
2) Characteristics of a data warehouse including being subject-oriented, integrated, non-volatile, time-variant, and accessible to end-users.
3) Differences between operational data and data warehouse data including the data warehouse being subject-oriented, summarized over time, and serving managerial communities rather than transactional needs.
Similar to A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CASE STUDY USING THE NCBI DATABASE FOR GENETIC VARIATION (20)
Home security is of paramount importance in today's world, where we rely more on technology, home
security is crucial. Using technology to make homes safer and easier to control from anywhere is
important. Home security is important for the occupant’s safety. In this paper, we came up with a low cost,
AI based model home security system. The system has a user-friendly interface, allowing users to start
model training and face detection with simple keyboard commands. Our goal is to introduce an innovative
home security system using facial recognition technology. Unlike traditional systems, this system trains
and saves images of friends and family members. The system scans this folder to recognize familiar faces
and provides real-time monitoring. If an unfamiliar face is detected, it promptly sends an email alert,
ensuring a proactive response to potential security threats.
In the era of data-driven warfare, the integration of big data and machine learning (ML) techniques has
become paramount for enhancing defence capabilities. This research report delves into the applications of
big data and ML in the defence sector, exploring their potential to revolutionize intelligence gathering,
strategic decision-making, and operational efficiency. By leveraging vast amounts of data and advanced
algorithms, these technologies offer unprecedented opportunities for threat detection, predictive analysis,
and optimized resource allocation. However, their adoption also raises critical concerns regarding data
privacy, ethical implications, and the potential for misuse. This report aims to provide a comprehensive
understanding of the current state of big data and ML in defence, while examining the challenges and
ethical considerations that must be addressed to ensure responsible and effective implementation.
Cloud Computing, being one of the most recent innovative developments of the IT world, has been
instrumental not just to the success of SMEs but, through their productivity and innovative contribution to
the economy, has even made a remarkable contribution to the economic growth of the United States. To
this end, the study focuses on how cloud computing technology has impacted economic growth through
SMEs in the United States. Relevant literature connected to the variables of interest in this study was
reviewed, and secondary data was generated and utilized in the analysis section of this paper. The findings
of this paper revealed that there have been meaningful contributions that the usage of virtualization has
made in the commercial dealings of small firms in the United States, and this has also been reflected in the
economic growth of the country. This paper further revealed that as important as cloud-based software is,
some SMEs are still skeptical about how it can help improve their business and increase their bottom line
and hence have failed to adopt it. Apart from the SMEs, some notable large firms in different industries,
including information and educational services, have adopted cloud computing technology and hence
contributed to the economic growth of the United States. Lastly, findings from our inferential statistics
revealed that no discernible change has occurred in innovation between small and big businesses in the
adoption of cloud computing. Both categories of businesses adopt cloud computing in the same way, and
their contribution to the American economy has no significant difference in the usage of virtualization.
Energy-constrained Wireless Sensor Networks (WSNs) have garnered significant research interest in
recent years. Multiple-Input Multiple-Output (MIMO), or Cooperative MIMO, represents a specialized
application of MIMO technology within WSNs. This approach operates effectively, especially in
challenging and resource-constrained environments. By facilitating collaboration among sensor nodes,
Cooperative MIMO enhances reliability, coverage, and energy efficiency in WSN deployments.
Consequently, MIMO finds application in diverse WSN scenarios, spanning environmental monitoring,
industrial automation, and healthcare applications.
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication. IJCSIT publishes original research papers and review papers, as well as auxiliary material such as: research papers, case studies, technical reports etc.
With growing, Car parking increases with the number of car users. With the increased use of smartphones
and their applications, users prefer mobile phone-based solutions. This paper proposes the Smart Parking
Management System (SPMS) that depends on Arduino parts, Android applications, and based on IoT. This
gave the client the ability to check available parking spaces and reserve a parking spot. IR sensors are
utilized to know if a car park space is allowed. Its area data are transmitted using the WI-FI module to the
server and are recovered by the mobile application which offers many options attractively and with no cost
to users and lets the user check reservation details. With IoT technology, the smart parking system can be
connected wirelessly to easily track available locations.
Welcome to AIRCC's International Journal of Computer Science and Information Technology (IJCSIT), your gateway to the latest advancements in the dynamic fields of Computer Science and Information Systems.
Computer-Assisted Language Learning (CALL) are computer-based tutoring systems that deal with
linguistic skills. Adding intelligence in such systems is mainly based on using Natural Language
Processing (NLP) tools to diagnose student errors, especially in language grammar. However, most such
systems do not consider the modeling of student competence in linguistic skills, especially for the Arabic
language. In this paper, we will deal with basic grammar concepts of the Arabic language taught for the
fourth grade of the elementary school in Egypt. This is through Arabic Grammar Trainer (AGTrainer)
which is an Intelligent CALL. The implemented system (AGTrainer) trains the students through different
questions that deal with the different concepts and have different difficulty levels. Constraint-based student
modeling (CBSM) technique is used as a short-term student model. CBSM is used to define in small grain
level the different grammar skills through the defined skill structures. The main contribution of this paper
is the hierarchal representation of the system's basic grammar skills as domain knowledge. That
representation is used as a mechanism for efficiently checking constraints to model the student knowledge
and diagnose the student errors and identify their cause. In addition, satisfying constraints and the number
of trails the student takes for answering each question and fuzzy logic decision system are used to
determine the student learning level for each lesson as a long-term model. The results of the evaluation
showed the system's effectiveness in learning in addition to the satisfaction of students and teachers with its
features and abilities.
In the realm of computer security, the importance of efficient and reliable user authentication methods has
become increasingly critical. This paper examines the potential of mouse movement dynamics as a
consistent metric for continuous authentication. By analysing user mouse movement patterns in two
contrasting gaming scenarios, "Team Fortress" and "Poly Bridge," we investigate the distinctive
behavioral patterns inherent in high-intensity and low-intensity UI interactions. The study extends beyond
conventional methodologies by employing a range of machine learning models. These models are carefully
selected to assess their effectiveness in capturing and interpreting the subtleties of user behavior as
reflected in their mouse movements. This multifaceted approach allows for a more nuanced and
comprehensive understanding of user interaction patterns. Our findings reveal that mouse movement
dynamics can serve as a reliable indicator for continuous user authentication. The diverse machine
learning models employed in this study demonstrate competent performance in user verification, marking
an improvement over previous methods used in this field. This research contributes to the ongoing efforts to
enhance computer security and highlights the potential of leveraging user behavior, specifically mouse
dynamics, in developing robust authentication systems.
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication.
Image segmentation and classification tasks in computer vision have proven to be highly effective using neural networks, specifically Convolutional Neural Networks (CNNs). These tasks have numerous
practical applications, such as in medical imaging, autonomous driving, and surveillance. CNNs are capable
of learning complex features directly from images and achieving outstanding performance across several
datasets. In this work, we have utilized three different datasets to investigate the efficacy of various preprocessing and classification techniques in accurssedately segmenting and classifying different structures
within the MRI and natural images. We have utilized both sample gradient and Canny Edge Detection
methods for pre-processing, and K-means clustering have been applied to segment the images. Image
augmentation improves the size and diversity of datasets for training the models for image classification
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication.
This research aims to further understanding in the field of continuous authentication using behavioural
biometrics. We are contributing a novel dataset that encompasses the gesture data of 15 users playing
Minecraft with a Samsung Tablet, each for a duration of 15 minutes. Utilizing this dataset, we employed
machine learning (ML) binary classifiers, being Random Forest (RF), K-Nearest Neighbors (KNN), and
Support Vector Classifier (SVC), to determine the authenticity of specific user actions. Our most robust
model was SVC, which achieved an average accuracy of approximately 90%, demonstrating that touch
dynamics can effectively distinguish users. However, further studies are needed to make it viable option
for authentication systems. You can access our dataset at the following
link:https://github.com/AuthenTech2023/authentech-repo
This paper discusses the capabilities and limitations of GPT-3 (0), a state-of-the-art language model, in the
context of text understanding. We begin by describing the architecture and training process of GPT-3, and
provide an overview of its impressive performance across a wide range of natural language processing
tasks, such as language translation, question-answering, and text completion. Throughout this research
project, a summarizing tool was also created to help us retrieve content from any types of document,
specifically IELTS (0) Reading Test data in this project. We also aimed to improve the accuracy of the
summarizing, as well as question-answering capabilities of GPT-3 (0) via long text
In the realm of computer security, the importance of efficient and reliable user authentication methods has
become increasingly critical. This paper examines the potential of mouse movement dynamics as a
consistent metric for continuous authentication. By analysing user mouse movement patterns in two
contrasting gaming scenarios, "Team Fortress" and "Poly Bridge," we investigate the distinctive
behavioral patterns inherent in high-intensity and low-intensity UI interactions. The study extends beyond
conventional methodologies by employing a range of machine learning models. These models are carefully
selected to assess their effectiveness in capturing and interpreting the subtleties of user behavior as
reflected in their mouse movements. This multifaceted approach allows for a more nuanced and
comprehensive understanding of user interaction patterns. Our findings reveal that mouse movement
dynamics can serve as a reliable indicator for continuous user authentication. The diverse machine
learning models employed in this study demonstrate competent performance in user verification, marking
an improvement over previous methods used in this field. This research contributes to the ongoing efforts to
enhance computer security and highlights the potential of leveraging user behavior, specifically mouse
dynamics, in developing robust authentication systems.
Image segmentation and classification tasks in computer vision have proven to be highly effective using neural networks, specifically Convolutional Neural Networks (CNNs). These tasks have numerous
practical applications, such as in medical imaging, autonomous driving, and surveillance. CNNs are capable
of learning complex features directly from images and achieving outstanding performance across several
datasets. In this work, we have utilized three different datasets to investigate the efficacy of various preprocessing and classification techniques in accurssedately segmenting and classifying different structures
within the MRI and natural images. We have utilized both sample gradient and Canny Edge Detection
methods for pre-processing, and K-means clustering have been applied to segment the images. Image
augmentation improves the size and diversity of datasets for training the models for image classification.
This work highlights transfer learning’s effectiveness in image classification using CNNs and VGG 16 that
provides insights into the selection of pre-trained models and hyper parameters for optimal performance.
We have proposed a comprehensive approach for image segmentation and classification, incorporating preprocessing techniques, the K-means algorithm for segmentation, and employing deep learning models such
as CNN and VGG 16 for classification.
- The document presents 6 different models for defining foot size in Tunisia: 2 statistical models, 2 neural network models using unsupervised learning, and 2 models combining neural networks and fuzzy logic.
- The statistical models (SM and SHM) are based on applying statistical equations to morphological foot data.
- The neural network models (MSK and MHSK) use self-organizing Kohonen maps to cluster foot data and model full and half sizes.
- The fuzzy neural network models (MSFK and MHSFK) incorporate fuzzy logic into the neural network learning process to better account for uncertainty in foot sizes.
The security of Electric Vehicle (EV) charging has gained momentum after the increase in the EV adoption
in the past few years. Mobile applications have been integrated into EV charging systems that mainly use a
cloud-based platform to host their services and data. Like many complex systems, cloud systems are
susceptible to cyberattacks if proper measures are not taken by the organization to secure them. In this
paper, we explore the security of key components in the EV charging infrastructure, including the mobile
application and its cloud service. We conducted an experiment that initiated a Man in the Middle attack
between an EV app and its cloud services. Our results showed that it is possible to launch attacks against
the connected infrastructure by taking advantage of vulnerabilities that may have substantial economic and
operational ramifications on the EV charging ecosystem. We conclude by providing mitigation suggestions
and future research directions.
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication.
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CASE STUDY USING THE NCBI DATABASE FOR GENETIC VARIATION
1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
DOI:10.5121/ijcsit.2017.9307 79
A META DATA VAULT APPROACH FOR
EVOLUTIONARY INTEGRATION OF BIG DATA SETS:
CASE STUDY USING THE NCBI DATABASE FOR
GENETIC VARIATION
Zaineb Naamane and Vladan Jovanovic
Department of Computer Science, Georgia Southern University, Statesboro, GA
ABSTRACT
A data warehouse integrates data from various and heterogeneous data sources and creates a
consolidated view of the data that is optimized for reporting and analysis. Today, business and
technology are constantly evolving, which directly affects the data sources. New data sources can
emerge while some can become unavailable. The DW or the data mart that is based on these data
sources needs to reflect these changes. Various solutions to adapt a data warehouse after the changes
in the data sources and the business requirements have been proposed in the literature [1]. However,
research in the problem of DW evolution has focused mainly on managing changes in the dimensional
model while other aspects related to the ETL, and maintaining the history of changes has not been
addressed. The paper presents a Meta Data vault model that includes a data vault based data
warehouse and a master data management. A major area of focus in this research is to keep both
history of changes and a “single version of the truth,” through an MDM, integrated with the DW. The
paper also outlines the load patterns used to load data into the data warehouse and materialized views
to deliver data to end-users. To test the proposed model, we have used big data sets from the biomedical
field and for each modification of the data source schema, we outline the changes that need to be made
to the EDW, the data marts and the ETL.
KEYWORDS
Data Warehouse (DW), Enterprise Data Warehouse (EDW), Business Intelligence, Data Vault (DV),
Business Data Vault, Master Data Vault, Master Data Management (MDM), Data Mart, Materialized
View, Schema Evolution, Data Warehouse Evolution, ETL, Metadata Repository, Relational Database
Management System (RDMS), NoSQL.
1. INTRODUCTION
The common assumption about a data warehouse is that once it is created, it remains static. This
assumption is incorrect because the business environment needs to change and increase with
time, and business processes are subject to frequent change. This change means that a new type
of information requirement becomes necessary. Research in the data warehouse environment has
not adequately addressed the management of schema changes to source databases. Such changes
have significant implications on the metadata used for managing the warehouse and the
applications that interact with source databases. In short, the data warehouse may not be
consistent with the data and the structure of the source databases. For this reason, the data
2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
80
warehouse should be designed in away that allows flexibility, evolution, and scalability. In this
thesis, we examine the implications for managing changes to the schema of the source databases
and propose a prototype that will be able to perform the following: 1)- adapt to a changing
business environment; 2)- support very large data; 3)- simplify design complexities; 4) - allow
addition and removal of data sources with no impact on the existing design; 5)- keep all past
data easily integrated with any future data using only additions; and 6)- allow tracking of
history, evolution, and usage of data.
Designing such a flexible system motivated the focus of this thesis. The Data Vault
methodology (DV 2.0), along with big data technologies, will be used to develop a Meta Data
Vault approach for evolutionary integration of big data sets. The prototype developed will solve
primary business requirements and address the evolving nature of the data model as well as
issues related to the history and the traceability of the data.
2. RESEARCH METHODOLOGY
2.1. PROBLEM
Building a data warehouse management system for scientific applications poses challenges.
Complex relationships and data types, evolving schema, and large data volumes (in terabyte) are
some commonly cited challenges with scientific data. Additionally, since a data model is an
abstraction of a real world concept, the model has to change with changes in domain knowledge.
For example, in biomedical long term research, as knowledge is gained from behavioral
experiments, new tasks or behaviors might be defined or new signals might be introduced. This
means that the data model and mappings must be redone again and again as the data models shift,
therefore keeping all past data easily integrated with any future data using only additions is very
desirable.
2.2. RESEARCH REQUIREMENTS
A critical factor in the advancement of biomedical research is the ease in which data can be
integrated, redistributed, tracked, and analyzed both within and across functional domains. My
research methodology will be driven by meta-data requirements from this domain such as:
• the need to incorporate big data and unstructured data (research data includes free-text
notes, images, and other complex data),
• keeping all past data easily integrated with any future data, using only additions,
• accessibility to and tracking of data or process elements in large volumes of scientific data,
• tracking where and when the data originated about a medical analysis (history, evolution,
and usage of data is often as valuable as the results of the analysis itself), and
• allowing seamless integration of data when the model changes.
2.3. VALIDATION APPROACH
The research of this paper centers upon the following question: Can the proposed Meta Data
Vault prototype serve as a standard solution able to effectively track and manage changes for
large-scale data sets while ensuring performance and scalability?
The solution developed will benefit from both relational and non-relational data design
(RDBMS and NoSQL) by including meta-data approach (combining MDM concepts with DV
concepts) for tracking and managing schema evolution, combined with NoSQL technology for
3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
81
managing big data sets.
To validate the proposed approach, a system prototype will be developed as a full proof of
concept implementation / architecture with a performance base experiment on a real example
from the biomedical field, which is the Single Nucleotide Polymorphism Database of Nucleotide
Sequence Variation.
3. THE PROPOSED SOLUTION
3.1. ENVIRONMENT
Our solution is designed to handle the exponentially growing volume of data, the variety of semi-
structured and unstructured data types, and the velocity of real-time data processing. To reach this
goal, we will use SQL server 2016 because it represents the perfect choice for our database
environment, especially with the new NoSQL feature introduced. This feature allows us to query
across relational and non-relational sources like Hadoop. SQL Server Integration services (SSIS)
will be used for data integration.
3.2. ARCHITECTURE SOLUTION
To reach our research goal and develop a flexible system which will be able to track and manage
changes in both data and metadata, as well as their schemas, we will use a data vault based
architecture (Figure 1).
Figure 1. Solution Architecture
The proposed architecture consists of four parts:
• Data sources: A data warehouse is designed to periodically get all sorts of data from various
data sources. One of the main objectives of the data warehouse is to bring together data from
many different data sources and databases in order to support reporting needs and
management decisions. Data sources include operational data stores and external sources.
• EDW: The enterprise data warehouse layer is modeled after the Data Vault methodology. It
consists of a DV model [2] and a master data. The DV model is composed of two data
models: one oriented towards the data source side (Operational Vault) and represents the
system of record, and the other oriented towards the reporting side (Business Vault). The
4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
82
master data is the metadata repository, which serves to integrate the operational vault and
the business vault in the EDW.
• Reporting DW: Contains one or more information mart that depends on the enterprise data
warehouse layer.
• Information Delivery: Allows users to perform their own data analysis tasks without
involvement of IT.
3.2.1. DATA SOURCE
3.2.1.1. RELATIONAL MODEL
To test our prototype, we will use a business case from the genetic field which deals with a DW
for the Nucleotide Sequence Variation.
Sequence variations exist at defined positions within genomes and are responsible for individual
phenotypic characteristics, including a person's propensity toward complex disorders such as
heart disease and cancer. As tools for understanding human variation and molecular genetics,
sequence variations can be used for gene mapping, definition of population structure, and
performance of functional studies [3]. For our data source, we will use the Single Nucleotide
Polymorphism database (dbSNP), which is a public-domain archive for a broad collection of
simple genetic polymorphisms. It has been designed to support submissions and research into a
broad range of biological problems [3].The dbSNP Schema is very complex with well over 100
tables and many relationships among tables. One single ER diagram with all dbSNP tables will be
too large to test our solution. The dbSNP schema has been modified and restricted to tables
related to the particular implementation of the dbSNP SNP variation data. The build used for this
study is the combination of different builds.
All data model examples are shown as diagrams using appropriate style-related variations based
on the standard IDEF1X data -modeling notation. The reason for selecting the Idef1X is its very
convenient closeness to the relational model, i.e. a direct visual representation of tables and
foreign keys and minimal set of visual data modeling elements.
The dbSNP data source model in the figure below deals with three subject areas:
• Submitted Population: Contains information on submitted variants and information on the
biological samples used in the experiments, such as the number of samples and the ethnicity
of the populations used.
• SNP Variation: Contains information on the mapping of variants to different reference
genomes.
Allele Frequency for Submitted SNP: Stores information on alleles and frequencies observed in
various populations.
5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
83
Figure 2. Data models for source databases
The model in Figure 2 has two major classes of data. The first class is submitted data, namely
original observations of sequence variation, accessioned using a submitted SNP identifier (ss).
The second class is generated during the dbSNP build cycle by aggregating data from multiple
submissions, as well as data from other sources, and is identified with a “reference SNP” (rs)
number. The rs identifier represents aggregation by type of sequence change and location on the
genome if an assembled genome is available, or aggregation by common sequence if a genome is
not available (Kitts et al, 2013).
Each row in the SNP table corresponds to a unique reference SNP cluster identified by snp_id
(rs), as opposed to the table SubSNP that contains a separate row for each submission identified
by subsnp_id. SNPsubSNPLNK is the central table that keeps the relationship between submitted
SNP (ss) and reference SNP (rs).
Allele stores all unique alleles in the dbSNP. A submitted SNP variation could have two or more
alleles. For example, a variation of A/G has two alleles: A and G. The alleles A and G are stored
in the Allele table.
Submitted variation patterns are stored in ObsVariation table.
Alleles (Allele) typically exist at different frequencies (AlleleFreqBySsPop) in different
populations (Population). GtyFreqBySsPop contains genotype frequency per submitted SNP and
population. SubSNPGenotypeSum contains summary data for each SubSNP within a specific
population.
3.2.1.2. NOSQL
Genomics produces huge volumes of data [4]; each human genome has about 25,000 genes
comprised of 3 million base pairs. This amounts to 100 gigabytes of data. Consequently,
sequencing multiple human genomes would quickly add up to hundreds of petabytes of data, and
the data created by analysis of gene interactions multiplies those further [5].
6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
84
NoSQL databases can store and process genomics data in real time, something that SQL is not
capable of doing it when the volume of data is huge and the number of joins required to perform
a query is significant.
Our NoSQL solution is not to replace the Relational Model as a whole, but only in some cases
where there is a need for scalability and big data.
NoSQL databases allow greater scalability, cost-efficiency, flexibility, and performance when
dealing with large data. The two tables that store information about a submitted SNP are subject
to scale out with each submission and build, and they are the highest in terms of number of rows
and size.
Using a document-based database (JSON) will increase performance as all the related data are in
a single document which eliminates the join operation and speeds up the querying process.
Moreover, the scalability cost is reduced because a document-based database is horizontally
scalable, i.e. we can reduce the workload by increasing the number of servers instead of adding
more resources when scaling in a relational database environment. The tables affected by
denormalization are SubSNP, SNPSubLink.
3.2.2. ENTERPRISE DATA WAREHOUSE
Our enterprise data warehouse includes the raw data vault (RDV), which is data source oriented
and contains unchanged copies of the originals from the data sources, and the business data vault
(BDV), which is report oriented and aims to support business integration. In this project, we will
implement the RDV and BDV as one single database, where the two models are connected using
links. For visibility of the model, we have decoupled the entire model to RDV and BDV models,
where hubs are the major redundant entities.
3.2.2.1. RAW DATA VAULT (RDV)
Data Vault design is focused on the functional areas of a business with the hubs representing
business keys, links representing transactions between business keys, and the Satellites providing
the context of the business keys. In addition to that, reference tables have been included in the
data vault to prevent redundant storage of simple reference data that is referenced a lot. The data
vault entities are designed to allow maximum flexibility and scalability while preserving most of
the traditional skill sets of data modeling expertise [6].
Figure 3 below shows the logical DV model for our solution. Every object (hub, link, satellite,
and reference tables) in the model stores two auditing attributes: The first load date (loadDTS),
which is the first date when the specific line appeared, and the first load source (recSource),
which is the first source where the specific line appeared. These attributes provide audit
information at a very detailed level. For simplicity of the models, business keys are used instead
of surrogate keys, as recommended by the DV methodology [7]. Hubs are represented in blue,
links in red, satellites in yellow, and reference tables in gray. The reference tables are managed in
the BDV model but are represented in the RDV model as redundant entities after separating the
BDV from the RDV. Every hub has at least one satellite that contains descriptive attributes of the
business key. Links represent the relationship between entities. Satellites that hang off the Link
tables represent the changing information within that context.
7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
85
One of the biggest advantages of the RDV model is maintaining history. For example, data
related to the reference genome sequence, upon which SNPs are mapped, can change during time.
We need to store valid values when a record was created. RefSNPs are operationally defined as a
variation at a location on an ideal reference chromosome. These reference chromosomes are the
goal of genome assemblies. However, since many of the genome projects are still in work, and
since even the ‘finished’ genomes are updated to correct past annotation errors, a refSNP is
defined as a variation in the interim reference sequence. Every time there is a genomic assembly
update, the interim reference sequence may change, so the refSNPs must be updated [3]. Using a
data vault based data warehouse, we will be able to recreate past data correctly and maintain the
history of these changes at any point of the time.
The Raw Data Vault (RDV) model is an important piece in our prototype because it will keep the
source system context intact. Data coming from different sources are integrated in the DV model
below without undergoing transformations. Data are rapidly loaded in their raw format, including
the date and the modification source. It is, therefore, possible to rebuild a source’s image at any
point of time.
Figure 3. The Raw Data Vault (RDV) Model
3.2.2.3. BUSINESS DATA VAULT (BDV)
The business master data vault (BDV) is a component of our EDW that is oriented towards the
reporting side. It is designed from the RDV by adding standardized master data and business rules
in order to ensure business integration. Hubs in the BDV (Figure 4) are shown in light blue while
hubs from the RDV are shown in dark blue. Every master hub in the BDV model is connected
through links to the appropriate Hub from the RDV model. Every master Hub has at least one
satellite and one same-as link. Same-as link SAL_SNP_MDM maps a record to other records
within a H_SNP_Master hub.
8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
86
Figure 4.The Business Vault(BDV)Mode
The Business Data Vault (BDV) is also a mandatory piece in the EDW. It is the part where the
business rules are executed, where integration is achieved, and where the result is stored. The data
transformation in the BDV is typically a value-transformation and not so much a structure
transformation used when creating Data Marts.
3.2.2.3. META DATA VAULT (MDV)
The Meta Data Vault (MDV) represents the metadata repository of our solution; the model is
designed to preserve metadata, changes in metadata, and the history of metadata. The MDV is a
data vault based model, which is used to integrate the Data warehouse metadata with the MDM
metadata in one single model.
Figure 5 shows the MDV model developed for our EDW solution. For readability of the model,
the DV meta-level attributes (recSource, LoadDTS) are not shown in the figure but are included
in all the entities of the model. Hubs, links, satellites, attributes, domains, roles, rules, sources,
resources, materialized views, and reference tables are represented as Hubs in the model (blue
color). All hubs have at least two satellites (pink color). One contains descriptive attributes of the
business key (name, description, type ...) and the other stores the business key. Every hub has a
same -as link. This type of link is used to connect source and target business keys. The model also
resolves the transformation rules from the data source to the RDV via ETL rules, transformation
from the RDV into the BDV via the business, and master rules, as well as the transformation rules
from the BDV to the materialized views. These rules are managed in the H_Rule, H_MV,
H_Attribute, H_Source entities and their links. The security aspect was also treated in this model.
H_Security is a hub linked to others hubs in the model to manage security and user access rights
to entities, attributes, sources, resources, etc. This proposed model permanently stores and
histories all the application metadata, because it is based on the Data Vault model. Hence, it will
not only solve the problem of schema evolution but also the MDM evolution
9. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
87
Figure 5. The Master Data Vault (MDV) Model
3.2.3. REPORTING DW
3.2.3.1. DATA MART
MART MODEL
Dimensional modeling is still the best practice for analysis and reporting. However, it does not
meet the performance expectation for an EDW system [8], [9]. Data Vault, on the other hand, is
more suitable for the large Enterprise Data Warehouse, but not suitable for direct analysis and
reporting. For this reason, we still need dimensional modeling to create data marts [10] in order to
respond to business users’ requirements. In order to test our solution, we considered the following
requirements:
- Allele frequency and count by submitted SNP, allele and population.
- Genotype frequency and count by submitted SNP, genotype and population.
The Work Information Package (WIP) is constructed for two business metrics within the SNP
Data Mart. The two fact tables represent each of the two measures above. Table 1 represents the
user requirements glossary.
10. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
88
Table 1. User requirements glossary
Based on the requirement analysis, we derived the following data mart model.
Figure 6. Data Mart Model
3.2.3.2. MATERIALIZED VIEWS
In data warehouses, materialized views can be used to pre-compute and store aggregated data
such as Avg. sum etc. Materialized views in a data warehouse system are typically referred to as
summaries since they store summarized data. They can also be used to pre-compute joins with or
without aggregations. Thus, a materialized view is used to eliminate overhead associated with
expensive joins or aggregations for a large or important class of queries [11].
For an EDW running analytical queries directly against the huge raw data volume of a data
warehouse, results in unacceptable query performance. The solution to this problem is storing
materialized views in the warehouse, which pre-aggregate the data and thus avoid raw data access
and speed up queries [12].
In our solution, materialized views are derived from the BDV model and used to perform analysis
and reporting on data.
All dimensions are built by joining of a Hub and a Satellite (figure 7)
11. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
89
Figure 7. Derived Allele Dimension Table
Hubs, links and satellites are used to derive fact tables (figure 8):
Figure 8. Derived Allele Frequency Fact Table
3.2.3.3. ETL
In this section, we will present the load patterns used to load the data from the data source to the
EDW. The Raw Data Vault (RDV) will be built from the data sources, the Business Data Vault
(BDV) will be built from the RDV, and finally the DM will be built from the BDV.
The ETL jobs to load a data vault based data warehouse (The RDV and BDV both based on the
data vault methodology) typically run in two steps: In the first step, all hubs are loaded in parallel.
In the second step, all links and satellites are loaded in parallel.
12. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
90
3.2.3.3.1. ETL FOR THE RAW DATA VAULT (RDV)
The individual ETL operations for each type of the RDV objects are:
• Hubs: The business key must appear only once in the hub; insert with a lookup on the
business key
• Links: The association must appear only one time in the link; insert with a lookup on the
association of surrogate key
• Satellites: The loading of the object is the most complex; it depends on the historization
mode. Either the changed attributes are updated, or a new version must be inserted.
3.2.3.3.2. ETL FOR THE BUSINESS DATA VAULT (BDV)
Data captured in the RDV remains in its raw form (unchanged). Once the data enters the RDV, it
is no longer deleted or changed. The BDV, on the other hand, is report oriented. It is where the
business rules are implemented and only the “golden copy” of the records is stored [1]. The issue
with the DBSNP database is that data can be replicated between submitters, which make the
notion of “golden key” often lost. The BDV is designed to consolidate all the duplicated records
coming from the RDV into one master business key. The RDV only contains the highest quality
information on which the reporting applications can depend.
The individual ETL operations for each type of the BDV objects are:
• Master Hubs: Hubs in the business Data Vault will receive new surrogate keys. This is better
than reusing the keys that were distributed in the BDV because these keys now have different
meanings. The surrogate key must only appear once in the Master Hub; insert with a lookup
on the surrogate key.
• MDM Links: The relationship between the original records in the RDV and the cleansed
records in BDV has to be maintained for audit purposes. The MDM Link shows the
relationship between the original copy and the golden copy of data. The association must
appear one time only in the link; insert with a lookup on the association of surrogate key.
• Master Satellites: Master satellites in the BDV hold the descriptive attributes of the master
business key. The loading of the object is the most complex; it depends on the historization
mode. Either the changed attributes are just updated, or a new version must be inserted.
4. EXPERIMENTS
4.1. MOTIVATION AND OVERALL PERSPECTIVE
In this experiment, we will evaluate the effects of schema changes to the data source on the
modeled data. The motivation here is to experiment how our solution is affected when there is a
need of:
4.1.1. DATA SOURCES SCHEMA CHANGES
Business needs to change and increase with time, and business processes are subject to frequent
change. Due to this change, a new type of information requirement that requires different data
becomes necessary. These changes have important implications; first on the metadata (MDM)
used for managing the warehouse, and second, on the applications that interact with the source
databases. In short, the data warehouse may not be consistent with the data and the structure of
13. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
91
the source databases. We expect our solution to allow seamless integration, flexibility, evolution,
and scalability.
4.1.2. TRACEABILITY
The data warehouse team faces difficulties when an attribute change. They need to decide
whether the data warehouse keeps track of both the old and new value of the attribute or not, what
to use for the key, and where to put the two values of the changed attribute?
We expect our solution to ensure the preservation of history and provenance, which enables
tracking the lineage of data back to the source. Data coming from different sources is not
transformed before it is inserted into the MDV, thus enabling a permanent system of records.
4.1.3. AUDITABILITY
One other typical requirement for a data warehouse is the ability to provide information about
source and extraction time of data stored in the system. One reason for that is to track down
potential errors and try to understand the flow of the data into the system. Since Data Vault keeps
a comprehensive history, including the ability to record where the data came from, we expect that
our solution of data vault to be the perfect choice for any system where keeping an audit trail is
important.
4.2. EXPECTATIONS
In this thesis, we examine the implications of changes to the schema of the source databases on
the Enterprise data warehouse, as well as the system ability in terms of traceability and
auditability. We expect that proposed prototype will be able to:
1. Adapt to a changing business environment;
2. Support very large data, simplify design complexities;
3. Allow addition and removal of data sources without impacting the existing design;
4. Keep all past data easily integrated with any future data using additions only; and
5. Allow tracking of history, evolution and usage of data.
To summarize, the prototype developed will solve primary business requirements, addressing the
evolving nature of the data model as well as issues related to the history, and the traceability of
the data.
4.3. EXPERIMENT
In this experiment, we will measure the effects of data source changes on the raw data vault
(RDV) model, the business data vault (BDV) model, the data marts (materialized views), and the
ETL package that loads data from the data source to the RDV and from the RDV to the BDV and
from the BDV to the DM. We will also evaluate how the system responds when there is a need
for traceability and auditability.
In order to determine the changes that should be applied to the EDW, we consider the possible
changes in the data source schema. We analyze how these data source schema changes (Table 2)
are propagated to the BDV, RDV, ETL, and DM.
14. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
92
Table 2. Set of changes from the data source schema
4.4. EXPERIMENT RESULTS
From the results of the experiment, we were able to demonstrate that the proposed solution
resolves issues of tracking the data back to the sources, the history of changes, and the loss of
information. In this experiment, we have validated that:
• Any changes to the data source are implemented through additions only. No information or
data is ever deleted, which means that the loss of information is avoided and any question
about data traceability can be answered.
• The solution provides a solid basis for the audit process due to its ability to record when the
data was inserted and where is it coming from. Each row in the RDV and BDV is
accompanied by record source and load date information. Using these two attributes, we can
answer any questions related to auditability at any given point of time.
• Data Vault dependency minimizations makes it simpler to load tables in parallel, which
accelerates the load process while still maintaining the referential integrity. In the data vault
methodology, the dependency does not go deeper than three levels, with hubs being on top of
the hierarchy, dependent links and hub satellites dependent on hubs and the last link satellites
dependent on links.
• The use of the materialized view for data mart has significantly affected the performance of
the system, especially when querying a huge amount of data.
4.4.1. EFFECT OF CHANGES ON THE EDW
Table 3 shows a summary of the effects of changes to the data sources on the EDW.
15. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
93
Table 3. Effect of changes in the existing data source schemas on the RDV, BDV, MV, and ETL
16. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
94
4.4.2. EVALUATION OF TRACEABILITY
Data vault is considered as a system of record for the EDW. It ensures the preservation of history
17. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
95
and provenance, which enables tracking the lineage of data back to the source [1]. Data coming
from different sources are inserted in the RDV model without any transformations to ensure data
traceability; the RDV enables permanent system of records, which means that copies of the
originals are kept permanently in RDV. This way, the loss of information is avoided and any
question about data traceability is answered. In this experiment, we have validated that any
changes to the data source are implemented through additions-only (Table 4):
• Historical changes are captured by inserting new links and satellites.
• Tracking change is tolerably complex due to a highly normalized structure.
• Provides the most detailed and auditable capture of changes.
In addition, separating the RDV that is a data source oriented from the BDV that is report
oriented means separating reversible and irreversible transformations. Reversible transformations
represent ETL patterns used to load data from a data source into the RDV, and it is possible to
trace back their effects, in order to obtain the system-of records [1].
4.4.3. EVALUATION OF AUDITABILITY
Data vault methodology provides a solid basis for the audit process due to its ability to record
when the data was inserted and where is it coming from. Each row in the RDV and BDV is
accompanied by record source and load date information. Using these two attributes, we are able
to answer any questions related to auditability at any given point of time. Thus, we are able to
know where a specific information is extracted from, when was it extracted, and what was the
process used to extract it.
4.4.4. ETL PERFORMANCE
In the data vault methodology, the dependency does not go deeper than three levels, with hubs
being on top of the hierarchy and then links and hub satellites depended on hubs and the last link
satellites dependent on links. Data Vault dependency minimizations makes it easier to load tables
in parallel, hence speed up the load process while still maintaining the referential integrity.
Moreover, one of the big advantages of DV 2.0 was the replacement of the standard integer
surrogate keys with hash-based primary keys. Hashing allows a child “key” to be computed in
parallel to the parent “key” and be loaded independently of each other, which highly increases
load performance of an Enterprise Data Warehouse especially when working with large volumes
of data. To conclude, Data Vault enables greatly simplified ETL transformations in a way not
possible with traditional data warehouse designs.
5. CONCLUSIONS
Nowadays, The DW environment is constantly changing. However, adapting the DW to these
changes is a time-consuming process and still needs more in-depth research. In this thesis, we
examine the implications of managing changes to the schema of the operational databases in the
context of data warehouses. We propose a Meta Data Vault approach for evolutionary integration
of big data sets from the biomedical field. The approach proposed uses an integrated DW and
MDM data repository. The DW includes the Raw Data Vault (RDV), that contains the actual
copies of the originals from the data sources; and the Business Data Vault (BDV) that stores the
data resulted from the execution of business rules.
The BDV is a component of our EDW that is oriented towards the reporting side. It is expanded
from the RDV by adding standardized master data and business rules in order to ensure business
18. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 3, June 2017
96
integration. The main purpose of expanding the RDV with the BDV is to separate the storing and
data warehousing portion from the interpretation of the data. This is accomplished by pushing
"business rules" downstream after RDV, towards BDV and reporting DW. Furthermore this
arrangement enabled evolution of business rules localized to the BDV alone having as a
consequence more flexible maintenance also easier to implement using independent parallel
tasks.
The RDV and BDV are integrated via the Metadata Repository Data Vault (MDV). A data vault
based data warehouse needs an additional layer to deliver analysis reports to the users. Therefore,
data marts were derived from the RDV using materialized views. From the results of the
experiment, we were able to demonstrate that the proposed solution resolves issues of tracking the
data back to the sources, the history of changes, and loss of information. The DV approach also
allows fast parallel loading and preservation of the DW integrity. The Meta data vault repository
is a complex subject that still needs additional work and research in order to be formalized and
standardized even though, the MDV was out of scope in the experiment. The suggested models
(RDV, BDV), along with the load patterns used, have proved its flexibility and easy adaptation to
the data source schema changes, while preserving the history of these changes using additions
only. The model has also proved its efficiency in terms of traceability, auditability, and
performance on the overall data warehouse system.
REFERENCES
[1] D.Subotic, V. Jovanovic, and P. Poscic, Data Warehouse and Master Data management Evolution
– a Meta-Data-Vault Approach. Issues in Information Systems, 15(Ii), 14–23, 2014.
[2] D.Linstedt, SuperCharge Your Data Warehouse: Invaluable Data Modeling Rules to Implement your
Data Vault. Create Space Independent Publishing Platform, USA, 2011.
[3] A.Kitts, L. Phan, W. Minghong, and J. B. Holmes, The database of short genetic variation (dbSNP),
The NCBI Handbook, (2nd), 2013. Available at http://www.ncbi.nlm.nih.gov/books/NBK174586/.
[4] H.Afify, M. Islam, and M. Wahed. DNA Lossless Differential Compression Algorithm Based on
Similarity of Genomic Sequence Database, International Journal of Computer Science & Information
Technology (IJCSIT), 2011.
[5] B.Feldman, E. Martin, and T. Skotnes. Genomics and the role of big data in personalizing the
healthcare experience, 2013. Available at https://www.oreilly.com/ideas/genomics-and-the-role-of-
big-data-in-personalizing-the-healthcare-experience.
[6] V.Jovanovic, and I. Bojicic. Conceptual Data Vault Model, Proceedings of the Southern Association
for Information Systems Conference, Atlanta, GA, USA, 2012.
[7] D. Linstedt, and M. Olschimke, Building a scalable data warehouse with Data Vault 2.0, 2016.
[8] Z. Naamane, and V. Jovanovic, Effectiveness of Data Vault compared to Dimensional Data Marts on
Overall Performance of a Data Warehouse System. IJCSI International Journal of Computer Science
Issues, Volume 13, Issue 4, 2016
[9] N.Rahman, An empirical study of data warehouse implementation effectiveness. International Journal
of Management Science and Engineering Management, 2017
[10] R.Kimball, and M. Ross, The data warehouse toolkit, 3rd edition, John Wiley, 2013.
[11] B.Shah, K. Ramachandran, and V. Raghavan, A Hybrid Approach for Data Warehouse View
Selection.International Journal of Data Warehousing and Mining (IJDWM), 2006.
[12] M.Teschke, and A. Ulbrichl, Using materialized views to speed up data warehousing. University
Erlangen-Nuremberg (IMMD VI), 1–16 , 1997.
Available at http://www.ceushb.de/forschung/downloads/TeUl97.pdf