This document discusses the design and maintenance of data warehouses. It begins by contrasting online transaction processing (OLTP) with online analytical processing (OLAP) and decision support systems. It then defines a data warehouse as a subject-oriented database used for decision making that is separate from operational databases. The document outlines reasons for building data warehouses, including improved performance, availability, and data quality. It also describes common data warehouse architectures and processes involved in operational maintenance like extract, transform, and load of data from source systems.
This document summarizes the key aspects of an enterprise data warehouse project for the Oregon Department of Education called KIDS Phase II. It discusses what a data warehouse is and why it is needed to integrate data from multiple sources. It outlines the current issues with the state's data environment and recommends building a centralized data warehouse and operational data store to integrate student performance and other education data for improved decision making. The document also covers planning the project, developing the data model, extracting and loading data, and delivering reports and business intelligence.
SQL Power Consulting is a Toronto-based consulting firm founded in 1988 that specializes in data warehousing and business intelligence solutions. They offer professional consulting services to support the end-to-end deployment of BI solutions. Their methodology involves multiple phases including requirements review, architecture design, project planning, ETL and report design/build, and warranty support. They emphasize critical success factors like commitment from stakeholders, flexible architectures, productivity tools, and delivering business value for clients.
The document discusses tips and strategies for using SAP NetWeaver Business Intelligence 7.0 as an enterprise data warehouse (EDW). It covers differences between evolutionary warehouse architecture and top-down design, compares data mart and EDW approaches, explores real-time data warehousing with SAP, examines common EDW pitfalls, and reviews successes and failures of large-scale SAP BI-EDW implementations. The presentation also explores the SAP NetWeaver BI architecture and Corporate Information Factory framework.
Data Ninja Webinar Series: Realizing the Promise of Data LakesDenodo
Watch the full webinar: Data Ninja Webinar Series by Denodo: https://goo.gl/QDVCjV
The expanding volume and variety of data originating from sources that are both internal and external to the enterprise are challenging businesses in harnessing their big data for actionable insights. In their attempts to overcome big data challenges, organizations are exploring data lakes as consolidated repositories of massive volumes of raw, detailed data of various types and formats. But creating a physical data lake presents its own hurdles.
Attend this session to learn how to effectively manage data lakes for improved agility in data access and enhanced governance.
This is session 5 of the Data Ninja Webinar Series organized by Denodo. If you want to learn more about some of the solutions enabled by data virtualization, click here to watch the entire series: https://goo.gl/8XFd1O
This document provides an overview of data warehousing and related concepts. It defines a data warehouse as a centralized database for analysis and reporting that stores current and historical data from multiple sources. The document describes key elements of data warehousing including Extract-Transform-Load (ETL) processes, multidimensional data models, online analytical processing (OLAP), and data marts. It also outlines advantages such as enhanced access and consistency, and disadvantages like time required for data extraction and loading.
From Traditional Data Warehouse To Real Time Data WarehouseOsama Hussein
1) Traditional data warehouses are updated periodically (daily, weekly, monthly) and contain large amounts of historical data to support business intelligence activities. Real-time data warehouses aim to provide more up-to-date information by integrating data from sources more frequently, within minutes or hours.
2) To achieve real-time or near real-time loading, modified ETL processes are used, including near real-time ETL to increase loading frequency, direct trickle loading continuously, or trickle and flip loading to a secondary partition.
3) Real-time data warehouse architectures proposed in the literature involve extracting change data from sources, processing it in a data processing area, and loading it into a real-time data
The key components of a data warehouse are the source data component, data staging component, data storage component, information delivery component, meta-data component, and management and control component. The source data component includes production data, internal data, archived data, and external data. The data staging component involves extracting, transforming through processes like handling synonyms and homonyms, and loading the data. The information delivery component provides access and reports to different user types from novice to senior executives.
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...DATAVERSITY
Join Basho technologies and Databricks, creators of Apache Spark, as we share lessons learned by both organizations in building scalable applications for IoT and time series use cases. We'll be discussing some of the data modeling considerations unique to time series data and some of the key factors developers and architects need to take into consideration as data moves through the pipeline. You'll learn:
Challenges in building apps to leverage data being generated by IoT devices
What you need to think about before you start modeling your IoT data
Shortcuts to success in building IoT apps
The webinar will also give a live demonstration of how to store and retrieve IoT data as well as a demonstration of integrated data store with analytics engine using a live Notebook as a guide.
This document summarizes the key aspects of an enterprise data warehouse project for the Oregon Department of Education called KIDS Phase II. It discusses what a data warehouse is and why it is needed to integrate data from multiple sources. It outlines the current issues with the state's data environment and recommends building a centralized data warehouse and operational data store to integrate student performance and other education data for improved decision making. The document also covers planning the project, developing the data model, extracting and loading data, and delivering reports and business intelligence.
SQL Power Consulting is a Toronto-based consulting firm founded in 1988 that specializes in data warehousing and business intelligence solutions. They offer professional consulting services to support the end-to-end deployment of BI solutions. Their methodology involves multiple phases including requirements review, architecture design, project planning, ETL and report design/build, and warranty support. They emphasize critical success factors like commitment from stakeholders, flexible architectures, productivity tools, and delivering business value for clients.
The document discusses tips and strategies for using SAP NetWeaver Business Intelligence 7.0 as an enterprise data warehouse (EDW). It covers differences between evolutionary warehouse architecture and top-down design, compares data mart and EDW approaches, explores real-time data warehousing with SAP, examines common EDW pitfalls, and reviews successes and failures of large-scale SAP BI-EDW implementations. The presentation also explores the SAP NetWeaver BI architecture and Corporate Information Factory framework.
Data Ninja Webinar Series: Realizing the Promise of Data LakesDenodo
Watch the full webinar: Data Ninja Webinar Series by Denodo: https://goo.gl/QDVCjV
The expanding volume and variety of data originating from sources that are both internal and external to the enterprise are challenging businesses in harnessing their big data for actionable insights. In their attempts to overcome big data challenges, organizations are exploring data lakes as consolidated repositories of massive volumes of raw, detailed data of various types and formats. But creating a physical data lake presents its own hurdles.
Attend this session to learn how to effectively manage data lakes for improved agility in data access and enhanced governance.
This is session 5 of the Data Ninja Webinar Series organized by Denodo. If you want to learn more about some of the solutions enabled by data virtualization, click here to watch the entire series: https://goo.gl/8XFd1O
This document provides an overview of data warehousing and related concepts. It defines a data warehouse as a centralized database for analysis and reporting that stores current and historical data from multiple sources. The document describes key elements of data warehousing including Extract-Transform-Load (ETL) processes, multidimensional data models, online analytical processing (OLAP), and data marts. It also outlines advantages such as enhanced access and consistency, and disadvantages like time required for data extraction and loading.
From Traditional Data Warehouse To Real Time Data WarehouseOsama Hussein
1) Traditional data warehouses are updated periodically (daily, weekly, monthly) and contain large amounts of historical data to support business intelligence activities. Real-time data warehouses aim to provide more up-to-date information by integrating data from sources more frequently, within minutes or hours.
2) To achieve real-time or near real-time loading, modified ETL processes are used, including near real-time ETL to increase loading frequency, direct trickle loading continuously, or trickle and flip loading to a secondary partition.
3) Real-time data warehouse architectures proposed in the literature involve extracting change data from sources, processing it in a data processing area, and loading it into a real-time data
The key components of a data warehouse are the source data component, data staging component, data storage component, information delivery component, meta-data component, and management and control component. The source data component includes production data, internal data, archived data, and external data. The data staging component involves extracting, transforming through processes like handling synonyms and homonyms, and loading the data. The information delivery component provides access and reports to different user types from novice to senior executives.
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...DATAVERSITY
Join Basho technologies and Databricks, creators of Apache Spark, as we share lessons learned by both organizations in building scalable applications for IoT and time series use cases. We'll be discussing some of the data modeling considerations unique to time series data and some of the key factors developers and architects need to take into consideration as data moves through the pipeline. You'll learn:
Challenges in building apps to leverage data being generated by IoT devices
What you need to think about before you start modeling your IoT data
Shortcuts to success in building IoT apps
The webinar will also give a live demonstration of how to store and retrieve IoT data as well as a demonstration of integrated data store with analytics engine using a live Notebook as a guide.
The Pivotal Business Data Lake provides a flexible blueprint to meet your business's future information and analytics needs while avoiding the pitfalls of typical EDW implementations. Pivotal’s products will help you overcome challenges like reconciling corporate and local needs, providing real-time access to all types of data, integrating data from multiple sources and in multiple formats, and supporting ad hoc analysis.
Vensai Consultants is an IT consulting firm that specializes in building data warehouses. They provide a roadmap for building a data warehouse that includes data acquisition, integration, storage in a data repository, and reporting services. They recommend tools for each step of the data warehouse development process, including data modeling, ETL, databases, analytics, and reporting tools.
This document discusses data warehousing and OLAP (online analytical processing) technology. It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data to support management decision making. It describes how data warehouses use a multi-dimensional data model with facts and dimensions to organize historical data from multiple sources for analysis. Common data warehouse architectures like star schemas and snowflake schemas are also summarized.
Part 2 of a 2 part presentation that I did in 2009, this presentation covers more about unstructured data, and operational data vault components. YES, even then I was commenting on how this market will evolve. IF you want to use these slides, please let me know, and add: "(C) Dan Linstedt, all rights reserved, http://LearnDataVault.com" in a VISIBLE fashion on your slides.
These are the slides from my talk at Data Day Texas 2016 (#ddtx16).
The world of data warehousing has changed! With the advent of Big Data, Streaming Data, IoT, and The Cloud, what is a modern data management professional to do? It may seem to be a very different world with different concepts, terms, and techniques. Or is it? Lots of people still talk about having a data warehouse or several data marts across their organization. But what does that really mean today in 2016? How about the Corporate Information Factory (CIF), the Data Vault, an Operational Data Store (ODS), or just star schemas? Where do they fit now (or do they)? And now we have the Extended Data Warehouse (XDW) as well. How do all these things help us bring value and data-based decisions to our organizations? Where do Big Data and the Cloud fit? Is there a coherent architecture we can define? This talk will endeavor to cut through the hype and the buzzword bingo to help you figure out what part of this is helpful. I will discuss what I have seen in the real world (working and not working!) and a bit of where I think we are going and need to go in 2016 and beyond.
The document discusses the purpose and history of data warehousing. It defines a data warehouse as a centralized, well-managed environment for storing high-value data from various sources. The data warehouse processes this data into a format optimized for analysis and information processing. The data warehouse has evolved from mainframe-based systems in the 1970s to today's cost-effective solutions embedded in software. A data warehouse is not defined by its size but by its functionality and ability to meet business objectives through consolidated, consistent data.
A data warehouse is a database used for reporting and analysis that integrates data from multiple sources. It provides strategic information through analysis that cannot be done by operational systems. A data warehouse contains integrated, subject-oriented data that is periodically updated and stored over time for decision making. It supports analytical tools and access for management rather than daily transactions.
Data Vault Modeling and Methodology introduction that I provided to a Montreal event in September 2011. It covers an introduction and overview of the Data Vault components for Business Intelligence and Data Warehousing. I am Dan Linstedt, the author and inventor of Data Vault Modeling and methodology.
If you use the images anywhere in your presentations, please credit http://LearnDataVault.com as the source (me).
Thank-you kindly,
Daniel Linstedt
The document discusses data archiving concepts and techniques. It introduces archiving as an intelligent process for placing inactive or infrequently accessed data on the right storage tier while allowing preservation, search and retrieval during a retention period. It discusses drivers of information growth like compliance requirements and new applications. An effective archiving strategy addresses both business and IT needs like managing risk, improving efficiency and reducing costs. The document outlines components of an archiving solution like application connectors, rules and management layers, and storage services. It also discusses IBM's reference architecture for archiving.
Enterprise Solutions Architect Eli Perl CVEli Perl
Eli Perl has over 25 years of experience as a solutions architect and data management consultant. He has worked for major financial institutions designing data warehouse architectures and implementing governance, compliance, and security solutions. His experience includes roles at Merrill Lynch, Deutsche Bank, JP Morgan Chase, and other large companies where he managed teams and led projects involving data warehousing, analytics, governance, and infrastructure.
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Cloudera, Inc.
SGI has been a leading commercial vendor of Hadoop clusters since 2008. Leveraging SGI's experience with high performance clusters at scale, SGI has delivered individual Hadoop clusters of up to 4000 nodes. Integration, performance, and management all become issues at scale, and Hadoop clusters scale! In this presentation, SGI will discuss representative customer use cases, major design considerations for performance and power optimization, how integrated Hadoop solutions leveraging CDH, SGI Rackable clusters, and SGI Management Center best meet customer needs, and how SGI envisions the needs of enterprise customers evolving as Hadoop continues to move into mainstream adoption.
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Edureka!
This Data Warehouse Tutorial For Beginners will give you an introduction to data warehousing and business intelligence. You will be able to understand basic data warehouse concepts with examples. The following topics have been covered in this tutorial:
1. What Is The Need For BI?
2. What Is Data Warehousing?
3. Key Terminologies Related To Data Warehouse Architecture:
a. OLTP Vs OLAP
b. ETL
c. Data Mart
d. Metadata
4. Data Warehouse Architecture
5. Demo: Creating A Data Warehouse
The document discusses the modern data warehouse and key trends driving changes from traditional data warehouses. It describes how modern data warehouses incorporate Hadoop, traditional data warehouses, and other data stores from multiple locations including cloud, mobile, sensors and IoT. Modern data warehouses use multiple parallel processing (MPP) architecture and the Apache Hadoop ecosystem including Hadoop Distributed File System, YARN, Hive, Spark and other tools. It also discusses the top Hadoop vendors and Oracle's technical innovations on Hadoop for data discovery, transformation, discovery and sharing. Finally, it covers the components of big data value assessment including descriptive, predictive and prescriptive analytics.
The document discusses two approaches to data warehousing - the Inmon approach and the Kimball approach. The Inmon approach takes a top-down perspective, designing the data warehouse first before building data marts. The Kimball approach takes a bottom-up perspective, starting with building individual data marts that can later be combined into a data warehouse. While they differ in methodology, both aim to make organizational data easily accessible and support improved decision making.
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)Denodo
This document summarizes a 6-session presentation on using data virtualization to solve key data integration challenges. The first session introduces data virtualization, covering how it can make business intelligence more agile, integrate big data, combine service-oriented architecture with data integration, enhance master data management and data warehousing, and create a single view of the customer. The presentation agenda is outlined and includes explanations of data virtualization, how it enhances existing architectures, demonstrations of its capabilities, Q&A, and next steps. Customer case studies show data virtualization delivering cost savings, productivity improvements, and faster access to new data sources and reports.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
Data Warehouse Design on Cloud ,A Big Data approach Part_OnePanchaleswar Nayak
This document discusses data warehouse design on the cloud using a big data approach. It covers topics such as business intelligence, data warehousing, data marts, data mining, ETL architecture, data warehouse design methodologies, Bill Inmon's top-down approach, Ralph Kimball's bottom-up approach, and addressing the new challenges of volume, velocity and variety of big data with Hadoop. The document proposes an architecture for next generation data warehousing using Hadoop to handle these new big data challenges.
Traditional Data-warehousing / BI overviewNagaraj Yerram
Business intelligence (BI) refers to technologies that collect, analyze, and present business data to support decision-making. A traditional BI architecture extracts data from source systems, transforms it using ETL processes, and loads it into a data warehouse optimized for analysis (OLAP). Dimensional modeling techniques structure data warehouses into fact and dimension tables arranged in star or snowflake schemas to enable analysis of key business metrics over time and across different dimensions like product or location. This facilitates interactive exploration and reporting on historical, current, and predictive business insights for strategic planning and opportunities.
Logical Data Fabric and Data Mesh – Driving Business OutcomesDenodo
Watch full webinar here: https://buff.ly/3qgGjtA
Presented at TDWI VIRTUAL SUMMIT - Modernizing Data Management
While the technological advances of the past decade have addressed the scale of data processing and data storage, they have failed to address scale in other dimensions: proliferation of sources of data, diversity of data types and user persona, and speed of response to change. The essence of the data mesh and data fabric approaches is that it puts the customer first and focuses on outcomes instead of outputs.
In this session, Saptarshi Sengupta, Senior Director of Product Marketing at Denodo, will address key considerations and provide his insights on why some companies are succeeding with these approaches while others are not.
Watch On-Demand and Learn:
- Why a logical approach is necessary and how it aligns with data fabric and data mesh
- How some of the large enterprises are using logical data fabric and data mesh for their data and analytics needs
- Tips to create a good data management modernization roadmap for your organization
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
The Pivotal Business Data Lake provides a flexible blueprint to meet your business's future information and analytics needs while avoiding the pitfalls of typical EDW implementations. Pivotal’s products will help you overcome challenges like reconciling corporate and local needs, providing real-time access to all types of data, integrating data from multiple sources and in multiple formats, and supporting ad hoc analysis.
Vensai Consultants is an IT consulting firm that specializes in building data warehouses. They provide a roadmap for building a data warehouse that includes data acquisition, integration, storage in a data repository, and reporting services. They recommend tools for each step of the data warehouse development process, including data modeling, ETL, databases, analytics, and reporting tools.
This document discusses data warehousing and OLAP (online analytical processing) technology. It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data to support management decision making. It describes how data warehouses use a multi-dimensional data model with facts and dimensions to organize historical data from multiple sources for analysis. Common data warehouse architectures like star schemas and snowflake schemas are also summarized.
Part 2 of a 2 part presentation that I did in 2009, this presentation covers more about unstructured data, and operational data vault components. YES, even then I was commenting on how this market will evolve. IF you want to use these slides, please let me know, and add: "(C) Dan Linstedt, all rights reserved, http://LearnDataVault.com" in a VISIBLE fashion on your slides.
These are the slides from my talk at Data Day Texas 2016 (#ddtx16).
The world of data warehousing has changed! With the advent of Big Data, Streaming Data, IoT, and The Cloud, what is a modern data management professional to do? It may seem to be a very different world with different concepts, terms, and techniques. Or is it? Lots of people still talk about having a data warehouse or several data marts across their organization. But what does that really mean today in 2016? How about the Corporate Information Factory (CIF), the Data Vault, an Operational Data Store (ODS), or just star schemas? Where do they fit now (or do they)? And now we have the Extended Data Warehouse (XDW) as well. How do all these things help us bring value and data-based decisions to our organizations? Where do Big Data and the Cloud fit? Is there a coherent architecture we can define? This talk will endeavor to cut through the hype and the buzzword bingo to help you figure out what part of this is helpful. I will discuss what I have seen in the real world (working and not working!) and a bit of where I think we are going and need to go in 2016 and beyond.
The document discusses the purpose and history of data warehousing. It defines a data warehouse as a centralized, well-managed environment for storing high-value data from various sources. The data warehouse processes this data into a format optimized for analysis and information processing. The data warehouse has evolved from mainframe-based systems in the 1970s to today's cost-effective solutions embedded in software. A data warehouse is not defined by its size but by its functionality and ability to meet business objectives through consolidated, consistent data.
A data warehouse is a database used for reporting and analysis that integrates data from multiple sources. It provides strategic information through analysis that cannot be done by operational systems. A data warehouse contains integrated, subject-oriented data that is periodically updated and stored over time for decision making. It supports analytical tools and access for management rather than daily transactions.
Data Vault Modeling and Methodology introduction that I provided to a Montreal event in September 2011. It covers an introduction and overview of the Data Vault components for Business Intelligence and Data Warehousing. I am Dan Linstedt, the author and inventor of Data Vault Modeling and methodology.
If you use the images anywhere in your presentations, please credit http://LearnDataVault.com as the source (me).
Thank-you kindly,
Daniel Linstedt
The document discusses data archiving concepts and techniques. It introduces archiving as an intelligent process for placing inactive or infrequently accessed data on the right storage tier while allowing preservation, search and retrieval during a retention period. It discusses drivers of information growth like compliance requirements and new applications. An effective archiving strategy addresses both business and IT needs like managing risk, improving efficiency and reducing costs. The document outlines components of an archiving solution like application connectors, rules and management layers, and storage services. It also discusses IBM's reference architecture for archiving.
Enterprise Solutions Architect Eli Perl CVEli Perl
Eli Perl has over 25 years of experience as a solutions architect and data management consultant. He has worked for major financial institutions designing data warehouse architectures and implementing governance, compliance, and security solutions. His experience includes roles at Merrill Lynch, Deutsche Bank, JP Morgan Chase, and other large companies where he managed teams and led projects involving data warehousing, analytics, governance, and infrastructure.
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Cloudera, Inc.
SGI has been a leading commercial vendor of Hadoop clusters since 2008. Leveraging SGI's experience with high performance clusters at scale, SGI has delivered individual Hadoop clusters of up to 4000 nodes. Integration, performance, and management all become issues at scale, and Hadoop clusters scale! In this presentation, SGI will discuss representative customer use cases, major design considerations for performance and power optimization, how integrated Hadoop solutions leveraging CDH, SGI Rackable clusters, and SGI Management Center best meet customer needs, and how SGI envisions the needs of enterprise customers evolving as Hadoop continues to move into mainstream adoption.
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Edureka!
This Data Warehouse Tutorial For Beginners will give you an introduction to data warehousing and business intelligence. You will be able to understand basic data warehouse concepts with examples. The following topics have been covered in this tutorial:
1. What Is The Need For BI?
2. What Is Data Warehousing?
3. Key Terminologies Related To Data Warehouse Architecture:
a. OLTP Vs OLAP
b. ETL
c. Data Mart
d. Metadata
4. Data Warehouse Architecture
5. Demo: Creating A Data Warehouse
The document discusses the modern data warehouse and key trends driving changes from traditional data warehouses. It describes how modern data warehouses incorporate Hadoop, traditional data warehouses, and other data stores from multiple locations including cloud, mobile, sensors and IoT. Modern data warehouses use multiple parallel processing (MPP) architecture and the Apache Hadoop ecosystem including Hadoop Distributed File System, YARN, Hive, Spark and other tools. It also discusses the top Hadoop vendors and Oracle's technical innovations on Hadoop for data discovery, transformation, discovery and sharing. Finally, it covers the components of big data value assessment including descriptive, predictive and prescriptive analytics.
The document discusses two approaches to data warehousing - the Inmon approach and the Kimball approach. The Inmon approach takes a top-down perspective, designing the data warehouse first before building data marts. The Kimball approach takes a bottom-up perspective, starting with building individual data marts that can later be combined into a data warehouse. While they differ in methodology, both aim to make organizational data easily accessible and support improved decision making.
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)Denodo
This document summarizes a 6-session presentation on using data virtualization to solve key data integration challenges. The first session introduces data virtualization, covering how it can make business intelligence more agile, integrate big data, combine service-oriented architecture with data integration, enhance master data management and data warehousing, and create a single view of the customer. The presentation agenda is outlined and includes explanations of data virtualization, how it enhances existing architectures, demonstrations of its capabilities, Q&A, and next steps. Customer case studies show data virtualization delivering cost savings, productivity improvements, and faster access to new data sources and reports.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
Data Warehouse Design on Cloud ,A Big Data approach Part_OnePanchaleswar Nayak
This document discusses data warehouse design on the cloud using a big data approach. It covers topics such as business intelligence, data warehousing, data marts, data mining, ETL architecture, data warehouse design methodologies, Bill Inmon's top-down approach, Ralph Kimball's bottom-up approach, and addressing the new challenges of volume, velocity and variety of big data with Hadoop. The document proposes an architecture for next generation data warehousing using Hadoop to handle these new big data challenges.
Traditional Data-warehousing / BI overviewNagaraj Yerram
Business intelligence (BI) refers to technologies that collect, analyze, and present business data to support decision-making. A traditional BI architecture extracts data from source systems, transforms it using ETL processes, and loads it into a data warehouse optimized for analysis (OLAP). Dimensional modeling techniques structure data warehouses into fact and dimension tables arranged in star or snowflake schemas to enable analysis of key business metrics over time and across different dimensions like product or location. This facilitates interactive exploration and reporting on historical, current, and predictive business insights for strategic planning and opportunities.
Logical Data Fabric and Data Mesh – Driving Business OutcomesDenodo
Watch full webinar here: https://buff.ly/3qgGjtA
Presented at TDWI VIRTUAL SUMMIT - Modernizing Data Management
While the technological advances of the past decade have addressed the scale of data processing and data storage, they have failed to address scale in other dimensions: proliferation of sources of data, diversity of data types and user persona, and speed of response to change. The essence of the data mesh and data fabric approaches is that it puts the customer first and focuses on outcomes instead of outputs.
In this session, Saptarshi Sengupta, Senior Director of Product Marketing at Denodo, will address key considerations and provide his insights on why some companies are succeeding with these approaches while others are not.
Watch On-Demand and Learn:
- Why a logical approach is necessary and how it aligns with data fabric and data mesh
- How some of the large enterprises are using logical data fabric and data mesh for their data and analytics needs
- Tips to create a good data management modernization roadmap for your organization
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
The document provides an overview of a course on data warehousing. It includes a roadmap that covers topics such as why data warehousing is used, the difference between operational systems and data warehouses, data warehouse approaches, data modeling concepts, and ETL products. It also defines key concepts like operational systems, data warehouses, and dimensional modeling. Specific techniques covered include entity-relationship modeling and dimensional modeling.
The document discusses best practices for assessing and optimizing an ICS infrastructure. It covers key factors to consider like the network, hardware, applications, security, and client landscape. Specific focus topics analyzed in more detail include user activity analysis and client landscape optimization. User activity can be analyzed to understand usage patterns and resource requirements. The client landscape should be optimized by choosing the right clients for users, consolidating for security and compliance, upgrading to newer ODS versions, and leveraging multiuser installation features.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
The Future of Data Warehousing and Data IntegrationEric Kavanagh
The rise of big data, data lakes and the cloud, coupled with increasingly stringent enterprise requirements, are reinventing the role of data warehousing in modern analytics ecosystems. The emerging generation of data warehouses is more flexible, agile and cloud-based than their predecessors, with a strong need for automation and real-time data integration.
Join this live webinar to learn:
-Typical requirements for data integration
-Common use cases and architectural patterns
-Guidelines and best practices to address data requirements
-Guidelines and best practices to apply architectural patterns
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
How to Place Data at the Center of Digital Transformation in BFSIDenodo
Watch full webinar here: https://bit.ly/3j7E9Jo
Consumers are increasingly using digital banking tools and insurance models, and these numbers will only continue to grow. Financial and insurance organizations have to adapt to the new and always changing situation while complying with new regulations, such as IFRS17, and embracing ESG criteria.
At the heart of any digital transformation is data. Therefore, it is not a stretch to say that data management and analytics strategies differentiate many of the leaders from the laggards in the banking, financial services and insurance (BFSI) industry. BFSI organizations still relying on slow, traditional systems and data management processes will find themselves falling behind their competition. In addition, as many adopt cloud strategies, these traditional approaches fill the cloud modernization process with downtime and end user frustration. In fact, according to a McKinsey article, cloud combined with distributed data infrastructure will define how consumers and providers adopt digital insurance models for the next decade.
Hear how the BFSI industry is leveraging data virtualization to deploy data fabric or data mesh architectures for enterprise-wide digital transformation.
Join this webinar to learn:
- The latest trends in BFSI for 2023 and how data and analytics is reshaping the industry
- How a logical data architecture can help you capitalize on your data
- How Denodo customers digitally transformed themselves using the Denodo Platform
The document discusses best practices for assessing IT infrastructures. It emphasizes analyzing user activity data to understand resource usage and transformation potential. Visualizing this data provides strategic insights and helps optimize the client landscape. Assessing the current platform, applications, and user behaviors allows more informed decisions about consolidating, modernizing or moving to the cloud. Automated tools can help measure real-world usage to identify high impact areas and guide projects.
The document provides an overview of data warehousing concepts including:
1) The differences between operational systems and data warehouses in terms of purpose, design, and usage.
2) Common data warehousing approaches including top-down and bottom-up, and their characteristics.
3) Key elements of a data warehousing technical architecture including the staging area, data warehouse, and data marts.
This document provides an overview of data warehousing concepts including:
- The key differences between operational systems and data warehouses in terms of design, usage, and data characteristics.
- The benefits of implementing a data warehouse for business intelligence and decision making.
- Common data warehousing architectures and approaches including top-down, bottom-up, and hybrid approaches.
- Fundamental data modeling techniques for data warehouses including entity-relationship modeling and dimensional modeling.
Denodo Partner Connect: Business Value Demo with Denodo Demo LiteDenodo
Watch full webinar here: https://buff.ly/3OCQvGk
In this session, Denodo Sales Engineer, Yik Chuan Tan, will guide you through the art of delivering a compelling demo of the Denodo Platform with Denodo Demo Lite. Watch to uncover the significant functionalities that set Denodo apart and learn how to effectively win over potential customers.
In this session, we will cover:
Understanding the Denodo Platform & Tailoring Your Demo to Prospect Needs: By gaining a comprehensive understanding of the Denodo Platform, its architecture, and how it addresses data management challenges, you can customize your demo to align with the specific needs and pain points of your prospects, including:
- seamless data integration with real-time access
- data security and governance
- self-service data discovery
- advanced analytics and reporting
- performance optimization scalability and deployment
Watch this Denodo demo session and acquire the skills and knowledge necessary to captivate your prospects. Whether you're a seasoned technical professional or new to the field, this session will equip you with the skills to deliver compelling demos that lead to successful conversions.
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
Building a Logical Data Fabric using Data Virtualization (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3FF1ubd
In the recent Building the Unified Data Warehouse and Data Lake report by leading industry analysts TDWI, we have discovered 64% of organizations stated the objective for a unified Data Warehouse and Data Lakes is to get more business value and 84% of organizations polled felt that a unified approach to Data Warehouses and Data Lakes was either extremely or moderately important.
In this session, you will learn how your organization can apply a logical data fabric and the associated technologies of machine learning, artificial intelligence, and data virtualization can reduce time to value. Hence, increasing the overall business value of your data assets.
KEY TAKEAWAYS:
- How a Logical Data Fabric is the right approach to assist organizations to unify their data.
- The advanced features of a Logical Data Fabric that assist with the democratization of data, providing an agile and governed approach to business analytics and data science.
- How a Logical Data Fabric with Data Virtualization enhances your legacy data integration landscape to simplify data access and encourage self-service.
The document provides information about the IBM PureData System for Analytics (Netezza). It discusses the components and architecture of the IBM PureData System models, including the N1001 and N2001 models. It explains the key hardware components like snippet blades, hosts, and storage arrays and how they work together using Netezza's Asymmetric Massively Parallel Processing architecture to optimize analytics workloads.
This document defines and describes key concepts related to data warehousing and business intelligence. It defines a data warehouse as a repository of integrated data organized for analysis. Key characteristics of a data warehouse include being subject-oriented, integrated, non-volatile, and summarized. The document also discusses data marts, architectures like three-tier and two-tier, and ETL processes. Risks, best practices, and administration of data warehouses are covered as well.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
1. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 1
Design and
Maintenance of
Data Warehouses
Design and
Maintenance of
Data Warehouses
Timos Sellis
National Technical University of Athens
KDBS Laboratory
http://www.dbnet.ece.ntua.gr/
Many thanks to P. Vassiliadis and A. Tsois
EDBT Summer School - Cargese 2002 2
Outline
What’s and Why’s for DW’s
DW architecture
DW Schema
Back End of the DW
Front End of the DW
DW Servers
Metadata Repository
Conclusions
2. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 2
EDBT Summer School - Cargese 2002 3
OLTP
On-line transaction processing (OLTP) is the
traditional way of using a database
Legacy systems: relational, hierarchical, network
databases / COBOL applications / …
Short transactions (read/update few records) with
ACID properties
Normally, only the last version of data stored in the
database
EDBT Summer School - Cargese 2002 4
DSS & OLAP
Decision support systems - help the executive,
manager, analyst make faster and better decisions.
What where the sales volumes by region and product
category for the last year?
Will a 10% discount increase sales volumes sufficiently?
On-line analytical processing (OLAP) is an
element of decision support systems (DSS)
3. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 3
EDBT Summer School - Cargese 2002 5
OLTP vs. OLAP
OLTP OLAP
User Clerk Manager
Function Day to day operations Decision support
Access Read/write Mostly read
Data detailed, up-to-date,
flat relational
summarised,
historical,
multidimensional
Db Size 100MB - 1GB 100GB - 1TB
Chaudhuri
& Dayal
@VLDB’96
EDBT Summer School - Cargese 2002 6
Data Warehouse
A decision support database that is maintained
separately from the organization’s operational
database.
• S. Chaudhuri, U. Dayal, VLDB’96 tutorial
A data warehouse is a subject-oriented,
integrated, time-varying, non-volatile collection of
data that is used primarily in organizational
decision making.
• W.H. Inmon, Building the Data Warehouse, 1992
4. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 4
EDBT Summer School - Cargese 2002 7
Reasons for Building Data Warehouses
Semantic Reconciliation
Dispread data sources within the same organization
Different encoding of the same entities
DW encompasses the full volume of these data
under a single, reconciled schema
Keeps the history of these data, too
EDBT Summer School - Cargese 2002 8
Reasons for Building Data Warehouses
Performance
OLAP applications need different organization of
data
Complex OLAP queries would degrade OLTP
performance
Availability
Separation increases availability
Possibly the only way to query the dispread data
sources
5. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 5
EDBT Summer School - Cargese 2002 9
Reasons for Building Data Warehouses
Data Quality
The validity of source data is not guaranteed (data can be
missing, inconsistent, out of date, violating business and
database rules…)
Errors in data reach a minimum 10% in most data stores
Can lead to wasting of resources of 25-40%
DW acts as a data cleaning buffer
…. and the market is there!
EDBT Summer School - Cargese 2002 10
The Market
Estimated sales in millions of dollars [ShTy98] (*estimates
are from [Pend00]).
1998 1999 2000 2001 2002 CAGR (%)
RDBMS sales for DW 900.0 1110.0 1390.0 1750.0 2200.0 25.0
Data Marts 92.4 125.0 172.0 243.0 355.0 40.0
ETL tools 101.0 125.0 150.0 180.0 210.0 20.1
Data Quality 48.0 55.0 64.5 76.0 90.0 17.0
Metadata Management 35.0 40.0 46.0 53.0 60.0 14.4
OLAP (including implementation
services)*
2000 2500 3000 3600 4000 18.9
6. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 6
EDBT Summer School - Cargese 2002 11
Data Warehouse Architecture
A Simple View
Client Client
Warehous
e
Sourc
e
Sourc
e
Sourc
e
Query &
Analysis
Integration
Metadata
EDBT Summer School - Cargese 2002 12
Data Warehouse Architecture
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
7. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 7
EDBT Summer School - Cargese 2002 13
Two / Three Tier Architecture
Warehouse database server
almost always relational (RDBMS)
Data Marts / OLAP server
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Clients
Query and reporting tools
Analysis tools / Data mining tools
EDBT Summer School - Cargese 2002 14
Data Warehouse Architecture
Enterprise warehouse: collects all information about
subjects
requires extensive business modeling
may take years to design and build
Data Marts: Departmental subsets that focus on
selected subjects
Virtual warehouse: views over operational dbs
8. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 8
EDBT Summer School - Cargese 2002 15
How to build the DW
Top – down
Single integrated enterprise model
Reduce all sources (and clients, if necessary) to the central
model
− Time consuming; labor intensive; slow to produce results
− Enhances the risk of the DW project due to late delivery of
results
+ Provides a consistent, global view of the enterprise data
EDBT Summer School - Cargese 2002 16
How to build the DW
Bottom – up
Build smaller data marts first
Progressively combine pairwise
− Fails to provide a global view of the enterprise data
− Possibly enhances the risk since a complete
integration might prove impossible late in the project
+ Early delivery of results
+ Less time consuming, less labor intensive
9. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 9
EDBT Summer School - Cargese 2002 17
Data Warehouse Back-End
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
EDBT Summer School - Cargese 2002 18
Design: Global-As-View Integration
Preintegration. What schemata to integrate and in
which order
Schema Comparison. To determine the correlations
among concepts of different schemata and to detect
possible naming, semantic, structural, … conflicts
Schema Conforming. Conflict resolution for
heterogeneous schemata
Schema Merging and Restructuring. Production of a
single conformed schema
10. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 10
EDBT Summer School - Cargese 2002 19
Design: Local-As-View Integration
Works the other way around.
Main deliverable is a central conceptual model,
produced by interactively examining user needs
and existing schemata
All source and client schemata are expressed in
terms of the central data warehouse schema and
not the other way around.
EDBT Summer School - Cargese 2002 20
DW = Materialized Views?
DW.PARTSU
PP
Aggregate1
PKEY, DAY
MIN(COST)
Aggregate2
PKEY, MONTH
AVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,
DAY
S1_PARTSU
PP
S2_PARTSU
PP
Sources DW
U
Simple View of a DW
11. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 11
EDBT Summer School - Cargese 2002 21
Add_SPK1
SUPPKEY=1
SK1
DS.PS1.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
$2€
COST DATE
DS.PS2
Add_SPK2
SUPPKEY=2
SK2
DS.PS2.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
COST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF1
DS.PS_NEW1.PKEY,
DS.PS_OLD1.PKEYDS.PS_NEW
1
DS.PS_OLD
1
DW.PARTSU
PP
Aggregate1
PKEY, DAY
MIN(COST)
Aggregate2
PKEY, MONTH
AVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,
DAY
FTP1
S1_PARTSU
PP
S2_PARTSU
PP
FTP2
DS.PS_NEW
2
DIFF2
DS.PS_OLD
2
DS.PS_NEW2.PKEY,
DS.PS_OLD2.PKEY
DW ≠ Materialized Views !
Sources DW
DSA
EDBT Summer School - Cargese 2002 22
Operational Processes
Data extraction, transform & load
Originally treated as the ‘refreshment’ problem
Requires to transform, clean, integrate data from
different sources.
Build/refresh derived data and views
Service queries
Monitor the warehouse
12. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 12
EDBT Summer School - Cargese 2002 23
The Refreshment Problem
Propagate updates on source data to the
warehouse
Issues:
when to refresh
on every update
periodically
refresh policy set by administrator
how to refresh
EDBT Summer School - Cargese 2002 24
Refreshment Techniques
Full extract from base tables
Incremental techniques
detect changes on base tables
snapshots
transaction shipping
active rules
logical correctness
transactional correctness
Currently, in practice we use ETL tools/scripts (see
next)…
13. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 13
EDBT Summer School - Cargese 2002 25
Data Extraction
Can take snapshot or differentials
(new/deleted/updated) of source data
Transfer, encryption, compression are also
involved
Time window and source system overhead
involved
In general, faced with the requirement of minimal
changes to existing configuration of sources
EDBT Summer School - Cargese 2002 26
Data Transformation
Schema Reconciliation: conflicts at the schema
level (different attributes for the same
information)
Value Identification & Reconciliation: different
(same) id’s for same (different) objects (use
surrogate keys)
14. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 14
EDBT Summer School - Cargese 2002 27
Data Cleaning
Offending Data: duplicates, integrity/business
rules/format violations …
Incompleteness: missing data
Renicing: esp. addresses
EDBT Summer School - Cargese 2002 28
Data Loading
This final stage may still require additional
preprocessing:
sorting, summarizing, performing computations
Issues:
huge volumes of data to be loaded
small time window
when to build indexes and summary tables
restart after failure with no loss of data integrity
15. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 15
EDBT Summer School - Cargese 2002 29
Loading Techniques
Cannot use SQL language interface to update or
append data.
record at a time
too slow since it uses random disc I/O
can make rollback segment or log file to burst
Use batch load utility
sort input records on a clustering key
sequential I/O 100 times faster than random I/O
build index at the same time
use parallelism to accelerate load operations
EDBT Summer School - Cargese 2002 30
Incremental Loading
Use incremental loads during refresh to reduce data
volume (e.g. Redbrick)
insert only updated tuples
incremental load conflicts with queries
break into sequence of shorter transactions
coordinate this sequence of transactions: must
ensure consistency between base and derived
tables and indices.
16. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 16
EDBT Summer School - Cargese 2002 31
Data Warehouse Front-End
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
EDBT Summer School - Cargese 2002 32
Front End Tools
Ad hoc query and reporting
Example: MS Excel, ProReports
OLAP: ‘Multidimensional spreadsheet’
pivot tables, drill down, roll up, slice, dice
Data Mining
17. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 17
EDBT Summer School - Cargese 2002 33
Basic ideas for OLAP
Several numeric measures that are analyzed
sales, budget, revenue, inventory
Dimensions
contexts in which a measure appears
Example: store, product, date information associated
with a sale.
each context is a dimension and the measure is a
point in a multi-dimensional world
EDBT Summer School - Cargese 2002 34
Basic ideas for OLAP
Nature of Analysis
aggregation (total sales, percent-to-total)
comparison (budget vs. expense)
ranking (top 10)
access to detailed and aggregate data
complex criteria specification
visualization
18. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 18
EDBT Summer School - Cargese 2002 35
Basic ideas for OLAP
Attributes
information associated with a dimension
example: owner of store, county in which the store is
located
Attribute Hierarchies
Attributes of a dimension are often related in a a
hierarchical way
example: street city country
EDBT Summer School - Cargese 2002 36
Multidimensional Data
Dimensions: Product, Region, Date
Hierarchical summarization paths:
Month
Region
Product
Sales volume
Industry
Category
Product
Country
Region
City
Office
Year
Quarter
Month Week
Day
19. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 19
EDBT Summer School - Cargese 2002 37
Operations
Roll up: summarize data
Drill down: go from higher level summary to
lower level summary or detailed data
Slice and dice: select and project
Pivot: re-orient cube
EDBT Summer School - Cargese 2002 38
Roll up
Sales volume
Electronics
Toys
Clothing
Cosmetics
Q1
$5,2
$1,9
$2,3
$1,1
Electronics
Toys
Clothing
Cosmetics
Q2
$8,9
$0,75
$4,6
$1,5
Products Store1 Store2
$5,6
$1,4
$2,6
$1,1
$7,2
$0,4
$4,6
$0,5
Sales volume
Electronics
Toys
Clothing
Cosmetics
Year1996
$14,1
$2,65
$6,9
$2,6
Products Store1 Store2
$12,8
$1,8
$7,2
$1,6
21. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 21
EDBT Summer School - Cargese 2002 41
Slice and Dice
Sales volume
Electronics
Toys
Clothing
Cosmetics
Q1
$5,2
$1,9
$2,3
$1,1
Electronics
Toys
Clothing
Cosmetics
Q2
$8,9
$0,75
$4,6
$1,5
Products Store1 Store2
$5,6
$1,4
$2,6
$1,1
$7,2
$0,4
$4,6
$0,5
Sales volume
Electronics
Toys
Q1
$5,2
$1,9
Products Store1
Electronics
Toys
Q2
$8,9
$0,75
EDBT Summer School - Cargese 2002 42
Data Warehouse Server
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
22. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 22
EDBT Summer School - Cargese 2002 43
Data Warehouse Servers - Outline
Server Technology: ROLAP & MOLAP
Indexing Techniques
Query Processing and Optimization
EDBT Summer School - Cargese 2002 44
Database Servers
Relational and Specialized Relational DBMS
Relational OLAP (ROLAP) DBMS
Multidimensional OLAP (MOLAP) DBMS
23. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 23
EDBT Summer School - Cargese 2002 45
Relational DBMS
Features that support DSS
Specialized Indexing techniques
Specialized Join and Scan Methods
Data Partitioning and use of Parallelism
Complex Query Processing
Intelligent Processing of Aggregates
Extensions to SQL and their processing
EDBT Summer School - Cargese 2002 46
ROLAP Servers
Exploits services of a relational engine effectively
Key functionality
needs aggregation navigation logic
ability to generate multi statement SQL
optimize for each individual database backend
Additional services
cost-based query governor
design tool for DSS schema
performance analysis tool
24. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 24
EDBT Summer School - Cargese 2002 47
Database Schemata for DW & ROLAP
Star Schema
Snowflake Schema
Fact Constellation
Aggregated data
EDBT Summer School - Cargese 2002 48
Star Schema
A star schema consists of one central fact table and
several denormalized dimension tables.
The measures of interest for OLAP are stored in the
fact table (e.g. Dollar Amount, Units in the table
SALES).
For each dimension of the multidimensional model
there exists a dimension table (e.g. Geography,
Product, Time, Account) with all the levels of
aggregation and the extra properties of these levels.
25. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 25
EDBT Summer School - Cargese 2002 49
Star Schema
SALES
Geography Code
Time Code
Account Code
Product Code
Dollar Amount
Units
Geography
Geography Code
Region Code
Region Manager
State Code
City Code
.....
Product
Product Code
Product Name
Brand Code
Brand Name
Prod. Line Code
Prod. Line Name
Time
Time Code
Quarter Code
Quarter Name
Month Code
Month Name
Date
Account
Account Code
KeyAccount Code
KeyAccountName
Account Name
Account Type
Account Market
Stanford Technology
Group, Inc., 1996
EDBT Summer School - Cargese 2002 50
Snowflake Schema
The normalized version of the star schema
Explicit treatment of dimension hierarchies (each
level has its own table)
Easier to maintain, slower in query answering
26. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 26
EDBT Summer School - Cargese 2002 51
Snowflake Schema
SALES
Postal Code
Time Code
Account Code
Product Code
Dollar Amount
Units
Time
Time Code
Quarter Code
Month Code
Quarter
Quarter Code
QuarterName
Month
Month Code
Month Name
Account
Account Code
KeyAccount
Code
Account
attributes
Account Code
AccountName
KeyAccount
KeyAcc Code
KeyAcc Name
Geography
Postal Code
Region Code
State Code
City Code
Region
Region Code
Region Mgr
State
State Code
State Name
City
City Code
City Name
Product
Product Code
Prod Line Code
Brand Code
Product
Product Code
ProductName
Brand
Brand Code
Brand Name
ProdLine
ProdLineCode
ProdLineName
Stanford Technology
Group, Inc., 1996
EDBT Summer School - Cargese 2002 52
Fact Constellation
Multiple fact tables that share many dimension
tables
Example: projected expense and the actual
expense may share dimensional tables
27. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 27
EDBT Summer School - Cargese 2002 53
Aggregated Tables
In addition to base fact and dimension tables,
data warehouse keeps aggregated (summary)
data for efficiency.
Two approaches
store as separate summary fact and dimension
tables
add to the existing base tables
EDBT Summer School - Cargese 2002 54
Aggregated Tables
RID City Amount
1 Athens $100
2 N.Y. $300
3 Rome $120
4 Athens $250
5 Rome $180
6 Rome $65
7 N.Y. $450
City Amount
Athens $350
N.Y. $750
Rome $365
RID City Amount Level
1 Athens $100 NULL
2 N.Y. $300 NULL
3 Rome $120 NULL
4 Athens $250 NULL
5 Rome $180 NULL
6 Rome $65 NULL
7 N.Y. $450 NULL
8 Athens $350 City
9 N.Y. $750 City
10 Rome $365 City
• Separate sum-table
• Extend existing base tables
Extended Sales table
Sales table
City-dimension sum table
28. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 28
EDBT Summer School - Cargese 2002 55
MOLAP Servers
The storage model is an n-dimensional array
Very fast in computations and OLAP operations
Normally they require pre-computation of the
available cubes
Compression of data to save storage space
Currently: 98% of the market for client tools
SISYPHUS: A Chunk-Based Storage
Manager for OLAP Cubes
PhD work of Nikos Karayannidis
National Technical University of Athens
(NTUA)
29. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 29
EDBT Summer School - Cargese 2002 57
ERATOSTHENES project
ERATOSTHENES, is a
specialized database
management system for
OLAP cubes which is under
development.
In the context of
ERATOSTHENES, a
prototype storage manager
for OLAP cubes, called
SISYPHUS, has been
developed.Storage Engine
(SISYPHUS)
Processing Engine
Presentation Engine
EDBT Summer School - Cargese 2002 58
Why OLAP poses new require-ments to
storage management?
Small response time: good physical clustering +
efficient access paths
Multidimensionality: md-storage structures,
address by location
Hierarchies: access paths, clustering
Sparseness: not random but according to
hierarchies.
30. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 30
EDBT Summer School - Cargese 2002 59
Architecture: levels of abstraction in
SISYPHUS
SSM
Record-oriented
storage mngmnt
File Manager
Bucket-oriented File
mngmnt
Logging/Recovery
Buffer Manager
Buffer mngmnt
Access Manager
Chunk-oriented File
mngmnt
Cube Access Methods OLAP Processing
rec.oriented
access
bckt.oriented
access
chnk.orient
ed access
Cell
oriented
access
EDBT Summer School - Cargese 2002 60
Dimension data encoding
City
Region
Country
LOCATION
0.1.2
0 1 2
CityA CityB CityC CityD
0 1
RegionA RegionB
0
CountryA
3
order-codes
member-code
31. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 31
EDBT Summer School - Cargese 2002 61
A chunk-oriented file system: the
hierarchically chunked cube
Use the bucket file
system.
Chunking Method:
partition the data space
by forming a hierarchy of
chunks that is based on
the dimension
hierarchies.
continent
city
region
country
item
type
category
item
Pseudo
[0..18]
[0..10]
[0..4]
[0..2]
[0..5]
[0..2]
[0..2]
[0..1]
EDBT Summer School - Cargese 2002 62
D = 0
continent
city
region
country
item
type
category
item
Pseudo
[0..18] (LOCATION)
[0..5](PRODUCT)
(0,0)
32. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 32
EDBT Summer School - Cargese 2002 63
continent
city
region
country
item
type
category
item
Pseudo
[0..5] [6..10] [11..18]
[0..3][4..5]
D = 1
EDBT Summer School - Cargese 2002 64
continent
city
region
country
item
type
category
item
Pseudo
[0..2] [3..5] [6..10] [11..14] [15..18]
[4..5][0..1][2..3]
D = 2
33. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 33
EDBT Summer School - Cargese 2002 65
continent
city
region
country
item
type
category
item
Pseudo
[0..1][2..3][4..5]
[1..2][0] [4..5][3] [8..9][6..7] [10] [12..14][11] [17..18][15..16]
D = 3 (Max Depth)
EDBT Summer School - Cargese 2002 66
Chunk Identifiers (chunk-ids)
Chunk addressing.
Unique identifier of chunk within cube + depicts
hierarchy path of chunk.
Interleave the member-codes of the pivot-level
members that define a chunk (at any depth).
e.g. D = 2 LOCATION: 2.3, PRODUCT:1.2
2.3 1.2
2 . 31 2| |
34. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 34
EDBT Summer School - Cargese 2002 67
Accessing the chunks of a cube
Need some chunk directory.
Idea: use intermediate depth chunks as directory
chunks that will guide us to the data chunks
(Dmax + 1)
Create a chunk-tree.
EDBT Summer School - Cargese 2002 68
1
3
Grain level
(Data Chunks)
Root Chunk
P P
0 1 2 3
D = 1
D = 2
LOCATION
PRODUCT
0 1 2
0
1
0
00.00 00.10
D = 3 (Max Depth)
0
00.00.0P
0
1
1 2
00.00.1P
0
1
00.10.2P
0
1
4 5
00.10.3P
0
1
0 1
00
P P
0 1 2 3
00.01 00.11
30
00.01.0P
2
3
1 2
00.01.1P
2
3
00.11.2P
2
3
4 5
00.11.3P
2
3
35. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 35
EDBT Summer School - Cargese 2002 69
Bucket Organization
3 parts: bucket header, directory chunk vector,
data chunk vector.
Main idea: try to store in the same bucket
whole families (i.e. sub-trees of chunks)!
A) A single sub-tree
B) Many sub-trees that form a bucket region
C) A single tree of directory chunks (root bucket)
D) A single data chunk
EDBT Summer School - Cargese 2002 70
Chunk organization
Implementation data structure: multidimensional arrays:
Offer data address by-location, native to cubes.
Enable chunk id exploitation.
We don’t have to store the chunk ids.
Are FAST!
Compression schemes:
Data chunks: allocate only non-empty cells, maintain bitmap.
Directory chunks: full cell allocation but no allocation for
empty sub-trees.
36. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 36
EDBT Summer School - Cargese 2002 71
Summary
Storage management in OLAP
SISYPHUS storage manager for OLAP
Chunk-oriented file system:
Natively multidimensional and supports hierarchies.
Clusters data hierarchically.
It is space conservative.
Adopts location-based than content-based data address
scheme.
Also: data-access interface can be used for defining
access paths and OLAP operations.
EDBT Summer School - Cargese 2002 72
Future Work
Experimental tests.
Design/Implementation of algorithms for typical
OLAP operations.
Other research issues:
Finding optimal bucket regions
Updating interface for common OLAP updating
operations.
Efficient file organization for dimension data
37. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 37
EDBT Summer School - Cargese 2002 73
Data Warehouse Servers - Outline
Server Technology: ROLAP & MOLAP
Indexing Techniques
Query Processing and Optimization
EDBT Summer School - Cargese 2002 74
Why specialized indexing
Join-intensive queries
Almost all queries demand joins of the fact table with some
dimensions
Very large tables
traditional index become too large to be efficient
Complex queries
selections based on complex criteria
Read-intensive workload
38. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 38
EDBT Summer School - Cargese 2002 75
BitMap Indexes
An alternative representation of RID-list
Advantageous for low-cardinality domains
Represent each row of a table by a bit and the
table as a bit vector
There is a distinct bit vector Bv for each value v
for the domain.
The j-th bit in the vector Bv is set if the j-th row of
the table has the value v for the column
EDBT Summer School - Cargese 2002 76
BitMap Indexes
Example: The attribute sex has values M and F.
A table of 100 million people needs 2 lists of 100
million bits
Comparison, join and aggregation operations are
reduced to bit arithmetic with dramatic
improvement in processing time
Significant reduction in space and I/O (30:1)
39. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 39
EDBT Summer School - Cargese 2002 77
BitMap Indexes
Cust Region Rating
C1 N H
C2 S M
C3 W L
C4 W H
C5 S L
C6 W L
C7 N H
RID N S E W
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
5 0 1 0 0
6 0 0 0 1
7 1 0 0 0
RID H M L
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
5 0 0 1
6 0 0 1
7 1 0 0
Base Table Region Index Rating Index
EDBT Summer School - Cargese 2002 78
BitMap Indexes
Works poorly for high cardinality domains since
the number of vectors increase
However, often good performance via
compression since scarcity also increases
Products that support bitmaps: Model 204,
TargetIndex (Redbrick), IQ (Sybase), Oracle
40. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 40
EDBT Summer School - Cargese 2002 79
Join Indexes
Traditional indexes map the value in a column to a list
of rows with that value
Join indexes maintain relationships between the primary
key and the foreign keys
Thus, join indexes relate the values of the dimensions
of a star schema to rows in the fact table.
Join indexes may span multiple dimensions
EDBT Summer School - Cargese 2002 80
Join Indexes
Join index for a single dimension:
Consider a schema with a Sales fact table and two
dimensions city and product
If there is a join index on city, then for each distinct city, the
index maintains a list of RIDs of the tuples recording sale in
that city
Example: The node Athens in the index points to the list of
RIDs in the fact table corresponding to transactions (sale) in
Athens.
Join indexes can span multiple dimensions
the node (Athens, oranges) points to transactions that took
place in Athens and which corresponds to purchase of
oranges
41. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 41
EDBT Summer School - Cargese 2002 81
Join Indexes
RID City Amount
1 Athens $100
2 N.Y. $300
3 Rome $120
4 Athens $250
5 Rome $180
6 Rome $65
7 N.Y. $450
City Country Population
Athens Greece 3.507.000
Rome Italy 3.033.000
N.Y. USA 17.953.000
Sales table City table
City RIDs
Athens 1, 4
Rome 3, 5, 6
N.Y. 2, 7
Index on City-Sales
EDBT Summer School - Cargese 2002 82
Data Warehouse Servers - Outline
Server Technology: ROLAP & MOLAP
Indexing Techniques
Query Processing and Optimization
42. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 42
EDBT Summer School - Cargese 2002 83
Specialized Join Methods
Traditional systems limit themselves to binary
joins
results in many intermediate tables
For a query over many dimensions, the
optimization time can be substantial
EDBT Summer School - Cargese 2002 84
Specialized Join Methods
StarJoin Algorithm (Redbrick)
use join indexes to identify regions of cartesian
product that are of interest
Intelligent Scan (Redbrick)
take advantage of the “read-only” environment
Parallel Join Methods
43. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 43
EDBT Summer School - Cargese 2002 85
Complex Query Processing
Extensible optimization frameworks (e.g.
Starburst [IBM Almaden])
Estimation of Statistics (histograms, sampling)
Some of the ideas useful for DSS:
interleaving GroupBy and Join
Merging Views
Propagating selection through views
Optimizing nested subqueries
EDBT Summer School - Cargese 2002 86
Example of Optimizing Nested
Subqueries
Find all employees younger than 35 who earn more
than the average of their department
Alternatives:
Iterate over each employee: (1) find the department of the employee (2)
compute average salary in the department (3) check if the employee’s
salary is above the average
Compute the average salary of each department. For each employee,
check if his/her salary is above the corresponding average salary
Find out the set of all departments where at least one of the employees is
35. Compute the average salary of only those departments. Repeat the
previous step.
44. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 44
EDBT Summer School - Cargese 2002 87
Rollup and Cube operators
[Gray et.al. 1996] Rollup operator for nested
aggregations
rollup product, store, city
group by product, store, city
group by store, city
group by city
Cube operator for all possible combinations
group by product,store,city cube
group by each subset of {product, store, city}, independently of the
order of columns in the statement
EDBT Summer School - Cargese 2002 88
The CUBE operator
Jim Gray
Adam Bosworth
Andrew Layman
Microsoft
CHEVY
FORD 1990
1991
1992
1993
RED
WHITE
BLUE
By Color
By Make & Color
By Make & Year
By Color & Year
By Make
By Year
Sum
The Data Cube and
The Sub-Space Aggregates
RED
WHITE
BLUE
Chevy Ford
By Make
By Color
Sum
Cross Tab
RED
WHITE
BLUE
By Color
Sum
Group By
(with total)Sum
Aggregate
Hamid Pirahesh
IBM
45. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 45
Processing Star Queries on
Hierarchically-Clustered Fact Tables
Nikos Karayannidis1, Aris Tsois1, Timos Sellis1, Roland
Pieringer2, Volker Markl4,
Frank Ramsak3,Robert Fenk3, Klaus Elhardt2, Rudolf
Bayer5
1I.C.C.S. - N.T.U.Athens,
3FORWISS –5T.U.München,
2TransAction Software GmbH,
4IBM Almaden Research Center
EDBT Summer School - Cargese 2002 90
Key Points
Star queries are ubiquitous in DW and OLAP
New trend: Hierarchically clustered star-
schemata
New processing framework
New optimization challenges
Implemented in TransBase HyperCube
Tested with real-world application (up to 40
speed-up)
46. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 46
EDBT Summer School - Cargese 2002 91
EDITH
EDITH - the European Development on Indexing
Techniques for Databases with Multidimensional
Hierarchies
Information Society Technologies Programme
(IST) - grant No. IST-1999-20722.
http://edith.in.tum.de
EDBT Summer School - Cargese 2002 92
Motivation – Problem statement
Not just report! What about ad hoc queries?
OLAP requires efficient processing of ad-hoc
star queries
Major bottleneck processing of the star-join
Cartesian product, bitmap indexes, …
NOT enough:
Efficiency requires good physical clustering
of data
47. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 47
EDBT Summer School - Cargese 2002 93
Hierarchical Clustering
A new trend:
hierarchical clustering of fact table data through
path-based surrogate keys
Exploitation of multidimensional indexes
Star join transforms to multidimensional range query
The overall processing framework of star queries
changes radically
EDBT Summer School - Cargese 2002 94
Contributions
Present a novel processing framework for star
queries over hierarchically clustered data
Discuss optimizations
Realization of our technology in a real system
Evaluation on a real-world application has
shown significant speed-ups.
48. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 48
EDBT Summer School - Cargese 2002 95
Hierarchical Surrogate Keys
Apply hierarchical encoding on each dimension
table
System-assigned h-surrogate key:
e.g., oc1(“Greece”)/oc2(“Athens”)/oc3(“Store5”)
Implementation based on underlying physical
data structure
EDBT Summer School - Cargese 2002 96
Database Schema
FT
m1
m2
d1
d2
…
dN
D1
h1
---------------
h2
h3
f1
f2
D2
h1
---------------
h2
h3
h4
DN
h1
---------------
h2
f1
f2
f3
hsk1
hsk2
…
hskN
hsk1
hsk2
hskN
49. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 49
EDBT Summer School - Cargese 2002 97
Star Queries
SELECT {Di.hj}{Di.fj}{aggr(…)AS AMj}
FROM FT,D1,…,DN
WHERE FT.d1 = D1.h1 AND…
LOCPRED({D1}) AND …
MPRED({FT.mi})
GROUP BY {Di.hj},{Di.fj},{FT.mj}
HAVING <having clause>
ORDER BY <ordering fields>
Star-join conditions
Dimension
restrictions
Measure
restrictions
EDBT Summer School - Cargese 2002 98
The Abstract Processing Plan
...Dn
FT
MD Range Access
Residual Join
Group-Select
Order_By
D1
Dj
Di
Residual Join
...
Create_RangeCreate_Range
...
h-surrogate processing
Main execution phase
50. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 50
EDBT Summer School - Cargese 2002 99
Optimization Issues
Optimizing h-surrogate processing
Single tuple retrieval for hierarchical prefix path
restrictions
Exploit composite index on (hm, hm-1,…, h1, hski)
Pregrouping transformation
Reduces tuples for residual join and speeds up
grouping
Heuristic algorithm based on query syntax
EDBT Summer School - Cargese 2002 100
Pre-grouping Transformation
F
Group Select
by month, store
Residual Join
MD Range Access
Residual Join
Date
Location
Date
F
Group Select
by month, store
Residual Join
MD Range Access
Residual Join
Location
Group Select
by hsk1, hsk2
51. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 51
EDBT Summer School - Cargese 2002 101
Performance Evaluation
Greek electronic retailer data:
3 dims (1.4M, 27K, 2.5K) tuples
Fact table: 15.5M tuples (1.5GB)
220 ad hoc star queries from real application
Compare 3 plans: STAR, AEP and OPT
FT selectivity range: 0.0% to 5.0% of FT
Result:
AEP vs. STAR 20 avg. speed up
OPT vs STAR 40 avg speed up
EDBT Summer School - Cargese 2002 102
Summary
Efficient star query processing a must in DW and OLAP
New trend: Hierarchically clustered star-schemata
Presented a novel processing framework for star
queries over hierarchically clustered data
Discussed optimization issues
Fully implemented our technology in TransBase
Evaluation with real-word application has shown
significant speed-ups
52. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 52
EDBT Summer School - Cargese 2002 103
Future Work
Extensive experimental evaluation
Investigate applicability of our processing
framework to other areas
Further optimization issues: reducing the number
of produced h-surrogate ranges
EDBT Summer School - Cargese 2002 104
Metadata Repository
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
53. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 53
EDBT Summer School - Cargese 2002 105
The Lack of Conceptual Support
Information
Source
Data
Warehouse
Wrapper/
Loader
Multidim.
Data Mart
Aggregation/
Customization?
Observation
OLTP
OLAPAnalyst
Operational
Department
Enterprise
Source
Quality
DW
Quality
Mart
Quality
(1)
(2)
(3)
(4)
(5)
EDBT Summer School - Cargese 2002 106
Conceptual-Logical-Physical
Source
DataStore
DW
DataStore
Wrapper
Client
DataStore
Aggregation/
Customization
?
Observation
OLTP
OLAPClient
Model
Operational
Department
Model
Enterprise
Model
Source
Schema
DW
Schema
Transportation
Agent
Transportation
Agent
Client
Schema
Conceptual
Perspective
Logical
Perspective
Physical
Perspective
54. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 54
EDBT Summer School - Cargese 2002 107
The DWQ Approach
Client Level
DW Level
Source Level
Conceptual
Perspective
Logical
Perspective
Physical
Perspective
Meta Model
Level
Models/
Meta Data
Level
in
Real
World
in
in
Process
Model
Process
Meta
Model
uses
Process
Processes
Quality
Metamodel
Quality
Model
Quality
Measure-
ments
EDBT Summer School - Cargese 2002 108
DWQ Repository
The DWQ approach for managing data warehouse
quality is organized around an extended, semantically
rich metadata repository (prototypically implemented
using ConceptBase), which controls all relevant
metadata
We have developed meta models for DW architecture,
quality, processes and evolution
Metadata can be provided and queried by external
tools, via active rules external tools could even be
activated
[Jarke et al., CAiSE98]
55. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 55
EDBT Summer School - Cargese 2002 109
DWQ Metadata Framework
Sources
...
...
Enterprise
Model
Client Client
Source Source
Model_1 Model_n
Model_1 Model_m
Mediators
conceptual/logical mapping
physical/logical mapping
conceptual link
data flow
logical link
Source Source
Wrappers
physical levelmeta level conceptual level logical level
MetaModel
Interface
Schema
Store
Client Client
DW
DW
Source Source
Schema_1 Schema_n
Schema_1 Schema_m
Data Store_1 Data Store_n
EDBT Summer School - Cargese 2002 110
Quality Model:
An Adapted GQM Approach
DW
Designers
Decision
Maker
DW
Administrator
Quality
Goal
Quality
Query
DW Objects,
Processes and Data
Metadata for
DW Architecture,
Quality and
Processes
establish
Measurement
Processes
evaluated
by
evidence
for
defined on
Quality
Factor
[Jarke et al., IS99]
56. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 56
EDBT Summer School - Cargese 2002 111
Quality Factors by Perspective
Conceptual
Perspective
• Completeness
• Redundancy
• Consistency
• Correctness
• Traceability
of Concepts and
Models
Logical
Perspective
• Usefulness of
schemas
• Correctness of
mappings
• Interpretability of
schemas
Physical
Perspective
• Efficiency
• Interpretability of
schemas
• Timeliness of stored
data
• Maintainability/
Usability of software
components
EDBT Summer School - Cargese 2002 112
Towards Quality-Oriented DW
Design Quality
Goal
1. Design 2. Evaluation
3. Analysis
& Improvement
Define
Quality
Factor
Types
Define
Object
Types
Define Object
Instances &
Properties
Define Metrics
& Agents
Compute!
Acquire values for
quality factors
(current status)
Feed values to
quality scenario
and play!
Discover/Refine
new/old
"functions"
Take actions!
Decompose
complex objects
and iterate
Empirically
derive
"functions"
Analyticaly
derive
"functions"
Produce a
scenario
for a goal
Produce expected/
acceptable values
Negotiate!
4. Re-evalution
& evolution
[Vassiliadis et al., IS00]
57. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 57
EDBT Summer School - Cargese 2002 113
DWQ Methodology : Summary
R1
R2
R3
Enterprise
Model
Materialized
Views
C1 C2 Cm
Conj.
Queries
R1
R2
R3
S1
R1
R2
R3
S2
R1
R2
R3
S3
R1
R2
R3
Sn
Conj.
Queries
Conj.
Queries
User queries
OLTP updates
3. Conceptual
Client Modeling
1. Conceptual
Enterprise Model
2. Conceptual
Source Models
Rewriting of
Aggregate Queries
Refreshment
6. Data
Reconciliation
4. Translate aggregates
into OLAP operations
5. Design
Optimization
Metadata
Repository
EDBT Summer School - Cargese 2002 114
Key Formal Results on Quality
Impacts
conceptual: description logic theory and tools for
complete reasoning about the relationships between
source, enterprise, and client models
conceptual/logical: containment, satisfiability, and
rewriting of queries over views with & without
aggregates
logical/physical: incremental cost-based optimization of
view materializations
physical: detailed impact analysis of replication and
refreshment policies
58. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 58
EDBT Summer School - Cargese 2002 115
ConceptBase User Interface
EDBT Summer School - Cargese 2002 116
DW Quality Example
59. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 59
EDBT Summer School - Cargese 2002 117
Metadata Standards
Metadata Coalition
MetaData Interchange Specification (MDIS)
Open Information Model (OIM)
OMG (latest development)
Common Warehouse Model (CWM)
Microsoft Repository
EDBT Summer School - Cargese 2002 118
Summary
OLAP - Multidimensional data
Drill down, Roll Up, Pivot, Slice and Dice
Data warehouse architecture
Warehouse operational process
Loading - Cleaning - Serving (ROLAP/MOLAP)
Refreshing
Warehouse server requirements
Star-Snowflake schemes
Specialized indexes: BitMap - Join Indexes
60. Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 60
EDBT Summer School - Cargese 2002 119
Research issues
Data cleaning
focus on schema inconsistencies
Data warehouse design
summary tables, indexing
Query Processing
use summary data, statistics mgt, dynamic optimization
Warehouse Management
resource management, runaway queries
incremental refresh techniques
EDBT Summer School - Cargese 2002 120
References
W. H. Inmon: Building the Data Warehouse (2nd Edition),
John Wiley, 1996.
R. Kimball: The Data Warehouse Toolkit, John Wiley,
1996.
H. Garcia-Molina, Data Warehousing Overview, class
notes, Stanford University.
S. Chaudhuri & U. Dayal: Data Warehousing and OLAP
for Decision Support - VLDB’96 tutorial
Oracle, IBM, Redbrick, Sybase, Informix, Tandem,
Teradata, HP, … web sites.
The DWQ project: http://www.dbnet.ece.ntua.gr/~dwq/