This document discusses techniques for efficiently caching query results in database systems. It proposes using a chunk-based caching approach where query results are broken into multidimensional chunks that can be cached and reused to answer future queries. The key components of the proposed system include a cache manager that works closely with the query optimizer to decide what chunks to cache and replace. The system represents queries and cached chunks using a directed acyclic graph to help generate efficient query execution plans that leverage the cached chunks.
Query Evaluation Techniques for Large Databases.pdfRayWill4
This document surveys techniques for efficiently executing queries over large databases. It describes algorithms for sorting, hashing, aggregation, joins and other operations. It also discusses parallel query execution, complex query plans, and techniques for non-traditional data models. The goal is to provide a foundation for designing query execution facilities in new database management systems.
This document summarizes a research paper that proposes a new join operator called C JOIN for highly concurrent data warehouses. C JOIN improves upon the traditional query-at-a-time model by employing a single physical plan that can share I/O, computation, and tuple storage across concurrent join queries. The design allows the query engine to scale gracefully to large datasets and numbers of concurrent queries, provide predictable execution times, and reduce contention compared to commercial and open-source database systems. An empirical evaluation found that C JOIN outperforms these other systems by an order of magnitude for tens to hundreds of concurrent queries on the Star Schema Benchmark.
This document introduces SQL/MapReduce (SQL/MR) as a new framework for user-defined functions (UDFs) in databases. Some key advantages of SQL/MR functions are that they are inherently parallel, dynamically polymorphic with output schemas determined at query time, and can be implemented using any programming language. The document describes the SQL/MR programming model and implementation within the Aster nCluster database. It also provides an example of using a SQL/MR function to perform clickstream sessionization to group user clicks by session over time.
This document describes an automatic query caching system called Exchequer that caches intermediate and final results of queries to improve query response times. Exchequer is closely coupled with the query optimizer to make mutually consistent caching and optimization decisions. It uses an incremental cache management algorithm to determine what to cache, taking into account what other results are already cached. Experimental results show that Exchequer improves query response times by over 30% compared to other systems and can achieve the same performance with 10 times less cache size.
Systems and methods for improving database performanceEyjólfur Gislason
Systems and methods for mapping and propagating a source dataset to a plurality of target tables in a target database are described herein. Embodiments include a physical database design with a template-based loader, and a method to propagate changes in the source dataset to a target database. Other embodiments include a database physical design with a plurality of small fact and summary tables. Still other embodiments include a method of extracting and loading data automatically into a target database, while simultaneously maintaining current summary tables.
The document discusses several key data warehousing features in Oracle, including:
1) The SQL Access Adviser which recommends indexes and materialized views for workloads.
2) Improvements to the multidimensional OLAP engine including advanced partitioning options.
3) Data Pump which replaces Import and Export utilities and is faster for exporting and importing data.
4) Automatic storage management which automates storage management tasks.
This document provides a collection of 17 frequently asked questions (FAQs) about Oracle database concepts. It includes concise definitions and explanations of key terms such as Oracle, Oracle database, Oracle instance, parameter file, system global area, program global area, user account, schema, user role, and more. It also provides sample scripts and is intended as a learning and interview preparation guide for Oracle DBAs.
Performance tuning in WebLogic Server involves tuning various components like EJBs, JMS, web applications, and web services. It is important to understand performance objectives like anticipated load and target CPU utilization. Monitoring disk, CPU, and network utilization helps identify bottlenecks. Common tuning techniques include optimizing pooling, caching, threading, and disabling unnecessary processing.
Query Evaluation Techniques for Large Databases.pdfRayWill4
This document surveys techniques for efficiently executing queries over large databases. It describes algorithms for sorting, hashing, aggregation, joins and other operations. It also discusses parallel query execution, complex query plans, and techniques for non-traditional data models. The goal is to provide a foundation for designing query execution facilities in new database management systems.
This document summarizes a research paper that proposes a new join operator called C JOIN for highly concurrent data warehouses. C JOIN improves upon the traditional query-at-a-time model by employing a single physical plan that can share I/O, computation, and tuple storage across concurrent join queries. The design allows the query engine to scale gracefully to large datasets and numbers of concurrent queries, provide predictable execution times, and reduce contention compared to commercial and open-source database systems. An empirical evaluation found that C JOIN outperforms these other systems by an order of magnitude for tens to hundreds of concurrent queries on the Star Schema Benchmark.
This document introduces SQL/MapReduce (SQL/MR) as a new framework for user-defined functions (UDFs) in databases. Some key advantages of SQL/MR functions are that they are inherently parallel, dynamically polymorphic with output schemas determined at query time, and can be implemented using any programming language. The document describes the SQL/MR programming model and implementation within the Aster nCluster database. It also provides an example of using a SQL/MR function to perform clickstream sessionization to group user clicks by session over time.
This document describes an automatic query caching system called Exchequer that caches intermediate and final results of queries to improve query response times. Exchequer is closely coupled with the query optimizer to make mutually consistent caching and optimization decisions. It uses an incremental cache management algorithm to determine what to cache, taking into account what other results are already cached. Experimental results show that Exchequer improves query response times by over 30% compared to other systems and can achieve the same performance with 10 times less cache size.
Systems and methods for improving database performanceEyjólfur Gislason
Systems and methods for mapping and propagating a source dataset to a plurality of target tables in a target database are described herein. Embodiments include a physical database design with a template-based loader, and a method to propagate changes in the source dataset to a target database. Other embodiments include a database physical design with a plurality of small fact and summary tables. Still other embodiments include a method of extracting and loading data automatically into a target database, while simultaneously maintaining current summary tables.
The document discusses several key data warehousing features in Oracle, including:
1) The SQL Access Adviser which recommends indexes and materialized views for workloads.
2) Improvements to the multidimensional OLAP engine including advanced partitioning options.
3) Data Pump which replaces Import and Export utilities and is faster for exporting and importing data.
4) Automatic storage management which automates storage management tasks.
This document provides a collection of 17 frequently asked questions (FAQs) about Oracle database concepts. It includes concise definitions and explanations of key terms such as Oracle, Oracle database, Oracle instance, parameter file, system global area, program global area, user account, schema, user role, and more. It also provides sample scripts and is intended as a learning and interview preparation guide for Oracle DBAs.
Performance tuning in WebLogic Server involves tuning various components like EJBs, JMS, web applications, and web services. It is important to understand performance objectives like anticipated load and target CPU utilization. Monitoring disk, CPU, and network utilization helps identify bottlenecks. Common tuning techniques include optimizing pooling, caching, threading, and disabling unnecessary processing.
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
In today's computing world, accessing and managing data has become one of the most significant elements. Applications as
varied as weather satellite feedback to military operation details employ huge databases that store graphics images, texts and other
forms of data. The main concern in maintaining this information is to access them in an efficient manner. Database optimization
techniques have been derived to address this issue that may otherwise limit the performance of a database to an extent of vulnerability.
We therefore discuss the aspects of performance optimization related to data access in distributed databases. We further looked at the
effect of these optimization techniques
This document provides instructions for setting up an Oracle Data Guard configuration with a physical standby database for SAP customers. It discusses preparatory work including naming conventions, database parameters, and SQLNet configuration. Key steps include configuring the Data Guard Broker, deploying log transfer and function checks, implementing BRARCHIVE with Data Guard, and automating failover using fast-start failover and client connection timeouts. Examples of configuration files and scripts are provided in an annex. The document allows setup of a fully automated disaster recovery solution using Oracle Data Guard.
- Oracle Data Guard is a data protection and disaster recovery solution that maintains up to 9 synchronized standby databases to protect enterprise data from failures, disasters, errors, and corruptions.
- Data Guard uses redo apply and SQL apply technologies to synchronize primary and standby databases by transmitting redo logs from the primary and applying the redo logs on the standby databases.
- Data Guard allows role transitions like switchovers and failovers between primary and standby databases to minimize downtime during planned and unplanned outages.
The document provides instructions for loading a large volume of master data into IBM InfoSphere Master Data Management Server (MDM Server) using the maintenance services batch approach. This approach loads data using MDM Server maintenance transactions invoked by the batch processor. The summary includes:
1. The maintenance services batch approach loads data into MDM Server using maintenance transactions invoked by the batch processor. This provides data validation benefits.
2. Setup requires installing MDM Server, the EntryLevelMDM patch for maintenance transactions, WebSphere Application Server, and DB2 database. Data is prepared in an XML format and processed by the multi-threaded batch processor.
3. Performance tuning tips include optimizing CPU and disk resources between the
The Oracle Optimizer uses both rule-based optimization and cost-based optimization to determine the most efficient execution plan for SQL statements. It considers factors like available indexes, data access methods, and sort usage to select the optimal plan. The optimizer can operate in different modes and generates execution plans that describe the chosen strategy. Tuning the optimizer settings and database design can help it select more efficient plans.
This document provides an introduction to Netezza fundamentals for application developers. It describes Netezza's Asymmetric Massively Parallel Processing architecture, which uses an array of servers called S-Blades connected to disks and database accelerator cards to process large volumes of data in parallel. The document aims to help readers quickly understand and use the Netezza appliance through explanations of its components and query processing. It also defines key Netezza terminology and objects.
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Aaron Shilo
The document provides an overview of Oracle database performance tuning best practices for DBAs and developers. It discusses the connection between SQL tuning and instance tuning, and how tuning both the database and SQL statements is important. It also covers the connection between the database and operating system, how features like data integrity and zero downtime updates are important. The presentation agenda includes topics like identifying bottlenecks, benchmarking, optimization techniques, the cost-based optimizer, indexes, and more.
Getting to know oracle database objects iot, mviews, clusters and more…Aaron Shilo
This document provides an overview of various Oracle database objects and storage structures including:
- Index-organized tables store data within the index based on key values for faster access times and reduced storage.
- Materialized views store the results of a query for faster access instead of re-executing joins and aggregations.
- Virtual indexes allow testing whether a potential new index would be used by the optimizer before implementing.
The presenter discusses how different segment types like index-organized tables, materialized views, and clusters can reduce I/O and improve query performance by organizing data to reduce physical reads and consistent gets. Experienced Oracle DBAs use these features to minimize disk I/O, the greatest factor in
Active / Active configurations with Oracle Active Data GuardAris Prassinos
Active / Active configurations with Oracle Active Data Guard allow maximizing utilization of disaster recovery sites by routing writes to the primary database and load balancing reads between the primary and standby databases. Application modifications are required to use role-based services to determine which database to connect to for reads and writes. This configuration reduces overall system costs compared to other disaster recovery options and provides redundancy with minimal data loss within minutes of a failure.
Managing user Online Training in IBM Netezza DBA Development by www.etraining...Ravikumar Nandigam
Dear Student,
Greetings from www.etraining.guru
We provide BEST online training in Hyderabad for IBM Netezza DBA and/or Development by a senior working professional. Our Netezza Trainer comes with a working experience of 10+ years, 6+ years in Netezza and an Netezza 7.1 certified professional.
DBA Course Content: http://www.etraining.guru/course/dba/online-training-ibm-netezza-puredata-dba
Development Course Content: http://www.etraining.guru/course/ibm/online-training-ibm-puredata-netezza-development
Course Cost: USD 300 (or) INR 18000
Number of Hours: 24 hours
*Please note the course also includes Netezza certification assitance.
If there is any opportunity, we will be very happy to serve you. Appreciate if you can explore other training opportunities in our website as well.
We can be reachable at info@etraining.guru (or) 91-996-669-2446 for any further info/details.
Regards,
Karthik
www.etraining.guru"
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsJohn Beresniewicz
RMOUG 2020 abstract:
This session will cover core concepts for Oracle performance analysis first introduced in Oracle 10g and forming the backbone of many features in the Diagnostic and Tuning packs. The presentation will cover the theoretical basis and meaning of these concepts, as well as illustrate how they are fundamental to many user-facing features in both the database itself and Enterprise Manager.
- The document discusses Oracle Data Guard, which is Oracle's disaster recovery solution that automates the creation and maintenance of transactionally consistent standby copies of a primary database.
- Data Guard helps protect data by taking the primary database data and automatically replicating it to one or more standby databases. This allows for failover to a standby if the primary fails.
- There are three types of standby databases: physical standby databases which are block-for-block identical copies, and logical standby databases which transform redo logs into SQL for application.
- Data Guard provides switchover and failover capabilities for planned and unplanned outages, respectively, to transition the primary role to a standby without
The document summarizes the performance and scalability capabilities of Microsoft SQL Server 2008. It discusses how SQL Server 2008 provides tools to optimize performance for databases of any size through features like an improved query processing engine and partitioning. It also explains how SQL Server 2008 allows databases to scale up by supporting new hardware and scale out through technologies like distributed partitioning and replication.
This document summarizes a project that analyzed different approaches to managing large datasets using both ORM and JDBC frameworks. The project team downloaded over 2GB of weather data and loaded it into a database using various designs, including denormalized data, normalized data, indexing, partitioning, and sharding. They then tested the performance of loading, querying, and searching the data under each approach. The document outlines the implementation details and observations for each design considered.
Oracle Enterprise Manager allows administrators to monitor the performance of Oracle databases from any location with web access. Key metrics such as wait times, load levels, and cache usage can be viewed alongside alerts that notify administrators when thresholds are exceeded. Both collection-based historical data and real-time data are accessible to help identify and address potential performance issues.
This document provides an overview and administration guide for Oracle Clusterware and Real Application Clusters (RAC). It describes the Oracle Clusterware and RAC software architectures, components, installation processes, and key features. The document also covers administering Oracle Clusterware components like voting disks and the Oracle Cluster Registry, storage management, database instances, services, and backup/recovery in RAC environments. Administrative tools for RAC like Enterprise Manager, SQL*Plus, and SRVCTL are also discussed.
How Anonymous Can Someone be on Twitter?George Sam
The document describes a study that aimed to identify Twitter users based on their writing styles in tweets. The researchers collected over 3 million tweets from 100 users and extracted features such as word counts and frequencies. They were able to correctly identify the author of tweets 67% of the time using tf-idf vectors and the extracted stylistic features.
This document describes a project to create a federated ontology for sports that combines data from soccer, tennis, and cricket. The project team collected data on players, matches, rankings, and tournaments from various websites and databases. They then cleaned the data and created ontologies to model the domains and relationships for each sport. The cleaned data was mapped to the ontologies using a tool called Karma. The modeled data was published to a triplestore and SPARQL queries were run to test the system. The goal of the project was to integrate information from multiple sports sources into a single federated ontology.
This document describes an integrated ontology project for sports domains including cricket, football, and tennis. The project involves extracting data from web sources, cleaning the data, creating a federated ontology using existing ontologies, modeling the data in Karma, publishing the data as RDF triples in a triple store, and enabling querying of the data using SPARQL. Future work may include adding additional sports domains and developing interfaces for querying and applications.
The document discusses the BBC's use of dynamic semantic publishing (DSP) for sports content on BBC News Online and BBC Sport. It outlines how DSP was used for the 2010 World Cup to generate pages dynamically for each player, team and group. It also describes how DSP will be used for BBC Sport in 2012 and the Olympics to generate thousands of dynamic pages and aggregates while handling live statistics and video. DSP allows highly scalable, personalized and adaptive digital publishing compared to traditional static methods.
GFII - Financial Times - Semantic PublishingJem Rayfield
The document discusses the Financial Times' universal publishing platform and semantic publishing strategies. It aims to:
1) Create a universal publishing platform that organizes assets and metadata to avoid being bound to individual platforms and focuses on APIs.
2) Use semantic annotation and natural language processing to understand content and disambiguate entities, and develop ontologies to model domains and collaborate with other publishers.
3) Build a responsive website like next.ft.com that aggregates content semantically based on concepts, companies, people and locations to address the complexity of too many pages for too few journalists.
4) Increase user engagement and reach by providing contextual and behavioral recommendations that combine the user's click data with semantic data and related
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
In today's computing world, accessing and managing data has become one of the most significant elements. Applications as
varied as weather satellite feedback to military operation details employ huge databases that store graphics images, texts and other
forms of data. The main concern in maintaining this information is to access them in an efficient manner. Database optimization
techniques have been derived to address this issue that may otherwise limit the performance of a database to an extent of vulnerability.
We therefore discuss the aspects of performance optimization related to data access in distributed databases. We further looked at the
effect of these optimization techniques
This document provides instructions for setting up an Oracle Data Guard configuration with a physical standby database for SAP customers. It discusses preparatory work including naming conventions, database parameters, and SQLNet configuration. Key steps include configuring the Data Guard Broker, deploying log transfer and function checks, implementing BRARCHIVE with Data Guard, and automating failover using fast-start failover and client connection timeouts. Examples of configuration files and scripts are provided in an annex. The document allows setup of a fully automated disaster recovery solution using Oracle Data Guard.
- Oracle Data Guard is a data protection and disaster recovery solution that maintains up to 9 synchronized standby databases to protect enterprise data from failures, disasters, errors, and corruptions.
- Data Guard uses redo apply and SQL apply technologies to synchronize primary and standby databases by transmitting redo logs from the primary and applying the redo logs on the standby databases.
- Data Guard allows role transitions like switchovers and failovers between primary and standby databases to minimize downtime during planned and unplanned outages.
The document provides instructions for loading a large volume of master data into IBM InfoSphere Master Data Management Server (MDM Server) using the maintenance services batch approach. This approach loads data using MDM Server maintenance transactions invoked by the batch processor. The summary includes:
1. The maintenance services batch approach loads data into MDM Server using maintenance transactions invoked by the batch processor. This provides data validation benefits.
2. Setup requires installing MDM Server, the EntryLevelMDM patch for maintenance transactions, WebSphere Application Server, and DB2 database. Data is prepared in an XML format and processed by the multi-threaded batch processor.
3. Performance tuning tips include optimizing CPU and disk resources between the
The Oracle Optimizer uses both rule-based optimization and cost-based optimization to determine the most efficient execution plan for SQL statements. It considers factors like available indexes, data access methods, and sort usage to select the optimal plan. The optimizer can operate in different modes and generates execution plans that describe the chosen strategy. Tuning the optimizer settings and database design can help it select more efficient plans.
This document provides an introduction to Netezza fundamentals for application developers. It describes Netezza's Asymmetric Massively Parallel Processing architecture, which uses an array of servers called S-Blades connected to disks and database accelerator cards to process large volumes of data in parallel. The document aims to help readers quickly understand and use the Netezza appliance through explanations of its components and query processing. It also defines key Netezza terminology and objects.
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Aaron Shilo
The document provides an overview of Oracle database performance tuning best practices for DBAs and developers. It discusses the connection between SQL tuning and instance tuning, and how tuning both the database and SQL statements is important. It also covers the connection between the database and operating system, how features like data integrity and zero downtime updates are important. The presentation agenda includes topics like identifying bottlenecks, benchmarking, optimization techniques, the cost-based optimizer, indexes, and more.
Getting to know oracle database objects iot, mviews, clusters and more…Aaron Shilo
This document provides an overview of various Oracle database objects and storage structures including:
- Index-organized tables store data within the index based on key values for faster access times and reduced storage.
- Materialized views store the results of a query for faster access instead of re-executing joins and aggregations.
- Virtual indexes allow testing whether a potential new index would be used by the optimizer before implementing.
The presenter discusses how different segment types like index-organized tables, materialized views, and clusters can reduce I/O and improve query performance by organizing data to reduce physical reads and consistent gets. Experienced Oracle DBAs use these features to minimize disk I/O, the greatest factor in
Active / Active configurations with Oracle Active Data GuardAris Prassinos
Active / Active configurations with Oracle Active Data Guard allow maximizing utilization of disaster recovery sites by routing writes to the primary database and load balancing reads between the primary and standby databases. Application modifications are required to use role-based services to determine which database to connect to for reads and writes. This configuration reduces overall system costs compared to other disaster recovery options and provides redundancy with minimal data loss within minutes of a failure.
Managing user Online Training in IBM Netezza DBA Development by www.etraining...Ravikumar Nandigam
Dear Student,
Greetings from www.etraining.guru
We provide BEST online training in Hyderabad for IBM Netezza DBA and/or Development by a senior working professional. Our Netezza Trainer comes with a working experience of 10+ years, 6+ years in Netezza and an Netezza 7.1 certified professional.
DBA Course Content: http://www.etraining.guru/course/dba/online-training-ibm-netezza-puredata-dba
Development Course Content: http://www.etraining.guru/course/ibm/online-training-ibm-puredata-netezza-development
Course Cost: USD 300 (or) INR 18000
Number of Hours: 24 hours
*Please note the course also includes Netezza certification assitance.
If there is any opportunity, we will be very happy to serve you. Appreciate if you can explore other training opportunities in our website as well.
We can be reachable at info@etraining.guru (or) 91-996-669-2446 for any further info/details.
Regards,
Karthik
www.etraining.guru"
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsJohn Beresniewicz
RMOUG 2020 abstract:
This session will cover core concepts for Oracle performance analysis first introduced in Oracle 10g and forming the backbone of many features in the Diagnostic and Tuning packs. The presentation will cover the theoretical basis and meaning of these concepts, as well as illustrate how they are fundamental to many user-facing features in both the database itself and Enterprise Manager.
- The document discusses Oracle Data Guard, which is Oracle's disaster recovery solution that automates the creation and maintenance of transactionally consistent standby copies of a primary database.
- Data Guard helps protect data by taking the primary database data and automatically replicating it to one or more standby databases. This allows for failover to a standby if the primary fails.
- There are three types of standby databases: physical standby databases which are block-for-block identical copies, and logical standby databases which transform redo logs into SQL for application.
- Data Guard provides switchover and failover capabilities for planned and unplanned outages, respectively, to transition the primary role to a standby without
The document summarizes the performance and scalability capabilities of Microsoft SQL Server 2008. It discusses how SQL Server 2008 provides tools to optimize performance for databases of any size through features like an improved query processing engine and partitioning. It also explains how SQL Server 2008 allows databases to scale up by supporting new hardware and scale out through technologies like distributed partitioning and replication.
This document summarizes a project that analyzed different approaches to managing large datasets using both ORM and JDBC frameworks. The project team downloaded over 2GB of weather data and loaded it into a database using various designs, including denormalized data, normalized data, indexing, partitioning, and sharding. They then tested the performance of loading, querying, and searching the data under each approach. The document outlines the implementation details and observations for each design considered.
Oracle Enterprise Manager allows administrators to monitor the performance of Oracle databases from any location with web access. Key metrics such as wait times, load levels, and cache usage can be viewed alongside alerts that notify administrators when thresholds are exceeded. Both collection-based historical data and real-time data are accessible to help identify and address potential performance issues.
This document provides an overview and administration guide for Oracle Clusterware and Real Application Clusters (RAC). It describes the Oracle Clusterware and RAC software architectures, components, installation processes, and key features. The document also covers administering Oracle Clusterware components like voting disks and the Oracle Cluster Registry, storage management, database instances, services, and backup/recovery in RAC environments. Administrative tools for RAC like Enterprise Manager, SQL*Plus, and SRVCTL are also discussed.
How Anonymous Can Someone be on Twitter?George Sam
The document describes a study that aimed to identify Twitter users based on their writing styles in tweets. The researchers collected over 3 million tweets from 100 users and extracted features such as word counts and frequencies. They were able to correctly identify the author of tweets 67% of the time using tf-idf vectors and the extracted stylistic features.
This document describes a project to create a federated ontology for sports that combines data from soccer, tennis, and cricket. The project team collected data on players, matches, rankings, and tournaments from various websites and databases. They then cleaned the data and created ontologies to model the domains and relationships for each sport. The cleaned data was mapped to the ontologies using a tool called Karma. The modeled data was published to a triplestore and SPARQL queries were run to test the system. The goal of the project was to integrate information from multiple sports sources into a single federated ontology.
This document describes an integrated ontology project for sports domains including cricket, football, and tennis. The project involves extracting data from web sources, cleaning the data, creating a federated ontology using existing ontologies, modeling the data in Karma, publishing the data as RDF triples in a triple store, and enabling querying of the data using SPARQL. Future work may include adding additional sports domains and developing interfaces for querying and applications.
The document discusses the BBC's use of dynamic semantic publishing (DSP) for sports content on BBC News Online and BBC Sport. It outlines how DSP was used for the 2010 World Cup to generate pages dynamically for each player, team and group. It also describes how DSP will be used for BBC Sport in 2012 and the Olympics to generate thousands of dynamic pages and aggregates while handling live statistics and video. DSP allows highly scalable, personalized and adaptive digital publishing compared to traditional static methods.
GFII - Financial Times - Semantic PublishingJem Rayfield
The document discusses the Financial Times' universal publishing platform and semantic publishing strategies. It aims to:
1) Create a universal publishing platform that organizes assets and metadata to avoid being bound to individual platforms and focuses on APIs.
2) Use semantic annotation and natural language processing to understand content and disambiguate entities, and develop ontologies to model domains and collaborate with other publishers.
3) Build a responsive website like next.ft.com that aggregates content semantically based on concepts, companies, people and locations to address the complexity of too many pages for too few journalists.
4) Increase user engagement and reach by providing contextual and behavioral recommendations that combine the user's click data with semantic data and related
The document summarizes the BBC's transition from a static publishing system to a dynamic semantic publishing (DSP) system. Some key points:
1) The static system was inflexible and did not allow for automated or personalized content publishing for large events like the World Cup or Olympics with thousands of pages.
2) The DSP system uses semantic technologies like ontologies, triplestores, and SPARQL to dynamically generate personalized and aggregated pages from tagged content assets.
3) This allowed the BBC to dramatically increase the breadth of published content while reducing journalist headcount through automated publishing. Events like the 2014 World Cup were covered with hundreds of dynamically generated pages.
This document describes an automatic query caching system called Exchequer that caches intermediate and final results of queries to improve query response times. Exchequer is closely coupled with the query optimizer to make mutually consistent caching and optimization decisions. It uses an incremental cache management algorithm to determine what to cache, taking into account what other results are already cached. Experimental results show that Exchequer improves query response times by over 30% compared to other systems and can achieve the same performance with 10 times less cache size.
This document describes an automatic query caching system called Exchequer. Exchequer caches both final and intermediate results of queries to improve query response times. It uses a directed acyclic graph (DAG) representation of queries and cached results. The optimizer is tightly coupled with the cache manager to make mutually consistent decisions about query execution plans and cache management plans. Experimental results show that Exchequer improves query response times by more than 30% compared to previous caching systems and can achieve the same response times with only one tenth of the cache size.
SSIS is a component of SQL Server that allows for data integration and workflow. It has separate runtime and data flow engines. The runtime engine manages package execution and control flow, while the data flow engine extracts, transforms, and loads data in a parallel, buffered manner for improved performance. SSAS is the analysis component that builds multidimensional cubes from relational data sources for analysis. It uses an OLAP storage model and has components for querying, processing, and caching data and calculations. SSRS is the reporting component that allows users to build interactive, parameterized reports from various data sources and deliver them through a web portal.
Load distribution of analytical query workloads for database cluster architec...Matheesha Fernando
The document summarizes a research paper on optimizing the distribution of analytical query workloads across multiple database servers. It discusses:
1) How database clusters work and the idea of using materialized query tables (MQTs) to optimize analytical queries.
2) The proposed framework which uses a genetic algorithm-based scheduler to optimize mapping of queries and MQTs to servers to minimize overall workload completion time.
3) An evaluation of the genetic algorithm approach against exhaustive search and greedy algorithms on synthetic workloads, finding it provides results close to exhaustive search.
This document compares various analytics and reporting architectures for generating reports from transaction systems. It discusses 6 approaches: 1) direct queries from the live transaction system, 2) direct queries from a mirror of the transaction system, 3) direct queries from a data warehouse, 4) OLAP over a mirror, 5) OLAP over a data warehouse, and 6) direct queries from QlikView. Each approach is evaluated in terms of potential response time and flexibility for ad-hoc reporting. QlikView is identified as a potentially quick, flexible and cost-effective solution by extracting and compressing data into QVD files for in-memory querying.
SQL Server 2017 - Adaptive Query Processing and Automatic Query TuningJavier Villegas
The document discusses SQL Server 2017 features including Query Store, Automatic Query Tuning, and Adaptive Query Processing. Query Store captures query execution plans and statistics over time to help troubleshoot performance issues. Automatic Query Tuning identifies and fixes performance regressions by automatically selecting better query plans. Adaptive Query Processing allows the query optimizer to modify the execution plan while a query is running based on actual runtime statistics, leading to more efficient plans.
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
This document summarizes research on optimizing queries for big data analytics. It discusses how organizations use different databases with varied data models to store and query big data. The main focus is improving query performance by having a query framework that can detect optimized data copies created by data engineers and execute queries against these copies. The framework uses the Apache Calcite query optimizer which rewrites queries to use optimized copies when possible based on a cost model. An evaluation on real taxi trip data showed the approach improved query response times.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Orca: A Modular Query Optimizer Architecture for Big DataEMC
This document describes Orca, a new query optimizer architecture developed by Pivotal for its data management products. Orca is designed to be modular and portable, allowing it to optimize queries for both massively parallel processing (MPP) databases and Hadoop systems. The key features of Orca include its use of a memo structure to represent the search space of query plans, a job scheduler to efficiently explore the search space in parallel, and an extensible framework for property enforcement during query optimization. Performance tests showed that Orca provided query speedups of 10x to 1000x over previous optimization systems.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
The document proposes a framework for improving browser performance using fuzzy logic. It splits the browser cache into an instant cache and durable cache. Web objects are initially stored in the instant cache, and objects visited more than a threshold are moved to the durable cache. When the durable cache is full, a fuzzy system classifies each object as cacheable or uncacheable. Uncacheable objects are removed to make space for new objects. The fuzzy system considers factors like recency, frequency and size to determine cacheability. Experimental results showed this approach improved hit ratio and byte hit ratio compared to LRU and LFU caching policies.
The document discusses performance tuning topics in WebLogic Server including tuning EJBs, JMS, web applications, web services, and JDBC. It provides guidance on understanding performance objectives such as anticipated users, requests, data, and target CPU utilization. It also discusses monitoring disk and CPU utilization, data transfers across networks, and locating bottlenecks in the system. Specific tuning recommendations are provided for EJBs, MDB pools, stateless session bean pools, entity bean pools, and JMS.
The document discusses performance tuning topics in WebLogic Server including tuning EJBs, JMS, web applications, web services, and JDBC. It provides guidance on understanding performance objectives such as anticipated users, requests, data, and target CPU utilization. It also discusses monitoring disk and CPU utilization, bottlenecks, and provides specific tuning recommendations for EJBs, MDB pools, stateless session bean pools, entity bean pools, and JMS.
The document discusses several applications of parallel processing:
1. Airfoil design optimization utilizes parallel processors to simultaneously evaluate the aerodynamic performance of multiple airfoil geometries, greatly decreasing computational time.
2. Parallel databases can distribute queries across multiple processors, improving response time for online transactions and reducing time for complex queries through intra-query parallelism.
3. Weather forecasting uses parallel computers to quickly process huge volumes of environmental data through thousands of computational iterations needed to generate useful forecasts.
Parallel processing in data warehousing and big dataAbhishek Sharma
A database is optimized for transactional processing like writes, while a data warehouse is optimized for analytical queries on historical data from multiple sources. A data warehouse contains transformed and aggregated data structured for analysis. It partitions analytical and operational tasks to improve performance and limit locking during data updates. Parallel processing in a data warehouse distributes queries across multiple CPUs to enable faster and more flexible reporting and business intelligence.
The document discusses performance tuning topics for WebLogic Server including tuning EJBs, JMS, web applications, web services, and JDBC. It provides guidance on understanding performance objectives, monitoring utilization of disk, CPU and network, and provides specific tuning recommendations for various components.
Query optimization in oodbms identifying subquery for query managementijdms
This paper is based on relatively newer approach for query optimization in object databases, which uses
query decomposition and cached query results to improve execution a query. Issues that are focused here is
fast retrieval and high reuse of cached queries, Decompose Query into Sub query, Decomposition of
complex queries into smaller for fast retrieval of result.
Here we try to address another open area of query caching like handling wider queries. By using some
parts of cached results helpful for answering other queries (wider Queries) and combining many cached
queries while producing the result.
Multiple experiments were performed to prove the productivity of this newer way of optimizing a query.
The limitation of this technique is that it’s useful especially in scenarios where data manipulation rate is
very low as compared to data retrieval rate.
Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docxsodhi3
Real-Time Data Warehouse Loading Methodology
Ricardo Jorge Santos
CISUC – Centre of Informatics and Systems
DEI – FCT – University of Coimbra
Coimbra, Portugal
[email protected]
Jorge Bernardino
CISUC, IPC – Polytechnic Institute of Coimbra
ISEC – Superior Institute of Engineering of Coimbra
Coimbra, Portugal
[email protected]
ABSTRACT
A data warehouse provides information for analytical processing,
decision making and data mining tools. As the concept of real-
time enterprise evolves, the synchronism between transactional
data and data warehouses, statically implemented, has been
redefined. Traditional data warehouse systems have static
structures of their schemas and relationships between data, and
therefore are not able to support any dynamics in their structure
and content. Their data is only periodically updated because they
are not prepared for continuous data integration. For real-time
enterprises with needs in decision support purposes, real-time data
warehouses seem to be very promising. In this paper we present a
methodology on how to adapt data warehouse schemas and user-
end OLAP queries for efficiently supporting real-time data
integration. To accomplish this, we use techniques such as table
structure replication and query predicate restrictions for selecting
data, to enable continuously loading data in the data warehouse
with minimum impact in query execution time. We demonstrate
the efficiency of the method by analyzing its impact in query
performance using benchmark TPC-H executing query workloads
while simultaneously performing continuous data integration at
various insertion time rates.
Keywords
real-time and active data warehousing, continuous data
integration for data warehousing, data warehouse refreshment
loading process.
1. INTRODUCTION
A data warehouse (DW) provides information for analytical
processing, decision making and data mining tools. A DW
collects data from multiple heterogeneous operational source
systems (OLTP – On-Line Transaction Processing) and stores
summarized integrated business data in a central repository used
by analytical applications (OLAP – On-Line Analytical
Processing) with different user requirements. The data area of a
data warehouse usually stores the complete history of a business.
The common process for obtaining decision making information
is based on using OLAP tools [7]. These tools have their data
source based on the DW data area, in which records are updated
by ETL (Extraction, Transformation and Loading) tools. The ETL
processes are responsible for identifying and extracting the
relevant data from the OLTP source systems, customizing and
integrating this data into a common format, cleaning the data and
conforming it into an adequate integrated format for updating the
data area of the DW and, finally, loading the final formatted data
into its database.
Traditionally, it has been well accepted that data warehouse
databases a ...
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
1. An Efficient Cache Handling Technique in Database
Systems
Abhishek Shah George Sam Nikhil Lakade
rakeshsh@usc.edu gsam@usc.edu lakade@usc.edu
University of Southern University of Southern University of Southern
California California California
1. ABSTRACT
In various commercial database
systems, the queries which have complex
structures often take longer time to execute. The
efficiency of the query processing can be greatly
improved if the results of the previous queries are
stored in the form of caches.
These caches can be used to answer
queries later on. Furthermore, the cost factor to
process large and complex queries is huge in
commercial databases due to the size of the
databases, and hence we need a way to optimize
processing by automatically caching the
intermediate results. Creating such an automatic
system to cache the results would help in saving
time.
Existing cache systems do manage to
store the intermediate results, but they suffer
from the problem of not knowing how efficiently
to use the cache memory to store the results. It
also becomes a problem, if the database gets
regularly updated. The cache would then become
obsolete. It is necessary to decide when to discard
a cache and the frequency of checking the
updates in the database.
2. INTRODUCTION
2.1 The Problem
Over the past few years there has been
an increase the need for use of Large Data
Warehouses and OLAP[2]
(On-Line Analytical
Processing) systems in managing large scale data.
These techniques provide an efficient and
economical way of storing large data and
extracting useful information from it. These
systems have proven useful in multiple
applications. The most common being in the use
of Data-Warehouses of Super-Marts, like
Walmart. increase the need for use of Large Data
Warehouses and OLAP (On-Line Analytical
Processing) systems in managing large scale data.
These techniques provide an efficient and
economical way of storing large data and
extracting useful information from it. These
systems have proven useful in multiple
applications. The most common being in the use
of Data-Warehouses of Super-Marts, like
Walmart.
The query processing time for OLAP
and decision support usually ranges from minutes
to hours, but depends mostly on the extent of the
database, the type of query and the processing
capabilities of the servers. Usually, large scale
servers require minimal time to process the
decision support queries. This processing time
however, reduces vastly if multiple queries are
executed simultaneously, or if the structure of the
query is complex. Complex queries are built up
of many sub queries, and the result set of the final
query depends on the result set obtained by
executing the sub queries.
Traditional databases employ the
method of treating every query independently.
However, this results in increase in processing
time. Moreover, if there is a scenario where a
particular query is frequently used, then every
time one needs to fire the same query, which
brings the problem of redundancy in processes.
2.2 Challenges
a. One main concern is knowing and deciding
which cache entries are to be deleted, and
whether to delete the entire cache or just part
of the cache. This becomes crucial if part of
the result is needed for further queries.
b. Many commercial database systems are
frequently updated. Our challenge here is to
update the results of the intermediate queries
as and when CRUD[1]
operations take place
on the database.
3. MOTIVATION
Seeing how commercial databases and
OLAP struggle to fetch query results efficiently,
having an automated system closely coupled with
2. the query optimizer to cache results and manage
them will provide an efficient way to retrieve
data. For example, consider a website which has
to fetch data from a large database every time a
page is accessed by a user.
The load time of the page increases
significantly because of large data and relatively
slow query processing time. It is also useful if the
system can automatically decide when to discard
or keep a particular cache entry, and handle
frequent database updates.
4. RELATED WORK
4.1 Exchequer
Exchequer[1]
is an intermediate query
caching system developed for the purpose of
storing relevant sub query results. The authors of
the paper on this system have differentiated their
system from normal Operating System caching
techniques based on the following aspects:
a. In traditional cache systems, size and cost of
computation aren’t considered, while the recency
of a used data object is sufficient.
b. In a query cache system, results from previous
cached queries can be used further for use.
c. In the traditional cache systems, the pages are
independent of each other, and so can be easily
deleted without affecting other pages.
In the Exchequer system, they take into
account the dynamic nature of the cache, so here
the traditional materialized view or an index
solution will not work efficiently. In the
materialized view scenario, there are techniques
to decide which entities to materialize and other
previous materialized views are taken into
consideration. This does not work in a static
cache system. Thus exchequer uses the dynamic
nature of cache system. Another system uses
multi-query caching, and takes into account the
cost of materializing the selected views, but
makes static decision on what to materialize. .
The Exchequer system also uses an AND-OR
DAG representation of queries and the cached
results. The use of DAG makes it extensible for
new operations and efficiently encodes
alternative ways of evaluating queries .The
exchequer DAG representation also takes into
account sort orders and presence of indexes.
In the exchequer architecture, there is a
tight coupling between cache manager and
optimizer. It uses a query execution plan to refer
to cached relations which is got by the execution
engine and the new intermediate query results
produced by the query are sometimes cached. It
uses an incremental greedy algorithm to decide
which results should be cached. The algorithm
first checks if any of the nodes of the chosen
query plan should be cached.
The incremental decision is made by
updating the representative set with a new query
and a selection algorithm is applied to the nodes
selected when the previous query was considered
and the nodes of the chosen query plan. The
output of this algorithm is a set of nodes that are
marked for caching and the best plan for the
current query. Thus when the query gets
executed, the nodes in its best plan that are
marked are added to cache, which then replace
the unmarked nodes. The unmarked nodes are
chosen for replacement using LCS/LRU, i.e. the
largest results are evicted, and amongst the
remaining results, the least recently used is
evicted. The exchequer optimizes the query
before fresh caching decisions are made the
chosen plan for each query is optimal for that
query, given the current cache contents.
4.2 Composite Information Server Platform &
Query Processing Engine:
Composite Information Server
Platform[2]
is an intermediate server platform that
works over REST, HTTP and SOAP protocols in
a client- server web application environment. It is
provided by Oracle. It receives client requests
and authenticates them either through LDAP or
Composite Domain Authentication and then
passes it to the Query Processing Engine. The
query processing engine then executes this
request over some data source and retrieves a
data result. It then combines this data into a
single SQL or XML result set and returns to the
client. The Query Processing Engine provides
various optimization methods so that the SQL
query is efficient. It basically translates all
requests into a distribution plan. It then analyzes
queries and creates an optimized execution plan
to determine the intermediate steps and data
source requests. The Query Processing Engine
also employs a Caching technique for the queries.
These sequence of queries are then executed
against the relevant data.
The engine minimizes the overheads and
creates efficient join methods that leverage
3. against a suboptimal query. The techniques that
the query engine provide include:
a. SQL Pushdown:
The Query Processing Engine offloads most of
the query processing. It pushes down the select
query operations like string searches,
comparisons, sorting etc. into the underlying
relational data sources.
b. Parallel Processing:
The Query Engine processes requests in a parallel
and asynchronous way on separate threads, thus
reducing the wait time and data source response
latency
c. Caching:
The Composite Information Server is
configured to cache the results of query, web
service calls and the procedures. It does this on a
per view/query basis. It stores this intermediate
results in either relational database or in a file
based cache. The Engine always checks if the
result of the query is already present in the cache,
and uses this cache data. It is most useful when
used on data which is frequently invoked and
which change rarely. In the scenario where the
data is constantly changing, the query engine
does not perform very well and cannot handle
frequent changes to data.
4.3 Multidimensional Query Cache using Chunks
To improve query response time in
OLAP caching of queries has been proposed,
which consists of mainly two approaches, table
level caching and query level caching. A
proposed previous work uses chunks[3]
to reuse
results of previous queries to answer future
queries. To achieve performance later, chunked
cache is combined with chunked file
organization.
Chunk file organization is basically
redefines the organization for relation tables. This
new organization of chunk files reduces cost
chunk cache miss. Concern about this
methodology is select required chunk. Smaller
chunks results into efficient query optimization
but efficiency downgrades when total number of
chunks increases in system and hence another
paradigm comes, which is to decide replace
policies for chunked caches.
4.4 Usability Based Caching
Another similar work that has been
done on this topic is the usability-based caching[4]
of query results in OLAP systems. In this method
they propose a new cache management scheme
for OLAP systems which is based on the usability
of query results in rewriting and processing of
related future queries. Not only they take into
consideration the queries that are currently being
executed but it also predicts the future queries
based on the present and past queries that are
being executed on this system using the
probability model.
5. SOLUTION SYSTEM
5.1 Architecture
The architecture of our proposed model of
optimizer is shown in figure 1.
Fig1. Architecture of System
The optimizer and cache manager
works in close coupling with our intermediate
query cache system. The optimizer uses the
chunked cache to efficiently cache incoming
database queries.
The query execution plan and the cache
management plan are designed inside the
Optimizer and Cache Manager. This block is
responsible for changing the current cache. The
query execution plan is created using the cached
chunks. This chunked cache is obtained from the
Execution Engine when as and when required.
5.2 Use of Chunks
There are systems which are becoming
increasingly dynamic like that of OLAP and
important for business data analysis. Usually the
data sets in such systems are of multidimensional
nature. The traditional relational systems are
designed in such a way that they cannot provide
4. the required performance for these data sets.
Hence such systems are built by using a three tier
architecture. The first tier gives an easy to use
graphical tool that allows the users to build
queries. The second tier provides a
multidimensional view of the data stored In the
final tier, which can be a RDBMS. Queries that
occur in systems like OLAP are very interactive
and demand quick response time even if they are
of complex nature.
At times OLAP queries are repetitive in
nature a d follow a predictable pattern. An
OLAP type session can be characterized using
different kinds of locality.
1. Terrestrial: The same data might be accessed
repeatedly by the same user or a different user.
2. Hierarchical: This kind of locality is specific to
the OLAP domain and is consequence of the
presence of hierarchies on the dimensions. Data
members which are related by the parent/child or
sibling relationships will be accessed over and
over again. For example if a user is looking at
data for United States his next query is likely to
be about Canada or Mexico.
We are going to use dynamic caching
scheme where the cache contents vary
dynamically, since new items may be inserted
and old items may be removed from the cache. A
dynamic approach will be significantly beneficial
at the middle tier, since it adapts to the query
profile. Also, we use chunks for dynamic caching
and demonstrate its feasibility under realistic
query workloads without much overhead. We use
multidimensional arrays to represent data. Instead
of storing a large array in simple rows or columns
we break them down to chunks and store them in
a chunked format. The different values for each
dimension are divided into ranges, and chunks
are created based on this division. The figure 1
shows how multidimensional space can be
broken up into chunks.
Fig 2 Chunks
5.3 Caching with the Chunks
In this type of caching using chunks the
query results to be stored in the cache are broken
up into chunks and the chunks are cached. When
a user inputs a new query the existing chunks are
required to answer that query. Depending on the
content available in cache, the list of chunks is
divided into two. One part is answered from the
cache. The other consisting of the missing chunk,
has to be computed from the backend. The cost is
reduced here by just computing the missing
chunk from the backend.
Caching chunks improves the
granularity of caching. This leads to better
utilization of the cache in two ways.
1. Frequently accessed chunks of a query get
cached. The chunks which are not frequently
accessed are replaced eventually.
2. Previous queries can be used much more
effectively. For example Figure 2 shows a chunk
based cache, in which each query represents a
portion of multidimensional space. Say we have
three queries Q1, Q2 and Q3 and Q1 and they are
called in the increasing order. Now Q3 is not
contained in Q1 and Q2 or their union. Thus,
methods based on query containment will not be
able to use Q1 and/or Q2 to answer Q3. With
chunk based caching, Q3 can use the chunks it
has in common with Q1 and Q2. Only the
remaining chunks which are shown below in the
figure 2 have to be computed. The chunked file
organization in the relational backend enables
these remaining chunks to be computed in time
5. proportional to their size rather than in time
proportional to the size of Q3.
Fig 3 Reuse of Cached Chunks
5.4 Replacement Scheme using Chunks
Replacement schemes become a very import
structure of this system as the future queries are largely
dependent on this. The old chunks has to be removed and
the new chunks have to be added to the cache for an
efficient caching. There are different replacements
strategies like LRU but are not efficient enough.
Schemes which make use of a profit metric
consisting of the size and execution cost of a query are
considered in [6]. We also something similar for the
replacement scheme, we combined the TIME scheme with
the notion of benefit. Let Benefit(C) denote the benefit of a
chunk. We associate one more quality, called Weight(C)
with each chunk C in the cache. The replacement algorithm
is as follows:
Algorithm: TimeBenefit
Input: chunk N to be inserted in the cache
while [ space not available for N]:
Let C be the chunk corresponding to current
time position.
If [ Wieght (C) ≤ 0] :
Evict C from the cache
Else :
Weight (C) = Weight (C) – Benefit (N)
EndIf
Advance Time position
EndWhile
Insert N into cache
Weight (N) = Benefit (N)
5.2 DAG Representation
Dag is a Directed Acyclic Graph. In our
implementation of a cache query optimizer
system, it is important that we find an efficient
way to represent a query. It is done so that we
find an optimized query plan to execute. The
query execution is optimized if we have an
efficient query plan. To use the query evaluation
structurally, we use the concept of Directed
Acyclic graphs. The DAG is a way to optimally
represent the set of queries and operations. Using
the DAG an efficient query plan can be
generated, and this query DAG can be further
used to create a query caching algorithm.
An efficient algorithm using the DAG structure
for queries is the Volcano Algorithm [5]. This
algorithm represents the queries and the set of
queries in the form of DAG. In a DAG, there are
2 set of nodes:
AND nodes and OR nodes. AND nodes are used
to represent operations performed on the result
sets and queries. It represents the operations like
select, join and other operations on result sets.
OR nodes are used to represent queries and result
sets. In our implementation the OR nodes will
represent the sub queries which will get cached.
The OR nodes are called equivalence nodes in the
Volcano Algorithm. The equivalence nodes do
not have any operational representation and are to
describe the data in the system.
How single queries are handled:
The single queries are directly
representable in DAG. In a single query, a query
tree is first created using the relations in the
query and the operations. Once the query tree is
created it is sequentially expanded to generate
further equivalence nodes over the operational
nodes. It is given in the following diagrams. The
squares denote the equivalence nodes and the
circles denote the operational nodes.
Let us assume the query is of the form
A⋈B⋈C. It is represented as DAG inFig4.a,
Fig4.b Fig4.c. The relations A, B, C and the
intermediate relations are represented in the
equivalence nodes. The join operator are
represented as the circle nodes. It will be
represented as the following steps. Fig 4.a shows
the query tree for the query. The additional
6. equivalence nodes for the intermediate results are
created for AB in Fig 4.b.
Now the DAG is expanded to
accommodate for all possible combinations of the
join operators. In Fig4.c we take all possible
initial joins. It is done between AB, AC and BC.
These are then stored as intermediate result sets.
These are later used for the join query to create
the final result set.
Fig4.a- Initial Query Tree
Fig 4.b- Intermediate DAG
Fig 4.c- DAG of Single Query
How query sets are handled:
Query sets are handle a little differently
in the Volcano Algorithm. In this version of the
Volcano Algorithm, the intermediate equivalence
nodes represent the result sets. Each query set is a
set of multiple queries. The deletion of queries in
this case is done by reference counting
mechanisms. The queries are added into the DAG
one at a time. At each time a query is inserted, a
new equivalence node and operational node is
created.
Sometimes, the expressions may match
existing subexpressions in the DAG. Query
subexpressions may be equal too. The volcano
optimizer algorithms handles these subexpression
anomalies. An example of this could be the
problem arriving due to associativity of the join
operators in multi relation queries. The Volcano
Algorithm applies the associativity and then
unifies the nodes by replacing them with a single
equivalence node.
5.3 Query Optimization over DAG:
Now that the Dag is created, the
algorithm will perform certain functions on the nodes to
evalueate the cost value of each node based on its type. The
equivalence nodes cost and the operational nodes costs are
evaluated separately. The optimizer also takes into account
the cost of reading the input when pipelines are not used.
The cost at each node is a function of its children and the
subtrees below it.
7. For operational node o the cost function is defined as
cost(o) = cost of executing (o) + ∑ei∈children(o) cost(ei)
The children of o are all equivalence nodes. The cost of
each equivalence node is
cost(e) = min{cost(oi) | oi ∈ children€}
= 0 if there are no children
We now have to take into account the case when
some subset of nodes may be materialized and we may
need to reuse these materialized nodes. We introduce a new
function called reusecost(ei) which gives us the cost if an
equivalence node ei is re used again from a materialized set
M.
Thus the modified cost factor will be
Cost(o) = cost of executing (o) + ∑ei∈children(o) CC(ei)
Where CC(ei) = cost(ei) if ei ∉ M
= min(cost(ei), reusecost(ei)) if ei ∈ M
5.4 Algorithm to Handle Cache Delete and Insertion
The algorithm proposed finds out if any of nodes
in the DAG and the chunk system are worth caching. We
need to find out the benefit of adding or deleting these
nodes. We create benefit functions for DAG too which will
be similar to the Benefit(N) function in the chunk system.
The benefit function also takes into account the number of
times the previous query was used. The proposed optimizer
needs to know the nodes selected when the previous query
was considered and all the nodes of the query plan. Now
suppose S is a set of nodes selected to be cached from
representative set R, then for a query q
Cost(R,S) = ∑q∈R (cost(q, S) * weight(q))
We now find out the benefit function.
Benefit(R,x,S) = cost(R, S) – (cost(R, {x} ∪ S) + cost(x, S))
This finds out the benefit we get by adding node
x to the DAG. In cases where x is computed already we
assume the cost(x, S) to be 0.
We can now create a modified algorithm that will
handle cache deletions and insertions.
Algorithm: TimeBenefit
Input: chunk N to be inserted in the cache
Set X of the expanded DAG with nodes cached
Node x with benefit(x,R,X)
while [ space not available for N]:
Let C be the chunk corresponding to current
time position.
If [ Wieght (C) ≤ 0] :
Evict C from the cache
Delete x and its equivalence nodes from
X
Else :
Weight (C) = Weight (C) – Benefit (N)
EndIf
Advance Time position
EndWhile
Insert N into cache
Weight (N) = Benefit (N)
Algorithm Modify_Cache
Input:
Expanded DAG for R, the representative set of queries,
and the set of candidate equivalence nodes for caching
Chunk N to be inserted
Output: Set of nodes to be cached
X=φ
Y = set of candidates equivalence nodes for caching
while (Y = φ)
L1:
Among nodes y ∈ Y such that size({y} ∪ X) < CacheSize)
Pick the node x with the highest benef it(R, x, X)/size(x)
/* i.e., highest benefit per unit space */
if (benef it(R, x, X) < 0)
break; /* No further benefits to be had, stop */
TimeBenefit(N,X,x)
Y = Y - x; X = X ∪ {x}
return X
The Modify_Cache algorithm now handles the
deletion of nodes and chunks from the optimizer and also
creates optimum cache mechanism for the queries.
5.5 Handling Frequent Updates
Our solution system also handles the frequent updates
made on the database and the coherency with the cache
system. As new data keeps getting added to the databases,
the cache needs to be modified. We also need to discard or
modify necessary chunks in the chunk system.
To do this we create a sub module which will act
as a proxy between the database and the client. As and
when the client system proposes changes to the database,
the data will first enter the proxy system. The proxy system
will then decide which relations and attributes in each
8. relation need to be modified in the database. The proxy
module is also connected to the Optimizer and Cache
Manager. This ensures that the chunks are mapped on to
the respective relation attributes in the proxy. Whenever a
new update enters the proxy module, it creates a map
pointer to the necessary chunks in the chunk system.
The proxy module then finds out which chunks in the
chunk system will need to be modified. Once this mapping
is created it then sends the update to the database to be
materialized. Further analysis of the proxy system will be
done in the future work.
6. EVALUATION OF RESULTS
6.1 Evaluating the use of DAG
The Dag based approach is also followed by The
Exchequer system. In this method we use the concept of
making the query into a set of nodes to be evaluated as a
DAG. This DAG is then expanded to analyze the nodes and
then the operations on the nodes are performed one by one,
We then get the cached nodes of the Dag and nodes which
need to be deleted form the DAG.
The query structure taken for evaluation is of the following
type.
SELECT SUM(QUANTITY)
FROM ORDERS, SUPPLIER, PART,
CUSTOMER, TIME
WHERE join-list AND select-list
GROUP BY groupby-list;
It has a central Orders fact table and dimension tables- Part,
Supplier, Customer and Time. The size of these tables is
the same as used in [1]. The join-list us used to have
equality between attributes of the order fact table and
primary keys. The select-list are generated by selecting 0
too 2 attributes from the join-list. The groupby-list is
generated by picking at random a subset of all the keys.
These queries are decided so that a fair comparison can be
made.
The metrics used to measure the goodness of the algorithm
is the total response time of the set of queries. The report is
generated for a sequence of 100 queries after 50 queries
have already allowed the cache to be generated. This total
response time is denoted by the estimated cost which is
calculated using the cost functions mentioned in secition
5.3 and 5.4.
To evaluate, the representative set is initially set to 10. And
then we check for different sized of caches, which is
around 5%, 10%, 20% of the database.
We compare our algorithm with the Exchequer system and
with a system which has no cache, it has the LRU method
for cache management. The LRU policy is found widely in
ADBMS systems., it picks the least recently used chunk
and Dag node to be replaced. In LRU the system is
unaware of the load of the work.
Analyzing the Estimated Response Time:
In this part, initially the cache size is kept at a minimum.
Then the algorithms are run on the queries. The
Exchequer’s Algorithm and our Modify_Cache perform
better than the NoCache system. Now the size of the cache
is increased to accommodate 5% of the database. The
algorithms show an improvement in their performance.
This can be because the increase in cache size makes it
easier for the algorithms to find out the chunks which will
be able to answer the queries on their own. In low sized
caches, this poses as a problem as it becomes costlier and
longer to find out chunks to look for answers and then
using the replacement methods for the DAG and the
chunks.
For low cache sizes the Modify_Cache algorithm performs
better than the Exchequer algorithm, with a higher rate of
improvement. However, as the cache size is increased to
30% the estimated cost increases for the Exchequer system.
After a given cache size, the system investment in caching
the extra results obtained in this increased size does not
help in the Benefit factor. With the cache being 30% of the
database size, the Modify_Cache algorithm still returns a
better estimated time than the Exchequer’s algorithm.
6.2 Evaluating for Chunks
We considered various performance measures to evaluate
the effectiveness of the schemes we have employed.
1. Using our system we executed 100 queries to
calculate the average execution time.
2. Cost Saving Ratio: This performance measure is the
percentage of the total cost of the queries saved due to hits
in the cache. The cost to execute the query at the backend
to compute the savings in cost due to a cache hit. Consider
a query stream consisting of a mix of n queries q1, q2,…qn.
c: this is the cost when we execute the query at the backend
hi: when we satisfy qi references made to the cache
ri: number of references to the query qi
9. Comparing the CSR of a query based system and a chunk
based cache we have come to a conclusion that query based
system gives a value of 0.42 because of the redundant
storage in the cache and the chunk based system the CSR
was 0.98 showing that the cache storage was not redundant.
7. FUTURE WORK
This concept of using queries for caching can be further
optimized. Our algorithm of Modify_Cache currently only
uses the Benefit anf Weight functions to evaluate the utility
of cache chunks. It can be further improved by
accommodating other aspects of the chunks, and also
taking into account eh different operations that can be
performed on the OLAP databases. Further work can be
done in improving the run time of our algorithm. Work also
needs to be done to better use the DAGs and imporving the
run time of the DAG expansions. The time requirement and
additional space requirement for the DAGs to store query
cache information play a crucial role. This needs to be
taken into account for further work. Also more work needs
to be done to find out hoe the frequent updates of the
databases can be handled. Our algorithm, at this point of
time cannot efficiently handle high frequency cache
updates. Work needs to be to implement advanced methods
to identify high frequency cache updates and hence
maintain the efficiency and consistency of the cache.
8. CONCLUSION
Thus it can be seen that the use of chunks and DAG in
implementing a Cache management system proves useful.
The performance of the Query Engine improves with the
use of the Modify_Cache algorithm. The estimated time to
run the queries also decreases tremendously with the use of
query caching and the use of chunks to store caches. The
use of chunks in caching helps in utility of large datastores
and results in decrement in running OLAP queries on the
same datastores.
9. REFERENCES
[1] Don’t Trash your Intermediate Results, Cache ’em,
Prasan Roy, Krithi Ramamritham, S. Seshadri, Pradeep
Shenoy, S. Sudarshan, IIT Bombay.
http://arxiv.org/abs/cs.DB/0003005
[2] Composite Data Virtualization, White Paper, Composite
Software
http://cdn.information-
management.com/media/pdfs/CompositeDVPerformance.p
df
[3] Caching Multidimensional Queries Using Chunks,
Prasad M. Deshpande, Karthikeyan Ramasamy, Amit
Shukla, Jeffrey F. Naughton, University of Wisconsin,
Madison
[4] Usability-based caching of query results in OLAP
systems, Chang-Sup Park, Myoung Ho Kim, Yoon-Joon
Lee
http://www.cin.ufpe.br/~bmcr/public_html/Usability-
based_caching_of_query.pdf
[5] Extensibility and Search Efficiency in the Volcano,
Goetz Graefe and William J. McKenna. Optimizer
Generator. Technical Report CU-CS-91-563, University of
Colorado at Boulder, December 1991.
[6] WATCHMAN: A Data Warehouse Intelligent Cache
Manager, P. Scheuermann, j. Shim and R. Vingralek,
VLDB Conf. 1996