Mc0077 – advanced database systems


Published on

Master of Computer Application (MCA) – Semester 4
MC0077 – Advanced Database Systems

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Mc0077 – advanced database systems

  1. 1. Advanced Database Systems Question 1. List and explain various Normal Forms. How BCNF differs from the Third Normal Form and 4th Normal forms? Normal Forms Relations are classified based upon the types of anomalies to which they're vulnerable. A database that's in the first normal form is vulnerable to all types of anomalies, while a database that's in the domain/key normal form has no modification anomalies. Normal forms are hierarchical in nature. That is, the lowest level is the first normal form, and the database cannot meet the requirements for higher level normal forms without first having met all the requirements of the lesser normal forms. First Normal Form Any table having any relation is said to be in the first normal form. The criteria that must be met to be considered relational is that the cells of the table must contain only single values, and repeat groups or arrays are not allowed as values. All attributes (the entries in a column) must be of the same kind, and each column must have a unique name. Each row in the table must be unique. Databases in the first normal form are the weakest and suffer from all modification anomalies. Second Normal Form If all a relational database's non-key attributes are dependent on all of the key, then the database is considered to meet the criteria for being in the second normal form. This normal form solves the problem of partial dependencies, but this normal form only pertains to relations with composite keys. Third Normal Form A database is in the third normal form if it meets the criteria for a second normal form and has no transitive dependencies. Boyce-Codd Normal Form A database that meets third normal form criteria and every determinant in the database is a candidate key, it's said to be in the Boyce-Codd Normal Form. This normal form solves the issue of functional dependencies. Fourth Normal Form Fourth Normal Form (4NF) is an extension of BCNF for functional and multi-valued dependencies. A schema is in 4NF if the left hand side of every non-trivial functional or multi-valued dependency is a super-key. Domain/Key Normal Form The domain/key normal form is the Holy Grail of relational database design, achieved when every constraint on the relation is a logical consequence of the definition of keys and domains, and enforcing key and domain restraints and conditions causes all constraints to be met. Thus, it avoids all non-temporal anomalies. It's much easier to build a database in domain/key normal form than it is to convert lesser databases which may contain numerous anomalies. However, successfully building a domain/key normal form database remains a difficult task, even for experienced database programmers. Thus, while the domain/key normal form eliminates the problems found in most databases, it tends to be the most costly normal form to achieve. However, failing to achieve the domain/key normal form may carry long-term, hidden costs due to anomalies which appear in databases adhering only to lower normal forms over time. Question 2. Describe the concepts of Structural Semantic Data Model (SSM). A data model in software engineering is an abstract model that describes how data are represented and accessed. Data models formally define data elements and relationships among data elements for a domain of interest. According to Hoberman (2009), "A data model is a way finding tool for both
  2. 2. business and IT professionals, which uses a set of symbols and text to precisely explain a subset of real information to improve communication within the organization and thereby lead to a more flexible and stable application environment." A data model explicitly determines the structure of data or structured data. Typical applications of data models include database models, design of information systems, and enabling exchange of data. Usually data models are specified in a data modeling language. Communication and precision are the two key benefits that make a data model important to applications that use and exchange data. A data model is the medium which project team members from different backgrounds and with different levels of experience can communicate with one another. Precision means that the terms and rules on a data model can be interpreted only one way and are not ambiguous. A data model can be sometimes referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models. A semantic data model in software engineering is a technique to define the meaning of data within the context of its interrelationships with other data. A semantic data model is an abstraction which defines how the stored symbols relate to the real world. A semantic data model is sometimes called a conceptual data model. The logical data structure of a database management system (DBMS), whether hierarchical, network, or relational, cannot totally satisfy the requirements for a conceptual definition of data because it is limited in scope and biased toward the implementation strategy employed by the DBMS. Therefore, the need to define data from a conceptual view has led to the development of semantic data modeling techniques. That is, techniques to define the meaning of data within the context of its interrelationships with other data. The real worlds, in terms of resources, ideas, events, etc., are symbolically defined within physical data stores. A semantic data model is an abstraction which defines how the stored symbols relate to the real world. Thus, the model must be a true representation of the real world Data modeling in software engineering is the process of creating a data model by applying formal data model descriptions using data modeling techniques. Data modeling is a technique for defining business requirements for a database. It is sometimes called database modeling because a data model is eventually implemented in a database. The illustrates the way data models are developed and used today. A conceptual data model is developed based on the data requirements for the application that is being developed, perhaps in the context of an activity model. The data model will normally consist of entity types, attributes, relationships, integrity rules, and the definitions of those objects. This is then used as the start point for interface or database design Data architecture is the design of data for use in defining the target state and the subsequent planning needed to hit the target state. It is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture. Question 3. Describe the following with respect to Object Oriented Databases: a. Query Processing in Object-Oriented Database Systems Query Processing in Object-Oriented Database Systems One of the criticisms of first-generation object-oriented database management systems (OODBMSs) was their lack of declarative query capabilities. This led some researchers to brand first generation (network and hierarchical) DBMSs as object-oriented [Ullman 1988]. It was commonly believed that the application domains that OODBMS technology targets do not need querying capabilities. This belief no longer holds, and declarative query capability is accepted as one of the fundamental features of OODBMSs [Atkinson et al. 1989; Stonebraker et al. 1990]. Indeed, most of the current prototype systems experiment with powerful query languages and investigate their optimization. Commercial products have started to include such languages as well (e.g., O2 [Deux et al. 1991], Object Store [Lamb et al. 1991]).In this chapter we discuss the issues related to the optimization and execution of OODBMS query languages (which we collectively call query processing). Query optimization techniques are dependent upon the query model and language. For example, a functional query language lends itself to functional optimization which is quite different from the algebraic, cost-based optimization techniques employed
  3. 3. in relational as well as a number of object-oriented systems. The query model, in turn, is based on the data (or object) model since the latter defines the access primitives which are used by the query model. These primitives, at least partially, determine the power of the query model. Despite this close relationship, in this chapter we do not consider issues related to the design of object models query models, or query languages in any detail. Language design issues are discussed elsewhere in this book. The interrelationship between object and query models is discussed in [Blakeley 1991; Ozsu and Straube 1991; Ozsu et al.1993; Yu and Osborn 1991]. Almost all object query processors proposed to date use optimization techniques developed for relational systems. However, there are a number of issues that make query processing more difficult in OODBMSs. The following are some of the more important issues: 1.Type system. Relational query languages operate on a simple type system consisting of a single aggregate type: relation The closure property of relational languages implies that each relational operator takes one or more relations as operands and produces a relation as a result. In contrast, object systems have richer type systems. The results of object algebra operators are usually sets of objects (or collections) whose members may be of different types. If the object languages are closed under the algebra operators, these heterogeneous sets of objects can be operands to other operators. This requires the development of elaborate type inferencing schemes to determine which methods can be applied to all the objects in such a set. Furthermore, object algebras often operate on semantically different collection types (e.g., set, bag, list) which imposes additional requirements on the type inferencing schemes to determine the type of the results of operations on collections of different types. 2. Encapsulation.Relational query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. The encapsulation of methods with the data that they operate on in OODBMSs raises (at least) two issues. First, estimating the cost of executing methods is considerably more difficult than estimating the cost of accessing an attribute according to an access path. In fact, optimizers have to worry about optimizing method execution, which is not an easy problem because methods may be written using a general-purpose programming language. Second, encapsulation raises issues related to the accessibility of storage information by the query optimizer. Some systems overcome this difficulty by treating the query optimizer as a special application that can break encapsulation and access information directly [Cluet and Delobel 1992]. Others propose a mechanism whereby objects “reveal” their costs as part of their interface [Graefe and Maier 1988]. b. Query Processing Architecture In this section we focus on two architectural issues: the query processing methodology and the query optimizer architecture. 1 Query Processing Methodology A query processing methodology similar to relational DBMSs, but modified to deal with the difficulties discussed in the previous section, can be followed in OODBMSs. depicts such a methodology proposed in [Straube and Ozsu 1990a]. The steps of the methodology are as follows. Queries are expressed in a declarative language which requires no user knowledge of object implementations, access paths or processing strategies. The calculus expression is first 2 calculus optimization calculus-algebra transformation type check algebra optimization execution lan generation object algebra expression type consistent expression optimized algebra expression declarative query normalized calculus expression execution plan 2 Optimizer Architecture: Query optimization can be modeled as an optimization problem whose solution is the choice of the “optimum” state in a state space (also called search space). In query optimization, each state corresponds to an algebraic query indicating an execution schedule and
  4. 4. represented as a processing tree. The state space is a family of equivalent (in the sense of generating the same result) algebraic queries. Query optimizers generate and search a state space using a search strategy applying a cost function to each state and finding one with minimal cost. Thus, to Characterize a query optimizer three things need to be specified:In this chapter we are mostly concerned with cost-based optimization, which is arguably the more interesting case. 3.1. The search space and the the transformation rules that generate the alternative query expressions which constitute the search space; 2. A search algorithm that allows one to move from one state to another in the search space; and 3. The cost function that is applied to each state. Many existing OODBMS optimizers are either implemented as part of the object manager on top of a storage system, or they are implemented as client modules in client-server architecture. In most cases, the above mentioned four aspects are “hardwired” into the query optimizer. Given that extensibility is a major goal of OODBMSs, one would hope to develop an extensible optimizer that accommodates different search strategies, different algebra specifications with their different transformation rules, and different cost functions. Rule-based query optimizers provide a limited amount of extensibility by allowing the definition of new transformation rules. However, they do not allow extensibility in other dimensions. In this section we discuss some new promising proposals for extensibility in OODBMSs. The Open OODB project [Wells et al. 1992] at Texas Instruments 2 concentrate on the definition of an open architectural framework for OODBMSs and on the description of the design space for these systems. Query processing in Open OODB [Blakeley et al. 1993]. The query module is an example of intra-module extensibility in Open OODB. The query optimizer, built using the Volcano optimizer generator is extensible with respect to algebraic operators, logical transformation rules, execution algorithms, implementation rules (i.e., logical operator to execution algorithm mappings), cost estimation functions, and physical property enforcement functions (e.g., presence of objects in memory). The clean separation between the user query language parsing structures and the operator graph on which the optimizer operates allows the replacement of the user language or optimizer. The separation between algebraic operators and execution algorithms allows exploration with alternative methods for implementing algebraic operators. Code generation is also a well defined subcomponent of the query module which facilitates porting the query module to work on top of other OODBMSs. The Open OODB query processor includes a query execution engine containing efficient implementations of scan, indexed scan, hybrid- hash join [Shapiro 1986], and complex object assembly [Keller et al. 1991]. The EPOQ project [Mitchell et al. 1993] is another approach to query optimization extensibility, where the search space is divided into regions. Each region corresponds to an equivalent family of query expressions that are reachable from each other. The regions are not necessarily mutually exclusive and differ in the queries that they manipulate, control (search) strategy that they use, query transformation rules that they incorporate, and optimization objectives they achieve. For example, one region may cover transformation rules that deal with simple select queries, while another region may deal with transformations for nested queries. Similarly, one region may have the objective of minimizing a cost function, while another region may attempt to transform queries in some desirable form. Each region may be nested to a number of levels, allowing hierarchical search within a region. Since the regions do not represent equivalence classes, there is a need for a global control strategy to determine how the query optimizer moves from one region to another. The feasibility and effectiveness of this approach remains to be verified. The TIGUKAT project [Peters et al. 1992] uses an object-oriented approach to query processing extensibility.
  5. 5. Question 4. Describe the Differences between Distributed & Centralized Databases. A distributed database is a database that is under the control of a central database management system (DBMS) in which storage devices are not all attached to a common CPU. It may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. Collections of data (e.g. in a database) can be distributed across multiple physical locations. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks. The replication and distribution of databases improves database performance at end-user worksites. To ensure that the distributive databases are up to date and current, there are two processes: replication and duplication. Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be very complex and time consuming depending on the size and number of the distributive databases. This process can also require a lot of time and computer resources. Duplication on the other hand is not as complicated. It basically identifies one database as a master and then duplicates that database. The duplication process is normally done at a set time after hours. This is to ensure that each distributed location has the same data. In the duplication process, changes to the master database only are allowed. This is to ensure that local data will not be overwritten. Both of the processes can keep the data current in all distributive locations. Besides distributed database replication and fragmentation, there are many other distributed database design technologies. For example, local autonomy, synchronous and asynchronous distributed database technologies. These technologies' implementation can and does depend on the needs of the business and the sensitivity/confidentiality of the data to be stored in the database, and hence the price the business is willing to spend on ensuring data security, consistency and integrity. Basic architecture A database User accesses the distributed database through: Local applications; Applications which do not require data from other sites. Global applications: Applications which do require data from other sites. A distributed database does not share main memory or disks. A centralized database has all its data on one place. As it is totally different from distributed database which has data on different places. In centralized database as all the data reside on one place so problem of bottle-neck can occur, and data availability is not efficient as in distributed database. Let me define some advantages of distributed database, it will clear the difference between centralized and distributed database. Advantages of Data Distribution The primary advantage of distributed database systems is the ability to share and access data in a reliable and efficient manner. Data sharing and Distributed Control: If a number of different sites are connected to each other, then a user at one site may be able to access data that is available at another site. For example, in the distributed banking system, it is possible for a user in one branch to access data in another branch. Without this capability, a user wishing to transfer funds from one branch to another would have to resort to some external mechanism for such a transfer. This external mechanism would, in effect, be a single centralized database. The primary advantage to accomplishing data sharing by means of data distribution is that each site is able to retain a degree of control over data stored locally. In a centralized system, the database
  6. 6. administrator of the central site controls the database. In a distributed system, there is a global database administrator responsible for the entire system. A part of these responsibilities is delegated to the local database administrator for each site. Depending upon the design of the distributed database system, each local administrator may have a different degree of autonomy which is often a major advantage of distributed databases. Question 5. Explain the following: a. Query Optimization Generally, the query optimizer cannot be accessed directly by users: once queries are submitted to database server, and parsed by the parser, they are then passed to the query optimizer where optimization occurs. However, some database engines allow guiding the query optimizer with hints. A query is a request for information from a database. It can be as simple as "finding the address of a person with SS# 123-45-6789," or more complex like "finding the average salary of all the employed married men in California between the ages 30 to 39, that earn less than their wives." Queries results are generated by accessing relevant database data and manipulating it in a way that yields the requested information. Since database structures are complex, in most cases, and especially for not- very-simple queries, the needed data for a query can be collected from a database by accessing it in different ways, through different data-structures, and in different orders. Each different way typically requires different processing time. Processing times of a same query may have large variance, from a fraction of a second to hours, depending on the way selected. The purpose of query optimization, which is an automated process, is to find the way to process a given query in minimum time. The large possible variance in time justifies performing query optimization, though finding the exact optimal way to execute a query, among all possibilities, is typically very complex, time consuming by itself, may be too costly, and often practically impossible. Thus query optimization typically tries to approximate the optimum by comparing several common-sense alternatives to provide in a reasonable time a "good enough" plan which typically does not deviate much from the best possible result. b. Text Retrieval Using SQL3/Text Retrieval SQL3 supports storage of multimedia data, such as text documents, in an O-R database using the blob/clob data types. However, the standard SQL3 specification does not include support for processing the media content, such as indexing or querying. Thus is it not possible to use standard SQL3 to locate documents based on an analysis of their content. Therefore, most of the larger or- dbms vendors (IBM, Oracle, Ingres, Postgress ...) have used the SQL3 UDT/UDF functionality to extend their or-dbms with management systems for media data. The approach used has been to add-on own or purchased specialized media management systems to the basic or-dbms. Basically, the new - to SQL3 - functionality includes: Indexing routines for the various types of media data, as discussed in CH.6, for example using: o Content terms for text data and o Color, shape, and texture features for image data. Selection operators for the SQL3 WHERE clause for specification of selection criteria for media retrieval. Text processing sub-systems for similarity evaluation and result ranking. Unfortunately, the result of this 'independent' activity is non standard or-dbms/mm (multimedia) systems that differ in the functionality included and limit data retrieval from multiple or-dbm system types. For example, unified access to data stored in Oracle and DB2 systems is difficult, both in query formulation and result presentation. Since actual SQL3/TextRetrieval syntax varies between or- dbms/mm implementations, the examples used in the following are given in generic SQL3/TextRetrieval statements.
  7. 7. 8.1 Text Document Retrieval Multimedia documents can be complex, but are basically unstructured. They can consist of the raw text only, or have a few fixed attributes with one or more semi- or unstructured components. For example, a news report for an election could include the following components: where n, m, k, and x are the number of occurrences of each component type. 1. Identifier, date, and author(s) of the report, 2. n* text blocks - (titles, abstract, content text), 3. m* images - example: image_of_candidate 4. k* charts, and 5. x* maps. Note that the document elements listed in pt.1 above function as context metadata for the report, while the text itself can function as semantic metadata for the image materials (Rønnevik, 2005). illustrates elements of a semi-structured document. The original Grieg site also contains a list of references/links which gives access to other multimedia documents about the composer, including some of his music. Since an OR-DB can contain text documents such as web pages, SQL3 should be extended with processing operators that support access to each of the element types listed above. Question 6. Describe the following: a. Data Mining Functions: Data mining functions can be divided into two categories: supervised (directed) and unsupervised (undirected). Supervised functions are used to predict a value; they require the specification of a target (known outcome). Targets are either binary attributes indicating yes/no decisions (buy/don't buy, churn or don't churn, etc.) or multi-class targets indicating a preferred alternative (color of sweater, likely salary range, etc.). Naive Bayes for classification is a supervised mining algorithm. Unsupervised functions are used to find the intrinsic structure, relations, or affinities in data. Unsupervised mining does not use a target. Clustering algorithms can be used to find naturally occurring groups in data. Data mining can also be classified as predictive or descriptive. Predictive data mining constructs one or more models; these models are used to predict outcomes for new data sets. Predictive data mining functions are classification and regression. Naive Bayes is one algorithm used for predictive data mining. Descriptive data mining describes a data set in a concise way and presents interesting characteristics of the data. Descriptive data mining functions are clustering, association models, and feature extraction. k-Means clustering is an algorithm used for descriptive data mining. Different algorithms serve different purposes; each algorithm has advantages and disadvantages. A given algorithm can be used to solve different kinds of problems. For example, k-Means clustering is unsupervised data mining; however, if you use k-Means clustering to assign new records to a cluster, it performs predictive data mining. Similarly, decision tree classification is supervised data mining; however, the decision tree rules can be used for descriptive purposes. Oracle Data Mining supports the following data mining functions: Supervised data mining: o Classification: Grouping items into discrete classes and predicting which class an item belongs to o Regression: Approximating and forecasting continuous values o Attribute Importance: Identifying the attributes that are most important in predicting results o Anomaly Detection: Identifying items that do not satisfy the characteristics of "normal" data (outliers) Unsupervised data mining: o Clustering: Finding natural groupings in the data
  8. 8. o Association models: Analyzing "market baskets" o Feature extraction: Creating new attributes (features) as a combination of the original attributes Oracle Data Mining permits mining of one or more columns of text data. Oracle Data Mining also supports specialized sequence search and alignment algorithms (BLAST) used to detect similarities between nucleotide and amino acid sequences. b. Data Mining Techniques: Several core techniques that are used in data mining describe the type of mining and data recovery operation. Unfortunately, the different companies and solutions do not always share terms, which can add to the confusion and apparent complexity. Let's look at some key techniques and examples of how to use different tools to build the data mining. Association Association (or relation) is probably the better known and most familiar and straightforward data mining technique. Here, you make a simple correlation between two or more items, often of the same type to identify patterns. For example, when tracking people's buying habits, you might identify that a customer always buys cream when they buy strawberries, and therefore suggest that the next time that they buy strawberries they might also want to buy cream. Building association or relation-based data mining tools can be achieved simply with different tools. For example, within InfoSphere Warehouse a wizard provides configurations of an information flow that is used in association by examining your database input source, decision basis, and output. Classification You can use classification to build up an idea of the type of customer, item, or object by describing multiple attributes to identify a particular class. For example, you can easily classify cars into different types (sedan, 4x4, convertible) by identifying different attributes (number of seats, car shape, driven wheels). Given a new car, you might apply it into a particular class by comparing the attributes with our known definition. You can apply the same principles to customers, for example by classifying them by age and social group. Additionally, you can use classification as a feeder to, or the result of, other techniques. For example, you can use decision trees to determine a classification. Clustering allows you to use common attributes in different classifications to identify clusters. Clustering By examining one or more attributes or classes, you can group individual pieces of data together to form a structure opinion. At a simple level, clustering is using one or more attributes as your basis for identifying a cluster of correlating results. Clustering is useful to identify different information because it correlates with other examples so you can see where the similarities and ranges agree. Clustering can work both ways. You can assume that there is a cluster at a certain point and then use our identification criteria to see if you are correct. In this, a sample of sales data compares the age of the customer to the size of the sale. It is not unreasonable to expect that people in their twenties (before marriage and kids), fifties, and sixties (when the children have left home), have more disposable income.