Master of Computer Application (MCA) – Semester 4 MC0077
1MC0077 – Advanced Database SystemsQuestion 1- List and explain various Normal Forms. How BCNF differs from the ThirdNormal Form and 4th Normal forms?First Normal Form - First normal form (1NF) is a property of a relation in a relationaldatabase. A relation is in first normal form if the domain of each attribute contains onlyatomic values, and the value of each attribute contains only a single value from that domain.First normal form is an essential property of a relation in a relational database. Databasenormalization is the process of representing a database in terms of relations in standardnormal forms, where first normal is a minimal requirement. First normal form deals with the"shape" of a record type. Under first normal form, all occurrences of a record type mustcontain the same number of fields. First normal form excludes variable repeating fields andgroups.Second Normal Form - Second normal form (2NF) is a normal form used in databasenormalization. A table that is in first normal form (1NF) must meet additional criteria if it is toqualify for second normal form. Specifically: a table is in 2NF if and only if it is in 1NF and nonon-prime attribute is dependent on any proper subset of any candidate key of the table. Anon-prime attribute of a table is an attribute that is not a part of any candidate key of thetable. Put simply, a table is in 2NF if and only if it is in 1NF and every non-prime attribute ofthe table is either dependent on the whole of a candidate key, or on another non-primeattribute. When a 1NF table has no composite candidate keys (candidate keys consisting ofmore than one attribute), the table is automatically in 2NF. Second and third normal formsdeal with the relationship between non-key and key fields.Third normal form - Third normal form is a normal form used in database normalization. Atable is in 3NF if and only if both of the following conditions hold: The relation R (table) is insecond normal form (2NF), every non-prime attribute of R is non-transitively dependent (i.e.directly dependent) on every super key of R.Fourth Normal form - Under the fourth normal form, a table cannot have more than one multivalued column. A multivalve column is one where a single entity can have more than oneattribute for that column.Fifth Normal Form - Fifth normal form deals with cases where information can bereconstructed from smaller pieces of information that can be maintained with lessredundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normalform generalizes to cases not covered by the others. The fifth normal form is created byremoving any columns that can be created from smaller pieces of data that can bemaintained with less redundancy.Difference between BCNF and Third Normal FormBoth 3NF and BCNF are normal forms that are used in relational databases to minimizeredundancies in tables. In a table that is in the BCNF normal form, for every non-trivial
2functional dependency of the form A → B, A is a super-key whereas, a table that complieswith 3NF should be in the 2NF, and every non-prime attribute should directly depend onevery candidate key of that table. BCNF is considered as a stronger normal form than the3NF and it was developed to capture some of the anomalies that could not be captured by3NF. Obtaining a table that complies with the BCNF form will require decomposing a tablethat is in the 3NF. This decomposition will result in additional join operations (or Cartesianproducts) when executing queries. This will increase the computational time. On the otherhand, the tables that comply with BCNF would have fewer redundancies than tables thatonly comply with 3NF.Difference between BCNF and 4th Normal Form● Database must be already achieved to 3NF to take it to BCNF, but database must bein 3NF and BCNF, to reach 4NF.● In fourth normal form, there are no multi-valued dependencies of the tables, but inBCNF, there can be multi-valued dependency data in the tables.Question 2 - What are differences in Centralized and Distributed Database Systems? Listthe relative advantages of data distribution.A distributed database is a database that is under the control of a central databasemanagement system (DBMS) in which storage devices are not all attached to a commonCPU. It may be stored in multiple computers located in the same physical location, or maybe dispersed over a network of interconnected computers. Collections of data (e.g. in adatabase) can be distributed across multiple physical locations. A distributed database canreside on network servers on the Internet, on corporate intranets or extranets, or on othercompany networks. The replication and distribution of databases improves databaseperformance at end-user worksites. To ensure that the distributive databases are up to dateand current, there are two processes: replication and duplication. Replication involves usingspecialized software that looks for changes in the distributive database. Once the changeshave been identified, the replication process makes all the databases look the same. Thereplication process can be very complex and time consuming depending on the size andnumber of the distributive databases. This process can also require a lot of time andcomputer resources. Duplication on the other hand is not as complicated. It basicallyidentifies one database as a master and then duplicates that database. The duplicationprocess is normally done at a set time after hours. This is to ensure that each distributedlocation has the same data. In the duplication process, changes to the master database onlyare allowed. This is to ensure that local data will not be overwritten. Both of the processescan keep the data current in all distributive locations. Besides distributed databasereplication and fragmentation, there are many other distributed database designtechnologies. For example, local autonomy, synchronous and asynchronous distributeddatabase technologies. These technologies implementation can and does depend on theneeds of the business and the sensitivity/confidentiality of the data to be stored in thedatabase, and hence the price the business is willing to spend on ensuring data security,consistency and integrity.A database User accesses the distributed database through:Local applications: Applications which do not require data from other sites.
3Global applications: Applications which do require data from other sites.A distributed database does not share main memory or disks. A centralized database hasall its data on one place, as it is totally different from distributed database which has data ondifferent places. In centralized database as all the data reside on one place so problem ofbottle-neck can occur, and data availability is not efficient as in distributed database.Advantages of Data DistributionThe primary advantage of distributed database systems is the ability to share and accessdata in a reliable and efficient manner.1. Data sharing and Distributed Control: If a number of different sites are connected to eachother, then a user at one site may be able to access data that is available at another site. Forexample, in the distributed banking system, it is possible for a user in one branch to accessdata in another branch. Without this capability, a user wishing to transfer funds from onebranch to another would have to resort to some external mechanism for such a transfer. Thisexternal mechanism would, in effect, be a single centralized database. The primaryadvantage to accomplishing data sharing by means of data distribution is that each site isable to retain a degree of control over data stored locally. In a centralized system, thedatabase administrator of the central site controls the database. In a distributed system,there is a global database administrator responsible for the entire system. A part of theseresponsibilities is delegated to the local database administrator for each site. Dependingupon the design of the distributed database system, each local administrator may have adifferent degree of autonomy which is often a major advantage of distributed databases.2. Reliability and Availability: If one site fails in distributed system, the remaining sited may beable to continue operating. In particular, if data are replicated in several sites, transactionneeding a particular data item may find it in several sites. Thus, the failure of a site does notnecessarily imply the shutdown of the system. The failure of one site must be detected bythe system, and appropriate action may be needed to recover from the failure. The systemmust no longer use the service of the failed site. Finally, when the failed site recovers or isrepaired, mechanisms must be available to integrate it smoothly back into the system.Although recovery from failure is more complex in distributed systems than in a centralizedsystem, the ability of most of the systems to continue to operate despite failure of one site,results in increased availability. Availability is crucial for database systems used for real-timeapplications.3. Speedup Query Processing: If a query involves data at several sites, it may be possible tosplit the query into sub queries that can be executed in parallel by several sites. Suchparallel computation allows for faster processing of a user’s query. In those cases in whichdata is replicated, queries may be directed by the system to the least heavily loaded sites.Question 3 - Describe the concepts of Structural Semantic Data Model (SSM).A data model in software engineering is an abstract model that describes how dataare represented and accessed. Data models formally define data elements and relationshipsamong data elements for a domain of interest. A data model explicitly determines thestructure of data or structured data. Typical applications of data models include databasemodels, design of information systems, and enabling exchange of data. Usually data modelsare specified in a data modeling language. Communication and precision are the two keybenefits that make a data model important to applications that use and exchange data. A
4data model is the medium which project team members from different backgrounds and withdifferent levels of experience can communicate with one another. Precision means that theterms and rules on a data model can be interpreted only one way and are not ambiguous. Adata model can be sometimes referred to as a data structure, especially in the context ofprogramming languages. Data models are often complemented by function models,especially in the context of enterprise models.A semantic data model in software engineering is a technique to define the meaning of datawithin the context of its interrelationships with other data. A semantic data model is anabstraction which defines how the stored symbols relate to the real world. A semantic datamodel is sometimes called a conceptual data model. The logical data structure of a databasemanagement system (DBMS), whether hierarchical, network, or relational, cannot totallysatisfy the requirements for a conceptual definition of data because it is limited in scope andbiased toward the implementation strategy employed by the DBMS. Therefore, the need todefine data from a conceptual view has led to the development of semantic data modelingtechniques. That is, techniques to define the meaning of data within the context of itsinterrelationships with other data. As illustrated in the figure. The real world, in terms ofresources, ideas, events, etc., is symbolically defined within physical data stores. A semanticdata model is an abstraction which defines how the stored symbols relate to the real world.Thus, the model must be a true representation of the real worldData modeling in software engineering is the process of creating a data model by applyingformal data model descriptions using data modeling techniques. Data modeling is atechnique for defining business requirements for a database. It is sometimes calleddatabase modeling because a data model is eventually implemented in a database. Dataarchitecture is the design of data for use in defining the target state and the subsequentplanning needed to hit the target state. It is usually one of several architecture domains thatform the pillars of an enterprise architecture or solution architecture. Data architecturedescribes the data structures used by a business and/or its applications. There aredescriptions of data in storage and data in motion; descriptions of data stores, data groupsand data items; and mappings of those data artifacts to data qualities, applications, locationsetc. Essential to realizing the target state, Data architecture describes how data isprocessed, stored, and utilized in a given system. It provides criteria for data processingoperations that make it possible to design data flows and also control the flow of data in thesystem.Question 4 - Describe the following with respect to Object Oriented Databases: a) QueryProcessing in Object-Oriented Database Systems b) Query Processing Architecturea. Query Processing in Object-Oriented Database SystemsOne of the criticisms of first-generation object-oriented database management systems(OODBMSs) was their lack of declarative query capabilities. This led some researchers tobrand first generation (network and hierarchical) DBMSs as object-oriented. It wascommonly believed that the application domains that OODBMS technology targets do notneed querying capabilities. This belief no longer holds, and declarative query capability isaccepted as one of the fundamental features of OO-DBMS. Indeed, most of the currentprototype systems experiment with powerful query languages and investigate their
5optimization. Commercial products have started to include such languages as well e.g. O2and Object-Store.Query optimization techniques are dependent upon the query model and language. Forexample, a functional query language lends itself to functional optimization which is quitedifferent from the algebraic, cost-based optimization techniques employed in relational aswell as a number of object-oriented systems. The query model, in turn, is based on the data(or object) model since the latter defines the access primitives which are used by the querymodel. These primitives, at least partially, determine the power of the query model. Despitethis close relationship, in this unit we do not consider issues related to the design of objectmodels, query models, or query languages in any detail.Almost all object query processors proposed to date use optimization techniques developedfor relational systems. However, there are a number of issues that make query processingmore difficult in OODBMSs. The following are some of the more important issues:Type System - Relational query languages operate on a simple type system consisting of asingle aggregate type: relation. The closure property of relational languages implies thateach relational operator takes one or more relations as operands and produces a relation asa result. In contrast, object systems have richer type systems. The results of object algebraoperators are usually sets of objects (or collections) whose members may be of differenttypes. If the object languages are closed under the algebra operators, these heterogeneoussets of objects can be operands to other operators.Encapsulation - Relational query optimization depends on knowledge of the physical storageof data (access paths) which is readily available to the query optimizer. The encapsulation ofmethods with the data that they operate on in OODBMSs raises (at least) two issues. First,estimating the cost of executing methods is considerably more difficult than estimating thecost of accessing an attribute according to an access path. In fact, optimizers have to worryabout optimizing method execution, which is not an easy problem because methods may bewritten using a general-purpose programming language. Second, encapsulation raisesissues related to the accessibility of storage information by the query optimizer. Somesystems overcome this difficulty by treating the query optimizer as a special application thatcan break encapsulation and access information directly.Complex Objects and Inheritance - Objects usually have complex structures where the stateof an object references other objects. Accessing such complex objects involves pathexpressions. The optimization of path expressions is a difficult and central issue in objectquery languages.Object Models - OODBMSs lack a universally accepted object model definition. Even thoughthere is some consensus on the basic features that need to be supported by any objectmodel (e.g., object identity, encapsulation of state and behavior, type inheritance, and typedcollections), how these features are supported differs among models and systems. As aresult, the numerous projects that experiment with object query processing follow quitedifferent paths and are, to a certain degree, incompatible, making it difficult to amortize onthe experiences of others.
6b. Query Processing ArchitectureA query processing methodology similar to relational DBMSs, but modified to deal with thedifficulties,The steps of the methodology are as follows.1. Queries are expressed in a declarative language2. It requires no user knowledge of object implementations, access paths orprocessing strategies3. The calculus expression is first4. Calculus Optimization5. Calculus Algebra Transformation6. Type check7. Algebra Optimization8. Execution Plan Generation9. ExecutionQuestion 5 - Describe the Differences between Distributed & Centralized Databases.1 Centralized Control vs. Decentralized Control - In centralized control one "databaseadministrator" ensures safety of data whereas in distributed control, it is possible to usehierarchical control structure based on a "global database administrator" having the centralresponsibility of whole data along with "local database administrators", who have theresponsibility of local databases.2 Data Independence - In central databases it means the actual organization of data istransparent to the application programmer. The programs are written with "conceptual" viewof the data (called "Conceptual schema"), and the programs are unaffected by physicalorganization of data. In Distributed Databases, another aspect of "distribution dependency"is added to the notion of data independence as used in Centralized databases. DistributionDependency means programs are written assuming the data is not distributed. Thuscorrectness of programs is unaffected by the movement of data from one site to another;however, their speed of execution is affected.3 Reduction of Redundancy - In centralized databases redundancy was reduced for tworeasons :(a) inconsistencies among several copies of the same logical data are avoided, (b)storage space is saved. Reduction of redundancy is obtained by data sharing. In distributeddatabases data redundancy is desirable as (a) locality of applications can be increased ifdata is replicated at all sites where applications need it, (b) the availability of the system canbe increased, because a site failure does not stop the execution of applications at other sitesif the data is replicated. With data replication, retrieval can be performed on any copy, whileupdates must be performed consistently on all copies.4 Complex Physical Structures and Efficient Access - In centralized databases complexaccessing structures like secondary indexed, interfile chains are used. All these featuresprovide efficient access to data. In distributed databases efficient access requires accessing
7data from different sites. For this an efficient distributed data access plan is required whichcan be generated either by the programmer or produced automatically by an optimizer.Problems faced in the design of an optimizer can be classified in two categories: a) Globaloptimization consists of determining which data must be accessed at which sites and whichdata files must consequently be transmitted between sites. b) Local optimization consists ofdeciding how to perform the local database accesses at each site.5 Integrity, Recovery and Concurrency Control - A transaction is an atomic unit of executionand atomic transactions are the means to obtain database integrity. Failures andconcurrency are two dangers of atomicity. Failures may cause the system to stop in midst oftransaction execution, thus violating the atomicity requirement. Concurrent execution ofdifferent transactions may permit one transaction to observe an inconsistent, transient statecreated by another transaction during its execution. Concurrent execution requiressynchronization amongst the transactions, which is much harder in all distributed systems.6 Privacy and Security - In traditional databases, the database administrator, havingcentralized control, can ensure that only authorized access to the data is performed. Indistributed databases, local administrators face the same as well as two new aspects of theproblem; (a) security (protection) problems because of communication networks is intrinsicto database systems. (b) In certain databases with a high degree of "site autonomy" mayfeel more protected because they can enforce their own protections instead of depending ona central database administrator.7 Distributed Query Processing - The DDBMS should be capable of gathering and presentingdata from more than one site to answer a single query. In theory a distributed system canhandle queries more quickly than a centralized one, by exploiting parallelism and reducingdisc contention; in practice the main delays (and costs) will be imposed by thecommunications network. Routing algorithms must take many factors into account todetermine the location and ordering of operations. Communications costs for each link in thenetwork are relevant, as also are variable processing capabilities and loadings for differentnodes, and (where data fragments are replicated) trade-offs between cost and currency.8 Distributed Directory (Catalog) Management - Catalogs for distributed databases containinformation like fragmentation description, allocation description, mappings to local names,access method description, statistics on the database, protection and integrity constraints(consistency information) which are more detailed as compared to centralized databases.Question 6 - Describe the following: a) Data Mining Functions b) Data Mining Techniquesa) Data Mining FunctionsData mining refers to the broadly-defined set of techniques involving finding meaningfulpatterns - or information - in large amounts of raw data. At a very high level, data mining isperformed in the following stages (note that terminology and steps taken in the data miningprocess varies by data mining practitioner):1. Data collection: gathering the input data you intend to analyze2. Data scrubbing: removing missing records, filling in missing values where appropriate
83. Pre-testing: determining which variables might be important for inclusion during theanalysis stage.4. Analysis/Training: analyzing the input data to look for patterns5. Model building: drawing conclusions from the analysis phase and determining amathematical model to be applied to future sets of input data6. Application: applying the model to new data sets to find meaningful patternsData mining can be used to classify or cluster data into groups or to predict likely futureoutcomes based upon a set of input variables/data.b) Data Mining TechniquesThere are several major data mining techniques have been developed and used in datamining projects.Association - Association is one of the best known data mining technique. In association, apattern is discovered based on a relationship of a particular item on other items in the sametransaction. For example, the association technique is used in market basket analysis toidentify what products that customers frequently purchase together.Classification - Classification is a classic data mining technique based on machine learning.Basically classification is used to classify each item in a set of data into one of predefined setof classes or groups.Clustering - Clustering is a data mining technique that makes meaningful or useful cluster ofobjects that have similar characteristic using automatic technique. Different fromclassification, clustering technique also defines the classes and put objects in them, while inclassification objects are assigned into predefined classes.Prediction - The prediction as it name implied is one of a data mining techniques thatdiscovers relationship between independent variables and relationship between dependentand independent variablesSequential Patterns - Sequential patterns analysis in one of data mining technique thatseeks to discover similar patterns in data transaction over a business period. The uncoverpatterns are used for further business analysis to recognize relationships among data.Artificial neural networks - These are non-linear, predictive models that learn throughtraining. Although they are powerful predictive modeling techniques, some of the powercomes at the expense of ease of use and deployment.Decision trees - They are tree-shaped structures that represent decision sets. Thesedecisions generate rules, which then are used to classify data. Decision trees are thefavored technique for building understandable models.The nearest-neighbor method - This method classifies dataset records based on similar datain a historical dataset.