2. UNIT V
ADVANCED TOPICS
Distributed Databases: Architecture, Data Storage,
Transaction Processing – Object-based Databases:
Object Database Concepts, Object-Relational features,
ODMG Object Model, ODL, OQL - XML Databases:
XML Hierarchical Model, DTD, XML Schema,
XQuery – Information Retrieval: IR Concepts,
Retrieval Models, Queries in IR systems.
3. Distributed Databases
A distributed database is a set of
interconnected databases that is distributed
over the computer network or internet.
It manages the distributed database and
provides mechanisms so as to make the
databases transparent to the users
4. Distributed Databases
Features
Databases in the collection are logically interrelated with
each other. Often they represent a single logical database.
Data is physically stored across multiple sites.
The processors in the sites are connected via a network.
A distributed database is not a loosely connected file
system.
5. Distributed Databases
Advantages:
Fast data processing
Reliability and availability
Reduced operating cost
Easier to expand
Improved sharing ability and local autonomy.
6. Distributed Databases
Disadvantages:
Complex to manage and control.
The security issues must be carefully managed
The system require deadlock handling during the
transaction processing
Need of standardization.
7. Distributed Databases
Homogeneous Distributed Database:
In this, all sites have identical database
management system software.
In such a system, local sites surrender a portion of
their autonomy in terms of their right to change
schemas or database management system software.
8. Distributed Databases
Homogeneous Distributed Database:
This software must also cooperate with other sites
in exchanging information about transactions, to
make transaction processing possible across
multiple sites.
It appears to user as a single system.
9. Distributed Databases
Heterogeneous Distributed Database:
In this, different sites may use different schemas, and
different database management system software.
The sites may not be aware of one another, and they
may provide only limited facilities for cooperation in
transaction processing.
10. Distributed Databases
Data Storage:
Replication: System maintains multiple copies of
data, stored in different sites, for faster retrieval
and fault tolerance
Fragmentation: Relation is partitioned into several
fragments stored in distinct sites
11. Distributed Databases
Data Replication:
The process of storing separate copies of the database
at two or more sites.
Full Replication: Entire relation is stored at all the
sites.
Partial Replication: Only some fragments of relation
are replicated on the sites.
13. Distributed Databases
Data Replication – Disadvantages:
Increased Storage Requirements
Increased Cost and Complexity of Data Updating
14. Distributed Databases
Data Fragmentation:
A division of relation r into fragments r1, r2,
r3…rn which contain sufficient information to
reconstruct relation r.
15. Distributed Databases
Data Fragmentation – Vertical Fragmentation:
The fields or columns of a table are grouped into
fragments.
In order to maintain reconstructiveness, each
fragment should contain the primary key field(s) of
the table.
16. Distributed Databases
Data Fragmentation – Vertical Fragmentation:
Example: Student(RollNo, Marks, City)
select RollNo from Student
select City from Student.
17. Distributed Databases
Data Fragmentation – Horizontal Fragmentation:
In this approach, each tuple of r is assigned to one or
more fragments.
If relation R is fragmentation in r1 and r2 fragments,
then to bring these fragments back to R we must use
union operation.
18. Distributed Databases
Data Fragmentation – Horizontal
Fragmentation:
Example:
Select * from student where marks>50 and
city=‘chennai’
20. Distributed Databases
Transaction Processing – Transaction
Manager:
Maintaining a log for recovery purposes
Participating in coordinating the concurrent
execution of the transactions executing at that site
21. Distributed Databases
Transaction Processing – Transaction
Coordinator:
Starting the execution of transactions that
originate at the site.
Distributing subtransactions at appropriate sites for
execution
23. Distributed Databases - Transaction
Processing
Two Phase Commit Protocol :
The atomicity is an important property of any
transaction processing.
Either the transaction will execute completely or it
won’t execute at all.
24. Distributed Databases - Transaction
Processing
Two Phase Commit Protocol:
A transaction which executes at multiple sites
must either be committed at all the sites, or aborted
at all the sites.
Not acceptable to have a transaction committed at
one site and aborted at another.
27. Distributed Databases - Transaction
Processing
Two Phase Commit Protocol:
Phase 1: Obtaining Decision or Voting Phase:
Step 1: Coordinator site Ci asks all participates to
prepare to commit T.
Ci adds the records <prepare T> to the log and writes the log
to stable storage.
It then sends prepare T messages to all participating sites.
28. Distributed Databases - Transaction
Processing
Two Phase Commit Protocol:
Phase 1: Obtaining Decision or Voting Phase:
Ci
S2
S3
S4
<Prepare, T>
<Prepare, T>
<Prepare, T>
<Prepare, T>
Coordinating
Site
Log
29. Distributed Databases - Transaction
Processing
Two Phase Commit Protocol:
Phase 1: Obtaining Decision or Voting Phase:
Step 2: Upon receiving message, transaction
manager at participating site determines if it can
commit the transaction.
30. Distributed Databases - Transaction
Processing
Two Phase Commit Protocol:
Phase 1: Obtaining Decision or Voting Phase:
Ci
S2
S3
S4
<Ready, T>
<abort, T>
<Ready, T>
Coordinating
Site
<Ready,T>
<No,T>
<Ready,T>
31. Distributed Databases - Transaction
Processing
Two Phase Commit Protocol:
Phase 1: Obtaining Decision or Voting Phase:
If not, add a record <no, T> to the log and send abort
message to Ci.
If the T can be committed, then:
add the record <ready T> to the log
force all records for T to stable storage
Send ready T message to Ci.
32. Distributed Databases - Transaction
Processing
Two Phase Commit Protocol:
Phase 2: Recording Decision Phase:
Ci adds the decision record <commit T> or <abort
T>, to the log and forces record onto stable
storage.
34. Distributed Databases - Transaction
Processing
Two Phase Commit Protocol:
Phase 2: Recording Decision Phase:
Ci sends a message to each participant informing it
of the decision.
Participants take appropriate action locally.
36. Distributed Databases - Transaction
Processing
Failure of Site – Failure of Participating Sites:
If any of the participating sites gets failed then
when participating site si recovers, it examines the
log entry made by it to take decisions about
executing transaction.
37. Distributed Databases - Transaction
Processing
Failure of Site – Failure of Participating Sites:
Log contain <commit T> record: site executes redo
(T)
Log contains <abort T> record: site executes undo (T)
Log contains <ready T> record: site must consult Ci to
determine the fate of T.
If T committed, redo (T)
If T aborted, undo (T)
38. Distributed Databases - Transaction
Processing
Failure of Site – Failure of Participating Sites:
The log contains no control records concerning T
replies that Sk failed before responding to the prepare
T message from Ci
since the failure of Sk precludes the sending of such a
response C1 must abort T
Sk must execute undo (T)
39. Distributed Databases - Transaction
Processing
Failure of Site – Failure of Coordinator Sites:
If an active site contains a <commit T> record in
its log, then T must be committed.
If an active site contains an <abort T> record in its
log, then T must be aborted.
40. Distributed Databases - Transaction
Processing
Failure of Site – Failure of Coordinator Sites:
If some active participating site does not contain a <ready T>
record in its log, then the failed coordinator Ci cannot have
decided to commit T. Can therefore abort T.
If none of the above cases holds, then all active sites must have a
<ready T> record in their logs, but no additional control records
(such as <abort T> of <commit T>). In this case active sites must
wait for Ci to recover, to find decision.
41. Distributed Databases - Transaction
Processing
Three Phase Commit Protocol:
No network partitioning
At any point at least on site must be up
At most k sites can fail.
43. Distributed Databases - Transaction
Processing
Three Phase Commit Protocol – Phase I:
Coordinator asks all participants to prepare to
commit transaction Ti. The coordinator then makes
the decision about commit or abort based on the
response from all the participating sites.
44. Distributed Databases - Transaction
Processing
Three Phase Commit Protocol – Phase II:
Coordinator makes a decision as in 2Phase
Commit which is called the pre-commit decision
<Pre-commit, T>, and records it in multiple
participating sites.
45. Distributed Databases - Transaction
Processing
Three Phase Commit Protocol – Phase III:
Coordinator sends commit/ abort message to all
participating sites.
46. Distributed Databases - Transaction
Processing
Three Phase Commit Protocol:
If the coordinating site in case gets failed then one of
the participating site becomes the coordinating site and
consults other participating sites to know the Pre-
commit message which they posses.
Thus using this pre-commit message the decision
about commit/ abort is taken by this new coordinating
site.
47. Object based Database
The object based database provide the solution
to model the real world object and their
behavior.
It is an alternative to relational database
model.
48. Object based Database
Complex Data Types:
Address can be viewed as a single string or separate
attributes for each part or composite attributes.
Applications:
Computer Aided Design
Hypertext database
Multimedia and image databases.
49. Object based Database
Object Classes:
class employee {
/* Variables */
string name; string address; date start-date; int salary;
/* Messages */
int annual-salary(); string get-name(); string get-address();
int set-address(string new-address);
int employment-length();
};
50. Object based Database
Inheritance:
An object-oriented database schema typically requires a
large number of classes.
For example, bank employees are similar to customers.
Need to place classes in a specialization hierarchy
53. Object based Database
Inheritance:
The keyword isa is used to indicate that a class is a
specialization of another class.
The specialization of a class are called subclasses.
E.g., employee is a subclass of person; teller is a subclass
of employee. Conversely, employee is a superclass of teller.
54. Object based Database
Inheritance:
Code Reusability
Substitutability: Any method of a class, A, can be equally
well be invoked with an object belonging to any
subclass B of A.
55. Object based Database
Multiple Inheritance:
In most cases, tree-structured organization of classes is
adequate to describe applications.
Multiple inheritance: the ability of class to inherit variables
and methods from multiple superclasses.
The class/subclass relationship is represented by a rooted
directed acyclic graph (DAG) in which a class may have
more than one superclass.
57. Object based Database
Multiple Inheritance:
Handling name conflicts: When multiple inheritance is
used, there is potential ambiguity if the same variable or
method can be inherited from more than one superclass.
58. Object based Database
ODMG Object Model:
ODMG – Object Database Management Group
Come up with the specification for using object oriented
database.
ODL – Object Definition Language
OQL – Object Query Language
OML – Object Manipulation Language
59. Object based Database
ODL:
Declaring Classes:
keyword interface
The name of the class
The list of attributes of the class declared using keyword
attribute.
61. Object based Database
ODL:
Declaring Relationships:
The SQL makes use of foreign key concept to establish
relationships two tables.
Keyword relationship to declare the relationship among
two relational schema.
65. Object based Database
OQL:
A query language standard for object oriented databases modeled
after SQL.
Rules:
All complete statements must be terminated by a semi-colon
A list of entries in OQL is usually separated by commas but not
terminated by a comma(,).
Strings of text are enclosed by matching quotation marks.
66. Object based Database
OQL:
Basic from of OQL: Select, From and Where
Syntax: SELECT <list of values>
FROM <list of collections and variable assignments>
WHERE < condition>
SELECT Sname:p.name FROM p in People WHERE
p.age>30
67. Object based Database
OQL:
Dot notations and Path expressions:
ta.salary -> real
t.students -> set of tuples of type tuple(name, fee:real)
representing students
t.salary -> real
68. XML Databases
XML - Extensible Markup Language
XML tags identify the data and are used to store and
organize the data.
Characteristics:
XML is extensible
XML carries the data, does not present it
XML is a public standard
69. XML Databases
Syntax Rules for XML Declaration
The XML declaration is case sensitive and must begin with
"<?xml>" where "xml" is written in lower-case.
If document contains XML declaration, then it strictly
needs to be the first statement of the XML document.
70. XML Databases
Element:
XML elements can be defined as building blocks of an
XML.
Elements can behave as containers to hold text, elements,
attributes, media objects or all of these.
74. XML Databases
Attributes:
Attribute gives more information about XML elements.
Attributes define properties of elements.
An XML attribute is always a name-value pair.
<element-name attribute1 attribute2 >
....content..
< /element-name>
76. XML Databases
Types of XML Documents:
Data Centric XML documents: Many small data items that
follow specific structure. These documents follow predefined
schema that defines tag names.
Document Centric XML documents: Large amounts of text, such
as articles of book. There are very few or no structured data
elements in these documents.
Hybrid Documents: Unstructured data and may not have
predefined schema.
77. XML Databases
DTD
DTD – Document Type Definition
To define the basic building block of any xml document
Using DTD, specify various elements type, attributes and
their relationships with one another.
To specify the set of rules for structuring data in any XML
file
78. XML Databases
DTD – Elements:
The basic entity
The elements are used for defining the tags.
The elements typically consist of opening and closing tag.
Ex: <body>some text</body>
79. XML Databases
DTD – Attributes:
Attributes always come in name/value pairs.
To specify the values of the element.
These are specified within the double quotes.
Ex: <img src="computer.gif" />
80. XML Databases
DTD – Entities:
Entities are expanded when a document is parsed by an
XML parser.
Entity References Character
< <
> >
& &
" "
81. XML Databases
DTD – PCDATA:
Parsed Character Data.
PCDATA is text that WILL be parsed by a parser. The text
will be examined by the parser for entities and markup.
Tags inside the text will be treated as markup and entities
will be expanded.
&, <, or > - & < and >
82. XML Databases
DTD – CDATA:
Character Data.
CDATA is text that will NOT be parsed by a parser. Tags
inside the text will NOT be treated as markup and entities
will not be expanded.
83. XML Databases
DTD – Example:
<?xml version="1.0"?>
<page>
<title>Hello friend</title>
<content>Here is some content :)</content>
<comment>samples</comment>
</page>
85. XML Databases
DTD – Merits:
To define the structural components of XML document
Simple and Compact
86. XML Databases
DTD – Demerits:
It cannot be much specific for complex documents
The language that DTD uses is not an XML document.
The DTD cannot define the type of data contained with in
the XML document.
87. XML Databases
XML Schema:
Structure of an XML document.
The elements and attributes that can appear in a document
The number of (and order of) child elements
Data types for elements and attributes
Default and fixed values for elements and attributes
XML Schema is an XML-based (and more powerful) alternative
to DTD
89. XML Databases
XML Schema – Advantages:
The schema provide the support for data types
The XML schema is written in XML itself and has large number
of built in and derived types.
Disadvantages:
Complex to design and hard to learn
Maintaining the schema for large and complex operations
sometimes slows down the processing ox XML document.
90. XML Databases
Xquery:
To query the XML database, to get information out of XML
databases.
XQuery FLWOR Expressions
For - selects a sequence of nodes
Let - binds a sequence to a variable
Where - filters the nodes
Order by - sorts the nodes
Return - what to return (gets evaluated once for every node)
91. XML Databases
Xquery – Example:
courses. Xml
display the title elements of the courses whose fees are
greater than 5000
for $x in doc("courses.xml")/courses/course
where $x/fees>5000
return $x/title
92. XML Databases
Xquery – Advantages:
Both hierarchical and tabular data can be retrieved.
To query tree and graphical structure.
Used to build web pages.
Used to transform XML documents.
93. Information Retrieval
Information Retrieval:
“The process of retrieving documents form a
collection in response to a query submitted by a user”
94. Information Retrieval
Information Retrieval:
Structured Data:
A form of data in which the information is in most
organized form.
Ex: Student table
95. Information Retrieval
Information Retrieval:
Unstructured Data:
Like human language.
It does not fit nicely into relational databases.
Ex: Emails, Text Documents, Social media, Videos and
Images.
96. Information Retrieval
Information Retrieval – Concept of Query
User can make use of free form of search request – Query
It is also called as keyword search.
97. Information Retrieval
Characteristics of IR Systems:
Types of Users:
Expert User: User who is searching for specific
information that is clear in mind.
Ex: User who wants to get the information about particular
book.
Layperson: A user with generic information need.
99. Information Retrieval
Characteristics of IR Systems:
Types of Information Need:
Navigational Search: To find a particular piece of
information that user needs quickly.
Ex: Finding site of “Anna University”
100. Information Retrieval
Characteristics of IR Systems:
Types of Information Need:
Informational Search: To find current information about
some topic.
Example: Information about current News.
101. Information Retrieval
Characteristics of IR Systems:
Types of Information Need:
Transactional Search: To reach a site in which further
interaction happen.
Ex: Online Reservation.
102. Information Retrieval
Database System IR System
Use of Structured data Use of unstructured data
Relational Data model is used Free-form query model is used.
Query returns data Search request returns list or pointers to
documents that may contain the desired
information
Results are based on exact matching Results are based on approximate
matching
103. Information Retrieval
Modes of Interactions:
Retrieval: Extraction of relevant information from a
repository of documents through an IR query.
Browsing: The activity of a user visiting or
navigation through similar or related documents based
on the user’s assessment of relevance.
104. Information Retrieval
Modes of Interactions:
Hyperlinks: To interconnect web pages and are mainly
used for browsing.
Anchor texts: Text phrases within documents used to
label hyperlinks and are very relevant to browsing.
Web Search: combines both activities(retrieval and
browsing)
105. Information Retrieval
Modes of Interactions:
Web Search Engine: Maintains an indexed repository
of web pages. The most relevant web pages are
returned to the user if possible in descending order of
their relevance.
106. Information Retrieval
IR Processing:
Statistical Approach:
The documents are first analyzed and broken down into chunks
of text.
Each word is counted for its relevance.
These words are then compared against the query to test the
significant degree of match.
Based on this matching, the ranked list of documents containing
these words is presented to the user.
107. Information Retrieval
IR Processing:
Statistical Approach:
Knowledge base technique of information retrieval is used.
The syntactical, lexical, sentential, discourse based and
pragmatic level of words used to prepare knowledge base
for understanding.
109. Information Retrieval
Retrieval Models:
Boolean Model:
Documents represented as a set of terms
Form queries using standard Boolean logic set-theoretic operators -
AND, OR and NOT.
Based on “Exact match” with query.
Lacks sophisticated ranking algorithms.
Make it easy to associate meta data information and write queries that
match the contents of the documents
110. Information Retrieval
Retrieval Models:
Vector Space Model:
An algebraic model for representing text documents.
It provides a framework in which weighting, ranking of
retrieved documents and relevance feedback are possible.
similarity functions can be used = Cosine of the angle
between the query and document vector commonly used
111. Information Retrieval
Retrieval Models:
Probabilistic Model:
A More concrete and definitive approach is taken.
The IR system has to decide whether the documents belong to
the relevant set or non-relevant set for a query.
To calculate the probability that the document belongs to the
relevant set and compare that with the probability that the
documents belongs to the non relevant set.
112. Information Retrieval
Retrieval Models:
Semantic Model:
The process of matching documents to a given query is based on
concept level and semantic matching instead of index term
matching.
This allows retrieval of relevant documents that share
meaningful associations with other documents in the query result.
113. Information Retrieval
Retrieval Models:
Semantic Model – Level of Analysis:
Morphological Analysis: Analyzed noun, verbs, adjective.
Syntactical Analysis: Complete phrases in the document
are parsed and then analyzed.
Semantic Analysis: To resolve the ambiguities in the
words the synonyms are used
114. Information Retrieval
Types of Queries in IR Systems:
Keywords:
Consist of words, phrases, and other characterizations of
documents
Queries compared to set of index keywords
Allow use of Boolean and other operators to build a
complex query
115. Information Retrieval
Types of Queries in IR Systems:
Keywords:
Keywords implicitly connected by a logical AND operator
Remove stopwords - Most commonly occurring words: a,
the, of
IR systems do not pay attention to the ordering of these
words in the query
116. Information Retrieval
Types of Queries in IR Systems:
Boolean Queries:
AND: both terms must be found
OR: either term found
NOT: record containing keyword omitted
( ): used for nesting
+: equivalent to and
–Boolean operators: equivalent to AND NOT
Document retrieved if query logically true as exact match in do
117. Information Retrieval
Types of Queries in IR Systems:
Phrase queries:
Phrase generally enclosed within double quotes
More restricted and specific version of proximity searching
118. Information Retrieval
Types of Queries in IR Systems:
Proximity queries:
Accounts for how close within a record multiple terms
should be to each other
Common option requires terms to be in the exact order
Various operator names: NEAR, ADJ(adjacent), or AFTER
119. Information Retrieval
Types of Queries in IR Systems:
Wildcard queries:
Support regular expressions and pattern matching-based
searching – ‘Data*’ would retrieve data, database, datapoint,
dataset
Involves preprocessing overhead
Retrieval models do not directly provide support for this query
type
120. Information Retrieval
Types of Queries in IR Systems:
Natural Language queries:
Few natural language search engines
Active area of research
Easier to answer questions