Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

1,934 views

Published on

In this talk, we briefly introduce two major research problems involving databases and functional dependencies. First, we introduce an information-theoretic measure that evaluates a database design based on the worst possible redundancy carried in the instances. Then we propose new design guidelines to reduce the amount of redundancy that databases carry due to the presence of functional dependencies.

We also introduce the problem of repairing an inconsistent database that violates a set of functional dependencies by making the smallest possible value modifications. We show that finding an optimum solution is NP-hard. Then we explore the possibility of producing an approximate solution that can be used in data cleaning systems.

No Downloads

Total views

1,934

On SlideShare

0

From Embeds

0

Number of Embeds

99

Shares

0

Downloads

29

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca Postdoctoral Research Fellow Department of Computer Science University of British Columbia Joint work with Leonid Libkin and Laks Lakshmanan Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 1/20
- 2. Motivation Both relational and XML databases may store redundant data: title director actor year The Departed Scorsese DiCaprio 2006 The Departed Scorsese Nicholson 2006 Functional Dependency: Shrek the Third Miller Myers 2007 Shrek the Third Miller Murphy 2007 title → year Shrek the Third Hui Myers 2007 Shrek the Third Hui Murphy 2007 Functional Dependency: @AreaCode → @City AreaCode AreaCode AreaCode AreaCode 416 416 416 416 City City City City Toronto Toronto Toronto Toronto Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 2/20
- 3. Motivation Normalization techniques try to remove redundancies: BCNF eliminates all redundancies. only key dependencies are allowed. cannot always be achieved without losing dependencies. AB → C C→B R(A, B, C) 3NF eliminates some redundancies. allows redundancy on prime attributes. preserves dependencies. XNF eliminates all redundancies w.r.t. XML functional dependencies. only XML keys are allowed: if X → p.@l, then X → p. introduced by Arenas & Libkin in 2002. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 3/20
- 4. Motivation Traditional normalization theory characterizes a database as redundant or non-redundant. does not measure redundancy. cannot provide guidelines to reduce redundancy. The more redundant the data, the more prone to update anomalies. Our goal is to show that there is a spectrum of redundancy using an information-theoretic tool. to choose database designs with low redundancy. to handle databases with dependency violations. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 4/20
- 5. Outline Motivation. Reducing redundancy in relational and XML data: Measure of redundancy. Redundancy analysis of normal forms and schemas. Correcting functional dependency violations. Conclusions. Future work. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 5/20
- 6. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
- 7. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} A B C D RICI (P |Σ) 1 2 3 4 0.875 1 2 3 5 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
- 8. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} A B C D RICI (P |Σ) 1 2 3 4 0.875 1 2 3 5 0.781 1 2 3 6 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
- 9. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} A B C D RICI (P |Σ) 1 2 3 4 0.875 1 2 3 5 0.781 1 2 3 6 0.711 1 2 3 7 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
- 10. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} A B C D RICI (P |Σ) 1 2 3 4 0.875 1 2 3 5 0.781 1 2 3 6 0.711 1 2 3 7 0.658 1 2 3 8 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
- 11. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} Σ = {A → C, B → C} A B C D RICI (P |Σ) RICI (P |Σ) 1 2 3 4 0.875 0.781 1 2 3 5 0.781 0.629 1 2 3 6 0.711 0.522 1 2 3 7 0.658 0.446 1 2 3 8 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
- 12. Measure of Information Content A B C 1 2 3 R(A, B, C) Σ = {A → B} 1 2 4 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
- 13. Measure of Information Content A B C 2 3 R(A, B, C) Σ = {A → B} 1 2 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
- 14. Measure of Information Content A B C 1 2 3 R(A, B, C) Σ = {A → B} 1 2 1 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
- 15. Measure of Information Content A B C 4 2 3 R(A, B, C) Σ = {A → B} 1 2 7 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
- 16. Measure of Information Content A B C 4 2 3 R(A, B, C) Σ = {A → B} 1 2 7 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = 48/ Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
- 17. Measure of Information Content A B C a 3 R(A, B, C) Σ = {A → B} 1 2 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = 48/(48 + 6 × 42) = 0.16 P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
- 18. Measure of Information Content A B C 1 2 3 R(A, B, C) Σ = {A → B} 1 2 4 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = 48/(48 + 6 × 42) = 0.16 P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2 Conditional entropy : 2.8057 Average over all possible X: RICk = 2.4558 I Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
- 19. Measure of Information Content A B C 1 2 3 R(A, B, C) Σ = {A → B} 1 2 4 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = 48/(48 + 6 × 42) = 0.16 P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2 Conditional entropy : 2.8057 Average over all possible X: RICk = 2.4558 I RICk (p| Σ) I RICI (p|Σ) = lim = 0.875 log k k→∞ Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
- 20. Measure and Database Design A schema S with constraints Σ is well-designed if for every instance I of (S, Σ) and every position p in I RICI (p|Σ) = 1. Known results (Arenas & Libkin, 2003): relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF. XML documents with FDs: (S, Σ) is well-designed iff it is in XNF. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20
- 21. Measure and Database Design A schema S with constraints Σ is well-designed if for every instance I of (S, Σ) and every position p in I RICI (p|Σ) = 1. Known results (Arenas & Libkin, 2003): relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF. XML documents with FDs: (S, Σ) is well-designed iff it is in XNF. Well-designed databases cannot always be achieved: Performance issues. Dependency preservation. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20
- 22. Measure and Database Design A schema S with constraints Σ is well-designed if for every instance I of (S, Σ) and every position p in I RICI (p|Σ) = 1. Known results (Arenas & Libkin, 2003): relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF. XML documents with FDs: (S, Σ) is well-designed iff it is in XNF. Well-designed databases cannot always be achieved: Performance issues. Dependency preservation. General design goal: maximizing information content to the possible extent by enforcing some design conditions. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20
- 23. Guaranteed Information Content Given a condition C, guaranteed information content (GIC) is the smallest information content found in instances of schemas satisfying C. instances of (S, Σ) Schema satisfying condition C (S, Σ) =⇒ ≥ GIC RICI (p|Σ) More formally, we look at the set of all possible values for information content POSS C (m) = {RICI (p | Σ) | I is an instance of (R, Σ), R has m attributes, (R, Σ) satisﬁes C}, then GICC (m), is the inﬁmum of POSS C (m). Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 9/20
- 24. Price of Dependency Preservation Design goal: minimizing redundancy while preserving FDs. For a normal form NF, PRICE(NF ) is the minimum information content that NF loses to guarantee dependency preservation. if c ∈ [0, 1] is the largest information content guaranteed for decompositions into NF, NF-decomposition (R1 , Σ1 ) ≥c RICI (p|Σ) (R, Σ) (R2 , Σ2 ) (R3 , Σ3 ) then price of dependency preservation, PRICE(NF ), is 1 − c. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 10/20
- 25. Price of Dependency Preservation Theorem = 1/2. PRICE(3NF) ≥ 1/2 for any dependency-preserving normal form NF. PRICE(NF ) To pay the smallest price for achieving dependency preservation, we should do a 3NF normalization. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 11/20
- 26. Price of Dependency Preservation Theorem = 1/2. PRICE(3NF) ≥ 1/2 for any dependency-preserving normal form NF. PRICE(NF ) To pay the smallest price for achieving dependency preservation, we should do a 3NF normalization. Not all 3NF normalizations are equal: Special subclasses of 3NF exist (old research). Only one subclass (3NF+ ) achieves the smallest price. We compare normal forms based on guaranteed information content or highest redundancy they allow. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 11/20
- 27. Comparing Normal Forms For every m > 2: Theorem 21−m GICAll (m) = 22−m GIC3NF (m) = GIC3NF+ (m) = 1/2 3NF is twice as good as doing nothing. 3NF+ is exponentially better. similar results obtained if we compare normal forms based on guaranteed average information content. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 12/20
- 28. Redundancy of an Arbitrary Schema Normalizing into smaller relations is not always desirable. losing constraints. slowing down query answering. Normalization decision can be made based on how much redundancy the schema allows; or where in the spectrum of redundancy the schema lies; or the lowest information content found in all instances of the schema. Design goal: decomposing the schema with the highest potential for redundancy. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 13/20
- 29. Redundancy of an Arbitrary Schema Given an arbitrary schema R with FDs Σ, let Theorem ΣA = {X | X → A, X is minimal and non-key}; #HS = the number of hitting sets of ΣA ; l=| X|. X∈ΣA Then the smallest information content found in column A of instances is GICR (A) = #HS · 2−l . Σ Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20
- 30. Redundancy of an Arbitrary Schema Given an arbitrary schema R with FDs Σ, let Theorem ΣA = {X | X → A, X is minimal and non-key}; #HS = the number of hitting sets of ΣA ; l=| X|. X∈ΣA Then the smallest information content found in column A of instances is GICR (A) = #HS · 2−l . Σ R1 (A, B, C, D, E) R2 (A, B, C, D, E) Σ1 = { AB → E, Σ2 = { BC → E, D → E} AC → E, BD → E} Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20
- 31. Redundancy of an Arbitrary Schema Given an arbitrary schema R with FDs Σ, let Theorem ΣA = {X | X → A, X is minimal and non-key}; #HS = the number of hitting sets of ΣA ; l=| X|. X∈ΣA Then the smallest information content found in column A of instances is GICR (A) = #HS · 2−l . Σ R1 (A, B, C, D, E) R2 (A, B, C, D, E) Σ1 = { AB → E, Σ2 = { BC → E, D → E} AC → E, BD → E} GICR1 (E) = GICR2 (E) = 3 1 1 2 Σ Σ 8 2 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20
- 32. Outline Motivation. Reducing redundancy in relational and XML data: Measure of redundancy. Redundancy analysis of normal forms and schemas. Correcting functional dependency violations. Conclusions. Future work. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 15/20
- 33. Functional Dependency Violations Large databases often tend to violate a set of FDs. Σ = {cnt, arCode → reg, cnt, reg → prov } An inconsistent database name cnt prov reg arCode phone t1 Smith CAN BC Van 604 123 4567 t2 Adams CAN BC Van 604 765 4321 t3 Simpson CAN BC Man 604 345 6789 t4 Rice CAN AB Vic 604 987 6543 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 16/20
- 34. Functional Dependency Violations Large databases often tend to violate a set of FDs. Σ = {cnt, arCode → reg, cnt, reg → prov } An inconsistent database name cnt prov reg arCode phone t1 Smith CAN BC Van 604 123 4567 t2 Adams CAN BC Van 604 765 4321 t3 Simpson CAN BC Man 604 345 6789 t4 Rice CAN AB Vic 604 987 6543 A minimal repair name cnt prov reg arCode phone t1 Smith CAN BC Van 604 123 4567 t2 Adams CAN BC Van 604 765 4321 t3 Simpson CAN BC Van 604 345 6789 t4 v1 Rice AB Vic 604 987 6543 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 16/20
- 35. Handling Inconsistent Databases Integrity constraints Σ (FDs, keys, etc.). Inconsistent database D: does not satisfy Σ. We can produce a repair R by inserting/deleting tuples or modifying values in D. ∆(D, R) = number of modiﬁcations Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20
- 36. Handling Inconsistent Databases Integrity constraints Σ (FDs, keys, etc.). Inconsistent database D: does not satisfy Σ. We can produce a repair R by inserting/deleting tuples or modifying values in D. ∆(D, R) = number of modiﬁcations Handling inconsistency: Consistent query answering: {Q(R) | R is a minimal repair for D} certain answer for queryQ = Producing an optimum repair Ropt with minimum ∆. Both approaches are intractable in general. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20
- 37. Handling Inconsistent Databases Integrity constraints Σ (FDs, keys, etc.). Inconsistent database D: does not satisfy Σ. We can produce a repair R by inserting/deleting tuples or modifying values in D. ∆(D, R) = number of modiﬁcations Handling inconsistency: Consistent query answering: {Q(R) | R is a minimal repair for D} certain answer for queryQ = Producing an optimum repair Ropt with minimum ∆. Both approaches are intractable in general. Our approach: producing an approximate solution Rapp for optimum repair. ∆(D, Rapp ) ≤ α · ∆(D, Ropt ) Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20
- 38. Approximating Optimum Repair Theorem. Finding an optimum solution for FD violations is NP-hard. Theorem. Finding a constant-factor approximation for all FD violations is NP-hard. Theorem. For every ﬁxed set of FDs, there is a polynomial-time algorithm that approximates optimum repair within a factor of α, where α depends on FDs. A B C D E t1 a1 b1 c1 d1 e1 t2 a2 b1 c2 d2 e2 t3 a1 b3 c3 d3 e3 t4 a4 b4 c4 d4 e4 t5 a5 b4 c5 d5 e5 t6 a6 b6 c4 d5 e6 Σ = {A → C, B → C, CD → E} Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 18/20
- 39. Conclusions We analyze schemas and normal forms based on worst cases of redundancy. There is a spectrum of information content (redundancy) for schemas. 0 1 1 2 poorly-designed well-designed Producing optimum repair for FD violations is hard. We introduced an approximation framework. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 19/20
- 40. Future Work Comparing quality of schemas with low / high information content in practice. Deﬁning normalization concepts for XML such as: dependency preserving decomposition. Finding an equivalent of 3NF for XML as a normal form that guarantees an information content of 1 . 2 to which every XML document is decomposable. Extending the repair algorithm for other integrity constraints. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 20/20

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment