Your SlideShare is downloading. ×
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Watermarking relational databases using optimization based techniques
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Watermarking relational databases using optimization based techniques

1,044

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,044
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
32
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 116 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008 Watermarking Relational Databases Using Optimization-Based Techniques Mohamed Shehab, Member, IEEE, Elisa Bertino, Fellow, IEEE, and Arif Ghafoor, Fellow, IEEE Abstract—Proving ownership rights on outsourced relational databases is a crucial issue in today’s internet-based application environments and in many content distribution applications. In this paper, we present a mechanism for proof of ownership based on the secure embedding of a robust imperceptible watermark in relational data. We formulate the watermarking of relational databases as a constrained optimization problem and discuss efficient techniques to solve the optimization problem and to handle the constraints. Our watermarking technique is resilient to watermark synchronization errors because it uses a partitioning approach that does not require marker tuples. Our approach overcomes a major weakness in previously proposed watermarking techniques. Watermark decoding is based on a threshold-based technique characterized by an optimal threshold that minimizes the probability of decoding errors. We implemented a proof of concept implementation of our watermarking technique and showed by experimental results that our technique is resilient to tuple deletion, alteration, and insertion attacks. Index Terms—Watermarking, digital rights, optimization. Ç1 INTRODUCTIONT HE rapid growth of the Internet and related technologies has offered an unprecedented ability to access andredistribute digital contents. In such a context, enforcing By contrast, the problem of watermarking relational data has not been given appropriate attention. There are, however, many application contexts for which datadata ownership is an important requirement, which represent an important asset, the ownership of which mustrequires articulated solutions, encompassing technical, thus be carefully enforced. This is the case, for example, oforganizational, and legal aspects [25]. Although we are still weather data, stock market data, power consumption,far from such comprehensive solutions, in the last years, consumer behavior data, and medical and scientific data.watermarking techniques have emerged as an important Watermark embedding for relational data is made possiblebuilding block that plays a crucial role in addressing the by the fact that real data can very often tolerate a smallownership problem. Such techniques allow the owner of the amount of error without any significant degradation withdata to embed an imperceptible watermark into the data. A respect to their usability. For example, when dealing withwatermark describes information that can be used to prove weather data, changing some daily temperatures of 1 orthe ownership of data such as the owner, origin, or recipient 2 degrees is a modification that leaves the data still usable.of the content. Secure embedding requires that the To date, only a few approaches to the problem ofembedded watermark must not be easily tampered with, watermarking relational data have been proposed [1], [23].forged, or removed from the watermarked data [26]. These techniques, however, are not very resilient to water-Imperceptible embedding means that the presence of the mark attacks. In this paper, we present a watermarkingwatermark is unnoticeable in the data. Furthermore, the technique for relational data that is highly resilientwatermark detection is blinded, that is, it neither requires compared to these techniques. In particular, our proposedthe knowledge of the original data nor the watermark. technique is resilient to tuple deletion, alteration, andWatermarking techniques have been developed for video, insertion attacks. The main contributions of the paper areimages, audio, and text data [24], [12], [15], [2], and also for summarized as follows:software and natural language text [7], [3]. . We formulate the watermarking of relational data- bases as a constrained optimization problem and discuss efficient techniques to handle the constraints.. M. Shehab is with the Department of Software and Information Systems, University of North Carolina at Charlotte, 9201 University City Blvd., We present two techniques to solve the formulated Charlotte, NC 28223. E-mail: mshehab@uncc.edu. optimization problem based on genetic algorithms. E. Bertino is with the Department of Computer Science, Purdue (GAs) and pattern search (PS) techniques. University, 250 N. University Street, West Lafayette, IN 47906. E-mail: bertino@cs.purdue.edu. . We present a data partitioning technique that does. A. Ghafoor is with the School of Electrical and Computer Engineering, not depend on marker tuples to locate the partitions Purdue University, Electrical Engineering Building, 465 Northwestern and, thus, it is resilient to watermark synchroniza- Ave., West Lafayette, IN 47907. E-mail: ghafoor@ecn.purdue.edu. tion errors.Manuscript received 8 Aug. 2005; revised 20 June 2007; accepted 18 July . We develop an efficient technique for watermark2007; published online 4 Sept. 2007. detection that is based on an optimal threshold. TheFor information on obtaining reprints of this article, please send e-mail to:tkde@computer.org, and reference IEEECS Log Number TKDE-0303-0805. optimal threshold is selected by minimizing theDigital Object Identifier no. 10.1109/TKDE.2007.190668. probability of decoding error. 1041-4347/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society
  • 2. SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 117Fig. 1. Stages of watermark encoding and decoding. . With a proof of concept implementation of our tuple insertion; thus, such a technique is not resilient to watermarking technique, we have conducted experi- deletion and insertion attacks. Furthermore, Sion et al. ments using both synthetic and real-world data. We recommend storing the marker tuples to enable the decoder have compared our watermarking technique with to accurately reconstruct the underlying partitions; how- previous approaches [1], [23] and shown the super- ever, this violates the blinded watermark detection prop- iority of our technique with respect to all types of erty. A detailed discussion of such attacks is presented in attacks. Section 8. The data manipulation technique used to change The paper is organized as follows: Section 2 discusses the the data statistics does not systematically investigate therelated work that includes the available relational database feasible region; instead, a naive unstructured technique iswatermarking techniques and highlights the shortcomings used, which does not make use of the feasible alterationsof these techniques. An overview of our watermarking that could be performed on the data without affecting itstechnique is described in Section 3, where an overview of usability. Furthermore, Sion et al. proposed a thresholdthe watermark encoding and decoding stages is presented. technique for bit decoding that is based on two thresholds.Section 4 discusses the data partitioning algorithm. The However, the thresholds are arbitrarily chosen without anywatermark embedding algorithm is described in Section 5. optimality criteria. Thus, the decoding algorithm exhibitsSections 6 and 7 discuss the decoding threshold evaluation, errors resulting from the nonoptimal threshold selection,and the watermark detection scheme. Section 8 presents the even in the absence of an attacker.attacker model. The experimental results are presented in Gross-Amblard [11] proposed a watermarking techniqueSection 9. Finally, conclusions are given in Section 10. for XML documents and theoretically investigates links between query result preservation and acceptable water-2 RELATED WORK marking alterations. Another interesting related researchAgrawal and Kiernan [1] proposed a watermarking algo- effort is to be found in [17], where the authors haverithm that embeds the watermark bits in the least significant proposed a fragile watermark technique to detect andbits (LSB) of selected attributes of a selected subset of localize alterations made to a database relation withtuples. This technique does not provide a mechanism for categorical attributes.multibit watermarks; instead, only a secret key is used. Foreach tuple, a secure message authenticated code (MAC) is 3 APPROACH OVERVIEWcomputed using the secret key and the tuple’s primary key.The computed MAC is used to select candidate tuples, Fig. 1 shows a block diagram summarizing the mainattributes, and the LSB position in the selected attributes. components of the watermarking system model used. AHiding bits in LSB is efficient. However, the watermark can data set D is transformed into a watermarked version DWbe easily compromised by very trivial attacks. For example, by applying a watermark encoding function that also takesa simple manipulation of the data by shifting the LSB’s one as inputs a secret key Ks only known to the copyrightposition easily leads to watermark loss without much owner and a watermark W . Watermarking modifies thedamage to the data. Therefore, the LSB-based data hiding data. However, these modifications are controlled bytechnique is not resilient [21], [8]. Moreover, it assumes that providing usability constraints referred to by the set G.the LSB bits in any tuple can be altered without checking These constraints limit the amount alterations that can bedata constraints. Simple unconstrained LSB manipulations performed on the data, such constraints will be discussed incan easily generate undesirable results such as changing the detail in the following sections. The watermark encodingage from 20 to 21. Li et al. [18] have presented a technique can be summarized by the following steps:for fingerprinting relational data by extending Agrawal Step E1. Data set partitioning: By using the secret key Ks ,et al.’s watermarking scheme. the data set D is partitioned into m nonoverlapping Sion et al. [23] proposed a watermarking technique that partitions fS0 ; . . . ; SmÀ1 g.embeds watermark bits in the data statistics. The datapartitioning technique used is based on the use of special Step E2. Watermark embedding: A watermark bit ismarker tuples, which makes it vulnerable to watermark embedded in each partition by altering the partitionsynchronization errors resulting from tuple deletion and statistics while still verifying the usability constraints in
  • 3. 118 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008 G. This alteration is performed by solving a constrained optimization problem.Step E3. Optimal threshold evaluation: The bit embedding statistics are used to compute the optimal threshold T Ã that minimizes the probability of decoding error. The watermarked version DW is delivered to theintended recipient. Then, it can suffer from unintentionaldistortions or attacks aimed at destroying the watermarkinformation. Note that even intentional attacks are per-formed without any knowledge of Ks or D, since these are Fig. 2. Data partitioning algorithm.not publicly available. Watermark decoding is the process of extracting the regenerate the partitions. The partitioning algorithm isembedded watermark using the watermarked data set DW , described in Fig. 2.the secret key Ks , and the optimal threshold T Ã . The decoding Although the presence of a primary key in the relationalgorithm is blind as the original data set D is not required for being watermarked is a common practice in relationalthe successful decoding of the embedded watermark. The data, our technique can be easily extended to handlewatermark decoding is divided into three main steps: cases when the relation has no primary key. Assuming aStep D1. Data set partitioning: By using the data partitioning single attribute relation, the most significant bits (MSB) algorithm used in E1, the data partitions are generated. of the data could be used as a substitute for the primary key. The use of the MSB assumes that the watermarkStep D2. Threshold-based decoding: The statistics of each embedding data alterations will unlikely alter the MSB partition are evaluated, and the embedded bit is decoded bits. However, if too many tuples share the same MSB using a threshold-based scheme based on the optimal bits, this would enable the attacker to infer information threshold T Ã . about the partition distribution. The solution would be toStep D3. Majority voting: The watermark bits are decoded select that minimizes the duplicates. Another technique, using a majority voting technique. in case of a relation with multiple attributes, is to use In the following sections, we discuss each of the identifying attributes instead of the primary key; forencoding and decoding steps in detail. example, in medical data, we could use the patient full name, patient address, and patient date of birth. Our data partitioning algorithm does not rely on special4 DATA PARTITIONING marker tuples for the selection of data partitions, whichIn this section, we present the data partitioning algorithm makes it resilient to watermark synchronization attacksthat partitions the data set based on a secret key Ks . The data caused by tuple deletion and tuple insertion. By contrast,set D is a database relation with scheme DðP ; A0 ; . . . ; AÀ1 Þ, Sion et al. [23] use special marker tuples, having thewhere P is the primary key attribute, A0 ; . . . ; AÀ1 are property that HðKs kHðr:P kKs ÞÞ mod m ¼ 0 to partition theattributes which are candidates for watermarking, and jDj is data set. In Sion’s approach, a partition is defined as the setthe number of tuples in D. The data set D is to be partitioned of tuples between two markers. Marker-based techniquesinto m nonoverlapping partitions, namely, fS0 ; . . . ; SmÀ1 g, not only use markers to define partitions but also to definesuch that each partition Si contains on the average jDj tuples boundaries between the embedded watermark bits. Such a mfrom the data set D. Partitions do not overlap, that is, for any technique is very fragile to tuple deletion and insertion duetwo partitions Si and Sj such that i 6¼ j, we have Si Sj ¼ fg. to the errors caused by the addition and deletion of markerFor each tuple r 2 D, the data partitioning algorithm tuples. This attack is discussed in more detail in Section 8.computes a MAC, which is considered to be secure [22]and is given by HðKs kHðr:P kKs ÞÞ, where r:P is the primary 5 WATERMARK EMBEDDINGkey of the tuple r, HðÞ is a secure hash function, and k is the In this section, we describe the watermark embeddingconcatenation operator. Using the computed MAC tuples are algorithm by formalizing the bit encoding as a constrainedassigned to partitions. For a tuple r, its partition assignment optimization problem. Then, we propose a GA and a PSis given by technique that can be used to efficiently solve such partitionðrÞ ¼ HðKs kHðr:P kKs ÞÞ mod m: optimization problem. The selection of which optimization algorithm to use is decided according to the applicationUsing the property that secure hash functions generate time and processing requirements, as will be discusseduniformly distributed message digests this partitioning further. At the end of this section, we give the overalltechnique, on average, places jDj tuples in each partition. m watermark embedding algorithm. Our watermarking tech-Furthermore, an attacker cannot predict the tuples-to- nique is able to handle tuples with multiple attributes, aspartition assignment without the knowledge of the secret we will discuss in Section 7. However, to simplify thekey Ks and the number of partitions m, which are kept following discussion, we assume the tuples in a partition Sisecret. Keeping m secret is not a requirement. However, contain a single numeric attribute. In such a case eachkeeping it secret makes it harder for the attacker to partition, Si can be represented as a numeric data vector
  • 4. SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 119Fig. 3. Notation. Fig. 4. Bit encoding algorithm.Si ¼ ½si1 ; . . . ; sin Š 2 n . Some of the notations used are citedin Fig. 3. encoding algorithm. These statistics are used to compute5.1 Single Bit Encoding optimal decoding parameters, as will be discussed inGiven a watermark bit bi and a numeric data vector Section 6.Si ¼ ½si1 ; . . . ; sin Š 2 n , the bit encoding algorithm maps The set of usability constraints Gi represents the boundsthe data vector Si to a new data vector Si ¼ Si þ Ái ,W on the tolerated change that can be performed on thewhere Ái ¼ ½Ái1 ; . . . ; Áin Š 2 n is referred to as the elements of Si . These constraints describe the feasible spacemanipulation vector. The performed manipulations are for the manipulation vector Ái for each bit encoding step.bounded by the data usability constraints referred to by These constraints are application and data dependent. Thethe set Gi ¼ fgi1 ; . . . ; gip g. The encoding is based on usability constraints are similar to the constraints enforcedoptimizing encoding function referred to as the hiding on watermarking algorithms for audio, images, and video,function, which is defined as follows: which mainly require that the watermark is not detectable by the human auditory and visual system [24], [12], [15], [8].Definition 1. A hiding function ÂÇ : n ! , where Ç is the For example, interval constraints could be used to control set of secret parameters decided by the data owner. the magnitude of the alteration for Áij , that is,The set Ç can be regarded as part of the secret key. Note Ámin Áij Ámax : ij ijthat when the hiding function is applied to Si þ Ái , the onlyvariable is the manipulation vector Ái , whereas Si and Ç Another example of usability constraints are classification-are constants. To encode bit bi into set Si , the bit encoding preserving constraints, which constrain the encodingalgorithm optimizes the hiding function ÂÇ ðSi þ Ái Þ. The alterations to generate data that belong to the sameobjective of the optimization problem of maximizing or classification as the original data. For example, whenminimizing the hiding function is based on the bit bi such watermarking age data, the results after the alterationthat if the bit bi is equal to 1, then the bit encoding algorithm should fall in the same age group, for example, “preschool”solves the following maximization problem: (0-6 years), “child” (7-13), “teenager” (14-18), “young male” (19-21), “adult” (22þ), these constraints can be easily max ÂÇ ðSi þ Ái Þ described using interval constraints as they are similar to Ái subject to Gi : defining bounds on Ái . Another interesting type of constraints may require that the watermarked data setHowever, if the bit bi is equal to 0, then the problem is maintain certain statistics. For example, the mean of thesimply changed into a minimization problem. The solution generated data set be equal to mean of the original data set,to the optimization problem generates the manipulation in such a case, the constructed constraint is of the formvector Áà at which ÂÇ ðSi þ ÁÃ Þ is optimal. The new data set i iSiW is computed as Si þ Áà . Using contradicting objectives, i X nnamely, maximization for bi ¼ 1 and minimization for bi ¼ Áij ¼ 0: j¼10 ensures that the values of ÂÇ ðSi þ ÁÃ Þ generated in both icases are located at maximal distance and, thus, makes the Several other usability constraints could be devisedinserted bit more resilient to attacks, in particular, to depending on the application requirements. These con-alteration attacks. straints are handled by the bit encoding algorithm by using Fig. 4 depicts the bit encoding algorithm. The bit encoding constrained optimization techniques when optimizing thealgorithm embeds bit bi in the partition Si if jSi j is greater hiding function as will be discussed in the subsequentthan . The value of represents the minimum partition size. sections.The maximize and minimize in the bit encoding algorithm For the sake of comparison in this paper, we use theoptimize the hiding function ÂÇ ðSi þ ÁÃ Þ subject to the i statistics-based hiding function used by Sion et al. [23]. The Wconstraints in Gi . The maximization and minimization mean and variance estimates of the new set Si ¼ Si þ Ái 2solution statistics are recorded for each encoding step in are referred to as ðSi þÁi Þ and ðSi þÁi Þ , respectively; in short,Xmax , Xmin , respectively, as indicated in lines 4 and 7 of the we will use and 2 . We define the reference point as
  • 5. 120 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008Fig. 5. The distribution of the set Si þ Ái on the number line and the tail Fig. 6. Binary chromosome representing Ái .entries circled. mutation, and selection, the GA creates successive genera-ref ¼ þ c  , where c 2 ð0; 1Þ is a secret real number that tions of solutions that evolve and inherit the positiveis a part of the set Ç. The data points in Si þ Ái that are characteristics of their parents and thus gradually approachabove ref are referred to as the“tail” entries, as illustrated in optimal or near-optimal solutions. By using the objectiveFig. 5. The hiding function Âc is defined as the number of function directly in the search, GAs can be effectivelytail entries normalized by the cardinality of Si , also referred applied in nonconvex, highly nonlinear, complex problemsto as the normalized tail count. It is computed as follows: [10], [5]. GAs have been frequently used to solve combina- torial optimization problems and nonlinear problems with 1X n Âc ðSi þ Ái Þ ¼ 1fs þÁ !refg ; complicated constraints or nondifferentiable objective func- n j¼1 ij ij tions. A GA is not guaranteed to find the global optimum; however, it is less likely to get trapped at a local optimumwhere n is the cardinality of Si , and 1fg is the indicator than traditional gradient-based search methods when thefunction defined as follows: objective function is not smooth and generally well 1 if condition ¼ T RUE; behaved. A GA usually analyzes a larger portion of the 1fconditiong ¼ solution space than conventional methods and is therefore 0 otherwise: more likely to find feasible solutions in heavily constrainedNote that the reference ref is dependent on both and , problems.which means that it is not fixed and dynamically varies The feasible set i is the set of values of Ái that satisfy allwith the statistics of Si þ Ái . Also note that the normalized constraints in Gi . GAs do not work directly with points intail count Âc ðSi þ Ái Þ depends on the distribution of Si þ the set i but rather with a mapping of the points in i intoÁi and the dynamic ref. a string of symbols called chromosomes. A simple binary The objective function Âc ðSi þ Ái Þ is nonlinear and representation scheme uses symbols from f0; 1g; eachnondifferentiable, which makes the optimization problem chromosome is L symbols long. As an example, the binaryat hand a nonlinear constrained optimization problem. In chromosome representing the vector Ái ¼ ½Ái1 ; . . . ; Áin Š issuch problems, traditional gradient-based approaches turn indicated in Fig. 6. Note that each component of Ái usesout to be inapplicable. To solve this optimization problem, L=n bits, where n ¼ jSi j. This chromosome representationwe propose two techniques based on GA and PS, respec- automatically handles interval constraints on Ái . Fortively. The choice of the technique to use depends on the example, if Áij can only take values in the interval ½lij ; hij Š,application processing requirements. Solving the optimiza- then by mapping the integers in the interval ½0; 2L=n À 1Š totion problem does not necessarily require to find a global values in the interval ½lij ; hij Š via simple translation andsolution because finding such solution may require a large scaling, this ensures that, whatever operations are per-number of computations. Our main goal is to find a near formed on the chromosome, the entries are guaranteed tooptimal solution that ensures that solutions of the mini- stay within the feasible interval.mization of Âc ðSi þ Ái Þ and maximization of Âc ðSi þ Ái Þ Each chromosome has a corresponding value of theare separated as far as possible from each other. As we will objective function, referred to as the fitness of the chromo-discuss further, GA could be used in order to determine some. To handle other types of constraints, we penalize theglobal optimal solutions by trading processing time, infeasible chromosomes by reducing their fitness valuewhereas PS could be used to provide a local optimal according to a penalty function ÈðÁi Þ, which represents thesolution without trading processing time. Note that these degree of infeasibility. Without loss of generality, if we areoptimization techniques will function for simpler hidingfunctions. However, having a simple linear hiding function solving the maximization problem in Section 5.1 withmakes it easier to attack. For example, if the average is used constraints Gi ¼ fgi1 ; . . . ; gip g, then the fitness function usedas the hiding function, in this case, the optimization is Âc ðSi þ Ái Þ þ ÈðÁi Þ, where 2 À is the penalty multi-problem will merely be adding or subtracting a constant plier and is chosen large enough to penalize the objectiveterm to the data vector to maximize or minimize the function in case of infeasible Ái . The penalty function ÈðÁi Þaverage. is given by5.2 Genetic Algorithm Technique X p ÈðÁi Þ ¼ gþ ðÁi Þ; ijA GA is a search technique that is based on the principles of j¼1natural selection or survival of the fittest. Pioneering workin this field was conducted by Holland in the 1960s [13], [6]. whereMany researchers have refined his initial approach. Instead 0 if Ái is feasible w:r:t gij ;of using gradient information, the GA uses the objective gþ ðÁi Þ ¼ ij ðgij ; Ái Þ otherwise;function directly in the search. The GA searches the solutionspace by maintaining a population of potential solutions. where ðgij ; Ái Þ 2 þ represents the amount of infeasibilityThen, by using evolving operations such as crossover, with respect to the constraint gij . For example, if the
  • 6. SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 121 Fig. 8. Example pattern for coordinate search in 2 , as part of a larger grid.Fig. 7. Shows Sigmoidð;Þ , where ¼ 0 and ¼ f1; 2; 8g. by starting the algorithm from different initial feasible points. For the sake of comparison, we conducted an P P experiment using normally distributed data, where the tailconstraint gij is n Áij ¼ 0, then ðgij ; Ái Þ ¼ k n Áij k. j¼1 j¼1For a detailed discussion of penalty-based techniques, the count Âc ðSi þ Ái Þ was maximized and minimized usinginterested reader is referred to [20], [5]. both PS and GA with interval constraints. Both algorithms A GA is less likely to get stuck in local optima. However, were restricted to use an equal number of objectivea GA requires a large number of functional evaluations to function evaluations. Fig. 9 reports the results of thisconverge to a global optimal. Thus, we recommend the use experiment, which shows that PS generates better opti-of GAs only when the processing time is not a strict mized tail counts and, thus, better separation between therequirement and watermarking is performed offline. For maximization and minimization results. However, if GA isfaster performance, we recommend the use of PS techniques given more functional evaluations converges to globaldiscussed in Section 5.3. optimum solutions.5.3 Pattern Search Technique 5.4 Watermark Embedding AlgorithmPS methods are a class of direct search methods for A watermark is a set of l bits W ¼ blÀ1 ; . . . ; b0 that are to benonlinear optimization. PS methods [4], [14] have been embedded in the data partitions fS0 ; . . . ; SmÀ1 g. To enablewidely used because of their simplicity and the fact that multiple embeddings of the watermark in the data set, thethey work well in practice on a variety of problems. More watermark length l is selected such that l ( m. The water-recently, they are provably convergent [16], [9]. PS starts at mark embedding algorithm embeds a bit bi in partition Skan initial point and samples the objective function at a such that k mod l ¼ i. This technique ensures that eachpredetermined pattern of points centered about that point watermark bit is embedded bmc times in the data set D. The lwith the goal of producing a new better iterate. Such moves watermark embedding algorithm is reported in Fig. 10. Theare referred to as exploratory moves, Fig. 8 shows an watermark embedding algorithm generates the partitions byexample pattern in 2 . If such sampling is successful (that calling get partitions, then for each partition Sk , a watermarkis, it produces a new better iterate), the process is repeated bit bi is encoded by using the single bit encoding algorithmwith the pattern centered about the new best point. If not, ðencode single bitÞ that was discussed in the previous Wthe size of the pattern is reduced, and the objective function sections. The generated altered partition Sk is inserted intois again sampled about the current point. For a detailed watermarked data set DW . Statistics ðXmax ; Xmin Þ arediscussion on PS refer to [16] and [9]. To improve the collected after each bit embedding and are used by theperformance of PS, the objective function Âc ðSi þ Ái Þ isapproximated by smooth sigmoid functions. The objectivefunction is approximated as follows: b 1X n Âc ðSi þ Ái Þ ¼ Sigmoidðf ;refÞ ðsij þ Áij Þ; n j¼1where Sigmoidð;Þ ðxÞ is a sigmoid function with parametersð; Þ, shown in Fig. 7, is defined as ð1 À ð1 þ eðxÀÞ ÞÀ1 Þ. Constraints could be handled using the techniquesdiscussed earlier. However, PS can easily handle theconstraints by limiting the exploratory moves to only thedirections that end up in the feasible space, thus, ensuringthat the generated solution is feasible. For a more detaileddiscussion refer to [16]. The systematic behavior of PS andthe adaptable pattern size leads to the fast convergence tooptimal feasible solutions. However, PS is not guaranteed Fig. 9. Comparison between GA and PS for the same number ofto find a global optimum. This problem can be overcome functional evaluations.
  • 7. 122 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008 Fig. 11. Threshold-based decoding scheme. and a probability density function, respectively. Perr is calculated as follows: Perr ¼ P ðbd ¼ 0; be ¼ 1Þ þ P ðbd ¼ 1; be ¼ 0Þ ¼ P ðbd ¼ 0jbe ¼ 1ÞP1 þ P ðbd ¼ 1jbe ¼ 0ÞP0 ¼ P ðx T jbe ¼ 1ÞP1 þ P ðx T jbe ¼ 0ÞP0 Z T Z 1 ¼ P1 fðxjbe ¼ 1Þdx þ P0 fðxjbe ¼ 0Þdx: À1 TFig. 10. Watermark embedding algorithm. To minimize the probability of decoding error ðPerr Þ with respect to the threshold T , we take the first order derivativeget optimal threshold algorithm to compute the optimal of Perr with respect to T to locate the optimal threshold T Ã ,decoding threshold; these details will be discussed further in as follows:the following sections. Z T Z 1 @Perr @ @ ¼ P1 fðxjbe ¼ 1Þdx þ P0 fðxjbe ¼ 0Þdx @T @T À1 @T T6 DECODING THRESHOLD EVALUATION ¼ P1 fðT jbe ¼ 1Þ À P0 fðT jbe ¼ 0Þ:In the previous sections, we discussed the bit encodingtechnique that embeds a watermark bit bi in a partition Si to The distributions fðxjbe ¼ 0Þ and fðxjbe ¼ 1Þ are estimatedgenerate a watermarked partition SiW . In this section, we from the statistics of the sets Xmin and Xmax , respectively.discuss the bit decoding technique that is used to extract the From our experimental observations of Xmin and Xmax , the Wembedded watermark bit bi from the partition Si . The bit distributions fðxjbe ¼ 0Þ and fðxjbe ¼ 1Þ pass the chi-squaredecoding technique is based on an optimal threshold T Ã test of normality and thus can be estimated as Gaussianthat minimizes the probability of decoding error. The distributions Nð0 ; 0 Þ and Nð1 ; 1 Þ, respectively. How-evaluation of such optimal threshold is discussed in this ever, the following analysis can still be performed withsection. other types of distributions. P0 could be estimated by Presented with the data partition SiW , the bit decoding jXmin jtechnique computes the hiding function ÂÇ ðSiW Þ and jXmax jþjXmin j and P1 ¼ 1 À P0 . Substituting the Gaussian expressions for fðxjbe ¼ 0Þ and fðxjbe ¼ 1Þ, the first ordercompares it to the optimal decoding threshold T Ã to decode derivative of Perr is as follows:the embedded bit bi . If ÂÇ ðSiW Þ is greater than T Ã , then thedecoded bit is 1; otherwise, the decoded bit is 0. For ! @Perr P1 ðT À 1 Þ2example, using the hiding function described in Section 5.1, ¼ pffiffiffiffiffiffi exp À @T 1 2 22 0the decoding technique computes the normalized tail count !of SiW by computing the reference ref and by counting the P0 ðT À 0 Þ2number of entries in SiW that are greater than ref. Then, the À pffiffiffiffiffiffi exp À : 0 2 220computed normalized tail count is compared to T Ã , seeFig. 11. The decoding technique is simple; however, the By equating the first order derivative of Perr to zero, we getvalue of the threshold T Ã should be carefully calculated so the following quadratic equation, the roots of which includeas to minimize the probability of bit decoding error, as will be the optimal threshold T Ã that minimizes Perr . The seconddiscussed in this section. order derivative of Perr is evaluated at T Ã to ensure that the 2 Ã The probability of bit decoding error is defined as the second order necessary condition ð@ P@T ðT Þ 0Þ is met. err 2probability of an embedded bit decoded incorrectly. Thedecoding threshold T Ã is selected such that it minimizes the 2 À 2 Ã2 0 2 À 1 2 Ã 0 1 1 0 T þ T þprobability of decoding error. The bit embedding stage 22 2 0 1 2 2 0 1discussed in Section 5.1 is based on the maximization or P 2 2 À 2 2 0 1minimization of the tail count; these optimized hiding ln þ 1 0 2 2 0 1 ¼ 0: P1 0 20 1function values computed during the encoding stage areused to compute the optimum threshold T Ã . The maximized From the above analysis, the selection of the optimal T Ã ishiding function values corresponding to b0i s equal to 1 are based on the collected output statistics of the watermarkstored in the set Xmax . Similarly, the minimized hiding embedding algorithm. The optimal threshold T Ã minimizesfunction values are stored in Xmin (see algorithm in Fig. 4). the probability of decoding error and thus enhances the Let Perr , P0 , and P1 represent the probability of decoding strength of the embedded watermark by increasing theerror, the probability of encoding a bit ¼ 0 and the chances of successful decoding. To show the high depen-probability of encoding a bit ¼ 1, respectively. Furthermore, dency of the probability of decoding error and the choice oflet be , bd , and fðxÞ represent the encoded bit, decoded bit, decoding threshold T Ã , we conducted an experiment using
  • 8. SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 123Fig. 12. (a) Shows fðxjbe ¼ 0Þ, fðxjbe ¼ 1Þ, and the computed T Ã ¼ 0:24142. (b) Gaussian approximation and experimental values of Perr for differentdecoding threshold ðT Þ values.Fig. 13. (a) Shows fðxjbe ¼ 0Þ, fðxjbe ¼ 1Þ, and T Ã for different usability constraints. (b) The minimum Perr at T Ã for different usability constraints.real-life1 data with usability constraints of Æ0:5 percent of encoding data manipulation and thus makes fðxjbe ¼ 0Þthe original data value. The histograms and the Gaussian and fðxjbe ¼ 1Þ more separated. Fig. 13b shows theestimates of Xmax and Xmin obtained from the experiment minimum probability of decoding error computed usingare reported in Fig. 12a. The optimal computed threshold the optimal threshold T Ã for data subject to differentT Ã is indicated by the dotted vertical line. As we can see in usability constraints. The overall watermark probability ofFig. 12a, the two distributions are far apart, which is a direct decoding error is reduced by embedding the watermarkresult of using the competing objectives for bi equal to 1 and multiple times in the data set, which is basically a0. Fig. 12b, shows the probability of decoding error for repetition error correcting code.different values of the decoding threshold, which shows thepresence of an optimal threshold that minimizes the 7 WATERMARK DETECTIONprobability of decoding error. Furthermore, Fig. 12b showsboth the Gaussian approximation and the experimental In this section, we discuss the watermark detectionvalues of the probability of decoding error, which shows algorithm that extracts the embedded watermark using thethat the Gaussian approximation matches the experimental secret parameters including Ks , m, , c, and T . The algorithmresults. starts by generating the data partitions fS0 ; . . . ; SmÀ1 g using The probability of decoding error is also dependent on the watermarked data set DW , the secret key Ks , and thethe usability constraints. If the usability constraints are number of partitions m as input to the data partitioningtight, the amount of alterations to the data set D may not algorithm discussed in Section 4. Each partition encodes abe enough for the watermark insertion. Fig. 13a shows single watermark bit; to extract the embedded bit, we use thethe effect of varying the usability constraints on the threshold decoding scheme based on the optimal thresholdseparation between fðxjbe ¼ 0Þ and fðxjbe ¼ 1Þ. Note that T that minimizes the probability of decoding error, asas the usability constraints are increased, this allows more discussed in Section 6. If the partition size is smaller than , the bit is decoded as an erasure; otherwise, it is decoded 1. Description of such data is discussed in Section 9. using the threshold scheme.
  • 9. 124 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008Fig. 14. An example illustrates the majority bit matching decodingalgorithm for a watermark W ¼ 011010, with “” representing theerasures. As the watermark W ¼ blÀ1 ; . . . ; b0 is embedded severaltimes in the data set, each watermark bit is extracted severaltimes, where for a bit bi , it is extracted from partition Sk ,where k mod l ¼ i. The extracted bits are decoded using themajority voting technique, which is used in the decoding ofrepetition error correcting codes. Each bit bi is extractedm m l times so it represents a b l c-fold repetition code [19]. Themajority voting technique is illustrated by the example in Fig. 15. Watermark detection algorithm.Fig. 14. The detailed algorithm used for watermarkdetection is reported in Fig. 15. cannot generate the data partitions fS0 ; . . . ; SmÀ1 g because In case of a relation with multiple attributes, the this requires the knowledge of both the secret key Ks andwatermark resilience can be increased by embedding the the number of partitions m, thus Mallory cannot intention-watermark in multiple attributes. This is a simple extension ally attack certain watermark bits. Moreover, any datato the presented encoding and decoding techniques in manipulations executed by Mallory cannot be checkedwhich the watermark is embedded in each attributed against the usability constraints because the original dataseparately. For a attribute relation, the watermark bit is set D is unknown. Under these assumptions, Mallory isembedded in each of the columns separately using the bit faced with the dilemma of trying to destroy the watermarkembedding technique discussed in Section 5.1. The use of and at the same time of not destroying the data. We classifymultiple attributes enables the multiple embedding of the attacks preformed by Mallory into three types, namely,watermark bits times in each partition, such embedding deletion, alteration, and insertion attacks.can be considered as an inner -fold repetition code [19]. Fordecoding purposes, the statistics Xmax and Xmin are 8.1 Deletion Attackcollected for each attribute separately. The optimal thresh- Mallory deletes tuples from the marked data set. If theold is computed for each attribute using the collected tuples are randomly deleted, then, on average, eachstatistics to minimize the probability of decoding error as partition loses m tuples. The watermarking techniquesdiscussed in Section 6. In the decoding phase, the water- available in the literature rely on special tuples, referredmark is extracted separately from each of the attributes to as marker tuples. Agrawal and Kiernan [1] use markerusing the discussed watermark detection algorithm, then tuples to locate the embedded bit, and Sion et al. [23] usemajority voting is used to detect the final watermark. marker tuples to locate the start and end of data partitions. The embedded watermark is a stream of bits where the marker tuples identify the boundaries between these bits in8 ATTACKER MODEL the stream. The successful deletion of marker tuples deletesIn this section, we discuss the attacker model and the these boundaries between the bits of the watermark stream,possible malicious attacks that can be performed. Assume which makes such marker-based watermarking techniquesthat Alice is the owner of the data set D and has marked D [1], [23] susceptible to watermark synchronization error. Forby using a watermark W to generate a watermarked data example, using the watermarking technique presented byset DW . The attacker Mallory can perform several types of Sion et al. [23], Fig. 16a shows an example partitioned dataattacks in the hope of corrupting or even deleting the set and the corresponding majority voting map used toembedded watermark. A robust watermarking technique decode the embedded watermark. The marker tuples aremust be able to survive all such attacks. represented by the shaded cells; these markers are used to We assume that Mallory has no access to the original identify the start and end of each partition. The embeddeddata set D and does not know any of the secret information bit is noted in each partition; the embedded watermark isused in the embedding of the watermark, including the “101010.” Now, if Mallory successfully deletes the markersecret key Ks , the secret number of partitions m, the secret tuple controlling the first bit ðb0 Þ, this results in the deletionconstant c, the optimization parameters, and the optimal of the first bit, see Fig. 16b. The decoder, unaware of thedecoding threshold T à . Given these assumptions, Mallory deleted bit, will generate “Â10101” instead, which is the
  • 10. SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 125Fig. 16. Watermarked data set subject to the deletion and insertion attacks, repectively, and their corresponding majority voting maps. Gray-shadedcells represent the original marker tuples, and the black cells represent the added marker tuples. (a) Watermarked data set. (b) After deletion attack.(c) After insertion attack.result of decoding a shifted version of the embedded bits. more difficult for the attacker to alter the embedded bit. InThis results in a watermark synchronization error at the addition, by the repeated embedding of the watermarkdecoder, see Fig. 16b. Moreover, the resynchronization of and the use of majority voting technique discussed inthe watermark bit stream becomes more complicated in the Section 7, this attack can easily be mitigated.presence of flipped bits due to other decoding errors. Thus,the successful deletion of a single marker could result in a 8.3 Insertion Attacklarge number of errors in the decoding phase. To avoid Mallory decides to insert tuples to the data set DW hopingwatermark synchronization errors in marker-based techni- to perturb the embedded watermark. The insertion of newques, the m marker tuples should be stored, as indicated by tuples acts as additive noise to the embedded watermark.Sion et al. in [23]. Note that this violates the requirement However, the watermark embedding is not based on athat the watermark decoding is blinded. single tuple and is based on a cumulative hiding function On the other hand, our partitioning technique is that operates on all the tuples in the partition. Thus, theresilient to such synchronization errors as it does not rely effect of adding tuples is a minor perturbation to the valueon marker tuples to locate the partition limits; instead, our of the hiding function and thus to the embedded watermarkpartitioning technique assigns tuples to partitions based a bit. Marker-based watermarking techniques may sufferdifferent approach, as discussed in Section 4. We also use badly from this attack because the addition of tuples mayerasures to indicate the loss of a bit due to insufficient introduce new markers in the data set and thus lead to thepartition size and, thus, to maintain synchronization and addition of new bits in the embedded watermark sequence.ensure that our technique is resilient to the watermark Consequently, this results in a watermark synchronizationsynchronization error. error. Using the example mentioned earlier, Fig. 16a shows8.2 Alteration AttackIn this attack, Mallory alters the data value of tuples.Here, Mallory is faced with the challenge that altering thedata may disturb the watermark; however, at the sametime, Mallory does not have access to the original data setD and, thus, may easily violate the usability constraintsand render the data useless. The alteration attack basicallyperturbs the data in hope of introducing errors in theembedded watermark bits. The attacker is trying to movethe hiding function values from the left of the optimalthreshold to the right and vice versa. However, using theconflicting objectives in encoding the watermark bits, thatis the maximizing the tail count for bi ¼ 1 and minimizingthe tail count for bi ¼ 0, maximizes the distance betweenthe hiding function values in both cases; thus, it makes it Fig. 17. Resilience to deletion attack.
  • 11. 126 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008Fig. 18. Resilience to -insertion and ð;
  • 12. Þ-insertion attacks. (a) -selected insertion attack. (b) ð;
  • 13. Þ-insertion attack. (c) ð;
  • 14. Þ-insertion attack.a partitioned data set and its corresponding majority voting contains the daily power consumption rates of somemap using the Sion et al. technique [23], where the customers over a period of one year. Such data sets areembedded watermark is “101010.” Now, if Mallory success- made available through the CIMEG2 project. The databasefully adds a marker tuple after the third marker tuple, this size is approximately 5 Mbytes; for testing purposes only, aresults in the addition of a new bit between ðb2 Þ and ðb3 Þ, see subset of the original data is used with 150,000 tuples. WeFig. 16c, where the black cell represents the added marker used c ¼ 75 percent, a 16-bit watermark, a minimumtuple. The decoder, unaware of the added bit, will generate partition size ¼ 10, a number of partitions m ¼ 2; 048,“010111” instead, which is the result of decoding a shifted and the data change was allowed within Æ0:5 percent. Theversion of the embedded bits. This problem is further PS algorithm was used for the optimization. The optimalcomplicated in the presence of bit errors in the watermark threshold was computed using the technique used instream. To ensure synchronization at the decoder, the Section 6 to minimize the probability of decoding error.marker-based watermarking techniques require the storage The watermarked data set was subject to different types ofof the m marker tuples to ensure successful partitioning of attacks including deletion, alteration, and addition attacks.the data set in the presence of the insertion attack [23]. On The results were averaged over multiple runs. Similarthe other hand, our partitioning algorithm is not dependent results were obtained for both uniform and normallyon special marker tuples, which makes it resilient to such distributed synthetic data. We show that it is difficult forattack, and watermark synchronization is guaranteed Mallory to remove or alter the watermark without destroy-during decoding. ing the data. he experimental results presented in Section 9 support We assessed computation times and observed a poly-the claims made about the resilience of our watermarking nomial behavior with respect to the input data size. Giventechnique to all the above attacks. the setup described above, with a local database, we obtained an average of around 300 tuples/sec for water-9 EXPERIMENTAL RESULTS mark embedding, whereas detection turned out to be at least approximately five times as fast. This occurs in theIn this section, we report the results of an extensive nonoptimized interpreted Java proof of concept implemen-experimental study that analyzes the resilience of the tation. We expect major orders of magnitude speedups in aproposed watermarking scheme to the attacks described real-life deployment version. For comparison purposes, wein Section 8. All the experiments were performed on3.2 GHz Intel Pentium IV CPUs with 512 Mbytes of RAM. 2. Consortium for the Intelligent Management of the Electric Power GridWe use real-life data from a relatively small database that (CIMEG). http://helios.ecn.purdue.edu/~cimeg.
  • 15. SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 127Fig. 19. (a), (b), and (c) Resilience to fixed-ð;
  • 16. Þ alter attacks. (d), (e), and (f) Resilience to random-ð;
  • 17. Þ alter attacks.have implemented the Sion et al. [23] approach with no Using our technique, the watermark was successfullystored markers, where the marker tuples are generated on extracted with 100 percent accuracy even when more thanthe fly during both encoding and the decoding phases. 80 percent of the tuples were deleted. On the other hand, the technique by Sion et al. badly deteriorates when only9.1 Deletion Attack 10 percent of the tuples were deleted. We believe that highIn this attack, Mallory randomly drops tuples from the resilience of our watermarking technique is due to thewatermarked data set, the watermark is then decoded and marker-free data partitioning algorithm that is resilient to thewatermark loss is measured for different values. Further- watermark synchronization errors caused by the tuplemore, in this test, we compare our implementation with Sion deletion.et al. (No Stored Markers) [23] approach. Fig. 17 shows the Because our technique is highly resilient to tuple deletionexperimental results; they clearly show that our water- attacks, the watermark can be retrieved from a small samplemarking technique is resilient to the random deletion attack. of the data. This important property combined with the high
  • 18. 128 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008 TABLE 1 Comparison between our Technique and the Technique by Sion et al. (No Stored Markers) [23]efficiency of our watermark detection algorithm makes it ð;
  • 19. Þ alter attacks. In the fixed-ð;
  • 20. Þ alter attack, Mallorypossible to develop tools able to effectively and efficiently randomly selects tuples and alters them by multiplyingsearch the Web to detect illegal copies of data. We could think 2 tuples by exactly ð1 þ
  • 21. Þ and the other 2 tuples byof an agent-based tool where the agent visits sites and ð1 À
  • 22. Þ. In this attack, the value of
  • 23. is fixed. In theselectively tests parts of the stored data sets to check for random-ð;
  • 24. Þ, alter attack tuples are randomly selected;ownership rights. Such a technique would only need inspect 2 tuples are then multiplied by ð1 þ xÞ and the other20 percent of the data for successful watermark detection. 2 tuples by ð1 À xÞ, where x is a uniform random variable in the range ½0;
  • 25. Š.9.2 Insertion Attack Figs. 19a, 19b, and 19c show the behavior of ourIn this experiment, Mallory attempts to add a number of watermarking technique subject to the fixed-ð;
  • 26. Þ altertuples hoping to weaken the embedded watermark. attack. As we can see from that in Fig. 19a, the watermark isHowever, by adding tuples to the current data, Mallory decoded with 100 percent accuracy even when 100 percentis adding meaningless data to the current data. Mallory of the tuples are altered by
  • 27. 1:0 percent. This shows thecould simply generate the new tuples by replicating values strong resilience of our watermarking technique to fixedin randomly selected existing tuples; we refer to such alteration attacks. Furthermore, Fig. 19b shows the numberattack as the -selected insertion attack. Mallory could of corrupted tuples as the attack proceeds. Tuples thatrandomly generate the tuple values by generating random exceed the usability constraints are referred to as corrupteddata from the range ðDW À
  • 28. DW ; DW þ
  • 29. DW Þ, whereDW and DW are the mean and standard deviation of the tuples. Fig. 19b shows that after
  • 30. 0:9 percent, a suddendata set DW , respectively. We refer to such an attack as the increase in the number of corrupted tuples; such an increaseð;
  • 31. Þ-insertion attack. Fig. 18a shows a comparison be- is due to the usability constraints used in this experiment,tween our approach and the Sion et al. (No Stored which are set to Æ0:5 percent. Fig. 19c is a clear descriptionMarkers) [23] approach. The comparison shows that our of the dilemma that the attacker is facing. The dotted linestechnique is resilient to -selected attack even when is show the number of corrupted tuples, whereas the solidup to 100 percent of the data set size. Although on the lines are represent the detected watermark accuracy. Byother hand, the Sion et al. marker-based technique increasing
  • 32. , the attacker is able to corrupt the watermark todeteriorates just after adding 10 percent of the data set 80 percent accuracy; however, at the same time, 75 percentsize. Figs. 18b and 18c show the resilience of our of the tuples are corrupted. Similar results were experi-watermarking technique to ð;
  • 33. Þ-insertion attack, where enced for the random-ð;
  • 34. Þ attack, which are shown inthe watermark was recovered with 100 percent accuracy Figs. 19d, 19e, and 19f.even when up to 80 percent of the data set size tuples Experiments performed at lower usability constraintswere inserted. still showed similar resilience trends of the watermark encoding and decoding when subject to above attacks.9.3 Alteration Attack Table 1 shows a comparison between our technique andWe tested our watermarking technique against two types Sion et al. technique based on the different watermarkof alteration attacks, namely, the fixed and the random attacks and main characteristics of each technique.
  • 35. SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 12910 CONCLUSION [17] Y. Li, H. Guo, and S. Jajodia, “Tamper Detection and Localization for Categorical Data Using Fragile Watermarks,” Proc. FourthIn this paper, we have presented a resilient watermarking ACM Workshop Digital Rights Management (DRM ’04), pp. 73-82,technique for relational data that embeds watermark bits in 2004. [18] Y. Li, V. Swarup, and S. Jajodia, “Fingerprinting Relationalthe data statistics. The watermarking problem was for- Databases: Schemes and Specialties,” IEEE Trans. Dependable andmulated as a constrained optimization problem that Secure Computing, vol. 2, no. 1, pp. 34-45, Jan.-Mar. 2005.maximizes or minimizes a hiding function based on the [19] R. Morelos-Zaragoza, The Art of Error Correcting Coding. Johnbit to be embedded. GA and PS techniques were employed Wiley Sons, 2002. [20] J. Nocedal and S. Wright, Numerical Optimization. Prentice Hall,to solve the proposed optimization problem and to handle 1999.the constraints. Furthermore, we presented a data partition- [21] F. Petitcolas, R. Anderson, and M. Kuhn, “Attacks on Copyrighting technique that does not depend on special marker Marking Systems,” LNCS, vol. 1525, pp. 218-238, Apr. 1998.tuples to locate the partitions and proved its resilience to [22] B. Schneier, Applied Cryptography. John Wiley Sons, 1996. [23] R. Sion, M. Atallah, and S. Prabhakar, “Rights Protection forwatermark synchronization errors. We developed an Relational Data,” IEEE Trans. Knowledge and Data Eng., vol. 16,efficient threshold-based technique for watermark detection no. 6, June 2004.that is based on an optimal threshold that minimizes the [24] M. Swanson, M. Kobayashi, and A. Tewfik, “Multimedia Data-probability of decoding error. The watermark resilience was Embedding and Watermarking Technologies,” Proc. IEEE, vol. 86, pp. 1064-1087, June 1998.improved by the repeated embedding of the watermark and [25] L. Vaas, “Putting a Stop to Database Piracy,” eWEEK, Enterpriseusing majority voting technique in the watermark decoding News and Revs., Sept. 2003.phase. Moreover, the watermark resilience was improved [26] R. Wolfgang, C. Podilchuk, and E. Delp, “Perceptual Watermarksby using multiple attributes. for Digital Images and Video,” Proc. IEEE, vol. 87, pp. 1108-1126, July 1999. A proof of concept implementation of our watermarkingtechnique was used to conduct experiments using both Mohamed Shehab received the PhD degreesynthetic and real-world data. A comparison our water- from the School of Electrical and Computer Engineering, Purdue University, West Lafay-marking technique with previously posed techniques ette, Indiana. He is currently working as anshows the superiority of our technique to deletion, altera- assistant professor in the Department of Soft-tion, and insertion attacks. ware and Information Systems, University of North Carolina, Charlotte. His research inter- ests include information security, distributedREFERENCES access control, distributed workflow manage- ment systems, and watermarking of relational[1] R. Agrawal and J. Kiernan, “Watermarking Relational Databases,” databases. He is a member of the IEEE. Proc. 28th Int’l Conf. Very Large Data Bases, 2002.[2] M. Atallah and S. Lonardi, “Authentication of LZ-77 Compressed Elisa Bertino is a professor of computer Data,” Proc. ACM Symp. Applied Computing, 2003. science and electrical and computer engineering[3] M. Atallah, V. Raskin, C. Hempelman, M. Karahan, R. Sion, K. at Purdue University and serves as a research Triezenberg, and U. Topkara, “Natural Language Watermarking director of Center for Education and Research in and Tamperproofing,” Proc. Fifth Int’l Information Hiding Workshop, Information Assurance and Security (CERIAS). 2002. Her main research interests include security and[4] G. Box, “Evolutionary Operation: A Method for Increasing database systems. In those areas, she has Industrial Productivity,” Applied Statistics, vol. 6, no. 2, pp. 81- published more than 300 papers. She is the 101, 1957. coordinating editor in chief of the Very Large[5] _ E. Chong and S. Zak, An Introduction to Optimization. John Wiley Database Systems Journal and serves on the Sons, 2001. editorial boards of several journals. She is a fellow of the IEEE and the[6] D. Coley, “Introduction to Genetic Algorithms for Scientists and ACM. She received the 2002 IEEE Computer Society Technical Engineers,” World Scientific, 1999. Achievement Award for outstanding contributions to database systems[7] C. Collberg and C. Thomborson, “Software Watermarking: and database security and advanced data management systems. Models and Dynamic Embeddings,” Proc. 26th ACM SIGPLAN- SIGACT Symp. Principles of Programming Languages, Jan. 1999. Arif Ghafoor is currently a professor in the[8] I. Cox, J. Bloom, and M. Miller, Digital Watermarking. Morgan School of Electrical and Computer Engineering, Kaufmann, 2001. Purdue University, West Lafayette, Indiana, and[9] E. Dolan, R. Lewis, and V. Torczon, “On the Local Convergence of is the director of the Distributed Multimedia Pattern Search,” SIAM J. Optimization, vol. 14, no. 2, pp. 567-583, Systems Laboratory and Information Infrastruc- 2003. ture Security Research Laboratory. His research[10] D. Goldberg, Genetic Algorithm in Search, Optimization and Machine interests include database security, parallel and Learning. Addison-Wesley, 1989. distributed computing, and multimedia informa-[11] D. Gross-Amblard, “Query-Preserving Watermarking of Rela- tion systems and has published extensively in tional Databases and XML Documents,” Proc. 22nd ACM these areas. He has served on the editorial SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems boards of various journals. He is a fellow of the IEEE. He has received (PODS ’03), pp. 191-201, 2003. the IEEE Computer Society 2000 Technical Achievement Award for his[12] F. Hartung and M. Kutter, “Multimedia Watermarking Techni- research contributions on multimedia systems. ques,” Proc. IEEE, vol. 87, no. 7, pp. 1079-1107, July 1999.[13] J. Holland, Adaptation in Natural and Artificial Systems. The MIT Press, 1992. . For more information on this or any other computing topic,[14] R. Hooke and T. Jeeves, “Direct Search Solution of Numerical and please visit our Digital Library at www.computer.org/publications/dlib. Statistical Problems,” J. Assoc. for Computing Machinery, vol. 8, no. 2, pp. 212-229, 1961.[15] G. Langelaar, I. Setyawan, and R. Lagendijk, “Watermarking Digital Image and Video Data: A State-of-the-Art Overview,” IEEE Signal Processing Magazine, vol. 17, no. 5, pp. 20-46, Sept. 2000.[16] R. Lewis and V. Torczon, “Pattern Search Methods for Linearly Constrained Minimization,” SIAM J. Optimization, vol. 10, no. 3, pp. 917-941, 2000.

×