A new model for secure dissemination of xml content


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A new model for secure dissemination of xml content

  1. 1. 292 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008 A New Model for Secure Dissemination of XML Content Ashish Kundu, Student Member, IEEE, and Elisa Bertino, Fellow, IEEE Abstract—The paper proposes an approach to content dissem- ing paradigms have been using the XML DOM as the primary ination that exploits the structural properties of an Extensible standard for data representation. Web services in intraenterprise Markup Language (XML) document object model in order to pro- and interenterprise networks are being adopted as the compo- vide an efficient dissemination and at the same time assuring con- tent integrity and confidentiality. Our approach is based on the nents for distributed computing; these web services are primarily notion of encrypted postorder numbers that support the integrity XML-based services. Recent developments in content-network and confidentiality requirements of XML content as well as facili- appliances provide technology that can be used at the network tate efficient identification, extraction, and distribution of selected level to efficiently filter and distribute contents to interested content portions. By using such notion, we develop a structure- parties in a possibly very large distributed system. based routing scheme that prevents information leaks in the XML data dissemination, and assures that content is delivered to users Efficiency and scalability must, however, be provided by as- according to the access control policies, that is, policies specify- suring at the same time the security of contents, and the privacy ing which users can receive which portions of the contents. Our of the parties acquiring and disseminating contents. It is use- proposed dissemination approach further enhances such structure- less to provide high-bandwidth content distribution systems the based, policy-based routing by combining it with multicast in order if the integrity of the disseminated contents is not assured or to achieve high efficiency in terms of bandwidth usage and speed of data delivery, thereby enhancing scalability. Our dissemination the ownership of the contents are not protected. Such problems approach thus represents an efficient and secure mechanism for are further complicated when dealing with contents encoded use in applications such as publish–subscribe systems for XML in XML, in that, because of the hierarchical organization of the Documents. The publish–subscribe model restricts the consumer content, different confidentiality and integrity requirements may and document source information to the routers to which they exist for different portions of the same content. register with. Our framework facilitates dissemination of contents with varying degrees of confidentiality and integrity requirements Data that a consumer is not authorized to access, but belongs in a mix of trusted and untrusted networks, which is prevalent in to the complete data set is called extraneous data. Flow of current settings across enterprise networks and the web. Also, it extraneous data to a consumer may leak information, even when does not require the routers to be aware of any security policy in this data is encrypted. In particular, extraneous data is prone to the sense that the routers do not need to implement any policy off-line dictionary attacks even by a legitimate consumer that related to access control. can exploit contextual knowledge from the data elements it has Index Terms—Encryption, Extensible Markup Language access to. Therefore, it is important that extraneous data, even (XML), postorder traversal, preorder traversal, publish– if encrypted with keys that the consumer does not have, be subscribe, security, structure-based routing, trees. removed from the content before its delivery. I. INTRODUCTION We thus need a dissemination approach, specifically tailored HE PROBLEM of content dissemination in an enterprise- to XML that addresses the issues of security, privacy, and scal- T setting as well as web-setting has been widely investigated, and various dissemination techniques have been proposed [1], ability in a holistic manner. Relevant requirements for such a dissemination approach include the following. [2]. Recently, however, the transformation and growth of enter- 1) Access control: A consumer must be provided with only prise networks into more dynamic frameworks than just a pas- that data set that it is permitted to access. sive repository of content as well as the increase in the ubiquity 2) Data integrity: Not only the integrity of the received data of services have contributed to the significance and complexity must be verifiable by the consumer, but also any compro- of this problem. The evolution of an Extensible Markup Lan- mise to the data must be precisely determined. guage (XML) [3] and its influence on the data models toward An XML-based data instance when represented using the XML-ization has made the document object model (DOM) [4] a DOM has an underlying tree [5] structure. Each node of such de facto standard for content representation. Enterprise comput- a tree refers to an XML entity. A tree is a nonlinear structure with a set of nodes and edges such that it is acyclic; there is a special node called root with no incoming edges and every other Manuscript received on December 12, 2006, revised on April 20, 2007. This work was supported in part by the National Science Foundation under Grant node has exactly one incoming edge. Fig. 1(a) shows a tree, 0430274 and in part by the sponsors of CERIAS. This paper was recommended the abstract representation of an XML document. Section II-A by Guest Editors P. Hung, M. Alesky, and Z. Milosevic. elaborates on the XML data model. The authors are with the Center for Education and Research in Informa- tion Assurance and Security (CERIAS) and Department of Computer Science, In this paper, we develop an approach that addresses the prob- Purdue University, West Lafeyette, IN 47907 USA (e-mail: ashishk@cs.purdue. lem of content dissemination in enterprise and cross-enterprise edu; bertino@cs.purdue.edu). networks; our solution satisfies the outlined requirements in a Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. holistic manner. The proposed dissemination model exploits Digital Object Identifier 10.1109/TSMCC.2008.919213 various structural properties of the XML DOM in order to 1094-6977/$25.00 © 2008 IEEEAuthorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.
  2. 2. KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT 293 “encrypted PONs” that overcome the vulnerabilities of postorder numbering while preserving all its desirable properties; 2) applications of encrypted postorder numbering to the ver- ification of content integrity and to the prevention of in- formation leaks in order to assure data confidentiality; 3) a novel method for structure-based routing that can beFig. 1. (a) Tree. Abstract representation of an XML document. (b) Postorder used in nontrusted domains for dissemination of sensitivenumbers associated with each node. (c) Encrypted postorder numbers associated content.with each node. B. Outline of the Paper Section II presents some simple observations on XML data models and PONs. Section III introduces the notion of encrypted PONs. Document encoding and encryption techniques using en- crypted PONs are defined in Section IV. Based on the structural encoding of the documents, structure-based routing and dissem- ination model for XML documents are proposed in Section V. Section VI analyzes the proposed dissemination model with respect to the requirements described in Section I. Section VII discusses the related work, and Section VIII concludes the paper. II. SOME SIMPLE OBSERVATIONS In this section, we discuss the properties of XML data and postorder numbers.Fig. 2. Security requirements in XML dissemination. A. XML Data Modelsupport access control, integrity, and privacy requirements. Theuse of structural properties is favorable to efficiency and scala- DOM is the commonly used model for representing XML-bility of the dissemination framework. based languages [4]. DOM organizes data as a rooted tree. In Our solution is based on the simple notion of postorder num- what follows, document refers to such a tree and documentrootbering [5] and its properties. By using such notion, we develop refers to its root. Moreover, element refers to an intermedi-a novel content routing scheme called structure-based rout- ate node in the tree. Content of a node includes attr, docu-ing of XML data. Such routing scheme prevents information menttype, and DOMimplementation [6]. Each node that is oneleaks, and at the same time improves efficiency and scalability of the following—text, CDATAsection, processinginstruction,of the structure-based dissemination model. A key feature of comment, is a leaf node in the tree. An entity refers a nonrootour approach is that it directly takes into account access con- node in the tree.trol policies, that is, policies specifying which entity can access Let D be an XML data instance organized according to thewhich portion of the contents, so that contents are disseminated DOM representation. Let T (V, E) be a tree representing theaccording to these policies. The resulting dissemination model document D; V and E denote the set of nodes (vertices) and ofis a multicast model for the XML dissemination (Fig. 2) that, edges of D, respectively. Let x be a node in V . Let Dx denotebased on the content structure and access control policies, builds the subtree of D rooted at x. Some or all the nodes in an XMLan overlay topology. Moreover, we exploit the properties of data instance contain content. Content of a node x is referred topostorder numbers (PONs) for integrity assurance. Our tech- as contentx . contentx contains only the content specific to x andnique allows consumers to verify the integrity of data they re- not of other nodes. The relation between parent–child nodes isceive, and in the case in which data have been tampered with, represented as directed edges, with edges directed from parentsallows the consumers to determine the affected portions of the to children. In what follows, ancestor(x) denotes the set ofdata. In what follows, we refer to the XML documents as docu- ancestors of x.ments, trees, data, or content trees. Subtrees refer to a subdocu- The dissemination of a document exploits the following struc-ment in terms of DOM. An element in DOM refers to a vertex tural properties in order to meet the requirements of secure andor a node in the corresponding tree representation. The terms scalable dissemination of XML data.user and consumer are used as synonymous. 1) XML data is order-preserving, that is, nodes x and y have an order among them in D.A. Main Contributions 2) The unit of data access is the subtree representation of a The main contributions of this paper can be summarized as subdocument. The smallest unit is a node.follows: 3) Any element and its corresponding subdocument are ac- 1) extensions of the notion of postorder numbering to cessible through by themselves or by a subtree rooted at derive a family of secure structural identifiers called any of their ancestors.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.
  3. 3. 294 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008 These properties are crucial in ensuring that the structure- A. Computation based routing extracts the correct subdocument and routes it to Let {p1 , p2 , . . . , pn } be a set of PONs for an XML data in- the correct consumer; they are reflected by postorder numbers, stance. Each pi , i = 1, . . . , n, is combined with a unique random discussed in the next section. number ri . The combined values are then encrypted by using B. Postorder Numbers an order-preserving encryption function [7]. The resulting set of numbers is the set of EPONs, and each of them is an EPON PONs are the numbers assigned to the nodes of a tree accord- specific to a data node in the XML document. The EPON for ing to the postorder traversal of the tree [5]. Let px denote the node px is denoted by ex . By combination, we mean a process of PON of node x ∈ V , where V is the set of vertices of an XML addition or concatenation or some other possible combination document D. A number is assigned to a node only when each of operations. The random values are chosen so that the combined its children has been assigned a PONs. The children of a node values preserve the order of PONs. The encryption process en- are assigned PONs according to the order, which is left to right. crypts these numbers in such a way that the ordering among the The highest PON is |V | and the lowest is 1. If z is the parent of entities is preserved. The random value associated with a PON x and x is the last child of z to be visited in postorder traver- follows a strictly increasing order, with the lowest random value sal before z is traversed, then pz = px + 1. Fig. 1(b) shows the being associated with the lowest PON. PONs assigned to each node of the tree in Fig. 1(a). Let, x, y, and z be nodes such that x and y are children of z; let We now present some important properties of PONs. px , py , and pz be their PONs, respectively. The random values 1) (PON-I): px uniquely refers to x and the subdocument Dx would be rx , ry , and rz , respectively. By definition of PONs in D. (Sect. II), pz > px and pz > py . rz > rx and rz > ry , that is, 2) (PON-II): Let z be the parent of x. Then, px < pz . the order of the PONs is preserved by the random numbers. rx 3) (PON-III): Let px lowest be the lowest PON of any element and ry should be chosen so that no relation can be and should in the subdocument Dx ; let u be a descendant of x. Then, possibly be established between them. pxlowest ≤ pu ≤ px . 4) (PON-IV): Let x and y be left and right children of z. Then, px < py . B. Properties of EPON The first property assures that we can identify and extract a EPONs preserve the order of PONs, therefore the properties specific subdocument in a document. The second property is the characterizing EPONs are identical to those of PONs. We refer basis for reasoning about the relation between the parent and to the properties of EPONs as EPON-properties. a child. The third property imposes a lower and upper bound 1) (EPON-I): ex uniquely determines the location of x in D on the possible PON of any element in a subtree. It is useful in and ex uniquely determines the subdocument Dx . identifying if a new node has been added to the document and 2) (EPON-II): Let z be the parent of x. Then, ex < ez . which one it is. The fourth property is used to determine if there 3) (EPON-III): Let ex lowest be the lowest PON of any element is any swapping among siblings in the received subdocument. in the subdocument Dx ; let u be a descendant of x. Then, lowest ≤ eu ≤ ex . ex However, the general notion of the PON has some drawbacks with respect to security. The PON of a node x indicates the 4) (EPON-IV): Let x and y be left and right children of z. number of nodes in the subdocument Dx . This is not desirable Then, ex < ey . especially when the consumer is permitted to access only a sub- EPONs can be computed as follows. While traversing the set of the subdocument Dx . The consumer may be able to infer content tree in postorder and assigning PON px to node x, a additional and possibly sensitive information regarding the size random number rx is generated such that if node y is just visited of the document. This is against the confidentiality and privacy prior to x, then rx ≥ ry . If all nodes have been visited, then all requirements of the dissemination model, which is our goal. the combined values (of px and rx ) are encrypted with the order Moreover, PONs are predictable, given their values and the among them being preserved. The algorithm is presented as distance between consecutive numbers. Thus, they can be gen- follows. erated, because they always lie in a range from 1 to the total Traverse the content tree in postorder. number of elements in the document. This makes the data easily Let the current node be x. vulnerable to tampering. Therefore, we need a notion, equiva- Let Px be the PON for x. lent to the notion of PON, that overcomes the aforementioned Generate a random number rx such that drawbacks and still exhibits all the good properties of PONs. If node y is the last node visited prior to x, then Next section describes such a notion and proposes a mecha- rx ≥ ry nism to encode the XML data such that information leaks are If all the nodes have been visited, then prevented. ∀x, encrypt the combination of px and rx with their order preserved. III. ENCRYPTED POSTORDER NUMBERS IV. DOCUMENT ENCODING AND ENCRYPTION The notion of encrypted (EPON) is a derived from the general notion of PON and overcomes the security related flaws of In this section, we introduce the notion of structural identifiers solutions based on the use of PON. for nodes in an XML document.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.
  4. 4. KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT 295 TABLE I ENCODING OF XML TREE IN FIG. 1(a)A. Structural Identifier Let z be a node; its structural identifier referred to as Sz , isdefined as a pair (ez , ez lowest ), where ez is the EPON associatedwith z, and ez lowest is the lowest EPON for any node in the set Fig. 3. Three subtrees of the content tree are shared with three consumers:of descendant nodes of z. consumer 1, 2, and 3. The structural identifier is unique for each node in a document.Sz not only uniquely identifies the document element referred Lemma 4.1 defines the basis for use of the structural identifierto as z, but also identifies the subtree of z. The second factor of a node in the encoding tuple of each of its child nodes.of Sz facilitates the unique identification of those elements that D. Document Encryptionbelong to the subtree. Document encoding is followed by document encryption.B. Integrity Identifier Since our dissemination technique delivers only the contents that is accessible to a user, we do not need a hierarchical encryp- The content of a node includes attributes of an XML element, tion scheme as proposed in [8]. Each encoded node is encryptedbut does not include any of its descendants. In the definition of an using a key that is shared between the producer and the con-integrity an identifier, the content of a node is bound to it using its sumer. If the content routers are trusted, the key maybe sharedstructural identifier. A hash of the concatenation of the content between the routers and the consumers. After encryption, eachand the structural identifier are generated. The resulting hash x document node x is represented as Sx , Sz , Es , where Sz isvalue referred to as local hash (LH) is the integrity identifier of x the structural identifier of z, the parent of x and Es is the valuethe node x denoted by Ix . Thus, Ix = H(Sx , contentx ), where resulting from the encryption of x.Sx is the structural identifier of x, contentx is the content of x,and H is a one-way collision-resistant hash function. V. STRUCTURE-BASED ROUTING We propose a multicast-based approach to disseminate XML-C. Document Encoding based data among the consumers. Fig. 3 shows multiple con- The well-defined structural entity in a hierarchically orga- sumer requests to access an XML tree. Consumer 1 has accessnized content such as a document is intuitively a subtree. Each to subtree T1 , consumer 2 has access to T2 , and consumer 3node x in a document has an encoding information Cx defined has access to T3 . For dissemination of the subtrees among var-as a tuple: Cx = (Sx , Ix ), where Sx is the structural identifier ious consumers, a multicast topology based on the structure ofand Ix is the integrity identifier of x. the tree is proposed. The multicast topology is built dynami- 1) Properties of Encoding Tuples: Each node x with parent cally and asynchronously using a publish–subscribe methodol-z in a content tree is encoded with the tuple Cx , Sz . If x is the ogy. The publish–subscribe based multicast network uses theroot, then its encoding is Cx . Such encoding tuple facilitates structure-based routing.the verification of the structural integrity of the content. It also The structure-based routing involves the following entities:facilitates data manipulation operations based on the structural the document source is the document producer or a trusted owneridentifier: content identification, extraction, and composition. of the document, and has full access to the original document and The encoding tuple of a node also contains the structural is the root of the multicast overlay network; the publisher pub-identifier of its parent. The parent identifiers are used to verify lishes the data to a set of subscribers; the subscriber subscribesthat data have not been compromised and to detect swapping of to the data and sends its request to a router-based publisher; thedata elements. Table I shows the encoding of each node in the router routes the specific portion of the data to consumers andXML tree in Fig. 1(a). For simplicity, the integrity identifier is other routers. A router is both a publisher and a subscriber. Thenot enumerated as it involves a hash value. document source is a publisher. A consumer is a subscriber. A Lemma 4.1: Let x and z be nodes in an XML document such consumer is said to be associated with a router for a specific doc-that z is an ancestor of x; let ex and ez be their respective ument if it has subscribed to that document through that router.EPONs. Let ex z lowest and elowest be the lowest EPONs in the For simplicity of discussion, we assume one document source;subtrees rooted at x and z, respectively. Then ex lowest ≥ elowest . z however the proposed solution can handle multiple document Proof: z is the parent of x in the XML document. Therefore, sources. A parent router of another router is one from which theDx ⊂ Dz . The lemma follows from the properties of EPONs. latter receives some content. A child router is defined conversely.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.
  5. 5. 296 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008 We assume that documents are identified by a valid uniform resource identifier—URI [9] or any other naming scheme suit- able for the enterprise. The owner of the document can itself carry out publishing or can delegate the publishing functionality to one or more other entities. The publishing routers propagate this information to their neighbor routers. Let Dz and Dq be subdocuments such that q is a descendant of z. Thus, Dq ⊂ Dz . Let R refer to any router that is reachable from another router Rz . Dz is the maximal structural block at a router Rz if and only if Rz or any router R reachable from Rz has only those consumers that have access to only Dz or Dq , for any q that is a descendant of z in D. Each router is aware of the maximal structural block that it is responsible for routing collectively to all the subscribers—consumers and routers. Fig. 4. Routing of three subtrees to consumers using EPONs. a) Example: Let D be represented by the tree T [T is shown in Fig. 1(a)]. Let R1 be a router. Consumers u1 and u2 have sumer has access to in the document (allowedset). The subscribed to R1 for document D. u1 and u2 have access to the callback-address method provides a mechanism to de- subdocument represented by Tz and Tx , respectively. Let R2 be liver content to the consumer in case of asynchronous a router reachable from R1 . It has a subscriber u3 . u3 has access subscription. to the subdocument Tx . Therefore, the maximal structural block 2) { parent router, {Sx +} +}, where x is the root of the of R1 is Tz and that of R2 is Tx . If a new consumer subscribes maximal structural block the parent router receives and to R1 with an access to T itself, then the maximal structural Sx is its structural identifier, ex in Sx is the PEPON of the block of R1 becomes T . router. The multicast topology with root at router Rz disseminates 3) { child router, {Sy +} +}, where a child router is a router content to a set of consumers collectively such that none of that has subscribed to a subdocument with PEPON Sy them has access or has subscribed to a subtree Dm of D where from R. The list of child routers with their specific PE- Dz ⊂ Dm . Dz can be identified through the EPONs. Let ez and PONs are stored at R. em be the EPONs of Dz and Dm , respectively; then, ez < em , b) Example: Consider Fig. 4. The router that routes the tree by definition of EPON. Router Rz routes (or publishes) only with PEPON 43 has two consumers consumer 1 and consumer 2. Dz and its subtrees. Rz identifies its maximal structural block It does not have any parent router nor any child router. The through its EPON ez , which we call as the publishing EPON information stored at this router is shown in Table II. (PEPON). A PEPON is the EPON of a maximal structural block being B. Dissemination Network published from the corresponding router. For router Rz , ez is A link in the document dissemination network is between two a PEPON. Routing is carried on a multicast topology, which is content routers and might involve intermediate network routers. of the form: either a tree or a directed acyclic graph (DAG). In In this section, we discuss the development of the dissemination Fig. 5, the PEPON’s for the two routers (between the consumer network that uses the structural identifier. and the producer) are 43 and 96. 1) Subscription: The subscription process is initiated by a Access permissions on the content for a consumer are ex- consumer. Upon being successful, the process returns the con- pressed on a node in the document. Access permissions for a sumer a set of structural signatures for the nodes in the document consumer u on a document d denoted by Lu are represented as that the consumer has access to. The set is the allowedset for an allowedset defined as {Sx | consumer has access to node x the consumer. A consumer determines which router to join for and Sx is the structural signature of x}. a specific document. A router R upon receiving a request for subscription to a document performs the consumer subscription, if the consumer is authorized. A. Content Routers 2) Link Setup: If a router R does not already have a known By content routers, we refer to content distributors of bro- path to the document publisher to satisfy the request, it sends kers. Such a router is an application level router that routes subscription requests to some or all other routers it is aware of documents. In what follows, the notation {x+} or { x +} de- (neighbor routers). Among many possible protocols, we pro- notes a nonempty set of elements of type x. Every router R is pose a three-way handshake protocol to establish a subscription aware of the following information: link between two routers. Suppose that the router R receives 1) { c-id, c-credentials, document URI, permissions, positive responses from R1 and R2 . Based on the document callback-address +}, where c-id is the id of the consumer properties and the path length from document source to each subscribed to document accessible at document URI, c- of them, R then determines which one to choose and notify credentials includes parameters needed for authentication the router(s) accordingly. Several criteria can be used for such of c-id, and permissions is the set of the nodes that a con- selection.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.
  6. 6. KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT 297 TABLE II 1) Content Delivery To Consumers: For each subscribed INFORMATION AT THE ROUTER FOR PEPON 43 IN FIG. 5 consumer, the router determines its access permissions for the associated document. The router identifies to which received content subtrees, the allowed nodes (included in allowedset) belong. This is carried out by matching the EPON ex of each Sx ∈ allowedset with the EPON of each of the roots of the received subtrees.The Outline of Link Setup Protocol subtrees specific for the consumer in Γ are then extracted from 1) The consumer sends the subscription request for a docu- the identified content. The router then forward the subtree to the ment including its consumer id, credentials, and callback consumer after encrypting it by using the encryption technique method to a router R. in place, if any. In our running example, Fig. 4 shows how 2) R authenticates the consumer and determines the list of EPONs are used for routing of subtrees. signatures of the content nodes that the consumer has 2) Content Delivery To Routers: The process of forwarding access to. the document to a router is as follows. For each router in its 3) The router determines the set of subdocuments (subtrees) subscriber set, a router determines the node(s) it is registered from the set of signatures as follows: it sorts the signatures for. It identifies and extracts these document nodes from the based on the EPON in the signature; if Sx is the signature respective subtrees. They are then encrypted and sent to the for x, then the sorting parameter is ex . Let the sorted set subscribing router. The next section discusses the technique be Q. Let the set of subtrees be denoted by Γ, initialized as used for identification and extraction. empty. Let the signature with highest EPON in Q be Sz . 3) Content Identification and Extraction: Content identifi- Remove each signature Sx from Q including Sz such that cation and extraction is carried out at each router that has ex ≤ ez and ex ≥ ez lowest , assign this set of signature to γx ; at least one subscriber. Each router has a list of the content add γx to Γ and repeat this process until Q becomes empty. subtrees it receives for a given document. The list is essentially 4) If the list of accessible subtrees Γ includes a subtree with a list of signatures of the roots of these subtrees (maximal struc- root having EPON ez (z being the document element) that tural blocks) that contain the EPONs (PEPONs) of these roots. is subsumed by the content tree served by this router, then The router also keeps track of the list of signatures of the roots of the request is processed successfully. This is determined the subtrees, each of its subscribers (consumers or routers) has by checking if eh lowest ≤ elowest ≤ ez ≤ eh (Lemma 4.1). z access to (Section V-A). The identification step determines the 5) Otherwise, the router R sends a subscription request for belongs-to relation among each of the content roots accessible to the subtree rooted at ez to some or all of its neighbor- each consumer and the content subtrees it receives. An important ing routers. If the access permissions include multiple property of EPONs is reported here from Section III-B. Simple subtrees, the subscription request includes each of these EPON property—any node y that belongs to the subtree rooted at subtrees. node x is such that the ey < ex , where ey and ex are the EPONs 6) Upon receiving a request from R, a router Ri checks if of y and x, respectively. The EPON of y, ey is also greater than or there exists a PEPON ex . If so, it returns ex with success equal to elowest , as defined in the structural identifier Sx of x. The as a response to R, else it recursively repeats the link setup identification technique uses the simple EPON property while procedure from Ri for ez to all its neighbors. verifying the belongs-to relation among the received content and 7) Upon receiving the responses, the router R selects a parent the subscribed content. The worst case complexity of the identi- router; the router registers the consumer and sends the fication step is O(mn), where m is the number of received con- response back to the client. tent subtrees and n is the number of subscribers at a given router. 8) Each router determines if the new node(s) and the existing During the extraction step, a depth-first traversal [5] is carried node(s) can be combined together to form a complete sub- out to determine the subscribed content root. The EPON of the tree of the document. If so, then it replaces all the nodes root of the subscribed content root is compared with the EPON stored in the database by the lowest common ancestor of of the visited node. If these EPONs match, the corresponding these subtrees. subtree is extracted. The worst case complexity of the extraction procedure is same as that of the depth-first search—O(v + e), where v is the number of nodes in the received subtree and e isC. Content Publishing the number of edges in the received subtree. The document publication process varies based on therecipient—router or consumer. Content is published as follows.Router R receives a set of document nodes N from its ancestor D. Document Verification(topology is a tree) or ancestors (topology is a DAG). If R has The security requirements for secure dissemination of XMLa nonempty set of consumers for the document, it then forward content are two-fold (Section I): maintaining confidentiality bythe document to the consumers based on the permissions. If not sending extraneous data to a consumer (preventing infor-there is a nonempty set of routers that are subscribers for some mation leaks) and facilitating precise verification of integrity.nodes in this document, then R forward the document to these In this section, we focus on the integrity verification for therouters based on their requirements. received content at the consumer side.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.
  7. 7. 298 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008 In order to precisely detect any integrity violations, the fol- E. Update Management lowing verification steps must be executed at the consumer side: This section discusses updates to documents—content and 1) (N-I) if nodes have been dropped; structure in the context of structure-based routing. 2) (N-II) if the order of the nodes has been changed; In case of changes that are structurally invariant, only the data 3) (N-III) if the content of a node has been compromised; inside a document node changes. Thus, only the local hash of 4) (N-IV) if some nodes have been added in an unauthorized the node changes. Only the updates of the changed nodes along manner; with their signatures are forwarded to the routers. 5) (N-V) if the content of one node has been replaced with Structural changes have to be reflected in the mapping from the content of another node. user credentials to accessible nodes and their signatures. There- Let u be a consumer receiving a set of nodes Ru from the fore, the services that implement the mapping function from router. Let Ru = {Sx | Sx = the structural identifier of node u u user credential to structural identifiers need to be notified ac- x received by u}. Consumer u receives a list, denoted by Lu , of cordingly with the new EPONs. If it is a distributed hash table, signatures of each permitted node during its subscription phase. then the document source updates the hash table. The routers The consumer validates all the nodes received with the nodes it are also notified of the modifications. Removal of a subtree is expects to receive by matching their EPONs. Let r be a received notified to the routers and consumers having the document. In node, r ∈ Ru . Let the structural signature of r be (er , er lowest ). case of addition of a new subtree, the original structure of the Let s be a signature in Lu such that s = (ex , elowest ). The match- x document is not affected. Therefore, the update is propagated ing of signatures is carried out as follows: ∀s ∈ Lu ∃r ∈ Ru to all the routers that have consumers with access permission to (er , er x u lowest ) = (ex , elowest ); i.e., for each node in L , if there is u the new subtree. In case of interchanges, the changes need to be a node in R with an identical structural signature, then all the propagated to the routers and consumers that are registered for permitted nodes have been received. If there is some node s that any updated node or an ancestor of that updated node. u has access to but does not match with any r in Ru , then s has been dropped (N-I verified partially). VI. DISCUSSION Then the consumer carries out a postorder traversal on every subtree representation with root at r ∈ Ru , after the nodes are In this section, we discuss the requirements mentioned in decrypted. Let x be the currently visited node. The local hash Section I for document dissemination and show that our pro- of x, denoted by H x is computed as H(Sx , contentx ). After the posed dissemination model addresses all the security require- decryption, x has the following encoding: (Sx , Sz , Cx , Sz ). ments of a dissemination model. H x is compared with Ix in Cx . If there is a mismatch, then the content integrity has been violated (N-III verified). If contents A. Requirements Satisfaction of two nodes have been swapped, this will also be detected Integrity: We introduced the notion of EPONs in order to because the integrity identifier of content of the node is bound support all the integrity requirements. In Section V-D, we de- to its structural identifier (Section IV-B) (N-V verified). veloped techniques based on this notion for content verification Otherwise the process continues as follows. The outer Sx and validation. must match with the Sx in Cx . If not, then this node is discarded Access Control and Confidentiality: The structure-based rout- as compromised and an integrity violation is noted. If the outer ing scheme ensures that a consumer is delivered only the portion Sz is not same as the inner Sz , then a violation is detected, but of data that it has access to. The notion of maximal structural the node is not yet discarded. The inner Sz is compared with blocks at routers ensures that the routers have access to only that the inner Sw of the received parent node w of x. If they do not much amount of data that its consumers collectively have access match, then x is discarded (N-I is verified completely). ex lowest is to. Our postorder-numbering-based integrity check technique is compared with ew x w lowest . If elowest < elowest , then the integrity of parallel to the Merkle hash algorithm [10]. Such a technique the structure of the sub-tree has been violated. If ex > ew , then requires the hash values of the subtrees that are not accessible this is a case of reordering. If ex is found not to be within the to the consumer also to be forwarded, so that the consumer can bounds of bounds of [ew lowest , ew ], a new node has been added verify the document integrity by computing and matching the (N-IV verified). final hash value of the complete original tree. Our technique ex- The verification algorithm also checks if ex is less than any ploits the properties of postorder numbering for the same goal node visited earlier during this traversal. This is done by com- and thus avoids sending the hash values of the subtrees that are paring the two factors of the structural identifier Sx . No such not accessible to the consumer, thereby preventing the leakage occurrence ensures that there is no change in the original order of data. This is an indirect information leak that is prevented by among nodes (N-II verified). our framework. The verification process is efficient and simple. It uses the 1) Efficiency of Structure-Based Routing: The simple no- basic technique of post- and preorder traversal and hash com- tion of PONs in the context of XML data provides powerful putation. Therefore the computation is not expensive nor the and sound principles for content identification and efficient ex- implementation of such a technique is complex. The order of traction with linear time complexity. The framework uses an verification is linear in terms of the size of the content received efficient content routing mechanism based on the content struc- because the postorder traversal combined with the preorder pro- ture. The cost of routing is, in the worst case, linear in the cessing on each subtree verifies the integrity of the content. document size.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.
  8. 8. KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT 299 In the first area, the only approach supporting the access con- trol in both pull and push-based distribution of data has been proposed by Bertino and Ferrari [8]. Such an approach relies on encrypting different portions of the data with different keys and then distributing the keys to data consumers according to the access control policies. Bertino et al. [11] have also investi- gated the problem of integrity of XML data by using the notion of Merkle hash. Those approaches have, however, some major drawbacks in that they are not scalable and do not remove extra- neous data from contents. These drawbacks are fully addressed by the approach proposed in this paper.Fig. 5. Summary of Merkle–hash technique, the selective XML dissemination In the second area, several approaches have been proposedtechnique, and the proposed technique in this paper. to address efficiency issues concerning publish–subscribe sys- The multicast topology based on the structure-based routing tems [12]–[19]. Most approaches (e.g., [13], [16], and [17]) usefor the dissemination model is acyclic. The multicast reduces a spanning tree structure for event routing. In order to reduce thethe network usage, while the cycleless feature ensures that the matching that has to be performed by brokers from the root tonumber of router hops is finite and proportional to the height the leaves, several optimization techniques have been proposed.of the document tree. Using the Pigeonhole principle on the Virtual groups are used to reduce the matching performed bynumber of content nodes and the number of consumers in a brokers [20]. However, security issues in content-based publish–large dissemination network, there would be overlaps for content subscribe systems have not been investigated. The only excep-accessibility and subscription between consumers. tions are the approaches by Srivatsa and Liu [21], that, however, Given that each path from the document source to a consumer focuses only on resiliency, and by Opyrchal and Prakash [19],contains a monotonically decreasing sequence of EPONs of the which, however, is very inefficient and is not flexible with re-document as PEPONs of the underlying routers and a parent spect to access control policies. In contrast, our approach [22]router never has a less EPON as a given router’s PEPON, a cycle addresses a larger spectrum of security requirements while beingcannot occur. This makes the multicast topology more efficient at the same time efficient and scalable.in terms of bandwidth usage and dissemination speed. The path VIII. CONCLUSION AND FUTURE WORKfrom the source of a document to a consumer contains a listof routers. Let the sequence of routers be R1 → R2 → · · · → The paper shows how the structural properties of the XMLRi → R(i+1) → · · · → Rn . Let the publishing EPON for the DOM can be exploited in order to address issues in data se-consumer at each Ri be βi . The following observations are curity and dissemination. We have applied the simple notion ofcrucial for the topological efficiency. PONs to solve some of the important challenges in data security, 1) For each i < j, 1 ≤ i, j ≤ n, βi ≤ βj , that is, the PEPONs especially data integrity and secure data dissemination. By have a monotonically decreasing order among them in using the structural properties of XML-content in conjunction such a path. This is because a router creates a link to with the properties of the PONs, we have proposed: 1) a tech- another router during the subscription process, if and only nique to verify the integrity of the distributed content; 2) a if the router has access to the required subtree or a larger technique that facilitates maintaining data confidentiality; and subtree from the specific document. 3) a novel structure-based routing of XML content. 2) Due to the monotonicity property, the sizes of the subtrees We introduced the notion of EPONs in order to support the being transmitted along the path R1 → R2 → · · · → Rn integrity and confidentiality requirements of XML content as also decrease monotonically. In the worst case, all sub- well as to facilitate efficient identification, extraction, and dis- scribers along the path have access to the complete docu- tribution of subsets of the content. The structure-based routing ment; however, in reality, most subscribers have access to a scheme uses the notion of EPONs to prevent information leaks subset of the document. Therefore, the cost of transmission in XML data dissemination. of the content from the source to a consumer is less than We proposed a dissemination model for XML content that the cost incurred in a common star/broadcast topology; or combines the multicast and structure-based routing in order is in the worst case, equivalent to such a cost in the latter. to improve efficiency in terms of bandwidth usage and speed Therefore, such a model is efficient in terms of network re- of data delivery, thereby favoring scalability. The dissemina-source usage, speed of dissemination, and is thus more scalable. tion model combined with techniques for data integrity verifi- cation and confidentiality provides a secure publish–subscribe VII. RELATED WORK paradigm for XML documents. The publish–subscribe model The main research efforts related with our paper are in the area restricts the consumer and document source information to theof secure dissemination of XML data and in the area of secure routers to which they register with. Such an approach to XMLpublish–subscribe systems. Fig. 5 presents a summary of the content dissemination satisfies the requirements of integrity,Merkle hash technique of integrity verification of trees [10], the confidentiality, and privacy preservation in a holistic manner.selective dissemination of XML [8], and the solution proposed Structure-based routing provides a modular and flexiblein this paper. model for security enforcements in data distribution. FlexibilityAuthorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.
  9. 9. 300 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008 in security enforcement is known to be an important require- [18] A. Carzaniga and A. L. Wolf, “Forwarding in a content-based network,” ment in secure system design and implementation. Depending in Proc. ACM SIGCOMM, Karlsruhe, Germany, Aug. 2003, pp. 163–174. [19] C. Wang, A. Carzaniga, D. Evans, and A. L. Wolf, “Security issues and re- on the degree of trust on the network integrity, checks may or quirements for internet-scale publish–subscribe systems,” in Proc. Hawaii may not be enforced. Moreover, the framework facilitates dis- Int. Conf. Syst. Sci., 2002, p. 303. semination of contents with varying degrees of confidentiality [20] P. Costa and G. P. Picco, “Semi-probabilistic content-based publish– subscribe,” in Proc. 25th IEEE Int. Conf. Distrib. Comput. Syst., 2005, and integrity in a mix of trusted and untrusted networks, which pp. 575–585. is so prevalent in current settings across enterprise networks [21] G. Cugola, D. Frey, A. L. Murphy, and G. P. Picco, “Minimizing the and the web. It facilitates the control and enforcement of access reconfiguration overhead in content-based publish–subscribe,” in Proc. 19th ACM Symp. Appl. Comput., 2004, pp. 1134–1140. control policies at a single point—document source, which [22] A. Carzaniga, M. J. Rutherford, and A. L. Wolf, “A routing scheme for is very difficult to achieve in content-based routing and other content-based networking,” in Proc. IEEE INFOCOM, 2004, pp. 918–928. publish–subscribe systems. On the other hand, content-based [23] G. P. Picco, G. Cugola, and A. L. Murphy, “Efficient content-based event dispatching in presence of topological recongurations,” in Proc. 23rd Int. routing can be easily emulated using the structure-based routing Conf. Distrib. Comput. Syst., 2003, pp. 234–243. presented in this paper. [24] H. Zhou and S. Singh, “Content based multicast (CBM) in ad hoc net- We plan to further investigate PONs and other structural prop- works,” in Proc. MobiHoc, 2000, pp. 51–60. [25] P. Costa, M. Migliavacca, G. P. Picco, and G. Cugola, “Epidemic algo- erties of hierarchical data models, and to apply them to address rithms for reliable content-based publish–subscribe: An evaluation,” in various security, management, and engineering related issues. Proc. 24th IEEE Int. Conf. Distrib. Comput. Syst., 2004, pp. 552–561. Explorations concerning the implementation of various access [26] M. Srivatsa and L. Liu, “Securing publish–subscribe overlay services with eventguard,” in Proc. 12th ACM Conf. Comput. Commun. Security, 2005, control policies and integrity models on such data dissemination pp. 289–298. model would also be interesting. [27] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf, “Design and evaluation of a wide-area event notication service,” ACM Trans. Comput. Syst., vol. 19, no. 3, pp. 332–383, 2001. REFERENCES [28] R. Zhang and Y. C. Hu, “Hyper: A hybrid approach to efficient content- based publish–subscribe,” presented at the Int. Conf. Distrib. Compt. Syst., [1] M. Altinel and M. J. Franklin, “Efficient filtering of XML documents Las Vegas, NV, 2005. for selective dissemination of information,” in Proc. VLDB Conf., 2000, [29] A. Kundu and E. Bertino, “Secure dissemination of XML content using pp. 53–64. structure-based routing,” in Proc. 10th IEEE Int. Enterprise Distrib. Object [2] A. Crespo, O. Buyukkokten, and H. Garcia-Molina, “Query merging: Comput. Conf. (EDOC’06), 2006, pp. 153–164. Improving query subscription processing in a multicast environment,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 1, pp. 174–191, 2003. [3] Extensible Markup Language (XML) [Online]. Available: http://www.w3. Ashish Kundu (S’06) is currently working toward org/XML/ the Ph.D. degree in the Department of Computer Sci- [4] W3C Document Object Model (DOM) [Online]. Available: http://www. ence, Purdue University, West Lafayette, IN. His pri- w3.org/DOM/ mary research interests lie in the security and privacy [5] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to issues in data distribution, and language-based secu- Algorithms. Cambridge, MA: MIT Press, 2001. rity issues in a distributed context. He has previously [6] W3C Document Object Model Core [Online]. Available: http://www. been a Research Staff Member at IBM India Research w3.org / TR / 2004 / REC-DOM-Level-3-Core-20040407 / core.html#ID- Laboratory, Delhi. 1590626202 Mr. Ashish is a Student Member of the ACM. [7] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order preserving encryp- He is also an Honorary Life Member of Upsilon Pi tion for numeric data,” in Proc. 2004 ACM SIGMOD Int. Conf. Manag. Epsilon. He is the coauthor of a paper that has been Data, pp. 563–574. awarded best student paper in IEEE EDOC ’06. He has also been awarded by [8] E. Bertino and E. Ferrari, “Secure and selective dissemination of XML IBM Bravo award for his technical contributions. He has served on the program documents,” ACM Trans. Inf. Syst. Secur., vol. 5, no. 3, pp. 290–331, committee of IEEE EDOC ’07 and ’08. 2002. [9] “Naming and addressing: URIs, URLs, . . .” http://www.w3.org/ Addressing/ Elisa Bertino (SM’03–F’02) is currently a Professor [10] R. Merkle, Secrecy, Authentication, and Public Key Systems. Ph.D. dis- of computer sciences at Purdue University, West Pu- sertation, Dept. Elect. Eng., Stanford Univ., CA, 1979. rudue, IN, and serves as Research Director of the Cen- [11] E. Bertino, B. Carminati, E. Ferrari, B. M. Thuraisingham, and A. Gupta, ter for Education and Research in Information Assur- “Selective and authentic third-party distribution of XML documents,” ance and Security (CERIAS). Previously she was a IEEE Trans. Knowl. Data Eng., vol. 16, no. 10, pp. 1263–1278, Oct. faculty member at Department of Computer Science 2004. and Communication, University of Milan, where she [12] L. Opyrchal and A. Prakash, “Secure distribution of events in content- directed the DB&SEC Laboratory. She has been a based publish subscribe systems,” presented at the 10th USENIX Security Visiting Researcher at the IBM Research Laboratory Symp., Washington, DC, 2001. (now Almaden), San Jose, at the Microelectronics [13] A. K. Datta, M. Gradinariu, M. Raynal, and G. Simon, “Anonymous and Computer Technology Corporation, at Rutgers publish–subscribe in p2p networks,” presented at the Int. Parallel Distrib. University, and at Telcordia Technologies. Her main research interests include Process. Symp., Nice, France, 2003. security, privacy, digital identity management systems, database systems, dis- [14] G. Banavar, T. Chandra, B. Mukherjee, and J. Nagarajarao, “An efficient tributed systems, and multimedia systems. In those areas, she has published more multi-cast protocol for content-based publish subscribe systems,” in Proc. than 250 papers in all major refereed journals, and in proceedings of interna- 19th IEEE Int. Conf. Distrib. Comput. Syst., 1999, pp. 262–272. tional conferences and symposia. She is a coauthor of the books Object-Oriented [15] M. Aguilera, R. Strom, D. Sturman, M. Astley, and T. Chandra, “Matching Database Systems—Concepts and Architectures (Addison-Wesley International, events in a content-based subscripton system,” presented at the 18th ACM 1993), Indexing Techniques for Advanced Database Systems (Kluwer Academic, Symp. Principles Distrib. Comput., Atlanta, GA, 1999. 1997), Intelligent Database Systems (Addison-Wesley International, 2001), and [16] A. Riabov, Z. Liu, J. L. Wolf, P. S. Yu, and L. Zhang, “Clustering algo- Security for Web Services and Service Oriented Architectures (Springer, 2007). rithms for content-based publication-subscription systems,” in Proc. 22nd She is a Co-Editor-in-Chief of the Very Large Database Systems (VLDB) Jour- IEEE Int. Conf. Distrib. Comput. Syst., 2002, pp. 133–142. nal. She serves also on the editorial boards of several scientific journals, incuding [17] F. Cao and J. Singh, “Efficient event routing in content-based publish– the ACM Transactions on Information and System Security, ACM Transactions subscribe service networks,” in Proc. of IEEE INFOCOM 2004, pp. 929– on Web, Acta Informatica, the Parallel and Distributed Database Journal, the 940. Journal of Computer Security, Data & Knowledge Engineering, and ScienceAuthorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.
  10. 10. KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT 301of Computer Programming. She has been a Consultant to several companies management systems” and the 2005 IEEE Computer Society Tsutomu Kanaion data management systems and applications and has given several courses Award “For pioneering and innovative research contributions to secure dis-to industries. Her research has been sponsored by several organizations and tributed systems.” She has served as a Program Committee member of severalcompanies, including the USA National Science Foundation, the US AirForce international conferences, such as ACM SIGMOD, VLDB, ACM OOPSLA, asOffice for Sponsored Research, the I3P Consortium, the European Union (un- Program Co-Chair of the 1998 IEEE International Conference on Data Engineer-der the 5th and 6th IST research programmes), IBM, Microsoft, and the Italian ing (ICDE), as program chair of 2000 European Conference on Object-OrientedTelecom. Programming (ECOOP 2000), of the 7th ACM Symposium of Access Control Dr. Bertino is a Fellow of the ACM and has been been named a Golden Models and Technologies (SACMAT 2002), of the EDBT 2004 Conference, andCore Member for her service to the IEEE Computer Society. She received the the IEEE Policy 2007 Workshop. She is an Associate Editor of IEEE INTERNET2002 IEEE Computer Society Technical Achievement Award for “For outstand- COMPUTING and IEEE SECURITY AND PRIVACY.ing contributions to database systems and database security and advanced dataAuthorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.