Iw qo s09 (1)
Upcoming SlideShare
Loading in...5

Iw qo s09 (1)






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Iw qo s09 (1) Iw qo s09 (1) Document Transcript

  • Ensuring Data Storage Security in Cloud Computing Cong Wang, Qian Wang, and Kui Ren Wenjing Lou Department of ECE Department of ECE Illinois Institute of Technology Worcester Polytechnic Institute Email: {cwang, qwang, kren}@ece.iit.edu Email: wjlou@ece.wpi.edu Abstract—Cloud Computing has been envisioned as the next- number of reasons. Firstly, traditional cryptographic primitivesgeneration architecture of IT Enterprise. In contrast to tradi- for the purpose of data security protection can not be directlytional solutions, where the IT services are under proper physical, adopted due to the users’ loss control of data under Cloudlogical and personnel controls, Cloud Computing moves theapplication software and databases to the large data centers, Computing. Therefore, verification of correct data storagewhere the management of the data and services may not be in the cloud must be conducted without explicit knowledgefully trustworthy. This unique attribute, however, poses many of the whole data. Considering various kinds of data fornew security challenges which have not been well understood. each user stored in the cloud and the demand of long termIn this article, we focus on cloud data storage security, which continuous assurance of their data safety, the problem ofhas always been an important aspect of quality of service. Toensure the correctness of users’ data in the cloud, we propose an verifying correctness of data storage in the cloud becomeseffective and flexible distributed scheme with two salient features, even more challenging. Secondly, Cloud Computing is not justopposing to its predecessors. By utilizing the homomorphic token a third party data warehouse. The data stored in the cloudwith distributed verification of erasure-coded data, our scheme may be frequently updated by the users, including insertion,achieves the integration of storage correctness insurance and data deletion, modification, appending, reordering, etc. To ensureerror localization, i.e., the identification of misbehaving server(s).Unlike most prior works, the new scheme further supports secure storage correctness under dynamic data update is hence ofand efficient dynamic operations on data blocks, including: data paramount importance. However, this dynamic feature alsoupdate, delete and append. Extensive security and performance makes traditional integrity insurance techniques futile andanalysis shows that the proposed scheme is highly efficient and entails new solutions. Last but not the least, the deploymentresilient against Byzantine failure, malicious data modification of Cloud Computing is powered by data centers running inattack, and even server colluding attacks. a simultaneous, cooperated and distributed manner. Individual I. I NTRODUCTION user’s data is redundantly stored in multiple physical loca- Several trends are opening up the era of Cloud Computing, tions to further reduce the data integrity threats. Therefore,which is an Internet-based development and use of computer distributed protocols for storage correctness assurance will betechnology. The ever cheaper and more powerful processors, of most importance in achieving a robust and secure cloud datatogether with the software as a service (SaaS) computing archi- storage system in the real world. However, such important areatecture, are transforming data centers into pools of computing remains to be fully explored in the literature.service on a huge scale. The increasing network bandwidth and Recently, the importance of ensuring the remote datareliable yet flexible network connections make it even possible integrity has been highlighted by the following researchthat users can now subscribe high quality services from data works [3]–[7]. These techniques, while can be useful to ensureand software that reside solely on remote data centers. the storage correctness without having users possessing data, Moving data into the cloud offers great convenience to can not address all the security threats in cloud data storage,users since they don’t have to care about the complexities since they are all focusing on single server scenario andof direct hardware management. The pioneer of Cloud Com- most of them do not consider dynamic data operations. Asputing vendors, Amazon Simple Storage Service (S3) and an complementary approach, researchers have also proposedAmazon Elastic Compute Cloud (EC2) [1] are both well distributed protocols [8]–[10] for ensuring storage correct-known examples. While these internet-based online services ness across multiple servers or peers. Again, none of thesedo provide huge amounts of storage space and customizable distributed schemes is aware of dynamic data operations.computing resources, this computing platform shift, however, As a result, their applicability in cloud data storage can beis eliminating the responsibility of local machines for data drastically limited.maintenance at the same time. As a result, users are at the In this paper, we propose an effective and flexible distributedmercy of their cloud service providers for the availability and scheme with explicit dynamic data support to ensure theintegrity of their data. Recent downtime of Amazon’s S3 is correctness of users’ data in the cloud. We rely on erasure-such an example [2]. correcting code in the file distribution preparation to provide From the perspective of data security, which has always redundancies and guarantee the data dependability. This con-been an important aspect of quality of service, Cloud Com- struction drastically reduces the communication and storageputing inevitably poses new challenging security threats for overhead as compared to the traditional replication-based file
  • distribution techniques. By utilizing the homomorphic token S Me s e cu rity sag ewith distributed verification of erasure-coded data, our scheme rit y low cu F F low Se ageachieves the storage correctness insurance as well as data Mes s Optional Third Party Auditor Cloud Storageerror localization: whenever data corruption has been detected Serversduring the storage correctness verification, our scheme can Data Flowalmost guarantee the simultaneous localization of data errors, Usersi.e., the identification of the misbehaving server(s). essage Flow Security M Our work is among the first few ones in this field to consider Cloud Service Providerdistributed data storage in Cloud Computing. Our contributioncan be summarized as the following three aspects: Fig. 1: Cloud data storage architecture1) Compared to many of its predecessors, which only providebinary results about the storage state across the distributedservers, the challenge-response protocol in our work further these operations we are considering are block update, delete,provides the localization of data error. insert and append.2) Unlike most prior works for ensuring remote data integrity, As users no longer possess their data locally, it is of criticalthe new scheme supports secure and efficient dynamic opera- importance to assure users that their data are being correctlytions on data blocks, including: update, delete and append. stored and maintained. That is, users should be equipped with security means so that they can make continuous correctness3) Extensive security and performance analysis shows that assurance of their stored data even without the existence ofthe proposed scheme is highly efficient and resilient against local copies. In case that users do not necessarily have theByzantine failure, malicious data modification attack, and even time, feasibility or resources to monitor their data, they canserver colluding attacks. delegate the tasks to an optional trusted TPA of their respective The rest of the paper is organized as follows. Section II choices. In our model, we assume that the point-to-pointintroduces the system model, adversary model, our design goal communication channels between each cloud server and theand notations. Then we provide the detailed description of our user is authenticated and reliable, which can be achieved inscheme in Section III and IV. Section V gives the security practice with little overhead. Note that we don’t address theanalysis and performance evaluations, followed by Section VI issue of data privacy in this paper, as in Cloud Computing,which overviews the related work. Finally, Section VII gives data privacy is orthogonal to the problem we study here.the concluding remark of the whole paper. B. Adversary Model II. P ROBLEM S TATEMENT Security threats faced by cloud data storage can comeA. System Model from two different sources. On the one hand, a CSP can A representative network architecture for cloud data storage be self-interested, untrusted and possibly malicious. Not onlyis illustrated in Figure 1. Three different network entities can does it desire to move data that has not been or is rarelybe identified as follows: accessed to a lower tier of storage than agreed for monetary • User: users, who have data to be stored in the cloud and reasons, but it may also attempt to hide a data loss incident rely on the cloud for data computation, consist of both due to management errors, Byzantine failures and so on. individual consumers and organizations. On the other hand, there may also exist an economically- • Cloud Service Provider (CSP): a CSP, who has signif- motivated adversary, who has the capability to compromise icant resources and expertise in building and managing a number of cloud data storage servers in different time distributed cloud storage servers, owns and operates live intervals and subsequently is able to modify or delete users’ Cloud Computing systems. data while remaining undetected by CSPs for a certain period. • Third Party Auditor (TPA): an optional TPA, who has Specifically, we consider two types of adversary with different expertise and capabilities that users may not have, is levels of capability in this paper: trusted to assess and expose risk of cloud storage services Weak Adversary: The adversary is interested in corrupting the on behalf of the users upon request. user’s data files stored on individual servers. Once a server is In cloud data storage, a user stores his data through a CSP comprised, an adversary can pollute the original data files byinto a set of cloud servers, which are running in a simulta- modifying or introducing its own fraudulent data to preventneous, cooperated and distributed manner. Data redundancy the original data from being retrieved by the user.can be employed with technique of erasure-correcting code Strong Adversary: This is the worst case scenario, in whichto further tolerate faults or server crash as user’s data grows we assume that the adversary can compromise all the storagein size and importance. Thereafter, for application purposes, servers so that he can intentionally modify the data files asthe user interacts with the cloud servers via CSP to access or long as they are internally consistent. In fact, this is equivalentretrieve his data. In some cases, the user may need to perform to the case where all servers are colluding together to hide ablock level operations on his data. The most general forms of data loss or corruption incident.
  • C. Design Goals integrated with the verification of erasure-coded data [8] [12]. To ensure the security and dependability for cloud data Subsequently, it is also shown how to derive a challenge-storage under the aforementioned adversary model, we aim response protocol for verifying the storage correctness as wellto design efficient mechanisms for dynamic data verification as identifying misbehaving servers. Finally, the procedure forand operation and achieve the following goals: (1) Storage file retrieval and error recovery based on erasure-correctingcorrectness: to ensure users that their data are indeed stored code is outlined.appropriately and kept intact all the time in the cloud. (2) A. File Distribution PreparationFast localization of data error: to effectively locate the mal-functioning server when data corruption has been detected. (3) It is well known that erasure-correcting code may be usedDynamic data support: to maintain the same level of storage to tolerate multiple failures in distributed storage systems. Incorrectness assurance even if users modify, delete or append cloud data storage, we rely on this technique to disperse thetheir data files in the cloud. (4) Dependability: to enhance data data file F redundantly across a set of n = m + k distributedavailability against Byzantine failures, malicious data modifi- servers. A (m + k, k) Reed-Solomon erasure-correcting codecation and server colluding attacks, i.e. minimizing the effect is used to create k redundancy parity vectors from m databrought by data errors or server failures. (5) Lightweight: vectors in such a way that the original m data vectors can beto enable users to perform storage correctness checks with reconstructed from any m out of the m + k data and parityminimum overhead. vectors. By placing each of the m + k vectors on a different server, the original data file can survive the failure of any k ofD. Notation and Preliminaries the m+k servers without any data loss, with a space overhead • F – the data file to be stored. We assume that F can be of k/m. For support of efficient sequential I/O to the original denoted as a matrix of m equal-sized data vectors, each file, our file layout is systematic, i.e., the unmodified m data consisting of l blocks. Data blocks are all well represented file vectors together with k parity vectors is distributed across as elements in Galois Field GF (2p ) for p = 8 or 16. m + k different servers. • A – The dispersal matrix used for Reed-Solomon coding. Let F = (F1 , F2 , . . . , Fm ) and Fi = (f1i , f2i , . . . , fli )T • G – The encoded file matrix, which includes a set of (i ∈ {1, . . . , m}), where l ≤ 2p − 1. Note all these blocks n = m + k vectors, each consisting of l blocks. are elements of GF (2p ). The systematic layout with parity • fkey (·) – pseudorandom function (PRF), which is defined vectors is achieved with the information dispersal matrix A, as f : {0, 1}∗ × key → GF (2p ). derived from an m × (m + k) Vandermonde matrix [13]: • φkey (·) – pseudorandom permutation (PRP), which is   1 1 ... 1 1 ... 1 defined as φ : {0, 1}log2 (l) × key → {0, 1}log2 (l) .  β1  β2 ... βm βm+1 . . . βn   • ver – a version number bound with the index for individ-  , . . . . .. . . . . .. . . ual blocks, which records the times the block has been  . . . . . . .  modified. Initially we assume ver is 0 for all data blocks. m−1 m−1 m−1 m−1 m−1 β1 β2 . . . βm βm+1 . . . βn • sver – the seed for PRF, which depends on the file name, ij block index i, the server position j as well as the optional where βj (j ∈ {1, . . . , n}) are distinct elements randomly block version number ver. picked from GF (2p ). After a sequence of elementary row transformations, the III. E NSURING C LOUD DATA S TORAGE desired matrix A can be written as In cloud data storage system, users store their data in   1 0 . . . 0 p11 p12 . . . p1kthe cloud and no longer possess the data locally. Thus, the  0 1 . . . 0 p21 p22 . . . p2k   correctness and availability of the data files being stored on the A = (I|P) =  . . . . . . .. . ,distributed cloud servers must be guaranteed. One of the key  . . . . .. .. . . . . . .  .issues is to effectively detect any unauthorized data modifica- 0 0 . . . 1 pm1 pm2 . . . pmktion and corruption, possibly due to server compromise and/or where I is a m × m identity matrix and P is the secret parityrandom Byzantine failures. Besides, in the distributed case generation matrix with size m × k. Note that A is derivedwhen such inconsistencies are successfully detected, to find from a Vandermonde matrix, thus it has the property that anywhich server the data error lies in is also of great significance, m out of the m + k columns form an invertible matrix.since it can be the first step to fast recover the storage errors. By multiplying F by A, the user obtains the encoded file: To address these problems, our main scheme for ensuringcloud data storage is presented in this section. The first part of G= F·A = (G(1) , G(2) , . . . , G(m) , G(m+1) , . . . , G(n) )the section is devoted to a review of basic tools from coding = (F1 , F2 , . . . , Fm , G(m+1) , . . . , G(n) ),theory that are needed in our scheme for file distribution across (j) (j) (j)cloud servers. Then, the homomorphic token is introduced. where G(j) = (g1 , g2 , . . . , gl )T (j ∈ {1, . . . , n}). AsThe token computation function we are considering belongs noticed, the multiplication reproduces the original data fileto a family of universal hash function [11], chosen to pre- vectors of F and the remaining part (G(m+1) , . . . , G(n) ) areserve the homomorphic properties, which can be perfectly k parity vectors generated based on F.
  • Algorithm 1 Token Pre-computation Algorithm 2 Correctness Verification and Error Localization 1: procedure 1: procedure C HALLENGE (i) (i) 2: Choose parameters l, n and function f, φ; 2: Recompute αi = fkchal (i) and kprp from KP RP ; (i) 3: Choose the number t of tokens; 3: Send {αi , kprp } to all the cloud servers; 4: Choose the number r of indices per verification; 4: Receive from servers: (j) 5: Generate master key Kprp and challenge kchal ; {Ri = r αq ∗ G(j) [φk(i) (q)]|1 ≤ j ≤ n} q=1 i for vector G(j) , j ← 1, n do prp 6: 5: for (j ← m + 1, n) do 7: for round i← 1, t do 6: R(j) ← R(j) − r fkj (sIq ,j )·αq , Iq = φk(i) (q) (i) q=1 i prp 8: Derive αi = fkchal (i) and kprp from KP RP . 7: end for (j) r 9: Compute vi = q=1 αq ∗ G(j) [φk(i) (q)] i 8: (1) (m) if ((Ri , . . . , Ri ) · P==(Ri (m+1) (n) , . . . , Ri )) then prp10: end for 9: Accept and ready for the next challenge.11: end for 10: else12: Store all the vi s locally. 11: for (j ← 1, n) do13: end procedure (j) (j) 12: if (Ri ! =vi ) then 13: return server j is misbehaving. 14: end ifB. Challenge Token Precomputation 15: end for In order to achieve assurance of data storage correctness 16: end if 17: end procedureand data error localization simultaneously, our scheme entirelyrelies on the pre-computed verification tokens. The main ideais as follows: before file distribution the user pre-computes acertain number of short verification tokens on individual vector user stores them locally to obviate the need for encryptionG(j) (j ∈ {1, . . . , n}), each token covering a random subset and lower the bandwidth overhead during dynamic data op-of data blocks. Later, when the user wants to make sure the eration which will be discussed shortly. The details of tokenstorage correctness for the data in the cloud, he challenges generation are shown in Algorithm 1. Once all tokens are computed, the final step beforethe cloud servers with a set of randomly generated block (j)indices. Upon receiving challenge, each cloud server computes file distribution is to blind each parity block gi ina short “signature” over the specified blocks and returns them (G(m+1) , . . . , G(n) ) byto the user. The values of these signatures should match the (j) (j) gi ← gi + fkj (sij ), i ∈ {1, . . . , l},corresponding tokens pre-computed by the user. Meanwhile,as all servers operate over the same subset of the indices, the where kj is the secret key for parity vector G(j) (j ∈ {m +requested response values for integrity check must also be a 1, . . . , n}). This is for protection of the secret matrix P. Wevalid codeword determined by secret matrix P. will discuss the necessity of using blinded parities in detail Suppose the user wants to challenge the cloud servers t in Section V. After blinding the parity information, the usertimes to ensure the correctness of data storage. Then, he disperses all the n encoded vectors G(j) (j ∈ {1, . . . , n})must pre-compute t verification tokens for each G(j) (j ∈ across the cloud servers S1 , S2 , . . . , Sn .{1, . . . , n}), using a PRF f (·), a PRP φ(·), a challenge key C. Correctness Verification and Error Localizationkchal and a master permutation key KP RP . To generate the Error localization is a key prerequisite for eliminating errorsith token for server j, the user acts as follows: in storage systems. However, many previous schemes do not 1) Derive a random challenge value αi of GF (2p ) by αi = explicitly consider the problem of data error localization, (i) fkchal (i) and a permutation key kprp based on KP RP . thus only provide binary results for the storage verification. 2) Compute the set of r randomly-chosen indices: Our scheme outperforms those by integrating the correctness verification and error localization in our challenge-response {Iq ∈ [1, ..., l]|1 ≤ q ≤ r}, where Iq = φk(i) (q). prp protocol: the response values from servers for each challenge 3) Calculate the token as: not only determine the correctness of the distributed storage, r but also contain information to locate potential data error(s). (j) (j) Specifically, the procedure of the i-th challenge-response for vi = αq ∗ G(j) [Iq ], where G(j) [Iq ] = gIq . i q=1 a cross-check over the n servers is described as follows: 1) The user reveals the αi as well as the i-th permutation (j) (i)Note that vi , which is an element of GF (2p ) with small key kprp to each servers.size, is the response the user expects to receive from server j 2) The server storing vector G(j) aggregates those r rows (i)when he challenges it on the specified data blocks. specified by index kprp into a linear combination After token generation, the user has the choice of either r (j)keeping the pre-computed tokens locally or storing them in Ri = αq ∗ G(j) [φk(i) (q)]. i prpencrypted form on the cloud servers. In our case here, the q=1
  • (j) Algorithm 3 Error Recovery 3) Upon receiving Ri s from all the servers, the user takes away blind values in R(j) (j ∈ {m + 1, . . . , n}) by 1: procedure r % Assume the block corruptions have been detected (j) (j) among Ri ← Ri − fkj (sIq ,j ) · αq , where Iq = φk(i) (q). i q=1 prp % the specified r rows; % Assume s ≤ k servers have been identified misbehaving 4) Then the user verifies whether the received values re- 2: Download r rows of blocks from servers; main a valid codeword determined by secret matrix P: 3: Treat s servers as erasures and recover the blocks. (1) (m) ? (m+1) (n) 4: Resend the recovered blocks to corresponding servers. (Ri , . . . , Ri ) · P = (Ri , . . . , Ri ). 5: end procedure Because all the servers operate over the same subset ofindices, the linear aggregation of these r specified rows (1) (n)(Ri , . . . , Ri ) has to be a codeword in the encoded file operations of update, delete and append to modify the data filematrix. If the above equation holds, the challenge is passed. while maintaining the storage correctness assurance.Otherwise, it indicates that among those specified rows, there The straightforward and trivial way to support these opera-exist file block corruptions. tions is for user to download all the data from the cloud servers Once the inconsistency among the storage has been success- and re-compute the whole parity blocks as well as verificationfully detected, we can rely on the pre-computed verification tokens. This would clearly be highly inefficient. In this section,tokens to further determine where the potential data error(s) we will show how our scheme can explicitly and efficiently (j)lies in. Note that each response Ri is computed exactly in the handle dynamic data operations for cloud data storage. (j)same way as token vi , thus the user can simply find whichserver is misbehaving by verifying the following n equations: A. Update Operation (j) ? (j) In cloud data storage, sometimes the user may need to Ri = vi , j ∈ {1, . . . , n}. modify some data block(s) stored in the cloud, from itsAlgorithm 2 gives the details of correctness verification and current value fij to a new one, fij + ∆fij . We refer thiserror localization. operation as data update. Due to the linear property of Reed- Solomon code, a user can perform the update operation andD. File Retrieval and Error Recovery generate the updated parity blocks by using ∆fij only, without Since our layout of file matrix is systematic, the user can involving any other unchanged blocks. Specifically, the userreconstruct the original file by downloading the data vectors can construct a general update matrix ∆F asfrom the first m servers, assuming that they return the correct   ∆f11 ∆f12 . . . ∆f1mresponse values. Notice that our verification scheme is based  ∆f21 ∆f22 . . . ∆f2m   on random spot-checking, so the storage correctness assurance ∆F =  . . .. . is a probabilistic one. However, by choosing system param-  . . . . . . . eters (e.g., r, l, t) appropriately and conducting enough times ∆fl1 ∆fl2 . . . ∆flmof verification, we can guarantee the successful file retrieval = (∆F1 , ∆F2 , . . . , ∆Fm ).with high probability. On the other hand, whenever the data Note that we use zero elements in ∆F to denote the unchangedcorruption is detected, the comparison of pre-computed tokens blocks. To maintain the corresponding parity vectors as well asand received response values can guarantee the identification be consistent with the original file layout, the user can multiplyof misbehaving server(s), again with high probability, which ∆F by A and thus generate the update information for bothwill be discussed shortly. Therefore, the user can always the data vectors and parity vectors as follows:ask servers to send back blocks of the r rows specified inthe challenge and regenerate the correct blocks by erasure ∆F · A = (∆G(1) , . . . , ∆G(m) , ∆G(m+1) , . . . , ∆G(n) )correction, shown in Algorithm 3, as long as there are at most = (∆F1 , . . . , ∆Fm , ∆G(m+1) , . . . , ∆G(n) ),k misbehaving servers are identified. The newly recoveredblocks can then be redistributed to the misbehaving servers where ∆G(j) (j ∈ {m + 1, . . . , n}) denotes the updateto maintain the correctness of storage. information for the parity vector G(j) . Because the data update operation inevitably affects some IV. P ROVIDING DYNAMIC DATA O PERATION S UPPORT or all of the remaining verification tokens, after preparation of So far, we assumed that F represents static or archived update information, the user has to amend those unused tokensdata. This model may fit some application scenarios, such for each vector G(j) to maintain the same storage correctnessas libraries and scientific datasets. However, in cloud data assurance. In other words, for all the unused tokens, the userstorage, there are many potential scenarios where data stored needs to exclude every occurrence of the old data block andin the cloud is dynamic, like electronic documents, photos, or replace it with the new one. Thanks to the homomorphiclog files etc. Therefore, it is crucial to consider the dynamic construction of our verification token, the user can performcase, where a user may wish to perform various block-level the token update efficiently. To give more details, suppose a
  • (j)block G(j) [Is ], which is covered by the specific token vi , has To support block append operation, we need a slight modi-been changed to G(j) [Is ] + ∆G(j) [Is ], where Is = φk(i) (s). fication to our token pre-computation. Specifically, we require prp (j)To maintain the usability of token vi , it is not hard to verify the user to expect the maximum size in blocks, denoted asthat the user can simply update it by lmax , for each of his data vector. The idea of supporting block append, which is similar as adopted in [7], relies on (j) (j) vi ← vi + αs ∗ ∆G(j) [Is ], i the initial budget for the maximum anticipated data size lmax in each encoded data vector as well as the system parameterwithout retrieving any other r − 1 blocks required in the pre- (j) rmax = ⌈r ∗ (lmax /l)⌉ for each pre-computed challenge-computation of vi . response token. The pre-computation of the i-th token on After the amendment to the affected tokens1 , the user needs (j) server j is modified as follows:to blind the update information ∆gi for each parity block in rmax(∆G(m+1) , . . . , ∆G(n) ) to hide the secret matrix P by vi = αq ∗ G(j) [Iq ], i (j) (j) ∆gi ← ∆gi + fkj (sver ), i ij ∈ {1, . . . , l}. q=1Here we use a new seed sver for the PRF. The version number where ijver functions like a counter which helps the user to keep G(j) [φk(i) (q)] , [φk(i) (q)] ≤ ltrack of the blind information on the specific parity blocks. G(j) [Iq ] = prp prpAfter blinding, the user sends update information to the cloud 0 , [φk(i) (q)] > l prpservers, which perform the update operation as This formula guarantees that on average, there will be r indices G (j) ←G (j) + ∆G (j) , (j ∈ {1, . . . , n}). falling into the range of existing l blocks. Because the cloud servers and the user have the agreement on the number ofB. Delete Operation existing blocks in each vector G(j) , servers will follow exactly Sometimes, after being stored in the cloud, certain data the above procedure when re-computing the token values uponblocks may need to be deleted. The delete operation we are receiving user’s challenge request.considering is a general one, in which user replaces the data Now when the user is ready to append new blocks, i.e.,block with zero or some special reserved data symbol. From both the file blocks and the corresponding parity blocks arethis point of view, the delete operation is actually a special case generated, the total length of each vector G(j) will be increasedof the data update operation, where the original data blocks and fall into the range [l, lmax ]. Therefore, the user will updatecan be replaced with zeros or some predetermined special those affected tokens by adding αs ∗ G(j) [Is ] to the old vi iblocks. Therefore, we can rely on the update procedure to whenever G(j) [Is ] = 0 for Is > l, where Is = φk(i) (s). The prpsupport delete operation, i.e., by setting ∆fij in ∆F to be parity blinding is similar as introduced in update operation,−∆fij . Also, all the affected tokens have to be modified and thus is omitted here.the updated parity information has to be blinded using thesame method specified in update operation. D. Insert Operation An insert operation to the data file refers to an appendC. Append Operation operation at the desired index position while maintaining the In some cases, the user may want to increase the size of same data block structure for the whole data file, i.e., insertinghis stored data by adding blocks at the end of the data file, a block F [j] corresponds to shifting all blocks starting withwhich we refer as data append. We anticipate that the most index j + 1 by one slot. An insert operation may affect manyfrequent append operation in cloud data storage is bulk append, rows in the logical data file matrix F, and a substantial numberin which the user needs to upload a large number of blocks of computations are required to renumber all the subsequent(not a single block) at one time. blocks as well as re-compute the challenge-response tokens. Given the file matrix F illustrated in file distribution prepa- Therefore, an efficient insert operation is difficult to supportration, appending blocks towards the end of a data file is and thus we leave it for our future work.equivalent to concatenate corresponding rows at the bottomof the matrix layout for file F. In the beginning, there are V. S ECURITY A NALYSIS AND P ERFORMANCE E VALUATIONonly l rows in the file matrix. To simplify the presentation, In this section, we analyze our proposed scheme in termswe suppose the user wants to append m blocks at the end of of security and efficiency. Our security analysis focuses onfile F, denoted as (fl+1,1 , fl+1,2 , ..., fl+1,m ) (We can always the adversary model defined in Section II. We also evaluateuse zero-padding to make a row of m elements.). With the the efficiency of our scheme via implementation of both filesecret matrix P, the user can directly calculate the append distribution preparation and verification token precomputation.blocks for each parity server as (m+1) (n) A. Security Strength Against Weak Adversary (fl+1,1 , ..., fl+1,m ) · P = (gl+1 , ..., gl+1 ). 1) Detection Probability against data modification: In our 1 In practice, it is possible that only a fraction of tokens need amendment, scheme, servers are required to operate on specified rows insince the updated blocks may not be covered by all the tokens. each correctness verification for the calculation of requested
  • 5000 5000 5000 0.3 0.2 0.4 0.4 0.7 0.3 0. 0.6 0.9 0.9 0.1 0. 0.1 0.95 95 0.1 99 0.9 0.2 9 9 4500 4500 4500 0.5 0.8 0.9 0.8 0.6 0.9 0.5 4000 4000 4000 0.4 5 0.7 0.6 0. 0.8 0. 95 9 0. 0.3l (total number of rows) l (total number of rows) l (total number of rows) 99 0. 3500 3500 3500 99 0.7 0.9 0.9 0.6 0.2 0.7 5 0. 0.5 8 0.9 0.4 3000 3000 3000 0. 0.3 0.9 0.4 0.2 0.9 5 95 0.5 0.3 0.8 0.1 0.1 0.9 0.1 0.2 0.9 9 0.6 2500 2500 9 2500 0.7 0.4 0.7 0.8 0.8 0.9 0.5 0.9 0.6 0.9 5 0. 0.9 6 5 0.3 2000 2000 0.9 0.99 2000 0.5 0.8 0. 7 0.8 0.7 0.95 0.2 0.9 0.95 0.5 1500 1500 0.8 1500 0.7 0. 0.4 0.6 0. 0.4 0.3 4 6 0.9 0.6 0.2 0.5 0.3 0.7 0.8 0.9 0.5 0.1 1000 0.3 0.6 1000 0.7 1000 0.1 0.2 0.9 0.1 0.8 0.7 0.4 0.8 0.2 0.5 0.5 0.6 0.6 0.7 0.8 0.4 0.5 0.7 0.3 0.4 0.4 0.3 500 500 0.3 0.6 0.7 500 0.2 0.6 0.1 0.2 0.3 0.5 0.4 0.5 0.2 0.2 0.4 0.5 0.1 0.3 0.4 0.1 0.1 0.3 0.4 0.2 0.3 0.1 0.1 0.1 0.2 0.3 0.1 0.2 0.1 0.2 0.1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 r (number of queried rows) (as a percentage of l) r (number of queried rows) (as a percentage of l) r (number of queried rows) (as a percentage of l)Fig. 2: The detection probability Pd against data modification. We show Pd as a function of l (the number of blocks on eachcloud storage server) and r (the number of rows queried by the user, shown as a percentage of l) for three values of z (thenumber of rows modified by the adversary). left) z = 1% of l; middle) z = 5% of l; right) z = 10% of l. Note that all graphsare plotted under p = 8, nc = 10 and k = 5 and each graph has a different scale.token. We will show that this “sampling” strategy on selected Based on above discussion, it follows that the probabilityrows instead of all can greatly reduce the computational of data modification detection across all storage servers isoverhead on the server, while maintaining the detection of the r r Pd = Pm · (1 − Pf ).data corruption with high probability. Suppose nc servers are misbehaving due to the possible Figure 2 plots Pd for different values of l, r, z while we setcompromise or Byzantine failure. In the following analysis, p = 8, nc = 10 and k = 5. 2 From the figure we can see thatwe do not limit the value of nc , i.e., nc ≤ n. Assume the if more than a fraction of the data file is corrupted, then itadversary modifies the data blocks in z rows out of the l rows suffices to challenge for a small constant number of rows inin the encoded file matrix. Let r be the number of different order to achieve detection with high probability. For example,rows for which the user asks for check in a challenge. Let X if z = 1% of l, every token only needs to cover 460 indicesbe a discrete random variable that is defined to be the number in order to achieve the detection probability of at least 99%.of rows chosen by the user that matches the rows modified by 2) Identification Probability for Misbehaving Servers: Wethe adversary. We first analyze the matching probability that have shown that, if the adversary modifies the data blocksat least one of the rows picked by the user matches one of the among any of the data storage servers, our sampling checkingrows modified by the adversary: scheme can successfully detect the attack with high probabil- ity. As long as the data modification is caught, the user will r Pm = 1 − P {X = 0} further determine which server is malfunctioning. This can r−1 (j) z be achieved by comparing the response values Ri with the = 1− (1 − min{ , 1}) (j) pre-stored tokens vi , where j ∈ {1, . . . , n}. The probability i=0 l−i for error localization or identifying misbehaving server(s) l−z r ≥ 1−( ) . can be computed in a similar way. It is the product of the l matching probability for sampling check and the probability ofNote that if none of the specified r rows in the i-th verification complementary event for the false negative result. Obviously, r r−1 ˆ zprocess are deleted or modified, the adversary avoids the the matching probability is Pm = 1− i=0 (1−min{ l−i , 1}),detection. where z ≤ z. ˆ (j) Next, we study the probability of a false negative result that Next, we consider the false negative probability that Ri = (j)the data blocks in those specified r rows has been modified, vi when at least one of z blocks is modified. According to ˆbut the checking equation still holds. Consider the responses proposition 1 of [14], tokens calculated in GF (2p ) for two (1) (n) rRi , . . . , Ri returned from the data storage servers for the different data vectors collide with probability Pf = 2−p . Thus, (j)i-th challenge, each response value Ri , calculated within the identification probability for misbehaving server(s) is pGF (2 ), is based on r blocks on server j. The number of r r Pd = Pm · (1 − Pf ).responses R(m+1) , . . . , R(n) from parity servers is k = n−m.Thus, according to the proposition 2 of our previous work Along with the analysis in detection probability, if z = 1%in [14], the false negative probability is of l and each token covers 460 indices, the identification probability for misbehaving servers is at least 99%. r Pf = P r1 + P r2 , 2 Note that n and k only affect the false negative probability P r . However c f r in our scheme, since p = 8 almost dominates the negligibility of Pf , the value (1+2−p )nc −1where P r1 = 2nc −1 and P r2 = (1 − Pr1 )(2−p )k . of nc and k have little effect in the plot of Pd .
  • B. Security Strength Against Strong Adversary set I m=4 m=6 m=8 m = 10 k=2 567.45s 484.55s 437.22s 414.22s In this section, we analyze the security strength of our set II k=1 k=2 k=3 k=4schemes against server colluding attack and explain why blind- m=8 358.90s 437.22s 584.55s 733.34sing the parity blocks can help improve the security strengthof our proposed scheme. TABLE I: The cost of parity generation in seconds for an 8GB Recall that in the file distribution preparation, the redun- data file. For set I, the number of parity servers k is fixed; fordancy parity vectors are calculated via multiplying the file set II, the number of data servers m is constant.matrix F by P, where P is the secret parity generationmatrix we later rely on for storage correctness assurance.If we disperse all the generated vectors directly after token 2) Challenge Token Pre-computation: Although in ourprecomputation, i.e., without blinding, malicious servers that scheme the number of verification token t is a fixed prioricollaborate can reconstruct the secret P matrix easily: they determined before file distribution, we can overcome thiscan pick blocks from the same rows among the data and parity issue by choosing sufficient large t in practice. For example,vectors to establish a set of m · k linear equations and solve when t is selected to be 1825 and 3650, the data file canfor the m · k entries of the parity generation matrix P. Once be verified every day for the next 5 years and 10 years,they have the knowledge of P, those malicious servers can respectively. Following the security analysis, we select aconsequently modify any part of the data blocks and calculate practical parameter r = 460 for our token pre-computationthe corresponding parity blocks, and vice versa, making their (see previous subsections), i.e., each token covers 460 differentcodeword relationship always consistent. Therefore, our stor- indices. Other parameters are along with the file distributionage correctness challenge scheme would be undermined—even preparation. According to our implementation, the averageif those modified blocks are covered by the specified rows, the token pre-computation cost for t = 1825 is 51.97s per datastorage correctness check equation would always hold. vector, and for t = 3650 is 103.94s per data vector. This To prevent colluding servers from recovering P and making is faster than the hash function based token pre-computationup consistently-related data and parity blocks, we utilize the scheme proposed in [7]. For a typical number of 8 servers,technique of adding random perturbations to the encoded file the total cost for token pre-computation is no more than 15matrix and hence hide the secret matrix P. We make use minutes. Note that each token is only an element of fieldof a keyed pseudorandom function fkj (·) with key kj and GF (28 ), the extra storage for those pre-computed tokens isseed sver , both of which has been introduced previously. ij less than 1MB, and thus can be neglected.In order to maintain the systematic layout of data file, weonly blind the parity blocks with random perturbations. Our VI. R ELATED W ORKpurpose is to add “noise” to the set of linear equations andmake it computationally infeasible to solve for the correct Juels et al. [3] described a formal “proof of retrievability”secret matrix P. By blinding each parity block with random (POR) model for ensuring the remote data integrity. Theirperturbation, the malicious servers no longer have all the scheme combines spot-cheking and error-correcting code tonecessary information to build up the correct linear equation ensure both possession and retrievability of files on archivegroups and therefore cannot derive the secret matrix P. service systems. Shacham et al. [4] built on this model and constructed a random linear function based homomorphicC. Performance Evaluation authenticator which enables unlimited number of queries and 1) File Distribution Preparation: We implemented the gen- requires less communication overhead. Bowers et al. [5] pro-eration of parity vectors for our scheme under field GF (28 ). posed an improved framework for POR protocols that general-Our experiment is conducted using C on a system with an Intel izes both Juels and Shacham’s work. Later in their subsequentCore 2 processor running at 1.86 GHz, 2048 MB of RAM, and work, Bowers et al. [10] extended POR model to distributeda 7200 RPM Western Digital 250 GB Serial ATA drive with systems. However, all these schemes are focusing on statican 8 MB buffer. We consider two sets of different parameters data. The effectiveness of their schemes rests primarily on thefor the (m+k, m) Reed-Solomon encoding. Table I shows the preprocessing steps that the user conducts before outsourcingaverage encoding cost over 10 trials for an 8 GB file. In the the data file F . Any change to the contents of F, eventable on the top, we set the number of parity vectors constant few bits, must propagate through the error-correcting code,at 2. In the one at the bottom, we keep the number of the data thus introducing significant computation and communicationvectors fixed at 8, and increase the number of parity vectors. complexity.Note that as m increases, the length l of data vectors on each Ateniese et al. [6] defined the “provable data possession”server will decrease, which results in fewer calls to the Reed- (PDP) model for ensuring possession of file on untrustedSolomon encoder. Thus the cost in the top table decreases storages. Their scheme utilized public key based homomorphicwhen more data vectors are involved. From Table I, it can tags for auditing the data file, thus providing public verifia-be shown that the performance of our scheme is comparable bility. However, their scheme requires sufficient computationto that of [10], even if our scheme supports dynamic data overhead that can be expensive for an entire file. In theiroperation while [10] is for static data only. subsequent work, Ateniese et al. [7] described a PDP scheme
  • that uses only symmetric key cryptography. This method has in its infancy now, and many research problems are yet to belower-overhead than their previous scheme and allows for identified. We envision several possible directions for futureblock updates, deletions and appends to the stored file, which research on this area. The most promising one we believehas also been supported in our work. However, their scheme is a model in which public verifiability is enforced. Publicfocuses on single server scenario and does not address small verifiability, supported in [6] [4] [17], allows TPA to audit thedata corruptions, leaving both the distributed scenario and cloud data storage without demanding users’ time, feasibilitydata error recovery issue unexplored. Curtmola et al. [15] or resources. An interesting question in this model is if weaimed to ensure data possession of multiple replicas across can construct a scheme to achieve both public verifiabilitythe distributed storage system. They extended the PDP scheme and storage correctness assurance of dynamic data. Besides,to cover multiple replicas without encoding each replica sep- along with our research on dynamic cloud data storage, wearately, providing guarantee that multiple copies of data are also plan to investigate the problem of fine-grained data erroractually maintained. localization. In other related work, Lillibridge et al. [9] presented a P2Pbackup scheme in which blocks of a data file are dispersed ACKNOWLEDGEMENTacross m+k peers using an (m+k, m)-erasure code. Peers can This work was supported in part by the US National Sciencerequest random blocks from their backup peers and verify the Foundation under grant CNS-0831963, CNS-0626601, CNS-integrity using separate keyed cryptographic hashes attached 0716306, and CNS-0831628.on each block. Their scheme can detect data loss from free-riding peers, but does not ensure all data is unchanged. Filhoet al. [16] proposed to verify data integrity using RSA-based R EFERENCEShash to demonstrate uncheatable data possession in peer-to- [1] Amazon.com, “Amazon Web Services (AWS),” Online at http://aws.peer file sharing networks. However, their proposal requires amazon.com, 2008. [2] N. Gohring, “Amazon’s S3 down for several hours,” Onlineexponentiation over the entire data file, which is clearly at http://www.pcworld.com/businesscenter/article/142549/amazons s3impractical for the server whenever the file is large. Shah et down for several hours.html, 2008.al. [17] proposed allowing a TPA to keep online storage honest [3] A. Juels and J. Burton S. Kaliski, “PORs: Proofs of Retrievability for Large Files,” Proc. of CCS ’07, pp. 584–597, 2007.by first encrypting the data then sending a number of pre- [4] H. Shacham and B. Waters, “Compact Proofs of Retrievability,” Proc.computed symmetric-keyed hashes over the encrypted data to of Asiacrypt ’08, Dec. 2008.the auditor. However, their scheme only works for encrypted [5] K. D. Bowers, A. Juels, and A. Oprea, “Proofs of Retrievability: Theory and Implementation,” Cryptology ePrint Archive, Report 2008/175,files, and auditors must maintain long-term state. Schwarz 2008, http://eprint.iacr.org/.et al. [8] proposed to ensure file integrity across multiple [6] G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson,distributed servers, using erasure-coding and block-level file and D. Song, “Provable Data Possession at Untrusted Stores,” Proc. of CCS ’07, pp. 598–609, 2007.integrity checks. However, their scheme only considers static [7] G. Ateniese, R. D. Pietro, L. V. Mancini, and G. Tsudik, “Scalable anddata files and do not explicitly study the problem of data error Efficient Provable Data Possession,” Proc. of SecureComm ’08, pp. 1–localization, which we are considering in this work. 10, 2008. [8] T. S. J. Schwarz and E. L. Miller, “Store, Forget, and Check: Using Algebraic Signatures to Check Remotely Administered Storage,” Proc. VII. C ONCLUSION of ICDCS ’06, pp. 12–12, 2006. In this paper, we investigated the problem of data security [9] M. Lillibridge, S. Elnikety, A. Birrell, M. Burrows, and M. Isard, “A Cooperative Internet Backup Scheme,” Proc. of the 2003 USENIXin cloud data storage, which is essentially a distributed storage Annual Technical Conference (General Track), pp. 29–41, 2003.system. To ensure the correctness of users’ data in cloud [10] K. D. Bowers, A. Juels, and A. Oprea, “HAIL: A High-Availability anddata storage, we proposed an effective and flexible distributed Integrity Layer for Cloud Storage,” Cryptology ePrint Archive, Report 2008/489, 2008, http://eprint.iacr.org/.scheme with explicit dynamic data support, including block [11] L. Carter and M. Wegman, “Universal Hash Functions,” Journal ofupdate, delete, and append. We rely on erasure-correcting code Computer and System Sciences, vol. 18, no. 2, pp. 143–154, 1979.in the file distribution preparation to provide redundancy parity [12] J. Hendricks, G. Ganger, and M. Reiter, “Verifying Distributed Erasure- coded Data,” Proc. 26th ACM Symposium on Principles of Distributedvectors and guarantee the data dependability. By utilizing the Computing, pp. 139–146, 2007.homomorphic token with distributed verification of erasure- [13] J. S. Plank and Y. Ding, “Note: Correction to the 1997 Tutorial oncoded data, our scheme achieves the integration of storage cor- Reed-Solomon Coding,” University of Tennessee, Tech. Rep. CS-03- 504, 2003.rectness insurance and data error localization, i.e., whenever [14] Q. Wang, K. Ren, W. Lou, and Y. Zhang, “Dependable and Securedata corruption has been detected during the storage correct- Sensor Data Storage with Dynamic Integrity Assurance,” Proc. of IEEEness verification across the distributed servers, we can almost INFOCOM, 2009. [15] R. Curtmola, O. Khan, R. Burns, and G. Ateniese, “MR-PDP: Multiple-guarantee the simultaneous identification of the misbehaving Replica Provable Data Possession,” Proc. of ICDCS ’08, pp. 411–420,server(s). Through detailed security and performance analysis, 2008.we show that our scheme is highly efficient and resilient to [16] D. L. G. Filho and P. S. L. M. Barreto, “Demonstrating Data Possession and Uncheatable Data Transfer,” Cryptology ePrint Archive, ReportByzantine failure, malicious data modification attack, and even 2006/150, 2006, http://eprint.iacr.org/.server colluding attacks. [17] M. A. Shah, M. Baker, J. C. Mogul, and R. Swaminathan, “Auditing to We believe that data storage security in Cloud Computing, Keep Online Storage Services Honest,” Proc. 11th USENIX Workshop on Hot Topics in Operating Systems (HOTOS ’07), pp. 1–6, 2007.an area full of challenges and of paramount importance, is still