Fake Picassos, Tampered History, and Digital Forgery:Protecting the Genealogy of Bits withSecure Provenance<br />Ragib Has...
Let’s play a game<br />Real, worth $101.8 million<br />Fake, listed at eBay, worth nothing<br />Can you spot the fake Pica...
So, how do art buyers authenticate art?<br />Among other things, they look at Provenance records<br />Provenance: from Lat...
Provenance of a book<br />Owner names recorded in sequential order on left side<br />4<br />
Let’s consider the Digital World<br />Am I getting back untampered data?<br />Was this data created and processed by perso...
Data History is especially important for mobile data<br />Celebrity death hoax<br />Resulting tweet/retweet cycle<br />Hoa...
The origin of data is important to measure its quality<br />Dr. Evil<br />Dr. Jones<br />Dataset is used by many unsuspect...
Bottom line: History matters …<br />The origins of a data object can affect its quality and usefulness<br />The transforma...
What exactly is Data Provenance?<br />Definition*<br />Description of the origins of data and the process by which it arri...
Example provenance systems<br />Simmhan et al., 2005<br />
What was the common theme of all those systems?<br />They were all scientific computing systems<br />And scientists trust ...
“History is not history unless it is the truth”. <br />Abraham Lincoln, 1856<br />
Problem Statement<br />Can we design an efficient system for securing the provenance of documents, and allow auditing with...
Previous work on integrity assurances<br />(Logically) centralized repository  (CVS, Subversion, GIT)<br />Changes to file...
Secure provenance means preventing “Undetectable history rewriting”<br />Preventundetectable modification of history, such...
Usage and threat model<br />Bob<br />Alice<br />Charlie<br />Marvin<br />PAlice<br />PBob<br />PCharlie<br />PAlice<br />P...
Only certain auditors can read each entry</li></ul>Adversaries: insiders or outsiders who<br /><ul><li>Add or remove histo...
Collude with others to add/remove entries
Claim a chain belongs to another document
Repudiate an entry </li></ul>Ragib Hasan, Radu Sion, and Marianne Winslett, “Introducing Secure Provenance: Problems and C...
Scenario 1: Securing a single document<br />
What realistic guarantees can we provide against a strong adversary?<br />We can’t expect or force the adversary to play b...
Plausible history is something that could have happened<br />Plausible history: The version history that will result if a ...
Original scheme: Ensure Verification of Plausible History<br />Given a document and a provenance chain, we want to verify ...
Analogy: The Forger and the Prada<br />Fake Prada, <br />street price $50<br />Forger’s goal is to create a plausible hist...
Our solution: Overview<br />P1<br />P2<br />P3<br />P4<br />Pn-1<br />Pn<br />Provenance Chain<br />…<br />U3<br />W3<br /...
Our solution: Confidentiality<br />A single auditor<br />Modification log<br />Encrypted Modification log<br />Issues <br ...
Only the auditor(s) trusted by the user can see the user’s actions on the document</li></ul>Multiple auditors<br />Encrypt...
Sidenote: Broadcast Encryption<br />Organize the keys as a tree, with auditors at leaves<br />Each principal knows all the...
Our solution: Confidentiality<br />P1<br />P2<br />P3<br />P4<br />Pn-1<br />Pn<br />…<br />U3<br />W3<br />K3<br />C3<br ...
Fine grained control over confidentiality<br />Redacted (unclassified)  Document<br />Classified Document<br />Provenance ...
We can summarizeprovenance chains to save space, make audits fast<br />1:1 chain<br />n:1 chain<br />Each entry has 1 chec...
Scenario 2: Multiple Documents<br />How to verify exactly what happened?<br />
Co-Provenance provides stronger guarantees<br />Problem: We can only tell if a given provenance chain shows what could hav...
How to (efficiently) add secure provenance<br />to <br />real file systems?<br />
Implementation Options<br />Secure provenance management can be implemented in<br />Kernel layer<br />Transparent to user ...
Implementation: SPROV library<br /><ul><li>Automated capture of writes to files
Record all write data, and related context information in the provenance entry</li></ul>SPROV is an application layer libr...
Implementation: How Sprov works<br />Modified fopen / fwrite / fclose calls<br />File I/O calls trigger provenance chain h...
Experiment goals<br />Measure run time overhead of secure provenance<br />Run the benchmarks with and without secure prove...
Experimental settings<br />Crypto settings<br />1024 bit DSA signatures <br />128 bit AES encryption<br />SHA-1 for hashes...
Experimental settings<br />Chain storage:<br />Chains stored as XML formatted meta-files<br />Modes<br /><ul><li>Config-Di...
Config-RD: Provenance chains stored on RAM Disk buffer, and periodically saved to disk</li></li></ul><li>Postmark small fi...
Hybrid workloads: Simulating real file systems<br />    File system distribution: <br />File size distribution in real fil...
Hybrid workloads: Using Data from real systems<br /> Workload<br /><ul><li>INS: Instructional lab (1.1% writes) [Roselli 00]
RES: A Research lab (2.9% writes) [Roselli 00]
Upcoming SlideShare
Loading in...5
×

Fake Picassos, Tampered History, and Digital Forgery: Protecting the Genealogy of Bits with Secure Provenance

1,662

Published on

Ragib Hasan's talk on Secure Data Provenance.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,662
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Fake Picassos, Tampered History, and Digital Forgery: Protecting the Genealogy of Bits with Secure Provenance

  1. 1. Fake Picassos, Tampered History, and Digital Forgery:Protecting the Genealogy of Bits withSecure Provenance<br />Ragib Hasan<br />NSF Computing Innovation Fellow<br />Dept. of Computer Science, Johns Hopkins University<br />[Joint work with Marianne Winslett (UIUC) and Radu Sion(Stony Brook)]<br />Computer Science Seminar, Johns Hopkins University<br />February 2, 2010<br />
  2. 2. Let’s play a game<br />Real, worth $101.8 million<br />Fake, listed at eBay, worth nothing<br />Can you spot the fake Picasso?<br />
  3. 3. So, how do art buyers authenticate art?<br />Among other things, they look at Provenance records<br />Provenance: from Latin provenire ‘come from’, defined as <br />“(i) the fact of coming from some particular source or quarter; origin, derivation. <br /> (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; a record of the ultimate derivation and passage of an item through its various owners” (Oxford English Dictionary)<br />In other words,Who owned it, what was done to it, how was it transferred …<br />Widely used in arts, archives, and archeology, called the Fundamental Principle of Archival<br />http://moma.org/collection/provenance/items/644.67.html<br />L&apos;artiste et son modèle (1928), at Museum of Modern Art<br />
  4. 4. Provenance of a book<br />Owner names recorded in sequential order on left side<br />4<br />
  5. 5. Let’s consider the Digital World<br />Am I getting back untampered data?<br />Was this data created and processed by persons I trust?<br />Unlike data processing in the past, digital data can be rapidly copied, modified, and erased<br />To trust data we receive from others or retrieve from storage, we need to look into the integrity of both the present state and the past historyof data<br />Data is generated, processed, and transmitted between different systems and principals, and stored in database/storage<br />Our life today has become increasingly dependent on digital data<br />Our Most Valuable Asset is Data<br />5<br />
  6. 6. Data History is especially important for mobile data<br />Celebrity death hoax<br />Resulting tweet/retweet cycle<br />Hoaxes propagates fast, so before retweeting, do check the original source<br />
  7. 7. The origin of data is important to measure its quality<br />Dr. Evil<br />Dr. Jones<br />Dataset is used by many unsuspecting scientists to produce new data<br />Unreliable experiments create faulty dataset<br />Tainted data only produces more tainted results<br />Reliable Lab<br />Dr. Jones got the data from a lab he trusts, but the original data was tainted, so any data derived from it will also be tainted<br />
  8. 8. Bottom line: History matters …<br />The origins of a data object can affect its quality and usefulness<br />The transformation path that a data object took affects its trustworthiness<br />In other words, data provenance is extremely important to decide data quality and trustworthiness<br />
  9. 9. What exactly is Data Provenance?<br />Definition*<br />Description of the origins of data and the process by which it arrived at the database. [Buneman et al.]<br />Information describing materials and transformations applied to derive the data. [Lanter]<br />Metadata recording the process of experiment workflows, annotations, and notes about experiments. [Greenwood]<br />Information that helps determine the derivation history of a data product, starting from its original sources. [Simmhan et al.]<br />9<br />*Simmhan et al. A Survey of Provenance in E-Science. SIGMOD Record, 2005.<br />
  10. 10. Example provenance systems<br />Simmhan et al., 2005<br />
  11. 11. What was the common theme of all those systems?<br />They were all scientific computing systems<br />And scientists trust people (more or less)<br />Previous research covers provenance collection, annotation, querying, and workflow, but security issues are not handled<br />For provenance in untrusted environments, we need integrity, confidentiality and privacy guarantees<br />So, we need provenance of provenance, i.e. a model for Secure Provenance<br />Data<br />
  12. 12. “History is not history unless it is the truth”. <br />Abraham Lincoln, 1856<br />
  13. 13. Problem Statement<br />Can we design an efficient system for securing the provenance of documents, and allow auditing with fine-grained preservation of confidentiality?<br />
  14. 14. Previous work on integrity assurances<br />(Logically) centralized repository (CVS, Subversion, GIT)<br />Changes to files recorded<br />Not applicable to mobile documents<br />File systems with integrity assurances (SUNDR, PASIS, TCFS)<br />Provide local integrity checking<br />Do not apply to data that traverses systems<br />System state entanglement (Baker 02)<br />Entangle one system’s state with another, so others can serve as witness to a system’s state<br />Not applicable to mobile data<br />Secure audit logs / trails (Schneier and Kelsey 99), LogCrypt (Holt 2004), (Peterson et al. 2006)<br />Trusted notary certifies logs, or trusted third party given hash chain seed<br />
  15. 15. Secure provenance means preventing “Undetectable history rewriting”<br />Preventundetectable modification of history, such that, <br />Adversaries cannot insert fake events, remove genuine events from a document’s provenance<br />No one can deny history of own actions<br />Allow fine grained preservation of privacy and confidentiality of actions<br />Users can choose which auditors can see details of their work<br />Attributes can be selectively disclosed or hidden without harming integrity check<br />15<br />
  16. 16. Usage and threat model<br />Bob<br />Alice<br />Charlie<br />Marvin<br />PAlice<br />PBob<br />PCharlie<br />PAlice<br />PBob<br />PCharlie<br />PMarvin<br />Audrey<br />Users: Edit documents on their machines<br />Documents: Are edited, transmitted to other users<br />Provenance entry = record of a user’s modifications and related context <br />Provenance chain = chronologically sorted list of entries; accompanies the document <br />Auditors: semi-trusted principals<br /><ul><li>All auditors can verify chain integrity
  17. 17. Only certain auditors can read each entry</li></ul>Adversaries: insiders or outsiders who<br /><ul><li>Add or remove history entries
  18. 18. Collude with others to add/remove entries
  19. 19. Claim a chain belongs to another document
  20. 20. Repudiate an entry </li></ul>Ragib Hasan, Radu Sion, and Marianne Winslett, “Introducing Secure Provenance: Problems and Challenges”, ACM StorageSS 2007<br />
  21. 21. Scenario 1: Securing a single document<br />
  22. 22. What realistic guarantees can we provide against a strong adversary?<br />We can’t expect or force the adversary to play by the books<br />Access control is not feasible in a loosely coupled distributed environment<br />So, we can only verify plausibility of a given history<br />
  23. 23. Plausible history is something that could have happened<br />Plausible history: The version history that will result if a document were created and subsequently edited and transferred from user to user, with provenance information correctly and indelibly recorded all along the way.<br />A document can have multiple plausible histories<br />
  24. 24. Original scheme: Ensure Verification of Plausible History<br />Given a document and a provenance chain, we want to verify if this could have happened<br />Forger’s goal: To create a plausible history involving honest users <br />
  25. 25. Analogy: The Forger and the Prada<br />Fake Prada, <br />street price $50<br />Forger’s goal is to create a plausible history for a fake Prada bag (i.e. make it appear to come from real Prada distributors via real supply chain)<br />Forger’s goal is NOT to replace a real Prada bag’s plausible history, to show it was made elsewhere.<br />Real Prada, price $3000<br />
  26. 26. Our solution: Overview<br />P1<br />P2<br />P3<br />P4<br />Pn-1<br />Pn<br />Provenance Chain<br />…<br />U3<br />W3<br />K3<br />C3<br />Pub3<br />Provenance Entry<br />Uid3<br />Pid3<br />Host3<br />IP3<br />time3<br />Ui= identity of the principal (lineage)<br />Ci= integrity checksum(s)<br />Ki = confidentiality locks for Wi<br />Wi= Encrypted modification log<br />
  27. 27. Our solution: Confidentiality<br />A single auditor<br />Modification log<br />Encrypted Modification log<br />Issues <br /><ul><li>Each user trusts a subset of the auditors
  28. 28. Only the auditor(s) trusted by the user can see the user’s actions on the document</li></ul>Multiple auditors<br />Encrypted Modification log<br />Modification log<br />Encrypted Modification log<br />Modification log<br />Encrypted Modification log<br />Optimization: Use broadcast encryption tree to reduce number of required keys<br />
  29. 29. Sidenote: Broadcast Encryption<br />Organize the keys as a tree, with auditors at leaves<br />Each principal knows all the keys from leaf to root<br />A<br />D<br />B<br />C<br />To trust everyone, use the root key<br />To trust only D<br />To trust B and C<br />To trust A and C and D<br />
  30. 30. Our solution: Confidentiality<br />P1<br />P2<br />P3<br />P4<br />Pn-1<br />Pn<br />…<br />U3<br />W3<br />K3<br />C3<br />Pub3<br />Wi= Eki (wi)|hash(D)<br />Ki = {Eka (ki) }<br />kiis a secret key that authorized auditors can retrieve from the field Ki<br />wi is either the diff or the set of actions taken on the file<br /><ul><li>ka is the key of a trusted auditor</li></li></ul><li>Our solution: Integrity<br />Old Provenance Entry<br />Old Checksum<br />New Provenance Entry<br />New Checksum<br />Hash<br />Sign<br />Ci = Sprivate_i(hash(Ui,Wi,Ki)|Ci−1)<br />
  31. 31. Fine grained control over confidentiality<br />Redacted (unclassified) Document<br />Classified Document<br />Provenance chain has sensitive info<br />Declassify / release<br />Deleting sensitive information will break integrity checks<br />P1<br />P2<br />P3<br />P4<br />P1<br />P2<br />P3<br />P4<br />Nonsensitive Information<br />Sensitive Information<br />Original attributes<br />commitment<br />Disclosable provenance entry<br />Nonsensitive Info<br />Sensitive info<br />Commit(sensitive info)<br />Checksum calculation<br />Blinded entry disclosed to third party<br />Nonsensitive Info<br />Commit(sensitive info)<br />
  32. 32. We can summarizeprovenance chains to save space, make audits fast<br />1:1 chain<br />n:1 chain<br />Each entry has 1 checksum, calculated from 1 previous checksum<br />Each entry has n checksums, each of them calculated from 1 previous checksum<br />We can systematically remove entries from the chain while still being able to prove integrity of chain<br />
  33. 33. Scenario 2: Multiple Documents<br />How to verify exactly what happened?<br />
  34. 34. Co-Provenance provides stronger guarantees<br />Problem: We can only tell if a given provenance chain shows what could have happened, but not exactly what really happened<br />Solution: Entangle chains of related documents<br />Each entry now contains n checksums, each computed using the checksum of the last entry of the provenance chain of a related document<br />
  35. 35. How to (efficiently) add secure provenance<br />to <br />real file systems?<br />
  36. 36. Implementation Options<br />Secure provenance management can be implemented in<br />Kernel layer<br />Transparent to user apps<br />But less portable<br />File System layer<br />Also transparent to user apps<br />Less portable<br />Application layer<br />Needs (slight) modification of apps<br />Highly portable<br />
  37. 37. Implementation: SPROV library<br /><ul><li>Automated capture of writes to files
  38. 38. Record all write data, and related context information in the provenance entry</li></ul>SPROV is an application layer library in C<br />Can be added to existing applications with almost no change to code<br />Provides the file system APIs from stdio.h<br />To add secure provenance, simply relinkapplications with sprov library instead of stdio.h<br />33<br />
  39. 39. Implementation: How Sprov works<br />Modified fopen / fwrite / fclose calls<br />File I/O calls trigger provenance chain handling mechanism<br />Calls to fwrite log the modifications<br />Calls to fclose writes a new provenance entry, computes new checksum<br />
  40. 40. Experiment goals<br />Measure run time overhead of secure provenance<br />Run the benchmarks with and without secure provenance<br />Calculate the % overhead<br />Observe the effects of different types of workloads<br />
  41. 41. Experimental settings<br />Crypto settings<br />1024 bit DSA signatures <br />128 bit AES encryption<br />SHA-1 for hashes<br />Experiments<br />Postmark benchmark : performance of writes on small files<br />Hybrid workload benchmark: performance with real life file system distribution and workloads<br />36<br />
  42. 42. Experimental settings<br />Chain storage:<br />Chains stored as XML formatted meta-files<br />Modes<br /><ul><li>Config-Disk : Provenance chains stored on Disk
  43. 43. Config-RD: Provenance chains stored on RAM Disk buffer, and periodically saved to disk</li></li></ul><li>Postmark small file benchmark: Overhead &lt; 5% for realistic workloads<br />20,000 small files (8KB-64KB) subjected to 100% to 0% write load with the Postmark benchmark<br />At 100% write load, execution time overhead of using secure provenance over the no-provenance case is approx. 27% (12% with RD)<br />At 50% write load, overheads go down to 16% (3% with RD)<br />Overheads are less than 5% with 20% or less write load<br />Config-Disk<br />Config-RD<br />100% writes, 0% reads<br />0% writes, 100% reads<br />
  44. 44. Hybrid workloads: Simulating real file systems<br /> File system distribution: <br />File size distribution in real file systems follows the log normal distribution [Bolosky and Douceur 99]<br />We created a file system with 20,000 files, using the lognormal parameters mu = 8.46, sigma = 2.4<br />Median file size = 4KB , mean file size = 80KB<br />In addition, we included a few large (1GB+) files<br />39<br />
  45. 45. Hybrid workloads: Using Data from real systems<br /> Workload<br /><ul><li>INS: Instructional lab (1.1% writes) [Roselli 00]
  46. 46. RES: A Research lab (2.9% writes) [Roselli 00]
  47. 47. CIFS-Corp: (15% writes) [Leung 08]
  48. 48. CIFS-Eng: (17% writes) [Leung 08]
  49. 49. EECS: (82% writes) [Ellard 03]</li></ul>40<br />
  50. 50. Typical real life workloads: 1 - 13% overhead<br />EECS<br />EECS<br />CIFS-corp/eng<br />CIFS-corp/eng<br />INS, RES<br />INS, RES<br />Config-Disk<br />Config-RD<br />INS and RES are read-intensive (80%+ reads), so overheads are very low in both cases.<br />CIFS-corp and CIFS-eng have 2:1 ratio of reads and writes, overheads are still low (range from 12% to 2.5%)<br />EECS has very high write load (82%+), so the overhead is higher, but still less than 35% for Config-Disk, and less than 7% for Config-RD<br />
  51. 51. Recording both read and writes costs 12-16% for real-life workloads<br />
  52. 52. Application-specific history requires different techniques<br />History integrity in databases<br />Developed a solution that leverages WORM storage to provide history integrity for database tuples<br />Incurs only 1%-3% overhead for TPC-C run-time<br />Reduces window of vulnerability 5-fold<br />Reduced audit times 100-fold over the previous state-of-the-art solution<br />History integrity in cloud computing<br />
  53. 53. Beyond Provenance<br />Provenance, versions, and snapshots are general cases of the property – remembrance<br />By enabling different objects to recall their past, we can introduce richer functionality, for example:<br />1.5th factor authentication: Use “What you remember” to strengthen any other authentication factor<br />Flexible security policies: Use knowledge of past of attributes/variables when evaluating security policies<br />Ragib Hasan, Radu Sion, and Marianne Winslett, “Remembrance: The Unbearable Sentience of Being Digital”, CIDR 2009<br />44<br />
  54. 54. Papers<br />Ragib Hasan, Radu Sion, and Marianne Winslett, “Introducing Secure Provenance: Problems and Challenges”, ACM StorageSS 2007<br />Ragib Hasan, Radu Sion, and Marianne Winslett, “The Case of the Fake Picasso: Preventing History Forgery with Secure Provenance”, USENIX FAST 2009, ACM TOSDec 2009 issue<br />Ragib Hasan, Radu Sion, and Marianne Winslett, “Remembrance: The Unbearable Sentience of Being Digital”, CIDR 2009<br />45<br />
  55. 55. Summary: Secure provenance possible at low cost<br />Yes, We CAN achieve secure provenance with integrity and confidentiality assurances with reasonable overheads<br />For most real-life workloads, overheads are between 1% and 15% only<br />More info at http://tinyurl.com/secprov<br />
  56. 56. Usage of Provenance<br />Data Quality – depends on input data and actions taken on data<br />Replication – a recipe for recreating data<br />Attribution – proper attribution of copyright<br />Informational – learn how data was derived<br />Audit Trails –proof of due diligence; investigation of problem sources, system intrusions, proof of prior work in patents<br />47<br />

×