Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sharing Sensitive Data Securely

1,504 views

Published on

These are the slides from my talk at FAR Con in Minneapolis recently. The topics are the implications of buried treasure hoards on data security, horror stories and new, simpler and provably secure methods for public data disclosure.

Published in: Software
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Sharing Sensitive Data Securely

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  2. 2. © 2014 MapR Technologies 2 Agenda • Two kinds of security failure – Buried treasure • But what could go wrong? – Horror stories • Sharing into controlled environments – Views, masking and fine-grained control • Sharing without sharing – When masking is not sufficient • Summary
  3. 3. © 2014 MapR Technologies 3 Locked Up Tight – The Cheapside Hoard • Between 1640 and 1666 somebody hid a cache of jewels under the floor of 30-32 Cheapside Road • They never came back for them … • The hoard was found by workmen in 1910 • Did the owners forget where they were? • Why didn’t their heirs or partners recover them?
  4. 4. © 2014 MapR Technologies 4 The Other Kind of Security Failure • Security can fail when there is a leak – Enigma decryption – Retail data compromise – Klaus Fuchs • Security also fails when data is not shared – AKA siloing – The many threads of 9/11 – The Cheapside hoard – Invisible technological opportunity cost
  5. 5. © 2014 MapR Technologies 5 Netflix • Shared anonymized data • Huge boost in state of the art for some kinds of recommendations • Anonymization shown to be weak barrier • Lawsuit, security clamp-down everywhere
  6. 6. © 2014 MapR Technologies 6 Reference Data Attack Netflix Opaque id [{date,movie,rating}...] IMDB Opaque id [{date,movie}...] Combined database
  7. 7. © 2014 MapR Technologies 7 The Moral • If there is something to correlate, anonymization may fail • When I say “may”, you should read “will”
  8. 8. © 2014 MapR Technologies 8 NY Cab • Hack license and medallion number hashed using MD-5 • No correlation data to work with • But cab (medallion) numbers have only a few forms • So we can generate hashes for all 20 million (or so) medallions
  9. 9. © 2014 MapR Technologies 9 So What? • What correlations are there? • NYC medallions are public information anyway • Taxis operate in the public realm
  10. 10. © 2014 MapR Technologies 10 So What?
  11. 11. © 2014 MapR Technologies 11 Paparrazo + Timestamp + Taxi = Who and Where See http://gawker.com/the-public-nyc-taxicab-database-that-accidentally-track-1646724546 http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/
  12. 12. © 2014 MapR Technologies 12 Extended Moral • Correlations are more common than we thought • Masking PII is not sufficient for public datasets • Theoretically, no solution is possible • Pragmatically, never bet against cleverness • Must change the game
  13. 13. © 2014 MapR Technologies 13 Alternative Strategies Public disclosure + Simple masking Public disclosure + Simple masking Public disclosure + Simple masking
  14. 14. © 2014 MapR Technologies 14 Key Elements of Masking • Opaque or format preserving? • Random or reversible or one-way? • Simple omission? • Right to be forgotten?
  15. 15. © 2014 MapR Technologies 15 Releasing Public Data • Why? – Required – For research – For support • How? – New technology based on KPI-preserving random data • Three use cases
  16. 16. © 2014 MapR Technologies 16 Secure Development is Hard System knowledge Observed data Training algorithm Model New measurements Model Anomaly scores Model deployment
  17. 17. © 2014 MapR Technologies 17 Secure Development is Hard System knowledge Observed data Training algorithm Model New measurements Model Anomaly scores Model deployment Outside collaborators are outside the security perimeter They can’t see the data and they can’t tune new algorithms to fit reality
  18. 18. © 2014 MapR Technologies 18 How To Make Realistic Data System under test Live data Failure signatures Fake data Failure signatures
  19. 19. © 2014 MapR Technologies 19 Parametric Simulation Match here Live data System under test Failure signatures Fake data Failure signatures Fake data System under test Failure signatures Parametric matching of failure signatures allows emulation of complex data properties Matching on KPI’s and failure modes guarantees practical fidelity
  20. 20. © 2014 MapR Technologies 20 The Method • Pick realistic and important KPI’s and failure measures – False positive rate – Scale invariant score distribution – Internal performance metrics (# of candidates searched, similar) • Build emulation roughly based on real system • Tune data spec to match KPI’s using real models • Export data spec to alternative models • Re-tune data spec to match on alternative models
  21. 21. © 2014 MapR Technologies 21 Example #1 – Query failure • Performance index is query failure with particular stack signature • Tuning knobs include – Table sizes – Data distributions – (potentially) field value realism – (potentially) field cross correlations
  22. 22. © 2014 MapR Technologies 22 The Original Conversation Them Us Hive broke, fix it.
  23. 23. © 2014 MapR Technologies 23 The Original Conversation Them Us Hive broke, fix it. Sure! Can I see the data? No.
  24. 24. © 2014 MapR Technologies 24 The Original Conversation Them Us Hive broke, fix it. Sure! Can I see the data? No. OK. Can I see the stack trace? No.
  25. 25. © 2014 MapR Technologies 25 The Original Conversation Them Us Hive broke, fix it. Sure! Can I see the data? No. OK. Can I see the stack trace? No. Can I log in to the system? No.
  26. 26. © 2014 MapR Technologies 26 The Original Conversation Them Us Hive broke, fix it. Sure! Can I see the data? No. OK. Can I see the stack trace? No. Can I log in to the system? No. What do you want me to do? Fix it.
  27. 27. © 2014 MapR Technologies 27 The Broken Query
  28. 28. © 2014 MapR Technologies 28 A Simpler Example Schema sales sales_id customer_id time_id store_id item_id PK FK FK FK FK quantity unit_price discount customer customer_idPK name street1 city state zip time time_idPK year month time day quarter store store_idPK name street city state zip region item item_idPK SKU description
  29. 29. © 2014 MapR Technologies 29 A Simpler Example sales sales_id customer_id time_id store_id item_id PK FK FK FK FK quantity unit_price discount customer customer_idPK name street1 city state zip time time_idPK year month time day quarter store store_idPK name street city state zip region item item_idPK SKU description [ {"name":"customer_id", "class":"id"}, {"name":"name", "class":"name", "type":"first_last"}, {"name":"street", "class":"address"}, {"class":"flatten", "value": { "class":"zip", "fields":"city,state,zip"}} ] [ {"name":"sales_id", "class":"id"}, {"name":"customer_id", "class":"foreign-key", "size":"$customers"}, {"name":"time_id", "class":"foreign-key", "size":"$times"}, {"name":"store_id", "class":"foreign-key", "size":"$stores"}, {"name":"item_id", "class":"foreign-key", "size":"$items"}, {"name":"quantity", "class":"int", "skew":0.5}, {"name":"unit_price", "class":"gamma", "dof":1, "scale":10}, {"name":"discount", "class":"uniform", "min":0, "max":20}, {"name":"exact_time", "class":"event", "start": "2014-01-01", "format":"yyyy-MM-dd HH:mm:ss", "rate": "10/d"} ]
  30. 30. © 2014 MapR Technologies 30 Data Flow Python: generate.py synth: items synth: times synth: sales synth: stores synth: customers csv: items csv: times csv: sales csv: stores csv: customers templates
  31. 31. © 2014 MapR Technologies 31 Sample Data customer_id,name,street,zip,city,state 0,"Mark Long","8578 Pied River Flats","02630","BARNSTABLE","MA" 1,"Chris Lanier","90018 Lost Treasure Corner","06083","ENFIELD","CT" 2,"Bryant Brandon","30712 Bright Shadow Stroll","93922","CARMEL","CA" 3,"Norman Horn","66871 Dewy Bird Shoal","59727","DIVIDE","MT" 4,"Carmen Nowell","6053 Velvet Barn Glen","29329","CONVERSE","SC"
  32. 32. © 2014 MapR Technologies 32 Results • We had to match size, number of records, rough levels of skew • Bug was in query planner – For particular values of relative table size, planner messed up • Once we had the fault, we could slim down the tables – Final example had 3 tables, 1000 records in larges
  33. 33. © 2014 MapR Technologies 33 Common Point of Compromise • Scenario: – Merchant 0 is compromised, leaks account data during compromise – Fraud committed elsewhere during exploit – High background level of fraud – Limited detection rate for exploits • Goal: – Find merchant 0 • Meta-goal: – Screen algorithms for this task without leaking sensitive data
  34. 34. © 2014 MapR Technologies 34 Simulation Setup 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds
  35. 35. © 2014 MapR Technologies 35 Simulation Strategy • For each consumer – Pick consumer parameters such as transaction rate, preferences – Generate transactions until end of sim-time • If merchant 0 during compromise time, possibly mark as compromised • For all transactions, possible mark as fraud, probability depends on history • Merchants are selected using hierarchical Pittman-Yor • Restate data – Flatten transaction streams – Sort by time • Tunables – Compromise probability, transaction rates, background fraud, detection probability
  36. 36. © 2014 MapR Technologies 36 Performance Indicators to Match • User and merchant population • Transaction count/consumer • Merchant propensity skew • Level of detected fraud • Spectrum of meta-model scores
  37. 37. © 2014 MapR Technologies 37
  38. 38. © 2014 MapR Technologies 38 Real bad guys
  39. 39. © 2014 MapR Technologies 39 Results • We matched general mechanism, rough transaction rates • Model was tuned on synthetic data, tested on live data • We found real bad guys on the first try
  40. 40. © 2014 MapR Technologies 40 Summary • Security can fail through too much and too little access • Sharing widely can have significant benefits and substantial risks • New levels of control available for masking and filtering of big data via Drill views • Synthetic data with KPI matching provides sharing of realistic data without risk
  41. 41. © 2014 MapR Technologies 41 Questions
  42. 42. © 2014 MapR Technologies 42 Thank You @mapr maprtech tdunning@mapr.com tdunning@apache.org Ted Dunning, ChiefApplicationArchitect MapRTechnologies maprtech mapr-technologies

×