• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
352
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Large Scale Threats to Data Anonymity Arvind Narayanan Joint work with Vitaly Shmatikov Kamalika Chaudhuri
  • 2. Anonymity is not cryptography
    • Small “keyspace” – random guessing succeeds with probability 1/N
      • Natural upper bound on N – the race is over !
      • Guess-and-verify paradigm
    • Even quadratic algorithms sometimes feasible!
    • Conventional wisdom relied on computational infeasibility
  • 3. The curse of dimensionality
    • Too much entropy per record
    • How high is high?
      • Try 35,540!
    • k-anonymity breaks down
    • Nearest neigbhor too far
      • Cinematch beats baseline by 1%!
    • Projection to low dimensions loses most of the info
  • 4. Auxiliary information
    • Auxiliary information about people very easy to obtain
    • Unlinkability of user traces – unaffordable luxury
    • Yet linking across databases often disastrous
    • Future privacy – linkage of “profile” to identify makes virtual identities impossible
  • 5. Two fallacies
    • Identifying vs. non-identifying attributes
      • All attributes are quasi-identifiers!
      • Simply removing record labels is not sufficient
    • Perturbation makes attacker’s task harder
      • Note superficial similarity with LPN
      • But non-cryptographic!
      • Reality: re-identification algorithms easily made noise resilient
  • 6. Interactive protocols
    • Severe computational limits
      • Query-execute-analyze cycle
    • Utility required may be non-statistical
      • Database may even be non-relational
    • Privacy for queries
    • Data aggregator not trusted
      • Algorithms in distributed setting not well developed yet
  • 7. Sad realization #2
    • Privacy usually an afterthought
      • (not important until it affects you)
      • Video privacy act example
    • Privacy vs. utility:
      • Collect/release the data, ask questions later
  • 8. Sweeney – linking (exact match) (Anonymous) (Non-anonymous) Hardly secret Probably not secret
  • 9. Collaborative filtering: profiles
    • Each of N users has a preference vector, or a preference profile
    • One attribute for each item
    • Goal: mine this database to predict preferences for new items
    • Can we release an anonymized database of preference vectors?
  • 10. Movielens – fuzzy match
    • Hypothetical investigation
    • Frankowski, Cosley, Sen, Terveen, Riedl.
    • Anonymized database of movie ratings
    • Attacker knows small number of approximate preferences
    • Nearest neighbor stats confirmed
  • 11. Netflix – fuzzy match with noise
    • Nearest neighbor graph
    • Real attack, Narayanan & Shmatikov
    • ~ 4 movies -> unique re-identification
      • know either ratings or dates approximately
      • one of the data points can be completely wrong
    • Found a couple of our friends
    • Found a couple of users from IMDb
  • 12. Distance to nearest neighbor
  • 13. Netflix’s take on privacy Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one-tenth of our complete dataset) and that data was subject to perturbation . Of course, since you know all your own ratings that really isn’t a privacy problem is it? -- Netflix Prize FAQ
  • 14. Example of deanonymization
  • 15. Netflix – contributions
    • Scoring tolerates large amount of noise
      • ∑ i Є M ∩ M’ [ e - α |r i - r i ‘| + c e - β |d i - d i ‘| + Γ ] / log #i
    • Verifying deanonymization in absence of oracle
      • [score(max) – score(max2)] / std.dev(score)
    • Extract user relationships
  • 16. Netflix customers with distance < 0.15
    • Could edges reflect real-life relationships?
    • Ratings and dates were ignored
  • 17. Recommenders: stronger attacks
    • Do recommendation systems inherently leak profile?
      • No data release!
    • Theoretical attacks known
      • Textbook systems
      • Deployed, complex systems
  • 18. Social networks
    • Graph of interactions between people
      • Think of phone call graphs
    • Different type of profile
      • Non – relational data
  • 19. Backstrom, Dwork, Kleinberg
    • Active and passive attacks
    • Re-identify nodes touched by malicious edges
    • Easy to find graph-structured patterns in large database
  • 20. Narayanan, Chaudhuri
    • Tolerates noise
    • Several attacks where a user can re-identify own node
    • Subgraph isomorphism with several hundred nodes
      • Heuristics involving node labels
    • User knows own degree exactly
      • Modern phones store all calls
      • Who deletes email anymore?
  • 21. Finding yourself
    • N instances of graph isomorphism
    • Use isomorphism-invariant signatures
  • 22.  
  • 23. Propagation of node re-identification
    • Surprisingly small number of seeds (6-12)
    • Large fraction of nodes compromised
    • Works even when large fraction (say 80%) of nodes are honest
  • 24.  
  • 25.  
  • 26. Propagation – implementation
    • Social phishing
    • Buddy zoo
    • Skype worm
    • Online addressbook service
    • Competing social network
  • 27. Author identification
    • Basically, a solved problem
      • However, most studies use a small set of authors
      • Not clear how well sample size required scales
    • Combine with typing pattern profiling
      • Possibly deanonymize among millions/billions of users
    • Example: oppressive country
  • 28. Genome anonymity
    • Rich social network
    • ~10^8 bits entropy per record
    • Labeled sample compromises privacy of blood relatives
    • Crossover happens in precise, elegant way
      • Work on admixing populations
    • Story of deanonymization of sperm donor
    • Ease of obtaining auxiliary data or anonymous samples
  • 29. Genome and DNA databases
    • Hapmap – entire genome
    • “ Family tree” services
    • 1/800 births from “anonymous” sperm donor
  • 30. Hapmap’s take on privacy The samples are anonymous with regard to individual identity. Samples cannot be connected to individuals, and no personal information is linked to any sample. As an additional safeguard, more samples were collected from each population than were used, so no one knows whether any particular person's DNA is included in the study.
  • 31.
    • Trait                   Genes        Chromosome location
    • Hair/iris color             ASIP                20 q11.2
    • Hair/iris color             DCT                13 q32
    • Green/blue iris            EYCL1            19 p13.1-q13.11
    • Brown/blue iris           EYCL3            15 q11-q15 *
    • Height                          GH1                17 q22-q24
    • Height (Laron)             GHR                  5 p13-p12
    • Brown/blond hair        HCL1              19 p13.1-q13.11
    • Brown/blond hair        HCL3              15 q11-q15   *
    • Brown/red hair            HCL2                4 q28-q31
    • Hair/iris color             HPS1               10 q23.1-23.3
    • Hair/iris color             HPS2               10 q24.32
    • Skin&hair color          MC1R             16 q24.3
    • Height (Marfan)          MFS                15 q21.1
    • Hair/iris color             MITF                 3 p12.3-14.1
    • Hair/iris color             MYO5A          15 q21
    • Ocular albinism           OA1                X p22.3  
    • Ocular albinism           OA2                X p11.4-p11.23
    • OcculoCut.Albinism    OCA2              15 q11.2-q12  
    • Hair/iris color             PMOC               2 p23.3
    • Hair/iris color             RAB27A         15 q15-21.1
    • Hair/iris color             SILV                12 q13-q14
    • Skin color                    SLC24A5        15 q21.1 A111T dark to light skin
    • Short Stature                SS                    X&Y p
    • Hair/iris color             TYR                11 q14-q21
    • Hair/iris color             TYRP1              9 p23
  • 32. Genotype – phenotype mappings
    • The medical community finds genotype -> phenotype mappings
      • Mappings being generated “at an explosive rate”
    • But also:
      • [Sweeney02]: Inferring genotype from clinical phenotype through a knowledge based algorithm
        • focuses on pathological phenotypes
  • 33. Underlying social network
  • 34. Big picture
    • Attacks against a wide spectrum of rich, high-dimensional datasets
    • Can we win the battle?
      • Using technology alone?
      • What if we don’t?
      • Is part of it already lost?
  • 35. Thanks for coming.
  • 36. Current work
    • Sweeney – exact match
    • Movielens – fuzzy match
    • Netflix – fuzzy match with noise
    • AOL
    • BDK07 – match on non-relational data
    • NC07 – non-relational data with noise
    • Amazon – fuzzy match with noise on utility oracle
    • Genome – match based on multiple databases
    • Genome – phenotype/genotype mapping
  • 37. Future work
    • Author identification
      • Combine with typing pattern profiling
      • Oppressive country example
    • Genome reidentification based on observables
    • Underlying social network
    • SAT solver – generic matching