Big Data and Attacks on
Privacy: How to Properly
Anonymize Social
Networks and Databases
(and Keep Them That Way)
AC 298r Final Presentation
Ryan Lee and Jeffrey Wang
Obligatory Social Network Stats
http://www.mediabistro.com/alltwitter/files/2013/11/growth-of-social-media-2013.jpg
Uses of Social Data: Research
Bollen et al. (2011).
CS109 Harvard Univ.
Fall 2013
Christakis & Fowler (2010). Christakis & Fowler (2007).
Uses of Social Data: Marketing
Facebook.com
Bio-Rad
Chang, R., Lee, A., Ghoniem, M., Kosara, R., Ribarsky, W., Yang, J., ... & Sudjianto, A. (2008). Scalable and interactive
visual analysis of financial wire transactions for fraud detection. Information visualization, 7(1), 63-76.
Uses of Social Data: Government
Challenge: Privacy
Naive Approach: Anonymization
Name Favorite Pizza Favorite Course
Ryan Lee Supreme AC298r
Jeffrey Wang Pepperoni AC298r
Daniel Weinstock Anchovies AC298r
Naive Approach: Anonymization
Name Favorite Pizza Favorite Course
Ryan Lee Supreme AC298r
Jeffrey Wang Pepperoni AC298r
Daniel Weinstock Anchovies AC298r
Priority: Security
Concern: Digital Footprint
NSA Data Warehouse
Deanonymization is Possible
Sweeny, Fuzziness and Knowledge-based Systems, 2002
Netflix Prize 2
Netflix De-anon: How they did it
● 500,000 record dataset was super-sparse
Netflix “Anonymized” Data
Public Data (IMDb, twitter, blogs, etc.)
Match if:
time < threshold
movie rating < threshold
Names
Surnames in Genomic Sequences
TACATA is a real last name...
“Anonymized” Cell Phone Data
de Montjoye, Y. A., Hidalgo, C. A., Verleysen, M., & Blondel, V. D. (2013). Unique in the Crowd: The privacy bounds of human mobility. Scientific reports, 3.
Defenses (lol JK)
K-Anonymity
Sweeny, Fuzziness and Knowledge-based Systems, 2002
A Tough Problem
DOB, Gender, and ZIP Code is enough to
uniquely identify 87% of US Citizens
Sweeny, Fuzziness and Knowledge-based Systems, 2002
Solution?
First Last Age Race
Harry Stone 34 African American
John Reyser 36 Caucasian
Beatrice Stone 34 African American
John Delgado 22 Hispanic
Sweeny, Fuzziness and Knowledge-based Systems, 2002
Solution: Suppression and
Generalization
First Last Age Race
Harry Stone 34 African American
John Reyser 36 Caucasian
Beatrice Stone 34 African American
John Delgado 22 Hispanic
k=2: Polynomial Solution! (Simplex Matching)
k>=3: NP-Hard (Graph Decomposition)
Sweeny, Fuzziness and Knowledge-based Systems, 2002
● Users are ε times less likely to be identified if
they chose not to participate in the database
Differential Privacy
Dwork, ICALP, 2002
Anonymity in Social Networks
Peter S. Bearman, James Moody, and Katherine Stovel, Chains of
affection: The structure of adolescent romantic and sexual networks,
American Journal of Sociology 110, 44-91 (2004).
http://www-personal.umich.edu/~mejn/networks/addhealth.gif
High School Dating Network
Information-rich Network Structure
Backstrom, L., & Kleinberg, J. (2013). Romantic Partnerships and the Dispersion of Social Ties: A Network
Analysis of Relationship Status on Facebook. arXiv preprint arXiv:1310.6753.
Attacks on Social Networks
● Passive: Find yourselves
● Active: structural steganography
http://www.cse.psu.edu/~asmith/courses/privacy598d/www/lec-notes/Attacking%20Social%20Network%20FINAL.pdf
No isomorphic
No automorphism
Obfuscating Social Networks
Zhou and Pei, KAIS, 2011
Part 1: Construct Min-DFS Tree for
Neighborhood
Zhou and Pei, KAIS, 2011
2 Useful Properties
1. Social Networks follow a Power-Law
Distribution
2. Social Networks typically have a small
diameter (6 degrees of separation)
Step 2: Anonymize Similar Vertices
Zhou and Pei, KAIS, 2011
Step 3: ??? => Step 4: Profit!
Zhou and Pei, KAIS, 2011
thanks
bye

Data Privacy and Anonymization

  • 1.
    Big Data andAttacks on Privacy: How to Properly Anonymize Social Networks and Databases (and Keep Them That Way) AC 298r Final Presentation Ryan Lee and Jeffrey Wang
  • 2.
    Obligatory Social NetworkStats http://www.mediabistro.com/alltwitter/files/2013/11/growth-of-social-media-2013.jpg
  • 3.
    Uses of SocialData: Research Bollen et al. (2011). CS109 Harvard Univ. Fall 2013 Christakis & Fowler (2010). Christakis & Fowler (2007).
  • 4.
    Uses of SocialData: Marketing Facebook.com Bio-Rad
  • 5.
    Chang, R., Lee,A., Ghoniem, M., Kosara, R., Ribarsky, W., Yang, J., ... & Sudjianto, A. (2008). Scalable and interactive visual analysis of financial wire transactions for fraud detection. Information visualization, 7(1), 63-76. Uses of Social Data: Government
  • 6.
  • 7.
    Naive Approach: Anonymization NameFavorite Pizza Favorite Course Ryan Lee Supreme AC298r Jeffrey Wang Pepperoni AC298r Daniel Weinstock Anchovies AC298r
  • 8.
    Naive Approach: Anonymization NameFavorite Pizza Favorite Course Ryan Lee Supreme AC298r Jeffrey Wang Pepperoni AC298r Daniel Weinstock Anchovies AC298r
  • 9.
  • 10.
  • 11.
    Deanonymization is Possible Sweeny,Fuzziness and Knowledge-based Systems, 2002
  • 12.
  • 13.
    Netflix De-anon: Howthey did it ● 500,000 record dataset was super-sparse Netflix “Anonymized” Data Public Data (IMDb, twitter, blogs, etc.) Match if: time < threshold movie rating < threshold Names
  • 14.
    Surnames in GenomicSequences TACATA is a real last name...
  • 15.
    “Anonymized” Cell PhoneData de Montjoye, Y. A., Hidalgo, C. A., Verleysen, M., & Blondel, V. D. (2013). Unique in the Crowd: The privacy bounds of human mobility. Scientific reports, 3.
  • 16.
  • 17.
    K-Anonymity Sweeny, Fuzziness andKnowledge-based Systems, 2002
  • 18.
    A Tough Problem DOB,Gender, and ZIP Code is enough to uniquely identify 87% of US Citizens Sweeny, Fuzziness and Knowledge-based Systems, 2002
  • 19.
    Solution? First Last AgeRace Harry Stone 34 African American John Reyser 36 Caucasian Beatrice Stone 34 African American John Delgado 22 Hispanic Sweeny, Fuzziness and Knowledge-based Systems, 2002
  • 20.
    Solution: Suppression and Generalization FirstLast Age Race Harry Stone 34 African American John Reyser 36 Caucasian Beatrice Stone 34 African American John Delgado 22 Hispanic k=2: Polynomial Solution! (Simplex Matching) k>=3: NP-Hard (Graph Decomposition) Sweeny, Fuzziness and Knowledge-based Systems, 2002
  • 21.
    ● Users areε times less likely to be identified if they chose not to participate in the database Differential Privacy Dwork, ICALP, 2002
  • 22.
    Anonymity in SocialNetworks Peter S. Bearman, James Moody, and Katherine Stovel, Chains of affection: The structure of adolescent romantic and sexual networks, American Journal of Sociology 110, 44-91 (2004). http://www-personal.umich.edu/~mejn/networks/addhealth.gif High School Dating Network
  • 23.
    Information-rich Network Structure Backstrom,L., & Kleinberg, J. (2013). Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook. arXiv preprint arXiv:1310.6753.
  • 24.
    Attacks on SocialNetworks ● Passive: Find yourselves ● Active: structural steganography http://www.cse.psu.edu/~asmith/courses/privacy598d/www/lec-notes/Attacking%20Social%20Network%20FINAL.pdf No isomorphic No automorphism
  • 25.
  • 26.
    Part 1: ConstructMin-DFS Tree for Neighborhood Zhou and Pei, KAIS, 2011
  • 27.
    2 Useful Properties 1.Social Networks follow a Power-Law Distribution 2. Social Networks typically have a small diameter (6 degrees of separation)
  • 28.
    Step 2: AnonymizeSimilar Vertices Zhou and Pei, KAIS, 2011
  • 29.
    Step 3: ???=> Step 4: Profit! Zhou and Pei, KAIS, 2011
  • 30.