In the world of Big Data, there has been a lot of the research into creating efficient algorithms that can help us gain statistical insight from the large databases that record much of our life. However, as our digital footprint becomes larger, many databases that were originally considered anonymous can now be re-identified. How do we make sure that doesn't happen?
Customer Service Analytics - Make Sense of All Your Data.pptx
Data Privacy and Anonymization
1. Big Data and Attacks on
Privacy: How to Properly
Anonymize Social
Networks and Databases
(and Keep Them That Way)
AC 298r Final Presentation
Ryan Lee and Jeffrey Wang
2. Obligatory Social Network Stats
http://www.mediabistro.com/alltwitter/files/2013/11/growth-of-social-media-2013.jpg
3. Uses of Social Data: Research
Bollen et al. (2011).
CS109 Harvard Univ.
Fall 2013
Christakis & Fowler (2010). Christakis & Fowler (2007).
5. Chang, R., Lee, A., Ghoniem, M., Kosara, R., Ribarsky, W., Yang, J., ... & Sudjianto, A. (2008). Scalable and interactive
visual analysis of financial wire transactions for fraud detection. Information visualization, 7(1), 63-76.
Uses of Social Data: Government
7. Naive Approach: Anonymization
Name Favorite Pizza Favorite Course
Ryan Lee Supreme AC298r
Jeffrey Wang Pepperoni AC298r
Daniel Weinstock Anchovies AC298r
8. Naive Approach: Anonymization
Name Favorite Pizza Favorite Course
Ryan Lee Supreme AC298r
Jeffrey Wang Pepperoni AC298r
Daniel Weinstock Anchovies AC298r
13. Netflix De-anon: How they did it
● 500,000 record dataset was super-sparse
Netflix “Anonymized” Data
Public Data (IMDb, twitter, blogs, etc.)
Match if:
time < threshold
movie rating < threshold
Names
15. “Anonymized” Cell Phone Data
de Montjoye, Y. A., Hidalgo, C. A., Verleysen, M., & Blondel, V. D. (2013). Unique in the Crowd: The privacy bounds of human mobility. Scientific reports, 3.
18. A Tough Problem
DOB, Gender, and ZIP Code is enough to
uniquely identify 87% of US Citizens
Sweeny, Fuzziness and Knowledge-based Systems, 2002
19. Solution?
First Last Age Race
Harry Stone 34 African American
John Reyser 36 Caucasian
Beatrice Stone 34 African American
John Delgado 22 Hispanic
Sweeny, Fuzziness and Knowledge-based Systems, 2002
20. Solution: Suppression and
Generalization
First Last Age Race
Harry Stone 34 African American
John Reyser 36 Caucasian
Beatrice Stone 34 African American
John Delgado 22 Hispanic
k=2: Polynomial Solution! (Simplex Matching)
k>=3: NP-Hard (Graph Decomposition)
Sweeny, Fuzziness and Knowledge-based Systems, 2002
21. ● Users are ε times less likely to be identified if
they chose not to participate in the database
Differential Privacy
Dwork, ICALP, 2002
22. Anonymity in Social Networks
Peter S. Bearman, James Moody, and Katherine Stovel, Chains of
affection: The structure of adolescent romantic and sexual networks,
American Journal of Sociology 110, 44-91 (2004).
http://www-personal.umich.edu/~mejn/networks/addhealth.gif
High School Dating Network
23. Information-rich Network Structure
Backstrom, L., & Kleinberg, J. (2013). Romantic Partnerships and the Dispersion of Social Ties: A Network
Analysis of Relationship Status on Facebook. arXiv preprint arXiv:1310.6753.
24. Attacks on Social Networks
● Passive: Find yourselves
● Active: structural steganography
http://www.cse.psu.edu/~asmith/courses/privacy598d/www/lec-notes/Attacking%20Social%20Network%20FINAL.pdf
No isomorphic
No automorphism
26. Part 1: Construct Min-DFS Tree for
Neighborhood
Zhou and Pei, KAIS, 2011
27. 2 Useful Properties
1. Social Networks follow a Power-Law
Distribution
2. Social Networks typically have a small
diameter (6 degrees of separation)