SlideShare a Scribd company logo
1 of 14
On the Anonymity of Home/Work
Location Pairs



Philippe Golle
Kurt Partridge

Presented by: Bo Begole
Location traces are useful for
   personalized location-based services
                                        Inferred Restaurant Preferences
     Location History
                                                        Monday   Tuesda
               Location       Cuisine
    Time

                                           …
11:57- 12:45    37°26’39”
                                                             …      …
               -122°9’38”
                                           12:00
1:22 - 1:31     37°23’11”                                   $           $
                                           to 1:00         $$          $$
               -122°9’02”
                                                     Chinese     Chinese
      …             …             …
                                                       Italian     Italian
                                                             …
                                                             …      …
                                           1:00 to




                                          Personalized
                                          Recommendations


                [Magitti, CHI 2008]
Location traces can be sensitive
    They can reveal business connections, political

    affiliation, medical condition, risky behaviors, etc.
    Can damage reputations, unfairly raise insurance

    premiums, increase divorce rates, etc.

    Some ways to mitigate the risk

    – “Only my trusted network provider knows my location”
        » Location data is often shared with third party LBS providers
    – “Privacy policies and legislation protect my location traces”
        » Data collector may fall victim to attacks, or be dishonest themselves
    – “I’ll report my location only infrequently when I need service”
        » Many services require frequent location updates (e.g. friend finder)
    – “I’ll report my location pseudonymously”
        » Watch out for inferences!
Sometimes possible to infer identity by
combining data sources
                                        Voter Registration
    Inference: what can be

    learned from combining                     Name
    existing knowledge with newly          Street address
    acquired information                         …
                                               Gender
    E.g.: learn the medical record          Postal code

                                            Date of Birth
    of William Weld [Swe00]
    – Knowing the birth date and ZIP
      code of the governor of
                                              Gender
      Massachusetts
                                            Postal code
    – Can retrieve his health records
                                            Date of Birth
      from a supposedly anonymous
      database of state employee
      health-insurance claims               Cancer Type

    87% of US population have
                                         Patient Records
    unique date of birth, gender
    and postal code.
Inferences from location data
    With 2-weeks’ worth of GPS data collected from a

    subject’s car, Krumm [Pervasive 2007] showed we can
    infer:
    – Home address (median error < 60 m)

    >5% are Identifiable by combining with

    – Reverse geo-coder
    – Web-based white pages directory
Preventing inferences [Krumm 07]

Data supression   Noise    Rounding
The K-anonymity Guarantee

    Problem

    – If a location trace is unique…
    – It may be combined with other public data
    – To generate undesirable inferences

    Solution: k-anonymity [Sweeney 2000]

     – Principle: “Data is safe to release if at least k
       people share the same attributes”
     – Anonymity set: set of people with same attributes
     – K-anonymity: size of the anonymity set
K-anonymity Example
     –   Cancer DB: (Post code: 34012, DOB 7/31/1945, Male)
     –   Goal: ensure 5-anonymity
     –   (Cambridge MA)                       54,805 people
     –   (Cambridge MA, male)                26,854 people
     –   (Cambridge MA, 55 years old)          2,096 people
     –   (Cambridge MA, DOB: 7/31/1945)            6 people
     –   (Cambridge MA, DOB: 7/31/1945, Male)      3 people
     –   (Post code: 34012, DOB: 7/31/1945, Male) 1 person

    Combine with voter registration: William Weld

    The same can be done for 69 percent of the 54,805

    people on the voting list of Cambridge, MA.
Inferences from location data
    What level of accuracy of location information

    could result in small sizes of k anonymity for
    some portion of the population?

    Assume a location trace reveals approximate …

    – region someone lives (county or postal code)
    – and also, region someone works (county or postal code)

    County and postal code are coarse-grained

    – A trace that reveals less is probably not very useful
    – Useful traces in fact reveal considerably more!
    – Therefore, our results are an upper bound on the size of
      the anonymity set (things could be worse!)
Data source
    Longitudinal Employer-Household Dynamics

    (LEHD)
    – Census Bureau program to compile information about
      where people work and where they live
       » County
       » Census Tract ~= US Postal code
    – Also records age, earnings and distribution across
      industries.
    – 103,289,243 workers from 42 states

    LEHD is a synthetic dataset

    – “The key statistical property to preserve in the
      synthetic data is the joint distribution of workers
      across home and work areas”
    – 3 implicates (synthetic datasets) are available for
      researchers to check robustness of results
Size of live/work anonymity set

                    100,000                                10,000,000
Size of anonymity set




                                                            1,000,000
                        10,000
                                                             100,000

                        1,000                                 10,000
                                                                                     50% have
                                                                                    k-anonymity
                                                               1,000
                          100
                                                                                     of 35,000
                                                                   100
                           10
                                                                    10
                                                                             5% have
                                                                         k-anonymity of 92
                                                                    1
                                                   50% have
                            1
                                               k-anonymity of 21


                                7% have
                            k-anonymity of 1                                Home only
                                                                            Work only
                                                                            Both home & work
Influence of living & working in the
                          same region versus different regions
                        5,000

                                                         1,000,000
Size of anonymity set




                        1,000
                                                          100,000

                                                           10,000
                         100

                                                            1,000

                                                              100
                          10

                                                               10

                           1                                    1




                                   Same location
                                   Different locations
                                   All
Conclusion: Even coarse grained
location traces can identify individuals
    Approximate home and work locations narrow a person to a

    very small anonymity set
                Geographic          Median Anonymity
                Region Size         Set Size
                County                    34,980
                Census Tract
                                             21
                (~postal code)
                Census Block                 1

    Being unique is not quite the same as being identifiable

     – But identity of a specific individual may be found by combining other
       sources (white pages, patient records, police records, private
       knowledge, employer records, facebook pages, …)
    We can estimate k-anonymity of location- and other context-

    aware tech. prior to widespread adoption using public datasets
     –   Population Census
     –   Longitudinal Employment Household Data (LEHD)
     –   American Time Use Survey (ATUS)
     –   Japan Statistics Bureau Survey on Time Use and Leisure Activities
Techniques to Estimate Privacy Effects of
Context-Aware Technologies
    K-anonymity (size of the anonymity set) is a metric to

    characterize level of privacy

    Can estimate k-anonymity of context-awareness

    technologies with public datasets
    –   Population Census
    –   Longitudinal Employment Household Data (LEHD)
    –   American Time Use Survey (ATUS)
    –   Japan Statistics Bureau Survey on Time Use and Leisure
        Activities

    K-anonymity is not a perfect measure

    – May still leak information
         » L-diversity [MGKV 2006]
         » ε-differential privacy [Dwork et al]

More Related Content

Viewers also liked

Board Presentation 030609
Board Presentation 030609Board Presentation 030609
Board Presentation 030609Ed Mitchell
 
Diy streets 10 mins transition montpelier
Diy streets 10 mins transition montpelierDiy streets 10 mins transition montpelier
Diy streets 10 mins transition montpelierEd Mitchell
 
Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure ...
Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure ...Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure ...
Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure ...bo begole
 
Blended Facilitation: NCVO Information2.0 conference 24/11/08
Blended Facilitation: NCVO Information2.0 conference 24/11/08Blended Facilitation: NCVO Information2.0 conference 24/11/08
Blended Facilitation: NCVO Information2.0 conference 24/11/08Ed Mitchell
 
CSCW 2008 Closing Plenary
CSCW 2008 Closing PlenaryCSCW 2008 Closing Plenary
CSCW 2008 Closing Plenarybo begole
 
Big society transition 1
Big society transition 1Big society transition 1
Big society transition 1Ed Mitchell
 
Parc Human Interaction
Parc Human InteractionParc Human Interaction
Parc Human Interactionbo begole
 
Resilience K-cafe presentation Phillipa Bayley 170909
Resilience K-cafe presentation Phillipa Bayley 170909Resilience K-cafe presentation Phillipa Bayley 170909
Resilience K-cafe presentation Phillipa Bayley 170909Ed Mitchell
 
Christmas greetings from Crystal Palace Transition Town
Christmas greetings from Crystal Palace Transition Town Christmas greetings from Crystal Palace Transition Town
Christmas greetings from Crystal Palace Transition Town Ed Mitchell
 
Ubiquitous Media Design Workshop, IXDC 2014
Ubiquitous Media Design Workshop, IXDC 2014Ubiquitous Media Design Workshop, IXDC 2014
Ubiquitous Media Design Workshop, IXDC 2014bo begole
 

Viewers also liked (10)

Board Presentation 030609
Board Presentation 030609Board Presentation 030609
Board Presentation 030609
 
Diy streets 10 mins transition montpelier
Diy streets 10 mins transition montpelierDiy streets 10 mins transition montpelier
Diy streets 10 mins transition montpelier
 
Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure ...
Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure ...Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure ...
Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure ...
 
Blended Facilitation: NCVO Information2.0 conference 24/11/08
Blended Facilitation: NCVO Information2.0 conference 24/11/08Blended Facilitation: NCVO Information2.0 conference 24/11/08
Blended Facilitation: NCVO Information2.0 conference 24/11/08
 
CSCW 2008 Closing Plenary
CSCW 2008 Closing PlenaryCSCW 2008 Closing Plenary
CSCW 2008 Closing Plenary
 
Big society transition 1
Big society transition 1Big society transition 1
Big society transition 1
 
Parc Human Interaction
Parc Human InteractionParc Human Interaction
Parc Human Interaction
 
Resilience K-cafe presentation Phillipa Bayley 170909
Resilience K-cafe presentation Phillipa Bayley 170909Resilience K-cafe presentation Phillipa Bayley 170909
Resilience K-cafe presentation Phillipa Bayley 170909
 
Christmas greetings from Crystal Palace Transition Town
Christmas greetings from Crystal Palace Transition Town Christmas greetings from Crystal Palace Transition Town
Christmas greetings from Crystal Palace Transition Town
 
Ubiquitous Media Design Workshop, IXDC 2014
Ubiquitous Media Design Workshop, IXDC 2014Ubiquitous Media Design Workshop, IXDC 2014
Ubiquitous Media Design Workshop, IXDC 2014
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

On the Anonymity of Home/Work Location Pairs

  • 1. On the Anonymity of Home/Work Location Pairs Philippe Golle Kurt Partridge Presented by: Bo Begole
  • 2. Location traces are useful for personalized location-based services Inferred Restaurant Preferences Location History Monday Tuesda Location Cuisine Time … 11:57- 12:45 37°26’39” … … -122°9’38” 12:00 1:22 - 1:31 37°23’11” $ $ to 1:00 $$ $$ -122°9’02” Chinese Chinese … … … Italian Italian … … … 1:00 to Personalized Recommendations [Magitti, CHI 2008]
  • 3. Location traces can be sensitive They can reveal business connections, political  affiliation, medical condition, risky behaviors, etc. Can damage reputations, unfairly raise insurance  premiums, increase divorce rates, etc. Some ways to mitigate the risk  – “Only my trusted network provider knows my location” » Location data is often shared with third party LBS providers – “Privacy policies and legislation protect my location traces” » Data collector may fall victim to attacks, or be dishonest themselves – “I’ll report my location only infrequently when I need service” » Many services require frequent location updates (e.g. friend finder) – “I’ll report my location pseudonymously” » Watch out for inferences!
  • 4. Sometimes possible to infer identity by combining data sources Voter Registration Inference: what can be  learned from combining Name existing knowledge with newly Street address acquired information … Gender E.g.: learn the medical record Postal code  Date of Birth of William Weld [Swe00] – Knowing the birth date and ZIP code of the governor of Gender Massachusetts Postal code – Can retrieve his health records Date of Birth from a supposedly anonymous database of state employee health-insurance claims Cancer Type 87% of US population have  Patient Records unique date of birth, gender and postal code.
  • 5. Inferences from location data With 2-weeks’ worth of GPS data collected from a  subject’s car, Krumm [Pervasive 2007] showed we can infer: – Home address (median error < 60 m) >5% are Identifiable by combining with  – Reverse geo-coder – Web-based white pages directory
  • 6. Preventing inferences [Krumm 07] Data supression Noise Rounding
  • 7. The K-anonymity Guarantee Problem  – If a location trace is unique… – It may be combined with other public data – To generate undesirable inferences Solution: k-anonymity [Sweeney 2000]  – Principle: “Data is safe to release if at least k people share the same attributes” – Anonymity set: set of people with same attributes – K-anonymity: size of the anonymity set
  • 8. K-anonymity Example – Cancer DB: (Post code: 34012, DOB 7/31/1945, Male) – Goal: ensure 5-anonymity – (Cambridge MA) 54,805 people – (Cambridge MA, male) 26,854 people – (Cambridge MA, 55 years old) 2,096 people – (Cambridge MA, DOB: 7/31/1945) 6 people – (Cambridge MA, DOB: 7/31/1945, Male) 3 people – (Post code: 34012, DOB: 7/31/1945, Male) 1 person Combine with voter registration: William Weld  The same can be done for 69 percent of the 54,805  people on the voting list of Cambridge, MA.
  • 9. Inferences from location data What level of accuracy of location information  could result in small sizes of k anonymity for some portion of the population? Assume a location trace reveals approximate …  – region someone lives (county or postal code) – and also, region someone works (county or postal code) County and postal code are coarse-grained  – A trace that reveals less is probably not very useful – Useful traces in fact reveal considerably more! – Therefore, our results are an upper bound on the size of the anonymity set (things could be worse!)
  • 10. Data source Longitudinal Employer-Household Dynamics  (LEHD) – Census Bureau program to compile information about where people work and where they live » County » Census Tract ~= US Postal code – Also records age, earnings and distribution across industries. – 103,289,243 workers from 42 states LEHD is a synthetic dataset  – “The key statistical property to preserve in the synthetic data is the joint distribution of workers across home and work areas” – 3 implicates (synthetic datasets) are available for researchers to check robustness of results
  • 11. Size of live/work anonymity set 100,000 10,000,000 Size of anonymity set 1,000,000 10,000 100,000 1,000 10,000 50% have k-anonymity 1,000 100 of 35,000 100 10 10 5% have k-anonymity of 92 1 50% have 1 k-anonymity of 21 7% have k-anonymity of 1 Home only Work only Both home & work
  • 12. Influence of living & working in the same region versus different regions 5,000 1,000,000 Size of anonymity set 1,000 100,000 10,000 100 1,000 100 10 10 1 1 Same location Different locations All
  • 13. Conclusion: Even coarse grained location traces can identify individuals Approximate home and work locations narrow a person to a  very small anonymity set Geographic Median Anonymity Region Size Set Size County 34,980 Census Tract 21 (~postal code) Census Block 1 Being unique is not quite the same as being identifiable  – But identity of a specific individual may be found by combining other sources (white pages, patient records, police records, private knowledge, employer records, facebook pages, …) We can estimate k-anonymity of location- and other context-  aware tech. prior to widespread adoption using public datasets – Population Census – Longitudinal Employment Household Data (LEHD) – American Time Use Survey (ATUS) – Japan Statistics Bureau Survey on Time Use and Leisure Activities
  • 14. Techniques to Estimate Privacy Effects of Context-Aware Technologies K-anonymity (size of the anonymity set) is a metric to  characterize level of privacy Can estimate k-anonymity of context-awareness  technologies with public datasets – Population Census – Longitudinal Employment Household Data (LEHD) – American Time Use Survey (ATUS) – Japan Statistics Bureau Survey on Time Use and Leisure Activities K-anonymity is not a perfect measure  – May still leak information » L-diversity [MGKV 2006] » ε-differential privacy [Dwork et al]

Editor's Notes

  1. Thank you. I'll be presenting the work of my colleagues Philippe Golle, who is possibly the nicest man on the planet, and Kurt Partridge, who is probably the most healthy-living person I know. The research is an exploration of privacy risk from location data.
  2. To explore this question, we turn to a publicly available data set called the Longitudinal Employer-Household Dynamics which contains data on more than 100 million workers in the US including the county and census tract (which is roughly similar to a postal code) where they live and work.I need to point out that the publicly available LEHD is actually synthesized from the real data set to preserve privacy. However, the key characteristic of the synthesis is that the distribution of where people live and where they work is preserved in the synthetic data. The Census bureau makes three synthetic sets available, and Philippe and Kurt have conducted their analysis on all three sets with identical results.
  3. Here are the results.The left diagram is for location at the granularity of census tract, the right is at granularity of county. The vertical axis is the size of the anonymity set on a logarithmic scale. The horizontal axis is the fraction of the population. The curves plot the fraction of the population who have an anonymity set of specific size. The red curve is derived from looking only at where a person works and the green is for where they live. Notice that for those data alone, 50% of the population of an anonymity set size of 1000 or more.But if you combine where a person lives with where they work, the anonymity set for 7% of the population is just 1 and for 50% of the population it is only as high as 21. On the county scale, the anonymity set size is substantially higher with 5% being no smaller than 92 and 50% having greater than 35000.--------Median size of anonymity setcensus tract: 21 people county: 35,000 people5% percentile size of anonymity setcensus tract: 1 people county: 92 peopleCounty. There are 2, 784 counties and county equivalents (boroughs,parishes) in the 42 states included in the LEHD dataset.– Census Tract. Census tracts are county subdivisions defined by the U.S.Census Bureau. In terms of size and number of residents, they are roughlycomparable to ZIP codes. The LEHD dataset contains 64, 881 tracts whereworkers live and 52, 852 tracts where they work. The mean number of workersliving in a tract is 1, 592 and the median is 1, 479. The mean number of peopleworking in a tract is 1, 954 and the median is 951 (it ranges from zero incompletely residential tracts to 166, 680 in a tract of Los Angeles County).
  4. Workers who live and work in different locations (brown stars) have a much smaller anonymity set (i.e. are lessanonymous) than those who live and work in the same location (blue diamonds). This holds true both for locations revealed at census tract (left graph) and county granularity (right graph).The traces of workers who cross location boundaries to go to work are particularly vulnerable to re-identification attacks. Across the states included in the LEHD dataset, 94.1% of workers live and work in different census tracts.These numbers show that the extra anonymity afforded by living and working in the same census tract is uncommon. The fraction of workers who live and work in different counties is 43.5%. The percentages range from 63.8%in Virginia to 10.7% in Hawaii. Living and working in the same county is more frequent, and therefore had a bigger effect on the national average.
  5. The point is that even coarse grained location traces can render individuals uniquely identifiable . The size k for US county granularity is fairly secure with a k of 34 thousand for 50% of population. For census tract and census block, which is described in the paper, k is quite small for 50% of the US population.Such small anonymity set sizes can leave some portion of the population vulnerable to being uniquely identified by combining with data from other sources.One final takeaway is this. It is possible to understand the privacy risk of context awareness before we attain widespread adoption of the technologies. There are publicly available data sets that can serve as surrogates for context technologies, such as the LEHD which was described in this paper. In other work we have used census data itself, and there is a data set of a large number of individual's self-described daily activities at the granularity of 15 minute time periods. The Japan statistics bureau provides a similar data set and there are probably comparable data from other governments.Write questions on 1000 yen note and I will take it to the author. Do you have any questions that I can take back to the authors?
  6. We and others in this research field have been collecting traces of location for personalization of location-based services. From location traces, we can infer a person's individual preferences. For example, in the Magitti project that we reported at CHI 2008, we used location traces to identify a person's activity patterns, store and restaurant preferences which was used to improve the quality of recommendations for local venues.
  7. We know that some users are uncomfortable with the existance of records of movement for a variety of reasons. In addition to restaurant and store preferences, such traces can reveal business connections, political affiliation, medical conditions and other aspects of a person's life that they might not want to expose so that they don't have to deal with undesired consequences.There are a number of ways to mitigate the risk of such exposure. You may trust that a service provider will not misuse the data - but sometimes they share such data with business affiliates. Legislation may be enacted to prevent misuse, but malicious parties may not abide by the laws. You can also just report your data infrequently, as needed, but some services require frequent updates in order to be useful. Finally, you can just use a false name, such as a user ID number, with the data. The risk here is that someone could potentially reverse identify you from some unique characteristic of the data. ----Another potential objection: do all the processing locally. The problem is that it consumes bandwidth for services that integrate with other data, and doesn’t work well for social applications. “pseudonymously”: for example, obtain service under a userID that cannot be tied to the user’s real identity
  8. So, for example here we have two separate publicly available databases. One is the voter registration records which record a person's name, gender and postal code. The second is a record of state employee health claims for cancer which carefully does not include the names of people who made claims. However, in 2000, Sweeney used these databases to discover that William Weld, the governer of Massachusetts at the time, had cancer.Sweeney‘s analysis of US census data showed that 87% of the U.S. population is uniquely identifiable given their full date of birth (year, month and day), sex, and the ZIP code where they live.
  9. Returning now to location traces, John Krum showed at Pervasive 2007 that you can infer a person's home address with a median error of 60 meters from 2 weeks of GPS location data. Combining that with a reverse geo-coder and online white pages, he was able to accurately identify more than 5% of the people.
  10. Krumm went on to show a number of techniques that obscure an individual's data traces, making it much more difficult to identify a particular individual. He showed that you can hide location points near the person’s home, or add noise to all of the data points, or round the location samples to points on a grid.One of the conclusions of Krumm's work was that Location obfuscation defeats re-identification, at the cost of a significant loss in location precision which could render some location-based services unusable. Yet there is no guarantee that even this degree of location obfuscation protects against other/all re-identification attacks.
  11. So, to summarize, if a location trace is unique, it may be combined with other public data to generate undesirable inferences. But how big is the risk? What can we use to estimate the degree of risk.In 2000, Sweeny proposed one metric of privacy certainty called k-anonymity which states that data are safe to release if at least \"k\" people share the same attributes. Those k people are called the anonymity set and k describes the size of that set.
  12. For example, recall the cancer database. Suppose we had a goal of ensuring k-anonymity of at least 5. If dataset revealed only the city in which the patients lived, the k-anonymity size is 54 thousand. Adding gender reduces it to 26 thousand, which is still quite good. We can even add the age of the patient, reducing k-anonymity to approx 2 thousand. But when we disclose date of birth, plus gender, plus postal code, the k anonymity goes down to 6, 3 and finally to just a single individual.The same can be done for 69 percent of the 54,805 people on the voting list of Cambridge, MA.
  13. How would we apply this principle to location trace data? What granularity of location information could result in small k for some portion of the population?Well, let's look at some very coarse-grain sizes to begin with county and zip code.GPS location trace, not just where they live, whichKrumm showed at 60m accuracy, but also where they work which is similarly inferable from location trace.We start at this level because it is very coarse grained, even less precise would probably not be useful to any location-based service. And useful traces would generally be even more precise. So the results are an upper bound on the anonymity set.