Managing Confidential Information – Trends and Approaches

Like this? Share it with your network

Share

Managing Confidential Information – Trends and Approaches

  • 949 views
Uploaded on

Personal information is ubiquitous and it is becoming increasingly easy to link information to individuals. Laws, regulations and policies governing information privacy are complex, but most......

Personal information is ubiquitous and it is becoming increasingly easy to link information to individuals. Laws, regulations and policies governing information privacy are complex, but most intervene through either access or anonymization at the time of data publication.

Trends in information collection and management -- cloud storage, "big" data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.

This session presented as part of the the Program on Information Science seminar series, examines trends information privacy. And the session will also discuss emerging approaches and research around managing confidential research information throughout its lifecycle.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
949
On Slideshare
638
From Embeds
311
Number of Embeds
6

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 311

http://drmaltman.wordpress.com 218
http://informatics.mit.edu 72
http://mr.chartercollege.edu 13
http://flavors.me 6
http://cloud.feedly.com 1
http://informatics-dev.mit.edu 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Personal information is ubiquitous and it is becoming increasingly easy to link information to individuals. Laws, regulations and policies governing information privacy are complex, but most intervene through either access or anonymization at the time of data publication. Trends in information collection and management -- cloud storage, "big" data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective. This session presented as part of the the Program on Information Science seminar series, examines trends information privacy. And the session will also discuss emerging approaches and research around managing confidential research information throughout its lifecycle.This work. by Micah Altman (http://micahaltman.com) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
  • Other image source: wikimedia commons

Transcript

  • 1. Prepared for MIT Libraries Program on Information Research Brown Bag Talk September 2013 Managing Confidential Information – Trends and Approaches Dr. Micah Altman <escience@mit.edu> Director of Research, MIT Libraries
  • 2. Standard Disclaimer These opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators Secondary disclaimer: “It’s tough to make predictions, especially about the future!” -- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc. Information Privacy Across the Research Lifecycle
  • 3. Collaborators & Co-Conspirators • Privacy Tools for Sharing Research Data Team (Salil Vadhan, P.I.) http://privacytools.seas.harvard.edu/peopl e • Research Support Supported in part by NSF grant CNS-1237235 Information Privacy Across the Research Lifecycle
  • 4. Related Work Main Project: • Privacy Tools for Sharing Research Data http://privacytools.seas.harvard.edu/ . Related publications: • Novak, K., Altman, M., Broch, E., Carroll, J. M., Clemins, P. J., Fournier, D., Laevart, C., et al. (2011). Communicating Science and Engineering Data in the Information Age. Computer Science and Telecommunications. National Academies Press • Vadhan, S. , et al. 2010. “Re: Advance Notice of Proposed Rulemaking: Human Subjects Research Protections”. Available from: http://dataprivacylab.org/projects/irb/Vadhan.pdf • Altman, M. (2012). “Mitigating Threats To Data Quality Throughout the Curation Lifecycle. In G. Marciano, C. Lee, & H. Bowden (Eds.), Curating For Quality. datacuration.web.unc.edu These slides & Privacy Across the Research Lifecycle from: most reprints available Information informatics.mit.edu
  • 5. Level Setting Information Privacy Across the Research Lifecycle
  • 6. Identifying Information Is Common • Includes information from a variety of sources, such as… – Research data, even if you aren’t the original collector – Student “records” such as e-mail, grades – Logs from web-servers, other systems • Lots of things are potentially identifying: – Under some federal laws: IP addresses, dates, zipcodes, … – Birth date + zipcode + gender uniquely identify ~87% of people in the U.S. [Sweeney 2002] Try it: http://aboutmyinfo.org/index.html – With date and place of birth, can guess first five digits of social security number (SSN) > 60% of the time. (Can guess the whole thing in under 10 tries, for a significant minority of people.) [Aquisti & Gross 2009] – Analysis of writing style or eclectic tastes has been used to identify individuals • Tables, graphs and maps can also reveal identifiable information Brownstein, et al., 2006 , NEJM 355(16), Information Privacy Across the Research Lifecycle
  • 7. Some Sources of Confidentiality Restrictions for University Held Research and Education Information • Overlapping laws • Different laws apply to different cases • Additional data usage agreements and license terms apply Information Privacy Across the Research Lifecycle
  • 8. Different Requirements and Definitions FERPA HIPAA Common Rule MA 201 CMR 17 Coverage Students in Educational Institutions Medical Information in “Covered Entities” Living persons in research by funded institutions Mass. Residents Identification Criteria -Direct -Indirect -Linked -Bad intent (!) -Direct -Indirect -Linked -Direct -Indirect -Linked -Direct Sensitivity Criteria Any non-directory information Any medical information Private information – based on harm Financial, State, Federal Identifiers Management Requirements - Directory opt-out - [Implied] good practice - Consent - Specific technical safeguards - Breach notification - Consent - [Implied] risk minimization - Specific technical safeguards - Breach notification Information Privacy Across the Research Lifecycle
  • 9. * * 2010 Information Privacy Across the Research Lifecycle
  • 10. Recognized Benefits of Data Sharing • Pioneering NRC report [Fienberg, et. al 1985] on data sharing recommended: – Sharing data should be a regular practice. – Investigators should share their data by the time of publication of initial major results of analyses of the data except in compelling circumstances. – Data relevant to public policy should be shared as quickly and widely as possible. – Plans for data sharing should be an integral part of a research plan whenever data sharing is feasible. • Numerous subsequent reports recommend data sharing. Information Privacy Across the Research Lifecycle
  • 11. Private Information & Information Services • Recommendations • Annotations & Tagging • Class discussion forum • Social Highlighting Information Privacy Across the Research Lifecycle
  • 12. Access Control Model Access Control Resource Auditing Client Authorization Credentials Authentication Request/Respo nse Log Resource Control Model External Auditor Information Privacy Across the Research Lifecycle
  • 13. Disclosure Limitation Data InputOutput Model Contingency table “The correlation between X and Y was large and statistically significant” Summary statistics DATA Information Visualization DATA * Jones * * 1961 021* * Jones * * 1961 021* * Jones * * 1972 9404* * Jones * * 1972 9404* * Jones * * 1972 9404* Public use sample microdata Published Outputs Information Privacy Across the Research Lifecycle
  • 14. Example Information Privacy Across the Research Lifecycle
  • 15. Exemplar: Social Media Analysis Attribute Type Examples Data: Structure - network Data: Attribute Types - Continuous/Discrete/ Scale: ratio/interval/ordinal/nominal Data: Performance Characteristics - 10M-1B observations Sample from stream of continuously updated corpus Dozens of dimensions/measures Measurement: Unit of Observation - Individuals; Interactions Measurement: Measurement type - Observational Measurement: Performance characteristic - High volume Complex network structure Sparsity Systematic and sparse metadata Management Constraints - License; Replication Analysis methods - Bespoke algorithms (clustering); nonlinear optimization; Bayesian methods Desired Outputs - Summary scalars (model coefficients) Summary table Static /interactive visualization More Information • • • Information Privacy Across the Research Lifecycle Grimmer, Justin, and Gary King. "General purpose computerassisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650. King, Gary, Jennifer Pan, and Molly Roberts. "How censorship in China allows government criticism but silences collective expression." APSA 2012 Annual Meeting Paper. 2012. Lazer, David, et al. "Life in the network: the coming age of computational social science." Science (New York, NY) 323.5915 (2009): 721.
  • 16. What’s wrong with this picture? Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed A. Jones 12341 01011961 02145 M Raspberry 0 B. Jones 12342 02021961 02138 M Pistachio 0 C. Jones 12343 11111972 94043 M Chocolate 0 D. Jones 12344 12121972 94043 M Hazelnut 0 E. Jones 12345 03251972 94041 F Lemon 0 F. Jones 12346 03251972 02127 F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348 01011973 63200 F Lime 2 I. Smith 12349 02021973 63300 M Mango 4 J. Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974 64500 M Frog 32 L. Smith 12352 04041974 64600 M Vanilla 64 M. Smith 12353 04041974 64700 F Pumpkin 128 N. SmithJones 12354 04041974 64800 F Allergic 256 Information Privacy Across the Research Lifecycle
  • 17. What’s wrong with this picture? HIPPA & MA Identifier Identifier & Sensitibe HIPAA dentifier HIPAA Identifier Sensitive IndirectI Identifier Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed A. Jones 12341 01011961 02145 M Raspberry 0 B. Jones 12342 02021961 02138 M Pistachio 0 C. Jones 12343 11111972 94043 M Chocolate 0 D. Jones 12344 12121972 94043 M Hazelnut 0 E. Jones 12345 03251972 94041 F Lemon 0 F. Jones 12346 03251972 02127 F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348 01011973 63200 F Lime 2 I. Smith 12349 02021973 63300 M Mango 4 J. Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974 64500 M Frog 32 L. Smith 12352 04041974 64600 M Vanilla 64 M. Smith 12353 04041974 64700 F Pumpkin 128 N. Smith 12354 04041974 64800 F Allergic 256 v. 23 (7/18/2013) Managing Confidential Data Mass resident Californian Twins, separated at birth? FERPA too? Unexpected Response? 17
  • 18. Help, help, I’m being suppressed… Synthetic Var Global Recode Local Suppression Aggregation + Perturbation Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed [Name 1] 12341 *1961 021* M Raspberry .1 [Name 2] 12342 *1961 021* M Pistachio -.1 [Name 3] 12343 *1972 940* M Chocolate 0 [Name 4] 12344 *1972 940* M Hazelnut 0 [Name 5] 12345 *1972 940* F Lemon .6 [Name 6] 12346 *1972 021* F Lemon .6 [Name 7] 12347 *1989 021* * Peach 64.6 [Name 8] 12348 *1973 632* F Lime 3 [Name 9] 12349 *1973 633* M Mango 3 [Name 10] 12350 *1973 634* M Coconut 37.2 [Name 11] 12351 *1974 645* M * 37.2 [Name 12] 12352 *1974 646* M Vanilla 37.2 [Name 13] 12353 *1974 647* F * 64.4 [Name 14] 12354 *1974 648* F Allergic 256 Information Privacy Across the Research Lifecycle Row
  • 19. k-anonymous – but not protected Additional background Sort Order/ Structure Name SSN Birthdate Zipcode Gender Favorite Ice Cream * * 1961 021* M Raspberry * * 1961 021* M Pistachio * * 1972 9404* * Chocolate 0 * Jones * * 1972 9404* * Hazelnut 0 * Jones * * 1972 9404* * Lemon 0 * Jones * * 021* F Lemon 1 * Jones * * 021* F Peach 1 * Smith * * 1973 63* * Lime 2 * Smith * * 1973 63* * Mango 4 * Smith * * 1973 63* * Coconut 16 * Smith * * 1974 64* M Frog 32 * Smith * * 1974 64* M Vanilla 64 * Smith * 04041974 64* F Pumpkin 128 * Smith * 04041974 64* F Allergic 256 Disclosure limitation 0 * Jones Information security 0 * Jones Research design … # of crimes committed * Jones Law, policy, ethics Information Privacy Across the Research Lifecycle Homogeneity
  • 20. Climate Information Privacy Across the Research Lifecycle
  • 21. Commercial Data Breaches • Data from 100 million individuals exposed this year… • Only a portion of breaches are reported • Difficult to trace impacts… but estimated 8.3M identity thefts in 2005 Information Privacy Across the Research Lifecycle Source: http://www.informationisbeau tiful.net/visualizations/worldsbiggest-data-breaches-hacks/
  • 22. Cloud computing risks • Cloud computing decouples physical and computing infrastructure • Increasingly used for core-IT, research computing, data collection, storage, and analysis • Confidentiality issues – Auditing and compliance – Access and commingling of data – Location of data and services and legal jurisdiction – Vulnerabilities of network communication using single well-known key – Vulnerability of key storage Information Privacy Across the Research Lifecycle
  • 23. Legal & Cultural Challenges • EU right to be forgotten; French “le droit à l'oubli”; California social media privacy act • Consumer privacy bill of rights; Do not track; Privacy Icons • Evolving case law on locational privacy • Public records, mug shots, and revenge porn • State-level action on privacy regulation • Attitudes towards sharing; surveillance Information Privacy Across the Research Lifecycle
  • 24. New Data – New Challenges • How to limit disclosure without completely destroying utility? – The “Netflix Problem”: large, sparse datasets that overlap can be probabilistically linked [Narayan and Shmatikov 2008] – The “GIS”: fine geo-spatial-temporal data impossible mask, when correlated with external data [Zimmerman 2008] – The “Facebook Problem”: Possible to identify masked network data, if only a few nodes controlled. [Backstrom, et. al 2007] – The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Tomkins et. al 2004] [For more examples see Vadhan, et al 2010] Information Privacy Across the Research Lifecycle Source: [Calberese 2008; Real Time Rome Project 2007]
  • 25. Weather Information Privacy Across the Research Lifecycle
  • 26. Possible Legal/Regulatory Changes for 2013-15 Law, policy, ethics Research design … Information security Disclosure limitation • Likely – New information privacy laws in selected states – Increased open data requirements from federal funders – Adoption of data availability requirements by increasing numbers of journals Information Privacy Across the Research Lifecycle
  • 27. Information Privacy Across the Research Lifecycle
  • 28. Research Information Privacy Across the Research Lifecycle
  • 29. Traditional approaches are failing • Modal traditional approach: – – – – removing subjects’ names storing descriptive information in a locked filing cabinet publishing summary tables (sometimes) release a public use version that suppressed and recoded descriptive information • Problems – law is changing – requirements are becoming more complex – research computing is moving towards the cloud, other distributed storage – researchers are using new forms of data that create new privacy issues – advances in the formal analysis of disclosure risk imply the impracticality of “de-identification” as required by law Information Privacy Across the Research Lifecycle
  • 30. Privacy Tools for Sharing Research Data A National Science Foundation Secure and Trustworthy Cyberspace Project Supported by award #1237235 Differentially Private Algorithms Shield Individuals in Databases The Dataverse Network will Distribute and Manage Confidential Databases Information Privacy Across the Research Lifecycle Policy tools Guide Information Management Across the Research Lifecycle
  • 31. Approaches • Policy – – – – Legal Reforms Information Accountability Economic rights Information transparency – – Privacy Nudges Privacy Icons • • Cryptography – – – – • Multiparty computation Zero knowledge protocols Functional encryption Homomorphic encryption Statistics – – – – • Aboutmydata.com Synthetic data Reidentification risk K-anonymity; homogeneity Differential privacy Information Lifecycle & Infrastructure – – – – Open consent Metadata frameworks Information accountability Policy aware filesystems – Data Vaults – – Secure data enclave Standardized Data Use Agreements • • IRODs Project VRM Information Privacy Across the Research Lifecycle
  • 32. Recent Work – Economics & Public Policy Research/Outreach • • • • • • • March 2013 – Dwork & Vadhan lead roundtable in Differential Privacy and Law and Policy (conference), Cardozo Law School March 2013 – Altman provided oral comments (recorded) on Public Workshop on Revisions to the Common Rule, National Academies, on limits of HIPAA approach to privacy. May 2013 – Altman & Crosas submitted written testimony to Public Access to Federally-Supported Research and Development Data, National Academies; including approaches to management of privacy for data sharing. June 2013 – Dwork, Sweeney, & Vadhan invited & participated in Privacy Law Scholars Conference, George Washington Law School/Berkeley Law School June 2013 -- Yiling Chen, Stephen Chong, Ian Kash, Tal Moran, and Salil Vadhan. “Truthful Mechanisms for Agents that Value Privacy”, Proceedings of the 14th ACM Conference on Electronic Commerce (EC), June 2013. September - Integrating Approaches to Privacy across the Research Lifecycle Workshop In Progress – Rewrite and expansion of, Vadhan, S. , et al. 2010. “Re: Advance Notice of Proposed Rulemaking: Human Subjects Research Protections”, proposing framework for integrating modern privacy concepts in to Human Subjects protections. Information Privacy Across the Research Lifecycle
  • 33. Information Life Cycle Model Long-term access Creation/Colle ction Re-use • • • • Scientometric Education Scientific Policy Storage/I ngest Research methods Statistical / Computational Frameworks Data Management Systems External dissemination/publica tion Analysis Legal / Policy Frameworks∂ ∂ Processing Internal Sharing Information Privacy Across the Research Lifecycle
  • 34. Example: Stakeholder Concerns Across Lifecycle Legal Issues Stakeholder Concerns Research Consumers - Readers - Secondary researcher Replicate and extend Secondary analysis Link research Research Publishers - Print publishers - Research archives Replicable research Promote use of their publications Protect publisher IP Avoid third party IP/Privacy Issues Copyright Licensing Project Personnel: - Investigators - Research Staff Replicable Research Publish Promote use of Publications Track use Copyright Research sponsors: - Home institution - Funding sources Replicable Research Policy Relevance Accessibility of Research Protect IP Avoid third party IP/Privacy Issues Privacy Research sources: Confidentiality - Research Subjects. Intellectual Property - Owners of subject material - Owners of supplementary data Information Transfer Information Privacy Across the Research Lifecycle Fair Use Licensing Freedom of Information Copyright Licensing Copyright DMCA Informed Consent Privacy Trade secrets
  • 35. Modeling Features Features Characteristics Data - Structure; Source; Unit of observation; Attribute types; Dimensionality; Number of observations; homogeneity; frequency of updates; quality characteristics Analytic Results - Form of output; analysis methodology; analysis/inferential goal; utility/loss/quality Disclosure scenario - - Source of threat; areas of vulnerability; attacker objectives, background knowledge, capability; Breach criteria/disclosure concept Stakeholders - Stakeholder types; capacities; trust relationships; budgets Lifecycle characteristics - Lifecycle stages controlled/in scope; policies used; stakeholders involved at each stage Current privacy management approach - Regulation/policy; legal controls; statistical/computational disclosure methods; information security controls Information Privacy Across the Research Lifecycle
  • 36. Legal/Policy Frameworks Intellectual Property Contract Trade Secret Contract Intellectual Attribution Moral Rights Patent Click-Wrap TOU License Database Rights Journal Replication Requirements FOIA State FOI Laws Funder Open Access Fair Use DMCA Trademark Common Rule 45 CFR 26 Copyright Rights of Publicity HIPAA EU Privacy Directive FERPA (Invasion, Defamation) CIPSEA Potentially Harmful State Privacy Laws Classified Access Rights Sensitive but Unclassified Privacy Torts Export Restrictions (Archeological Sites, Endangered Species, Animal Testing, …) EAR ITAR Confidentiality
  • 37. Law, policy, ethics Research design … Risk Assessment Information security Disclosure limitation • [NIST 800-100, simplification of NIST 800-30] Threat Modeling Analysis - likelihood - impact - mitigating controls System Analysis Vulnerability Identification Institute Selected Controls Testing and Auditing Information Security Control Selection Process Information Privacy Across the Research Lifecycle
  • 38. Systems Policy Research questions deriving from Information Lifecycle Analysis • Infrastructure requirements analysis – Data acquisition, storage, dissemination – Identification, authorization, authentication – Metadata, protocols • System design: potential implementation cost of interactive privacy: – Information security -- hardening – Information security – certification & auditing – Model server development, provisioning, maintenance, reliability, availability • System design: information security tradeoffs of Interactive privacy mechanisms: – – – – • Availability risks: denial of service attack Availability/integrity risks: privacy budget exhaustion attacks Integrity risks: modification of delivered results (e.g. man-in-the-middle attacks) Secrecy/privacy: breach of authentication/authorization layer System design: optimizing privacy & utility across lifecycle – When does limiting disclosive data collection dominate methods at the data analysis stage – When does restricted virtual data enclaves + public synthetic data dominate interactive mechanisms • System design: Information use/reuse – Support of scientific analysis use cases (model diagnostics, exploratory data analysis, integration of externa data) within interactive privacy systems. – Align informational assumptions across stages & incorporating informative priors? – Requirements for scientific replication/verification of results produced by model servers? Information Privacy Across the Research Lifecycle
  • 39. Legal Policy Research questions deriving from Information Lifecycle Analysis • Legal requirements across lifecycle stages • Legal instruments -- capturing scientific privacy concepts in legal instruments consistently across lifecycle – service level agreements – consent terms – deposit agreement – data usage agreements – Regulatory language Information Privacy Across the Research Lifecycle
  • 40. Public Policy Research Questions • Where does market fail for sharing confidential research data? – What market conditions are theoretically violated? – What is the empirical evidence of the degree of violation? – How do degree of violation vary by policy context & use case? • Policy equlibria – What are contribution and privacy equilibria for data sharing under different privacy concepts? • Interventions – How do proposed interventions (e.g. advise & consent; “privacy icons”, uniform regulations, breach notification, information accountability, anonymization ) correspond to sources of market failures? Information Privacy Across the Research Lifecycle
  • 41. Beyond Legal Research -- Market Theory • Condition on Markets • – No political/legal distortions [See, e.g., Posner 1978] – Common knowledge – No barriers to entry Conditions on exchange [See e.g., Benisch, Kelley, Sadeh, & Cranor 2011; McDonald & Cranor 2010] – No transaction costs – No information asymmetries • Conditions on agents • Conditions on [See e.g. Acquisti 2010; Tsai, equilibrium valuation Egelman, Cranor & Aquisti 2010] – Perfect rationality – Self-interested – Infinitely many agents – Stable preferences – Pareto optimality vs. economic surplus – Ignorability of distributional concern • Conditions on goods – Consumptive goods – Excludable goods – Decreasing returns to scale – Transferability – No externalities Information Privacy Across the Research Lifecycle
  • 42. Bibliography (Selected) • L. Willenborg and T. D. Waal. Elements of Statistical Disclosure Control, volume 155 of Lecture Notes in Statistics. Springer Verlag, New York, NY, 2001. • Higgins, Sarah. "The DCC curation lifecycle model." International Journal of Digital Curation 3.1 (2008): 134-140.www.dcc.ac.uk/resources/curationlifecycle-model • ESSNET, Handbook on Statistical Disclosure Control. 2011. neon.vb.cbs.nl/casc/SDC_Handbook.pdf • Fung, Benjamin, et al. "Privacy-preserving data publishing: A survey of recent developments." ACM Computing Surveys (CSUR) 42.4 (2010): 14. • Altman, M. (2012). “Mitigating Threats To Data Quality Throughout the Curation Lifecycle. In G. Marciano, C. Lee, & H. Bowden (Eds.), Curating For Quality. datacuration.web.unc.edu Information Privacy Across the Research Lifecycle