Exploring New Methods for Protecting and Distributing Confidential Research Data
Upcoming SlideShare
Loading in...5

Exploring New Methods for Protecting and Distributing Confidential Research Data



I gave this presentation at the Fall 2009 CNI Membership meeting.

I gave this presentation at the Fall 2009 CNI Membership meeting.



Total Views
Views on SlideShare
Embed Views



2 Embeds 4

http://www.linkedin.com 3
http://www.slideshare.net 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Exploring New Methods for Protecting and Distributing Confidential Research Data Exploring New Methods for Protecting and Distributing Confidential Research Data Presentation Transcript

  • Exploring New Methods for Protecting and Distributing Confidential Research Data Bryan Beecher Felicia LeClere ICPSR/University of Michigan
  • Today’s Talk
    • What’s ICPSR?
    • How do organizations distribute confidential research data today?
    • What are the problems?
    • What can we improve?
  • What’s ICPSR?
    • Inter-university Consortium for Political and Social Research
      • JSTOR for social science data
    • Serving billions since 1962
    View slide
  • Who does ICPSR serve?
    • Research universities
      • Discover and download data
    • Teaching universities and colleges
      • On-line analysis
    • Federal agencies
      • Data management, preservation, and dissemination
    View slide
  • Distributing data
  • Distributing data
    • Most of our content is public-use
      • Anonymized public opinion
      • Aggregate government data
    • Little risk of disclosure
    • But what about the good stuff?
  • Distributing sensitive data
  • Distributing sensitive data
    • Higher risk of breech of confidentiality
      • Variables that give geographic information that might be combined with other data sources to identify a respondent
    • Requires special handling
  • Distributing sensitive data
    • Researcher agrees to protect the data and identities
    • Delivered securely
    • Harsh penalty deterrent
  • National Longitudinal Study of Adolescent Health
    • Add Health
      • Highly used and cited study
    • Very frank questions
      • Kids in 7 th through 12 th grade
    • Carolina Population Center
    • Gold standard in data protection
  • Traditional Approach http://www.flickr.com/photos/videolux/2389320345/ http://www.flickr.com/photos/curiousexpeditions/3767246490/
  • Traditional Approach Confidential research data Apply for access Write security plan Repeat
  • Can we improve upon it?
    • Paperwork
      • How do we speed the application process?
    • Security
      • How do we ensure the data are going to a good home?
  • Paperwork
    • Web portal
      • Research plan
      • IRB approval
      • CVs
      • Confidentiality agreements
  • Paperwork
    • Web portal
      • Behavioral questionnaire
      • Electronic copy of contract (HTML, PDF)
      • Database back-end to drive workflow systems
  • Restricted data Contracting System
    • Integrated with ICPSR’s existing Web download mechanism
    • Collects information that would ordinarily be provided through paper
    • “ Tickler” system to send reminders, nag about missing items
  • Security
    • Current system relies on…
      • The data provider to maintain security templates
      • The researcher to write an IT security plan
      • The data provider to read and understand the plan
      • The researcher to execute the plan
  • Current access model Researcher Workstation ICPSR
  • A new access model? Secure Area Researcher Workstation ICPSR
  • Secure area = the cloud?
    • Cloud-based access
      • Convenient
      • Scalable
      • Economical
      • Perfect?
  • What could go wrong?
  • Almost everything
    • Is the cloud reliable?
    • Will the data be safe?
    • We are building an analytic environment for a researcher, how will we know what to provide?
    • Will this perform well for the researcher?
    • This is the main story…
  • Cloud reliability
    • Already using the cloud for DR purposes since January 2009
    • The Merit Network Operations Center monitors all of our stuff
    • Ping, http GET every minute 24 x 7
    • Results?
  • Local v. cloud – CY 2009
  • Conclusion
    • Cloud has been more reliable than local environment
    • If local power was better, cloud would still be better, but only a little better
    • Certainly seems to be good enough
  • Cloud security
    • Absolute security?
      • Who cares?
    • More secure than the typical WinTel desktop of a social science researcher?
      • That’s the goal
  • Current practice
    • Data archive maintains per-platform guidelines on IT security
    • Researcher downloads a template and writes his/her own IT security plan
    • Data provider reviews plan; approves or iterates until approved or rejected
  • Sample items
      • I secured the computer on which the Add Health data resides in a locked room, or secured the computer to a table with a lock and cable (locking the case so the battery cannot be removed).
      • I turned off all unneeded services and disabled unneeded network protocols.
  • Brutal facts
    • Data providers are
    • not IT experts
    • Researchers are not
    • experts in IT security
    • Even if the system is secure on Day One, what assurance is there that it continues to be secure?
  • Our approach to security
    • Leverage tools from the cloud provider (AWS access control lists)
    • Leverage tools from UMich (regular Retina and Nessus scans)
    • Engage a white hat hacker to probe and evaluate the system
  • Conclusion
    • Expecting researchers to build and maintain secure IT environments is not reasonable
    • We think we can build something at least as secure in the cloud
    • We’ll evaluate our environment using outside evaluators
  • What to deploy?
    • Model means we need to distribute a working analytic environment, not just the data
    • Also gives the researcher the opportunity to limit access to only a subset of contractees
  • May I Take Your Order?
    • Operating system?
    • Analysis software?
    • Who’s allowed to use the system?
    • Anything else?
  • The ACI Chooser
    • Analytic Cloud Instance
      • Cumulus
    • The ACI Chooser
    • Takes your order
    • Brings your ACI to your table (in the cloud)
  • Conclusion
    • We’re building this now
    • Issues to resolve
      • How do we get passwords to people?
      • Remote access mechanism?
        • Citrix? Terminal Services?
      • Should we encrypt the data?
  • Performance
    • Will a cloud-based analysis system meet the expectations of a researcher?
    • Will one size fit all?
  • Amazon EC2
    • Regular
      • S (1 CPU, 2GB, $0.12)
      • L (4 CPU, 7GB, $0.48)
      • XL (8 CPU, 15GB, $0.96)
    • High memory
      • XXL (13 CPU, 34GB, $1.44)
      • XXXXL (26 CPU, 68GB, $2.88)
    • High CPU
      • M (5 CPU, 2GB, $0.29)
      • XL (20 CPU, 7GB, $1.16)
  • Strategy
    • Balance cost and performance
    • Start small, but give opportunity to grow
      • Easy to move an image from one instance size to another
    • Measure performance via researcher’s experience
  • Conclusion
    • Partners
      • Panel Study of Income Dynamics (PSID)
      • Los Angeles Family and Neighborhood Study (LA FANS)
    • Start small; re-launch larger
    • Ask how well it works
  • Thanks and Final Thoughts
    • Could preserve machine image + data + software + “program” for replication purposes
    • enclavecloud.blogspot.com charts our adventures
    • Cloud-related work sponsored by a recent NIH Challenge Grant