Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(SEC313) Security & Compliance at the Petabyte Scale

4,586 views

Published on

Delivering petabyte-scale computational resources to a large community of users while meeting stringent security and compliance requirements presents a host of technical challenges. Seven Bridges Genomics met and overcame them when building the Cancer Genomics Cloud Pilot (CGC) for the National Cancer Institute. The CGC helps users to solve massive computational problems involving multidimensional data, which include: running diverse analyses in a reproducible manner, collaborating with other researchers, and keeping personal data secure to comply with NIH regulations on controlled data sets. Seven Bridges will highlight the lessons learned along the way, as well as best practices for constructing secure and compliant platform services using Amazon S3, Amazon Glacier, AWS Identity and Access Management (IAM), Amazon VPC, and Amazon Route 53.

Published in: Technology
  • Be the first to comment

(SEC313) Security & Compliance at the Petabyte Scale

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Igor Bogicevic, CTO Security and Compliance at the Petabyte Scale Lessons from the National Cancer Institute’s Cancer Genomics Cloud Pilot Angel Pizarro, AWS Scientific Computing October 2015
  2. 2. What to expect from this session • Background: Unique challenges for securing genomics information • Case study: Democratizing access to The Cancer Genome Atlas (TCGA) through the Seven Bridges Cancer Genomics Cloud • Deep dives: How we’ve leveraged AWS to support secure and compliant genomics research
  3. 3. Why is securing genomics information hard?
  4. 4. i) Genomics data is big…and getting bigger NGS: Next Generation Sequencing NGS sequencers include machines from Illumina, Life Technologies, and Pacific Biosciences. Human genome data based on estimates of whole human genomes sequenced Sources: Financial reports of Illumina, Life Technologies, Pacific Biosciences; revenue guidances; JP Morgan; The Economist; Seven Bridges Analysis. Between 2014–2018, production of new NGS data to exceed 2 exabytes #sequencers GenomicdataTb
  5. 5. ii) Genomes are inherently sensitive Very personal (including your relatives…) Can’t fully anonymize information Can’t take it back once it’s out there
  6. 6. iii) Research is highly collaborative and diverse It occurs in large teams... ...with numerous analytical tools
  7. 7. The Challenge Enable thousands of researchers using hundreds of (custom) tools to analyze petabytes of highly sensitive data in a secure and compliant environment
  8. 8. Case study: Bringing the Cancer Genome Atlas (TCGA) to the Cloud This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN261201400008C.
  9. 9. TCGA is one of the richest and most complete genomics data sets in the world 34 tumor types from thousands of patients… …analyzed across multiple dimensions… …by researchers across the US… …at a cost of $375 million. 1.5+ petabytes, growing to 3.5 petabytes in the next year
  10. 10. But learning from this data is challenging
  11. 11. The Cancer Genomics Cloud Pilots seek to directly address these difficulties • Initiated by Dr. Harold Varmus in 2013 • BAA issued in January 2014 • 3 pilots awarded September 2014 o Broad Institute o Institute for Systems Biology o Seven Bridges Genomics Early access: November 2015 Open release: January 2016 www.CancerGenomicsCloud.org
  12. 12. Our approach to democratizing access to TCGA data
  13. 13. The components of democratized access – Data ● Immediately and securely access petabytes of open-access and controlled-access cancer genomics data. ● Analyze data from your private cohorts alongside public data. ● Data access governed by the NIH Genomic Data Sharing Policy. ● As an NIH trusted partner, Seven Bridges is able to authorize approved researchers. ● First controlled access genomic dataset on AWS. ● Coming soon: http://aws.amazon.com/public-data- sets/tcga/.
  14. 14. The components of democratized access – Reproducibility 1.1.2 2.0a 2.3Lite ● Execute workflows from primary analysis through visualization. ● Each result is always associated with a complete snapshot of the tool versions, parameters, and input files.
  15. 15. The components of democratized access – Open standards ● Native execution of Docker-based Common Workflow Language (CWL) pipelines allows portability and sharing of custom tools. ● APIs support workflow automation and enhance interoperability.
  16. 16. ...implemented through our genomics platform
  17. 17. How we’ve leveraged AWS to support secure and compliant genomics research
  18. 18. Security and compliance―connected, but separate.
  19. 19. Security • Network and data security overview • Parallel file access at scale • Enabling secure computation using researcher- contributed tools • Enabling secure user access and collaboration
  20. 20. Simplified system architecture Encrypted Amazon S3 buckets Virtual private cloud (Development environment) Virtual private cloud (Production environment) Dynamic worker instances Infrastructure server Seven Bridges website Dynamic worker instances Infrastructure server IPSEC VPN Seven Bridges offices Open VPN Gateway Remote workforce AWS IPSEC AWS IPSEC User Access platform download data Data flow Secure access point AWS
  21. 21. Securing the network • Extensive use of virtual private clouds (VPCs) • Separate dev and production environments DevProduction ● Built-in IPSEC allows easy network integration • Open VPN to secure remote user access ● Each instance and VPC is individually firewalled
  22. 22. Securing data • At-rest encryption • Amazon S3 SSE, SSE-KMS • Amazon EBS encryption • Ephemeral storage DevProduction • In transit • Data in-transit-fortifying - TLS exclusively on S3 ● From other users • AWS IAM to access other users’ buckets
  23. 23. Controls to support secure data • Atomic data access • Data locality • Dedicated tenancy on computation instances • Using only encrypted storage • Strict data purging Amazon S3 Amazon EBS Amazon EC2 { "Version":"2012-10-17", "Statement":[ { "Sid":"112", "Effect":"Deny", "Principal": "*", "Action":"s3:PutObject", "Resource":"arn:aws:s3:::examplebucket/*", "Condition": { "StringNotEquals": {"s3:x-amz-server-side​-encryption": "AES256"} } } ] } dm-crypt
  24. 24. Security • Network and data security overview • Parallel file access at scale • Enabling secure computation using researcher- contributed tools • Enabling secure user access and collaboration
  25. 25. Parallel file access at scale The Challenge: Many bioinformatics tasks require sharing of intermediary results between multiple instances.
  26. 26. Parallel file access at scale – NFS Observed network saturation at ~8 NFS clients.
  27. 27. Hypothesis • Amazon S3 would remove single NFS server bandwidth bottleneck. • Presenting user’s S3 objects as a local filesystem could provide an elegant abstraction that any application could use. • Cumulative S3 read/write speed should scale mostly linearly with number of workers. • Total read/write speed on shared S3 objects should significantly exceed NFS server solution speed on >10 workers.
  28. 28. Parallel access at scale – SBG-FS/Amazon S3 Amazon S3
  29. 29. SBG-FS single worker performance Compute Instances 300200100 90 215 894 ThroughputMB/s 400 600 50 250150 1st read (SBG-FS Prefetch) Write (SBG-FS Upload) 2nd read (SBG- FS Cache)
  30. 30. SBG-FS cumulative worker performance Compute Instances 300200100 50 250 ThroughputGB/s 150 200 50 250150 1st read (SBG-FS Prefetch) Write (SBG-FS Upload) 2nd read (SBG- FS Cache)100
  31. 31. SBG-FS auditing capabilities Amazon S3
  32. 32. Security • Network and data security overview • Parallel file access at scale • Enabling secure computation using researcher- contributed tools • Enabling secure user access and collaboration
  33. 33. Enabling secure computation using researcher-contributed tools The Challenge: bioinformatics tools 10,000+ 50+ tools used in single TCGA marker paper Our Approach: Common Workflow Language (CWL) wrapper Seven Bridges Platform
  34. 34. Benefits of using Docker to deploy user- contributed tools • Enables solid resource isolation at the container level • Simplifies deploying and managing tools at scale DevProduction
  35. 35. Security risks posed by use of Docker • Docker daemon runs under root privileges • User can intentionally or unintentionally add malicious apps • If resources management not set properly, apps could do damage outside its container DevProduction
  36. 36. Enabling secure use of Docker containers ● Know your private vs. public resources ● Isolate network resources for each container (firewalling) • Be careful with linking containers • Aggregate logs (forensics) DevProduction
  37. 37. Security • Network and data security overview • Parallel file access at scale • Enabling secure computation using researcher- contributed tools • Enabling secure user access and collaboration
  38. 38. Enabling secure access DevProduction ● Organizations have diverse models of internal structure and responsibilities • Roles and authentication models are very diverse • Federated authentication and SSO
  39. 39. Supporting federated login for controlled data access Error Message Approved Researchers cron x 24hr Metadata service ELK stack Verify SAML
  40. 40. Enabling collaboration • SBG Platform provides isolation of resources at project level • Users can share projects and control access through roles • Basic role provides just a read access, write/copy privileges separate from execution One Billing Group $ Multiple users and roles per project Users participate in projects and can provide funding . . (- $ $ $ $ Project-specific user roles Multiple users per project Clear funding/payment responsibility
  41. 41. Overall system security is enabled by monitoring and testing • Penetration testing • Patch management • Software and infrastructure vulnerability assessments • Monitoring of platform performance and availability • Pandora FMS/OSSEC/Sysdig • Auditing and logs at a project and platform level • Logs aggregated and available for inspection with ELK stack
  42. 42. Putting it all together 1. User logs on to the platform 2. Platform creates a unique signed URL for the user 3. Using signed URL, data is uploaded to an encrypted Amazon S3 bucket 4. After the user starts a computation, the Seven Bridges Platform calculates the optimal execution plan and starts dedicated task worker instances 5. Worker instances securely pull data from Amazon S3 6. Worker instances are able to securely share intermediate data 7. Final results are uploaded to Amazon S3 Encrypted S3 bucket User EC2 instances Data sharing between instances 6 SevenBridges Computation environment Seven Bridges Platform 4 1,2 3 5,7 Encrypted Amazon S3 Amazon EC2 Instances
  43. 43. Lessons learned from petabyte-scale security • Isolate resources as much as possible • Encrypt everything―it will make your life easier • Understand the scale of the data • Measure everything • Leverage the infrastructure
  44. 44. Compliance
  45. 45. When we talk about compliance, we talk about Building trust Shared language
  46. 46. dbGaP Protect against risk associated with release of genomes of individuals consenting to participate in research studies. HIPAA Protect against risk associated with release of Personal Health Information (PHI). ISO 27001 Provides framework for general security management of assets across the organization and is a general specification for information security management system (ISMS). Compliance frameworks
  47. 47. Shared responsibility == compliance coordination StackedResponsibility Facilities Infrastructure Virtualization API and Service Endpoints AWS Data Security Data Provenance Application Monitoring OS, Network, etc. Seven Bridges Genomics Users | Groups | Projects | Applications Researcher Auditor
  48. 48. Shared responsibility across frameworks dbGaP HIPAA ISO 27001 ResearcherAWS Seven Bridges
  49. 49. Shared responsibility across frameworks dbGaP HIPAA ISO 27001 ResearcherAWS Seven Bridges
  50. 50. Shared responsibility across frameworks dbGaP HIPAA ISO 27001 ResearcherAWS Seven Bridges
  51. 51. Securely integrating with platforms
  52. 52. Security and compliance in practiceStackedResponsibility Data Security Data Provenance Application Monitoring OS, Network, etc. Users | Groups | Projects | Applications Facilities Infrastructure Virtualization API and Service Endpoints Horizontal Responsibility Seven Bridges GenomicsResearcher Amazon Web Services
  53. 53. Use case: Analyze Personal Genome Project data http://personalgenomes.org VPC subnet Dedicated instance 1000 Genomes
  54. 54. Strategies to follow • Rely on the platform as much as possible • Follow security best practices outlined in the AWS documentation • Have a checklist!
  55. 55. Compliance checklist  AWS security  VPC, security groups, encrypted storage  Protect AWS credentials  Protect platform credentials  SOPs for OS and application updates  Audit and logging of the activities outside of platform  Data provenance and lifecycle
  56. 56. AWS architecture IAM instance role VPC subnet Security group Virtual private cloud • Access platforms via Internet or VPC peering • DevOps for instance and application management • Protect credentials with AWS IAM and AWS KMS
  57. 57. Secure bootstrapping with instance UserData
  58. 58. AWS Command Line Interface
  59. 59. Secure and format local storage
  60. 60. Compliance checklist  AWS security  VPC, security groups, encrypted storage  Protect AWS credentials  Protect platform credentials  SOPs for OS and application updates ❑ Audit and logging of the activities outside of platform ❑ Data provenance and lifecycle
  61. 61. Thank you!
  62. 62. Remember to complete your evaluations!

×