Lessons Learned on How to Secure Petabytes of Data


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lessons Learned on How to Secure Petabytes of Data

  1. 1. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton Lesson Learned Securing Data at Scale Drew Farris Peter Guerra Hadoop Summit 2014
  2. 2. © Copyright 2014 Booz Allen Hamilton
  3. 3. © Copyright 2014 Booz Allen Hamilton Photo: CC BY 2.0: https://www.flickr.com/photos/atoach/5015711744
  4. 4. © Copyright 2014 Booz Allen Hamilton Photo CC BY 2.0: https://www.flickr.com/photos/dutchamsterdam/
  5. 5. © Copyright 2014 Booz Allen Hamilton Who we are   Founded and run DC Hadoop Users Group Meetup – http://www.meetup.com/Hadoop-DC   Technical talks at multiple conferences –  Strata, Data Science Summit, IDGA Gov Cloud Conference, Cloudera Hadoop Summit,Yahoo! Hadoop Summit, IEEE Cloud Conference, CSA Congress, Black Hat   Multiple client engagements over the last 7 years –  Defense –  Civil and Commercial Health –  Civil and Commercial Financial Services –  Commercial and International +  Booz Allen Big Data and Data Science Points-of-View +  http://www.boozallen.com/cloud +  http://www.boozallen.com/datascience +  Advancing the Art of Analytics & Big Data +  http://www.boozallen.com/insights/expertvoices/big- data +  http://www.federalnewsradio.com/? nid=154&sid=2080808 +  Tackling Large Scale Data in Government +  http://www.cloudera.com/blog/2010/11/tackling- large-scale-data-in-government/ +  IT Architectures for Complex Search and Information Retrieval +  http://www.slideshare.net/cloudera/fuzzy-table-final +  http://www.slideshare.net/ydn/3-biometric- hadoopsummit2010
  6. 6. © Copyright 2014 Booz Allen Hamilton Agenda +  Securing Data in Hadoop +  Architectural Case Study +  What we did +  How we did it +  What tools we used +  Smart Data +  Emerging Security Capabilities
  7. 7. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton Securing Data in Hadoop
  8. 8. © Copyright 2014 Booz Allen Hamilton +  Data is growing exponentially and our ability to securely store and process it is falling behind +  Security policies haven’t kept up with the technology +  Most security policies and tools were not written for Big Data systems, so mapping can be difficult +  Clients are often not prepared for the security challenges when integrating multiple data sources What are the security challenges with these architectures?
  9. 9. © Copyright 2014 Booz Allen Hamilton Our approach to data security has made adoption more difficult +  For the last 20 years we have built systems in silos, isolated data containers (databases, applications, and so forth) +  Most organizations secure each silo individually and protect access by database +  Most certification and accreditation programs (FISMA), PCI, HIPAA, and SANS top 20 controls define security controls around each data silo +  Most security controls implemented are to protect the servers, user, or network access to data
  10. 10. © Copyright 2014 Booz Allen Hamilton Example: SANS 20 – Control 15; Controlled Access on Need to Know Deploy data protection such as IDS, firewalls, anti-virus, HIPS, DLP, GRC… Wrap those around a number of Big Data technologies, most of which are based on Apache Hadoop or integrate with it: +  Hortonworks / Cloudera Stack +  NoSQL MongoDB / CouchDB / Cassandra +  BigTable (Apache Accumulo / Apache Hbase ) Distributed Systems by nature have different security challenges because of their architecture SANS Control 15: … the data classification system and permission baseline is the blueprint for how authentication and access of data is controlled… +  Step 1:An appropriate data classification system and permissions baseline applied to production data systems +  Step 2:Access appropriately logged to a log management system +  Step 3: Proper access control applied to portable media/USB drives +  Step 4:Active scanner validates,checks access,and checks data classification +  Step 5: Host-based encryption and data-loss prevention validates and checks all access requests.
  11. 11. © Copyright 2014 Booz Allen Hamilton Overview of Security Architecture Components +  Infrastructure & Network +  Encryption (at Rest & in Transit) +  Authentication (User Principal and Device) +  Authorization (Privileged Access Management) +  Access Controls (Data Visibility) +  Auditing & Monitoring of Data Access +  Policy & Compliance Driving Principles +  Start with People, Process and Culture +  Understand the Data and the Threat +  Start small and build +  Never finished
  12. 12. © Copyright 2014 Booz Allen Hamilton Apache Hadoop Security Challenges Scale +  The large number of tasks presents problems with direct authentication HDFS / File System +  NameNodes have ACLs, while DataNodes don’t Job Execution +  Propagation of credentials to executing nodes Job Data +  Task Parameters / Intermediate output accessible via HTTP Multi Tenancy +  Access to Intermediate Output & Local Block Storage Trust Of Auxiliary Services (Oozie, Hadoop clients, Hadoop Pipes/Streaming)
  13. 13. © Copyright 2014 Booz Allen Hamilton First Hadoop release with Kerberos in 2008 A better solution was available, not always implemented: +  Tokens: Delegation Token, Block Access Token, Job Token +  Symmetric Encryption == Shared Keys +  Large Cluster = Thousands of Copies of Shared Keys +  Performance Goals (Less than 3% impact) lead to weak SASL QoP +  Pluggable Authentication left to end-user +  HDFS proxies for bulk transfer expose data Often not implemented in favor of putting Hadoop into an enclave, but still doesn’t fully regulate access to data Alternatives? +  Tahoe-LAFS. Cool, but significant Performance Impact
  14. 14. © Copyright 2014 Booz Allen Hamilton Apache Hadoop 2.x Security Hadoop RPC +  Clients, MapReduce Jobs, Hadoop Daemons +  SASL with varying levels of protection (QoP): -  Authorization, Integrity Protection and Confidentiality Direct TCP/IP +  HDFS Data Transfer between Clients, DN +  Tunnel existing protocol over SASL HDFS-3637 HTTP +  Web-UI, FSImage Operations between NN / SNN +  HTTPS, Reloadable Java Keystore, Others +  MAPREDUCE-4417, HADOOP-8581
  15. 15. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton Architectural Case Study Commercial Client
  16. 16. © Copyright 2014 Booz Allen Hamilton +  Client is a multi-national Fortune 500 company with over 100,000 employees +  Client had multiple data sources for each business unit – R&D, Manufacturing, Sales and Marketing, Corporate +  Client wanted to combine data, but many sensitive issues around new product development and access to data by third party contractors, others within its network boundaries +  Efforts to integrate data previously had failed because of political and technical issues +  Could not get CISO to sign off on combining data! Challenges
  17. 17. © Copyright 2014 Booz Allen Hamilton Securing the Enterprise Ecosystem Design Goals +  Build a fully realized “Data Lake” combining information from many different sources +  Protect from unauthorized release or modification of information +  Focus primarily on full-text retrieval but enable a variety of analytic functions. +  Enable the use of a variety of components from Hadoop Ecosystem +  Implement in a series of phases based on client requirements
  18. 18. © Copyright 2014 Booz Allen Hamilton Services (SOA) Analytics and Discovery Views and Indexes Data Lake Metadata Tagging Data Sources Infrastructure/ Management Visualization, Reporting, Dashboards, and Query Interface Human Insights and Actions Enabled by customizable interfaces and visualizations of the data Analytics and Services Your tools for analysis, modeling, testing, and simulations Data Management The single, secure repository for all of your valuable data Infrastructure The technology platform for storing and managing your data Machine Learning Free-Computation Alerting Geographic Language Translation Entity Relationship Event Grab Dense/ Sparse Structured Unstructured Streaming Provisioning Deployment Monitoring Workflow Streaming Analytics Streaming indexes Our Common Reference Architecture for Big Data
  19. 19. © Copyright 2014 Booz Allen Hamilton Distributed* Storage Extract Distributed Analy6cs*&*Indexing Presenta6on*Layer periodic*updates Non=Rela6onal*Stores Sta6c*Rela6onal* Databases Sta6c*Data Custom*Ingest*Logic Sqoop Hadoop HDFS Storm+Lucene* Processing*Layer Index*Files Index*Persistence*& Meta=data*Management depending*on*use*case JeGy*App*Server Applica6ons*&* Services*Layer interac6ve*search batch*repor6ng View*/*UI*Model Browser*App Front=end*Client (On=Network*Users) Data$Lake$Pla*orm$Components$&$Search$App.$Architecture Enterprise*Security,*Monitoring,*and*Governance*Controls Hadoop Map/Reduce Search*&*BI*Logic Kerberos*SSO* Connector Directory Services On=Premise*Firewall Hive DNS,*DHCP,*NTP,* SMTP,*Proxy*(package* updates)*Services ZooKeeper Informa6on*Model*/* Hive*meta=store Security Groups*(FW) Network*ACLs Standard*AWS* Machine* Images Encrypted*Data* Volumes An6virus*&* System Monitoring Knox*Gateway* &*Audit*Logging AWS*Direct*Connect AWS$Virtual$Private$Cloud$(EC2) OnCPremise$Network Remote*Access* Cer6ficate (2=way*SSL) Accumulo Data* Governance*&** Stewardship Analy6c*App*&*BI* Users*(On=Network) Spoire*&*Other*BI* Tools Privileged*Users*/* Data*Scien6sts (Direct*Access) Streaming*Data User*Uploaded Data*Sets Rela6onal*Database* Triggers Ka]a low-latency updates =*Open*Source*Components*(Green)
  20. 20. © Copyright 2014 Booz Allen Hamilton tl; dr; +  Data Loading via Sqoop / Custom Transport +  Ingest / Index via MapReduce +  Distributed Query via Storm+Lucene +  Batch / Reporting Via MR / Hive +  Authentication via Kerberos +  Access Via Web Application & Knox +  Currently 100TB / 50% used, 150TB by EOY
  21. 21. © Copyright 2014 Booz Allen Hamilton Infrastructure and Network Security +  Amazon Web Services Provided +  Virtual Private Cloud / Security Groups +  Time to Deployment in Early Phases +  Physical access to data centers, network isolation, etc. +  Future Transition on-Premise Infrastructure +  Concerned with procurement time +  Other clients we’ve worked with 3-6 month turnaround for infrastructure prep +  Instance Level Malware Detection tuned to co-exist with cluster workloads
  22. 22. © Copyright 2014 Booz Allen Hamilton Encryption At Rest: +  LUKS (Linux Unified Key Setup) for Ephemeral Storage Volumes +  “Lock it up and throw away the key” In Transit: +  SSL to Web App Endpoints and Knox Gateway +  Internal Network Isolation – VPC Controls prevent traffic interception & MITM attacks
  23. 23. © Copyright 2014 Booz Allen Hamilton Authentication and Authorization +  Authentication via Kerberos +  Authorization via LDAP +  Future transition to enterprise authentication services: Oracle IAM. +  Multi-factor Authentication for both Users and Devices via PKI +  Authorization performed at both the User and Device Level
  24. 24. © Copyright 2014 Booz Allen Hamilton Operating System user accounts and groups for users, projects and teams reflected in HDFS permissions Privileged access via Knox Gateway extension which provides access via SSH, auditing and monitoring and control of administrative connections into the cluster. (KNOX-250) Identity Provider Knox Gateway Hadoop Cluster (Master) (Oozie) (Hive2 Server) External Sources REST/SSL SSH HTTP SPNEGO Privileged Access Management
  25. 25. © Copyright 2014 Booz Allen Hamilton Putting it All Together +  Search UI is a web application accessed via SSL +  Knox is the primary cluster access mechanism for users who need to access to the cluster. Knox Provides access to the following services: +  WebHDFS, WebHCat, Hive, Oozie +  Knox for administrative access, via custom SSH plugin
  26. 26. © Copyright 2014 Booz Allen Hamilton Future Directions +  Role Base Access Control is an emerging client need. This will require: +  Integration with enterprise role management +  Passing roles through Web App & Knox to backend +  Role based access in Accumulo, Lucene Indexes +  Smart Data Tagging Strategy …
  27. 27. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton Smart Data
  28. 28. © Copyright 2014 Booz Allen Hamilton Smart Data +  How many organizations have data security requirements? +  A structured, verifiable representation of security tags bound to the data is required in order for the enterprise to become inherently "smarter" about the information flowing in and around it – Smart Data +  Overview of design principles: +  PKI +  Implement ABAC controls in IdAM +  Define trusted data format based on data security +  Tag all your data +  Deploy Hadoop platform that leverages tags to track access +  Log, monitor, and audit everything
  29. 29. © Copyright 2014 Booz Allen Hamilton Data Element Visibility Tags (red | blue | green) Authorization Authentication Attributes (red, orange, blue) IDAM User Machine Learning Free-Computation Alerting Geographic Language Translation Entity Relationship Event Grab Dense/ Sparse Structured Unstructured Streaming Provisioning Deployment Monitoring Workflow Streaming Analytics Streaming indexes Apache Accumulo Overview of Smart Data
  30. 30. © Copyright 2014 Booz Allen Hamilton Allow access to resource MedicalJournal with attribute patientID=x if Subject match DesignatedDoctorOfPatient and action is read with obligation on Permit: doLog_Inform(patientID,Subject,time) on Deny : doLog_UnauthorizedLogin(patientID,Subject,time) Smart Data Security Controls +  Trusted Client – Authorization and Authentication using PKI +  Trusted Data Format – Data visibility is controlled using Boolean expressions +  Ex.“((red|blue|green) & (white|yellow))” +  Clients present Authorizations (red, blue, green, yellow) to Apache Accumulo +  Corresponding tags are bound to data stored in Apache Accumulo +  Trusted Log – All data interactions are logged and audited Identity and Access Management +  Attribute Based Access Control – Users all assigned series of attributes +  Attributes and Authorization Bound by XACML, SAML +  Policy Decision Point (PDP) +  Policy Enforcement Point (PEP) +  Policy Retrieval Point (PRP) +  Policy Information Point (PIP) +  Policy Administration Point (PAP)
  31. 31. © Copyright 2014 Booz Allen Hamilton Tagging Smart Data Formulate the tags used to control data from multiple perspectives +  Data Origin +  Level of Access Required +  Information Governance Policy +  Data Owners +  Intended Recipients Use fine grained tags, assign users many roles +  Tag at the field level so that existence can be verified without revealing the full data record In Accumulo: +  Capitalize on the richness of boolean expressions in visibility tags +  Differential Compression eliminates the impact of repartition of data +  Visibility Tags are bound to the data, changing visibilities is not trivial: it means a delete and a re-add.
  32. 32. © Copyright 2014 Booz Allen Hamilton Representational versus Referential Tags Representational tags encode the specific visibilities they represent, including all alternate controls for a specific document User has roles of ACCOUNTING, RESEARCH and PII +  If data has tag PII&RESEARCH, user can access data +  If data has tag HIPAA&ACCOUNTING, user can’t access data Referential Tags are a code, that relies on external translation between assigned access controls and visibility markings: Data has marking of 03DECAF00D +  User has roles of ACCOUNTING, RESEARCH and PII +  At lookup, translation of user roles into possible referential tags Choice depends on security posture, what are the consequences of getting it wrong versus the ease of shifting policy or data?
  33. 33. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton Emerging Security Capabilities
  34. 34. © Copyright 2014 Booz Allen Hamilton Ecosystem for security capabilities for Hadoop is growing rapidly Cloudera (with Intel Rhino) +  Sentry (ACLs for Hive / Impala) +  Gazzang (Filesystem Encryption) +  Intel Rhino +  Encryption Codec Support HADOOP-9331 +  Key Distribution & Management MAPREDUCE-5025 +  Token Based Authentication HADOOP-9392 +  Unified Authorization Framework HADOOP-9466 +  Transparent Encryption for Hbase/Zookeeper +  Others, see https://github.com/intel-hadoop/project-rhino/ Hortonworks +  Production Ready Apache Knox +  XA Secure +  Central Administration +  Authorization for HDFS / Hive / Hbase +  Compliance Controls Lots of talks at this Hadoop Summit on data security: The Future of Hadoop Security – Joey Echeverria Hadoop REST API Security with the Apache Knox Gateway – Kevin Minder,Larry McCay Securing Big Data: Lock it Down, or Liberate? Jeff Graham,Mark Tomallo Improvements in Hadoop Security – Sanjay Radia,Chris Nauroth
  35. 35. © Copyright 2014 Booz Allen Hamilton Summary +  Security for Hadoop has come a long way and is changing rapidly, but is still maturing +  Securing the data in Hadoop means thinking differently about the architecture when combining multiple data sources +  Your Hadoop Architecture should provide consistent security mechanisms across all of the data +  A more complete way to secure data is to implement Smart Data (ABAC and Fine Grained Access Controls) but this hasn’t been embraced consistently across the Hadoop ecosystem yet +  The next 6 months will be interesting …
  36. 36. © Copyright 2014 Booz Allen Hamilton Just Released! The Field Guide to Data Science 120 page e-book of data science geekery Download for free: http://www.boozallen.com/datascience Thanks! Drew (@drewfarris) Peter (@petrguerra)