• Save
Hadoop World 2011: Security Considerations for Hadoop Deployments - Jeremy Glesner & Richard Clayton - Berico Technologies

Like this? Share it with your network

Share

Hadoop World 2011: Security Considerations for Hadoop Deployments - Jeremy Glesner & Richard Clayton - Berico Technologies

  • 2,575 views
Uploaded on

Security in a distributed environment is a growing concern for most industries. Few face security challenges like the Defense Community, who must balance complex security constraints with......

Security in a distributed environment is a growing concern for most industries. Few face security challenges like the Defense Community, who must balance complex security constraints with timeliness and accuracy. We propose to briefly discuss the security paradigms defined in DCID 6/3 by NSA for secure storage and access of data (the “Protection Level” system). In addition, we will describe the implications of each level on the Hadoop architecture and various patterns organizations can implement to meet these requirements within the Hadoop ecosystem. We conclude with our “wish list” of features essential to meet the federal security requirements.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,575
On Slideshare
2,171
From Embeds
404
Number of Embeds
8

Actions

Shares
Downloads
0
Comments
0
Likes
5

Embeds 404

http://www.cloudera.com 233
http://d.hatena.ne.jp 161
http://webcache.googleusercontent.com 3
http://a0.twimg.com 2
https://twitter.com 2
http://paper.li 1
http://cloudera.matt.dev 1
http://garagekidztweetz.hatenablog.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Emphasis here is on protecting against the outsider threat. The real problem with this architecture is the insider threat; no strong authorization or delegation of credentials
  • Another way of looking at the same analytic architecture. Isolating Hadoop in a layered VLAN approach. Emphasis is on securing against the outsider threat. While this architecture protects against an outsider threat, it doesn’t work so well to protect against the insider threat, or inadvertent disclosure of information that users are not privileged to see. This model works well enough when users all have the same privileges to see the data, but it starts to break down when mixing data of different sensitivities.The benefit of cloud computing is to ingest and analyze massive data sets, and ultimately, the Federal Sector wanted to realize this benefit. But that meant mixing data of varying sensitivities.
  • Meanwhile, and as a result of the increased trend toward big data solutions, the Federal Sector is trending toward NIST standards/guidelines. When data was isolated in silos, the security aspects focused on user privilege as they relate to data sensitivity. However, with data aggregation comes increased emphasis on data confidentiality and integrity. In clouds, you know you’re going to have users of mixed privilege … so the trend must shift away from users onto the data.----- Meeting Notes (11/7/11 21:52) -----DCID 6/3 was about people and privileges.NIST is a shift towards systems and data.
  • Shift away from focus on user privilege to an emphasis on system/data integrity. Total security approach that deals as much with insider as outsider threat scenarios. Moderate to High Security emphasizes a complete lockdown of the OS, Container, Ports/Protocols, Data Integrity (rest/motion), Data Confidentiality (rest/motion), User Roles/Responsibilities, User Privilege.Agencies are encouraged to use this as a foundation and extend the requirements to suit their needs. Single instance (RDBMS) data silos were much easier to secure and test. In this model, there were only a handful of machines. Clouds introduce an enormous risk because massive amounts of data (at mixed sensitivities) are now co-located across tens, hundreds or thousands of machines in a single cluster. The insider/outsider risk just exploded. Greater risk associated with the sensitivity of data aggregation, and more machines to secure. While it’s entirely possible to continue down the path of using ACLs and Kerberos to secure the system and protect the data, it’s more complicated and less elegant at scale. As data volumes increase, there’s more risk. The only way to ensure integrity and confidentiality is through encryption.
  • Analysis = Big data begets even BIGGER data. When you put all the data in one place, you want to analyze it. You want to run text analytics such as NLP, SUMO, WordNetSynsets, Co-Occurance, Near Frequency, Collective Intelligence Algorithms, Recommender systems, etc.
  • Under FIPS 200, an information system is categorized as low-impact, moderate-impact or high-impact based on an assessment of its most sensitive characteristic (Confidentiality, Integrity, Availability). The trend is toward more security not less!
  • Why can’t we just encrypt whole files and stick them into Hadoop?
  • Two approaches:Create a new format and try to reengineer Hadoop to use that format when reading and writing data.Create a CompressionCodec for use in a Sequence File.Since implementing our our file format would be a lot of work (for not a whole lot of gain), we will only talk about extending the SequenceFile and implementing CompressionCodecs.
  • Note: You will need to actually implement the Compressor and Decompressor as well.This is actually not a whole lot of work. Spotify has already done this with their OpenPGPCompressionCodec project on GitHub.
  • By File Extension is important because Hadoop will use this to determine which codec to use to perform the decompression. The actual extension comes from the CodecCompressor method: getDefaultExtension().We need to make sure we tell the JobConf to use the compression codec as the output of every MapReduce job.
  • Key ManagementProperty Files: please consider encrypting the value of the key in the property file. The Apache Camel Project with it’s Jasypt plugin can help with this.Environment Property: one thing to consider is creating initialization scripts for processes that have a privileged user (say a security manager) set an environment variable that contains the key. When the process terminates, unset the environment variable.JobConf – cover this on the next slide.
  • Hadoop offers another alternative via the Configurable interface. If CompressionCodec implements the interface, it will receive the configuration shortly after instantiation, in which it can pull Job-specific configuration properties.
  • Extending SequenceFile will not be trivial, and it will also require the class to
  • Since there are multiple datasets in HDFS, and they need to be separately controlled, no data can enter Hadoop unprotected. For this reason, the data needs to enter Hadoop encrypted by corpus.
  • This is dangerous because anyone that has access to the key would be able to see data of all classifications in transit.Per-corpus certs would require the SSL TrustManager to maintain a ridiculous amount of certificates; alternatively, the certs could potentially be passed with the job.Generate a certificate per each job would provide the greatest flexibility and security, but would require a lot of work to implement.
  • Extending SequenceFile will not be trivial, and it will also require the class to
  • We are going to want an “envelope” to place the encrypted data in; this is essentially for when we preform decryption since we need to know what key to use to decrypt.
  • Option one is a less complex solution, however, it will be a lot less performant as the record becomes larger or the number of keys become increase. The second solution simplifies this by creating a random symmetric key for the purpose of encryption and decryption and appends a header with the symmetric key encrypted with the cumulative
  • There are other options, but they are almost not viable (like encrypting Key/Value Pairs during the Map and Reduce phases before they are collected by the OutputCollector).

Transcript

  • 1. SECURITY CONSIDERATIONS FOR FEDERAL HADOOP DEPLOYMENTS HADOOP WORLD 2011PRESENTER(S):JEREMY GLESNER, CHIEF TECHNOLOGY OFFICERRICHARD CLAYTON, CHIEF ENGINEER
  • 2. 2OUR AGENDA Security in a distributed environment is a growing concern for most industries. Few face security challenges like the Federal Sector, who must balance complex security constraints with timeliness and accuracy. This presentation is the culmination of a research project to implement more secure Hadoop implementations on our projects. Today, we will discuss: About Us. Who we are and what we do. Federal IT Trends. We will first discuss the environment and current trends for data consolidation within the Federal Sector. Security Constraints. We will briefly discuss the constraints within the Federal environment for secure storage and access of data. We’ll touch on Traditional Security approaches. Use Cases. We’ll present several use cases for using Hadoop within secure architectures by stepping through three use cases and touch on feature requirements. Closing Remarks. We conclude with our next steps and some takeaways on our road to using Hadoop more securely within the federal sector.
  • 3. 3WHO ARE WE? Jeremy Glesner Chief Technology Officer, Berico Technologies Jeremy is CTO at Berico, a passionate high-end software engineering, consulting and analytic services provider to the federal and commercial sectors. He is responsible for driving Berico’s strategic position and efficacy in key technical areas for maximum customer benefit. Jeremy holds a B.A. in Government and a B.S. in Computer and Information Science from the University of Maryland, and is a candidate for the M.S. in Software Engineering at Drexel University. Richard Clayton Chief Engineer, Berico Technologies Richard is Chief Engineer at Berico, a passionate high-end software engineering, consulting and analytic services provider to the federal and commercial sectors. He is responsible for the efficient and effective implementation of technology across all customer projects and internal research and development efforts. Richard holds a B.S. in Computer Science from Park University and is a Cloudera Certified Hadoop Developer.
  • 4. 4 ABOUT BERICO BERICO TECHNOLOGIES is an operationally focused solutions and consulting provider to the public and private sectors./ VETERAN OWNED SMALL BUSINESS Berico’s mission is to be a first-class, game-changing technology and consulting firm that solves the hardest problems and delivers the most relevant solutions to our customers./ FOUNDED IN 2006 Berico works to implement the government’s future vision, and solve the BY GUY FILIPPELLI & information, operations, and technology challenges of today, by leveraging NICK HALLAM core competencies that include: • Enterprise scale, secure application engineering and development • Cloud-scale distributed data processing • Information fusion • Information visualization • Advanced analytics • Full-spectrum intelligence training
  • 5. 5FEDERAL IT Data Deluge. Data/Tool volume, variety and velocity allTRENDS increasing. • Massive volumes of structured, semi-structured and unstructured data • Sparse datasets and no unified object model • Diverse repositories (RDBMs, local repositories, sundry of electronic file types) • Ever increasing speed with which the data arrives • Silos of capability driven by content classification, security IT Consolidation. “not built here” effectsacross government concerns and CIOs/CTOs/IT managers expect bigger returns on existing investments. Do more with less. Better metrics. • With the end of the Iraq and Afghan wars, the government has reduced contingency operation (GWOT) funding by 23.3% in 2012. • Nearly a dozen major cloud efforts underway within Federal Sector to consolidate resources and reduce data duplication. • Growing emphasis on data and tool de-duplication and better knowledge management (pedigree, providence, provenance) • Innovation and buzz driving IT decisions toward Platform-as- a-Service solutions (virtualization, distributed/parallel computing), with an emphasis on open-source technologies. = Immediate need for enterprise-ready, scalable and secure architectures to consolidate enormous volumes of diverse sparse data sets.
  • 6. 6FEDERAL ITTRENDS
  • 7. 7TRADITIONAL Current approach to Hadoop security in a large-scale analyticsSECURITY architectural layout: isolate the Hadoop cluster. Services Tier Security Gateway Ingestion Services Data Access Layer Index Index Index Index Service Ingest Service Pipeline Buffer Hadoop Cluster Service Service
  • 8. 8TRADITIONAL Boundary protection through tiered VLAN architectures:SECURITY enclave strategies are geared to the outsider threat. Firewall Firewall Universe Firewall Extranet Services Tier Client Zone Firewall Service Zone Ingest Pipeline IDX IDX Data Zone Hadoop Cluster
  • 9. 9SECURITY Trending toward NIST Standards/GuidelinesCONSTRAINTS FISMA. The Federal Information Security Management Act requires each federal agency to develop, document, and implement an agency-wide program to provide information security for the information and information systems that support the operations and assets of the agency, including those provided or managed by another agency, contractor, or other sources. Executed through the National InstituteBarack Obama recently released new guidance Executive Order(s). of Standards and Technology (NIST). regarding Structural Reforms to Improve the Security of Classified Networks and the Responsible Sharing and Safeguarding of Classified Information in response to wikileaks. DoD 8510.01 / DIACAP. Pertains to the Defense Department. Establishes a certification and accreditation process to manage the implementation of information assurance capabilities and services and provide visibility of accreditation decisions regarding the operation of DoD information systems. Adopting aspects of NIST. ICD 503. Pertains to the Intelligence Community. Establishes a policy for information technology systems security risk management, certification and accreditation. Meant to replace DCID 6/3. Depends on NIST and/or the Committee on National Security Systems (CNSS) approval. … NIST Standards (FIPS 200 and SP-800-53), CNSSI 1253, National Security Act of 1947, Executive Order (EO) 12333, as amended; EO 13231; EO 12958, as amended; and many more!
  • 10. 10 NIST Standards for Information System CategorizationSECURITY POTENTIAL IMPACT Security Objective LOW MODERATE HIGHCONSTRAINTS Confidentiality The unauthorized The unauthorized The unauthorized disclosure of disclosure of information Preserving authorized restrictions on disclosure of information information could be could be expected to information access and disclosure, could be expected to expected to have a have a severe or including means for protecting have a limited adverse serious adverse effect catastrophic adverse personal privacy and proprietary effect on organizational on organizational effect on organizational information. operations, operations, operations, organizational assets, or organizational assets, organizational assets, or [44 U.S.C., SEC. 3542] individuals. or individuals. individuals. Integrity The unauthorized The unauthorized The unauthorized modification or modification or modification or Guarding against improper destruction of destruction of destruction of information information modification or information could be information could be could be expected to destruction, and includes ensuring expected to have a expected to have a have a severe or information non- repudiation and limited adverse effect serious adverse effect catastrophic adverse authenticity. on organizational on organizational effect on organizational operations, operations, operations, [44 U.S.C., SEC. 3542] organizational assets, or organizational assets, organizational assets, or individuals. or individuals. individuals. Availability The disruption of The disruption of access The disruption of access access to or use of to or use of information to or use of information information or an Ensuring timely and reliable access or an information or an information system information system to and use of information. system could be could be expected to could be expected to expected to have a have a severe or have a serious [44 U.S.C., SEC. 3542] limited adverse effect catastrophic adverse adverse effect on on organizational effect on organizational organizational operations, operations, Table extracted from FIPS 199 organizational assets, or operations, organizational assets, or organizational assets, individuals. individuals. or individuals.
  • 11. 11FORCING Security, Analysis, Consolidation, Data Volume creates divergent interests. The need to secure and scale analytic architectures is inFUNCTIONS direct opposition to the analytic capabilities that can be delivered by that architecture. • Data Confidentiality • Data Integrity • System Availability • Natural Language • Localize all data into Processing organizational • WordNet, SUMO, Co- enterprise-scale Occurance / Near clusters. Frequency, Clustering • Mixed sensitivities • Collective • More machines to Intelligence, Recomm protect. ender Systems • Data volume, variety and velocity increasing. • Enrichment and analysis drives content explosion
  • 12. 12USE CASES &SECURITY What are the security implications for Hadoop? 1. Protecting Data in Hadoop from Theft or Accidental Disclosure. All users have the same privilege to the information stored within, and the data is of MODERATE confidentiality, MODERATE integrity and HIGH availability. Mission Critical Information System = {(Confidentiality, MODERATE), (Integrity, MODERATE), (Availability, HIGH)} NIST System Classification is HIGH. 2. Storing and Segregating Data of Mixed Sensitivities. All users have mixed privileges to the information stored within, and the data is of MODERATE confidentiality, MODERATE integrity and HIGH availability. Mission Critical Information System = {(Confidentiality, MODERATE), (Integrity, MODERATE), (Availability, HIGH)} NIST System Classification is HIGH.
  • 13. 13USE CASES &SECURITY 3. Securing the Results of Analytics that Cross Datasets. Users have different levels of privilege to the information stored within; we want to be able to analyze data with HIGH confidentiality, integrity and availability. Mission Critical Information System = {(Confidentiality, HIGH), (Integrity, HIGH), (Availability, HIGH)} NIST System Classification is HIGH.
  • 14. 14USE CASES &SECURITY Bottom Line: Long term adoption of any technology within the Federal IT community will depend on its ability to cryptographically protect data at rest and in motion.
  • 15. 15USE CASE 1:PROTECTING DATA FROMTHEFT OR ACCIDENTALDISCLOSURE
  • 16. 16USE CASE 1:ESSENTIALSECURITY Security Objectives • Users cannot read data at rest or in motion unless authorized. Assumptions • All data, regardless of sensitivity, is encrypted with the same key Solution • Single-key block encryption within Hadoop. • Transport Encryption for HDFS and M/R Runtime.
  • 17. 17USE CASE-1ESSENTIALSECURITY SINGLE-KEY BLOCK ENCRYPTION
  • 18. 18USE CASE 1ESSENTIALSECURITY File larger than block size. Limitations with Encrypting Whole Files in HDFS Encrypt the entire file. • Unable to determine record boundaries (especially if records are split across blocks). Encrypted file larger • Files may be too large than block size. (to transfer to single data node to encrypt or decrypt). • Performances issues (load not distributed evenly across nodes). Split and store encrypted file. Data Node Data Node Data Node Encrypted file split into blocks. Data Node Data Node Data Node
  • 19. 19USE CASE 1ESSENTIALSECURITY Block-Level Encryption • Solves “whole file” encryption problem. • Each Data Node can decrypt blocks stored locally. • Once decrypted, records split across blocks can be resolved in normal fashion. Separately File larger than encrypted block size. blocks. Data Node Data Node Data Node Split, Encrypt, and Store. Data Node Data Node Data Node
  • 20. 20USE CASE 1ESSENTIALSECURITY How to Implement Block-Level Encryption • Create a custom file format. • Use a custom CompressionCodec for use in SequenceFile. o Implement CompressionCodec (and corresponding Compressor and Decompressor interfaces if needed) o Add CompressionCodec to Hadoop’s classpath o Configure Hadoop to use CompressionCodec
  • 21. 21USE CASE 1ESSENTIALSECURITY Implement a Custom CompressionCodec public interface CompressionCodec { CompressionOutputStream createOutputStream( OutputStream out) throws IOException; CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) throws IOException; Class<? extends Compressor> getCompressorType(); Compressor createCompressor(); CompressionInputStream createInputStream( InputStream in) throws IOException; CompressionInputStream createInputStream(InputStream in, Decompressor decompressor) throws IOException; Class<? extends Decompressor> getDecompressorType(); Decompressor createDecompressor(); String getDefaultExtension(); } Spotify’s: Hadoop OpenPGP CompressionCodec using BouncyCastle https://github.com/spotify/hadoop-openpgp-codec
  • 22. 22USE CASE 1ESSENTIALSECURITY Configure Hadoop to Use Custom CompressionCodec • Register File Extension (e.g.: “*.berico”) o Add custom CompressionCodec class to Hadoop’s configuration property: <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.c ompress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec, com.berico.hadoop.AES256Codec</value> <description>A list of the compression codec classes that can be used for compression/decompression.</description> </property> • Using JobConf to set output. FileOutputFormat.setOutputCompressorClass( jobConf, com.berico.hadoop.AES256Codec.class);
  • 23. 23USE CASE 1ESSENTIAL Encryption GotchasSECURITY Temporary Data • Temporary data must also be encrypted (created in the Mapper, Combiner, and Reducer). o Use “setMapOutputCompressorClass” on the JobConf jobConf.setMapOutputCompressorClass( com.berico.hadoop.AES256Codec.class); Key Management • How does my CompressionCodec know what key to use? o Property File o Environment Property o Parameter of the JobConf
  • 24. 24USE CASE 1ESSENTIALSECURITY Passing Key(s) via JobConf • On the CompressionCodec implementation: o Implement org.apache.hadoop.conf.Configurable public interface Configurable { void setConf(Configuration conf); Configuration getConf(); } o Extract key from the Configuration object String secret = configuration.get( "secret.key"); • Set the encryption/decryption key on the JobConf jobConf.set( "secret.key", "fruit loops");
  • 25. 25USE CASE 1ESSENTIALSECURITY TRANSPORT ENCRYPTION
  • 26. 26USE CASE 1ESSENTIALSECURITY Implementing SSL in Hadoop • HDFS and MapReduce communication in Hadoop is done through the Hadoop IPC framework: org.apache.hadoop.ipc.*. o The framework uses the standard javax.net.* Socket API to exchange data between clients and nodes. o The SSL Socket API (javax.net.ssl.*) directly extends the Socket API (e.g.: SSLSocket extends Socket) • It stands to reason that SSL Sockets could be used to potentially replace the default socket implementation of Hadoop.
  • 27. 27USE CASE 1ESSENTIALSECURITY Implementing SSL in Hadoop, Cont. • Hadoop currently provides two SocketFactory implementations, StandardSocketFactory and SocksSocketFactory (both in org.apache.hadoop.net). • You can create your own by: o Extending javax.net.SocketFactory o Configuring the SocketFactory by implementing the Configurable interface (optional). • Configuring Hadoop to use your factory <property> <name>hadoop.rpc.socket.factory.class.default</name> <value>com.berico.hadoop.CustomSSLSocketFactory</value> <description>Here is our custom SSL Socket Factory (configurable) </description> </property>
  • 28. 28USE CASE 2:STORING AND SEGREGATINGDATA OF MIXED SENSITIVITIES
  • 29. 29USE CASE 2MIXEDSENSITIVITIES Security Objectives • Restrict data based on its classification level. o Must be segregated from other datasets at rest and in motion. o Users are authorized only to specific datasets. • Hadoop Administrators are restricted by classification like other users. Solution • Multi-key (per dataset) block encryption in Hadoop. • Multi-classification transport encryption for HDFS and M/R runtime.
  • 30. 30USE CASE 2MIXEDSENSITIVITIES MULTI-KEY BLOCK ENCRYPTION
  • 31. 31USE CASE 2MIXEDSENSITIVITIES Multi-Key Encryption Caveats • Same principles as Single-Key Block Encryption except: o Custom CompressionCodec needs to have context when it encrypts/decrypts (e.g.: what keys do I use?). o Data cannot entire Hadoop mixed (multiple datasets/classifications in a single file unprotected).
  • 32. 32USE CASE 2MIXEDSENSITIVITIES Passing Context to the CompressionCodec for decryption • Two strategies: o Pass the correct key via JobConf. • Your responsibility to know the correct key for the block. o Extend SequenceFile to carry encryption context in its header data and extend CompressionCodec to accept SequenceFile “Metadata”. • Utilize “header” to carry dataset identifier. • Choose from user’s keys passed in from the JobConf (or obtained by some other mechanism) to decrypt. • Pass metadata into CompressionCodec from SequenceFile.
  • 33. 33USE CASE 2MIXEDSENSITIVITIES Controlling Data Entry • In multi-classification instances of Hadoop, data must enter HDFS encrypted. o The Segregate-Buffer-Encrypt-Write pattern can be used to perform this task: A Hadoop Data (HDFS) Source B Segregate C Write Buffer Encrypt
  • 34. 34USE CASE 2MIXEDSENSITIVITIES TRANSPORT ENCRYPTION
  • 35. 35USE CASE 2MIXEDSENSITIVITIES Strategies for Multi-Classification Transport • In multi-classification systems, data in transit must also be segregated (no shared keys). • Three strategies for transport: o “System-High” SSL certificate for Hadoop. o Per-corpus SSL certificate. o Generate a temporary certificate per job.
  • 36. 36USE CASE 3:SECURING THE RESULTS OFANALYTICS THAT CROSSDATASETS
  • 37. 37USE CASE 3SECURE Security ObjectivesANALYTICS • Allow MapReduce operations to occur across classifications. • Allow the creation of new classifications by combining two or more datasets of mixed classifiations. • Preserve the integrity of the data. o All derived data have the combined sensitivity of those datasets that produced it (A + B = AB). o Data created from the combining of two or more datasets must be segregated. Solution • Separately encrypted “records” within Hadoop. • Derived-data “segregation strategy”. • Multi-classification transport encryption for HDFS and MapReduce (including mixed datasets).
  • 38. 38USE CASE 3SECUREANALYTICS RECORD-LEVEL ENCRYPTION
  • 39. 39USE CASE 3SECUREANALYTICS Strategies for Record-Level Encryption • Two strategies: o Decrypt/Encrypt directly in Mapper, Reducer and Combiner. o Transparently Decrypt/Encrypt Records using a custom CompressionCodec.
  • 40. 40USE CASE 3SECUREANALYTICS Extending CompressionCodec to allow Record-Level Decryption/Encryption • Variation of block-level encryption except that the context of the encryption (determinant for the key to use) is within the value of the record. Write encrypted record. Output of the Mapper, Reducer, o r Combiner Dataset - A Record Dataset - B Compression Record Codec Record Dataset - C Determine the Block correct key to use to encrypt
  • 41. 41USE CASE 3 Extending CompressionCodec to allowSECURE Record-Level Decryption/Encryption, Cont.ANALYTICS Steps: • Standardize Key-Value outputs to carrying information necessary for encryption (e.g.: classification). • Construct a “record” model (Thrift example): struct EncryptedRecord { 1: i32 uid, 2: string classification, 3: binary encryptedBody } • Implement a CompressionCodec capable of: o Interpreting the classification from the data collected by the OutputCollector. o Reading and Writing the “record” model as the output. • Configure the JobConf to use record compression: SequenceFileOutputFormat.setOutputCompressionType(jobConf, SequenceFile.CompressionType.RECORD);
  • 42. 42USE CASE 3SECUREANALYTICS Controlling Data Entry with Mixed Sensitivities • Record-level encryption simplifies the complexity of insertion: Encrypt – Buffer – Write Data Source Hadoop Data (HDFS) Source Write Encrypt Buffer Data Source
  • 43. 43USE CASE 3SECUREANALYTICS DERIVED-DATA SEGREGATION STRATEGIES
  • 44. 44 Handling “Derived Data”USE CASE 3SECURE • “Derived data” occurs when a process combinesANALYTICS two or more datasets of differing classifications/sensitivities. • Three Strategies: o Create a new corpus with the combined classifications with it’s own set of keys. Record C o Encrypt the record multiple times, once for each compartment/sensitivity (using its corresponding key). Record B A o Create a unique symmetric key. • Encrypt data with symmetric key. • Encrypt symmetric key with the keys of each classification/sensitivity. Key: T B A Record T
  • 45. 45USE CASE 3SECUREANALYTICS TRANSPORT ENCRYPTION
  • 46. 46USE CASE 3SECUREANALYTICS Strategies for Mixed-Classification Transport • In mixed-classification systems, must also be segregated; but unlike multi-classification systems without “mixing”, the data has to be segregated within the same Job. • Two strategies for transport: o “System-High” SSL certificate for Hadoop. o Generate a temporary certificate per job.
  • 47. 47CLOSINGREMARKS Next Steps: Implement. Berico Technologies is actively working on building encryption capabilities into Hadoop using the methods discussed in this presentation. You can follow our progress at: http://www.bericotechnologies.com/blogs Search. We’ve only discussed our security issues within HDFS and MapReduce. The Federal IT Sector needs similar capabilities within distributed databases and indices (e.g.: encrypted HBase). Mature. Performance is going to be terrible with encryption, but we’ve got ideas to improve the implementation (CUDA anyone?). Watchlist: Accumulo. A new Apache Incubator project sponsored by Doug Cutting. Real-time Key/Value store that incorporates cell labels to control what data is returned to the user. http://incubator.apache.org/accumulo
  • 48. 48THE END QUESTIONS?
  • 49. 49THE END Jeremy Glesner Richard Clayton jeremy@bericotechnologies.com richard@bericotechnologies.com
  • 50. 50THE END BACKUP SLIDES
  • 51. 51SECURITY What is “high security” by Federal Sector InformationCONSTRAINTS Assurance standards? Requirement Description Moderate High AC-3 Access Enforcement (data access enforcement mechanisms) Basic Basic AC-4 Information Flow Management (data in motion) Basic Basic AC-6 Least Privilege (authorized, need-to-know access for users and Basic, System Basic, System system processes) Accounts and Accounts and Privileges, Privileges, Access Privilege Access Privilege Limitations Limitations AC-16 Security Attributes (security attributes to information in - - storage, in process, and in transmission) AC-20 Use of External Information Systems (Access by/from external Basic, External Basic, External Info Systems) Access by Access by Authorized Authorized Users, Storage Users, Storage Media Media Limitations on Limitations on External Systems External Systems AC-21 User-Based Collaboration and Information Sharing (sharing - - information with same/higher privilege level users) SC-2 Application Partitioning (isolation of administrative interfaces) Basic Basic SC-4 Information in Shared Resources (protection of data from Basic Basic users of different privilege levels)
  • 52. 52 Requirement Description Moderate HighSECURITY SC-7 Boundary Protection (boundary, transmission outside the Basic, Public Basic, Public boundary) Network NetworkCONSTRAINTS Separation, Separation, Public Network Public Network Access Denial, Access Denial, Access Point Access Point Limitations, Limitations, Manages Traffic Manages Traffic Flow, Enclave Flow, Enclave Protection, Least Protection, Privilege, Deny Least Privilege, Remote Access Deny Remote Access, Intelligent use of Proxy Servers to Mask Network Internals SC-8, Transmission Integrity (integrity of information in motion) Basic, Encryption Basic, in Motion Encryption in Motion SC-9 Transmission Confidentiality (confidentiality of information in Basic, Encryption Basic, motion) in Motion Encryption in Motion SC-12 Cypto Key Establishment and Management (key policy) Basic Basic, Availability Despite Key Loss
  • 53. 53SECURITYCONSTRAINTS Requirement Description Moderate High SC-13 Use of Cryptography (encryption at rest/motion) Basic Basic SC-16 Transmission of Security Attributes (exchanging attributes - - between information systems) SC-17 PKI Certificates (certificate policy) Basic Basic SC-28 Protection of Information at Rest (encryption at rest) Basic Basic SI-7 Software and Information Integrity (monitors software/data Basic, Integrity Basic, Integrity integrity using CRCs, parity checks, cryptographic hashes) Scans Scans, Auditing, Alerting = Information Assurance guidelines make it very tough to consolidate information without strict guarantees that information is secure.