Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem


Published on

Published in: Technology
  • Correction for slide 10: should be 'Expansion factor of data is N/K (default is 10/3, or ~3.3)
    Are you sure you want to  Yes  No
    Your message goes here

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem

  1. 1. MapReduce over Tahoe Aaron Cordova Associate New York Oct 1, 2009 Booz Allen Hamilton Inc. . 134 National Business Parkway Annapolis Junction, MD 20701 Hadoop World 2009 2009 Hadoop World NYC 1
  2. 2. MapReduce over Tahoe  Impact of data security requirements on large scale analysis  Introduction to Tahoe  Integrating Tahoe with Hadoop’s MapReduce  Deployment scenarios, considerations  Test results Hadoop World NYC 2009 2
  3. 3. Features of Large Scale Analysis  As data grows, it becomes harder, more expensive to move – “Massive” data  The more data sets are located together, the more valuable each is – Network Effect  Bring computation to the data Hadoop World NYC 2009 3
  4. 4. Data Security and Large Scale Analysis  Each department within an organization has its own data  Some data need to be shared  Others are protected CRM Product Sales Testing Hadoop World NYC 2009 4
  5. 5. Data Security  Because of security constraints, departments tend to setup their own data storage and processing systems independently Support Support Support Support  This includes support staff Storage Storage Storage Storage  Highly inefficient Processing Processing Processing Processing  Analysis across datasets is impossible Apps Apps Apps Apps Hadoop World NYC 2009 5
  6. 6. “Stovepipe Effect” Hadoop World NYC 2009 6
  7. 7. Tahoe - A Least Authority File System  Release 1.5   Included in Ubuntu Karmic Koala  Open Source Hadoop World NYC 2009 7
  8. 8. Tahoe Architecture  Data originates at the client, which is trusted Storage Servers  Client encrypts, segments, and erasure-codes data  Segments are distributed to storage nodes over encrypted links  Storage nodes only see encrypted SSL data, and are not trusted Client Hadoop World NYC 2009 8
  9. 9. Tahoe Architecture Features  AES Encryption  Segmentation  Erasure-coding  Distributed  Flexible Access Control Hadoop World NYC 2009 9
  10. 10. Erasure Coding Overview N K  Only k of n segments are needed to recover the file  Up to n-k machines can fail, be compromised, or malicious without data loss  n and k are configurable, and can be chosen to achieve desired availability  Expansion factor of data is k/n (default is 3/10, or 3.3) Hadoop World NYC 2009 10
  11. 11. Flexible Access Control  Each file has a Read Capability and a Write Capability  These are decryption keys ReadCap File  Directories have capabilities too WriteCap ReadCap Dir WriteCap Hadoop World NYC 2009 11
  12. 12. Flexible Access Control  Access to a subset of files can be done by: – creating a directory Dir – attaching files – sharing read or write capabilities of the dir  Any files or directories attached are accessible  Any outside the directory are not File Dir ReadCap File File Hadoop World NYC 2009 12
  13. 13. Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 13
  14. 14. Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 14
  15. 15. Access Control Example Files Directories /Sales /New /Testing Products Files that need to be shared can be linked to a new directory, whose read capability is given to both departments Hadoop World NYC 2009 15
  16. 16. Hadoop Can Use The Following File Systems  HDFS  Cloud Store (KFS)  Amazon S3  FTP  Read only HTTP  Now, Tahoe! Hadoop World NYC 2009 16
  17. 17. Hadoop File System Integration HowTo  Step 1. – Locate your favorite file system’s API  Step 2. – subclass FileSystem – found in /src/core/org/apache/hadoop/fs/  Step 3. – Add lines to core-site.xml: <name> fs.lafs.impl </name> <value> your.class </value>  Step 4. – Test using your favorite Infrastructure Service Provider Hadoop World NYC 2009 17
  18. 18. Hadoop Integration : MapReduce  One Tahoe client is run on each Storage Servers machine that serves as a MapReduce Worker  On average, clients communicate with k storage servers  Jobs are limited by aggregate network bandwidth  MapReduce workers are trusted, storage nodes are not Hadoop Map Reduce Workers Hadoop World NYC 2009 18
  19. 19. Hadoop-Tahoe Configuration  Step 1. Start Tahoe  Step 2. Create a new directory in Tahoe, note the WriteCap  Step 3. Configure core-site.xml thus: – fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS – lafs.rootcap: $WRITE_CAP – lafs://localhost  Step 4. Start MapReduce, but not HDFS Hadoop World NYC 2009 19
  20. 20. Deployment Scenario - Large Organization  Within a datacenter, Storage Servers departments can run MapReduce jobs on discrete groups of compute nodes  Each MapReduce job accesses a directory containing a subset of files  Results are written back to the storage servers, encrypted Sales Audit MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 20
  21. 21. Deployment Scenario - Community  If a community uses a shared Storage Servers data center, different organizations can run discrete MapReduce jobs  Perhaps most importantly, when results are deemed appropriate to share, access can be granted simply by sending a read or write capability  Since the data are all co-located already, no data needs to be moved FBI Homeland Sec MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 21
  22. 22. Deployment Scenario - Public Cloud Services  Since storage nodes require no Storage Servers trust, they can be located at a remote location, e.g. within a cloud service provider’s datacenter Cloud Service Provider  MapReduce jobs can be done this way if bandwidth to the datacenter is adequate MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 22
  23. 23. Deployment Scenario - Public Cloud Services  For some users, everything Storage Servers could be run remotely in a service provider’s data center  There are a few caveats and additional precautions in this scenario: Cloud Service Provider MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 23
  24. 24. Public Cloud Deployment Considerations  Store configuration files in memory  Encrypt / disable swap  Encrypt spillover Cloud Service Provider  Must trust memory / hypervisor  Trust service provider disks Hadoop World NYC 2009 24
  25. 25. HDFS and Linux Disk Encryption Drawbacks  At most one key per node - no support for flexible access control  Decryption done at the storage node rather than at the client - still have to trust storage nodes Hadoop World NYC 2009 25
  26. 26. Tahoe and HDFS - Comparison Feature HDFS Tahoe Confidentiality File Permissions AES Encryption Integrity Checksum Merkel Hash Tree Availability Replication Erasure Coding Expansion Factor 3x 3.3x (k/n) Self-Healing Automatic Automatic Load-balancing Automatic Planned Mutable Files No Yes Hadoop World NYC 2009 26
  27. 27. Performance HDFS Tahoe  Tests run on ten nodes  RandomWrite writes 1 GB per 200 node  WordCount done over randomly 150 generated text  Tahoe write speed is 10x slower 100  Read-intensive jobs are about the same 50  Not so bad since the most common data use case is write- once, read-many 0 Random Write Word Count Hadoop World NYC 2009 27
  28. 28. Code  Tahoe available from – Licensed under GPL 2 or TGPPL  Integration code available at – Licensed under Apache 2 Hadoop World NYC 2009 28