Hw09   Map Reduce Over Tahoe   A Least Authority Encrypted Distributed Filesystem
 

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem

on

  • 3,604 views

 

Statistics

Views

Total Views
3,604
Views on SlideShare
3,582
Embed Views
22

Actions

Likes
2
Downloads
70
Comments
1

2 Embeds 22

http://www.slideshare.net 14
https://twitter.com 8

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Correction for slide 10: should be 'Expansion factor of data is N/K (default is 10/3, or ~3.3)
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hw09   Map Reduce Over Tahoe   A Least Authority Encrypted Distributed Filesystem Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem Presentation Transcript

  • MapReduce over Tahoe Aaron Cordova Associate New York Oct 1, 2009 Booz Allen Hamilton Inc. . 134 National Business Parkway Annapolis Junction, MD 20701 cordova_aaron@bah.com Hadoop World 2009 2009 Hadoop World NYC 1
  • MapReduce over Tahoe  Impact of data security requirements on large scale analysis  Introduction to Tahoe  Integrating Tahoe with Hadoop’s MapReduce  Deployment scenarios, considerations  Test results Hadoop World NYC 2009 2
  • Features of Large Scale Analysis  As data grows, it becomes harder, more expensive to move – “Massive” data  The more data sets are located together, the more valuable each is – Network Effect  Bring computation to the data Hadoop World NYC 2009 3
  • Data Security and Large Scale Analysis  Each department within an organization has its own data  Some data need to be shared  Others are protected CRM Product Sales Testing Hadoop World NYC 2009 4
  • Data Security  Because of security constraints, departments tend to setup their own data storage and processing systems independently Support Support Support Support  This includes support staff Storage Storage Storage Storage  Highly inefficient Processing Processing Processing Processing  Analysis across datasets is impossible Apps Apps Apps Apps Hadoop World NYC 2009 5
  • “Stovepipe Effect” Hadoop World NYC 2009 6
  • Tahoe - A Least Authority File System  Release 1.5  AllMyData.com  Included in Ubuntu Karmic Koala  Open Source Hadoop World NYC 2009 7
  • Tahoe Architecture  Data originates at the client, which is trusted Storage Servers  Client encrypts, segments, and erasure-codes data  Segments are distributed to storage nodes over encrypted links  Storage nodes only see encrypted SSL data, and are not trusted Client Hadoop World NYC 2009 8
  • Tahoe Architecture Features  AES Encryption  Segmentation  Erasure-coding  Distributed  Flexible Access Control Hadoop World NYC 2009 9
  • Erasure Coding Overview N K  Only k of n segments are needed to recover the file  Up to n-k machines can fail, be compromised, or malicious without data loss  n and k are configurable, and can be chosen to achieve desired availability  Expansion factor of data is k/n (default is 3/10, or 3.3) Hadoop World NYC 2009 10
  • Flexible Access Control  Each file has a Read Capability and a Write Capability  These are decryption keys ReadCap File  Directories have capabilities too WriteCap ReadCap Dir WriteCap Hadoop World NYC 2009 11
  • Flexible Access Control  Access to a subset of files can be done by: – creating a directory Dir – attaching files – sharing read or write capabilities of the dir  Any files or directories attached are accessible  Any outside the directory are not File Dir ReadCap File File Hadoop World NYC 2009 12
  • Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 13
  • Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 14
  • Access Control Example Files Directories /Sales /New /Testing Products Files that need to be shared can be linked to a new directory, whose read capability is given to both departments Hadoop World NYC 2009 15
  • Hadoop Can Use The Following File Systems  HDFS  Cloud Store (KFS)  Amazon S3  FTP  Read only HTTP  Now, Tahoe! Hadoop World NYC 2009 16
  • Hadoop File System Integration HowTo  Step 1. – Locate your favorite file system’s API  Step 2. – subclass FileSystem – found in /src/core/org/apache/hadoop/fs/FileSystem.java  Step 3. – Add lines to core-site.xml: <name> fs.lafs.impl </name> <value> your.class </value>  Step 4. – Test using your favorite Infrastructure Service Provider Hadoop World NYC 2009 17
  • Hadoop Integration : MapReduce  One Tahoe client is run on each Storage Servers machine that serves as a MapReduce Worker  On average, clients communicate with k storage servers  Jobs are limited by aggregate network bandwidth  MapReduce workers are trusted, storage nodes are not Hadoop Map Reduce Workers Hadoop World NYC 2009 18
  • Hadoop-Tahoe Configuration  Step 1. Start Tahoe  Step 2. Create a new directory in Tahoe, note the WriteCap  Step 3. Configure core-site.xml thus: – fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS – lafs.rootcap: $WRITE_CAP – fs.default.name: lafs://localhost  Step 4. Start MapReduce, but not HDFS Hadoop World NYC 2009 19
  • Deployment Scenario - Large Organization  Within a datacenter, Storage Servers departments can run MapReduce jobs on discrete groups of compute nodes  Each MapReduce job accesses a directory containing a subset of files  Results are written back to the storage servers, encrypted Sales Audit MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 20
  • Deployment Scenario - Community  If a community uses a shared Storage Servers data center, different organizations can run discrete MapReduce jobs  Perhaps most importantly, when results are deemed appropriate to share, access can be granted simply by sending a read or write capability  Since the data are all co-located already, no data needs to be moved FBI Homeland Sec MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 21
  • Deployment Scenario - Public Cloud Services  Since storage nodes require no Storage Servers trust, they can be located at a remote location, e.g. within a cloud service provider’s datacenter Cloud Service Provider  MapReduce jobs can be done this way if bandwidth to the datacenter is adequate MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 22
  • Deployment Scenario - Public Cloud Services  For some users, everything Storage Servers could be run remotely in a service provider’s data center  There are a few caveats and additional precautions in this scenario: Cloud Service Provider MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 23
  • Public Cloud Deployment Considerations  Store configuration files in memory  Encrypt / disable swap  Encrypt spillover Cloud Service Provider  Must trust memory / hypervisor  Trust service provider disks Hadoop World NYC 2009 24
  • HDFS and Linux Disk Encryption Drawbacks  At most one key per node - no support for flexible access control  Decryption done at the storage node rather than at the client - still have to trust storage nodes Hadoop World NYC 2009 25
  • Tahoe and HDFS - Comparison Feature HDFS Tahoe Confidentiality File Permissions AES Encryption Integrity Checksum Merkel Hash Tree Availability Replication Erasure Coding Expansion Factor 3x 3.3x (k/n) Self-Healing Automatic Automatic Load-balancing Automatic Planned Mutable Files No Yes Hadoop World NYC 2009 26
  • Performance HDFS Tahoe  Tests run on ten nodes  RandomWrite writes 1 GB per 200 node  WordCount done over randomly 150 generated text  Tahoe write speed is 10x slower 100  Read-intensive jobs are about the same 50  Not so bad since the most common data use case is write- once, read-many 0 Random Write Word Count Hadoop World NYC 2009 27
  • Code  Tahoe available from http://allmydata.org – Licensed under GPL 2 or TGPPL  Integration code available at http://hadoop-lafs.googlecode.com – Licensed under Apache 2 Hadoop World NYC 2009 28