Big	
  Data	
  Security	
  
    Joey	
  Echeverria	
  |	
  Principal	
  Solu8ons	
  Architect	
  
    joey@cloudera.com	
  |	
  @fwiffo	
  




1                                         ©2013 Cloudera, Inc.
Big	
  Data	
  Security	
  




           EARLY	
  DAYS	
  




2	
  
Hadoop	
  File	
  Permissions	
  

    •    Added	
  in	
  HADOOP-­‐1298	
  
          •    Hadoop	
  0.16	
  
          •    Early	
  2008	
  
    •    Authoriza8on	
  without	
  authen8ca8on	
  
    •    POSIX-­‐like	
  RWX	
  bits	
  




3
MapReduce	
  ACLs	
  

    •    Added	
  in	
  HADOOP-­‐3698	
  
          •    Hadoop	
  0.19	
  
          •    Late	
  2008	
  
    •    ACLs	
  per	
  job	
  queue	
  
    •    Set	
  a	
  list	
  of	
  allowed	
  users	
  or	
  groups	
  per	
  opera8on	
  
          •    Job	
  submission	
  
          •    Job	
  administra8on	
  
    •    No	
  authen8ca8on	
  



4
Securing	
  a	
  Cluster	
  Through	
  a	
  Gateway	
  

    •    Hadoop	
  cluster	
  runs	
  on	
  a	
  private	
  network	
  
    •    Gateway	
  server	
  dual-­‐homed	
  (Hadoop	
  network	
  and	
  
         public	
  network)	
  
    •    Users	
  SSH	
  onto	
  gateway	
  
          •    Op8onally	
  can	
  create	
  an	
  SSH	
  proxy	
  for	
  jobs	
  to	
  be	
  
               submi`ed	
  from	
  the	
  client	
  machine	
  
    •    Provides	
  minimum	
  level	
  of	
  protec8on	
  




5
Big	
  Data	
  Security	
  




           WHY	
  SECURITY	
  MATTERS	
  




6	
  
Prevent	
  Accidental	
  Access	
  

    •    Don’t	
  let	
  users	
  shoot	
  themselves	
  in	
  the	
  foot	
  
    •    Main	
  driver	
  for	
  early	
  features	
  
    •    Not	
  security	
  per-­‐se,	
  but	
  a	
  cri8cal	
  first	
  step	
  
    •    Doesn’t	
  require	
  strong	
  authen8ca8on	
  




7
Stop	
  Malicious	
  Users	
  

    •    Early	
  features	
  were	
  necessary,	
  but	
  not	
  sufficient	
  
    •    Security	
  has	
  to	
  get	
  real	
  
    •    Hadoop	
  runs	
  arbitrary	
  code	
  
    •    Implicit	
  trust	
  doesn’t	
  prevent	
  the	
  insider	
  threat	
  




8
Co-­‐mingle	
  All	
  Your	
  Data	
  

    •    Ofen	
  overlooked	
  
    •    Big	
  data	
  means	
  gegng	
  rid	
  of	
  stovepipes	
  
          •    Scalability	
  and	
  flexibility	
  are	
  only	
  50%	
  of	
  the	
  problem	
  
          •    Trust	
  your	
  data	
  in	
  a	
  mul8-­‐tenant	
  environment	
  
    •    Most	
  cri8cal	
  driver	
  




9
Big	
  Data	
  Security	
  




            AN	
  EVOLVING	
  STORY	
  




10	
  
Authoriza8on	
  

     •    Files	
  
     •    MapReduce/YARN	
  job	
  queues	
  
     •    Service-­‐level	
  authoriza8on	
  
          •    Whitelists	
  and	
  blacklists	
  of	
  hosts	
  and	
  users	
  




11
Authen8ca8on	
  
                      2.2    High Level Use Cases                                            2    USE CASES
     •      HADOOP-­‐4487	
  
             •    Hadoop	
  0.22	
  and	
  0.20.205	
  
                     2.2 High Level Use Cases
                       1. Applications accessing files on HDFS clusters Non-MapReduce ap-
             •    Late	
  2010	
   including hadoop fs, access files stored on one or more HDFS
                           plications,
                             clusters. The application should only be able to access files and services
     •      Based	
  on	
  Kerberos	
  and	
  internal	
  delega8on	
  tokens	
  
                             they are authorized to access. See figure 1. Variations:

                              (a) Access HDFS directly using HDFS protocol.
             •    Provides	
  strong	
  user	
  authen8ca8on	
   servers via the HFTP
                            (b) Access HDFS indirectly though HDFS proxy
                                FileSystem or HTTP get.
             •    Also	
  used	
  for	
  service-­‐to-­‐service	
  authen8ca8on	
  
     	
                                         (joe)
                                                           Name
                                                           Node       delg(jo
                                                                                 e)
                                           kerb
                                                                                      MapReduce
                            Application
                                                              kerb(hdfs)                 Task
                                          bloc                                   n
                                              k to                            oke
                                                   ken                    ck t
                                                           Data       blo
                                                           Node



                                                 Figure 1: HDFS High-level Dataflow
12

                        2. Applications accessing third-party (non-Hadoop) services Non-
                           MapReduce applications and MapReduce tasks accessing files or opera-
Encryp8on	
  

     •    Over	
  the	
  wire	
  encryp8on	
  for	
  some	
  socket	
  
          connec8ons	
  
     •    RPC	
  encryp8on	
  added	
  soon	
  afer	
  Kerberos	
  
     •    Shuffle	
  encryp8on	
  (HTTPS)	
  added	
  in	
  Hadoop	
  2.0.2-­‐
          alpha,	
  back	
  ported	
  to	
  CDH4	
  MR1	
  
     •    HDFS	
  block	
  streamer	
  encryp8on	
  added	
  in	
  Hadoop	
  
          2.0.2-­‐alpha	
  
     •    Volume-­‐level	
  encryp8on	
  for	
  data	
  at	
  rest	
  



13
Big	
  Data	
  Security	
  




            SECURITY	
  FOR	
  KEY	
  VALUE	
  STORES	
  




14	
  
Apache	
  Accumulo	
  

     •    Robust,	
  scalable,	
  high	
  performance	
  data	
  storage	
  and	
  
          retrieval	
  system	
  
     •    Built	
  by	
  NSA,	
  now	
  an	
  Apache	
  project	
  
     •    Based	
  on	
  Google’s	
  BigTable	
  
     •    Built	
  on	
  top	
  of	
  HDFS,	
  ZooKeeper	
  and	
  Thrif	
  
     •    Iterators	
  for	
  server-­‐side	
  extensions	
  
     •    Cell	
  labels	
  for	
  flexible	
  security	
  models	
  




15
Data	
  Model	
  

     •    Mul8-­‐dimensional,	
  persistent,	
  sorted	
  map	
  
     •    Key/Value	
  store	
  with	
  a	
  twist	
  
     •    A	
  single	
  primary	
  key	
  (Row	
  ID)	
  
     •    Secondary	
  key	
  (Column)	
  internal	
  to	
  a	
  row	
  
           •    Family	
  
           •    Qualifier	
  
     •    Per-­‐cell	
  8mestamp	
  




16
Cell-­‐Level	
  Security	
  

     •    Labels	
  stored	
  per	
  cell	
  
     •    Labels	
  consist	
  of	
  Boolean	
  expressions	
  (AND,	
  OR,	
  
          nes8ng)	
  
     •    Labels	
  associated	
  with	
  each	
  user	
  
     •    Cell	
  labels	
  checked	
  against	
  user’s	
  labels	
  with	
  a	
  built-­‐
          in	
  iterator	
  




17
Pluggable	
  Authen8ca8on	
  

     •    Currently	
  supports	
  username/password	
  
          authen8ca8on	
  backed	
  by	
  ZooKeeper	
  
     •    ACCUMULO-­‐259	
  
           •    Targeted	
  for	
  Accumulo	
  1.5.0	
  
     •    Authen8ca8on	
  info	
  replaced	
  with	
  generic	
  tokens	
  
     •    Supports	
  mul8ple	
  implementa8ons	
  (e.g.	
  Kerberos)	
  




18
Applica8on	
  Level	
  

     •    Accumulo	
  ofen	
  paired	
  with	
  applica8on	
  level	
  
          authen8ca8on/authoriza8on	
  
     •    Accumulo	
  users	
  created	
  per	
  applica8on	
  
     •    Each	
  applica8on	
  granted	
  access	
  level	
  of	
  most	
  
          permi`ed	
  user	
  
     •    Applica8on	
  authen8cates	
  users,	
  grabs	
  user	
  
          authoriza8ons,	
  passes	
  user	
  labels	
  with	
  requests	
  




19
Apache	
  HBase	
  

     •    Also	
  based	
  on	
  Google’s	
  BigTable	
  
     •    Started	
  as	
  a	
  Hadoop	
  contrib	
  project	
  
     •    Supports	
  column-­‐level	
  ACLs	
  
     •    Kerberos	
  for	
  authen8ca8on	
  
     •    Discussion	
  and	
  early	
  prototypes	
  of	
  cell-­‐level	
  security	
  
          ongoing	
  




20
Big	
  Data	
  Security	
  




            FUTURE	
  




21	
  
Encryp8on	
  for	
  Data	
  at	
  Rest	
  

     •    Need	
  mul8ple	
  levels	
  of	
  granularity	
  
     •    Encryp8on	
  keys	
  8ed	
  to	
  authoriza8on	
  labels	
  (like	
  
          Accumulo	
  labels	
  or	
  HBase	
  ACLs)	
  
     •    APIs	
  for	
  file-­‐level,	
  block-­‐level,	
  or	
  record-­‐level	
  
          encryp8on	
  




22
Hive	
  Security	
  

     •    Column-­‐level	
  ACLs	
  
     •    Kerberos	
  authen8ca8on	
  
     •    AccessServer	
  




23
24   ©2013 Cloudera, Inc.

Big Data Security with Hadoop

  • 1.
    Big  Data  Security   Joey  Echeverria  |  Principal  Solu8ons  Architect   joey@cloudera.com  |  @fwiffo   1 ©2013 Cloudera, Inc.
  • 2.
    Big  Data  Security   EARLY  DAYS   2  
  • 3.
    Hadoop  File  Permissions   •  Added  in  HADOOP-­‐1298   •  Hadoop  0.16   •  Early  2008   •  Authoriza8on  without  authen8ca8on   •  POSIX-­‐like  RWX  bits   3
  • 4.
    MapReduce  ACLs   •  Added  in  HADOOP-­‐3698   •  Hadoop  0.19   •  Late  2008   •  ACLs  per  job  queue   •  Set  a  list  of  allowed  users  or  groups  per  opera8on   •  Job  submission   •  Job  administra8on   •  No  authen8ca8on   4
  • 5.
    Securing  a  Cluster  Through  a  Gateway   •  Hadoop  cluster  runs  on  a  private  network   •  Gateway  server  dual-­‐homed  (Hadoop  network  and   public  network)   •  Users  SSH  onto  gateway   •  Op8onally  can  create  an  SSH  proxy  for  jobs  to  be   submi`ed  from  the  client  machine   •  Provides  minimum  level  of  protec8on   5
  • 6.
    Big  Data  Security   WHY  SECURITY  MATTERS   6  
  • 7.
    Prevent  Accidental  Access   •  Don’t  let  users  shoot  themselves  in  the  foot   •  Main  driver  for  early  features   •  Not  security  per-­‐se,  but  a  cri8cal  first  step   •  Doesn’t  require  strong  authen8ca8on   7
  • 8.
    Stop  Malicious  Users   •  Early  features  were  necessary,  but  not  sufficient   •  Security  has  to  get  real   •  Hadoop  runs  arbitrary  code   •  Implicit  trust  doesn’t  prevent  the  insider  threat   8
  • 9.
    Co-­‐mingle  All  Your  Data   •  Ofen  overlooked   •  Big  data  means  gegng  rid  of  stovepipes   •  Scalability  and  flexibility  are  only  50%  of  the  problem   •  Trust  your  data  in  a  mul8-­‐tenant  environment   •  Most  cri8cal  driver   9
  • 10.
    Big  Data  Security   AN  EVOLVING  STORY   10  
  • 11.
    Authoriza8on   •  Files   •  MapReduce/YARN  job  queues   •  Service-­‐level  authoriza8on   •  Whitelists  and  blacklists  of  hosts  and  users   11
  • 12.
    Authen8ca8on   2.2 High Level Use Cases 2 USE CASES •  HADOOP-­‐4487   •  Hadoop  0.22  and  0.20.205   2.2 High Level Use Cases 1. Applications accessing files on HDFS clusters Non-MapReduce ap- •  Late  2010   including hadoop fs, access files stored on one or more HDFS plications, clusters. The application should only be able to access files and services •  Based  on  Kerberos  and  internal  delega8on  tokens   they are authorized to access. See figure 1. Variations: (a) Access HDFS directly using HDFS protocol. •  Provides  strong  user  authen8ca8on   servers via the HFTP (b) Access HDFS indirectly though HDFS proxy FileSystem or HTTP get. •  Also  used  for  service-­‐to-­‐service  authen8ca8on     (joe) Name Node delg(jo e) kerb MapReduce Application kerb(hdfs) Task bloc n k to oke ken ck t Data blo Node Figure 1: HDFS High-level Dataflow 12 2. Applications accessing third-party (non-Hadoop) services Non- MapReduce applications and MapReduce tasks accessing files or opera-
  • 13.
    Encryp8on   •  Over  the  wire  encryp8on  for  some  socket   connec8ons   •  RPC  encryp8on  added  soon  afer  Kerberos   •  Shuffle  encryp8on  (HTTPS)  added  in  Hadoop  2.0.2-­‐ alpha,  back  ported  to  CDH4  MR1   •  HDFS  block  streamer  encryp8on  added  in  Hadoop   2.0.2-­‐alpha   •  Volume-­‐level  encryp8on  for  data  at  rest   13
  • 14.
    Big  Data  Security   SECURITY  FOR  KEY  VALUE  STORES   14  
  • 15.
    Apache  Accumulo   •  Robust,  scalable,  high  performance  data  storage  and   retrieval  system   •  Built  by  NSA,  now  an  Apache  project   •  Based  on  Google’s  BigTable   •  Built  on  top  of  HDFS,  ZooKeeper  and  Thrif   •  Iterators  for  server-­‐side  extensions   •  Cell  labels  for  flexible  security  models   15
  • 16.
    Data  Model   •  Mul8-­‐dimensional,  persistent,  sorted  map   •  Key/Value  store  with  a  twist   •  A  single  primary  key  (Row  ID)   •  Secondary  key  (Column)  internal  to  a  row   •  Family   •  Qualifier   •  Per-­‐cell  8mestamp   16
  • 17.
    Cell-­‐Level  Security   •  Labels  stored  per  cell   •  Labels  consist  of  Boolean  expressions  (AND,  OR,   nes8ng)   •  Labels  associated  with  each  user   •  Cell  labels  checked  against  user’s  labels  with  a  built-­‐ in  iterator   17
  • 18.
    Pluggable  Authen8ca8on   •  Currently  supports  username/password   authen8ca8on  backed  by  ZooKeeper   •  ACCUMULO-­‐259   •  Targeted  for  Accumulo  1.5.0   •  Authen8ca8on  info  replaced  with  generic  tokens   •  Supports  mul8ple  implementa8ons  (e.g.  Kerberos)   18
  • 19.
    Applica8on  Level   •  Accumulo  ofen  paired  with  applica8on  level   authen8ca8on/authoriza8on   •  Accumulo  users  created  per  applica8on   •  Each  applica8on  granted  access  level  of  most   permi`ed  user   •  Applica8on  authen8cates  users,  grabs  user   authoriza8ons,  passes  user  labels  with  requests   19
  • 20.
    Apache  HBase   •  Also  based  on  Google’s  BigTable   •  Started  as  a  Hadoop  contrib  project   •  Supports  column-­‐level  ACLs   •  Kerberos  for  authen8ca8on   •  Discussion  and  early  prototypes  of  cell-­‐level  security   ongoing   20
  • 21.
    Big  Data  Security   FUTURE   21  
  • 22.
    Encryp8on  for  Data  at  Rest   •  Need  mul8ple  levels  of  granularity   •  Encryp8on  keys  8ed  to  authoriza8on  labels  (like   Accumulo  labels  or  HBase  ACLs)   •  APIs  for  file-­‐level,  block-­‐level,  or  record-­‐level   encryp8on   22
  • 23.
    Hive  Security   •  Column-­‐level  ACLs   •  Kerberos  authen8ca8on   •  AccessServer   23
  • 24.
    24 ©2013 Cloudera, Inc.