SlideShare a Scribd company logo
1 of 24
Big Data Security
    Joey Echeverria | Principal Solutions Architect
    joey@cloudera.com | @fwiffo




1                               ©2013 Cloudera, Inc.
Big Data Security




     EARLY DAYS




2
Hadoop File Permissions

    •   Added in HADOOP-1298
        •   Hadoop 0.16
        •   Early 2008
    • Authorization without authentication
    • POSIX-like RWX bits




3
MapReduce ACLs

    •   Added in HADOOP-3698
        •   Hadoop 0.19
        •   Late 2008
    • ACLs per job queue
    • Set a list of allowed users or groups per operation
        •   Job submission
        •   Job administration
    •   No authentication



4
Securing a Cluster Through a Gateway

    • Hadoop cluster runs on a private network
    • Gateway server dual-homed (Hadoop network and
      public network)
    • Users SSH onto gateway
        •   Optionally can create an SSH proxy for jobs to be
            submitted from the client machine
    •   Provides minimum level of protection




5
Big Data Security




     WHY SECURITY MATTERS




6
Prevent Accidental Access

    • Don’t let users shoot themselves in the foot
    • Main driver for early features
    • Not security per-se, but a critical first step
    • Doesn’t require strong authentication




7
Stop Malicious Users

    • Early features were necessary, but not sufficient
    • Security has to get real
    • Hadoop runs arbitrary code
    • Implicit trust doesn’t prevent the insider threat




8
Co-mingle All Your Data

    • Often overlooked
    • Big data means getting rid of stovepipes
        •   Scalability and flexibility are only 50% of the problem
        •   Trust your data in a multi-tenant environment
    •   Most critical driver




9
Big Data Security




      AN EVOLVING STORY




10
Authorization

     • Files
     • MapReduce/YARN job queues
     • Service-level authorization
         •   Whitelists and blacklists of hosts and users




11
Authentication
                 2.2 High Level Use Cases                                                  2 USE CASES
     •   HADOOP-4487
         •   Hadoop 0.22evel U0.20.205
                2.2 H igh L
                                   and se Cases
                  1. A ppl icat i ons accessing fi les on H D F S cl ust er s Non-MapReduce ap-
         •   Late 2010ions, including hadoop fs, access files st ored on one or more HDFS
                     plicat
                      clust ers. T he applicat ion should only be able t o access files and services
     •   Based on Kerberos and internal delegation tokens
                      t hey are aut horized t o access. See figure 1. Variat ions:

                       (a) Access HDFS direct ly using HDFS prot ocol.
         •   Provides strong user authentication servers via t he HFT P
                    (b) Access HDFS indirect ly t hough HDFS proxy
                        FileSyst em or HT T P get .
         •   Also used for service-to-service authentication
                                                    Name
                                                               delg(jo
                                         (joe)      Node               e
                                    kerb                                   )
                                                                                    MapReduce
                     Application
                                                       kerb(hdfs)                      Task
                                   bloc                                     e   n
                                          k to
                                              ken                       tok
                                                                   ck
                                                     Data      blo
                                                     Node



                                          Figure 1: HDFS High-level Dat aflow
12
Encryption

     •   Over the wire encryption for some socket
         connections
     •   RPC encryption added soon after Kerberos
     •   Shuffle encryption (HTTPS) added in Hadoop 2.0.2-
         alpha, back ported to CDH4 MR1
     •   HDFS block streamer encryption added in Hadoop
         2.0.2-alpha
     •   Volume-level encryption for data at rest



13
Big Data Security




      SECURITY FOR KEY VALUE STORES




14
Apache Accumulo

     •   Robust, scalable, high performance data storage and
         retrieval system
     •   Built by NSA, now an Apache project
     •   Based on Google’s BigTable
     •   Built on top of HDFS, ZooKeeper and Thrift
     •   Iterators for server-side extensions
     •   Cell labels for flexible security models




15
Data Model

     • Multi-dimensional, persistent, sorted map
     • Key/Value store with a twist
     • A single primary key (Row ID)
     • Secondary key (Column) internal to a row
         •   Family
         •   Qualifier
     •   Per-cell timestamp




16
Cell-Level Security

     • Labels stored per cell
     • Labels consist of Boolean expressions
       (AND, OR, nesting)
     • Labels associated with each user
     • Cell labels checked against user’s labels with a built-
       in iterator




17
Pluggable Authentication

     • Currently supports username/password
       authentication backed by ZooKeeper
     • ACCUMULO-259
         •   Targeted for Accumulo 1.5.0
     • Authentication info replaced with generic tokens
     • Supports multiple implementations (e.g. Kerberos)




18
Application Level

     • Accumulo often paired with application level
       authentication/authorization
     • Accumulo users created per application
     • Each application granted access level of most
       permitted user
     • Application authenticates users, grabs user
       authorizations, passes user labels with requests




19
Apache HBase

     •   Also based on Google’s BigTable
     •   Started as a Hadoop contrib project
     •   Supports column-level ACLs
     •   Kerberos for authentication
     •   Discussion and early prototypes of cell-level security
         ongoing




20
Big Data Security




      FUTURE




21
Encryption for Data at Rest

     • Need multiple levels of granularity
     • Encryption keys tied to authorization labels (like
       Accumulo labels or HBase ACLs)
     • APIs for file-level, block-level, or record-level
       encryption




22
Hive Security

     • Column-level ACLs
     • Kerberos authentication
     • AccessServer




23
24   ©2013 Cloudera, Inc.

More Related Content

What's hot

Demystifying the Distributed Database Landscape (DevOps) (1).pdf
Demystifying the Distributed Database Landscape (DevOps) (1).pdfDemystifying the Distributed Database Landscape (DevOps) (1).pdf
Demystifying the Distributed Database Landscape (DevOps) (1).pdfScyllaDB
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL DatabasesRajith Pemabandu
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Jay Patel
 
Real-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka StreamsReal-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka Streamsconfluent
 
Cobrix – a COBOL Data Source for Spark
Cobrix – a COBOL Data Source for SparkCobrix – a COBOL Data Source for Spark
Cobrix – a COBOL Data Source for SparkDataWorks Summit
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Aes (advance encryption standard)
Aes (advance encryption standard) Aes (advance encryption standard)
Aes (advance encryption standard) Sina Manavi
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 
Building an Authorization Solution for Microservices Using Neo4j and OPA
Building an Authorization Solution for Microservices Using Neo4j and OPABuilding an Authorization Solution for Microservices Using Neo4j and OPA
Building an Authorization Solution for Microservices Using Neo4j and OPANeo4j
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilledrICh morrow
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...DataStax
 

What's hot (20)

Demystifying the Distributed Database Landscape (DevOps) (1).pdf
Demystifying the Distributed Database Landscape (DevOps) (1).pdfDemystifying the Distributed Database Landscape (DevOps) (1).pdf
Demystifying the Distributed Database Landscape (DevOps) (1).pdf
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
Real-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka StreamsReal-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka Streams
 
Caching
CachingCaching
Caching
 
Cobrix – a COBOL Data Source for Spark
Cobrix – a COBOL Data Source for SparkCobrix – a COBOL Data Source for Spark
Cobrix – a COBOL Data Source for Spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
DES
DESDES
DES
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Aes (advance encryption standard)
Aes (advance encryption standard) Aes (advance encryption standard)
Aes (advance encryption standard)
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Building an Authorization Solution for Microservices Using Neo4j and OPA
Building an Authorization Solution for Microservices Using Neo4j and OPABuilding an Authorization Solution for Microservices Using Neo4j and OPA
Building an Authorization Solution for Microservices Using Neo4j and OPA
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilled
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 

Viewers also liked

Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Peter Wood
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data miningharithavijay94
 
Big Data, Security Intelligence, (And Why I Hate This Title)
Big Data, Security Intelligence, (And Why I Hate This Title) Big Data, Security Intelligence, (And Why I Hate This Title)
Big Data, Security Intelligence, (And Why I Hate This Title) Coastal Pet Products, Inc.
 
Information Security in Big Data : Privacy and Data Mining
Information Security in Big Data : Privacy and Data MiningInformation Security in Big Data : Privacy and Data Mining
Information Security in Big Data : Privacy and Data Miningwanani181
 
Big Data Security with Hadoop
Big Data Security with HadoopBig Data Security with Hadoop
Big Data Security with HadoopCloudera, Inc.
 
Big data security the perfect storm
Big data security   the perfect stormBig data security   the perfect storm
Big data security the perfect stormUlf Mattsson
 
Demystify big data data science
Demystify big data  data scienceDemystify big data  data science
Demystify big data data scienceMahesh Kumar CV
 
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...CA API Management
 
"Big Data" in the Energy Industry
"Big Data" in the Energy Industry"Big Data" in the Energy Industry
"Big Data" in the Energy IndustryPaige Bailey
 
BigDataEurope - Big Data & Energy
BigDataEurope - Big Data & EnergyBigDataEurope - Big Data & Energy
BigDataEurope - Big Data & EnergyBigData_Europe
 
Kerberos, Token and Hadoop
Kerberos, Token and HadoopKerberos, Token and Hadoop
Kerberos, Token and HadoopKai Zheng
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview Hortonworks
 

Viewers also liked (19)

Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data mining
 
Big Data, Security Intelligence, (And Why I Hate This Title)
Big Data, Security Intelligence, (And Why I Hate This Title) Big Data, Security Intelligence, (And Why I Hate This Title)
Big Data, Security Intelligence, (And Why I Hate This Title)
 
Information Security in Big Data : Privacy and Data Mining
Information Security in Big Data : Privacy and Data MiningInformation Security in Big Data : Privacy and Data Mining
Information Security in Big Data : Privacy and Data Mining
 
Big Data Security with Hadoop
Big Data Security with HadoopBig Data Security with Hadoop
Big Data Security with Hadoop
 
Big data security the perfect storm
Big data security   the perfect stormBig data security   the perfect storm
Big data security the perfect storm
 
Big data Overview
Big data OverviewBig data Overview
Big data Overview
 
Demystify big data data science
Demystify big data  data scienceDemystify big data  data science
Demystify big data data science
 
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Big Data Security and Governance
Big Data Security and GovernanceBig Data Security and Governance
Big Data Security and Governance
 
"Big Data" in the Energy Industry
"Big Data" in the Energy Industry"Big Data" in the Energy Industry
"Big Data" in the Energy Industry
 
BigDataEurope - Big Data & Energy
BigDataEurope - Big Data & EnergyBigDataEurope - Big Data & Energy
BigDataEurope - Big Data & Energy
 
Add
AddAdd
Add
 
Kerberos, Token and Hadoop
Kerberos, Token and HadoopKerberos, Token and Hadoop
Kerberos, Token and Hadoop
 
Open-BDA Hadoop Summt 2014 - Post Summit Report
Open-BDA Hadoop Summt 2014 - Post Summit ReportOpen-BDA Hadoop Summt 2014 - Post Summit Report
Open-BDA Hadoop Summt 2014 - Post Summit Report
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 

Similar to Big data security

Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop EcosystemDataWorks Summit
 
Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api CompatibilityCloudera, Inc.
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access SecurityCloudera, Inc.
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityChris Nauroth
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big DataRommel Garcia
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big DataGreat Wide Open
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
 
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by ClouderaBig Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by ClouderaCaserta
 
Hops - Distributed metadata for Hadoop
Hops - Distributed metadata for HadoopHops - Distributed metadata for Hadoop
Hops - Distributed metadata for HadoopJim Dowling
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
 
Securing Your Apache Spark Applications
Securing Your Apache Spark ApplicationsSecuring Your Apache Spark Applications
Securing Your Apache Spark ApplicationsCloudera, Inc.
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCloudIDSummit
 

Similar to Big data security (20)

Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api Compatibility
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
 
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by ClouderaBig Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
 
Hops - Distributed metadata for Hadoop
Hops - Distributed metadata for HadoopHops - Distributed metadata for Hadoop
Hops - Distributed metadata for Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Securing Your Apache Spark Applications
Securing Your Apache Spark ApplicationsSecuring Your Apache Spark Applications
Securing Your Apache Spark Applications
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
 

More from Joey Echeverria

Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityJoey Echeverria
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and ClouderaJoey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

More from Joey Echeverria (12)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Big data security

  • 1. Big Data Security Joey Echeverria | Principal Solutions Architect joey@cloudera.com | @fwiffo 1 ©2013 Cloudera, Inc.
  • 2. Big Data Security EARLY DAYS 2
  • 3. Hadoop File Permissions • Added in HADOOP-1298 • Hadoop 0.16 • Early 2008 • Authorization without authentication • POSIX-like RWX bits 3
  • 4. MapReduce ACLs • Added in HADOOP-3698 • Hadoop 0.19 • Late 2008 • ACLs per job queue • Set a list of allowed users or groups per operation • Job submission • Job administration • No authentication 4
  • 5. Securing a Cluster Through a Gateway • Hadoop cluster runs on a private network • Gateway server dual-homed (Hadoop network and public network) • Users SSH onto gateway • Optionally can create an SSH proxy for jobs to be submitted from the client machine • Provides minimum level of protection 5
  • 6. Big Data Security WHY SECURITY MATTERS 6
  • 7. Prevent Accidental Access • Don’t let users shoot themselves in the foot • Main driver for early features • Not security per-se, but a critical first step • Doesn’t require strong authentication 7
  • 8. Stop Malicious Users • Early features were necessary, but not sufficient • Security has to get real • Hadoop runs arbitrary code • Implicit trust doesn’t prevent the insider threat 8
  • 9. Co-mingle All Your Data • Often overlooked • Big data means getting rid of stovepipes • Scalability and flexibility are only 50% of the problem • Trust your data in a multi-tenant environment • Most critical driver 9
  • 10. Big Data Security AN EVOLVING STORY 10
  • 11. Authorization • Files • MapReduce/YARN job queues • Service-level authorization • Whitelists and blacklists of hosts and users 11
  • 12. Authentication 2.2 High Level Use Cases 2 USE CASES • HADOOP-4487 • Hadoop 0.22evel U0.20.205 2.2 H igh L and se Cases 1. A ppl icat i ons accessing fi les on H D F S cl ust er s Non-MapReduce ap- • Late 2010ions, including hadoop fs, access files st ored on one or more HDFS plicat clust ers. T he applicat ion should only be able t o access files and services • Based on Kerberos and internal delegation tokens t hey are aut horized t o access. See figure 1. Variat ions: (a) Access HDFS direct ly using HDFS prot ocol. • Provides strong user authentication servers via t he HFT P (b) Access HDFS indirect ly t hough HDFS proxy FileSyst em or HT T P get . • Also used for service-to-service authentication Name delg(jo (joe) Node e kerb ) MapReduce Application kerb(hdfs) Task bloc e n k to ken tok ck Data blo Node Figure 1: HDFS High-level Dat aflow 12
  • 13. Encryption • Over the wire encryption for some socket connections • RPC encryption added soon after Kerberos • Shuffle encryption (HTTPS) added in Hadoop 2.0.2- alpha, back ported to CDH4 MR1 • HDFS block streamer encryption added in Hadoop 2.0.2-alpha • Volume-level encryption for data at rest 13
  • 14. Big Data Security SECURITY FOR KEY VALUE STORES 14
  • 15. Apache Accumulo • Robust, scalable, high performance data storage and retrieval system • Built by NSA, now an Apache project • Based on Google’s BigTable • Built on top of HDFS, ZooKeeper and Thrift • Iterators for server-side extensions • Cell labels for flexible security models 15
  • 16. Data Model • Multi-dimensional, persistent, sorted map • Key/Value store with a twist • A single primary key (Row ID) • Secondary key (Column) internal to a row • Family • Qualifier • Per-cell timestamp 16
  • 17. Cell-Level Security • Labels stored per cell • Labels consist of Boolean expressions (AND, OR, nesting) • Labels associated with each user • Cell labels checked against user’s labels with a built- in iterator 17
  • 18. Pluggable Authentication • Currently supports username/password authentication backed by ZooKeeper • ACCUMULO-259 • Targeted for Accumulo 1.5.0 • Authentication info replaced with generic tokens • Supports multiple implementations (e.g. Kerberos) 18
  • 19. Application Level • Accumulo often paired with application level authentication/authorization • Accumulo users created per application • Each application granted access level of most permitted user • Application authenticates users, grabs user authorizations, passes user labels with requests 19
  • 20. Apache HBase • Also based on Google’s BigTable • Started as a Hadoop contrib project • Supports column-level ACLs • Kerberos for authentication • Discussion and early prototypes of cell-level security ongoing 20
  • 21. Big Data Security FUTURE 21
  • 22. Encryption for Data at Rest • Need multiple levels of granularity • Encryption keys tied to authorization labels (like Accumulo labels or HBase ACLs) • APIs for file-level, block-level, or record-level encryption 22
  • 23. Hive Security • Column-level ACLs • Kerberos authentication • AccessServer 23
  • 24. 24 ©2013 Cloudera, Inc.