Creating a Secure Hadoop
                 Initiative
                 Securing the Big Data Ecosystem




This document contains confidential, proprietary and trade secret information and is subject to certain legal protection. You may not
review, copy, or distribute this information unless you are a designated recipient, and have prior written authorization from Zettaset, Inc.
About Me
• CTO Zettaset, Inc. – Big Data Hadoop Company
   – Founded 2007
• Distributed Computing Guy
   – Have been since college
• Security Guy
   – Founder SPI Dynamics (sold to HP, 2007)
   – Internet Security Systems, Prof. Services
   – Security First Network Bank, Sec. Guru.
Zettaset Enables Enterprise-Ready Hadoop
• Zettaset Orchestrator™ automates
  Hadoop installation and cluster
  management with an enterprise-ready
  solution for Big Data deployments
 – Enterprise-class – Hardened for security, high
   availability, and performance
 – Dramatically lowers operational expenses –
   Reduces IT resource requirements
 – Simple to deploy – Accelerates time to value
   from weeks to hours
 – Eliminates unnecessary dependencies on
   professional services
 – Works with any Apache Hadoop distribution



         3                  © 2012 Zettaset, Inc. | Proprietary and Confidential
Zettaset Orchestrator:
Making Hadoop Clusters Enterprise-Ready




     4
             © 2012 Zettaset, Inc. | Proprietary and Confidential
What is Big Data?

• Great Question
• It’s not a number, people define it
  differently.
• Majority define it as a scalability
  issue:
  – “The inability to continue storing and processing data the way
    that you’ve been storing and processing data.”
Exponential Data Growth = Big Data
                      Estimated Global Data Volume:
                          2011: 1.8 Zettabytes
                          2015: 7.9 Zettabytes
                      The world's information
                      doubles every two years
                      Over the next 10 years:
                       The number of servers worldwide
                        will grow by 10x
                       Amount of information managed
                        by enterprise data centers will
                        grow by 50x
                       Number of “files” enterprise data
                        center handle will grow by 75x

                       Source: http://www.emc.com/leadership/programs/digital-
                       universe.htm, which was based on the 2011 IDC Digital Universe
                       Study


6
Hadoop Distribution Landscape
         Distribution & Core                                        Apache                        BigInsights BigInsights
                                    CDH3u4    CDH4u0   HDP v1.0               MapR M3   MapR M5
            Components                                            Bigtop v0.3                       1.4 BE      1.4 EE
           Apache Hadoop                                                                                   
                 HDFS                                                                                                Open Source
               Fuse-DFS                                                     -          -            -          -        Apache Hadoop
             MapReduce                                                                                     
            MapReduce 2               -                  -           -         -          -            -          -
                                                                                                                             Proprietary
          Hadoop Common                                                     -          -                     
             Apache Hive                                                                                   
              Apache Pig                                                                                   
            Apache HBase                                                                                   
          Apache Zookeeper                                                  -          -                     
           Apache Ambari              -         -                    -         -          -            -          -
          Apache Templeton            -         -                    -         -          -            -          -
            Apache Flume                                -                                                   
            Apache Sqoop                                                                          -          -
           Apache Mahout                                -                                          -          -
            Apache Whirr                                -                                          -          -
            Apache Oozie                                                                                   
           Apache Lucene              -         -         -           -         -          -                     
            Apache Derby              -         -         -           -         -          -                     
             Apache Avro              -         -         -           -         -          -                     
                 Hue                                    -           -         -          -            -          -
           BigInsights Apps           -         -         -           -         -          -            -         
        Hadoop Management
                 Nagios               -         -                    -         -          -            -          -
                 Ganglia              -         -                    -         -         -             -          -
        Zettaset Orchestrator™                                                                             
           Cloudera Manager                             -           -         -          -            -          -
            MapR Manager              -         -         -           -                               -         -
        BigInsights web console       -         -         -           -         -          -            -         
       BigInsights simple console     -         -         -           -         -          -                     -
                                                                                                    7
January 24, 2013                             Zettaset, Inc. | Proprietary
What is the current state of Security?
•   Another Great Question
•   Minimal work has been done in this field
•   Currently Not a Huge Community Focus.
•   Everyone feels like it’s been addressed by
    adding Kerberos to the system

Don’t tell InfoSec People the Kerberos has fixed
                   everything!
Why Not Tell Them That?
• You will give them an aneurysm.
• Kerberos is “Brushed On” Security NOT
  “Baked In” security.
• Kerberos does NOT address compliance
  issues around data (HIPAA, GLBA, PCI, Etc.)

 Nothing around encryption, nothing around
              best practices.
Hadoop: What’s Missing?

• All Hadoop distros are constrained
  by the limitations of the Apache
  open source components
• Not written to support hardened
  security, compliance, encryption, po
  licy-enablement, and risk
  management
• Not written with high
  availability, service
  management, and monitoring in
  mind 10            © 2012 Zettaset, Inc. | Proprietary and Confidential
Current State of Hadoop Security

• Existing security for Apache-based Hadoop
  distributions does not meet enterprise requirements
  to support regulatory compliance mandates such as
  HIPAA and SOX, for example
• Security breaches can result in negative
  impact, e.g., release sensitive information, damage
  brand, compromise competitive advantage, spark
  litigation, etc.
• Hadoop security mechanism provides mutual
  authentication of users and services via SASL and
  Kerberos, but this has limitations




          11                 © 2012 Zettaset, Inc. | Proprietary and Confidential
Enterprise-Class Hadoop Security
      Addresses the security gaps and vulnerabilities that exist
             in all Apache-based Hadoop distributions
•   Hardened to address access control, policy, compliance and risk management

•   Support for Lightweight Directory Access Protocol (LDAP) and Active Directory
    (AD), enabling Hadoop clusters to seamlessly integrate with existing security policies
    within the enterprise environment

•   Centralized configuration management, logging, and auditing, which maintains control
    of ingress and egress points in the cluster, and enables Hadoop clusters to meet
    compliance requirements for reporting and forensics

•   Role-based access control (RBAC), which significantly improves the user
    authentication process, and enables Kerberos to be run against all components of a
    big data ecosystem, not just Hadoop




          12                  © 2012 Zettaset, Inc. | Proprietary and Confidential
Defining Big Data Use Case
• What is your use case?
• What are you trying to accomplish?
• What data are you going to be storing?
• What are you going to do after you store it?


  This will define your Security Threat Model and how you
                      protect your data.
Big Data Production System




                                                                  Log files
                                                                   Alerts
                                                                Transactions
                                                                    etc.



                   Structured             Semi-structured   Unstructured
  Types
  Data




                      Data                     Data             Data


January 24, 2013         Zettaset, Inc. | Proprietary and              14
Big Data Landscape (Version 2.0)
            Infrastructure                                   Analytics                                     Application
 NoSQL Databases         Hadoop Related        Analytics Solutions   Data Visualization                          s
                                                                                                          Ad Optimization




NewSQL Databases                                                                                Publisher            Marketing
                                                   Statistical                                    Tools
                                                   Computing           Social Media

  MPP        Management        Cluster
                                                                                                        Industry Applications
Databases    / Monitoring      Services
                                               Sentiment Analysis    Analytics Services
                                Security
                                                                                                 Application Service Providers
                                               Location / People /
                                                                       Big Data Search
                                                     Events
             Crowdsourcin                                               IT Analytics                          Data
 Storage          g
                                                                                                  Data       Sources Sources
                                                                                                                  Data
                              Collection /      Real-Time    Crowdsourc       SMB              Marketplace
                               Transport                     ed Analytics   Analytics              s



                               Cross Infrastructure /                                                      Personal Data
                                     Analytics       Open Source Projects
Framework    Query /             Data Access              Coordination /      Real -      Statistical     Machine        Cloud
            Data Flow                                       Workflow          Time          Tools         Learning     Deployme
                                                                                                                           nt
                   © Matt Turck (@mattturck) and Shivon Zilis (@shivonz) Bloomberg Ventures
What is a Threat Model?
• Threat modeling is based on the notion that any
  system or organization has assets of value worth
  protecting and these assets have certain
  vulnerabilities.
• Internal or external threats exploit these
  vulnerabilities in order to cause damage to the
  assets, and appropriate security
  countermeasures exist that mitigate the threats.
• A threat model can help to assess the
  probability, the potential harm, the priority etc., of
  attacks, and thus help to minimize or eradicate
  the threats.
Approaches to threat modeling
•   3 general approaches to threat modeling:

• Attacker-centric
     – Attacker-centric threat modeling starts with an attacker, and evaluates their
       goals, and how they might achieve them. Attacker's motivations are often
       considered, for example, "The NSA wants to read this email," or "Jon wants to
       copy this DVD and share it with his friends." This approach usually starts from
       either entry points or assets.
• Software-centric
     – Software-centric threat modeling (also called 'system-centric,' 'design-centric,' or
       'architecture-centric') starts from the design of the system, and attempts to step
       through a model of the system, looking for types of attacks against each element
       of the model. This approach is used in threat modeling in Microsoft's Security
       Development Lifecycle.
• Asset-centric
     – Asset-centric threat modeling involves starting from assets entrusted to a
       system, such as a collection of sensitive personal information.
Bottom Line
• Identify any threats to the confidentiality, availability and
  integrity of the data and the application based on the
  data access control matrix that your application should
  be enforcing
• Assign risk values and determine the risk responses
• Determine the countermeasures to implement based on
  your chosen risk responses
• Continually update the threat model based on the
  emerging security landscape.
Summary
 • All existing Apache-based Hadoop distributions have functional
   limitations which constrain enterprise adoption
 • Zettaset Orchestrator is addressing the enterprise-level gaps in
   security, high availability, performance, and manageability that
   exist in all Apache-based Hadoop distributions
 • Orchestrator is a universal management and control software
   layer that can sit on top of any Hadoop distribution (distro-
   agnostic)
 • Orchestrator fills the Service Management gaps that exist in all
   Hadoop distributions and cluster deployments, and makes
   Hadoop ready for broader enterprise adoption



       19              © 2012 Zettaset, Inc. | Proprietary and Confidential
Thank You !

Big Data Cloud Meetup - Jan 24 2013 - Zettaset

  • 1.
    Creating a SecureHadoop Initiative Securing the Big Data Ecosystem This document contains confidential, proprietary and trade secret information and is subject to certain legal protection. You may not review, copy, or distribute this information unless you are a designated recipient, and have prior written authorization from Zettaset, Inc.
  • 2.
    About Me • CTOZettaset, Inc. – Big Data Hadoop Company – Founded 2007 • Distributed Computing Guy – Have been since college • Security Guy – Founder SPI Dynamics (sold to HP, 2007) – Internet Security Systems, Prof. Services – Security First Network Bank, Sec. Guru.
  • 3.
    Zettaset Enables Enterprise-ReadyHadoop • Zettaset Orchestrator™ automates Hadoop installation and cluster management with an enterprise-ready solution for Big Data deployments – Enterprise-class – Hardened for security, high availability, and performance – Dramatically lowers operational expenses – Reduces IT resource requirements – Simple to deploy – Accelerates time to value from weeks to hours – Eliminates unnecessary dependencies on professional services – Works with any Apache Hadoop distribution 3 © 2012 Zettaset, Inc. | Proprietary and Confidential
  • 4.
    Zettaset Orchestrator: Making HadoopClusters Enterprise-Ready 4 © 2012 Zettaset, Inc. | Proprietary and Confidential
  • 5.
    What is BigData? • Great Question • It’s not a number, people define it differently. • Majority define it as a scalability issue: – “The inability to continue storing and processing data the way that you’ve been storing and processing data.”
  • 6.
    Exponential Data Growth= Big Data Estimated Global Data Volume:  2011: 1.8 Zettabytes  2015: 7.9 Zettabytes The world's information doubles every two years Over the next 10 years:  The number of servers worldwide will grow by 10x  Amount of information managed by enterprise data centers will grow by 50x  Number of “files” enterprise data center handle will grow by 75x Source: http://www.emc.com/leadership/programs/digital- universe.htm, which was based on the 2011 IDC Digital Universe Study 6
  • 7.
    Hadoop Distribution Landscape Distribution & Core Apache BigInsights BigInsights CDH3u4 CDH4u0 HDP v1.0 MapR M3 MapR M5 Components Bigtop v0.3 1.4 BE 1.4 EE Apache Hadoop         HDFS         Open Source Fuse-DFS     - - - - Apache Hadoop MapReduce         MapReduce 2 -  - - - - - - Proprietary Hadoop Common     - -   Apache Hive         Apache Pig         Apache HBase         Apache Zookeeper     - -   Apache Ambari - -  - - - - - Apache Templeton - -  - - - - - Apache Flume   -      Apache Sqoop       - - Apache Mahout   -    - - Apache Whirr   -    - - Apache Oozie         Apache Lucene - - - - - -   Apache Derby - - - - - -   Apache Avro - - - - - -   Hue   - - - - - - BigInsights Apps - - - - - - -  Hadoop Management Nagios - -  - - - - - Ganglia - -  - - - - - Zettaset Orchestrator™         Cloudera Manager   - - - - - - MapR Manager - - - -   - - BigInsights web console - - - - - - -  BigInsights simple console - - - - - -  - 7 January 24, 2013 Zettaset, Inc. | Proprietary
  • 8.
    What is thecurrent state of Security? • Another Great Question • Minimal work has been done in this field • Currently Not a Huge Community Focus. • Everyone feels like it’s been addressed by adding Kerberos to the system Don’t tell InfoSec People the Kerberos has fixed everything!
  • 9.
    Why Not TellThem That? • You will give them an aneurysm. • Kerberos is “Brushed On” Security NOT “Baked In” security. • Kerberos does NOT address compliance issues around data (HIPAA, GLBA, PCI, Etc.) Nothing around encryption, nothing around best practices.
  • 10.
    Hadoop: What’s Missing? •All Hadoop distros are constrained by the limitations of the Apache open source components • Not written to support hardened security, compliance, encryption, po licy-enablement, and risk management • Not written with high availability, service management, and monitoring in mind 10 © 2012 Zettaset, Inc. | Proprietary and Confidential
  • 11.
    Current State ofHadoop Security • Existing security for Apache-based Hadoop distributions does not meet enterprise requirements to support regulatory compliance mandates such as HIPAA and SOX, for example • Security breaches can result in negative impact, e.g., release sensitive information, damage brand, compromise competitive advantage, spark litigation, etc. • Hadoop security mechanism provides mutual authentication of users and services via SASL and Kerberos, but this has limitations 11 © 2012 Zettaset, Inc. | Proprietary and Confidential
  • 12.
    Enterprise-Class Hadoop Security Addresses the security gaps and vulnerabilities that exist in all Apache-based Hadoop distributions • Hardened to address access control, policy, compliance and risk management • Support for Lightweight Directory Access Protocol (LDAP) and Active Directory (AD), enabling Hadoop clusters to seamlessly integrate with existing security policies within the enterprise environment • Centralized configuration management, logging, and auditing, which maintains control of ingress and egress points in the cluster, and enables Hadoop clusters to meet compliance requirements for reporting and forensics • Role-based access control (RBAC), which significantly improves the user authentication process, and enables Kerberos to be run against all components of a big data ecosystem, not just Hadoop 12 © 2012 Zettaset, Inc. | Proprietary and Confidential
  • 13.
    Defining Big DataUse Case • What is your use case? • What are you trying to accomplish? • What data are you going to be storing? • What are you going to do after you store it? This will define your Security Threat Model and how you protect your data.
  • 14.
    Big Data ProductionSystem Log files Alerts Transactions etc. Structured Semi-structured Unstructured Types Data Data Data Data January 24, 2013 Zettaset, Inc. | Proprietary and 14
  • 15.
    Big Data Landscape(Version 2.0) Infrastructure Analytics Application NoSQL Databases Hadoop Related Analytics Solutions Data Visualization s Ad Optimization NewSQL Databases Publisher Marketing Statistical Tools Computing Social Media MPP Management Cluster Industry Applications Databases / Monitoring Services Sentiment Analysis Analytics Services Security Application Service Providers Location / People / Big Data Search Events Crowdsourcin IT Analytics Data Storage g Data Sources Sources Data Collection / Real-Time Crowdsourc SMB Marketplace Transport ed Analytics Analytics s Cross Infrastructure / Personal Data Analytics Open Source Projects Framework Query / Data Access Coordination / Real - Statistical Machine Cloud Data Flow Workflow Time Tools Learning Deployme nt © Matt Turck (@mattturck) and Shivon Zilis (@shivonz) Bloomberg Ventures
  • 16.
    What is aThreat Model? • Threat modeling is based on the notion that any system or organization has assets of value worth protecting and these assets have certain vulnerabilities. • Internal or external threats exploit these vulnerabilities in order to cause damage to the assets, and appropriate security countermeasures exist that mitigate the threats. • A threat model can help to assess the probability, the potential harm, the priority etc., of attacks, and thus help to minimize or eradicate the threats.
  • 17.
    Approaches to threatmodeling • 3 general approaches to threat modeling: • Attacker-centric – Attacker-centric threat modeling starts with an attacker, and evaluates their goals, and how they might achieve them. Attacker's motivations are often considered, for example, "The NSA wants to read this email," or "Jon wants to copy this DVD and share it with his friends." This approach usually starts from either entry points or assets. • Software-centric – Software-centric threat modeling (also called 'system-centric,' 'design-centric,' or 'architecture-centric') starts from the design of the system, and attempts to step through a model of the system, looking for types of attacks against each element of the model. This approach is used in threat modeling in Microsoft's Security Development Lifecycle. • Asset-centric – Asset-centric threat modeling involves starting from assets entrusted to a system, such as a collection of sensitive personal information.
  • 18.
    Bottom Line • Identifyany threats to the confidentiality, availability and integrity of the data and the application based on the data access control matrix that your application should be enforcing • Assign risk values and determine the risk responses • Determine the countermeasures to implement based on your chosen risk responses • Continually update the threat model based on the emerging security landscape.
  • 19.
    Summary • Allexisting Apache-based Hadoop distributions have functional limitations which constrain enterprise adoption • Zettaset Orchestrator is addressing the enterprise-level gaps in security, high availability, performance, and manageability that exist in all Apache-based Hadoop distributions • Orchestrator is a universal management and control software layer that can sit on top of any Hadoop distribution (distro- agnostic) • Orchestrator fills the Service Management gaps that exist in all Hadoop distributions and cluster deployments, and makes Hadoop ready for broader enterprise adoption 19 © 2012 Zettaset, Inc. | Proprietary and Confidential
  • 20.

Editor's Notes

  • #4 What does Zettaset do?
  • #7 Data capacity on average in enterprises is growing at 40% to 60% year over year due to a number of factors, including an explosion in unstructured data.” - Computerworld, 2010“80 percent of data is unstructured.” - IBM, 2010
  • #8 IBM BigInsights includes proprietary applications as part of its EE distribution only. These include Jaql, Jaqlserver, Workflow, BigSheets, and LanguageWare
  • #9 Don’t tell them it’s secure just because it’s behind a fire wall either… it didn’t work in 1995 and it dosent always work now.
  • #14 Before you can talk about securing big data, you have to understand your data use case