Combat Cyber Threats with Cloudera Impala & Apache Hadoop


Published on

Learn how you can use Cloudera Impala to:

- Operate with all data in your domain
- Address cyber security analysis and forensics needs
- Combat fraud, waste, and abuse

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Interactive SQL for HadoopResponses in seconds vs. minutes or hours4-100x faster than HiveNearly ANSI-92 standard SQL with HiveQLCREATE, ALTER, SELECT, INSERT, JOIN, subqueries, etc.ODBC/JDBC drivers Compatible SQL interface for existing Hadoop/CDH applicationsNative MPP Query EnginePurpose-built for low latency queries – another application being brought to HadoopSeparate runtime from MapReduce which is designed for batch processingTightly integrated with Hadoop ecosystem – major design imperative and differentiator for ClouderaSingle system (no integration)Native, open file formats that are compatible across the ecosystem (no copying)Single metadata model (no synchronization)Single set of hardware and system resources (better performance, lower cost)Integrated, end-to-end security (no vulnerabilities)Open SourceKeeps with our strategy of an open platform – i.e. if it stores or processes data, it’s open sourceApache-licensedCode available on Github
  • More & Faster Value from Big DataProvides an interactive BI/Analytics experience on HadoopPreviously BI/Analytics was impractical due to the batch orientation of MapReduceEnables more users to gain value from organizational data assets (SQL/BI users)Makes more data available for analysis (raw data, multi-structured data, historical data)Removes delays from data migrationInto specialized analytical DBMSsInto proprietary file formats that happen to be stored in HDFSInto transient in-memory storesFlexibilityQuery across existing data in HadoopHDFS and HBaseAccess data immediately and directly in its native formatSelect best-fit file formatsUse raw data formats when unsure of access patterns (text files, RCFiles, LZO)Increase performance with optimized file formats when access patterns are known (Parquet, Avro)All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc. on the same data at the same timeCost EfficiencyReduce movement, duplicate storage & computeData movement: no time or resource penalty for migrating data into specialized systems or formatsDuplicate storage: no need to duplicate data across systems or within the same system in different file formatsCompute: use the same compute resources as the rest of the Hadoop system – You don’t need a separate set of nodes to run interactive query vs. batch processing (MapReduce)You don’t need to overprovision your hardware to enable memory-intensive, on-the-fly format conversions10% to 1% the cost of analytic DMBSLess than $1,000/TBFull Fidelity AnalysisNo loss of fidelity from aggregations or conforming to fixed schemasIf the attribute exists in the raw data, you can query against it
  • This is an overview of my simple cluster I put together for the Webinar, 4 nodes in total: 3 node Hadoop Cluster and an Application Server.So the configuration here is one that would be present in many public and private organizationsWe have placed a sensor at the gateway or gateway(s) across the enterprise monitoring traffic incoming and outgoing.This information is captured by a variety of sensor/collectors and written to files on a regular basis.So now lets go through the data sets.
  • 1.) Provide a brief tour of the cluster using Cloudera Manager
  • Combat Cyber Threats with Cloudera Impala & Apache Hadoop

    1. 1. Combat Cyber Threats with Cloudera Impala & Apache Hadoop Justin Erickson | Director, Product Management, Cloudera Wayne Wheeles | Analytic, Infrastructure and Enrichment Developer Cyber Security, Six3 Systems July 2013
    2. 2. Agenda What’s new in Impala? • Impala recap • Impala 1.1 • Authorization with Sentry Cyber security with Impala • Cyber security demo overview • Working with WebProxy Data • Working with Netflow Data • IDS Amplification and Correlation “holy grail use case” • Discussion and questions 2
    3. 3. Cloudera Impala 3 Interactive SQL for Hadoop  Responses in seconds  ANSI-92 standard SQL with Hive SQL Native MPP Query Engine  Purpose-built for low-latency queries  Separate runtime from MapReduce  Designed as part of the Hadoop ecosystem Open Source  Apache-licensed
    4. 4. Benefits of Impala 4 More & Faster Value from “Big Data”  Interactive BI/analytics experience via SQL  No delays from data migration Flexibility  Query across existing data  Select best-fit file formats (Parquet, Avro, etc.)  Run multiple frameworks on the same data at the same time Cost Efficiency  Reduce movement, duplicate storage & compute  10% to 1% the cost of analytic DBMS Full Fidelity Analysis  No loss from aggregations or fixed schemas
    5. 5. Impala 1.1 (released July 23, 2013) Sentry support • Fine-grained authorization • Role-based authorization Support for views Performance • Parquet columnar performance • Join order sorted by table size • More efficient metadata refresh for larger installations Additional SQL • SQL-89 joins (in addition to existing SQL-92) • LOAD function • REFRESH command for JDBC/ODBC Improved HBase support • Binary types • Caching configuration ©2013 Cloudera, Inc. All Rights Reserved. 5
    6. 6. Previous State of Authorization 6 Insecure Advisory Authorization Users can grant themselves permissions Intended to prevent accidental deletion of data Problem: Doesn’t guard against malicious users HDFS Impersonation Data is protected at the file level by HDFS permissions Problem: File-level not granular enough Problem: Not role-based Two Sub-Optimal Choices for SQL on Hadoop
    7. 7. Sentry with CDH4.3 Hive and Impala 1.1 7 Secure Authorization Ability to control access to data and/or privileges on data for authenticated users Fine-Grained Authorization Ability to give users access to a subset of data in a database Role-Based Authorization Ability to create/apply templatized privileges based on functional roles Multi-Tenant Administration Ability for central admin group to empower lower-level admins to manage security for each database/schema
    8. 8. Part of an overall infosec landscape 8 Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation Data Protecting data in the cluster from unauthorized visibility Technical Concepts: Encryption Data masking Access Defining what users and applications can do with data Technical Concepts: Permissions Authorization Visibility Reporting on where data came from and how it’s being used Technical Concepts: Auditing Lineage SentryKerberos | Oozie | Knox Cloudera NavigatorCertified Partners Available 7/23
    9. 9. Agenda – Cyber security with Impala What’s new in Impala? • Impala recap • Impala 1.1 • Authorization with Sentry Cyber security with Impala • Cyber security demo overview • Working with WebProxy Data • Working with Netflow Data • IDS Amplification and Correlation “holy grail use case” • Discussion and questions 9
    10. 10. Impala Mission Demonstration Platform 10 Application Server Cloudera - CDH 4 Cluster sherpa4 sherpa3 sherpa2 sherpa1 • Cloudera Manager • HDFS • Impala • HBASE • MR • HIVE • HDFS • Impala • HBASE • MR • HIVE • HDFS (NN) • Impala (State Store) • HBASE(RS) • MR • HUE • Oozie • Zookeeper • HIVE Organization Network Gateway to Internet S E N S O R Netflow WebProxy IDS
    11. 11. Demo Platform Data Sets Webinar Data Sets • Netflow Data • The term flow refers to a single data flow connection between two hosts, defined uniquely by its five-tuple. • • IDS/IPS Data • a device or software application that monitors network or system activities for malicious activities or policy violations and produces reports to a management station • • WebProxy Data • WebProxy for request by users within the corporate domain. Enrichment Data Sets • Geographic enrichment • Geo-location information of addresses • • Blacklist Information • Address list of addresses identified as potential threat • • Whitelist Information • Addresses known located within the corporate network • Statistical Cubes • Cubes built for the purpose of providing statistical amplification for analysis 11
    12. 12. Demonstration 12 Impala Mission Demonstration Platform
    13. 13. 13 Why Impala for Cyber Security? Cloudera Impala and HDFS are a great choice for cyber security: • Offers one powerful and secure platform for structured and unstructured data. • Uniquely provides the capability to store large amounts of data at a acceptable price point. • Sentry provides even greater protection for your cyber security data.
    14. 14. Thank You • Ask questions on the Q&A tab • Recording will be available at • After webinar, inquire at: • Contact info: Email: Twitter: @WayneWheeles @JustinErickson @Cloudera 14 Cloudera Impala “Imagination is more important than knowledge. For knowledge is limited to all we now know and understand, while imagination embraces the entire world, and all there ever will be to know and understand.” ~Albert Einstein Six3 Cyber Security Demo