Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Five Tips for Running Cloudera on AWS

This deck covers key considerations and provides advice for enterprises looking to run production-scale Cloudera on AWS. We touch on everything from security to governance to selecting the right instance type for your Hadoop workload (Spark, Impala, Search, etc).

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Five Tips for Running Cloudera on AWS

  1. 1. 1© Cloudera, Inc. All rights reserved. Five Tips for Running Cloudera on AWS Joy Chatterjee | Senior Product Manager | Cloudera Rahul Bhartia | Ecosystem Solutions Architect | Amazon Web Services
  2. 2. What We’re Going to Cover • Hadoop in the cloud • Architectural and access patterns • Deployment and management • Security and governance • How to get started
  3. 3. 3© Cloudera, Inc. All rights reserved. Big data transforming business CUSTOMER & CHANNEL DATA-DRIVEN PRODUCTS SECURITY, RISK & COMPLIANCE
  4. 4. 4© Cloudera, Inc. All rights reserved. Airbnb uses Cloudera on AWS as a platform for machine learning and search that more effectively matches customers with the right rental property CUSTOMER 360
  5. 5. 5© Cloudera, Inc. All rights reserved. A camera company uses Cloudera to run big data and analytic workloads on AWS to literally soar ahead of the competition DATA-DRIVEN PRODUCTS
  6. 6. 6© Cloudera, Inc. All rights reserved. FINRA uses Cloudera on AWS to look at 30B market events per day to build a holistic picture of US market activity, while saving $10-20M annually. RISK
  7. 7. Hadoop in the AWS Cloud
  8. 8. Why AWS for Hadoop? Immediate AvailabilityBroad & Deep Capabilities Scalable Deploy the infrastructure you need almost instantly without long provisioning cycles. Find everything you need to collect, store, process, analyze and visualize Big Data. Scale from a few gigabytes to several petabytes; and from a few machines to thousands of nodes with just a few clicks.
  9. 9. Global Footprint Over 1 million active customers across 190 countries 1,700 government agencies 4,500 educational institutions 11 regions 30 availability zones 53 edge locations Everyday, AWS adds enough new server capacity to support Amazon.com when it was a $7 billion global enterprise. Region Edge Location
  10. 10. Administration & Security Access Control Identity Management Key Management & Storage Monitoring & Logs Resource & Usage Auditing Platform Services Analytics App Services Developer Tools & Operations Mobile Services Data Pipelines Data Warehouse Hadoop Real-time Streaming Data Application Lifecycle Management Containers Deployment DevOps Event-driven Computing Resource Templates Identity Mobile Analytics Push Notifications Sync App Streaming Email Queuing & Notifications Search Transcoding Workflow Core Services CDN Compute (VMs, Auto-scaling & Load Balancing) Databases (Relational, NoSQL, Caching) Networking (VPC, DX, DNS) Storage (Object, Block and Archival) Infrastructure Availability Zones Points of Presence Regions Enterprise Applications Business Email Sharing & Collaboration Virtual Desktop Technical & Business Support Account Management Partner Ecosystem Professional Services Security & Pricing Reports Solutions Architects Support Training & Certification
  11. 11. GPU enabled General purpose Memory optimized Storage optimized Compute optimized G2M4 R3C4 I 2 D2T2 Amazon EFS Amazon EBS Amazon EC2 Instance storage Amazon S3 ObjectFile Block Wide Choice of Compute & Storage Amazon Glacier
  12. 12. 12© Cloudera, Inc. All rights reserved. Hybrid Deployment Flexibility Cloudera Enterprise on AWS A new kind of data platform. • One place for unlimited data • Unified, multi-framework data access Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise
  13. 13. 13© Cloudera, Inc. All rights reserved. On-Premises Cloud Storage Direct Attached Direct Attached or Object Store Data Not shared across clusters Shared across multiple clusters Sizing Fixed-size Dynamic based on load Usage Model All users share cluster Clusters created as needed for apps/users Industry Standard Servers (CPU, Memory, & Direct Attached Storage) Industry Standard Servers (CPU & Memory) Hadoop is Different in the Cloud Object Storage
  14. 14. 14© Cloudera, Inc. All rights reserved. On-Premises vs. Cloud Deployment Impala Compute bound Memory bound Desires balance of compute and memory Impala S3 Object Store
  15. 15. 15© Cloudera, Inc. All rights reserved. Common Architectural Patterns in AWS S3 Object Store Source Data Seed Data Backup/DR ETL/MODELING (Spark, MapReduce) • Short-running clusters • Elastic workload • Little local storage • Long-running clusters • Sized to demand • Some local storage BI/ANALYTICS (Impala, Solr) • Fixed clusters • Periodic sync • All local storage APP DELIVERY (HBase, Kudu)
  16. 16. 16© Cloudera, Inc. All rights reserved. Different Data Types and Access Patterns Fast (< 2 sec) access to large volumes of data using Impala Bulk processing of huge volumes of data using Spark or Hive Real-time (< 10 ms) access to structured customer data using HBase Integrated full-text search using Apache Solr Real-time stream processing of data using Spark Streaming Operational Data Search Stream Processing
  17. 17. Deployment and Management
  18. 18. 18© Cloudera, Inc. All rights reserved. Cloudera Director Deploy and manage enterprise-grade Hadoop in the cloud Trusted for Production • Supports large scale customer deployments • Integrated part of Cloudera Enterprise Flexible Deployment • Out-of-the-box integrations with AWS • Supports multiple cloud platforms Simple Administration • Dynamic cluster lifecycle management • Multi-cluster, multi-environment • Custom use case support through APIwww.cloudera.com/downloads
  19. 19. 19© Cloudera, Inc. All rights reserved. Example of Cloud Cluster Master Node Master Node Master Nodes Worker Group 1 Disk optimized instances Worker Group 2 Memory optimized instances • Optimize worker groups by using various instances within the same cluster • Grow and shrink compute independently across worker groups based on load • Use on-demand instances on masters nodes and spot instances for worker nodes • Use Director to manage configuration files for a blueprint for repeatable deployments
  20. 20. 20© Cloudera, Inc. All rights reserved. Job dispatch to the cluster Job Cluster 1 Cluster 3 Cluster 2 Cluster 4 Object storage Dispatcher Policy
  21. 21. Security and Governance
  22. 22. Familiar Security Model Validated and driven by customers’ security experts Benefits all customers PEOPLE & PROCESS SYSTEM NETWORK PHYSICAL Security is Job Zero
  23. 23. AWS Foundation Services Compute Storage Database Networking AWS Global Infrastructure Regions Availability Zones Edge Locations Network Security Server Security Customer applications & content You get to define your controls IN the Cloud AWS takes care of the security OF the Cloud You AWS And You Share Responsibility for Security Data Security Access Control
  24. 24. Key AWS Certifications and Assurance Programs
  25. 25. Understand Configuration Changes Automate IT asset inventory Discover and provision cloud services Audit and troubleshoot configuration changes in the cloud
  26. 26. Full visibility of your AWS environment • CloudTrail will record access to API calls and save logs in your S3 buckets, no matter how those API calls were made Who did what and when and from where (IP address) • CloudTrail support for many AWS services and growing - includes EC2, EBS, VPC, RDS, IAM and RedShift • Easily Aggregate all log information Monitoring: Get consistent visibility of logs
  27. 27. 27© Cloudera, Inc. All rights reserved. Comprehensive, Compliance-Ready Security Authentication, Authorization, Audit, and Compliance Access Defining what users and applications can do with data Technical Concepts: Permissions Authorization Data Protecting data in the cluster from unauthorized visibility Technical Concepts: Encryption, Tokenization, Data masking Visibility Reporting on where data came from and how it’s being used Technical Concepts: Auditing Lineage Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation
  28. 28. 28© Cloudera, Inc. All rights reserved. RecordService Unified Access Control Enforcement • New high performance security layer that centrally enforces access control policies across Hadoop • Complements Apache Sentry’s unified policy definition • Row- and column-based security • Dynamic data masking • Apache-licensed open source • Beta now available FILESYSTEM HDFS NoSQL HBASE INGEST – SQOOP, FLUME, KAFKA DATA INTEGRATION & STORAGE SECURITY – SENTRY, RECORDSERVICE RESOURCE MANAGEMENT – YARN UNIFIED DATA SERVICES BATCH STREAM SQL SEARCH MODEL ONLINE DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS SPARK, HIVE, PIG SPARK IMPALA SOLR SPARK HBASE FILESYSTEM S3
  29. 29. 29© Cloudera, Inc. All rights reserved. Fine-Grained HDFS Access without RecordService Date/time Accnt # SSN Asset Trade Country 09:33:11 16- Feb-2015 0234837823 238-23- 9876 AAPL Sell US 11:33:01 16- Feb-2015 3947848494 329-44- 9847 TBT Buy EU 14:12:34 16- Feb-2015 4848367383 123-56- 2345 IBM Sell UK 09:22:03 16- Feb-2015 3485739384 585-11- 2345 INTC Buy US 11:55:33 16- Feb-2015 3847598390 234-11- 8765 F Buy US 10:22:55 16- Feb-2015 8765432176 344-22- 9876 UA Buy UK 13:45:24 16- Feb-2015 3456789012 412-22- 8765 AMZN Sell EU 09:03:44 16- Feb-2015 4857389329 123-44- 5678 TMV Buy US 15:55:55 16- Feb-2015 4756983234 234-76- 9274 MA Buy UK Date/time Accnt # SSN Asset Trade Country 14:12:34 16- Feb-2015 4848367383 123-56- 2345 IBM Sell UK 10:22:55 16- Feb-2015 8765432176 344-22- 9876 UA Buy UK 15:55:55 16- Feb-2015 4756983234 234-76- 9274 MA Buy UK Date/time Accnt # SSN Asset Trade Country 11:33:01 16- Feb-2015 3947848494 329-44- 9847 TBT Buy EU 13:45:24 16- Feb-2015 3456789012 412-22- 8765 AMZN Sell EU Date/time Accnt # SSN Asset Trade Country 09:33:11 16- Feb-2015 0234837823 238-23- 9876 AAPL Sell US 09:22:03 16- Feb-2015 3485739384 585-11- 2345 INTC Buy US 11:55:33 16- Feb-2015 3847598390 234-11- 8765 F Buy US 09:03:44 16- Feb-2015 4857389329 123-44- 5678 TMV Buy US Split the original file Use HDFS permissions to limit access
  30. 30. 30© Cloudera, Inc. All rights reserved. Fine-Grained HDFS Access Control with RecordService • Apply controls to the master data file • Row, column, and sub-column (masking) controls • Enforce these across all access paths Date/time Accnt # SSN Asset Trade Country 09:33:11 16- Feb-2015 0234837823 238-23- 9876 AAPL Sell US 11:33:01 16- Feb-2015 3947848494 329-44- 9847 TBT Buy EU 14:12:34 16- Feb-2015 4848367383 123-56- 2345 IBM Sell EU 09:22:03 16- Feb-2015 3485739384 585-11- 2345 INTC Buy US 11:55:33 16- Feb-2015 3847598390 234-11- 8765 F Buy US 10:22:55 16- Feb-2015 8765432176 344-22- 9876 UA Buy EU 13:45:24 16- Feb-2015 3456789012 412-22- 8765 AMZN Sell EU Column-Level Controls Row-LevelControls Date/time Accnt # SSN Asset Trade Country 09:33:11 16- Feb-2015 0234837823 238-23- 9876 AAPL Sell US 11:33:01 16- Feb-2015 3947848494 329-44- 9847 TBT Buy group2 14:12:34 16- Feb-2015 4848367383 123-56- 2345 IBM Sell group3 09:22:03 16- Feb-2015 3485739384 585-11- 2345 INTC Buy US 11:55:33 16- Feb-2015 3847598390 234-11- 8765 F Buy US 10:22:55 16- Feb-2015 8765432176 344-22- 9876 UA Buy group3 13:45:24 16- Feb-2015 3456789012 412-22- 8765 AMZN Sell group2 Column-Level Controls Row-LevelControls XXX-XX XXX-XX XXX-XX What U.S. Brokers See
  31. 31. Data Management
  32. 32. 32© Cloudera, Inc. All rights reserved. Data Management Challenges Business Users • How do I find what’s relevant? • Can I trust what I find? • How can I explore data on my own? Compliance Officers • Who’s accessing what data? • What are they doing with the data? • Is sensitive data governed and protected? • Can I meet compliance needs? Database Admins • How is data being used today? • How can I optimize for future workloads? • How can I take advantage of Hadoop risk-free and fast? Data Stewards/Curators • How can I manage data from ingest to purge? • How do I classify data efficiently? • How can data be made available to end-users?
  33. 33. 33© Cloudera, Inc. All rights reserved. Cloudera Navigator in the Cloud Data Management & Governance Audit Only distribution to pass PCI audit Lineage How did I get to this truth? Search Metadata – What are all the files associated with this product? Policy Move data from HDFS to Amazon S3
  34. 34. How to Get Started
  35. 35. Cloudera on AWS Checklist 1. Use the right AWS instance types for the right Hadoop workload 2. Take a Dev/Ops approach to managing the Hadoop lifecycle on AWS 3. Amazon S3 provides good, low-cost cloud storage for Hadoop jobs 4. Security is everyone’s responsibility. Enforce multi-layer security that covers cloud, cluster, and data access, authentication, and encryption 5. Metadata is key to data management and governance. Without it, your users can’t answer the questions that matter to your business
  36. 36. 36© Cloudera, Inc. All rights reserved. Cloudera Live runs on AWS and includes step-by-step tutorials and integrations with familiar BI tools www.cloudera.com/live
  37. 37. AWS Quick Start Reference Deployments http://aws.amazon.com/quickstart
  38. 38. 38© Cloudera, Inc. All rights reserved. Thank You!

    Be the first to comment

    Login to see the comments

  • terumba

    Jan. 27, 2016
  • KennethJeckell

    Jan. 29, 2016
  • EdwardMoran3

    Jan. 29, 2016
  • mageru

    Feb. 3, 2016
  • RichardOng3

    Dec. 9, 2016
  • sreejitbakshi

    Mar. 8, 2017
  • ngch45

    Mar. 20, 2017
  • rnanduri1

    May. 8, 2017
  • RavirajAdrangi

    May. 25, 2017
  • RotemGabay

    Jun. 4, 2017
  • BhupeshSingu

    Jun. 21, 2017
  • AliAsgarJuzer

    Jul. 8, 2018
  • tazimehdi

    Aug. 12, 2018
  • sankargmail

    Sep. 22, 2018
  • UmapathyV

    Jan. 15, 2019
  • criabdala

    Mar. 6, 2019

This deck covers key considerations and provides advice for enterprises looking to run production-scale Cloudera on AWS. We touch on everything from security to governance to selecting the right instance type for your Hadoop workload (Spark, Impala, Search, etc).

Views

Total views

5,865

On Slideshare

0

From embeds

0

Number of embeds

39

Actions

Downloads

0

Shares

0

Comments

0

Likes

16

×