1© Cloudera, Inc. All rights reserved.
Cloudera & The Cloud
2© Cloudera, Inc. All rights reserved.
Explosion of data and devices (IoT)
30B
connected
devices
440x
more
data
Transformation of IT infrastructure
open
source
cloud
machine
learning
$200B
total
market1
1 IDC Worldwide Big Data and Business Analytics Market Through 2020
The data-driven enterprise
3© Cloudera, Inc. All rights reserved.
The Big Shift
In 2017
58% on-premises
11% private cloud
25% public cloud
Source: 451 Research, Voice of the
Enterprise: Workloads and Key Projects,
Cloud Transformation, 2017.
By 2019
38% on-premises
15% private cloud
41% public cloud
4© Cloudera, Inc. All rights reserved.
My organization
is moving to the cloud,
why should we
consider Cloudera?
5© Cloudera, Inc. All rights reserved.
Current State
6© Cloudera, Inc. All rights reserved.
Instant, self-service
access to data and
resources
Application performance
Job-oriented tools
Choice
Secure, controlled
provisioning
Predictable costs
Systems-oriented tools
Standards and portability
KNOWLEDGE WORKERS INFRASTRUCTURE TEAM
Stakeholders
Advance strategic
initiatives
Link analytics to business
Reduce admin burden
Integrated solutions
DATA TEAM
7© Cloudera, Inc. All rights reserved.
–+
• Speed of deployment
• Tenant isolation
• Self-service
• Workload elasticity
• Shared storage
• Pay-as-you-go
• Bring your own tools
• Bring your own data
• Disjointed services
• Too many data copies
• Multiple security frameworks
• Difficult to troubleshoot workloads
• No shared metadata
• Unable to track data lineage
• Few on-premises integration services
• Proprietary services
• Cloud lock-in
CLOUD
BENEFITS
CLOUD
PROBLEMS
8© Cloudera, Inc. All rights reserved.
Deployment options
Bare Metal Private Cloud Cloud IaaS Cloud PaaS
Applications Applications Applications Applications
Clusters Clusters Clusters Clusters
Operating System Operating System Operating System Operating System
Network Network Network Network
Storage Storage Storage Storage
Servers Servers Servers Servers
Customer managed Vendor managed
On-premises
9© Cloudera, Inc. All rights reserved.
The Answer
10© Cloudera, Inc. All rights reserved.
● The modern platform for machine
learning and analytics
● with multiple deployment options
● and one shared data experience
11© Cloudera, Inc. All rights reserved.
The modern platform for ML and Analytics...
DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
DATA PROCESSING
• Cost efficient
• Reliable
• Scalable
• Based on Spark,
MapReduce,
Hive & Pig
• Supported by
Workload
Analytics
FAST BI & SQL
• Flexibility
• Elastic scale
• Go beyond SQL
• Based on
Impala & Hive
• SQL dev enviro
• Supported by
Workload
Analytics
MACHINE LEARNING
• Fast dev to
production
• Secure self-
serve
• Based on
Python, R, and
Spark
• ML dev
environment
(CDSW)
ONLINE & REAL-TIME
• High throughput,
low latency
• Strongly
consistent
• Based on
Hbase, Kudu
& Spark
streaming
12© Cloudera, Inc. All rights reserved.
...With multiple deployment options...
OPERATIONAL
DATABASE
DATA
SCIENCE
ANALYTIC
DATABASE
DATA
ENGINEERING
DATA
ENGINEERING
ANALYTIC
DATABASE
PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES
(in beta)
SDX in EDH clusters SDX Reference Architecture
Altus SDX (in beta)
Via Cloudera Manager and Director
via Altus PaaS
13© Cloudera, Inc. All rights reserved.
• Shared catalog
• Unified security
• Consistent governance
• Easy workload management
• Flexible ingest and replication
...And one Shared Data Experience
14© Cloudera, Inc. All rights reserved. 14
The modern platform for machine learning and analytics optimized for the cloud
DATA CATALOG
SECURITY GOVERNANCE
WORKLOAD
MANAGEMENT
INGEST &
REPLICATION
EXTENSIBLE
SERVICES
CORE
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
AWS S3 AWS EBS HDFS KUDU
STORAGE
SERVICES
Cloudera Enterprise
PRIVATE CLOUDBARE METAL INFRASTRUCTURE
DEPLOYMENT
OPTIONS SERVICES
15© Cloudera, Inc. All rights reserved.
"Better to bet on cloud providers
for infrastructure, Cloudera for data,
analytics and security fabric, and
leave the rest to the ecosystem"
16© Cloudera, Inc. All rights reserved.
$
• 2-4x storage savings, faster ramp
• Lower risk of data breach
• Analysts more productive on jobs
• Safe self-service, no shadow IT
• Less effort for audits/compliance
• IT more strategic, less admin time
• Deployment choices
• Portability and price arbitrage
• No proprietary lock-in
• Eliminate data copies
• Single security framework
• Easy to troubleshoot workloads
• Universally shared metadata
• Easy to track data lineage
• Unified services
• Same solution as on-premises
• Same solution in multi-cloud
• Open source standards
+
CLOUDERA
ADVANTAGES
BUSINESS
VALUE
17© Cloudera, Inc. All rights reserved.
Cloud Customer Successes
● Manufacturing Customer
● Analytic DB, Data Science &
Engineering on AWS
“We can deliver all of the data with less compute and much
less complexity.”
- Joe Smith, ACME Manufacturing
● Advisory and tech for finserv
● Cloudera EDH on AWS
“Cloudera gives us the best of both worlds—the ability to
capture and process thousands of critical events at scale and
the option to deploy on prem, in the cloud, or in a hybrid
architecture. And running Cloudera on AWS enables us to
leverage a world-class, reliable cloud infrastructure that
supports our changing business needs and the needs of our
customers.”
- Kaushik Deka, CTO and Director of Engineering
18© Cloudera, Inc. All rights reserved.
1. Learn Cloudera product offerings
2. Deploy using AWS Quickstart template
3. Leverage Professional Services
4. Optimize based on changing business need
Call to Action
19© Cloudera, Inc. All rights reserved.
Best practices for running your
Cloudera Enterprise cluster on AWS
20© Cloudera, Inc. All rights reserved.
Quick AWS Component Review
EC2 - servers
S3 - object storage
EBS - block storage
RDS - we still need RDBMS
VPC - everything needs network
21© Cloudera, Inc. All rights reserved.
AWS Service Limits
Default limits on most AWS services (most can be increased)
• EC2 - ten (10) m4.4xlarge instances per region
• EBS - 20 TB of Throughput Optimized HDD (st1) per region
• S3 - 100 buckets per account
• VPC - 5 VPC per region
Defaults might limit your cluster, so plan ahead!
22© Cloudera, Inc. All rights reserved.
Deployment
Topology
22© Cloudera, Inc. All rights reserved.
23© Cloudera, Inc. All rights reserved.
Deployment Topology
Two types of deployments from network perspective:
• Public Subnet - direct access to Internet & AWS services
• Private Subnet - instances must go through a VPC endpoints to
reach AWS services & NAT instances for Internet
If you require high-bandwidth access to data sources on Internet or
AWS services in another region, deploy to a public subnet.
Public doesn’t mean open to world. Limit access using Security
Groups.
24© Cloudera, Inc. All rights reserved.
Deployment Topology (Public Subnet)
25© Cloudera, Inc. All rights reserved.
Deployment Topology (Private Subnet)
26© Cloudera, Inc. All rights reserved.
A Word About Edge Nodes
Edge nodes:
• Have direct access to cluster
• Users user these nodes via client applications
• Web applications, BI tools, or just Hadoop command-line client
Avoid direct user access to the cluster!
27© Cloudera, Inc. All rights reserved.
Deployment Topology with Edge Nodes
28© Cloudera, Inc. All rights reserved.
Deployment Topology with Edge Nodes
29© Cloudera, Inc. All rights reserved.
Roles and
Instance
Types
29© Cloudera, Inc. All rights reserved.
30© Cloudera, Inc. All rights reserved.
So you want a cluster, eh?
31© Cloudera, Inc. All rights reserved.
Deployment Model
Two types of deployments:
• Persistent - long-lived, always on, no spin-up time
• Temporary - short-lived, can be started/stopped to save money
Impacts job setup time, instance types, storage method, and cost.
32© Cloudera, Inc. All rights reserved.
Instance Types
Ephemeral
• have locally attached storage, HDD or SSD
• on instance termination, data is irrecoverable
EBS-only
• no local storage, must mount EBS volumes
• on instance termination, data is safe in EBS
33© Cloudera, Inc. All rights reserved.
Networking
Connectivity
Security
33© Cloudera, Inc. All rights reserved.
34© Cloudera, Inc. All rights reserved.
Storage Configuration
Three types of storage:
• Instance Storage
• Elastic Block Storage (EBS)
• Object Storage (S3)
35© Cloudera, Inc. All rights reserved.
Storage Configuration - Instance Storage
• Attached to EC2 instances, like physical disks on physical server
• Lifetime of storage == Lifetime of EC2 instance
• Each EC2 instance has different amounts of storage
c1.xlarge has 4 x 420 GB
d2.8xlarge has 24 x 2 TB
• For HDFS data directories, we like HDD instance storage
• Backup planning -- multi-instance shutdowns, multi-VM AWS events
36© Cloudera, Inc. All rights reserved.
Storage Configuration - EBS
• Persistent block level storage volumes
• Can be encrypted at rest w/ negligible impact to latency/throughput
• OS: General Purpose (gp2)
• DFS: Throughput Optimized HDD (st1), Cold HDD (sc1)
• Baseline & burst performance increase with size of provisioned
volume
e.g. 500 GB st1 baseline throughput 20 MB/s; 1000 GB → 40 MB/s
37© Cloudera, Inc. All rights reserved.
Storage Configuration - EBS Recommendations
Instance selection
• Use EBS-optimized instances OR instances with 10Gb+ network
• Minimum dedicated EBS bandwidth of 1000 Mb/s (125 MB/s)
Volume selection
• Baseline performance, 40 MB/s or better (1000 GB st1, 3200 GB
sc1)
• Do not exceed instance’s dedicated EBS bandwidth!
38© Cloudera, Inc. All rights reserved.
Storage Configuration - S3
• Great for cold backup: durable, available, inexpensive
• For hot backup, use a second HDFS cluster
• Hive and Spark can also use S3 directly
• Standard data operations can read from & write to S3 buckets
39© Cloudera, Inc. All rights reserved.
Storage Configuration
Three types of storage:
• Instance Storage
• Elastic Block Storage (EBS)
• Object Storage (S3)
• Root Device
40© Cloudera, Inc. All rights reserved.
Storage Configuration - Root Device
• For operating system and logs
• Use EBS gp2 volumes as root devices
• At least 500 GB for OS, CDH software, and logs
• Do not use instance storage for the root device!
41© Cloudera, Inc. All rights reserved. 41© Cloudera, Inc. All rights reserved.
Capacity
Planning
42© Cloudera, Inc. All rights reserved.
Capacity Planning
• AWS makes expansion easy; advance planning makes things easier
• Consider workloads: how much storage vs compute? balanced?
• Consider data replication (3x), growth, retention
• Low storage density, r3.8xlarge or c4.8xlarge provide less storage
but higher compute and memory
• High storage density, d2.8xlarge offers 48 TB per instance with a
good amount of compute and memory
43© Cloudera, Inc. All rights reserved.
Cloudera Enterprise Hardware Requirements Guide
tiny.cloudera.com/hw-reqs
44© Cloudera, Inc. All rights reserved. 44© Cloudera, Inc. All rights reserved.
Provisioning
Instances
45© Cloudera, Inc. All rights reserved.
Provisioning Instances
• Cloudera Director automates most things
• Manual via EC2 command-line API tool or AWS management
console
• Don’t forget your databases (either RDS or self-managed)
• Cloudera Altus
46© Cloudera, Inc. All rights reserved.
Provisioning Instances
No matter which route you take...
• root device: 500 GB+ gp2 EBS volume
• master metadata: ephemeral or recommended gp2 EBS volumes
• DFS data: ephemeral or recommended st1/sc1 EBS volumes
• use tags to indicate the role instances/volumes will play
47© Cloudera, Inc. All rights reserved.
Thank You

Big data journey to the cloud 5.30.18 asher bartch

  • 1.
    1© Cloudera, Inc.All rights reserved. Cloudera & The Cloud
  • 2.
    2© Cloudera, Inc.All rights reserved. Explosion of data and devices (IoT) 30B connected devices 440x more data Transformation of IT infrastructure open source cloud machine learning $200B total market1 1 IDC Worldwide Big Data and Business Analytics Market Through 2020 The data-driven enterprise
  • 3.
    3© Cloudera, Inc.All rights reserved. The Big Shift In 2017 58% on-premises 11% private cloud 25% public cloud Source: 451 Research, Voice of the Enterprise: Workloads and Key Projects, Cloud Transformation, 2017. By 2019 38% on-premises 15% private cloud 41% public cloud
  • 4.
    4© Cloudera, Inc.All rights reserved. My organization is moving to the cloud, why should we consider Cloudera?
  • 5.
    5© Cloudera, Inc.All rights reserved. Current State
  • 6.
    6© Cloudera, Inc.All rights reserved. Instant, self-service access to data and resources Application performance Job-oriented tools Choice Secure, controlled provisioning Predictable costs Systems-oriented tools Standards and portability KNOWLEDGE WORKERS INFRASTRUCTURE TEAM Stakeholders Advance strategic initiatives Link analytics to business Reduce admin burden Integrated solutions DATA TEAM
  • 7.
    7© Cloudera, Inc.All rights reserved. –+ • Speed of deployment • Tenant isolation • Self-service • Workload elasticity • Shared storage • Pay-as-you-go • Bring your own tools • Bring your own data • Disjointed services • Too many data copies • Multiple security frameworks • Difficult to troubleshoot workloads • No shared metadata • Unable to track data lineage • Few on-premises integration services • Proprietary services • Cloud lock-in CLOUD BENEFITS CLOUD PROBLEMS
  • 8.
    8© Cloudera, Inc.All rights reserved. Deployment options Bare Metal Private Cloud Cloud IaaS Cloud PaaS Applications Applications Applications Applications Clusters Clusters Clusters Clusters Operating System Operating System Operating System Operating System Network Network Network Network Storage Storage Storage Storage Servers Servers Servers Servers Customer managed Vendor managed On-premises
  • 9.
    9© Cloudera, Inc.All rights reserved. The Answer
  • 10.
    10© Cloudera, Inc.All rights reserved. ● The modern platform for machine learning and analytics ● with multiple deployment options ● and one shared data experience
  • 11.
    11© Cloudera, Inc.All rights reserved. The modern platform for ML and Analytics... DATA ENGINEERING OPERATIONAL DATABASE ANALYTIC DATABASE DATA SCIENCE DATA PROCESSING • Cost efficient • Reliable • Scalable • Based on Spark, MapReduce, Hive & Pig • Supported by Workload Analytics FAST BI & SQL • Flexibility • Elastic scale • Go beyond SQL • Based on Impala & Hive • SQL dev enviro • Supported by Workload Analytics MACHINE LEARNING • Fast dev to production • Secure self- serve • Based on Python, R, and Spark • ML dev environment (CDSW) ONLINE & REAL-TIME • High throughput, low latency • Strongly consistent • Based on Hbase, Kudu & Spark streaming
  • 12.
    12© Cloudera, Inc.All rights reserved. ...With multiple deployment options... OPERATIONAL DATABASE DATA SCIENCE ANALYTIC DATABASE DATA ENGINEERING DATA ENGINEERING ANALYTIC DATABASE PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES (in beta) SDX in EDH clusters SDX Reference Architecture Altus SDX (in beta) Via Cloudera Manager and Director via Altus PaaS
  • 13.
    13© Cloudera, Inc.All rights reserved. • Shared catalog • Unified security • Consistent governance • Easy workload management • Flexible ingest and replication ...And one Shared Data Experience
  • 14.
    14© Cloudera, Inc.All rights reserved. 14 The modern platform for machine learning and analytics optimized for the cloud DATA CATALOG SECURITY GOVERNANCE WORKLOAD MANAGEMENT INGEST & REPLICATION EXTENSIBLE SERVICES CORE SERVICES DATA ENGINEERING OPERATIONAL DATABASE ANALYTIC DATABASE DATA SCIENCE AWS S3 AWS EBS HDFS KUDU STORAGE SERVICES Cloudera Enterprise PRIVATE CLOUDBARE METAL INFRASTRUCTURE DEPLOYMENT OPTIONS SERVICES
  • 15.
    15© Cloudera, Inc.All rights reserved. "Better to bet on cloud providers for infrastructure, Cloudera for data, analytics and security fabric, and leave the rest to the ecosystem"
  • 16.
    16© Cloudera, Inc.All rights reserved. $ • 2-4x storage savings, faster ramp • Lower risk of data breach • Analysts more productive on jobs • Safe self-service, no shadow IT • Less effort for audits/compliance • IT more strategic, less admin time • Deployment choices • Portability and price arbitrage • No proprietary lock-in • Eliminate data copies • Single security framework • Easy to troubleshoot workloads • Universally shared metadata • Easy to track data lineage • Unified services • Same solution as on-premises • Same solution in multi-cloud • Open source standards + CLOUDERA ADVANTAGES BUSINESS VALUE
  • 17.
    17© Cloudera, Inc.All rights reserved. Cloud Customer Successes ● Manufacturing Customer ● Analytic DB, Data Science & Engineering on AWS “We can deliver all of the data with less compute and much less complexity.” - Joe Smith, ACME Manufacturing ● Advisory and tech for finserv ● Cloudera EDH on AWS “Cloudera gives us the best of both worlds—the ability to capture and process thousands of critical events at scale and the option to deploy on prem, in the cloud, or in a hybrid architecture. And running Cloudera on AWS enables us to leverage a world-class, reliable cloud infrastructure that supports our changing business needs and the needs of our customers.” - Kaushik Deka, CTO and Director of Engineering
  • 18.
    18© Cloudera, Inc.All rights reserved. 1. Learn Cloudera product offerings 2. Deploy using AWS Quickstart template 3. Leverage Professional Services 4. Optimize based on changing business need Call to Action
  • 19.
    19© Cloudera, Inc.All rights reserved. Best practices for running your Cloudera Enterprise cluster on AWS
  • 20.
    20© Cloudera, Inc.All rights reserved. Quick AWS Component Review EC2 - servers S3 - object storage EBS - block storage RDS - we still need RDBMS VPC - everything needs network
  • 21.
    21© Cloudera, Inc.All rights reserved. AWS Service Limits Default limits on most AWS services (most can be increased) • EC2 - ten (10) m4.4xlarge instances per region • EBS - 20 TB of Throughput Optimized HDD (st1) per region • S3 - 100 buckets per account • VPC - 5 VPC per region Defaults might limit your cluster, so plan ahead!
  • 22.
    22© Cloudera, Inc.All rights reserved. Deployment Topology 22© Cloudera, Inc. All rights reserved.
  • 23.
    23© Cloudera, Inc.All rights reserved. Deployment Topology Two types of deployments from network perspective: • Public Subnet - direct access to Internet & AWS services • Private Subnet - instances must go through a VPC endpoints to reach AWS services & NAT instances for Internet If you require high-bandwidth access to data sources on Internet or AWS services in another region, deploy to a public subnet. Public doesn’t mean open to world. Limit access using Security Groups.
  • 24.
    24© Cloudera, Inc.All rights reserved. Deployment Topology (Public Subnet)
  • 25.
    25© Cloudera, Inc.All rights reserved. Deployment Topology (Private Subnet)
  • 26.
    26© Cloudera, Inc.All rights reserved. A Word About Edge Nodes Edge nodes: • Have direct access to cluster • Users user these nodes via client applications • Web applications, BI tools, or just Hadoop command-line client Avoid direct user access to the cluster!
  • 27.
    27© Cloudera, Inc.All rights reserved. Deployment Topology with Edge Nodes
  • 28.
    28© Cloudera, Inc.All rights reserved. Deployment Topology with Edge Nodes
  • 29.
    29© Cloudera, Inc.All rights reserved. Roles and Instance Types 29© Cloudera, Inc. All rights reserved.
  • 30.
    30© Cloudera, Inc.All rights reserved. So you want a cluster, eh?
  • 31.
    31© Cloudera, Inc.All rights reserved. Deployment Model Two types of deployments: • Persistent - long-lived, always on, no spin-up time • Temporary - short-lived, can be started/stopped to save money Impacts job setup time, instance types, storage method, and cost.
  • 32.
    32© Cloudera, Inc.All rights reserved. Instance Types Ephemeral • have locally attached storage, HDD or SSD • on instance termination, data is irrecoverable EBS-only • no local storage, must mount EBS volumes • on instance termination, data is safe in EBS
  • 33.
    33© Cloudera, Inc.All rights reserved. Networking Connectivity Security 33© Cloudera, Inc. All rights reserved.
  • 34.
    34© Cloudera, Inc.All rights reserved. Storage Configuration Three types of storage: • Instance Storage • Elastic Block Storage (EBS) • Object Storage (S3)
  • 35.
    35© Cloudera, Inc.All rights reserved. Storage Configuration - Instance Storage • Attached to EC2 instances, like physical disks on physical server • Lifetime of storage == Lifetime of EC2 instance • Each EC2 instance has different amounts of storage c1.xlarge has 4 x 420 GB d2.8xlarge has 24 x 2 TB • For HDFS data directories, we like HDD instance storage • Backup planning -- multi-instance shutdowns, multi-VM AWS events
  • 36.
    36© Cloudera, Inc.All rights reserved. Storage Configuration - EBS • Persistent block level storage volumes • Can be encrypted at rest w/ negligible impact to latency/throughput • OS: General Purpose (gp2) • DFS: Throughput Optimized HDD (st1), Cold HDD (sc1) • Baseline & burst performance increase with size of provisioned volume e.g. 500 GB st1 baseline throughput 20 MB/s; 1000 GB → 40 MB/s
  • 37.
    37© Cloudera, Inc.All rights reserved. Storage Configuration - EBS Recommendations Instance selection • Use EBS-optimized instances OR instances with 10Gb+ network • Minimum dedicated EBS bandwidth of 1000 Mb/s (125 MB/s) Volume selection • Baseline performance, 40 MB/s or better (1000 GB st1, 3200 GB sc1) • Do not exceed instance’s dedicated EBS bandwidth!
  • 38.
    38© Cloudera, Inc.All rights reserved. Storage Configuration - S3 • Great for cold backup: durable, available, inexpensive • For hot backup, use a second HDFS cluster • Hive and Spark can also use S3 directly • Standard data operations can read from & write to S3 buckets
  • 39.
    39© Cloudera, Inc.All rights reserved. Storage Configuration Three types of storage: • Instance Storage • Elastic Block Storage (EBS) • Object Storage (S3) • Root Device
  • 40.
    40© Cloudera, Inc.All rights reserved. Storage Configuration - Root Device • For operating system and logs • Use EBS gp2 volumes as root devices • At least 500 GB for OS, CDH software, and logs • Do not use instance storage for the root device!
  • 41.
    41© Cloudera, Inc.All rights reserved. 41© Cloudera, Inc. All rights reserved. Capacity Planning
  • 42.
    42© Cloudera, Inc.All rights reserved. Capacity Planning • AWS makes expansion easy; advance planning makes things easier • Consider workloads: how much storage vs compute? balanced? • Consider data replication (3x), growth, retention • Low storage density, r3.8xlarge or c4.8xlarge provide less storage but higher compute and memory • High storage density, d2.8xlarge offers 48 TB per instance with a good amount of compute and memory
  • 43.
    43© Cloudera, Inc.All rights reserved. Cloudera Enterprise Hardware Requirements Guide tiny.cloudera.com/hw-reqs
  • 44.
    44© Cloudera, Inc.All rights reserved. 44© Cloudera, Inc. All rights reserved. Provisioning Instances
  • 45.
    45© Cloudera, Inc.All rights reserved. Provisioning Instances • Cloudera Director automates most things • Manual via EC2 command-line API tool or AWS management console • Don’t forget your databases (either RDS or self-managed) • Cloudera Altus
  • 46.
    46© Cloudera, Inc.All rights reserved. Provisioning Instances No matter which route you take... • root device: 500 GB+ gp2 EBS volume • master metadata: ephemeral or recommended gp2 EBS volumes • DFS data: ephemeral or recommended st1/sc1 EBS volumes • use tags to indicate the role instances/volumes will play
  • 47.
    47© Cloudera, Inc.All rights reserved. Thank You