1© Cloudera, Inc. All rights reserved.
Multi-Tenant Operations with
Cloudera Enterprise
A look inside British Telecommunications
Phill Radley | Chief Data Architect | BT
Matt Schumpert | Director Product Management | Cloudera
2© Cloudera, Inc. All rights reserved.
What is Multi-Tenant Hadoop
• Single General Purpose Hadoop Cluster
• Multiple distinct user groups with code & data that need to be separated
• Sharing storage (HDFS) & processing resources (cores & RAM)
• Storage allocated with HDFS Quota
• Compute managed with Fair Share Scheduler (at run time)
• Mixed work loads storage only, batch & interactive processing
• Typically On-Premise run by an in-house data centre team
3© Cloudera, Inc. All rights reserved.
Why Implement Multi-Tenant Hadoop
• A single place for all raw enterprise data kept for as long as needed
Universally popular concept in the business except for in Finance
Target data sets the business will be interested in
• Highly efficient use of Infrastructure
• Allows small tenants access to big resources
• Self-Service fast provisioning enabling fast project spin up
• New Low unit cost makes old businesses cases viable (e.g. active archives)
• Start small, with one or two small tenants, but plan for many more
• E.g. find a struggling old batch applications & re-platform as an internal IT project
• Once platform up and running go after a high profile flagship tenant
4© Cloudera, Inc. All rights reserved.
Platform as a Service – Hadoop as a Service
Target Users
• Application developers, testers & production
• Business Analysts/Data Scientists wanting access to live data
Service specification
• HaaS Version 1.0, change control & roadmap
• Features (e.g. HDFS(httpFS/NFS/API)  Map/Reduce  HUE  PIG Hive  Hbase  Search  )
Service Management
• Ordering form & process, Helpdesk
• Service Manager, Capacity Manager
5© Cloudera, Inc. All rights reserved.
Security & Governance
• Tenant data privacy
• Microsoft Active Directory integration with Kerberos
• All user groups & accounts managed in AD
• HDFS Encryption Zones
• Data governance to control data sharing
• Identified data stewards who approve creation of shared views and grants
• Security Logging & Reporting
6© Cloudera, Inc. All rights reserved.
The genesis of HaaS
Research & Innovation
Adastral Park
Business HQ
London
7© Cloudera, Inc. All rights reserved.
From Hadoop to HaaS
• Standing up a cluster is straightforward
• Buy Hadoop optimized servers (lots of local disk)
The unit cost is a fraction of a typical private cloud
• Install Linux (integrate with Active Directory/Kerberos)
• Use Cloudera Manager to create cluster
• Decide what services to offer based on the pipeline of tenant
workloads.
• Feb 2014 HaaS R1: was a “minimum viable product”
• Storage + Batch Compute (M/R) + UI (Hue) + Kerberos
• Oct 2015 HaaS R2: Added interactive SQL use
• Impala + Sqoop + Sentry
• Aug 2016 HaaS R3: In Memory
• Spark + Second site + Search…
8© Cloudera, Inc. All rights reserved.
HAAS A AP 00307_12126
Microsoft Active Directory Groups
What is a HaaS Tenant?
• A tenant is synonymous with a HaaS Service instance
1. An identifying Group in Active Directory
2. A set of Hadoop resources owned by the Group
• HDFS Quota
• YARN Resource Pool
• Hive database
• ( + other options e.g. Flume port/agent, + data wrangling tool)
• All services are accessed through common access points
Service ID: HAAS A AP 00307_12126
  
  
DFLT QUOTA
500GB
 
Pig Hql java
Hive Database
HAASA AP 00307_12126
Table 1 View 1
Q
Table 2 View 2
YARN Resource Pool
HAASA AP 00307_12126
HDFS Storage
/user/HAASA AP 00307_12126
HaaS Service Instance Admin
(e.g. developer, data scientist)
Hadoop
Platform Admin
service request
Provisioning script
“Welcome to HaaS”
CLUSTER SERVICE
TYPE
SERVICE
NO.
BUS. APP.
ID
9© Cloudera, Inc. All rights reserved.
HaaS Tenant Reporting
BT has developed a range of supporting tools & training materials to help
on-board tenants and monitor the service
For example the provisioning script and weekly HDFS capacity reports:
One Project: NAD
multiple services
Service 123=prod
Service 153=test
P for
Production
T for Test
D for Dev
10© Cloudera, Inc. All rights reserved.
e.g. HAASAAP0067_05038: CMF Customer Master File
1 Pre-Load
CSS
COSMOSS
DISE
BTC
C2B
Antillia
Glossi
Cyclone
Phoenix
Radianz
Siebel OV
Siebel OS
“Customer Master file (CMF) ”
• A 10 year old batch app needing to re-platform (2014)
• Data from 12 Source systems merged with D&B Legal
Entities used as Reference Data
• Existing SQL modules ported to HQL+PIG
Benefits
• Business able to do multiple runs in a day (15x faster)
• Adding new sources is quicker (schema on read)
• Data available for Self-Service Teams (DQ/Data Science)













HAASAAP0067_05038
OLD CMF
DBStaging
Source Systems
2 Load
3 Match / De-Dupe
4 Key Gen
5 Business Rule
6 Publish
7 Post Load
CMF
Reference Data
11© Cloudera, Inc. All rights reserved.
HAASA AP 00101_2029
Faults
4369
Orders
3531
CRM
2029
 Three existing business applications (CRM, Orders, Faults) extended into HaaS 
RDBMS
Customer
Table
RDBMS
Orders
Table
RDBMS
Faults
Table
T_CustomerHive DB
HAASA
AP 00101_2029
sqoop
V_Customer
HAASA AP 00202_3531
T_OrdersHive DB
HAASA
AP 0202_3531
sqoop
V_Orders
HAASA AP 00303_4369
T_FaultsHive DB
HAASA
AP 0303_4369
sqoop
V_Faults
Business
Data
Stewards
Business Analysts / Data Scientists

CRM

Orders

Faults
Target for Self-Service Data Access using HaaS
1. Browse & select data
2. Get Steward Approval
3. Create VIEWs & GRANTs
4. Select/join Views
Data
Catalogue
• Self-service workflow-driven access to any table on any
system (contrast with design/develop legacy warehouse
approach)
• Option to add homomorphic encryption to any table to
anonymize PII data to further reduce risk
12© Cloudera, Inc. All rights reserved.
Cloudera Manager 5.7
Easier Multi-Tenant Operations
13© Cloudera, Inc. All rights reserved.
Major Enablers of Multi-Tenancy in Cloudera Manager
• Dynamic Resource Pools
• Cluster Utilization Reporting
• HDFS Usage Reports
14© Cloudera, Inc. All rights reserved.
Dynamic Resource Pools Define Tenants!
• Hierarchical buckets that
• Express prioritization
• Protect fixed capacity
• Create sensible guardrails
15© Cloudera, Inc. All rights reserved.
Dynamic Resource Pools Define Tenants!
• Hierarchical buckets that
• Express prioritization
• Protect fixed capacity
• Create sensible guardrails
• Make an admins’ life easy with
• User/group-based creation
• ACLs
• Automatic preemption
• Rotating service windows
16© Cloudera, Inc. All rights reserved.
Dynamic Resource Pools Configuration
17© Cloudera, Inc. All rights reserved.
Roadmap: Dynamic Resource Pools
• Automatic user/group-based job placement under a tenant’s pool
18© Cloudera, Inc. All rights reserved.
Cluster Utilization Reporting
BI Marketing Engineering
19© Cloudera, Inc. All rights reserved.
Cluster Utilization Reporting
Usage Data
Resource Allocations
BI Marketing Engineering
20© Cloudera, Inc. All rights reserved.
Cluster Utilization Reporting
Usage Data
Resource Allocations
Report
BI Marketing Engineering
• Configurable Time Window
• Tenant Aggregation View
• User Aggregation View
21© Cloudera, Inc. All rights reserved.
Cluster Utilization Reporting
Usage Data
Resource Allocations
Report
BI Marketing Engineering
• “How much CPU & memory did each tenant use?”
• “I set up fair scheduler. Did each of my tenants get their fair share?”
• “Which tenants had to wait the longest for their applications to get resources?
• “Which tenants asked for the most memory but used the least?”
• “When do I need to add nodes to my cluster?”
• Configurable Time Window
• Tenant Aggregation View
• User Aggregation View
22© Cloudera, Inc. All rights reserved.
Cluster Utilization Reporting
23© Cloudera, Inc. All rights reserved.
Cluster Utilization Reporting
24© Cloudera, Inc. All rights reserved.
Cluster Utilization Reporting
25© Cloudera, Inc. All rights reserved.
Roadmap: Cluster Utilization Reporting
• Container Allocation Latency
• A definitive wait metric for each bit of YARN workload
• Support for more components
• HDFS, HBase, Search, etc
• Support additional metrics
• Disk I/O, Network I/O
• Add additional tools to existing metrics:
• Showback/chargeback: associate $$ with resource usage
• Capacity planning: trend lines
• DBA tools: identify/flag rogue queries (Hive, Impala, HBase)
• Workload management: tag critical apps with SLAs
26© Cloudera, Inc. All rights reserved.
HDFS Usage Reports
• Recently revamped based on known HaaS implementations
• Drill-down by user/tenant to do housecleaning
27© Cloudera, Inc. All rights reserved.
More Information & Next Steps
Get Started
• Download C5.7:
www.cloudera.com/downloads
Release Notes
• www.cloudera.com/documentation/
enterprise/latest/topics/rg_release_
notes.html
Training Classes
• university.cloudera.com
Check out Cloudera Manager Demo
Videos at go.cloudera.com/hadoop-
demo-cm1
28© Cloudera, Inc. All rights reserved.
Questions?

Multi-Tenant Operations with Cloudera 5.7 & BT

  • 1.
    1© Cloudera, Inc.All rights reserved. Multi-Tenant Operations with Cloudera Enterprise A look inside British Telecommunications Phill Radley | Chief Data Architect | BT Matt Schumpert | Director Product Management | Cloudera
  • 2.
    2© Cloudera, Inc.All rights reserved. What is Multi-Tenant Hadoop • Single General Purpose Hadoop Cluster • Multiple distinct user groups with code & data that need to be separated • Sharing storage (HDFS) & processing resources (cores & RAM) • Storage allocated with HDFS Quota • Compute managed with Fair Share Scheduler (at run time) • Mixed work loads storage only, batch & interactive processing • Typically On-Premise run by an in-house data centre team
  • 3.
    3© Cloudera, Inc.All rights reserved. Why Implement Multi-Tenant Hadoop • A single place for all raw enterprise data kept for as long as needed Universally popular concept in the business except for in Finance Target data sets the business will be interested in • Highly efficient use of Infrastructure • Allows small tenants access to big resources • Self-Service fast provisioning enabling fast project spin up • New Low unit cost makes old businesses cases viable (e.g. active archives) • Start small, with one or two small tenants, but plan for many more • E.g. find a struggling old batch applications & re-platform as an internal IT project • Once platform up and running go after a high profile flagship tenant
  • 4.
    4© Cloudera, Inc.All rights reserved. Platform as a Service – Hadoop as a Service Target Users • Application developers, testers & production • Business Analysts/Data Scientists wanting access to live data Service specification • HaaS Version 1.0, change control & roadmap • Features (e.g. HDFS(httpFS/NFS/API)  Map/Reduce  HUE  PIG Hive  Hbase  Search  ) Service Management • Ordering form & process, Helpdesk • Service Manager, Capacity Manager
  • 5.
    5© Cloudera, Inc.All rights reserved. Security & Governance • Tenant data privacy • Microsoft Active Directory integration with Kerberos • All user groups & accounts managed in AD • HDFS Encryption Zones • Data governance to control data sharing • Identified data stewards who approve creation of shared views and grants • Security Logging & Reporting
  • 6.
    6© Cloudera, Inc.All rights reserved. The genesis of HaaS Research & Innovation Adastral Park Business HQ London
  • 7.
    7© Cloudera, Inc.All rights reserved. From Hadoop to HaaS • Standing up a cluster is straightforward • Buy Hadoop optimized servers (lots of local disk) The unit cost is a fraction of a typical private cloud • Install Linux (integrate with Active Directory/Kerberos) • Use Cloudera Manager to create cluster • Decide what services to offer based on the pipeline of tenant workloads. • Feb 2014 HaaS R1: was a “minimum viable product” • Storage + Batch Compute (M/R) + UI (Hue) + Kerberos • Oct 2015 HaaS R2: Added interactive SQL use • Impala + Sqoop + Sentry • Aug 2016 HaaS R3: In Memory • Spark + Second site + Search…
  • 8.
    8© Cloudera, Inc.All rights reserved. HAAS A AP 00307_12126 Microsoft Active Directory Groups What is a HaaS Tenant? • A tenant is synonymous with a HaaS Service instance 1. An identifying Group in Active Directory 2. A set of Hadoop resources owned by the Group • HDFS Quota • YARN Resource Pool • Hive database • ( + other options e.g. Flume port/agent, + data wrangling tool) • All services are accessed through common access points Service ID: HAAS A AP 00307_12126       DFLT QUOTA 500GB   Pig Hql java Hive Database HAASA AP 00307_12126 Table 1 View 1 Q Table 2 View 2 YARN Resource Pool HAASA AP 00307_12126 HDFS Storage /user/HAASA AP 00307_12126 HaaS Service Instance Admin (e.g. developer, data scientist) Hadoop Platform Admin service request Provisioning script “Welcome to HaaS” CLUSTER SERVICE TYPE SERVICE NO. BUS. APP. ID
  • 9.
    9© Cloudera, Inc.All rights reserved. HaaS Tenant Reporting BT has developed a range of supporting tools & training materials to help on-board tenants and monitor the service For example the provisioning script and weekly HDFS capacity reports: One Project: NAD multiple services Service 123=prod Service 153=test P for Production T for Test D for Dev
  • 10.
    10© Cloudera, Inc.All rights reserved. e.g. HAASAAP0067_05038: CMF Customer Master File 1 Pre-Load CSS COSMOSS DISE BTC C2B Antillia Glossi Cyclone Phoenix Radianz Siebel OV Siebel OS “Customer Master file (CMF) ” • A 10 year old batch app needing to re-platform (2014) • Data from 12 Source systems merged with D&B Legal Entities used as Reference Data • Existing SQL modules ported to HQL+PIG Benefits • Business able to do multiple runs in a day (15x faster) • Adding new sources is quicker (schema on read) • Data available for Self-Service Teams (DQ/Data Science)              HAASAAP0067_05038 OLD CMF DBStaging Source Systems 2 Load 3 Match / De-Dupe 4 Key Gen 5 Business Rule 6 Publish 7 Post Load CMF Reference Data
  • 11.
    11© Cloudera, Inc.All rights reserved. HAASA AP 00101_2029 Faults 4369 Orders 3531 CRM 2029  Three existing business applications (CRM, Orders, Faults) extended into HaaS  RDBMS Customer Table RDBMS Orders Table RDBMS Faults Table T_CustomerHive DB HAASA AP 00101_2029 sqoop V_Customer HAASA AP 00202_3531 T_OrdersHive DB HAASA AP 0202_3531 sqoop V_Orders HAASA AP 00303_4369 T_FaultsHive DB HAASA AP 0303_4369 sqoop V_Faults Business Data Stewards Business Analysts / Data Scientists  CRM  Orders  Faults Target for Self-Service Data Access using HaaS 1. Browse & select data 2. Get Steward Approval 3. Create VIEWs & GRANTs 4. Select/join Views Data Catalogue • Self-service workflow-driven access to any table on any system (contrast with design/develop legacy warehouse approach) • Option to add homomorphic encryption to any table to anonymize PII data to further reduce risk
  • 12.
    12© Cloudera, Inc.All rights reserved. Cloudera Manager 5.7 Easier Multi-Tenant Operations
  • 13.
    13© Cloudera, Inc.All rights reserved. Major Enablers of Multi-Tenancy in Cloudera Manager • Dynamic Resource Pools • Cluster Utilization Reporting • HDFS Usage Reports
  • 14.
    14© Cloudera, Inc.All rights reserved. Dynamic Resource Pools Define Tenants! • Hierarchical buckets that • Express prioritization • Protect fixed capacity • Create sensible guardrails
  • 15.
    15© Cloudera, Inc.All rights reserved. Dynamic Resource Pools Define Tenants! • Hierarchical buckets that • Express prioritization • Protect fixed capacity • Create sensible guardrails • Make an admins’ life easy with • User/group-based creation • ACLs • Automatic preemption • Rotating service windows
  • 16.
    16© Cloudera, Inc.All rights reserved. Dynamic Resource Pools Configuration
  • 17.
    17© Cloudera, Inc.All rights reserved. Roadmap: Dynamic Resource Pools • Automatic user/group-based job placement under a tenant’s pool
  • 18.
    18© Cloudera, Inc.All rights reserved. Cluster Utilization Reporting BI Marketing Engineering
  • 19.
    19© Cloudera, Inc.All rights reserved. Cluster Utilization Reporting Usage Data Resource Allocations BI Marketing Engineering
  • 20.
    20© Cloudera, Inc.All rights reserved. Cluster Utilization Reporting Usage Data Resource Allocations Report BI Marketing Engineering • Configurable Time Window • Tenant Aggregation View • User Aggregation View
  • 21.
    21© Cloudera, Inc.All rights reserved. Cluster Utilization Reporting Usage Data Resource Allocations Report BI Marketing Engineering • “How much CPU & memory did each tenant use?” • “I set up fair scheduler. Did each of my tenants get their fair share?” • “Which tenants had to wait the longest for their applications to get resources? • “Which tenants asked for the most memory but used the least?” • “When do I need to add nodes to my cluster?” • Configurable Time Window • Tenant Aggregation View • User Aggregation View
  • 22.
    22© Cloudera, Inc.All rights reserved. Cluster Utilization Reporting
  • 23.
    23© Cloudera, Inc.All rights reserved. Cluster Utilization Reporting
  • 24.
    24© Cloudera, Inc.All rights reserved. Cluster Utilization Reporting
  • 25.
    25© Cloudera, Inc.All rights reserved. Roadmap: Cluster Utilization Reporting • Container Allocation Latency • A definitive wait metric for each bit of YARN workload • Support for more components • HDFS, HBase, Search, etc • Support additional metrics • Disk I/O, Network I/O • Add additional tools to existing metrics: • Showback/chargeback: associate $$ with resource usage • Capacity planning: trend lines • DBA tools: identify/flag rogue queries (Hive, Impala, HBase) • Workload management: tag critical apps with SLAs
  • 26.
    26© Cloudera, Inc.All rights reserved. HDFS Usage Reports • Recently revamped based on known HaaS implementations • Drill-down by user/tenant to do housecleaning
  • 27.
    27© Cloudera, Inc.All rights reserved. More Information & Next Steps Get Started • Download C5.7: www.cloudera.com/downloads Release Notes • www.cloudera.com/documentation/ enterprise/latest/topics/rg_release_ notes.html Training Classes • university.cloudera.com Check out Cloudera Manager Demo Videos at go.cloudera.com/hadoop- demo-cm1
  • 28.
    28© Cloudera, Inc.All rights reserved. Questions?

Editor's Notes

  • #7 To set some context I thought I’d take a slide to give you the backstory to HaaS. As a business BT has always invested in R&D, our UK research campus Adastral Park was opened 40 years ago. Ever since we have invested in R&D, last year BT spent over £500 million. In addition to our in-house research work we have technology scouts in silicon valley and researchers at MIT. In 2010/11 our customer experience research team were working social media sentiment analysis when they came across Hadoop. They were working on small data samples on laptops in R-studio. Hadoops scale out architecture and schema on read made it easy for them to ingest millions of tweets so they built a research cluster. Pretty soon they were using Hadoop to answer different business questions like “What proportion of UK phone lines could support 50MB internet ?” “What would the fault rates be if 80% of customers had 50MB broadband ? How many additional engineers might we need” ? The business was catching onto big data spurred on by articles like the HBR Oct 2012 and the torrent of analyst waves and hype cycles. They started to rely on the research hadoop capability as they found they could get answers to big ad-hoc questions much faster from research and hadoop than they could from traditional data warehouses that weren’t setup to quickly ingest new data sets and run statistical models. Research now had a problem because they’re not set up to offer a production service with support and SLA. They came to the Chief architects Office for help in getting Hadoop out of Research and into BAU data centres ASAP. Within CAO we saw the lots of opportunities with Hadoop. The most significant being the ability to build a single enterprise data hub that we could use to deliver data democratisation, i.e. giving the data back to the business owners There were other short benefits such as the ability to re-platform old batch apps that needed to be kept running and provide low cost storage & archive. -oOo-
  • #8 Design Write Service description based on customer needs. MVP ! Sign Offs (Data centre Operations, Info Security) Try it out, use Cloudera Manager to setup & monitor services Reuse what the business already had Order Gateway, Active Directory Automate Provisioning Market & Communicate