SlideShare a Scribd company logo
This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
Data Lake – Multitenancy Best Practices
30 November, 2018 | Author: Sanjay Upadhyay; Sr. Solution Architect
CitiusTech Thought
Leadership
2
Objective
 Multitenancy in a data lake allows organizations to share their cluster resources across user
communities without impacting business SLAs and capabilities or security and privacy needs
 This document covers guidelines around achieving multitenancy in a data lake environment
 It mentions the different design and implementation guidelines necessary for on premise as well
as cloud-based multitenant data lake, and highlights the reference architecture for both these
deployment options
3
Agenda
 Introduction
 Key Drivers for Data Lake Multitenancy
 On-Premise Data Lake Multitenancy
• Design Considerations
• Implementation Guidelines
• Reference Architecture
 On-Cloud Data Lake Multitenancy
• Design Considerations
• Implementation Guidelines
• Reference Architecture
4
Introduction
 In modern data management infrastructure, data lake is an important repository that holds a vast
amount of data in its native format until it is needed. An Enterprise Data Lake (EDL) can not only
store and share enterprise-wide information but is also capable of performing a variety of
enterprise workload activities like batch processing, streaming processing, interactive SQL,
enterprise search and advanced analytics
 The adoption of data lakes is increasing and organizations are leveraging data lakes for numerous
use cases. As usage increases, operational challenges like resource clogging also increase,
resulting in failure of critical business processes and impacting service level agreements (SLAs)
 Earlier, most data lake implementations were on-premise. However, organizations today are
leveraging cloud technology to replace or expand their existing data lake implementations
 Data lake multi-tenancy is an important architectural paradigm that enables multiple business
users and processes to share a common set of resources, such as Apache Hadoop clusters.
 This includes setting up appropriate policies around resource provisioning and access while
meeting SLAs and security requirements for each tenant
 There are different sets of guidelines to achieve multitenancy on-premise as well as on-cloud,
some of the key design principle and implementation guidelines are described in this document
5
 Resource and Cost Optimization: Having multiple business units share the same cluster resources
brings significant cost savings (across hardware and operational costs)
 Collaboration and Decision Making: Multi-tenancy helps achieve data sharing and collaboration
between different teams and integrate multiple data silos to get a unified view of data
 Operational Simplicity: Sharing an cluster among multiple users makes it dramatically easier to
deploy and manage
 Wide Audience: A multi-tenant, cloud based cluster enables a wide range of stakeholders
(developers, analysts, data scientists) from different organizational units to access and use the
data
 Seamless Scalability: A multi-tenant model makes it easy for IT teams to scale-out clusters to
meet changing business needs (e.g., business growth, new markets, reorganizations, mergers and
acquisitions, etc.)
Key Drivers for Data Lake Multitenancy
6
On-Premise Design Considerations
Key design considerations to achieve data lake multitenancy:
Defining
the Tenant
Identification of different business units that would access the
cluster. All business units should have defined use cases to
leverage enterprise data lake
Selecting the
Right Isolation
Model
Strategic cluster design can be done using one of three key
architectural models - Share Nothing, Shared Management and
Shared Resources
Defining the
Resource Usage
Agreement
Ensure complete coverage of utilization and SLA requirements
across storage, compute, data governance, tracking, auditing and
on-boarding
7
On-Premise Design: Isolation Models
Model 1: Share Nothing
 Cluster management and data are segregated
 Does not leverage multitenancy , but can be used by IT
teams based on operational realities and governance
policies
Model 2: Shared Management
 Cluster management is shared, while data and resources
are separated for tenant groups
 Useful for high-priority clusters that can’t afford to risk any
performance issues or resource contentions
Model 3: Shared Resources
 Leverages the multitenancy benefits from consolidated
cluster management, shared data and resources
 Isolation model is recommended for development of
enterprise data lakes
Cluster Manager
Tenant A Tenant B
Cluster Manager
Tenant A Tenant B
Cluster
Manager
Tenant B
Cluster
Manager
Tenant A
8
On-Premise Design: Resource Usage Agreement
Key Components
Storage
 Every tenant group on the cluster should have access to it’s section (namespace
/ directory)
 A dedicated directory should be assigned to users in a tenant group to store
data
 The storage quota needs to be controlled to ensure work of other groups and
users isn’t impacted due to influx of large data
Compute
 All tenant groups should be guaranteed an agreed upon minimum compute
power at all times
 Recommended to provide compute power at tenant group level irrespective of
the tenant users
Data
Governance
 Define metadata management and data lineage
Tracking &
Auditing
 Track the cluster access and generate report
 Audit data asset accessed with other metadata like access time, IP address etc.
On-boarding
 Design the hierarchy of the tenant groups and service accounts
 The process of adding tenant groups and users to the cluster should be
straightforward
9
On-Premise Implementation Guidelines (1/8)
Hadoop Distributed File System (HDFS) Resource
Management
HDFS is the key storage unit in data lake and is shared by all
the tenant groups, users and service accounts for processing
jobs. HDFS storage is broadly classified into three categories:
 LOB Space: Space allocated to particular line of business
like Finance, Marketing, etc.
 User Space: Dedicated space for individual users for
development / experimentation
 Enterprise Space: This layer stores all datasets (raw or
processed) used by multiple business groups
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
10
On-Premise Implementation Guidelines (2/8)
Storage Structure
The structure of HDFS storage should be simple and clearly
isolate the different business units’ raw data.
e.g. Storage Structure
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
11
On-Premise Implementation Guidelines (3/8)
HDFS Quota Management
HDFS supports two quota mechanisms that administrators
can utilize to manage space usage by cluster tenants:
Disk Space Quotas
 Sets disk space limits on a per-directory basis
 Prevents users from accidentally or maliciously
consuming excess disk space within the cluster
Name Space Quotas
 Limits the number of files or subdirectories within a
particular directory
 Helps administrators optimize the metadata subsystem
(NameNode) within the Hadoop cluster
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
12
On-Premise Implementation Guidelines (4/8)
HDFS Resource Isolation
 HDFS resource isolation is achieved using HDFS federation (HDFS 2.X)
 HDFS federation in Hadoop uses multiple independent namenodes /
namespaces to horizontally scale the name service
 All data nodes are used as common storage for blocks by all the name
nodes. Each data node registers with all the name nodes in the cluster
Limitations of Single Namespace / Namenode
 Namespace and block storage are tightly coupled
 The namespace isn’t scalable like data node. Horizontal scalability in HDFS
cluster is achievable with the addition of more data nodes
 Hadoop’s performance depends on throughput of the namenode.
Operation of current file system depends on the throughput of a single
namenode
 There is no separation of namespaces. This results in no isolation among
tenant organizations which are using the cluster
Benefits of HDFS Federation
 No isolation in a single namenode in a multi-user environment. Different
categories of applications and users can be put into multiple namespaces
by using multiple namenodes
 Namenodes in federation scale up horizontally in the file system’s
namespace
 Read / Write throughput can be improved by adding more namenodes
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
13
On-Premise Implementation Guidelines (5/8)
SQL-On-Hadoop Data Management
There are three different ways to manage multi-tenant data on SQL-
on-Hadoop database. The data is eventually stored on HDFS but can
be viewed using SQL tools like Hive, HAWQ, IMPALA etc.
 Separated Databases: Storing tenant data in separate databases
is the simplest approach to data isolation
 Shared Database and Separate Schema: This approach involves
housing multiple tenants in the same database. Each tenant has
its own set of tables that are grouped into a schema created
specifically for that tenant
 Shared Database and Schema: The same database and same set
of tables host multiple tenants' data. Each table has records from
multiple tenant and is segregated by tenant’s id column value
Separate
Schema
Shared
Schema
Separated
DatabasesIsolated Shared
Implementing Shared Database and Separate Schema is recommended.
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
14
On-Premise Implementation Guidelines (6/8)
Compute
 YARN Capacity Scheduler is used as a resource
management application to allocate shared cluster
resources among users and groups
 The queue is an important component of scheduling in
YARN and for isolating resources. Important queue
properties are:
• Queue name
• Queue path name
• Associated child queue and application
• Minimum-maximum capacity of the queue
 Resources can be allocated with the Capacity Scheduler by:
• Enabling the Capacity Scheduler
• Setting up Queues
• Controlling access to
queues with
ACLs (Access Control List)
Root
Finance Marketing Operations
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
e.g. Setting up a queue hierarchy
15
On-Premise Implementation Guidelines (7/8)
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
Security
Authentication  Need to know who users are and groups they belong to
 Kerberos is the only authentication method supported by
most components
Authorization  Need to know what users can access and what level of
permissions they have
 In a multi-tenant environment, service accounts / IDs
need to be setup at the group and enterprise levels
 Service accounts exist in the enterprise identity store, and
are provided to end users using Apache Ranger / Sentry
Auditing  Determines who did what and when
 Apache Ranger / Sentry or custom solution for auditing
Data Protection  Data in Transit
• SSL / TLS needs to be enabled to encrypt data
between clients and service endpoints
• Keys / certificates configured as per service / role
 Data at Rest
• Multiple encryption zones on HDFS allow only
authorized users to access data
• Data is transmitted in an encrypted form as encryption
is on HDFS block level
• Keys can be stored in Java keystore or HSM
16
On-Premise Implementation Guidelines (8/8)
Governance
Apace Atlas can be used for managing the metadata of the data
assets and keep track of the lineage information.
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
17
On-Premise Reference Architecture
HDFS
(Hadoop Distributed File System)
Data Access
YARN: Data Operating System
Governance
Security&Operation
1. HDFS Quota Management
• Disk Space Quota
• Name Space Quota
2. HDFS Federation
YARN Capacity Scheduler
Impala
Dril
Presto
NoSQL
Solr
SQL
Streaming
Milb
GraphX
Spark
Hive
Pig
Cascading
Tez
Hive
Pig
Giraph
Mahout
Map Reduce
Data Management
Data Lake
OperationsData Science BI & DWH Marketing
18
On-Cloud Design Considerations
Infrastructure Isolation
(Physical & Logical)
Virtualization Using
Multiple VM Support
Cloud Automation &
Integration
Catalogue
Manager
 Infrastructure shared
physically or logically
based on security
expectations (Multi-
tenant environment)
 Complete isolation of
compute using
dedicated cluster or
resource pool
 Logical isolation using
virtual machine level
 Complete network
isolation through
dedicated network for
every tenant and VLAN
for logical isolation
 Infrastructure
components are
virtualized and
managed as a single
entity, and are
simultaneously
isolated based on
tenancy
 Security and
compliance needs met
through anti-
collocation,
hypervisor-level
firewalls, resource
grouping of compute
and storage, and
VLAN-based isolations
 Essential to support
client specific business
processes, identity
management and
integration with tools
and services
 In the multitenant
cloud environment, it
is essential to define
standard processes
and practice for
providing
customization
flexibility
 Each aspect of multi-
tenancy must
culminate in an
intuitive user interface
 Catalogue content
depends on tenancy
and individual
privileges
 Multi-tenancy at a
catalogue level may
authorize integration
with different
directory services for
each tenant
19
On-Cloud Implementation Guideline (1/4)
Key guidelines to achieve multitenant data lake environment:
Storage
Object storage are the key storage units in a cloud data lake. Below are some of the key cloud
providers :
Storage Management
Multitenancy can be achieved using two methods:
 Storage Account: Isolation of storage can be implemented using separate storage account for each
tenant group along with appropriate identity and access management
 Containers / Buckets: By creating separate containers / buckets for each tenant group with
appropriate identity and access management
Cloud Provider AWS Azure Google
Service Name S3 Azure Storage (Blob) Google Cloud storage
Hot S3 Standard Hot Blob Storage GCS
Cool S3 Standard - Infrequent Access Cool Blob Storage GCS Nearline
Archival Glacier Archive Blob storage GCS Coldline
Object Limit Unlimited Unlimited Unlimited
Size Limit 5 TB / Object 500 TB / Account 5 TB / Object
20
Resource Group
Best approach is to create a separate resource group for each LOB tenant
Big Data Processing
Big data processing services provided by key cloud providers:
Data Processing Isolation
 Creating multiple services within the same subscription for each LOB. e.g. operational department
have their own Azure / AWS Databricks service for a given subscription
 Deployment of multiple clusters for each tenant group or user with appropriate identity and
access management
 Data processing results can be stored into the tenant specific storage accounts / buckets
On-Cloud Implementation Guideline (2/4)
AWS Azure Google
Service Name
Elastic MapReduce (EMR)
| AWS Databricks
HDInsight | Azure
Databricks
Google Cloud
Dataproc
21
On-Cloud Implementation Guideline (3/4)
Security, Identity and Access
Cloud service providers are responsible for security of data centers
Application Security Implementation
 Configure AWS / Azure Stack to support users from multiple IAM / Azure Active Directory
(Azure AD) tenants to use services in AWS / Azure Stack
 Provide authorization using AWS Organization / Azure RBAC
 Data encryption can be secured using AWS Key Management Service / Azure Key Vault
Security Features from
Cloud Providers
AWS Azure Google
Authentication &
Authorization
Identity and Access
Management (IAM)
Azure Active
Directory
Google Cloud Identity
and Access Management
AWS Organization Azure RBAC -
Encryption
Server-side Encryption with
Amazon S3 Key
Management Service
Azure Storage Service
Encryption
Google / Customer
Managed Encryption Key
Key Management Service Key Vault -
22
Network Services
Multi-tenant Network Strategy
 Isolate the application servers on their own physical network. This approach works for single
tenants on dedicated servers
 Define Virtual Switches (vSwitches) for each tenant. vSwitches can bring all relevant virtual
machines together on one logical switch. Similar to virtual machines, vSwitches can move within
the cloud environment
 Configure Virtual Local Area Network (VLANs) and create separate network for each tenant
On-Cloud Implementation Guideline (4/4)
Network Services from
Cloud Providers
AWS Azure Google
Cloud Virtual Network
Virtual Private Cloud
(VPC)
Virtual Network
Virtual Private Cloud
(VPC)
Cross Premise
Connectivity
AWS VPN Gateway Azure VPN Gateway Google VPN Gateway
Dedicated Network Direct Connect Express Route Dedicated Interconnect
23
On-Cloud Reference Architecture
Customer DCustomer A Customer B Customer C
Cloud Data Storage
(S3 / Azure Blob, ADLS / GCS)
Data Access
YARN: Data Operating System
Governance
Security&Operation
Data Management
SQL
Streamin
g
Milb
GraphX
Databricks Spark Redshift
Azure
SQLDWH
Elastic DWH
Hive
HBase
Spark
Hadoop
HDInsight / EMR
IaaS
Impala
Dril
Presto
NoSQL
Solr
Load Balancer
VM
VMInstance
1
Instance
2
Instance
n-1
Instance
n
VM
VM
24
 http//archive.gtra.org/files/Multitenancy_and_the_Enterprise_Data_Hub.pdf
 http://thesai.org/Downloads/Volume5No11/Paper_23-A_Hybrid_Multi-
Tenant_Database_Schema_for_Multi-Level_Quality_of_Service.pdf
 https://www.linkedin.com/pulse/multi-tenancy-deployment-options-big-data-design-pattern-
tom-martin/
 https://www.ibm.com/blogs/cloud-computing/2016/08/16/design-considerations-multi-tenant-
cloud/
 https://www.networkworld.com/article/3191520/cloud-computing/deep-dive-on-aws-vs-azure-
vs-google-cloud-storage-options.html
 https://securosis.com/assets/library/reports/Securing_Hadoop_Final_V2.pdf
 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-
management/bk_yarn-resource-management.pdf
 http://www.jamesserra.com/archive/2016/07/multi-tenant-databases-in-the-cloud/
 https://blogs.technet.microsoft.com/yungchou/2013/08/08/resource-pooling-virtualization-
fabric-and-cloud/
 http://dataconomy.com/2017/11/building-governed-data-lake-cloud/
References
25
 Enterprise Datalake
 Multitenancy
 HDFS Resource Management
 HDFS Resource Isolation
 HDFS Quota Management
Key Words
26
Thank You
Authors:
Sanjay Upadhyay2
thoughtleaders@citiustech.com
About CitiusTech
3,200+
healthcare IT professionals worldwide
100%
healthcare industry focus
30%+
CAGR over last 5 years
110+
healthcare customers
• Healthcare technology companies
• Hospitals, IDNs & medical groups
• Payers and health plans
• ACOs, MCOs, HIEs, HIXs, NHINs
• Pharma & Life Sciences companies

More Related Content

What's hot

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
Snowflake Computing
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
Chris Taylor
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
PolarSeven Pty Ltd
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
inovex GmbH
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data Engineering
Ananth PackkilDurai
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
 

What's hot (20)

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data Engineering
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 

Similar to Data Lake - Multitenancy Best Practices

The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
Impetus Technologies
 
Untangling cluster management with Helix
Untangling cluster management with HelixUntangling cluster management with Helix
Untangling cluster management with Helix
Kishore Gopalakrishna
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
IJCERT JOURNAL
 
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET Journal
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
Sun Technologies
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBA
inventy
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Denodo
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
Rahul Chaturvedi
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
dbpublications
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
ShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
VishalBH1
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Denodo
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
EMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
hktripathy
 
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
IJET - International Journal of Engineering and Techniques
 
Database Performance Management in Cloud
Database Performance Management in CloudDatabase Performance Management in Cloud
Database Performance Management in Cloud
Dr. Amarjeet Singh
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
Umair Shafique
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 

Similar to Data Lake - Multitenancy Best Practices (20)

The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
 
Untangling cluster management with Helix
Untangling cluster management with HelixUntangling cluster management with Helix
Untangling cluster management with Helix
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBA
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
EMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data Analytics
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
 
Database Performance Management in Cloud
Database Performance Management in CloudDatabase Performance Management in Cloud
Database Performance Management in Cloud
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 

More from CitiusTech

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health Plans
CitiusTech
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in Healthcare
CitiusTech
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations
CitiusTech
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)
CitiusTech
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CitiusTech
 
Accelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOpsAccelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOps
CitiusTech
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life Sciences
CitiusTech
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk Patients
CitiusTech
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for Payers
CitiusTech
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement
CitiusTech
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
CitiusTech
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation Testing
CitiusTech
 
Progressive Web Apps in Healthcare
Progressive Web Apps in HealthcareProgressive Web Apps in Healthcare
Progressive Web Apps in Healthcare
CitiusTech
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in Healthcare
CitiusTech
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP
CitiusTech
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and Future
CitiusTech
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes Research
CitiusTech
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer Market
CitiusTech
 
Testing Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on HadoopTesting Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on Hadoop
CitiusTech
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data Analytics
CitiusTech
 

More from CitiusTech (20)

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health Plans
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in Healthcare
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An Analysis
 
Accelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOpsAccelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOps
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life Sciences
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk Patients
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for Payers
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation Testing
 
Progressive Web Apps in Healthcare
Progressive Web Apps in HealthcareProgressive Web Apps in Healthcare
Progressive Web Apps in Healthcare
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in Healthcare
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and Future
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes Research
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer Market
 
Testing Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on HadoopTesting Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on Hadoop
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data Analytics
 

Recently uploaded

NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
FIDO Alliance
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
AimanAthambawa1
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 

Recently uploaded (20)

NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 

Data Lake - Multitenancy Best Practices

  • 1. This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech. Data Lake – Multitenancy Best Practices 30 November, 2018 | Author: Sanjay Upadhyay; Sr. Solution Architect CitiusTech Thought Leadership
  • 2. 2 Objective  Multitenancy in a data lake allows organizations to share their cluster resources across user communities without impacting business SLAs and capabilities or security and privacy needs  This document covers guidelines around achieving multitenancy in a data lake environment  It mentions the different design and implementation guidelines necessary for on premise as well as cloud-based multitenant data lake, and highlights the reference architecture for both these deployment options
  • 3. 3 Agenda  Introduction  Key Drivers for Data Lake Multitenancy  On-Premise Data Lake Multitenancy • Design Considerations • Implementation Guidelines • Reference Architecture  On-Cloud Data Lake Multitenancy • Design Considerations • Implementation Guidelines • Reference Architecture
  • 4. 4 Introduction  In modern data management infrastructure, data lake is an important repository that holds a vast amount of data in its native format until it is needed. An Enterprise Data Lake (EDL) can not only store and share enterprise-wide information but is also capable of performing a variety of enterprise workload activities like batch processing, streaming processing, interactive SQL, enterprise search and advanced analytics  The adoption of data lakes is increasing and organizations are leveraging data lakes for numerous use cases. As usage increases, operational challenges like resource clogging also increase, resulting in failure of critical business processes and impacting service level agreements (SLAs)  Earlier, most data lake implementations were on-premise. However, organizations today are leveraging cloud technology to replace or expand their existing data lake implementations  Data lake multi-tenancy is an important architectural paradigm that enables multiple business users and processes to share a common set of resources, such as Apache Hadoop clusters.  This includes setting up appropriate policies around resource provisioning and access while meeting SLAs and security requirements for each tenant  There are different sets of guidelines to achieve multitenancy on-premise as well as on-cloud, some of the key design principle and implementation guidelines are described in this document
  • 5. 5  Resource and Cost Optimization: Having multiple business units share the same cluster resources brings significant cost savings (across hardware and operational costs)  Collaboration and Decision Making: Multi-tenancy helps achieve data sharing and collaboration between different teams and integrate multiple data silos to get a unified view of data  Operational Simplicity: Sharing an cluster among multiple users makes it dramatically easier to deploy and manage  Wide Audience: A multi-tenant, cloud based cluster enables a wide range of stakeholders (developers, analysts, data scientists) from different organizational units to access and use the data  Seamless Scalability: A multi-tenant model makes it easy for IT teams to scale-out clusters to meet changing business needs (e.g., business growth, new markets, reorganizations, mergers and acquisitions, etc.) Key Drivers for Data Lake Multitenancy
  • 6. 6 On-Premise Design Considerations Key design considerations to achieve data lake multitenancy: Defining the Tenant Identification of different business units that would access the cluster. All business units should have defined use cases to leverage enterprise data lake Selecting the Right Isolation Model Strategic cluster design can be done using one of three key architectural models - Share Nothing, Shared Management and Shared Resources Defining the Resource Usage Agreement Ensure complete coverage of utilization and SLA requirements across storage, compute, data governance, tracking, auditing and on-boarding
  • 7. 7 On-Premise Design: Isolation Models Model 1: Share Nothing  Cluster management and data are segregated  Does not leverage multitenancy , but can be used by IT teams based on operational realities and governance policies Model 2: Shared Management  Cluster management is shared, while data and resources are separated for tenant groups  Useful for high-priority clusters that can’t afford to risk any performance issues or resource contentions Model 3: Shared Resources  Leverages the multitenancy benefits from consolidated cluster management, shared data and resources  Isolation model is recommended for development of enterprise data lakes Cluster Manager Tenant A Tenant B Cluster Manager Tenant A Tenant B Cluster Manager Tenant B Cluster Manager Tenant A
  • 8. 8 On-Premise Design: Resource Usage Agreement Key Components Storage  Every tenant group on the cluster should have access to it’s section (namespace / directory)  A dedicated directory should be assigned to users in a tenant group to store data  The storage quota needs to be controlled to ensure work of other groups and users isn’t impacted due to influx of large data Compute  All tenant groups should be guaranteed an agreed upon minimum compute power at all times  Recommended to provide compute power at tenant group level irrespective of the tenant users Data Governance  Define metadata management and data lineage Tracking & Auditing  Track the cluster access and generate report  Audit data asset accessed with other metadata like access time, IP address etc. On-boarding  Design the hierarchy of the tenant groups and service accounts  The process of adding tenant groups and users to the cluster should be straightforward
  • 9. 9 On-Premise Implementation Guidelines (1/8) Hadoop Distributed File System (HDFS) Resource Management HDFS is the key storage unit in data lake and is shared by all the tenant groups, users and service accounts for processing jobs. HDFS storage is broadly classified into three categories:  LOB Space: Space allocated to particular line of business like Finance, Marketing, etc.  User Space: Dedicated space for individual users for development / experimentation  Enterprise Space: This layer stores all datasets (raw or processed) used by multiple business groups HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 10. 10 On-Premise Implementation Guidelines (2/8) Storage Structure The structure of HDFS storage should be simple and clearly isolate the different business units’ raw data. e.g. Storage Structure HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 11. 11 On-Premise Implementation Guidelines (3/8) HDFS Quota Management HDFS supports two quota mechanisms that administrators can utilize to manage space usage by cluster tenants: Disk Space Quotas  Sets disk space limits on a per-directory basis  Prevents users from accidentally or maliciously consuming excess disk space within the cluster Name Space Quotas  Limits the number of files or subdirectories within a particular directory  Helps administrators optimize the metadata subsystem (NameNode) within the Hadoop cluster HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 12. 12 On-Premise Implementation Guidelines (4/8) HDFS Resource Isolation  HDFS resource isolation is achieved using HDFS federation (HDFS 2.X)  HDFS federation in Hadoop uses multiple independent namenodes / namespaces to horizontally scale the name service  All data nodes are used as common storage for blocks by all the name nodes. Each data node registers with all the name nodes in the cluster Limitations of Single Namespace / Namenode  Namespace and block storage are tightly coupled  The namespace isn’t scalable like data node. Horizontal scalability in HDFS cluster is achievable with the addition of more data nodes  Hadoop’s performance depends on throughput of the namenode. Operation of current file system depends on the throughput of a single namenode  There is no separation of namespaces. This results in no isolation among tenant organizations which are using the cluster Benefits of HDFS Federation  No isolation in a single namenode in a multi-user environment. Different categories of applications and users can be put into multiple namespaces by using multiple namenodes  Namenodes in federation scale up horizontally in the file system’s namespace  Read / Write throughput can be improved by adding more namenodes HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 13. 13 On-Premise Implementation Guidelines (5/8) SQL-On-Hadoop Data Management There are three different ways to manage multi-tenant data on SQL- on-Hadoop database. The data is eventually stored on HDFS but can be viewed using SQL tools like Hive, HAWQ, IMPALA etc.  Separated Databases: Storing tenant data in separate databases is the simplest approach to data isolation  Shared Database and Separate Schema: This approach involves housing multiple tenants in the same database. Each tenant has its own set of tables that are grouped into a schema created specifically for that tenant  Shared Database and Schema: The same database and same set of tables host multiple tenants' data. Each table has records from multiple tenant and is segregated by tenant’s id column value Separate Schema Shared Schema Separated DatabasesIsolated Shared Implementing Shared Database and Separate Schema is recommended. HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 14. 14 On-Premise Implementation Guidelines (6/8) Compute  YARN Capacity Scheduler is used as a resource management application to allocate shared cluster resources among users and groups  The queue is an important component of scheduling in YARN and for isolating resources. Important queue properties are: • Queue name • Queue path name • Associated child queue and application • Minimum-maximum capacity of the queue  Resources can be allocated with the Capacity Scheduler by: • Enabling the Capacity Scheduler • Setting up Queues • Controlling access to queues with ACLs (Access Control List) Root Finance Marketing Operations HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance e.g. Setting up a queue hierarchy
  • 15. 15 On-Premise Implementation Guidelines (7/8) HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance Security Authentication  Need to know who users are and groups they belong to  Kerberos is the only authentication method supported by most components Authorization  Need to know what users can access and what level of permissions they have  In a multi-tenant environment, service accounts / IDs need to be setup at the group and enterprise levels  Service accounts exist in the enterprise identity store, and are provided to end users using Apache Ranger / Sentry Auditing  Determines who did what and when  Apache Ranger / Sentry or custom solution for auditing Data Protection  Data in Transit • SSL / TLS needs to be enabled to encrypt data between clients and service endpoints • Keys / certificates configured as per service / role  Data at Rest • Multiple encryption zones on HDFS allow only authorized users to access data • Data is transmitted in an encrypted form as encryption is on HDFS block level • Keys can be stored in Java keystore or HSM
  • 16. 16 On-Premise Implementation Guidelines (8/8) Governance Apace Atlas can be used for managing the metadata of the data assets and keep track of the lineage information. HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 17. 17 On-Premise Reference Architecture HDFS (Hadoop Distributed File System) Data Access YARN: Data Operating System Governance Security&Operation 1. HDFS Quota Management • Disk Space Quota • Name Space Quota 2. HDFS Federation YARN Capacity Scheduler Impala Dril Presto NoSQL Solr SQL Streaming Milb GraphX Spark Hive Pig Cascading Tez Hive Pig Giraph Mahout Map Reduce Data Management Data Lake OperationsData Science BI & DWH Marketing
  • 18. 18 On-Cloud Design Considerations Infrastructure Isolation (Physical & Logical) Virtualization Using Multiple VM Support Cloud Automation & Integration Catalogue Manager  Infrastructure shared physically or logically based on security expectations (Multi- tenant environment)  Complete isolation of compute using dedicated cluster or resource pool  Logical isolation using virtual machine level  Complete network isolation through dedicated network for every tenant and VLAN for logical isolation  Infrastructure components are virtualized and managed as a single entity, and are simultaneously isolated based on tenancy  Security and compliance needs met through anti- collocation, hypervisor-level firewalls, resource grouping of compute and storage, and VLAN-based isolations  Essential to support client specific business processes, identity management and integration with tools and services  In the multitenant cloud environment, it is essential to define standard processes and practice for providing customization flexibility  Each aspect of multi- tenancy must culminate in an intuitive user interface  Catalogue content depends on tenancy and individual privileges  Multi-tenancy at a catalogue level may authorize integration with different directory services for each tenant
  • 19. 19 On-Cloud Implementation Guideline (1/4) Key guidelines to achieve multitenant data lake environment: Storage Object storage are the key storage units in a cloud data lake. Below are some of the key cloud providers : Storage Management Multitenancy can be achieved using two methods:  Storage Account: Isolation of storage can be implemented using separate storage account for each tenant group along with appropriate identity and access management  Containers / Buckets: By creating separate containers / buckets for each tenant group with appropriate identity and access management Cloud Provider AWS Azure Google Service Name S3 Azure Storage (Blob) Google Cloud storage Hot S3 Standard Hot Blob Storage GCS Cool S3 Standard - Infrequent Access Cool Blob Storage GCS Nearline Archival Glacier Archive Blob storage GCS Coldline Object Limit Unlimited Unlimited Unlimited Size Limit 5 TB / Object 500 TB / Account 5 TB / Object
  • 20. 20 Resource Group Best approach is to create a separate resource group for each LOB tenant Big Data Processing Big data processing services provided by key cloud providers: Data Processing Isolation  Creating multiple services within the same subscription for each LOB. e.g. operational department have their own Azure / AWS Databricks service for a given subscription  Deployment of multiple clusters for each tenant group or user with appropriate identity and access management  Data processing results can be stored into the tenant specific storage accounts / buckets On-Cloud Implementation Guideline (2/4) AWS Azure Google Service Name Elastic MapReduce (EMR) | AWS Databricks HDInsight | Azure Databricks Google Cloud Dataproc
  • 21. 21 On-Cloud Implementation Guideline (3/4) Security, Identity and Access Cloud service providers are responsible for security of data centers Application Security Implementation  Configure AWS / Azure Stack to support users from multiple IAM / Azure Active Directory (Azure AD) tenants to use services in AWS / Azure Stack  Provide authorization using AWS Organization / Azure RBAC  Data encryption can be secured using AWS Key Management Service / Azure Key Vault Security Features from Cloud Providers AWS Azure Google Authentication & Authorization Identity and Access Management (IAM) Azure Active Directory Google Cloud Identity and Access Management AWS Organization Azure RBAC - Encryption Server-side Encryption with Amazon S3 Key Management Service Azure Storage Service Encryption Google / Customer Managed Encryption Key Key Management Service Key Vault -
  • 22. 22 Network Services Multi-tenant Network Strategy  Isolate the application servers on their own physical network. This approach works for single tenants on dedicated servers  Define Virtual Switches (vSwitches) for each tenant. vSwitches can bring all relevant virtual machines together on one logical switch. Similar to virtual machines, vSwitches can move within the cloud environment  Configure Virtual Local Area Network (VLANs) and create separate network for each tenant On-Cloud Implementation Guideline (4/4) Network Services from Cloud Providers AWS Azure Google Cloud Virtual Network Virtual Private Cloud (VPC) Virtual Network Virtual Private Cloud (VPC) Cross Premise Connectivity AWS VPN Gateway Azure VPN Gateway Google VPN Gateway Dedicated Network Direct Connect Express Route Dedicated Interconnect
  • 23. 23 On-Cloud Reference Architecture Customer DCustomer A Customer B Customer C Cloud Data Storage (S3 / Azure Blob, ADLS / GCS) Data Access YARN: Data Operating System Governance Security&Operation Data Management SQL Streamin g Milb GraphX Databricks Spark Redshift Azure SQLDWH Elastic DWH Hive HBase Spark Hadoop HDInsight / EMR IaaS Impala Dril Presto NoSQL Solr Load Balancer VM VMInstance 1 Instance 2 Instance n-1 Instance n VM VM
  • 24. 24  http//archive.gtra.org/files/Multitenancy_and_the_Enterprise_Data_Hub.pdf  http://thesai.org/Downloads/Volume5No11/Paper_23-A_Hybrid_Multi- Tenant_Database_Schema_for_Multi-Level_Quality_of_Service.pdf  https://www.linkedin.com/pulse/multi-tenancy-deployment-options-big-data-design-pattern- tom-martin/  https://www.ibm.com/blogs/cloud-computing/2016/08/16/design-considerations-multi-tenant- cloud/  https://www.networkworld.com/article/3191520/cloud-computing/deep-dive-on-aws-vs-azure- vs-google-cloud-storage-options.html  https://securosis.com/assets/library/reports/Securing_Hadoop_Final_V2.pdf  https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource- management/bk_yarn-resource-management.pdf  http://www.jamesserra.com/archive/2016/07/multi-tenant-databases-in-the-cloud/  https://blogs.technet.microsoft.com/yungchou/2013/08/08/resource-pooling-virtualization- fabric-and-cloud/  http://dataconomy.com/2017/11/building-governed-data-lake-cloud/ References
  • 25. 25  Enterprise Datalake  Multitenancy  HDFS Resource Management  HDFS Resource Isolation  HDFS Quota Management Key Words
  • 26. 26 Thank You Authors: Sanjay Upadhyay2 thoughtleaders@citiustech.com About CitiusTech 3,200+ healthcare IT professionals worldwide 100% healthcare industry focus 30%+ CAGR over last 5 years 110+ healthcare customers • Healthcare technology companies • Hospitals, IDNs & medical groups • Payers and health plans • ACOs, MCOs, HIEs, HIXs, NHINs • Pharma & Life Sciences companies