SlideShare a Scribd company logo
1 of 21
© Okera, Inc. – Confidential and Proprietary | 1
©2018 Okera, Inc. – Confidential and Proprietary 1
August 11th 2018
Data Management Best Practices
Amandeep Khurana
©2018 Okera, Inc. – Confidential and Proprietary 2
Who I am
• Dad to a 7 week old girl
• Dog-dad to a 2 yr old rescue
• Foodie
• Music lover
• Love solving hard problems
• Inefficiency killer
About Me
What I do
• CEO, Co-Founder, Okera
• Principal Solutions Architect @
Cloudera
• Software Engineering @ AWS EMR
• Co-author, HBase in Action
• Help people get value from
technology
© Okera, Inc. – Confidential and Proprietary 3
Focus for today:
Data Lakes in the Cloud
©2018 Okera, Inc. – Confidential and Proprietary 4
Agenda 1. Core tenets of a modern data lake
2. Why is cloud different?
3. Data Management topics
©2018 Okera, Inc. – Confidential and Proprietary 5
This is not a vendor pitch or endorsement
© Okera, Inc. – Confidential and Proprietary 6
Data Lake is a capability
It’s not a technology stack
©2018 Okera, Inc. – Confidential and Proprietary 7
Core Tenets 1. Separation of storage and compute
2. Elastic
3. Multi-tenant
4. Flexibility in compute tools
5. Flexibility in storage systems
6. Flexibility in infrastructure
7. Dynamically on-board new users and workloads
8. Secure, Governed, Monitored
• Not just the technology
• Scale independently
• Add/remove independently
• Making changes at one layer
shouldn't affect the other
• Regardless of which tool you
use, same catalog, same
security, same auditability, same
monitoring.
©2018 Okera, Inc. – Confidential and Proprietary 8
• Fundamental paradigm shift: Everything is a service
• The stack is getting deconstructed. Including the data warehouse
• Elasticity
• Pay based on usage
• Decision making: IT → Lines of Business (end users)
Cloud v/s Data Center: what’s different?
©2018 Okera, Inc. – Confidential and Proprietary 9
Data
Governance
topics
1. Metadata management
2. Authentication
3. Access Control and Obfuscation
4. Audit
5. Misc topics
©2018 Okera, Inc. – Confidential and Proprietary 10
Primer (Critical to get right)
3 kinds
• Technical (schemas, file locations, size, owner)
• Operational (transformations, queries, performance,
timing, how fresh)
• Business (semantic meaning, access policies, tags)
Metadata
©2018 Okera, Inc. – Confidential and Proprietary 11
Approach
Wrong: Get a crawler to find sensitive info (fingerprint)
Wrong: Get a catalog to classify and govern data
Wrong:
Right: Build technical -> operational -> business. Think
holistically. Afterthoughts are expensive.
Wrong: ”It’s schema on read”
Right: It certainly is schema on read. Reader needs a
schema. Schemas enable sharing.
Metadata
©2018 Okera, Inc. – Confidential and Proprietary 12
Where?
Wrong: Multiple metastores. Let users read RDS.
Right: Put API in front of database
Wrong: Use wiki to store dataset locations and schemas
Right: Build a metastore and integrate in your workflow
Metadata
©2018 Okera, Inc. – Confidential and Proprietary 13
Source of identity
Wrong: IAM
Right: AD / SSO for every systems. Single source
Granularity
Wrong: IAM roles
Right: Users and Groups
Protocols
Wrong: Kerberos
Right: Tokens / OAuth etc
Authentication
©2018 Okera, Inc. – Confidential and Proprietary 14
Mechanism
Wrong: IAM on bucket policies
Right: Role based
Where?
Wrong: S3 level (or any storage level)
Wrong: Compute engine
Right: Independent of storage and compute. Remember: Flexibility
and separation of storage and compute. It’s “data” access controls.
Build independent policy engine that works with multiple tools.
Access Control and Obfuscation
©2018 Okera, Inc. – Confidential and Proprietary 15
Who?
Wrong: Central team
Right: Distributed stewardship
How?
Wrong: Create copies with slices of data for different users (think FGAC, GDPR
consent)
Right: Single copy, control access on top.
Wrong: Obfuscate data and then store it
Worse: Throw away key
Worse: Hard-code key into application code
Right: Store full fidelity. Control access. Obfuscate at read time. Use case specific.
Access Control and Obfuscation
©2018 Okera, Inc. – Confidential and Proprietary 16
When to build?
Wrong: After everything else
Right: Architect from the get go. Auth, Auth and Audit go
together
Wrong: Oh, I have this info. Cloudtrail + EMR logs
Right: Audit is an always on capability. Build reports. Ask
questions of the audit trail. Don’t leave for later.
Wrong: Build a chargeback system
Right: Build a rich audit trail system that gives visibiility
Audit
©2018 Okera, Inc. – Confidential and Proprietary 17
Who’s responsible?
Wrong: Central team
Right: Distributed stewardship
Wrong: Deletes and updates by reading entire partition/dataset, recreating
it, writing it in a new location, alter partition to change location (CDC,
GDPR)
Wrong: Use a different storage systems (HBase, Kudu etc)
Right: Build views to reconcile on the fly. Compact periodically
Wrong: Crawl the lake and find stuff
Right: Catalog as a part of the ingest / creation process. Don’t just fill the
lake
Misc Topics
©2018 Okera, Inc. – Confidential and Proprietary 18
Tool Selection
Wrong: Same from data center should work
Right: Build cloud based services as capabilities
Wrong: Oh we should just build this
Right: Focus on your core competency. Buy stuff. It’s
cheaper
Misc Topics
©2018 Okera, Inc. – Confidential and Proprietary 19
• Cloud and separation of storage and compute is a powerful
paradigm
• Think about data management holistically
• Approach as capabilities and services in the cloud
• Build the foundations first
• Use the right abstractions and frameworks
Summary
© Okera, Inc. – Confidential and Proprietary 20
Scaling governance enables self-service and agility.
Thank you
ak@okera.com

More Related Content

What's hot

Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...DataKitchen
 
Hopper energyservices
Hopper energyservicesHopper energyservices
Hopper energyserviceshopperdev
 
Using Elastic @ Elastic: InfoSec and Elastic Security
Using Elastic @ Elastic: InfoSec and Elastic SecurityUsing Elastic @ Elastic: InfoSec and Elastic Security
Using Elastic @ Elastic: InfoSec and Elastic SecurityElasticsearch
 
Choosing the Right Open Source Database
Choosing the Right Open Source DatabaseChoosing the Right Open Source Database
Choosing the Right Open Source DatabaseAll Things Open
 
Metadata Mastery: A Big Step for BI Modernization
Metadata Mastery: A Big Step for BI ModernizationMetadata Mastery: A Big Step for BI Modernization
Metadata Mastery: A Big Step for BI ModernizationEric Kavanagh
 
Securing Search Data in the Cloud
Securing Search Data in the CloudSecuring Search Data in the Cloud
Securing Search Data in the CloudSearchStax
 
[NEW LAUNCH!] Introducing Amazon Comprehend Medical (AIM398) - AWS re:Invent ...
[NEW LAUNCH!] Introducing Amazon Comprehend Medical (AIM398) - AWS re:Invent ...[NEW LAUNCH!] Introducing Amazon Comprehend Medical (AIM398) - AWS re:Invent ...
[NEW LAUNCH!] Introducing Amazon Comprehend Medical (AIM398) - AWS re:Invent ...Amazon Web Services
 
Acquisition de données dans Neo4j pour le Master Data Management
Acquisition de données dans Neo4j pour le Master Data ManagementAcquisition de données dans Neo4j pour le Master Data Management
Acquisition de données dans Neo4j pour le Master Data ManagementNeo4j
 
Hurry Up and Wait! Leveraging Open Source to Fuel Sutter’s HIT Innovation Ple...
Hurry Up and Wait! Leveraging Open Source to Fuel Sutter’s HIT Innovation Ple...Hurry Up and Wait! Leveraging Open Source to Fuel Sutter’s HIT Innovation Ple...
Hurry Up and Wait! Leveraging Open Source to Fuel Sutter’s HIT Innovation Ple...Appnovation Technologies
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
 
Truevert Vertical Semantic Search
Truevert Vertical Semantic SearchTruevert Vertical Semantic Search
Truevert Vertical Semantic SearchTruevert
 
Prepare to Recover: Fully Protect Your Salesforce Data
Prepare to Recover: Fully Protect Your Salesforce Data Prepare to Recover: Fully Protect Your Salesforce Data
Prepare to Recover: Fully Protect Your Salesforce Data Spanning Cloud Apps
 
The Need for NoSQL - MarkLogic
The Need for NoSQL - MarkLogicThe Need for NoSQL - MarkLogic
The Need for NoSQL - MarkLogicGovLoop
 
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuBig Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuEmre Sevinç
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...Denodo
 
The Journey to Success with Big Data
The Journey to Success with Big DataThe Journey to Success with Big Data
The Journey to Success with Big DataCloudera, Inc.
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...StampedeCon
 
Starting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer ResearchStarting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer ResearchDataWorks Summit/Hadoop Summit
 

What's hot (20)

Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
 
Hopper energyservices
Hopper energyservicesHopper energyservices
Hopper energyservices
 
Using Elastic @ Elastic: InfoSec and Elastic Security
Using Elastic @ Elastic: InfoSec and Elastic SecurityUsing Elastic @ Elastic: InfoSec and Elastic Security
Using Elastic @ Elastic: InfoSec and Elastic Security
 
Choosing the Right Open Source Database
Choosing the Right Open Source DatabaseChoosing the Right Open Source Database
Choosing the Right Open Source Database
 
Metadata Mastery: A Big Step for BI Modernization
Metadata Mastery: A Big Step for BI ModernizationMetadata Mastery: A Big Step for BI Modernization
Metadata Mastery: A Big Step for BI Modernization
 
Securing Search Data in the Cloud
Securing Search Data in the CloudSecuring Search Data in the Cloud
Securing Search Data in the Cloud
 
[NEW LAUNCH!] Introducing Amazon Comprehend Medical (AIM398) - AWS re:Invent ...
[NEW LAUNCH!] Introducing Amazon Comprehend Medical (AIM398) - AWS re:Invent ...[NEW LAUNCH!] Introducing Amazon Comprehend Medical (AIM398) - AWS re:Invent ...
[NEW LAUNCH!] Introducing Amazon Comprehend Medical (AIM398) - AWS re:Invent ...
 
Acquisition de données dans Neo4j pour le Master Data Management
Acquisition de données dans Neo4j pour le Master Data ManagementAcquisition de données dans Neo4j pour le Master Data Management
Acquisition de données dans Neo4j pour le Master Data Management
 
Hurry Up and Wait! Leveraging Open Source to Fuel Sutter’s HIT Innovation Ple...
Hurry Up and Wait! Leveraging Open Source to Fuel Sutter’s HIT Innovation Ple...Hurry Up and Wait! Leveraging Open Source to Fuel Sutter’s HIT Innovation Ple...
Hurry Up and Wait! Leveraging Open Source to Fuel Sutter’s HIT Innovation Ple...
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Truevert Vertical Semantic Search
Truevert Vertical Semantic SearchTruevert Vertical Semantic Search
Truevert Vertical Semantic Search
 
Prepare to Recover: Fully Protect Your Salesforce Data
Prepare to Recover: Fully Protect Your Salesforce Data Prepare to Recover: Fully Protect Your Salesforce Data
Prepare to Recover: Fully Protect Your Salesforce Data
 
The Need for NoSQL - MarkLogic
The Need for NoSQL - MarkLogicThe Need for NoSQL - MarkLogic
The Need for NoSQL - MarkLogic
 
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuBig Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
 
The Journey to Success with Big Data
The Journey to Success with Big DataThe Journey to Success with Big Data
The Journey to Success with Big Data
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
 
Oracle Big Data SQL
Oracle Big Data SQLOracle Big Data SQL
Oracle Big Data SQL
 
Starting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer ResearchStarting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer Research
 

Similar to Data Con LA 2018 - Critical data management practices by Amandeep Khurana

Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
 
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Cloudera, Inc.
 
2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the UnionCloudera, Inc.
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and ManufacturingCloudera, Inc.
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightCloudera, Inc.
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderDataconomy Media
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchCloudera, Inc.
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax
 
Non-Relational Revolution - Joseph Idziorek
Non-Relational Revolution - Joseph IdziorekNon-Relational Revolution - Joseph Idziorek
Non-Relational Revolution - Joseph IdziorekAmazon Web Services
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondStanding Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondCloudera, Inc.
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
 
Enterprise Cloud transformation z pohledu Oracle
Enterprise Cloud transformation z pohledu OracleEnterprise Cloud transformation z pohledu Oracle
Enterprise Cloud transformation z pohledu OracleMarketingArrowECS_CZ
 

Similar to Data Con LA 2018 - Critical data management practices by Amandeep Khurana (20)

Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
 
2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Non-Relational Revolution - Joseph Idziorek
Non-Relational Revolution - Joseph IdziorekNon-Relational Revolution - Joseph Idziorek
Non-Relational Revolution - Joseph Idziorek
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondStanding Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database Migrations
 
Enterprise Cloud transformation z pohledu Oracle
Enterprise Cloud transformation z pohledu OracleEnterprise Cloud transformation z pohledu Oracle
Enterprise Cloud transformation z pohledu Oracle
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Data Con LA 2018 - Critical data management practices by Amandeep Khurana

  • 1. © Okera, Inc. – Confidential and Proprietary | 1 ©2018 Okera, Inc. – Confidential and Proprietary 1 August 11th 2018 Data Management Best Practices Amandeep Khurana
  • 2. ©2018 Okera, Inc. – Confidential and Proprietary 2 Who I am • Dad to a 7 week old girl • Dog-dad to a 2 yr old rescue • Foodie • Music lover • Love solving hard problems • Inefficiency killer About Me What I do • CEO, Co-Founder, Okera • Principal Solutions Architect @ Cloudera • Software Engineering @ AWS EMR • Co-author, HBase in Action • Help people get value from technology
  • 3. © Okera, Inc. – Confidential and Proprietary 3 Focus for today: Data Lakes in the Cloud
  • 4. ©2018 Okera, Inc. – Confidential and Proprietary 4 Agenda 1. Core tenets of a modern data lake 2. Why is cloud different? 3. Data Management topics
  • 5. ©2018 Okera, Inc. – Confidential and Proprietary 5 This is not a vendor pitch or endorsement
  • 6. © Okera, Inc. – Confidential and Proprietary 6 Data Lake is a capability It’s not a technology stack
  • 7. ©2018 Okera, Inc. – Confidential and Proprietary 7 Core Tenets 1. Separation of storage and compute 2. Elastic 3. Multi-tenant 4. Flexibility in compute tools 5. Flexibility in storage systems 6. Flexibility in infrastructure 7. Dynamically on-board new users and workloads 8. Secure, Governed, Monitored • Not just the technology • Scale independently • Add/remove independently • Making changes at one layer shouldn't affect the other • Regardless of which tool you use, same catalog, same security, same auditability, same monitoring.
  • 8. ©2018 Okera, Inc. – Confidential and Proprietary 8 • Fundamental paradigm shift: Everything is a service • The stack is getting deconstructed. Including the data warehouse • Elasticity • Pay based on usage • Decision making: IT → Lines of Business (end users) Cloud v/s Data Center: what’s different?
  • 9. ©2018 Okera, Inc. – Confidential and Proprietary 9 Data Governance topics 1. Metadata management 2. Authentication 3. Access Control and Obfuscation 4. Audit 5. Misc topics
  • 10. ©2018 Okera, Inc. – Confidential and Proprietary 10 Primer (Critical to get right) 3 kinds • Technical (schemas, file locations, size, owner) • Operational (transformations, queries, performance, timing, how fresh) • Business (semantic meaning, access policies, tags) Metadata
  • 11. ©2018 Okera, Inc. – Confidential and Proprietary 11 Approach Wrong: Get a crawler to find sensitive info (fingerprint) Wrong: Get a catalog to classify and govern data Wrong: Right: Build technical -> operational -> business. Think holistically. Afterthoughts are expensive. Wrong: ”It’s schema on read” Right: It certainly is schema on read. Reader needs a schema. Schemas enable sharing. Metadata
  • 12. ©2018 Okera, Inc. – Confidential and Proprietary 12 Where? Wrong: Multiple metastores. Let users read RDS. Right: Put API in front of database Wrong: Use wiki to store dataset locations and schemas Right: Build a metastore and integrate in your workflow Metadata
  • 13. ©2018 Okera, Inc. – Confidential and Proprietary 13 Source of identity Wrong: IAM Right: AD / SSO for every systems. Single source Granularity Wrong: IAM roles Right: Users and Groups Protocols Wrong: Kerberos Right: Tokens / OAuth etc Authentication
  • 14. ©2018 Okera, Inc. – Confidential and Proprietary 14 Mechanism Wrong: IAM on bucket policies Right: Role based Where? Wrong: S3 level (or any storage level) Wrong: Compute engine Right: Independent of storage and compute. Remember: Flexibility and separation of storage and compute. It’s “data” access controls. Build independent policy engine that works with multiple tools. Access Control and Obfuscation
  • 15. ©2018 Okera, Inc. – Confidential and Proprietary 15 Who? Wrong: Central team Right: Distributed stewardship How? Wrong: Create copies with slices of data for different users (think FGAC, GDPR consent) Right: Single copy, control access on top. Wrong: Obfuscate data and then store it Worse: Throw away key Worse: Hard-code key into application code Right: Store full fidelity. Control access. Obfuscate at read time. Use case specific. Access Control and Obfuscation
  • 16. ©2018 Okera, Inc. – Confidential and Proprietary 16 When to build? Wrong: After everything else Right: Architect from the get go. Auth, Auth and Audit go together Wrong: Oh, I have this info. Cloudtrail + EMR logs Right: Audit is an always on capability. Build reports. Ask questions of the audit trail. Don’t leave for later. Wrong: Build a chargeback system Right: Build a rich audit trail system that gives visibiility Audit
  • 17. ©2018 Okera, Inc. – Confidential and Proprietary 17 Who’s responsible? Wrong: Central team Right: Distributed stewardship Wrong: Deletes and updates by reading entire partition/dataset, recreating it, writing it in a new location, alter partition to change location (CDC, GDPR) Wrong: Use a different storage systems (HBase, Kudu etc) Right: Build views to reconcile on the fly. Compact periodically Wrong: Crawl the lake and find stuff Right: Catalog as a part of the ingest / creation process. Don’t just fill the lake Misc Topics
  • 18. ©2018 Okera, Inc. – Confidential and Proprietary 18 Tool Selection Wrong: Same from data center should work Right: Build cloud based services as capabilities Wrong: Oh we should just build this Right: Focus on your core competency. Buy stuff. It’s cheaper Misc Topics
  • 19. ©2018 Okera, Inc. – Confidential and Proprietary 19 • Cloud and separation of storage and compute is a powerful paradigm • Think about data management holistically • Approach as capabilities and services in the cloud • Build the foundations first • Use the right abstractions and frameworks Summary
  • 20. © Okera, Inc. – Confidential and Proprietary 20 Scaling governance enables self-service and agility.