SlideShare a Scribd company logo
1 of 17
Download to read offline
Anarchy Doesn’t Scale:
Your Data Lake Needs Governance
Kiran Kamreddy
Sr. Product Marketing Manager
Hadoop Portfolio, Teradata
One Data Lake: Many Definitions
One Data Lake: Many Definitions
A	
  centralized	
  repository	
  of	
  raw	
  data	
  into	
  which	
  many	
  data-­‐producing	
  
streams	
  flow	
  and	
  from	
  which	
  downstream	
  facili7es	
  may	
  draw.	
  	
  	
  	
  
Informa4on	
  
Sources	
  
Data	
  Lake	
  
Downstream	
  
Facili4es	
  
Data	
  Variety	
   is	
  the	
  driving	
  factor	
  in	
  building	
  a	
  Data	
  Lake	
  
Data Lake Maturity & Risks
Data	
  Lake	
  ini4a4ves	
  usually	
  start	
  small	
  	
  
–  Single	
  business	
  unit	
  
–  Few	
  data	
  sources,	
  applica7ons	
  
–  Few	
  data	
  scien7sts	
  
–  Exclusive	
  domain	
  
	
  
	
  
Informal,	
  Tribal	
  ways	
  of	
  sharing	
  	
  	
  
–  Handshakes	
  
–  Over	
  the	
  desk	
  conversa7ons	
  
–  No	
  processes,	
  mechanisms	
  
–  Everyone	
  is	
  free	
  to	
  use	
  	
  
Anarchy	
  works	
  well	
  
Data Lake Maturity & Risks
Data	
  Lake	
  ini4a4ves	
  grow	
  popular	
  
–  More	
  Business	
  Units	
  
–  More	
  Data	
  Sources	
  	
  
–  More	
  Applica7ons	
  
–  More	
  Users	
  	
  
…. grow into more complex environments
Without	
  proper	
  governance	
  mechanisms	
  	
  	
  
Data	
  lakes	
  risk	
  turning	
  data	
  swamps	
  
Consistency
Quality
Security
What can Governance do for my data lake?
Fundamental	
  capabili4es	
  for	
  tracking,	
  organizing	
  and	
  understanding	
  data	
  
- Where	
  did	
  my	
  data	
  come	
  from	
  ?	
  How	
  is	
  it	
  being	
  transformed	
  ?	
  	
  
- Track	
  usage,	
  resolve	
  anomalies,	
  visualize,	
  op7mize	
  and	
  clarify	
  data	
  lineage	
  
- 	
  Search	
  and	
  access	
  data	
  (	
  not	
  only	
  browse	
  )	
  
- Assess	
  data	
  quality	
  and	
  fitness	
  for	
  purpose	
  
Specialized capabilities to meets security& compliance requirements
-  Govern who can/cannot access the data and who cannot
-  Data life cycle management, archiving and retention policies
-  Auditing, compliance
Data	
  Governance	
  first	
  approach	
  to	
  prevent	
  turning	
  to	
  Data	
  swamps	
  	
  
RetrofiIng	
  data	
  governance	
  is	
  not	
  feasible	
  
	
  
Governance and Productivity
Governance	
  should	
  supports	
  day-­‐to-­‐day	
  use	
  of	
  data	
  	
  
–  Data	
  workers	
  need	
  a	
  strong	
  understanding	
  	
  
–  Roles	
  for	
  data	
  stewards,	
  data	
  owners,	
  data	
  analysts/scien7sts	
  need	
  to	
  
be	
  assigned	
  
	
  
Operational Metadata is critical to understanding
–  Where did it come from?
–  What is the environment? – landing zone, OS, Line of Business
–  What processes touched my data?did you lose any data? – Checksums etc.
–  When did the data get ingested, transformed?
–  Did it get exported, when, where how will it be used (organizational)?
Provision consistent ingest methods that track operational metadata
What about Security & Compliance?
•  Compliance	
  and	
  Regulatory	
  	
  	
  
–  Capture,	
  store	
  and	
  move	
  data	
  	
  	
  
–  Sarbanes-­‐Oxley,	
  HIPAA,	
  Basel	
  II	
  	
  
•  Security	
  	
  	
  
–  Authoriza7on,	
  Authen7ca7on	
  	
  
–  Handling	
  sensi7ve	
  data	
  	
  
•  Audi4ng	
  	
  
–  Recording	
  every	
  aUempt	
  to	
  access	
  
•  Archive	
  &	
  Reten4on	
  
–  Data	
  life	
  cycle	
  policies	
  
	
  
Governance Challenges on Hadoop Data Lakes
Hadoop	
  is	
  different	
  to	
  DW	
  
• Scale:	
  High	
  volumes	
  of	
  data,	
  mul7ple	
  user	
  access	
  
• Variety:	
  Schema-­‐on-­‐read,	
  mul7ple	
  formats	
  of	
  data	
  
• Mul7ple	
  storage	
  layers	
  (HDFS,	
  Hive,	
  HBase)	
  
• Many	
  processing	
  engines	
  (MR,	
  Hive,	
  Pig,	
  Impala,	
  Drill…)	
  
• Many	
  workflow	
  engines/schedules	
  (Cron,	
  Oozie,	
  Falcon…)	
  
• Holis7c	
  view	
  of	
  data	
  with	
  required	
  context	
  is	
  difficult	
  	
  
	
  
Hadoop needs less stringent, more flexible mechanisms
Balance agility and self service with processes, rules, regulations
Maintain Governance without losing Hadoop's power
•  Apache	
  Hadoop	
  has	
  built-­‐in	
  support	
  for	
  these	
  capabili4es	
  
•  HCatalog,	
  Hive,	
  etc	
  
•  Hadoop	
  distribu4on	
  vendors	
  have	
  all	
  made	
  improvements	
  in	
  
each	
  of	
  these	
  areas	
  
•  Navigator,	
  Ranger,	
  etc	
  
•  Vendors	
  provide	
  specialized	
  capabili4es	
  in	
  each	
  area	
  that	
  go	
  
beyond	
  what	
  a	
  Hadoop	
  distribu4on	
  provides	
  
Multiple Approaches….. but incomplete
One approach for Data Governance on Hadoop
Teradata	
  Loom®	
  –	
  Integrated	
  Data	
  Management	
  for	
  Hadoop	
  
	
  Metadata	
  management,	
  Lineage,	
  Data	
  Wrangling	
  
	
  Automa7c	
  data	
  cataloging,	
  data	
  profiling	
  and	
  sta7s7cs	
  genera7on	
  
Teradata Rainstor – Data Archiving
Structured data archiving in Hadoop with robust security
Compliance and auditing
ThinkBig – Hadoop professional services
Hadoop Data Lake – packaged service/product offering to build and deploy
high-quality, governed data lakes
Teradata Loom
Find	
  and	
  Understand	
  Your	
  Data	
  
•  Ac7veScan	
  
–  Data	
  cataloging	
  
–  Event	
  triggers	
  
–  Job	
  detec7on	
  and	
  lineage	
  crea7on	
  
–  Data	
  profiling	
  (sta7s7cs)	
  
•  Workbench	
  and	
  Metadata	
  Registry	
  
–  Data	
  explora7on	
  and	
  discovery	
  
–  Technical	
  and	
  business	
  metadata	
  
–  Data	
  sampling	
  and	
  previews	
  
–  Lineage	
  rela7onships	
  
–  Search	
  over	
  metadata	
  
–  REST	
  API	
  –	
  easily	
  integrate	
  third-­‐party	
  
apps	
  
Prepare	
  Your	
  Data	
  
•  Data	
  Wrangling	
  
–  Self-­‐service,	
  interac7ve	
  data	
  wrangling	
  for	
  
Hadoop	
  
–  Metadata	
  tracked	
  	
  
•  HiveQL	
  
–  Joins,	
  unions,	
  aggrega7ons,	
  UDFs	
  
–  Metadata	
  tracked	
  in	
  Loom	
  
Teradata Rainstor
–  Retain	
  data	
  online	
  that	
  is	
  queryable	
  for	
  an	
  indefinite	
  period	
  
– 	
  Re7re	
  data	
  that	
  are	
  no	
  longer	
  required	
  with	
  auto-­‐expira4on	
  policies	
  
– 	
  Comply	
  with	
  strict	
  government	
  rules	
  and	
  regula7ons	
  
– 	
  Retain	
  the	
  metadata	
  as	
  it	
  was	
  originally	
  captured	
  
– 	
  Store	
  tamper-­‐proof,	
  immutable	
  (unchangeable)	
  data	
  
– 	
  Maintain	
  availability	
  to	
  data	
  as	
  RDBMS	
  versions	
  change	
  or	
  expire	
  
–  Compression,	
  MPP	
  SQL	
  query	
  engine,	
  Encryp7on,	
  Audi7ng	
  
Big Data Services from ThinkBig
Big Data
Strategy &
Roadmap
Data Lake
Implementation
Analytics &
Data Science
Training &
Support
Lack of Clear
Big Data
Strategy
Difficulty
Turning Data
into Action
Missing Big
Data Skills
	
  
Focused	
  exclusively	
  on	
  tying	
  Hadoop	
  and	
  big	
  data	
  solu7ons	
  to	
  measurable	
  
business	
  value	
  
	
  
Data Scattered
& Not Well
Understood
Impact	
  
•  Co-­‐loca4on	
  of	
  data	
  provides	
  more	
  efficient	
  workflow	
  for	
  analysts	
  
•  Hadoop	
  provides	
  scalability	
  at	
  a	
  lower	
  cost	
  than	
  tradi4onal	
  systems	
  
•  Develop	
  new	
  insights	
  to	
  drive	
  business	
  value	
  
Situa4on	
  
	
  	
  
Problem	
  
Lack	
  of	
  centralized	
  metadata	
  repository	
  makes	
  data	
  governance	
  impossible.	
  	
  
Enterprise	
  must	
  have	
  transparency	
  into	
  data	
  in	
  the	
  cluster	
  and	
  capability	
  to	
  define	
  
extensible	
  metadata.	
  
Large	
  scale	
  data	
  lake	
  planned	
  with	
  many	
  heterogeneous	
  sources	
  and	
  many	
  individual	
  
analyst	
  users.	
  
Solu4on	
  
Hadoop	
  provides	
  data	
  lake	
  infrastructure.	
  	
  Loom	
  provides	
  centralized	
  metadata	
  
management,	
  with	
  an	
  automa7on	
  framework.	
  
Data	
  Governance	
  for	
  Hadoop	
  
Bank	
  Holding	
  Company	
  
Talk Summary
•  Data	
  governance	
  is	
  cri4cal	
  to	
  building	
  a	
  successful	
  data	
  lake	
  
–  Fundamental	
  governance	
  capabili7es	
  make	
  data	
  workers	
  more	
  
produc7ve	
  
–  Solu7ons	
  for	
  mee7ng	
  regulatory	
  requirements	
  are	
  also	
  needed	
  
•  Teradata	
  Loom	
  provides	
  required	
  data	
  cataloging	
  and	
  lineage	
  
capabili4es	
  to	
  make	
  hadoop	
  users	
  more	
  produc4ve	
  
•  RainStor	
  provides	
  advanced	
  archiving	
  solu4on	
  
•  ThinkBig	
  Data	
  Lake	
  provides	
  the	
  complete	
  package	
  
Stop	
  by	
  our	
  booth	
  for	
  
more	
  details	
  	
  
We	
  Love	
  Feedback	
  
Questions/Comments
Email: kiran.kamreddy@teradata.com
Follow Me
Twitter @kirankamreddy
Rate This Session
with the PARTNERS Mobile App
Remember To Share Your Virtual Passes
Follow Teradata 2015 PARTNERS
www.teradata-partners.com/social

More Related Content

What's hot

Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data LakeRobert Chong
 
Data Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopData Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopDataWorks Summit
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success DataWorks Summit/Hadoop Summit
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data ArchitectureZaloni
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance InitiativeDataWorks Summit
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake ArchitectureDATAVERSITY
 
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
DAMA Chicago - Ensuring your data lake doesn’t become a data swampDAMA Chicago - Ensuring your data lake doesn’t become a data swamp
DAMA Chicago - Ensuring your data lake doesn’t become a data swampNVISIA
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksAmazon Web Services
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 

What's hot (20)

Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data Lake
 
Data Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopData Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise Hadoop
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data Architecture
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance Initiative
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
 
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
DAMA Chicago - Ensuring your data lake doesn’t become a data swampDAMA Chicago - Ensuring your data lake doesn’t become a data swamp
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 

Viewers also liked

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...Avinash Ramineni
 
Data Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management RequirementsData Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management RequirementsSnapLogic
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldDataWorks Summit/Hadoop Summit
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introductionIBM Analytics
 
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThomas Kelly, PMP
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Introduction to metadata management
Introduction to metadata managementIntroduction to metadata management
Introduction to metadata managementOpen Data Support
 

Viewers also liked (10)

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
Data Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management RequirementsData Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management Requirements
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Metadata Workshop
Metadata WorkshopMetadata Workshop
Metadata Workshop
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introduction
 
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT Strategy
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Introduction to metadata management
Introduction to metadata managementIntroduction to metadata management
Introduction to metadata management
 

Similar to Your Data Lake Needs Governance to Avoid Becoming a Data Swamp

Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentationmlang222
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryDataWorks Summit/Hadoop Summit
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupScott Mitchell
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 
Hadoop Perspectives for 2017
Hadoop Perspectives for 2017Hadoop Perspectives for 2017
Hadoop Perspectives for 2017Precisely
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Oracle Modern Information Management Platform - v1.0
Oracle Modern Information Management Platform - v1.0Oracle Modern Information Management Platform - v1.0
Oracle Modern Information Management Platform - v1.0Bratamay Majumder
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Big data beyond the hype may 2014
Big data beyond the hype may 2014Big data beyond the hype may 2014
Big data beyond the hype may 2014bigdatagurus_meetup
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
Scaling Data overview
Scaling Data overviewScaling Data overview
Scaling Data overviewWade Malone
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
 
Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Cloudera, Inc.
 

Similar to Your Data Lake Needs Governance to Avoid Becoming a Data Swamp (20)

Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Hadoop Perspectives for 2017
Hadoop Perspectives for 2017Hadoop Perspectives for 2017
Hadoop Perspectives for 2017
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Oracle Modern Information Management Platform - v1.0
Oracle Modern Information Management Platform - v1.0Oracle Modern Information Management Platform - v1.0
Oracle Modern Information Management Platform - v1.0
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Big data beyond the hype may 2014
Big data beyond the hype may 2014Big data beyond the hype may 2014
Big data beyond the hype may 2014
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Scaling Data overview
Scaling Data overviewScaling Data overview
Scaling Data overview
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Your Data Lake Needs Governance to Avoid Becoming a Data Swamp

  • 1. Anarchy Doesn’t Scale: Your Data Lake Needs Governance Kiran Kamreddy Sr. Product Marketing Manager Hadoop Portfolio, Teradata
  • 2. One Data Lake: Many Definitions
  • 3. One Data Lake: Many Definitions A  centralized  repository  of  raw  data  into  which  many  data-­‐producing   streams  flow  and  from  which  downstream  facili7es  may  draw.         Informa4on   Sources   Data  Lake   Downstream   Facili4es   Data  Variety   is  the  driving  factor  in  building  a  Data  Lake  
  • 4. Data Lake Maturity & Risks Data  Lake  ini4a4ves  usually  start  small     –  Single  business  unit   –  Few  data  sources,  applica7ons   –  Few  data  scien7sts   –  Exclusive  domain       Informal,  Tribal  ways  of  sharing       –  Handshakes   –  Over  the  desk  conversa7ons   –  No  processes,  mechanisms   –  Everyone  is  free  to  use     Anarchy  works  well  
  • 5. Data Lake Maturity & Risks Data  Lake  ini4a4ves  grow  popular   –  More  Business  Units   –  More  Data  Sources     –  More  Applica7ons   –  More  Users     …. grow into more complex environments Without  proper  governance  mechanisms       Data  lakes  risk  turning  data  swamps   Consistency Quality Security
  • 6. What can Governance do for my data lake? Fundamental  capabili4es  for  tracking,  organizing  and  understanding  data   - Where  did  my  data  come  from  ?  How  is  it  being  transformed  ?     - Track  usage,  resolve  anomalies,  visualize,  op7mize  and  clarify  data  lineage   -   Search  and  access  data  (  not  only  browse  )   - Assess  data  quality  and  fitness  for  purpose   Specialized capabilities to meets security& compliance requirements -  Govern who can/cannot access the data and who cannot -  Data life cycle management, archiving and retention policies -  Auditing, compliance Data  Governance  first  approach  to  prevent  turning  to  Data  swamps     RetrofiIng  data  governance  is  not  feasible    
  • 7. Governance and Productivity Governance  should  supports  day-­‐to-­‐day  use  of  data     –  Data  workers  need  a  strong  understanding     –  Roles  for  data  stewards,  data  owners,  data  analysts/scien7sts  need  to   be  assigned     Operational Metadata is critical to understanding –  Where did it come from? –  What is the environment? – landing zone, OS, Line of Business –  What processes touched my data?did you lose any data? – Checksums etc. –  When did the data get ingested, transformed? –  Did it get exported, when, where how will it be used (organizational)? Provision consistent ingest methods that track operational metadata
  • 8. What about Security & Compliance? •  Compliance  and  Regulatory       –  Capture,  store  and  move  data       –  Sarbanes-­‐Oxley,  HIPAA,  Basel  II     •  Security       –  Authoriza7on,  Authen7ca7on     –  Handling  sensi7ve  data     •  Audi4ng     –  Recording  every  aUempt  to  access   •  Archive  &  Reten4on   –  Data  life  cycle  policies    
  • 9. Governance Challenges on Hadoop Data Lakes Hadoop  is  different  to  DW   • Scale:  High  volumes  of  data,  mul7ple  user  access   • Variety:  Schema-­‐on-­‐read,  mul7ple  formats  of  data   • Mul7ple  storage  layers  (HDFS,  Hive,  HBase)   • Many  processing  engines  (MR,  Hive,  Pig,  Impala,  Drill…)   • Many  workflow  engines/schedules  (Cron,  Oozie,  Falcon…)   • Holis7c  view  of  data  with  required  context  is  difficult       Hadoop needs less stringent, more flexible mechanisms Balance agility and self service with processes, rules, regulations Maintain Governance without losing Hadoop's power
  • 10. •  Apache  Hadoop  has  built-­‐in  support  for  these  capabili4es   •  HCatalog,  Hive,  etc   •  Hadoop  distribu4on  vendors  have  all  made  improvements  in   each  of  these  areas   •  Navigator,  Ranger,  etc   •  Vendors  provide  specialized  capabili4es  in  each  area  that  go   beyond  what  a  Hadoop  distribu4on  provides   Multiple Approaches….. but incomplete
  • 11. One approach for Data Governance on Hadoop Teradata  Loom®  –  Integrated  Data  Management  for  Hadoop    Metadata  management,  Lineage,  Data  Wrangling    Automa7c  data  cataloging,  data  profiling  and  sta7s7cs  genera7on   Teradata Rainstor – Data Archiving Structured data archiving in Hadoop with robust security Compliance and auditing ThinkBig – Hadoop professional services Hadoop Data Lake – packaged service/product offering to build and deploy high-quality, governed data lakes
  • 12. Teradata Loom Find  and  Understand  Your  Data   •  Ac7veScan   –  Data  cataloging   –  Event  triggers   –  Job  detec7on  and  lineage  crea7on   –  Data  profiling  (sta7s7cs)   •  Workbench  and  Metadata  Registry   –  Data  explora7on  and  discovery   –  Technical  and  business  metadata   –  Data  sampling  and  previews   –  Lineage  rela7onships   –  Search  over  metadata   –  REST  API  –  easily  integrate  third-­‐party   apps   Prepare  Your  Data   •  Data  Wrangling   –  Self-­‐service,  interac7ve  data  wrangling  for   Hadoop   –  Metadata  tracked     •  HiveQL   –  Joins,  unions,  aggrega7ons,  UDFs   –  Metadata  tracked  in  Loom  
  • 13. Teradata Rainstor –  Retain  data  online  that  is  queryable  for  an  indefinite  period   –   Re7re  data  that  are  no  longer  required  with  auto-­‐expira4on  policies   –   Comply  with  strict  government  rules  and  regula7ons   –   Retain  the  metadata  as  it  was  originally  captured   –   Store  tamper-­‐proof,  immutable  (unchangeable)  data   –   Maintain  availability  to  data  as  RDBMS  versions  change  or  expire   –  Compression,  MPP  SQL  query  engine,  Encryp7on,  Audi7ng  
  • 14. Big Data Services from ThinkBig Big Data Strategy & Roadmap Data Lake Implementation Analytics & Data Science Training & Support Lack of Clear Big Data Strategy Difficulty Turning Data into Action Missing Big Data Skills   Focused  exclusively  on  tying  Hadoop  and  big  data  solu7ons  to  measurable   business  value     Data Scattered & Not Well Understood
  • 15. Impact   •  Co-­‐loca4on  of  data  provides  more  efficient  workflow  for  analysts   •  Hadoop  provides  scalability  at  a  lower  cost  than  tradi4onal  systems   •  Develop  new  insights  to  drive  business  value   Situa4on       Problem   Lack  of  centralized  metadata  repository  makes  data  governance  impossible.     Enterprise  must  have  transparency  into  data  in  the  cluster  and  capability  to  define   extensible  metadata.   Large  scale  data  lake  planned  with  many  heterogeneous  sources  and  many  individual   analyst  users.   Solu4on   Hadoop  provides  data  lake  infrastructure.    Loom  provides  centralized  metadata   management,  with  an  automa7on  framework.   Data  Governance  for  Hadoop   Bank  Holding  Company  
  • 16. Talk Summary •  Data  governance  is  cri4cal  to  building  a  successful  data  lake   –  Fundamental  governance  capabili7es  make  data  workers  more   produc7ve   –  Solu7ons  for  mee7ng  regulatory  requirements  are  also  needed   •  Teradata  Loom  provides  required  data  cataloging  and  lineage   capabili4es  to  make  hadoop  users  more  produc4ve   •  RainStor  provides  advanced  archiving  solu4on   •  ThinkBig  Data  Lake  provides  the  complete  package   Stop  by  our  booth  for   more  details    
  • 17. We  Love  Feedback   Questions/Comments Email: kiran.kamreddy@teradata.com Follow Me Twitter @kirankamreddy Rate This Session with the PARTNERS Mobile App Remember To Share Your Virtual Passes Follow Teradata 2015 PARTNERS www.teradata-partners.com/social