SlideShare a Scribd company logo
DATA LAKE ARCHITECTURE
Monojit Basu, Founder & Director
TechYugadi IT Solutions & Consulting
OSI DAYS 2016, BANGALORE
Data Never Sleeps
 Every minute
 Facebook users share 216,302 photos
 Dropbox users upload 833,333 new files
 Youtube users share 400 hours of new video
 Twitter users send 350,000 tweets
 A Boeing 737 Aircraft in flight generates 40 TB of data
EDW vs Data Lake
 Data Lake is built on the premise that every drop of
data is valuable
 Its a place for capturing and exploring huge
volumes of raw data that a business generates
 Explorers are diverse: business analysts, data
scientists, …
 even business managers (using self-service)
 Goals of exploration may be loosely defined
EDW vs Data Lake
 EDW stores filtered and processed data
 For pre-meditated usage scenarios
 Traditionally structured in the form of ‘cubes’
 Analogy
 Difference between a college library (focused on
curriculum) and the US Library of Congress
EDW vs Data Lake
 Schema-on-Read
 Schema-on-Write
DATA LAKE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
READ /
EXTRACT
READ /
EXTRACT
READ /
EXTRACT
CRM
ANALYTICS
SCM
ANALYTICS
RECO
ENGINE
ENTERPRISE DATA
WAREHOUSE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
SALES
OPERATIONS
MARKETING
ETL
Why Think of Data Lake
 Business Drivers
 Diverse sources of data: transactions, interactions, human
and machine-generated
 Routine analysis not enough – deeper insights lead to
differentiation
 Agile and Adaptive Business Models
 Technology Drivers
 Fast, cheap and scalable storage (eg. HDFS)
 Diverse data-processing engines (eg. NoSQL)
 Infinitely elastic processing power (cluster of commodity
servers)
Application Domains
 Healthcare  IoT
 E-Governance  Insurance
What Features Should It Support
 Scalable Storage Layer
 3 V’s of Data Inflow
 Data Discovery
 Data Governance
 Pluggable and Extensible Analytics
 Elastic Processing Power
 Multi-stakeholder and Multi-tenant Access
Building It On Top Of Hadoop
 Data Lake doesn’t have to be Hadoop
 But Hadoop has proven its prowess on planet-scale
data, in terms of:
 Data Volumes
 Elastic Data Processing Power
 Probably the idea of a Data Lake was inspired by
Hadoop
 Naturally most often a Data Lake Architecture is
built around Hadoop
Storage Capacity: Metrics
 Normally HDFS scales even with one NameNode
 Unless you have hundreds of Petabytes data
 But you need to monitor the usage pattern
 Are you creating too many small files (what’s the
average number of blocks per file)?
 How much RAM would you need for the NameNode? (a
high value could mean larger GC pauses)
 Internal Load (heartbeats and block reports) vs
External Get and Create Requests
Storage Capacity: HDFS Federation
 Single Name Node  NameNode Federation
Name
Node
Data
Node1
Data
Node2
Data
NodeN
MR
Client
Get / Create
Internal
Load
…
NameNode1 NameNode2
Block Pool1 Block Pool2
Data
Node1
Data
Node2
Data
NodeN…
Storage Capacity: Availability
 NameNode Federation does not ensure HA
 Even if you don’t go for Federation, configuring high
availability is recommended
 Essentially set up a Standby NameNode
 Active NameNode shares state with the Standby
 Using a shared Journal Manager, or
 Simply using a NFS-mounted shared File directory
 Synchronization frequency is configurable
Compute Capacity
 Hadoop 1.0 supported 1 type of Job (Map-Reduce)
 MR jobs were scheduled by a ‘JobTracker’ process
 Hadoop 2.0 offers a Resource Manager (YARN)
 It is intended to replace JobTracker and better the
Hadoop cluster size limit from 3000 to 10000
 But more important: YARN supports different types of
Jobs including MR to run on Hadoop
 Hence Data Lake should preferably be built on YARN
Compute Capacity: YARN
 YARN ARCHITECTURE
RESOURCE
MANAGER
NODE MANAGER
MR APP
MASTER
SPARK
TASK
NODE MANAGER
SPARK APP
MASTER
MR
TASK
N
O
D
E
1
N
O
D
E
2
MR CLIENT
SPARK
CLIENT
Data Inflow
 The goal is to build a pipeline into Hadoop-native
data stores
 HDFS, mandatorily
 Hive and Hbase, preferably
 Considering the variety of data formats that a Data
Lake must accommodate:
 A general purpose Data Integration Tool must be chosen
 For example, Pentaho Data Integration (PDI)
Data Inflow
 Pipelines specialized for specific data formats may
also be plugged in
HDFS
FLAT FILE INPUT
CONNECTOR
WEB SERVICE INPUT
CONNECTOR
HDFS OUTPUT
CONNECTOR
.txt .json
SQOOP FLUME
DB log
Data Inflow: Streaming Data
 Streaming Data may be processed in two ways
 Simply store in the Data Lake for future analysis
 Interesting tweets for building a sentiment analysis model
 Store and Forward to a Real-time Analytics Engine
 Even as real-time processing occurs, the source data in
raw format may be useful in future
 To build / update machine learning models, for example
in fraud analytics
HDFS
STORE STORE &
FORWARD
Data Analytics
 A Data Lake built on HDFS will most likely use a
Hadoop cluster to analyze data
 Sometimes the result of the analysis may be stored
back into HDFS (or possibly Hive / Hbase)
 But Data Visualization and Reporting / Dashboards
may work only on structured data cubes
 Hence on the Analytics side, a Data Lake may need
outflow paths from HDFS into structured data stores
Plugging In Data Analytics Engine
 Jaspersoft Reporting with HDFS
HDFS
ANALYZED DATA
JASPERSOFT ETL
HDFS INPUT
CONNECTOR OLAP
CUBE
JASPERSOFT
REPORTING
ENGINE
Data Governance
 Data Lake does not conform to a schema
 Data Governance makes it possible to make sense
of the data
 To both analysts and administrators
 Data Governance is a fairly open-ended subject
 Vendors offer different techniques to solve each
governance use case
 But common patterns are emerging across the landscape
Data Governance: Analyst Use Cases
 To search and retrieve ‘relevant’ data for analysis
 Common Techniques
 Metadata Management
 Data tagging
 Text Search
 Data Classification
 Metadata can include technical as well as business
information (linked to a Business Glossary)
 Data tags are often created by users collaboratively
Data Governance: Admin Use Cases
 Track data flow from
source to end applications
 Retain, replicate and
archive based on usage
 Track access and usage
information for compliance
 Lineage
 Data Life-cycle
Management
 Auditing
Automated Metadata Generation
 As data is ingested, suitable attributes are extracted
and stored into a metadata repository
 Data type (XML, PDF, text, etc)
 Data size
 Creation and Last Access time, etc
 Even data tags can be inserted at the time of ingest
 Unconditionally, eg. ‘sales’
 Conditionally, eg. ‘holiday_sales’
Apache Atlas For Data Governance
Source: http://atlas.incubator.apache.org/Architecture.html
Data Access And Security
 By default HDFS is secured using
 Kerberos for authentication, and
 Unix-style file permissions for authorization
 In a large data repository with diverse stakeholders
you may need more control
 If so, a couple of products may be considered for
augmenting Data Security:
 Apache Knox
 Apache Ranger
Data Access And Security
HDFS
Perimeter Security:
Knox
KERBEROS
Authentication Authorization
(rwx)
RANGER Federated
Access Control
NODE 1 NODE N
Why Use Ranger
 Supports Federated Access Control
 Can fall-back upon default HDFS file permissions
 Manages Access Control over several Hadoop-
based components, like Hive, Storm, etc.
 Advanced fine-grained access control, like
 Deny policies for user or group
 Tag-based access control, where a collection of
resources share a common access tag
 For example, a few columns in a Hive table and a
certain files in HDFS could share a tag: ‘internal_audit’
Steps To Build A Data Lake
 Set up a scalable data storage layer
 Set up a Compute Cluster capable of running a
diverse mix of Jobs
 Create data flow pipeline(s) for batch jobs
 Create data flow pipeline(s) for streaming data
Steps To Build A Data Lake
 Plug in one or more Analytics Engine(s)
 Set up mechanisms for efficient data discovery
and data governance
 Implement Data Access Controls
 Design a Monitoring Infrastructure for Jobs and
Resources (not covered today)
Building A Data Lake: Starting Points
 Set up a scalable data storage layer: HDFS
 Set up a Compute Cluster capable of running a
diverse mix of Jobs: YARN
 Create data flow pipeline(s) for batch jobs:
Pentaho HDFS Connector
 Create data flow pipeline(s) for streaming data:
Pentaho Messaging Connector
Steps To Build A Data Lake
 Plug in one or more Analytics Engine(s): Pentaho
Reporting and Spark MLib
 Set up mechanisms for efficient data discovery
and data governance: Apache Atlas
 Implement Data Access Controls: Apache Ranger
 Design a Monitoring Infrastructure for Jobs and
Resources: Apache Ambari
Taking The Plunge
 Do you need to plan for and build a Data Lake?
 Ask yourself: what fraction of your data are you
analyzing today ?
 What value might the unused data offer ?
 For marketing campaigns
 For product lifecycle management
 For regulatory compliance, and so on …
 Engage your stakeholders from different LoBs
 Is decision making being hampered by lack of data ?
Taking The Plunge
 Start small: There is a learning curve
 Storing data is not enough – maintaining the
stewarding the data is all important
 Design for extensibility and plugability
 Minimize vendor lock-in
 Be open to change as you scale your infrastructure
monojit@techyugadi.com

More Related Content

What's hot

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Databricks
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
Amazon Web Services
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Cathrine Wilhelmsen
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
Amazon Web Services
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
Databricks
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
Rodney Joyce
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
ELT vs. ETL - How they’re different and why it matters
ELT vs. ETL - How they’re different and why it mattersELT vs. ETL - How they’re different and why it matters
ELT vs. ETL - How they’re different and why it matters
Matillion
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 

What's hot (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
ELT vs. ETL - How they’re different and why it matters
ELT vs. ETL - How they’re different and why it mattersELT vs. ETL - How they’re different and why it matters
ELT vs. ETL - How they’re different and why it matters
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 

Viewers also liked

Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
mark madsen
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
SnapLogic
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
Capgemini
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centre
jatin batra
 
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
RSD
 
R language
R languageR language
R language
LearningTech
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
Peng Cheng
 
Taming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkTaming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model Framework
Ramkumar Ravichandran
 
Industrial internet big data uk market study
Industrial internet big data uk market studyIndustrial internet big data uk market study
Industrial internet big data uk market study
Sari Ojala
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Eugene Yan Ziyou
 
The concept of Datalake with Hadoop
The concept of Datalake with HadoopThe concept of Datalake with Hadoop
The concept of Datalake with Hadoop
Avkash Chauhan
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG community
Eugene Yan Ziyou
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
DataWorks Summit
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
Bigstep
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
Christian Robert
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)
Eugene Yan Ziyou
 

Viewers also liked (20)

Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centre
 
search engines
search enginessearch engines
search engines
 
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
 
R language
R languageR language
R language
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Taming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkTaming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model Framework
 
Industrial internet big data uk market study
Industrial internet big data uk market studyIndustrial internet big data uk market study
Industrial internet big data uk market study
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...
 
The concept of Datalake with Hadoop
The concept of Datalake with HadoopThe concept of Datalake with Hadoop
The concept of Datalake with Hadoop
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG community
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)
 

Similar to Datalake Architecture

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
James Serra
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
Stéphane Fréchette
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Cloud computing major project
Cloud computing major projectCloud computing major project
Cloud computing major project
ayk115
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution Provider
Agileiss
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Data ingestion
Data ingestionData ingestion
Data ingestion
nitheeshe2
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
eduarderwee
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
Supratim Ray
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 

Similar to Datalake Architecture (20)

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Cloud computing major project
Cloud computing major projectCloud computing major project
Cloud computing major project
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution Provider
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 

Datalake Architecture

  • 1. DATA LAKE ARCHITECTURE Monojit Basu, Founder & Director TechYugadi IT Solutions & Consulting OSI DAYS 2016, BANGALORE
  • 2. Data Never Sleeps  Every minute  Facebook users share 216,302 photos  Dropbox users upload 833,333 new files  Youtube users share 400 hours of new video  Twitter users send 350,000 tweets  A Boeing 737 Aircraft in flight generates 40 TB of data
  • 3. EDW vs Data Lake  Data Lake is built on the premise that every drop of data is valuable  Its a place for capturing and exploring huge volumes of raw data that a business generates  Explorers are diverse: business analysts, data scientists, …  even business managers (using self-service)  Goals of exploration may be loosely defined
  • 4. EDW vs Data Lake  EDW stores filtered and processed data  For pre-meditated usage scenarios  Traditionally structured in the form of ‘cubes’  Analogy  Difference between a college library (focused on curriculum) and the US Library of Congress
  • 5. EDW vs Data Lake  Schema-on-Read  Schema-on-Write DATA LAKE XML JSON CSV PDF TRADING PARTNER REST API INVOICING ORDERS DB READ / EXTRACT READ / EXTRACT READ / EXTRACT CRM ANALYTICS SCM ANALYTICS RECO ENGINE ENTERPRISE DATA WAREHOUSE XML JSON CSV PDF TRADING PARTNER REST API INVOICING ORDERS DB SALES OPERATIONS MARKETING ETL
  • 6. Why Think of Data Lake  Business Drivers  Diverse sources of data: transactions, interactions, human and machine-generated  Routine analysis not enough – deeper insights lead to differentiation  Agile and Adaptive Business Models  Technology Drivers  Fast, cheap and scalable storage (eg. HDFS)  Diverse data-processing engines (eg. NoSQL)  Infinitely elastic processing power (cluster of commodity servers)
  • 7. Application Domains  Healthcare  IoT  E-Governance  Insurance
  • 8. What Features Should It Support  Scalable Storage Layer  3 V’s of Data Inflow  Data Discovery  Data Governance  Pluggable and Extensible Analytics  Elastic Processing Power  Multi-stakeholder and Multi-tenant Access
  • 9. Building It On Top Of Hadoop  Data Lake doesn’t have to be Hadoop  But Hadoop has proven its prowess on planet-scale data, in terms of:  Data Volumes  Elastic Data Processing Power  Probably the idea of a Data Lake was inspired by Hadoop  Naturally most often a Data Lake Architecture is built around Hadoop
  • 10. Storage Capacity: Metrics  Normally HDFS scales even with one NameNode  Unless you have hundreds of Petabytes data  But you need to monitor the usage pattern  Are you creating too many small files (what’s the average number of blocks per file)?  How much RAM would you need for the NameNode? (a high value could mean larger GC pauses)  Internal Load (heartbeats and block reports) vs External Get and Create Requests
  • 11. Storage Capacity: HDFS Federation  Single Name Node  NameNode Federation Name Node Data Node1 Data Node2 Data NodeN MR Client Get / Create Internal Load … NameNode1 NameNode2 Block Pool1 Block Pool2 Data Node1 Data Node2 Data NodeN…
  • 12. Storage Capacity: Availability  NameNode Federation does not ensure HA  Even if you don’t go for Federation, configuring high availability is recommended  Essentially set up a Standby NameNode  Active NameNode shares state with the Standby  Using a shared Journal Manager, or  Simply using a NFS-mounted shared File directory  Synchronization frequency is configurable
  • 13. Compute Capacity  Hadoop 1.0 supported 1 type of Job (Map-Reduce)  MR jobs were scheduled by a ‘JobTracker’ process  Hadoop 2.0 offers a Resource Manager (YARN)  It is intended to replace JobTracker and better the Hadoop cluster size limit from 3000 to 10000  But more important: YARN supports different types of Jobs including MR to run on Hadoop  Hence Data Lake should preferably be built on YARN
  • 14. Compute Capacity: YARN  YARN ARCHITECTURE RESOURCE MANAGER NODE MANAGER MR APP MASTER SPARK TASK NODE MANAGER SPARK APP MASTER MR TASK N O D E 1 N O D E 2 MR CLIENT SPARK CLIENT
  • 15. Data Inflow  The goal is to build a pipeline into Hadoop-native data stores  HDFS, mandatorily  Hive and Hbase, preferably  Considering the variety of data formats that a Data Lake must accommodate:  A general purpose Data Integration Tool must be chosen  For example, Pentaho Data Integration (PDI)
  • 16. Data Inflow  Pipelines specialized for specific data formats may also be plugged in HDFS FLAT FILE INPUT CONNECTOR WEB SERVICE INPUT CONNECTOR HDFS OUTPUT CONNECTOR .txt .json SQOOP FLUME DB log
  • 17. Data Inflow: Streaming Data  Streaming Data may be processed in two ways  Simply store in the Data Lake for future analysis  Interesting tweets for building a sentiment analysis model  Store and Forward to a Real-time Analytics Engine  Even as real-time processing occurs, the source data in raw format may be useful in future  To build / update machine learning models, for example in fraud analytics HDFS STORE STORE & FORWARD
  • 18. Data Analytics  A Data Lake built on HDFS will most likely use a Hadoop cluster to analyze data  Sometimes the result of the analysis may be stored back into HDFS (or possibly Hive / Hbase)  But Data Visualization and Reporting / Dashboards may work only on structured data cubes  Hence on the Analytics side, a Data Lake may need outflow paths from HDFS into structured data stores
  • 19. Plugging In Data Analytics Engine  Jaspersoft Reporting with HDFS HDFS ANALYZED DATA JASPERSOFT ETL HDFS INPUT CONNECTOR OLAP CUBE JASPERSOFT REPORTING ENGINE
  • 20. Data Governance  Data Lake does not conform to a schema  Data Governance makes it possible to make sense of the data  To both analysts and administrators  Data Governance is a fairly open-ended subject  Vendors offer different techniques to solve each governance use case  But common patterns are emerging across the landscape
  • 21. Data Governance: Analyst Use Cases  To search and retrieve ‘relevant’ data for analysis  Common Techniques  Metadata Management  Data tagging  Text Search  Data Classification  Metadata can include technical as well as business information (linked to a Business Glossary)  Data tags are often created by users collaboratively
  • 22. Data Governance: Admin Use Cases  Track data flow from source to end applications  Retain, replicate and archive based on usage  Track access and usage information for compliance  Lineage  Data Life-cycle Management  Auditing
  • 23. Automated Metadata Generation  As data is ingested, suitable attributes are extracted and stored into a metadata repository  Data type (XML, PDF, text, etc)  Data size  Creation and Last Access time, etc  Even data tags can be inserted at the time of ingest  Unconditionally, eg. ‘sales’  Conditionally, eg. ‘holiday_sales’
  • 24. Apache Atlas For Data Governance Source: http://atlas.incubator.apache.org/Architecture.html
  • 25. Data Access And Security  By default HDFS is secured using  Kerberos for authentication, and  Unix-style file permissions for authorization  In a large data repository with diverse stakeholders you may need more control  If so, a couple of products may be considered for augmenting Data Security:  Apache Knox  Apache Ranger
  • 26. Data Access And Security HDFS Perimeter Security: Knox KERBEROS Authentication Authorization (rwx) RANGER Federated Access Control NODE 1 NODE N
  • 27. Why Use Ranger  Supports Federated Access Control  Can fall-back upon default HDFS file permissions  Manages Access Control over several Hadoop- based components, like Hive, Storm, etc.  Advanced fine-grained access control, like  Deny policies for user or group  Tag-based access control, where a collection of resources share a common access tag  For example, a few columns in a Hive table and a certain files in HDFS could share a tag: ‘internal_audit’
  • 28. Steps To Build A Data Lake  Set up a scalable data storage layer  Set up a Compute Cluster capable of running a diverse mix of Jobs  Create data flow pipeline(s) for batch jobs  Create data flow pipeline(s) for streaming data
  • 29. Steps To Build A Data Lake  Plug in one or more Analytics Engine(s)  Set up mechanisms for efficient data discovery and data governance  Implement Data Access Controls  Design a Monitoring Infrastructure for Jobs and Resources (not covered today)
  • 30. Building A Data Lake: Starting Points  Set up a scalable data storage layer: HDFS  Set up a Compute Cluster capable of running a diverse mix of Jobs: YARN  Create data flow pipeline(s) for batch jobs: Pentaho HDFS Connector  Create data flow pipeline(s) for streaming data: Pentaho Messaging Connector
  • 31. Steps To Build A Data Lake  Plug in one or more Analytics Engine(s): Pentaho Reporting and Spark MLib  Set up mechanisms for efficient data discovery and data governance: Apache Atlas  Implement Data Access Controls: Apache Ranger  Design a Monitoring Infrastructure for Jobs and Resources: Apache Ambari
  • 32. Taking The Plunge  Do you need to plan for and build a Data Lake?  Ask yourself: what fraction of your data are you analyzing today ?  What value might the unused data offer ?  For marketing campaigns  For product lifecycle management  For regulatory compliance, and so on …  Engage your stakeholders from different LoBs  Is decision making being hampered by lack of data ?
  • 33. Taking The Plunge  Start small: There is a learning curve  Storing data is not enough – maintaining the stewarding the data is all important  Design for extensibility and plugability  Minimize vendor lock-in  Be open to change as you scale your infrastructure