SlideShare a Scribd company logo
1 of 34
Download to read offline
DATA LAKE ARCHITECTURE
Monojit Basu, Founder & Director
TechYugadi IT Solutions & Consulting
OSI DAYS 2016, BANGALORE
Data Never Sleeps
 Every minute
 Facebook users share 216,302 photos
 Dropbox users upload 833,333 new files
 Youtube users share 400 hours of new video
 Twitter users send 350,000 tweets
 A Boeing 737 Aircraft in flight generates 40 TB of data
EDW vs Data Lake
 Data Lake is built on the premise that every drop of
data is valuable
 Its a place for capturing and exploring huge
volumes of raw data that a business generates
 Explorers are diverse: business analysts, data
scientists, …
 even business managers (using self-service)
 Goals of exploration may be loosely defined
EDW vs Data Lake
 EDW stores filtered and processed data
 For pre-meditated usage scenarios
 Traditionally structured in the form of ‘cubes’
 Analogy
 Difference between a college library (focused on
curriculum) and the US Library of Congress
EDW vs Data Lake
 Schema-on-Read
 Schema-on-Write
DATA LAKE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
READ /
EXTRACT
READ /
EXTRACT
READ /
EXTRACT
CRM
ANALYTICS
SCM
ANALYTICS
RECO
ENGINE
ENTERPRISE DATA
WAREHOUSE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
SALES
OPERATIONS
MARKETING
ETL
Why Think of Data Lake
 Business Drivers
 Diverse sources of data: transactions, interactions, human
and machine-generated
 Routine analysis not enough – deeper insights lead to
differentiation
 Agile and Adaptive Business Models
 Technology Drivers
 Fast, cheap and scalable storage (eg. HDFS)
 Diverse data-processing engines (eg. NoSQL)
 Infinitely elastic processing power (cluster of commodity
servers)
Application Domains
 Healthcare  IoT
 E-Governance  Insurance
What Features Should It Support
 Scalable Storage Layer
 3 V’s of Data Inflow
 Data Discovery
 Data Governance
 Pluggable and Extensible Analytics
 Elastic Processing Power
 Multi-stakeholder and Multi-tenant Access
Building It On Top Of Hadoop
 Data Lake doesn’t have to be Hadoop
 But Hadoop has proven its prowess on planet-scale
data, in terms of:
 Data Volumes
 Elastic Data Processing Power
 Probably the idea of a Data Lake was inspired by
Hadoop
 Naturally most often a Data Lake Architecture is
built around Hadoop
Storage Capacity: Metrics
 Normally HDFS scales even with one NameNode
 Unless you have hundreds of Petabytes data
 But you need to monitor the usage pattern
 Are you creating too many small files (what’s the
average number of blocks per file)?
 How much RAM would you need for the NameNode? (a
high value could mean larger GC pauses)
 Internal Load (heartbeats and block reports) vs
External Get and Create Requests
Storage Capacity: HDFS Federation
 Single Name Node  NameNode Federation
Name
Node
Data
Node1
Data
Node2
Data
NodeN
MR
Client
Get / Create
Internal
Load
…
NameNode1 NameNode2
Block Pool1 Block Pool2
Data
Node1
Data
Node2
Data
NodeN…
Storage Capacity: Availability
 NameNode Federation does not ensure HA
 Even if you don’t go for Federation, configuring high
availability is recommended
 Essentially set up a Standby NameNode
 Active NameNode shares state with the Standby
 Using a shared Journal Manager, or
 Simply using a NFS-mounted shared File directory
 Synchronization frequency is configurable
Compute Capacity
 Hadoop 1.0 supported 1 type of Job (Map-Reduce)
 MR jobs were scheduled by a ‘JobTracker’ process
 Hadoop 2.0 offers a Resource Manager (YARN)
 It is intended to replace JobTracker and better the
Hadoop cluster size limit from 3000 to 10000
 But more important: YARN supports different types of
Jobs including MR to run on Hadoop
 Hence Data Lake should preferably be built on YARN
Compute Capacity: YARN
 YARN ARCHITECTURE
RESOURCE
MANAGER
NODE MANAGER
MR APP
MASTER
SPARK
TASK
NODE MANAGER
SPARK APP
MASTER
MR
TASK
N
O
D
E
1
N
O
D
E
2
MR CLIENT
SPARK
CLIENT
Data Inflow
 The goal is to build a pipeline into Hadoop-native
data stores
 HDFS, mandatorily
 Hive and Hbase, preferably
 Considering the variety of data formats that a Data
Lake must accommodate:
 A general purpose Data Integration Tool must be chosen
 For example, Pentaho Data Integration (PDI)
Data Inflow
 Pipelines specialized for specific data formats may
also be plugged in
HDFS
FLAT FILE INPUT
CONNECTOR
WEB SERVICE INPUT
CONNECTOR
HDFS OUTPUT
CONNECTOR
.txt .json
SQOOP FLUME
DB log
Data Inflow: Streaming Data
 Streaming Data may be processed in two ways
 Simply store in the Data Lake for future analysis
 Interesting tweets for building a sentiment analysis model
 Store and Forward to a Real-time Analytics Engine
 Even as real-time processing occurs, the source data in
raw format may be useful in future
 To build / update machine learning models, for example
in fraud analytics
HDFS
STORE STORE &
FORWARD
Data Analytics
 A Data Lake built on HDFS will most likely use a
Hadoop cluster to analyze data
 Sometimes the result of the analysis may be stored
back into HDFS (or possibly Hive / Hbase)
 But Data Visualization and Reporting / Dashboards
may work only on structured data cubes
 Hence on the Analytics side, a Data Lake may need
outflow paths from HDFS into structured data stores
Plugging In Data Analytics Engine
 Jaspersoft Reporting with HDFS
HDFS
ANALYZED DATA
JASPERSOFT ETL
HDFS INPUT
CONNECTOR OLAP
CUBE
JASPERSOFT
REPORTING
ENGINE
Data Governance
 Data Lake does not conform to a schema
 Data Governance makes it possible to make sense
of the data
 To both analysts and administrators
 Data Governance is a fairly open-ended subject
 Vendors offer different techniques to solve each
governance use case
 But common patterns are emerging across the landscape
Data Governance: Analyst Use Cases
 To search and retrieve ‘relevant’ data for analysis
 Common Techniques
 Metadata Management
 Data tagging
 Text Search
 Data Classification
 Metadata can include technical as well as business
information (linked to a Business Glossary)
 Data tags are often created by users collaboratively
Data Governance: Admin Use Cases
 Track data flow from
source to end applications
 Retain, replicate and
archive based on usage
 Track access and usage
information for compliance
 Lineage
 Data Life-cycle
Management
 Auditing
Automated Metadata Generation
 As data is ingested, suitable attributes are extracted
and stored into a metadata repository
 Data type (XML, PDF, text, etc)
 Data size
 Creation and Last Access time, etc
 Even data tags can be inserted at the time of ingest
 Unconditionally, eg. ‘sales’
 Conditionally, eg. ‘holiday_sales’
Apache Atlas For Data Governance
Source: http://atlas.incubator.apache.org/Architecture.html
Data Access And Security
 By default HDFS is secured using
 Kerberos for authentication, and
 Unix-style file permissions for authorization
 In a large data repository with diverse stakeholders
you may need more control
 If so, a couple of products may be considered for
augmenting Data Security:
 Apache Knox
 Apache Ranger
Data Access And Security
HDFS
Perimeter Security:
Knox
KERBEROS
Authentication Authorization
(rwx)
RANGER Federated
Access Control
NODE 1 NODE N
Why Use Ranger
 Supports Federated Access Control
 Can fall-back upon default HDFS file permissions
 Manages Access Control over several Hadoop-
based components, like Hive, Storm, etc.
 Advanced fine-grained access control, like
 Deny policies for user or group
 Tag-based access control, where a collection of
resources share a common access tag
 For example, a few columns in a Hive table and a
certain files in HDFS could share a tag: ‘internal_audit’
Steps To Build A Data Lake
 Set up a scalable data storage layer
 Set up a Compute Cluster capable of running a
diverse mix of Jobs
 Create data flow pipeline(s) for batch jobs
 Create data flow pipeline(s) for streaming data
Steps To Build A Data Lake
 Plug in one or more Analytics Engine(s)
 Set up mechanisms for efficient data discovery
and data governance
 Implement Data Access Controls
 Design a Monitoring Infrastructure for Jobs and
Resources (not covered today)
Building A Data Lake: Starting Points
 Set up a scalable data storage layer: HDFS
 Set up a Compute Cluster capable of running a
diverse mix of Jobs: YARN
 Create data flow pipeline(s) for batch jobs:
Pentaho HDFS Connector
 Create data flow pipeline(s) for streaming data:
Pentaho Messaging Connector
Steps To Build A Data Lake
 Plug in one or more Analytics Engine(s): Pentaho
Reporting and Spark MLib
 Set up mechanisms for efficient data discovery
and data governance: Apache Atlas
 Implement Data Access Controls: Apache Ranger
 Design a Monitoring Infrastructure for Jobs and
Resources: Apache Ambari
Taking The Plunge
 Do you need to plan for and build a Data Lake?
 Ask yourself: what fraction of your data are you
analyzing today ?
 What value might the unused data offer ?
 For marketing campaigns
 For product lifecycle management
 For regulatory compliance, and so on …
 Engage your stakeholders from different LoBs
 Is decision making being hampered by lack of data ?
Taking The Plunge
 Start small: There is a learning curve
 Storing data is not enough – maintaining the
stewarding the data is all important
 Design for extensibility and plugability
 Minimize vendor lock-in
 Be open to change as you scale your infrastructure
monojit@techyugadi.com

More Related Content

What's hot

Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMark Kromer
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse ArchitecturesTheju Paul
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 

What's hot (20)

Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 

Viewers also liked

Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lakeCapgemini
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centrejatin batra
 
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...RSD
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in sparkPeng Cheng
 
Taming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkTaming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkRamkumar Ravichandran
 
Industrial internet big data uk market study
Industrial internet big data uk market studyIndustrial internet big data uk market study
Industrial internet big data uk market studySari Ojala
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Eugene Yan Ziyou
 
The concept of Datalake with Hadoop
The concept of Datalake with HadoopThe concept of Datalake with Hadoop
The concept of Datalake with HadoopAvkash Chauhan
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communityEugene Yan Ziyou
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos MilovanovicInstitute of Contemporary Sciences
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)Eugene Yan Ziyou
 

Viewers also liked (20)

Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centre
 
search engines
search enginessearch engines
search engines
 
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
 
R language
R languageR language
R language
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Taming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkTaming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model Framework
 
Industrial internet big data uk market study
Industrial internet big data uk market studyIndustrial internet big data uk market study
Industrial internet big data uk market study
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...
 
The concept of Datalake with Hadoop
The concept of Datalake with HadoopThe concept of Datalake with Hadoop
The concept of Datalake with Hadoop
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG community
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)
 

Similar to Datalake Architecture

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Cloud computing major project
Cloud computing major projectCloud computing major project
Cloud computing major projectayk115
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution ProviderAgileiss
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalDataWorks Summit
 
Data ingestion
Data ingestionData ingestion
Data ingestionnitheeshe2
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 

Similar to Datalake Architecture (20)

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Cloud computing major project
Cloud computing major projectCloud computing major project
Cloud computing major project
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution Provider
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 

Recently uploaded

2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证ju0dztxtn
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malangadet6151
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat ViagraToko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagraadet6151
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一0uyfyq0q4
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 

Recently uploaded (20)

2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat ViagraToko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 

Datalake Architecture

  • 1. DATA LAKE ARCHITECTURE Monojit Basu, Founder & Director TechYugadi IT Solutions & Consulting OSI DAYS 2016, BANGALORE
  • 2. Data Never Sleeps  Every minute  Facebook users share 216,302 photos  Dropbox users upload 833,333 new files  Youtube users share 400 hours of new video  Twitter users send 350,000 tweets  A Boeing 737 Aircraft in flight generates 40 TB of data
  • 3. EDW vs Data Lake  Data Lake is built on the premise that every drop of data is valuable  Its a place for capturing and exploring huge volumes of raw data that a business generates  Explorers are diverse: business analysts, data scientists, …  even business managers (using self-service)  Goals of exploration may be loosely defined
  • 4. EDW vs Data Lake  EDW stores filtered and processed data  For pre-meditated usage scenarios  Traditionally structured in the form of ‘cubes’  Analogy  Difference between a college library (focused on curriculum) and the US Library of Congress
  • 5. EDW vs Data Lake  Schema-on-Read  Schema-on-Write DATA LAKE XML JSON CSV PDF TRADING PARTNER REST API INVOICING ORDERS DB READ / EXTRACT READ / EXTRACT READ / EXTRACT CRM ANALYTICS SCM ANALYTICS RECO ENGINE ENTERPRISE DATA WAREHOUSE XML JSON CSV PDF TRADING PARTNER REST API INVOICING ORDERS DB SALES OPERATIONS MARKETING ETL
  • 6. Why Think of Data Lake  Business Drivers  Diverse sources of data: transactions, interactions, human and machine-generated  Routine analysis not enough – deeper insights lead to differentiation  Agile and Adaptive Business Models  Technology Drivers  Fast, cheap and scalable storage (eg. HDFS)  Diverse data-processing engines (eg. NoSQL)  Infinitely elastic processing power (cluster of commodity servers)
  • 7. Application Domains  Healthcare  IoT  E-Governance  Insurance
  • 8. What Features Should It Support  Scalable Storage Layer  3 V’s of Data Inflow  Data Discovery  Data Governance  Pluggable and Extensible Analytics  Elastic Processing Power  Multi-stakeholder and Multi-tenant Access
  • 9. Building It On Top Of Hadoop  Data Lake doesn’t have to be Hadoop  But Hadoop has proven its prowess on planet-scale data, in terms of:  Data Volumes  Elastic Data Processing Power  Probably the idea of a Data Lake was inspired by Hadoop  Naturally most often a Data Lake Architecture is built around Hadoop
  • 10. Storage Capacity: Metrics  Normally HDFS scales even with one NameNode  Unless you have hundreds of Petabytes data  But you need to monitor the usage pattern  Are you creating too many small files (what’s the average number of blocks per file)?  How much RAM would you need for the NameNode? (a high value could mean larger GC pauses)  Internal Load (heartbeats and block reports) vs External Get and Create Requests
  • 11. Storage Capacity: HDFS Federation  Single Name Node  NameNode Federation Name Node Data Node1 Data Node2 Data NodeN MR Client Get / Create Internal Load … NameNode1 NameNode2 Block Pool1 Block Pool2 Data Node1 Data Node2 Data NodeN…
  • 12. Storage Capacity: Availability  NameNode Federation does not ensure HA  Even if you don’t go for Federation, configuring high availability is recommended  Essentially set up a Standby NameNode  Active NameNode shares state with the Standby  Using a shared Journal Manager, or  Simply using a NFS-mounted shared File directory  Synchronization frequency is configurable
  • 13. Compute Capacity  Hadoop 1.0 supported 1 type of Job (Map-Reduce)  MR jobs were scheduled by a ‘JobTracker’ process  Hadoop 2.0 offers a Resource Manager (YARN)  It is intended to replace JobTracker and better the Hadoop cluster size limit from 3000 to 10000  But more important: YARN supports different types of Jobs including MR to run on Hadoop  Hence Data Lake should preferably be built on YARN
  • 14. Compute Capacity: YARN  YARN ARCHITECTURE RESOURCE MANAGER NODE MANAGER MR APP MASTER SPARK TASK NODE MANAGER SPARK APP MASTER MR TASK N O D E 1 N O D E 2 MR CLIENT SPARK CLIENT
  • 15. Data Inflow  The goal is to build a pipeline into Hadoop-native data stores  HDFS, mandatorily  Hive and Hbase, preferably  Considering the variety of data formats that a Data Lake must accommodate:  A general purpose Data Integration Tool must be chosen  For example, Pentaho Data Integration (PDI)
  • 16. Data Inflow  Pipelines specialized for specific data formats may also be plugged in HDFS FLAT FILE INPUT CONNECTOR WEB SERVICE INPUT CONNECTOR HDFS OUTPUT CONNECTOR .txt .json SQOOP FLUME DB log
  • 17. Data Inflow: Streaming Data  Streaming Data may be processed in two ways  Simply store in the Data Lake for future analysis  Interesting tweets for building a sentiment analysis model  Store and Forward to a Real-time Analytics Engine  Even as real-time processing occurs, the source data in raw format may be useful in future  To build / update machine learning models, for example in fraud analytics HDFS STORE STORE & FORWARD
  • 18. Data Analytics  A Data Lake built on HDFS will most likely use a Hadoop cluster to analyze data  Sometimes the result of the analysis may be stored back into HDFS (or possibly Hive / Hbase)  But Data Visualization and Reporting / Dashboards may work only on structured data cubes  Hence on the Analytics side, a Data Lake may need outflow paths from HDFS into structured data stores
  • 19. Plugging In Data Analytics Engine  Jaspersoft Reporting with HDFS HDFS ANALYZED DATA JASPERSOFT ETL HDFS INPUT CONNECTOR OLAP CUBE JASPERSOFT REPORTING ENGINE
  • 20. Data Governance  Data Lake does not conform to a schema  Data Governance makes it possible to make sense of the data  To both analysts and administrators  Data Governance is a fairly open-ended subject  Vendors offer different techniques to solve each governance use case  But common patterns are emerging across the landscape
  • 21. Data Governance: Analyst Use Cases  To search and retrieve ‘relevant’ data for analysis  Common Techniques  Metadata Management  Data tagging  Text Search  Data Classification  Metadata can include technical as well as business information (linked to a Business Glossary)  Data tags are often created by users collaboratively
  • 22. Data Governance: Admin Use Cases  Track data flow from source to end applications  Retain, replicate and archive based on usage  Track access and usage information for compliance  Lineage  Data Life-cycle Management  Auditing
  • 23. Automated Metadata Generation  As data is ingested, suitable attributes are extracted and stored into a metadata repository  Data type (XML, PDF, text, etc)  Data size  Creation and Last Access time, etc  Even data tags can be inserted at the time of ingest  Unconditionally, eg. ‘sales’  Conditionally, eg. ‘holiday_sales’
  • 24. Apache Atlas For Data Governance Source: http://atlas.incubator.apache.org/Architecture.html
  • 25. Data Access And Security  By default HDFS is secured using  Kerberos for authentication, and  Unix-style file permissions for authorization  In a large data repository with diverse stakeholders you may need more control  If so, a couple of products may be considered for augmenting Data Security:  Apache Knox  Apache Ranger
  • 26. Data Access And Security HDFS Perimeter Security: Knox KERBEROS Authentication Authorization (rwx) RANGER Federated Access Control NODE 1 NODE N
  • 27. Why Use Ranger  Supports Federated Access Control  Can fall-back upon default HDFS file permissions  Manages Access Control over several Hadoop- based components, like Hive, Storm, etc.  Advanced fine-grained access control, like  Deny policies for user or group  Tag-based access control, where a collection of resources share a common access tag  For example, a few columns in a Hive table and a certain files in HDFS could share a tag: ‘internal_audit’
  • 28. Steps To Build A Data Lake  Set up a scalable data storage layer  Set up a Compute Cluster capable of running a diverse mix of Jobs  Create data flow pipeline(s) for batch jobs  Create data flow pipeline(s) for streaming data
  • 29. Steps To Build A Data Lake  Plug in one or more Analytics Engine(s)  Set up mechanisms for efficient data discovery and data governance  Implement Data Access Controls  Design a Monitoring Infrastructure for Jobs and Resources (not covered today)
  • 30. Building A Data Lake: Starting Points  Set up a scalable data storage layer: HDFS  Set up a Compute Cluster capable of running a diverse mix of Jobs: YARN  Create data flow pipeline(s) for batch jobs: Pentaho HDFS Connector  Create data flow pipeline(s) for streaming data: Pentaho Messaging Connector
  • 31. Steps To Build A Data Lake  Plug in one or more Analytics Engine(s): Pentaho Reporting and Spark MLib  Set up mechanisms for efficient data discovery and data governance: Apache Atlas  Implement Data Access Controls: Apache Ranger  Design a Monitoring Infrastructure for Jobs and Resources: Apache Ambari
  • 32. Taking The Plunge  Do you need to plan for and build a Data Lake?  Ask yourself: what fraction of your data are you analyzing today ?  What value might the unused data offer ?  For marketing campaigns  For product lifecycle management  For regulatory compliance, and so on …  Engage your stakeholders from different LoBs  Is decision making being hampered by lack of data ?
  • 33. Taking The Plunge  Start small: There is a learning curve  Storing data is not enough – maintaining the stewarding the data is all important  Design for extensibility and plugability  Minimize vendor lock-in  Be open to change as you scale your infrastructure