Data Lakes visão prática: estruturação e criação

Data Lakes
visão prática
Marco Garcia
CTO, Founder – Cetax, TutorPro
mgarcia@cetax.com.br
https://www.linkedin.com/in/mgarciacetax/

Com mais de 20 anos de experiência em TI, sendo 18 exclusivamente com Business
Intelligence , Data Warehouse e Big Data, Marco Garcia é certificado pelo Kimball University,
nos EUA, onde obteve aula pessoalmente com Ralph Kimball – um dos principais gurus do
Data Warehouse.
1º Instrutor Certificado Hortonworks LATAM
Arquiteto de Dados e Instrutor na Cetax Consultoria.
02
Apresentação

The ability to learn or understand or to deal with new or trying situations :reason; also:the skilled use of reason
the ability to apply knowledge to manipulate one's environment or to think abstractly as measured by objective
criteria (as tests).
What is intelligence?
04
Data Lake ?

1ª Citação Data
Lake
Outubro-2010

Data Warehouse x Data Lake
https://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html
Garrafas de água:
- Limpas
- Tratadas
- Empacotadas
- Prontas para o
Consumo
Lago de Dados :
- Bruto
- Sem
tratamento
- Precisa ser
trabalhada
para ser
consumida

“Dados são o novo Petróleo”
No ano de 2012 a
Como petróleo, precisam ser refinados !
DATA IS THE NEW OIL!

DADOS POR VALIDADE PARA BIG DATA

ARQUITETURA COMPLETA PARA BIG DATA ? Hadoop !
Hadoop

WhatisApacheHadoop?
 Allows for the distributed processing of large data sets across clusters of computers using
simple programming models
 Is designed to scale up from single servers to thousands of machines, each offering local
computation and storage
 Does not rely on hardware to deliver high-availability, but rather the library itself is
designed to detect and handle failures at the application layer
 Delivers a highly-available service on top of a cluster of computers, each of which may be
prone to failures
The Apache Hadoop project describes the technology as a software framework that:
Source: http://hadoop.apache.org

HadoopCore=Storage+Compute
storage storage
storage storage
CPU RAM
Yet Another Resource
Negotiator (YARN)
Hadoop Distributed File
System (HDFS)

DistinctMastersandScale-OutWorkers
worker node
NodeManager
DataNode
master node 2
ZooKeeper
Resource
Manager
master node 1
ZooKeeper
NameNode
master node 3
ZooKeeper
HiveServer2
utility node 1
Client
Gateway
Knox
utility node 2
Client
Gateway
Ambari Server
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode

Como seria o DataLake
no Hadoop ?

compute
&
storage
. . .
. . .
. .
compute
&
storage
.
.
YARN
KNOX
AMBARI
HCATALOG (table metadata)
Step 2: Model/Apply Metadata
(data processing)
HIVE PIG
Step 3: Transform, Aggregate & Materialize
LOAD
SQOOP/Hive
Web HDFS
Data Sources
RDBMS, No/New SQL Store
(Oracle, Hana)
EDW
(SAP BW)
Step 4a: Publish/Exchange
Step 4c: Analyze
Analytical Tools
SAS, Python, R, Matlib
ANALYTICAL
NN
AppMaster
Streaming
INTERACTIVE
HIVE Server
Query/Visualization/Re
porting Tools
SAP BO
Tableau/Excel
Any JDBC Compliant
ToolStep 4b: Explore/Visualize
FALCON (data lifecycle)
Manage Steps 1-3: Data Lifecycle with Falcon
LOAD
SQOOP
FLUME
NIFI
KAFKA
SOURCE DATA
App/System
Logs
Customer/Invent
ory Data
Transaction/Sale
s Data
Flat Files
Twitter/Facebook
Streams
DB
File
JMS
REST
HTTP
Streaming
Step 1:Extract & Load

PassosparaoDataLake
Passo 1 - Extrair e Carregar
Passo 2 - Modelar e Aplicar os metadados
Passo 3 - Transformar, Agregar e Materializar os dados
Passo 4a - Publicar ou Enviar Dados
Passo 4b - Explorar e Visualizar
Passo 4c - Analisar, fazer Ciência de Dados

Como Estruturar e
Criar o Data Lake

PontosFundamentais
 Alinhe o Data Lake com a Estrutura Organizacional
 Crie áreas (Zones) no Data Lake (ingest zone, transformation zone, presentation zone)
 Processos de Ingestão de Dados
 Segurança
 Linhagem de Dados
 Entender as necessidades
 Integrações serão necessárias !

EstruturaLógicadaOrganização
 Alinhe a estrutura por funções e não por departamentos ou equipes, as organizações
mudam, mas as funções quase sempre são semelhantes.
 Pense em um investimento de longo prazo
 Esteja sempre atendo a regulamentações e controles internos ou mesmo externos.
 Pense no Data Lake em Camadas

HDFSlayer
 Data is written into landing zone
SQOOP
HDF
Flume
…
RAW format
 Security
Contains PII information
Landing zone is using HDFS TDE for data
protection
Only ETL tools are accessing this layer
Access by data wrangler only
Data retention is limited ( < 1 month )
Landing zone
RDBMS
Landing
SQOOP
Nifi

HDFSlayer
 Data is compressed in large files
Hadoop archive (har)
Solve small file problem
 Data is automatically removed
Retention policy managed via Falcon
 Security
Archive zone is using HDFS TDE for data
protection
Limited set of users can access it
 HDFS tiering
Archival layer
Landing Archive

HDFSlayer
 Data is moving from Landing to Speed
Data is cleaned as part of ETL
Optimized file format
Orc, parquet, avro, …
 Multiple copy of same dataset depending
on use cases
RAW data store in optimized file format
Tokenised, normalisation, datamarts, ...
 Security
Sensitive data are tokenised
Business users access this layer
Presentation layer
Landing Archive
Presentation

Multi-tenantenvironment
 Third party tools move data from landing
into dev & test zone
PII information are encrypted using 3rd party
solution
One way tokenisation
Data is consistently tokenised
Enable join in between different datasets
 Benefit
Development is done against realistic dataset
(volume & format)
Give access to data scientist team
Development and test layer
Landing Dev / Test / …

 Data
Accessed from presentation layer
 Benefit
Give access to version of production data to
data scientist teams
Allow data science team to acquire ad-hoc
external datasets
Data exploration layer
Data
exploration

 Third party tools move data from landing
into dev & test zone
PII information are encrypted using 3rd party
solution
Reversible tokenisation
Data is consistently tokenised
Enable join in between different datasets
Production layer
Prod
Data
exploration

Bestpractices
 Create a catalogue of datasets in Atlas
Data owner
Source system
Project using it
 Keep multiple copy of the same data
Raw
Optimized
Tokenized
 Disaster Recovery
Dev / Test / Data Exploration run on DR cluster
Define prioritize workload
 Create dataset structures based upon
projects
Datasets will be reused across projects
 No write access to business users
Do’s Don’ts

Obrigado !
Visite nos :
www.cetax.com.br
Estamos contratando !

Data Lakes visão prática: estruturação e criação

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Lakes visão prática: estruturação e criação

Similar to Data Lakes visão prática: estruturação e criação (20)

More from Marco Garcia

More from Marco Garcia (17)

Recently uploaded

Recently uploaded (20)

Data Lakes visão prática: estruturação e criação

Editor's Notes