Minha apresentação sobre construção de data lakes para bigdata usando hadoop como plataforma de dados. Conheça mais sobre nossos trabalhos de consultoria e treinamento em Hadoop Hortonworks, BigData, Data Warehousing e Business Intelligence
1. Data Lakes
visão prática
Marco Garcia
CTO, Founder – Cetax, TutorPro
mgarcia@cetax.com.br
https://www.linkedin.com/in/mgarciacetax/
2. Com mais de 20 anos de experiência em TI, sendo 18 exclusivamente com Business
Intelligence , Data Warehouse e Big Data, Marco Garcia é certificado pelo Kimball University,
nos EUA, onde obteve aula pessoalmente com Ralph Kimball – um dos principais gurus do
Data Warehouse.
1º Instrutor Certificado Hortonworks LATAM
Arquiteto de Dados e Instrutor na Cetax Consultoria.
02
Apresentação
5. The ability to learn or understand or to deal with new or trying situations :reason; also:the skilled use of reason
the ability to apply knowledge to manipulate one's environment or to think abstractly as measured by objective
criteria (as tests).
What is intelligence?
04
Data Lake ?
7. Data Warehouse x Data Lake
https://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html
Garrafas de água:
- Limpas
- Tratadas
- Empacotadas
- Prontas para o
Consumo
Lago de Dados :
- Bruto
- Sem
tratamento
- Precisa ser
trabalhada
para ser
consumida
8. “Dados são o novo Petróleo”
No ano de 2012 a
Como petróleo, precisam ser refinados !
DATA IS THE NEW OIL!
13. WhatisApacheHadoop?
Allows for the distributed processing of large data sets across clusters of computers using
simple programming models
Is designed to scale up from single servers to thousands of machines, each offering local
computation and storage
Does not rely on hardware to deliver high-availability, but rather the library itself is
designed to detect and handle failures at the application layer
Delivers a highly-available service on top of a cluster of computers, each of which may be
prone to failures
The Apache Hadoop project describes the technology as a software framework that:
Source: http://hadoop.apache.org
18. compute
&
storage
. . .
. . .
. .
compute
&
storage
.
.
YARN
KNOX
AMBARI
HCATALOG (table metadata)
Step 2: Model/Apply Metadata
(data processing)
HIVE PIG
Step 3: Transform, Aggregate & Materialize
LOAD
SQOOP/Hive
Web HDFS
Data Sources
RDBMS, No/New SQL Store
(Oracle, Hana)
EDW
(SAP BW)
Step 4a: Publish/Exchange
Step 4c: Analyze
Analytical Tools
SAS, Python, R, Matlib
ANALYTICAL
NN
AppMaster
Streaming
INTERACTIVE
HIVE Server
Query/Visualization/Re
porting Tools
SAP BO
Tableau/Excel
Any JDBC Compliant
ToolStep 4b: Explore/Visualize
FALCON (data lifecycle)
Manage Steps 1-3: Data Lifecycle with Falcon
LOAD
SQOOP
FLUME
NIFI
KAFKA
SOURCE DATA
App/System
Logs
Customer/Invent
ory Data
Transaction/Sale
s Data
Flat Files
Twitter/Facebook
Streams
DB
File
JMS
REST
HTTP
Streaming
Step 1:Extract & Load
19. PassosparaoDataLake
Passo 1 - Extrair e Carregar
Passo 2 - Modelar e Aplicar os metadados
Passo 3 - Transformar, Agregar e Materializar os dados
Passo 4a - Publicar ou Enviar Dados
Passo 4b - Explorar e Visualizar
Passo 4c - Analisar, fazer Ciência de Dados
21. PontosFundamentais
Alinhe o Data Lake com a Estrutura Organizacional
Crie áreas (Zones) no Data Lake (ingest zone, transformation zone, presentation zone)
Processos de Ingestão de Dados
Segurança
Linhagem de Dados
Entender as necessidades
Integrações serão necessárias !
22. EstruturaLógicadaOrganização
Alinhe a estrutura por funções e não por departamentos ou equipes, as organizações
mudam, mas as funções quase sempre são semelhantes.
Pense em um investimento de longo prazo
Esteja sempre atendo a regulamentações e controles internos ou mesmo externos.
Pense no Data Lake em Camadas
24. HDFSlayer
Data is written into landing zone
SQOOP
HDF
Flume
…
RAW format
Security
Contains PII information
Landing zone is using HDFS TDE for data
protection
Only ETL tools are accessing this layer
Access by data wrangler only
Data retention is limited ( < 1 month )
Landing zone
RDBMS
Landing
SQOOP
Nifi
25. HDFSlayer
Data is compressed in large files
Hadoop archive (har)
Solve small file problem
Data is automatically removed
Retention policy managed via Falcon
Security
Archive zone is using HDFS TDE for data
protection
Limited set of users can access it
HDFS tiering
Archival layer
Landing Archive
26. HDFSlayer
Data is moving from Landing to Speed
Data is cleaned as part of ETL
Optimized file format
Orc, parquet, avro, …
Multiple copy of same dataset depending
on use cases
RAW data store in optimized file format
Tokenised, normalisation, datamarts, ...
Security
Sensitive data are tokenised
Business users access this layer
Presentation layer
Landing Archive
Presentation
27. Multi-tenantenvironment
Third party tools move data from landing
into dev & test zone
PII information are encrypted using 3rd party
solution
One way tokenisation
Data is consistently tokenised
Enable join in between different datasets
Benefit
Development is done against realistic dataset
(volume & format)
Give access to data scientist team
Development and test layer
Landing Dev / Test / …
28. Multi-tenantenvironment
Data
Accessed from presentation layer
Benefit
Give access to version of production data to
data scientist teams
Allow data science team to acquire ad-hoc
external datasets
Data exploration layer
Landing Dev / Test / …
Data
exploration
29. Multi-tenantenvironment
Third party tools move data from landing
into dev & test zone
PII information are encrypted using 3rd party
solution
Reversible tokenisation
Data is consistently tokenised
Enable join in between different datasets
Production layer
Landing Dev / Test / …
Prod
Data
exploration
30. Bestpractices
Create a catalogue of datasets in Atlas
Data owner
Source system
Project using it
Keep multiple copy of the same data
Raw
Optimized
Tokenized
Disaster Recovery
Dev / Test / Data Exploration run on DR cluster
Define prioritize workload
Create dataset structures based upon
projects
Datasets will be reused across projects
No write access to business users
Do’s Don’ts
Os dados podem ser o novo petróleo, a nova corrida que as empresas vão enfrentar para multiplicar seus lucros!
A correta coleta, processamento e análise dos dados podem ser um diferencial competitivo a todos os negócios.
Claro, como petróleo, os dados também precisam ser refinados para um melhor resultado.
Essa lista é um exemplo de possíveis fontes, mas deveremos ter muito mais fontes.
As novas ferramentas permitem conexão e captura de dados em diversas categorias de softwares ou mesmo equipamentos eletrônicos que permita captura de dados.
Claro que além dos dados tradicionais que hoje buscamos em outros sistemas, bancos de dados e arquivos de texto.
Referencia - http://voltdb.com/blog/big-data/big-data-value-continuum/
Muitos softwares ?
Por favor, se acalme, vamos falar disso um pouco mais para frente.
Muitos softwares ?
Por favor, se acalme, vamos falar disso um pouco mais para frente.
This “wordy” slide is straight from the project’s self-description and warrants a splash before we go much further…
So what is Apache Hadoop? It is a scalable, fault tolerant, open source framework for the distributed storing and processing of large sets of data on commodity hardware. But what does all that mean?
Well first of all it is scalable. Hadoop clusters can range from as few as one machine to literally thousands of machines. That is scalability!
It is also fault tolerant. Hadoop services become fault tolerant through redundancy. For example, the Hadoop Distributed File System, called HDFS, automatically replicates data blocks to three separate machines, assuming that your cluster has at least three machines in it. Many other Hadoop services are replicated, too, in order to avoid any single points of failure.
Hadoop is also open source. Hadoop development is a community effort governed under the licensing of the Apache Software Foundation. Anyone can help to improve Hadoop by adding features, fixing software bugs, or improving performance and scalability.
Hadoop also uses distributed storage and processing. Large datasets are automatically split into smaller chunks, called blocks, and distributed across the cluster machines. Not only that, but each machine processes its local block of data. This means that processing is distributed too, potentially across hundreds of CPUs and hundreds of gigabytes of memory.
All of this occurs on commodity hardware which reduces not only the original purchase price, but also potentially reduces support costs as well.
At the most granular level, Hadoop is an engine who provides storage via HDFS and compute via YARN capabilities.
The “ecosystem” tools wrap around core.
Hadoop is not a monolithic piece of software. It is a collection of architectural pillars that contain software frameworks. Most of the frameworks are part of the Apache software ecosystem. The picture illustrates the Apache frameworks that are part of the Hortonworks Hadoop distribution.
So why does Hadoop have so many frameworks and tools? The reason is that each tool is designed for a specific purpose. The functionality of some tools overlap but typically one tool is going to be better than others when performing certain tasks.
For example, both Apache Storm and Apache Flume ingest data and perform real-time analysis. But Storm has more functionality and is more powerful for real-time data analysis.
Here is an example cluster with three master nodes, 12 worker nodes, and two utility nodes. The cluster is running various services, like YARN and HDFS. Services can be implemented by one or more service components.
The three master nodes are running service master components. The 12 worker nodes are running service worker components, sometimes called slave components. The two utility nodes are running service components that provide access, security, and management services for the cluster.
This page does not illustrate all services, service master, or service worker components. More detail is provided in other lessons.
Break Glass?
If need to be reprocess – Copy form Archive into Landing
Har tracking by atlas
ISO27001 – Data & Processing should be separated – Doesn’t mean separated env
Separated dev & test are used for upgrade / patch testing - can be smaller / virtualised / ..
ISO24001 – Data & Processing should be separated – Doesn’t mean separated env