DIRECTION
REDACTOR DATE
BIG DATA PLATFORM
INDUSTRIALIZATION
BIG DATA INITIAVES @Renault
2014
Big Data Sandbox on old
HPC Infrastructure.
Site: Innovation LAB
POC: Quality Data
Exploration
2015
DataLab Implementation
New HP Infrastructure
Data Protection: NO
1st Level of
Industrialization
2016
Big Data Platform
Industrialization to host
both Pocs and Projects in
Production.
Data Protection: YES
3
DIRECTION
REDACTOR DATE
Big Data Deployment Production Stakes
• One Hadoop cluster with a 24/7 always-on visibility of data(instead of siloing them).
• Many crossing Data possibilities
• Simplify Operations
• Design Simplicity
• Charge Back Model
• Scalability and Isolation
• Isolate Experimental applications from Production
4
DIRECTION
REDACTOR DATE
Déploiement des projets Qualité sur DataLake
Big Data
Developpers
Serveur Client
(Clients Hadoop
Installés)
DMZ
KNOX
GATEWAY
Search
Node
Edge
Node
Name
Node
DataStore
Master DNDNDNDNDNDNDNDNDN
DNDNDNDNDNDNDNDNDN
Data Sources
LoadBalancer
Web
Applications
Web service
Import
Access GUI
Search
Node
5
DIRECTION
REDACTOR DATE
Quality
Sales and
Marketing
Supply Chain Engineering
Consumers
Open Data
Internet of Things
Producers
Batch (RDBMS,
Files)
Messages, Logs
Streaming, Data Flow
NFS Gateway, Sqoop,
Spark SQL
FLUME
LOGSTASH
KAFKA PRODUCERS
Kafka
Broker
(Topics)
Spark
Streaming
Elasticsearch
HBASE
HIVE
HDFS
Spark SQL
Spark RDD
Big Data Ecosystem @ Renault
YARN + HDFS
6
DIRECTION
REDACTOR DATE
Data Ingestion Scenarios : RDBMS
RDBMS
Sqoop
Spark SQL
HIVE
Flat Files
HBase
Basic data import based on column id (Integer) or timestamp
Example: SOPHIA
ELT Architecture: Extract Load and Transform
Non standard Data Import, Specific schema
Example: BLMS
Support ETL Architecture : Extract Transfrom and Load
Support Ingestion directly to Elasticsearch
INSERT ONLY
NOCTURNAL BATCH
SQL QUERIES
HIGH LATENCY
INSERT ONLY
NOCTURNAL BATCH
DATA PROCESSING
Files Format: CSV,
PARQUET, AVRO
INSERT AND UPDATE
NOSQL DB (KEY-VALUE SCHEMA)
LOW LATENCY
FOR SCALING OUT RELATIONAL
DB ON HADOOP
INTERACTIVE ANALYTICS
(SPOTFIRE)
VERY LOW LATENCY (SSD DISKS)
NEAR REAL TIME ANALITYCS
(WATCHING ALERTS)
TEXT SEARCH (LOG ANALYSIS)
NESTED and PARENT/CHILD
RELATIONSHIPS
Elasticsearch
7
DIRECTION
REDACTOR DATE
Interactive SQL Data Analytics
Main Objectives
• Speeding up BI queries on Big Data Stores
• Hide Complexity of Big Data Architecture to End-User and Provide Only one Data
Connector for Spotfire
• Provide Interactive User Experience
• Data Virtualization (No need to import RDBMS systematically for Crossing Data)
 Spark SQL (1.6) : The emerging solution for Interactive SQL Data Analytics with the Data
Source API.
8
DIRECTION
REDACTOR DATE
Only One Data Connector
Data Processing Applications
Add In-Memory Capability
Load/Insert Load/Insert Load/Insert
Interactive SQL Data Analytics
HBASEHIVE Elasticsearch
Spark SQL
Files Parquet
DataSource API
Load/Insert
RDBMS
Load/Insert
9
DIRECTION
REDACTOR DATE
Big Data In Action
Multitenancy
High Availability
Security
Data Governance
Policy
Continuous Delivery
Data Protection
Hadoop
Organization
POC#
0
POC#
1
POC#
2
POC#
n
Release
Management
PRODUCTION
SLA & Monitoring
Tenant 1
App 1
Tenant 2
App 1
App 2
Tenant n
App 1
Managementand
Monitoring
DataLab, Open to
Data Exploration
Bi-modal Big Data Platform
One physical Cost –effective platform
11
DIRECTION
REDACTOR DATE
Hadoop Global Data Life Cycle
Data Sources
Load or archive
batch data
Stream real time
data
Mask Sensitive data
with Automated
Process
Renault Big Data
Platform
Refine, curate, process,
query data
Big Data Web
Services
INGESTION
Scheduled and
Monitored by
AITS in PROD
Policies pre-defined by
Security Officer
ADA
ARCA
subscription
request to
datasets
Data
Access
AITS: Groups
Management
DIRx Data
Owner
Validation
Sync
Users
Defined in Ranger
and Protegrity
Defined in
Falcon
12
DIRECTION
REDACTOR DATE
Active / Active Hadoop Platform
Rack Salle 1 C2 Rack Salle 2 C2
Data Nodes
for Bloc Storage
HDFS PROD HDFS POC
High Availability Architecture
13
DIRECTION
REDACTOR DATE
Hadoop Security Levels
OS Security
Authorization
Perimeter Level Security
Protected Zone
Data Protection
Selected Solution :
Tokenization
(Protegrity)
14
DIRECTION
REDACTOR DATE
Tokenization Definition
Selected solution: Tokenization
• Tokenization is a form of data protection that
converts sensitive data into fake data.
• The real data can be retrieved by authorized
users.
• Protegrity: The only Available Solution for Hadoop
(supports also traditional Data Systems)
Data Protection Key Requirements:
• Ability to de-identify Personally Identifiable Data
• Restrict data access (financial data, Data
residency obligations, …)
• Provide central management and control of all
data security operations
15
DIRECTION
REDACTOR DATE
Identifier Clear Protected
Authorized Role 1
* Can see most data in the
clear
Authorized Role 2
* Can see limited data in the
clear
Name Joe Smith csu wusoj Joe Smith Joe Smith
Address 100 Main Street, Pleasantville,
CA
476 srta coetse, cysieondusbak,
HA
100 Main Street, Pleasantville,
CA
“No Access”
Date of Birth 12/25/1966 01/02/1966 12/25/1966 01/02/1966
VIN VF1112C0000724284 AB9875R8467364752 VF1112C0000724284 “No Access”
Credit Card
Number
3678 2289 3907 3378 3846 2290 3371 3890 xxxx xxxx xxxx 3378 3846 2290 3371 3890
E-mail Address joe.smith@surferdude.org eoe.nwuer@beusorpdqo.aku joe.smith@surferdude.org joe.smith@surferdude.org
Telephone
Number
760-278-3389 998-389-2289 760-278-3389 998-389-2289
DATA PROTECTION Example 1
Fine Grained Protection
16
DIRECTION
REDACTOR DATE
DATA PROTECTION Example 2
Data Residency
17
DIRECTION
REDACTOR DATE
The Next Step: HDFS FEDERATION
Name Service 1
/store1/
Name Service 2
/store2/
Name
Node 1
Name
Node 2
Name
Node 3
Name
Node 4
Data
Node 1
Data
Node 2
Data
Node 3
Data
Node 4
Data
Node 5
Data
Node 6
Data
Node 7
Data
Node 8
Federation
Scale-out Data Nodes
Resources usage orchestrated by YARN
HA HA
• see only access /store2
• cannot access to /store1
• cannot access to
Name Node 1 and 2
• see only access /store1
• cannot access to /store2
• cannot access to
Name Node 3 and 4
PHYSICALISOLATION
Privileged Users can
access /store1 and /store2
for Crossing Data
Bloc Storage
Unreadable Data
Data
Node 9
18
DIRECTION
REDACTOR DATE
BIG DATA PROJECT PROCESS
Business
Use Case DIRx
POC PROD Project
Data
Exploration
AT
Data Sources Ingestion
Security Policies
Data Processing development
User Access Authorization
CPT
DIA - Innovation Front End deployment
Admin
@BICC
architecture
development
MEP
Implementation
Deployment
management
DAT
19
DIRECTION
REDACTOR DATE
DATA CHANGE MANAGEMENT
• Publish real time dashboard of all data flows.
• For each data source the producer and the consumers will be displayed:
• Producer: the DIRx generating the Data
• Consumers: all the Big Data applications using this Data source
• If the Producer decides to change the data source schema, He has to send notification
from the dashboard to all consumers.
• By monitoring in real time the logs of data ingestion jobs, the failure detection is more
reactive and efficient.
20
DIRECTION
REDACTOR DATE
COLLECT LOGS FOR (NRT) CARTO
Sqoop
Metastore
Storing Jobs
Hive
Metastore
Storing SQL Schema
Oozie
Metastore
Storing
Workflows
Oozie Logs
Yarn
Job History Logs
Storing
Projects
Hue
Database
Elasticsearch
J
D
B
C
P
L
U
G
I
N
L
O
G
S
T
A
S
H
Kibana
21
DIRECTION
REDACTOR DATE
Dynamic Relationship Portal
Big Data Platform Industrialization

Big Data Platform Industrialization

  • 1.
    DIRECTION REDACTOR DATE BIG DATAPLATFORM INDUSTRIALIZATION
  • 2.
    BIG DATA INITIAVES@Renault 2014 Big Data Sandbox on old HPC Infrastructure. Site: Innovation LAB POC: Quality Data Exploration 2015 DataLab Implementation New HP Infrastructure Data Protection: NO 1st Level of Industrialization 2016 Big Data Platform Industrialization to host both Pocs and Projects in Production. Data Protection: YES
  • 3.
    3 DIRECTION REDACTOR DATE Big DataDeployment Production Stakes • One Hadoop cluster with a 24/7 always-on visibility of data(instead of siloing them). • Many crossing Data possibilities • Simplify Operations • Design Simplicity • Charge Back Model • Scalability and Isolation • Isolate Experimental applications from Production
  • 4.
    4 DIRECTION REDACTOR DATE Déploiement desprojets Qualité sur DataLake Big Data Developpers Serveur Client (Clients Hadoop Installés) DMZ KNOX GATEWAY Search Node Edge Node Name Node DataStore Master DNDNDNDNDNDNDNDNDN DNDNDNDNDNDNDNDNDN Data Sources LoadBalancer Web Applications Web service Import Access GUI Search Node
  • 5.
    5 DIRECTION REDACTOR DATE Quality Sales and Marketing SupplyChain Engineering Consumers Open Data Internet of Things Producers Batch (RDBMS, Files) Messages, Logs Streaming, Data Flow NFS Gateway, Sqoop, Spark SQL FLUME LOGSTASH KAFKA PRODUCERS Kafka Broker (Topics) Spark Streaming Elasticsearch HBASE HIVE HDFS Spark SQL Spark RDD Big Data Ecosystem @ Renault YARN + HDFS
  • 6.
    6 DIRECTION REDACTOR DATE Data IngestionScenarios : RDBMS RDBMS Sqoop Spark SQL HIVE Flat Files HBase Basic data import based on column id (Integer) or timestamp Example: SOPHIA ELT Architecture: Extract Load and Transform Non standard Data Import, Specific schema Example: BLMS Support ETL Architecture : Extract Transfrom and Load Support Ingestion directly to Elasticsearch INSERT ONLY NOCTURNAL BATCH SQL QUERIES HIGH LATENCY INSERT ONLY NOCTURNAL BATCH DATA PROCESSING Files Format: CSV, PARQUET, AVRO INSERT AND UPDATE NOSQL DB (KEY-VALUE SCHEMA) LOW LATENCY FOR SCALING OUT RELATIONAL DB ON HADOOP INTERACTIVE ANALYTICS (SPOTFIRE) VERY LOW LATENCY (SSD DISKS) NEAR REAL TIME ANALITYCS (WATCHING ALERTS) TEXT SEARCH (LOG ANALYSIS) NESTED and PARENT/CHILD RELATIONSHIPS Elasticsearch
  • 7.
    7 DIRECTION REDACTOR DATE Interactive SQLData Analytics Main Objectives • Speeding up BI queries on Big Data Stores • Hide Complexity of Big Data Architecture to End-User and Provide Only one Data Connector for Spotfire • Provide Interactive User Experience • Data Virtualization (No need to import RDBMS systematically for Crossing Data)  Spark SQL (1.6) : The emerging solution for Interactive SQL Data Analytics with the Data Source API.
  • 8.
    8 DIRECTION REDACTOR DATE Only OneData Connector Data Processing Applications Add In-Memory Capability Load/Insert Load/Insert Load/Insert Interactive SQL Data Analytics HBASEHIVE Elasticsearch Spark SQL Files Parquet DataSource API Load/Insert RDBMS Load/Insert
  • 9.
    9 DIRECTION REDACTOR DATE Big DataIn Action Multitenancy High Availability Security Data Governance Policy Continuous Delivery Data Protection Hadoop Organization
  • 10.
    POC# 0 POC# 1 POC# 2 POC# n Release Management PRODUCTION SLA & Monitoring Tenant1 App 1 Tenant 2 App 1 App 2 Tenant n App 1 Managementand Monitoring DataLab, Open to Data Exploration Bi-modal Big Data Platform One physical Cost –effective platform
  • 11.
    11 DIRECTION REDACTOR DATE Hadoop GlobalData Life Cycle Data Sources Load or archive batch data Stream real time data Mask Sensitive data with Automated Process Renault Big Data Platform Refine, curate, process, query data Big Data Web Services INGESTION Scheduled and Monitored by AITS in PROD Policies pre-defined by Security Officer ADA ARCA subscription request to datasets Data Access AITS: Groups Management DIRx Data Owner Validation Sync Users Defined in Ranger and Protegrity Defined in Falcon
  • 12.
    12 DIRECTION REDACTOR DATE Active /Active Hadoop Platform Rack Salle 1 C2 Rack Salle 2 C2 Data Nodes for Bloc Storage HDFS PROD HDFS POC High Availability Architecture
  • 13.
    13 DIRECTION REDACTOR DATE Hadoop SecurityLevels OS Security Authorization Perimeter Level Security Protected Zone Data Protection Selected Solution : Tokenization (Protegrity)
  • 14.
    14 DIRECTION REDACTOR DATE Tokenization Definition Selectedsolution: Tokenization • Tokenization is a form of data protection that converts sensitive data into fake data. • The real data can be retrieved by authorized users. • Protegrity: The only Available Solution for Hadoop (supports also traditional Data Systems) Data Protection Key Requirements: • Ability to de-identify Personally Identifiable Data • Restrict data access (financial data, Data residency obligations, …) • Provide central management and control of all data security operations
  • 15.
    15 DIRECTION REDACTOR DATE Identifier ClearProtected Authorized Role 1 * Can see most data in the clear Authorized Role 2 * Can see limited data in the clear Name Joe Smith csu wusoj Joe Smith Joe Smith Address 100 Main Street, Pleasantville, CA 476 srta coetse, cysieondusbak, HA 100 Main Street, Pleasantville, CA “No Access” Date of Birth 12/25/1966 01/02/1966 12/25/1966 01/02/1966 VIN VF1112C0000724284 AB9875R8467364752 VF1112C0000724284 “No Access” Credit Card Number 3678 2289 3907 3378 3846 2290 3371 3890 xxxx xxxx xxxx 3378 3846 2290 3371 3890 E-mail Address joe.smith@surferdude.org eoe.nwuer@beusorpdqo.aku joe.smith@surferdude.org joe.smith@surferdude.org Telephone Number 760-278-3389 998-389-2289 760-278-3389 998-389-2289 DATA PROTECTION Example 1 Fine Grained Protection
  • 16.
  • 17.
    17 DIRECTION REDACTOR DATE The NextStep: HDFS FEDERATION Name Service 1 /store1/ Name Service 2 /store2/ Name Node 1 Name Node 2 Name Node 3 Name Node 4 Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 Data Node 6 Data Node 7 Data Node 8 Federation Scale-out Data Nodes Resources usage orchestrated by YARN HA HA • see only access /store2 • cannot access to /store1 • cannot access to Name Node 1 and 2 • see only access /store1 • cannot access to /store2 • cannot access to Name Node 3 and 4 PHYSICALISOLATION Privileged Users can access /store1 and /store2 for Crossing Data Bloc Storage Unreadable Data Data Node 9
  • 18.
    18 DIRECTION REDACTOR DATE BIG DATAPROJECT PROCESS Business Use Case DIRx POC PROD Project Data Exploration AT Data Sources Ingestion Security Policies Data Processing development User Access Authorization CPT DIA - Innovation Front End deployment Admin @BICC architecture development MEP Implementation Deployment management DAT
  • 19.
    19 DIRECTION REDACTOR DATE DATA CHANGEMANAGEMENT • Publish real time dashboard of all data flows. • For each data source the producer and the consumers will be displayed: • Producer: the DIRx generating the Data • Consumers: all the Big Data applications using this Data source • If the Producer decides to change the data source schema, He has to send notification from the dashboard to all consumers. • By monitoring in real time the logs of data ingestion jobs, the failure detection is more reactive and efficient.
  • 20.
    20 DIRECTION REDACTOR DATE COLLECT LOGSFOR (NRT) CARTO Sqoop Metastore Storing Jobs Hive Metastore Storing SQL Schema Oozie Metastore Storing Workflows Oozie Logs Yarn Job History Logs Storing Projects Hue Database Elasticsearch J D B C P L U G I N L O G S T A S H Kibana
  • 21.