The Exabyte Journey and DataBrew with CICD

The exabyte journey &
DataBrew with CICD
Speaker: Shu-Jeng, Hsieh (Scott)
Date: Jun 25th, 2021 (Fri)
Location: the 104 Corporation

The Exabyte Journey in
LinkedIn with Hadoop

1 exabyte (EB)
= 1000 PB
= 1M TB
= 1B GB

Two Milestones
1. LinkedIn now stores 1 exabyte of total data
across all Hadoop clusters.
2. The largest cluster is a 10,000-node cluster.
a. 500 PB
b. 1 billion objects
c. A single NameNode serving RPCs
d. An average latency under 10 milliseconds

High Availability
1. The NameNode used to be a single point of
failure (SPOF).
2. From Hadoop 2, the HA architecture is
introduced.
a. Quorum Journal Manager (QJM)
b. Network File System (NFS)
3. Aside from preventing from the SPOF, HA is
also crucial for rolling upgrades.

DN DN DN DN
NN
Active
NN
Standby
JN
JN JN

Java Tuning
380GB heap to maintain
1.1Bof namespace objects

Java Tuning
1. Java heap generations
a. Young generation
b. Tenured generation
2. Non-fair locking
a. Theoretically, it’s unfavorable for writers.
b. In practice, it substantially improves performance due to the
occupancy of read requests, which is 95%.

Satellite cluster
1. The small files problem
2. The logging directory
3. Bootstrapping the satellite cluster
a. 60TB of data using DistCp, 12 hours
b. FailoverFS
4. Very large block reports

Consistent reads
from Standby Node
1. Motivation and requirements
a. HDFS-12943
2. Consistency model
3. The stale read problem
4. The consistency principle
5. Journal tailing: Fast path
a. HDFS-13150

Scaling beyond the
Hadoop ecosystem
1. Port-based selective wire encryption
a. GDPR, CCPA
b. HADOOP-10335, HDFS-13541
c. 36-46% reduction in read/write latency, 56-85% increase of
read/write throughput.
2. Encryption at rest
a. Dataset-level encryption
b. LiKMS

Scaling beyond the
Hadoop ecosystem
3. Wormhole
a. Bunches of pipelines for data transferring
b. From a single HDFS to the universe

Traditional Chinese Version
https://fantasticsie.medium.com/%E8%89%BE%E4
%BD%8D%E5%85%83%E4%BF%B1%E6%A8%8
2%E9%83%A8-linkedin-
%E5%9C%A8%E6%93%B4%E5%B1%95-
hadoop-
%E5%88%86%E6%95%A3%E5%BC%8F%E6%A
A%94%E6%A1%88%E7%B3%BB%E7%B5%B1%
E4%B8%8A%E7%9A%84%E6%97%85%E7%A8

Glue DataBrew 101
1. A service that empowers those who have
needs on ETL without code skill.
2. 250+ patterns (built-in transformations)
3. Track the transforming history
of data.
4. Integrates with data pipelines

Data Lineage
AWS Glue DataBrew SAS EG

DataBrew recipes
A series of transformations
on a DataBrew dataset

Continuous Integration &
Continuous Delivery
1. The culture enables teams to store and version
code, maintain parity between development and
production environments.
2. CI/CD practices apply beyond software delivery.
3. A demonstration of the practice on data prepartion

A CDK construct for the demonstration with 4
programming languages available worldwide,
demos for the languages included.
cdk-databrew-cicd
https://awscdk.io/packages/cdk-databrew-cicd@0.1.13/#/
Construct for the architecture

import * as cdk from '@aws-cdk/core';
import { DataBrewCodePipeline } from 'cdk-databrew-cicd';
class TypescriptStack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const preproductionAccountId = 'PREPRODUCTION_ACCOUNT_ID';
const productionAccountId = 'PRODUCTION_ACCOUNT_ID';
const dataBrewPipeline = new DataBrewCodePipeline(this, 'DataBrewCicdPipeline', {
preproductionIamRoleArn:
`arn:${cdk.Aws.PARTITION}:iam::${preproductionAccountId}:role/preproduction-Databrew-Cicd-Role`,
productionIamRoleArn: `arn:${cdk.Aws.PARTITION}:iam::${productionAccountId}:role/production-
Databrew-Cicd-Role`,
// bucketName: 'OPTIONAL',
// repoName: 'OPTIONAL',
// branchName: 'OPTIONAL',
// pipelineName: 'OPTIONAL'
});
new cdk.CfnOutput(this, 'OPreproductionLambdaArn', { value:
dataBrewPipeline.preproductionFunctionArn });
new cdk.CfnOutput(this, 'OProductionLambdaArn', { value: dataBrewPipeline.productionFunctionArn
});
new cdk.CfnOutput(this, 'OCodeCommitRepoArn', { value: dataBrewPipeline.codeCommitRepoArn });
new cdk.CfnOutput(this, 'OCodePipelineArn', { value: dataBrewPipeline.codePipelineArn });
}
}
const app = new cdk.App();
new TypescriptStack(app, 'TypescriptStack', {stackName: 'DataBrew-CICD'});

How it looks like in the
CloudFormation Designer
after deploying
the CDK construct

The Exabyte Journey and DataBrew with CICD

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Exabyte Journey and DataBrew with CICD

Similar to The Exabyte Journey and DataBrew with CICD (20)

More from Shu-Jeng Hsieh

More from Shu-Jeng Hsieh (12)

Recently uploaded

Recently uploaded (20)

The Exabyte Journey and DataBrew with CICD

Editor's Notes