The document discusses LinkedIn's use of Hadoop and HDFS to store and process over 1 exabyte of data across multiple clusters. Some key points:
1. LinkedIn now stores over 1 exabyte of total data across all of its Hadoop clusters, with its largest cluster being 10,000 nodes storing 500 petabytes of data.
2. The Hadoop clusters use a single NameNode for metadata management with an average latency under 10 milliseconds. High availability features help prevent single points of failure.
3. LinkedIn has optimized performance through techniques like Java tuning and satellite clusters to address issues like small files and logging directories.
4. Two Milestones
1. LinkedIn now stores 1 exabyte of total data
across all Hadoop clusters.
2. The largest cluster is a 10,000-node cluster.
a. 500 PB
b. 1 billion objects
c. A single NameNode serving RPCs
d. An average latency under 10 milliseconds
6. Two Milestones
1. LinkedIn now stores 1 exabyte of total data
across all Hadoop clusters.
2. The largest cluster is a 10,000-node cluster.
a. 500 PB
b. 1 billion objects
c. A single NameNode serving RPCs
d. An average latency under 10 milliseconds
7. High Availability
1. The NameNode used to be a single point of
failure (SPOF).
2. From Hadoop 2, the HA architecture is
introduced.
a. Quorum Journal Manager (QJM)
b. Network File System (NFS)
3. Aside from preventing from the SPOF, HA is
also crucial for rolling upgrades.
11. Java Tuning
1. Java heap generations
a. Young generation
b. Tenured generation
2. Non-fair locking
a. Theoretically, it’s unfavorable for writers.
b. In practice, it substantially improves performance due to the
occupancy of read requests, which is 95%.
13. Satellite cluster
1. The small files problem
2. The logging directory
3. Bootstrapping the satellite cluster
a. 60TB of data using DistCp, 12 hours
b. FailoverFS
4. Very large block reports
14. Consistent reads
from Standby Node
1. Motivation and requirements
a. HDFS-12943
2. Consistency model
3. The stale read problem
4. The consistency principle
5. Journal tailing: Fast path
a. HDFS-13150
15. Scaling beyond the
Hadoop ecosystem
1. Port-based selective wire encryption
a. GDPR, CCPA
b. HADOOP-10335, HDFS-13541
c. 36-46% reduction in read/write latency, 56-85% increase of
read/write throughput.
2. Encryption at rest
a. Dataset-level encryption
b. LiKMS
16. Scaling beyond the
Hadoop ecosystem
3. Wormhole
a. Bunches of pipelines for data transferring
b. From a single HDFS to the universe
19. Glue DataBrew 101
1. A service that empowers those who have
needs on ETL without code skill.
2. 250+ patterns (built-in transformations)
3. Track the transforming history
of data.
4. Integrates with data pipelines
24. Continuous Integration &
Continuous Delivery
1. The culture enables teams to store and version
code, maintain parity between development and
production environments.
2. CI/CD practices apply beyond software delivery.
3. A demonstration of the practice on data prepartion
26. A CDK construct for the demonstration with 4
programming languages available worldwide,
demos for the languages included.
cdk-databrew-cicd
https://awscdk.io/packages/cdk-databrew-cicd@0.1.13/#/
Construct for the architecture
27. import * as cdk from '@aws-cdk/core';
import { DataBrewCodePipeline } from 'cdk-databrew-cicd';
class TypescriptStack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const preproductionAccountId = 'PREPRODUCTION_ACCOUNT_ID';
const productionAccountId = 'PRODUCTION_ACCOUNT_ID';
const dataBrewPipeline = new DataBrewCodePipeline(this, 'DataBrewCicdPipeline', {
preproductionIamRoleArn:
`arn:${cdk.Aws.PARTITION}:iam::${preproductionAccountId}:role/preproduction-Databrew-Cicd-Role`,
productionIamRoleArn: `arn:${cdk.Aws.PARTITION}:iam::${productionAccountId}:role/production-
Databrew-Cicd-Role`,
// bucketName: 'OPTIONAL',
// repoName: 'OPTIONAL',
// branchName: 'OPTIONAL',
// pipelineName: 'OPTIONAL'
});
new cdk.CfnOutput(this, 'OPreproductionLambdaArn', { value:
dataBrewPipeline.preproductionFunctionArn });
new cdk.CfnOutput(this, 'OProductionLambdaArn', { value: dataBrewPipeline.productionFunctionArn
});
new cdk.CfnOutput(this, 'OCodeCommitRepoArn', { value: dataBrewPipeline.codeCommitRepoArn });
new cdk.CfnOutput(this, 'OCodePipelineArn', { value: dataBrewPipeline.codePipelineArn });
}
}
const app = new cdk.App();
new TypescriptStack(app, 'TypescriptStack', {stackName: 'DataBrew-CICD'});
28. How it looks like in the
CloudFormation Designer
after deploying
the CDK construct