SlideShare a Scribd company logo
The exabyte journey &
DataBrew with CICD
Speaker: Shu-Jeng, Hsieh (Scott)
Date: Jun 25th, 2021 (Fri)
Location: the 104 Corporation
The Exabyte Journey in
LinkedIn with Hadoop
1 exabyte (EB)
= 1000 PB
= 1M TB
= 1B GB
Two Milestones
1. LinkedIn now stores 1 exabyte of total data
across all Hadoop clusters.
2. The largest cluster is a 10,000-node cluster.
a. 500 PB
b. 1 billion objects
c. A single NameNode serving RPCs
d. An average latency under 10 milliseconds
Historical
Scale
Two Milestones
1. LinkedIn now stores 1 exabyte of total data
across all Hadoop clusters.
2. The largest cluster is a 10,000-node cluster.
a. 500 PB
b. 1 billion objects
c. A single NameNode serving RPCs
d. An average latency under 10 milliseconds
High Availability
1. The NameNode used to be a single point of
failure (SPOF).
2. From Hadoop 2, the HA architecture is
introduced.
a. Quorum Journal Manager (QJM)
b. Network File System (NFS)
3. Aside from preventing from the SPOF, HA is
also crucial for rolling upgrades.
DN DN DN DN
NN
Active
NN
Standby
JN
JN JN
Java Tuning
380GB heap to maintain
1.1Bof namespace objects
Java Tuning
1. Java heap generations
a. Young generation
b. Tenured generation
2. Non-fair locking
a. Theoretically, it’s unfavorable for writers.
b. In practice, it substantially improves performance due to the
occupancy of read requests, which is 95%.
Java Tuning
Satellite cluster
1. The small files problem
2. The logging directory
3. Bootstrapping the satellite cluster
a. 60TB of data using DistCp, 12 hours
b. FailoverFS
4. Very large block reports
Consistent reads
from Standby Node
1. Motivation and requirements
a. HDFS-12943
2. Consistency model
3. The stale read problem
4. The consistency principle
5. Journal tailing: Fast path
a. HDFS-13150
Scaling beyond the
Hadoop ecosystem
1. Port-based selective wire encryption
a. GDPR, CCPA
b. HADOOP-10335, HDFS-13541
c. 36-46% reduction in read/write latency, 56-85% increase of
read/write throughput.
2. Encryption at rest
a. Dataset-level encryption
b. LiKMS
Scaling beyond the
Hadoop ecosystem
3. Wormhole
a. Bunches of pipelines for data transferring
b. From a single HDFS to the universe
Traditional Chinese Version
https://fantasticsie.medium.com/%E8%89%BE%E4
%BD%8D%E5%85%83%E4%BF%B1%E6%A8%8
2%E9%83%A8-linkedin-
%E5%9C%A8%E6%93%B4%E5%B1%95-
hadoop-
%E5%88%86%E6%95%A3%E5%BC%8F%E6%A
A%94%E6%A1%88%E7%B3%BB%E7%B5%B1%
E4%B8%8A%E7%9A%84%E6%97%85%E7%A8
Glue DataBrew with CICD
Glue DataBrew 101
1. A service that empowers those who have
needs on ETL without code skill.
2. 250+ patterns (built-in transformations)
3. Track the transforming history
of data.
4. Integrates with data pipelines
A Glance
Data Profile
Data Lineage
AWS Glue DataBrew SAS EG
DataBrew recipes
A series of transformations
on a DataBrew dataset
Continuous Integration &
Continuous Delivery
1. The culture enables teams to store and version
code, maintain parity between development and
production environments.
2. CI/CD practices apply beyond software delivery.
3. A demonstration of the practice on data prepartion
Architecture
A CDK construct for the demonstration with 4
programming languages available worldwide,
demos for the languages included.
cdk-databrew-cicd
https://awscdk.io/packages/cdk-databrew-cicd@0.1.13/#/
Construct for the architecture
import * as cdk from '@aws-cdk/core';
import { DataBrewCodePipeline } from 'cdk-databrew-cicd';
class TypescriptStack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const preproductionAccountId = 'PREPRODUCTION_ACCOUNT_ID';
const productionAccountId = 'PRODUCTION_ACCOUNT_ID';
const dataBrewPipeline = new DataBrewCodePipeline(this, 'DataBrewCicdPipeline', {
preproductionIamRoleArn:
`arn:${cdk.Aws.PARTITION}:iam::${preproductionAccountId}:role/preproduction-Databrew-Cicd-Role`,
productionIamRoleArn: `arn:${cdk.Aws.PARTITION}:iam::${productionAccountId}:role/production-
Databrew-Cicd-Role`,
// bucketName: 'OPTIONAL',
// repoName: 'OPTIONAL',
// branchName: 'OPTIONAL',
// pipelineName: 'OPTIONAL'
});
new cdk.CfnOutput(this, 'OPreproductionLambdaArn', { value:
dataBrewPipeline.preproductionFunctionArn });
new cdk.CfnOutput(this, 'OProductionLambdaArn', { value: dataBrewPipeline.productionFunctionArn
});
new cdk.CfnOutput(this, 'OCodeCommitRepoArn', { value: dataBrewPipeline.codeCommitRepoArn });
new cdk.CfnOutput(this, 'OCodePipelineArn', { value: dataBrewPipeline.codePipelineArn });
}
}
const app = new cdk.App();
new TypescriptStack(app, 'TypescriptStack', {stackName: 'DataBrew-CICD'});
How it looks like in the
CloudFormation Designer
after deploying
the CDK construct
Ask
Me
Anything

More Related Content

What's hot

Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Edureka!
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And HadoopEdureka!
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyShital Kat
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)Edureka!
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 

What's hot (20)

Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And Hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)
 
RHadoop
RHadoopRHadoop
RHadoop
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 

Similar to The Exabyte Journey and DataBrew with CICD

High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...Larry Smarr
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Resume boda liao-storage
Resume boda liao-storageResume boda liao-storage
Resume boda liao-storageliao boda
 
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...Larry Smarr
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn
 
Riding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New ScienceRiding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New ScienceLarry Smarr
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuCeph Community
 
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
Distributed and Cloud Computing 1st Edition Hwang Solutions ManualDistributed and Cloud Computing 1st Edition Hwang Solutions Manual
Distributed and Cloud Computing 1st Edition Hwang Solutions Manualkyxeminut
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud ComputingDeepak Singh
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...Larry Smarr
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical ResearchBruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical ResearchDanny Abukalam
 
How to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemHow to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemLarry Smarr
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 

Similar to The Exabyte Journey and DataBrew with CICD (20)

High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Resume boda liao-storage
Resume boda liao-storageResume boda liao-storage
Resume boda liao-storage
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
 
Riding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New ScienceRiding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New Science
 
Ods chapter7
Ods chapter7Ods chapter7
Ods chapter7
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
Distributed and Cloud Computing 1st Edition Hwang Solutions ManualDistributed and Cloud Computing 1st Edition Hwang Solutions Manual
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical ResearchBruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
 
How to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemHow to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway System
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 

More from Shu-Jeng Hsieh

Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKShu-Jeng Hsieh
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineShu-Jeng Hsieh
 
How cdk and projen benefit to A team
How cdk and projen benefit to A teamHow cdk and projen benefit to A team
How cdk and projen benefit to A teamShu-Jeng Hsieh
 
Optimization technique for SAS
Optimization technique for SASOptimization technique for SAS
Optimization technique for SASShu-Jeng Hsieh
 
Trial plan with capitation payment of the national healthcare insurance in ta...
Trial plan with capitation payment of the national healthcare insurance in ta...Trial plan with capitation payment of the national healthcare insurance in ta...
Trial plan with capitation payment of the national healthcare insurance in ta...Shu-Jeng Hsieh
 
Prediction for consuming behavior of noncontractual customers
Prediction for consuming behavior of noncontractual customersPrediction for consuming behavior of noncontractual customers
Prediction for consuming behavior of noncontractual customersShu-Jeng Hsieh
 
Scheduling advertisements on a web page to maximize revenue
Scheduling advertisements on a web page to maximize revenueScheduling advertisements on a web page to maximize revenue
Scheduling advertisements on a web page to maximize revenueShu-Jeng Hsieh
 
A data driven approach to measure web site navigability
A data driven approach to measure web site navigabilityA data driven approach to measure web site navigability
A data driven approach to measure web site navigabilityShu-Jeng Hsieh
 

More from Shu-Jeng Hsieh (12)

Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
 
How cdk and projen benefit to A team
How cdk and projen benefit to A teamHow cdk and projen benefit to A team
How cdk and projen benefit to A team
 
Optimization technique for SAS
Optimization technique for SASOptimization technique for SAS
Optimization technique for SAS
 
The way
The wayThe way
The way
 
Demo of our thoughts
Demo of our thoughtsDemo of our thoughts
Demo of our thoughts
 
Trial plan with capitation payment of the national healthcare insurance in ta...
Trial plan with capitation payment of the national healthcare insurance in ta...Trial plan with capitation payment of the national healthcare insurance in ta...
Trial plan with capitation payment of the national healthcare insurance in ta...
 
Prediction for consuming behavior of noncontractual customers
Prediction for consuming behavior of noncontractual customersPrediction for consuming behavior of noncontractual customers
Prediction for consuming behavior of noncontractual customers
 
Scheduling advertisements on a web page to maximize revenue
Scheduling advertisements on a web page to maximize revenueScheduling advertisements on a web page to maximize revenue
Scheduling advertisements on a web page to maximize revenue
 
A data driven approach to measure web site navigability
A data driven approach to measure web site navigabilityA data driven approach to measure web site navigability
A data driven approach to measure web site navigability
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单yhkoc
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...correoyaya
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sMAQIB18
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 

The Exabyte Journey and DataBrew with CICD

  • 1. The exabyte journey & DataBrew with CICD Speaker: Shu-Jeng, Hsieh (Scott) Date: Jun 25th, 2021 (Fri) Location: the 104 Corporation
  • 2. The Exabyte Journey in LinkedIn with Hadoop
  • 3. 1 exabyte (EB) = 1000 PB = 1M TB = 1B GB
  • 4. Two Milestones 1. LinkedIn now stores 1 exabyte of total data across all Hadoop clusters. 2. The largest cluster is a 10,000-node cluster. a. 500 PB b. 1 billion objects c. A single NameNode serving RPCs d. An average latency under 10 milliseconds
  • 6. Two Milestones 1. LinkedIn now stores 1 exabyte of total data across all Hadoop clusters. 2. The largest cluster is a 10,000-node cluster. a. 500 PB b. 1 billion objects c. A single NameNode serving RPCs d. An average latency under 10 milliseconds
  • 7. High Availability 1. The NameNode used to be a single point of failure (SPOF). 2. From Hadoop 2, the HA architecture is introduced. a. Quorum Journal Manager (QJM) b. Network File System (NFS) 3. Aside from preventing from the SPOF, HA is also crucial for rolling upgrades.
  • 8.
  • 9. DN DN DN DN NN Active NN Standby JN JN JN
  • 10. Java Tuning 380GB heap to maintain 1.1Bof namespace objects
  • 11. Java Tuning 1. Java heap generations a. Young generation b. Tenured generation 2. Non-fair locking a. Theoretically, it’s unfavorable for writers. b. In practice, it substantially improves performance due to the occupancy of read requests, which is 95%.
  • 13. Satellite cluster 1. The small files problem 2. The logging directory 3. Bootstrapping the satellite cluster a. 60TB of data using DistCp, 12 hours b. FailoverFS 4. Very large block reports
  • 14. Consistent reads from Standby Node 1. Motivation and requirements a. HDFS-12943 2. Consistency model 3. The stale read problem 4. The consistency principle 5. Journal tailing: Fast path a. HDFS-13150
  • 15. Scaling beyond the Hadoop ecosystem 1. Port-based selective wire encryption a. GDPR, CCPA b. HADOOP-10335, HDFS-13541 c. 36-46% reduction in read/write latency, 56-85% increase of read/write throughput. 2. Encryption at rest a. Dataset-level encryption b. LiKMS
  • 16. Scaling beyond the Hadoop ecosystem 3. Wormhole a. Bunches of pipelines for data transferring b. From a single HDFS to the universe
  • 19. Glue DataBrew 101 1. A service that empowers those who have needs on ETL without code skill. 2. 250+ patterns (built-in transformations) 3. Track the transforming history of data. 4. Integrates with data pipelines
  • 22. Data Lineage AWS Glue DataBrew SAS EG
  • 23. DataBrew recipes A series of transformations on a DataBrew dataset
  • 24. Continuous Integration & Continuous Delivery 1. The culture enables teams to store and version code, maintain parity between development and production environments. 2. CI/CD practices apply beyond software delivery. 3. A demonstration of the practice on data prepartion
  • 26. A CDK construct for the demonstration with 4 programming languages available worldwide, demos for the languages included. cdk-databrew-cicd https://awscdk.io/packages/cdk-databrew-cicd@0.1.13/#/ Construct for the architecture
  • 27. import * as cdk from '@aws-cdk/core'; import { DataBrewCodePipeline } from 'cdk-databrew-cicd'; class TypescriptStack extends cdk.Stack { constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) { super(scope, id, props); const preproductionAccountId = 'PREPRODUCTION_ACCOUNT_ID'; const productionAccountId = 'PRODUCTION_ACCOUNT_ID'; const dataBrewPipeline = new DataBrewCodePipeline(this, 'DataBrewCicdPipeline', { preproductionIamRoleArn: `arn:${cdk.Aws.PARTITION}:iam::${preproductionAccountId}:role/preproduction-Databrew-Cicd-Role`, productionIamRoleArn: `arn:${cdk.Aws.PARTITION}:iam::${productionAccountId}:role/production- Databrew-Cicd-Role`, // bucketName: 'OPTIONAL', // repoName: 'OPTIONAL', // branchName: 'OPTIONAL', // pipelineName: 'OPTIONAL' }); new cdk.CfnOutput(this, 'OPreproductionLambdaArn', { value: dataBrewPipeline.preproductionFunctionArn }); new cdk.CfnOutput(this, 'OProductionLambdaArn', { value: dataBrewPipeline.productionFunctionArn }); new cdk.CfnOutput(this, 'OCodeCommitRepoArn', { value: dataBrewPipeline.codeCommitRepoArn }); new cdk.CfnOutput(this, 'OCodePipelineArn', { value: dataBrewPipeline.codePipelineArn }); } } const app = new cdk.App(); new TypescriptStack(app, 'TypescriptStack', {stackName: 'DataBrew-CICD'});
  • 28. How it looks like in the CloudFormation Designer after deploying the CDK construct
  • 29.

Editor's Notes

  1. 2015: 20PB, 2020: 500PB 2015: 1.45 億; 2020: 16 億個物件(three namespace volume) Gbhr cmpute