SlideShare a Scribd company logo
1 of 35
Download to read offline
Big Data Tools in AWS
Scott (考特)
Sep. 29th, 2021 (Wed)
AWS User Group
Oh. So. CDK
Scott (考特)
Shu-Jeng, Hsieh
● Sr. Data Engineer, the 104
● AWS Community Builder
Agenda
AWS EMR
AWS Glue 3.0
AWS EMR
HCatalog
Transient or long-running clusters
Long-running and auto scaling Transient and job scoped
1.Great for lines of business leaders
2.Great for short-running jobs or ad hoc
queries
3.Ideal to save costs for multi-tenanted
data science and data engineering
jobs
1.Works well for job-scoped pipelines
2.Reduces blast radius
3.Easier to upgrade clusters and restart
jobs
Example use cases:
● Notebooks
● Ad-hoc jobs and experimentation
● streaming
Example use cases:
● Large-scale transformation
● ETL to other DWH or Data Lake
● Building ML jobs
Liem, M., 2020. Amazon EMR Deep Dive and Best Practices - AWS Online Tech
Talks. [video] Available at: <https://www.youtube.com/watch?v=dU40df0Suoo>
Automation
API requesting via AWS SDK
API requesting via AWS Lambda
State machine
Amazon Data Pipeline
Deployment
options
Amazon EC2 Amazon EKS AWS Outposts
Ad hoc workloads Data pipelines Data science
notebooks
Richardson, C., Novikova, M. and Zhang, K., 2021. How Tamr Optimized Amazon EMR Workloads to
Unify 200 Billion Records 5x Faster than On-Premises
Amazon S3
marts
Amazon S3
source data
Amazon EMR
Prepare data
Launch
Service
Use data
JDBC
Access
Zeppelin
AirFlow pipelines
Dubrovsky, O. and Reuveni, Y., 2020. AWS re:Invent 2020: How
Nielsen built a multi-petabyte data platform using Amazon EMR.
import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
interface ExtendedEmrCreateClusterProps extends tasks.EmrCreateClusterProps {
/**
* Specifies the step concurrency level to allow multiple steps to run in parallel
*
* Requires EMR release label 5.28.0 or above.
* Must be in range [1, 256].
*
* @default 1 - no step concurrency allowed
*/
readonly stepConcurrencyLevel?: number;
}
class ExtendedEmrCreateCluster extends tasks.EmrCreateCluster {
protected readonly stepConcurrencyLevel: number;
constructor(
scope: cdk.Construct,
id: string,
props: ExtendedEmrCreateClusterProps
) {
super(scope, id, props);
this.stepConcurrencyLevel = props.stepConcurrencyLevel ?? 1;
}
protected _renderTask(): any {
const originalObject = super._renderTask();
const extensionObject = {};
Object.assign(extensionObject, originalObject, {
Parameters: {
StepConcurrencyLevel: cdk.numberToCloudFormation(
this.stepConcurrencyLevel
),
...originalObject.Parameters,
},
});
return extensionObject;
}}
CDK issues
● #15223
● #15242
import * as sfn from '@aws-cdk/aws-stepfunctions';
import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
tasks.EmrSetClusterTerminationProtection
tasks.EmrAddStep
tasks.EmrTerminateCluster
sfn.Choice
sfn.Condition
sfn.Parallel
Constructs that you’ll encounter
pretty much frequently
dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Make Traditional Chinese
available', {
name: 'modify metadata',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'bash',
'-c',
`aws s3 cp
s3://${this.demoBucketName}/modify_meta_database.sh .;
chmod +x modify_meta_database.sh;
./modify_meta_database.sh;
rm modify_meta_database.sh;`,
],
})
);
dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Some ETL', {
name: 'Execute an ETL',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'spark-submit',
'--deploy-mode',
'cluster',
'--master',
'yarn',
'--num-executors',
'2',
'--executor-cores',
'8',
'--executor-memory',
'12g',
'--conf',
'spark.yarn.submit.waitAppCompletion=true',
`s3://${this.demoBucketName}/etl/spark-etl.py`,
],
})
);
const dataMovementParallel = new sfn.Parallel(
this,
'Do some complex things in an EMR Cluster',
{
resultPath: sfn.JsonPath.DISCARD,
}
);
Example assignment of
paralleling tasks for an EMR cluster
import * as events from '@aws-cdk/aws-events';
import * as targets from '@aws-cdk/aws-events-targets';
const stateMachine = new sfn.StateMachine(this, 'StateMachine', {
stateMachineName: stateMachineName,
definition: shouldLaunchCluster,
});
const stateMachineTarget = new targets.SfnStateMachine(
stateMachine,
{
input: events.RuleTargetInput.fromObject({
LaunchCluster: true,
TerminateCluster: false,
}),
}
);
const stateMachineRule = new events.Rule(
this,
'StateMachineRule',
{
schedule: events.Schedule.expression(`cron(20 0 ? * Mon-Fri *)`),
ruleName: `${process.env.DEPLOYMENT_ENV}-sql-analytics-statemachine-rule`,
enabled: true
description:
'An event rule to launch an EMR cluster via AWS Step Functions.',
}
);
stateMachineRule.addTarget(stateMachineTarget);
Example of production workload
Common Errors in
AWS Step Functions
https://docs.aws.amazon.com/step-functions/latest/apireference/CommonErrors.html
src
├── custom-glue-workflows.ts
├── emr-scripts
│ ├── bootstrap
│ │ ├── modify_meta_database.sh
│ │ └── update_ssm.sh
│ ├── hive
│ │ ├── hive_create_database.q
│ │ └── hive_create_table.q
│ └── spark
│ └── remove-auto.py
├── glue-resources.ts
├── glue-scripts
│ ├── app-team
│ │ └── resume_detection.py
│ └── schema
│ └── update_workflow_property.py
AWS Glue
AWS
Glue
AWS Services
On-premises
Big Data Data Warehouse
SaaS
Cross-cloud
Data Store
Inferring schema, detecting data
drift, keeping metadata up to date
Reusable data pipelines,
event-triggered workflow
Visual data preparation tool
for data analysis
Materialized views
AWS Glue 3.0
● Performance-optimized Spark runtime
○ upgrading from Spark 2.4 to Spark 3.1.1
○ upgraded JDBC drivers
● Faster read and write access
● Faster and efficient partition pruning
● Fine-grained access control
● ACID transactions
● Improved user experience for monitoring, debugging,
and tuning Spark applications
Xue, C. and Zhou, Y., 2021. Building a SIMD Supported Vectorized Native Engine for
Spark SQL. [video] Available at: <https://youtu.be/hwAzodnaqa0>
Shuffled hash join improvement (SPARK-32461)
● Preserve shuffled hash join build side partitioning (SPARK-32330)
● Preserve hash join (BHJ and SHJ) stream side ordering (SPARK-32383)
● Coalesce bucketed tables for shuffled hash join (SPARK-32286)
● Add code-gen for shuffled hash join (SPARK-32421)
● Support full outer join in shuffled hash join (SPARK-32399)
Event-triggered
Glue workflow
{
"Sid": "S3Event1",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:GetBucketAcl",
"Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1"
},
{
"Sid": "S3Event2",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1/hakunamatata/*",
"Condition": {
"StringEquals": {
"s3:x-amz-acl": "bucket-owner-full-control"
}
}
}
S3 Bucket Policy
import * as cloudtrail from '@aws-cdk/aws-cloudtrail';
const glueWorkflowTrail = new cloudtrail.Trail(this, 'GlueWorkflowTrail', {
trailName: `s3-events-trail`,
bucket: s3.Bucket.fromBucketName(
this,
'DemoBucket',
`scott-demo-events-${cdk.Aws.REGION}`
),
s3KeyPrefix: 'event-folder',
});
glueWorkflowTrail.addS3EventSelector(
[
{
bucket: s3.Bucket.fromBucketName(
this,
'BucketWhereFileWillBePut',
'scott-target-bucket-${cdk.Aws.REGION}'
),
objectPrefix: 'hakunamatata/',
},
],
{
includeManagementEvents: false,
readWriteType: cloudtrail.ReadWriteType.WRITE_ONLY,
}
);
CloudTrail
configuration
import * as glue from '@aws-cdk/aws-glue';
const glueWorkFlow = new glue.CfnWorkflow(this, 'GlueWorkFlow', {
name: `scott-demo-glue-workflow`,
description:
'A Glue workflow',
});
const glueWorkFlowArn =
`arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:workflow/${glueWor
kFlow.name}`;
Glue workflow
import * as events from '@aws-cdk/aws-events';
const eventRule = new events.Rule(this, 'FileDetectionRule', {
ruleName: `event-glue-workflow-trigger`,
description:
'An event rule to trigger the Glue workflow',
eventPattern: {
source: ['aws.s3'],
detailType: ['AWS API Call via CloudTrail'],
detail: {
['eventSource']: ['s3.amazonaws.com'],
['eventName']: ['PutObject'],
['requestParameters']: {
bucketName: ['scott-target-bucket-${cdk.Aws.REGION}'],
key: [{ prefix: 'hakunamatata/' }],
},
},
},
});
Amazon EventBridge
const cfnEventRule = eventRule.node.defaultChild as events.CfnRule;
cfnEventRule.targets = [
{
arn: glueWorkFlowArn,
id: 'CloudTrailTriggersWorkflow',
roleArn: eventBridgeExecutionRole.roleArn,
},
];
Set target as the Glue workflow
import * as cr from '@aws-cdk/custom-resources';
const updateTriggerSdkCall: cr.AwsSdkCall = {
service: 'Glue',
action: 'updateTrigger',
parameters: {
Name: triggerEntity.name,
TriggerUpdate: {
Actions: [
{
JobName: jobName,
Timeout: 1,
},
],
Description: triggerEntity.description,
EventBatchingCondition: {
BatchSize: 8,
BatchWindow: 120,
},
},
},
physicalResourceId: cr.PhysicalResourceId.of(Date.now().toString()),
};
new cr.AwsCustomResource(this, id + 'CustomResource', {
onCreate: updateTriggerSdkCall,
onUpdate: updateTriggerSdkCall,
policy: cr.AwsCustomResourcePolicy.fromStatements([
new iam.PolicyStatement({
actions: ['glue:UpdateTrigger'],
resources: [
`arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:trigger/*`,
],
}),
]),
logRetention: logs.RetentionDays.ONE_WEEK,
});}
https://databricks.com/product/delta-lake-on-databricks
Ask
Me
Anything

More Related Content

What's hot

Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Michael Stack
 

What's hot (20)

Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
ebay
ebayebay
ebay
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
In Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging serviceIn Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging service
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
 

Similar to Big Data Tools in AWS

Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009
marpierc
 

Similar to Big Data Tools in AWS (20)

Docker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline ExecutionDocker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline Execution
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
 
Building a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless frameworkBuilding a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless framework
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Designing a production grade realtime ml inference endpoint
Designing a production grade realtime ml inference endpointDesigning a production grade realtime ml inference endpoint
Designing a production grade realtime ml inference endpoint
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...
Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...
Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data center
 
[NDC 2019] Enterprise-Grade Serverless
[NDC 2019] Enterprise-Grade Serverless[NDC 2019] Enterprise-Grade Serverless
[NDC 2019] Enterprise-Grade Serverless
 
[NDC 2019] Functions 2.0: Enterprise-Grade Serverless
[NDC 2019] Functions 2.0: Enterprise-Grade Serverless[NDC 2019] Functions 2.0: Enterprise-Grade Serverless
[NDC 2019] Functions 2.0: Enterprise-Grade Serverless
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009
 
Taking advantage of the Amazon Web Services (AWS) Family
Taking advantage of the Amazon Web Services (AWS) FamilyTaking advantage of the Amazon Web Services (AWS) Family
Taking advantage of the Amazon Web Services (AWS) Family
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 

More from Shu-Jeng Hsieh

Scheduling advertisements on a web page to maximize revenue
Scheduling advertisements on a web page to maximize revenueScheduling advertisements on a web page to maximize revenue
Scheduling advertisements on a web page to maximize revenue
Shu-Jeng Hsieh
 
A data driven approach to measure web site navigability
A data driven approach to measure web site navigabilityA data driven approach to measure web site navigability
A data driven approach to measure web site navigability
Shu-Jeng Hsieh
 

More from Shu-Jeng Hsieh (11)

Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
The Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICDThe Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICD
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
 
How cdk and projen benefit to A team
How cdk and projen benefit to A teamHow cdk and projen benefit to A team
How cdk and projen benefit to A team
 
Optimization technique for SAS
Optimization technique for SASOptimization technique for SAS
Optimization technique for SAS
 
The way
The wayThe way
The way
 
Demo of our thoughts
Demo of our thoughtsDemo of our thoughts
Demo of our thoughts
 
Trial plan with capitation payment of the national healthcare insurance in ta...
Trial plan with capitation payment of the national healthcare insurance in ta...Trial plan with capitation payment of the national healthcare insurance in ta...
Trial plan with capitation payment of the national healthcare insurance in ta...
 
Prediction for consuming behavior of noncontractual customers
Prediction for consuming behavior of noncontractual customersPrediction for consuming behavior of noncontractual customers
Prediction for consuming behavior of noncontractual customers
 
Scheduling advertisements on a web page to maximize revenue
Scheduling advertisements on a web page to maximize revenueScheduling advertisements on a web page to maximize revenue
Scheduling advertisements on a web page to maximize revenue
 
A data driven approach to measure web site navigability
A data driven approach to measure web site navigabilityA data driven approach to measure web site navigability
A data driven approach to measure web site navigability
 

Recently uploaded

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Recently uploaded (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 

Big Data Tools in AWS

  • 1. Big Data Tools in AWS Scott (考特) Sep. 29th, 2021 (Wed) AWS User Group Oh. So. CDK
  • 2. Scott (考特) Shu-Jeng, Hsieh ● Sr. Data Engineer, the 104 ● AWS Community Builder
  • 6. Transient or long-running clusters Long-running and auto scaling Transient and job scoped 1.Great for lines of business leaders 2.Great for short-running jobs or ad hoc queries 3.Ideal to save costs for multi-tenanted data science and data engineering jobs 1.Works well for job-scoped pipelines 2.Reduces blast radius 3.Easier to upgrade clusters and restart jobs Example use cases: ● Notebooks ● Ad-hoc jobs and experimentation ● streaming Example use cases: ● Large-scale transformation ● ETL to other DWH or Data Lake ● Building ML jobs Liem, M., 2020. Amazon EMR Deep Dive and Best Practices - AWS Online Tech Talks. [video] Available at: <https://www.youtube.com/watch?v=dU40df0Suoo>
  • 7. Automation API requesting via AWS SDK API requesting via AWS Lambda State machine Amazon Data Pipeline
  • 8. Deployment options Amazon EC2 Amazon EKS AWS Outposts Ad hoc workloads Data pipelines Data science notebooks
  • 9. Richardson, C., Novikova, M. and Zhang, K., 2021. How Tamr Optimized Amazon EMR Workloads to Unify 200 Billion Records 5x Faster than On-Premises
  • 10. Amazon S3 marts Amazon S3 source data Amazon EMR Prepare data Launch Service Use data JDBC Access Zeppelin AirFlow pipelines Dubrovsky, O. and Reuveni, Y., 2020. AWS re:Invent 2020: How Nielsen built a multi-petabyte data platform using Amazon EMR.
  • 11.
  • 12. import * as tasks from '@aws-cdk/aws-stepfunctions-tasks'; interface ExtendedEmrCreateClusterProps extends tasks.EmrCreateClusterProps { /** * Specifies the step concurrency level to allow multiple steps to run in parallel * * Requires EMR release label 5.28.0 or above. * Must be in range [1, 256]. * * @default 1 - no step concurrency allowed */ readonly stepConcurrencyLevel?: number; } class ExtendedEmrCreateCluster extends tasks.EmrCreateCluster { protected readonly stepConcurrencyLevel: number; constructor( scope: cdk.Construct, id: string, props: ExtendedEmrCreateClusterProps ) { super(scope, id, props); this.stepConcurrencyLevel = props.stepConcurrencyLevel ?? 1; } protected _renderTask(): any { const originalObject = super._renderTask(); const extensionObject = {}; Object.assign(extensionObject, originalObject, { Parameters: { StepConcurrencyLevel: cdk.numberToCloudFormation( this.stepConcurrencyLevel ), ...originalObject.Parameters, }, }); return extensionObject; }} CDK issues ● #15223 ● #15242
  • 13. import * as sfn from '@aws-cdk/aws-stepfunctions'; import * as tasks from '@aws-cdk/aws-stepfunctions-tasks'; tasks.EmrSetClusterTerminationProtection tasks.EmrAddStep tasks.EmrTerminateCluster sfn.Choice sfn.Condition sfn.Parallel Constructs that you’ll encounter pretty much frequently
  • 14. dataMovementParallel.branch( new tasks.EmrAddStep(this, 'Make Traditional Chinese available', { name: 'modify metadata', clusterId: sfn.JsonPath.stringAt('$.ClusterId'), actionOnFailure: tasks.ActionOnFailure.CONTINUE, jar: 'command-runner.jar', args: [ 'bash', '-c', `aws s3 cp s3://${this.demoBucketName}/modify_meta_database.sh .; chmod +x modify_meta_database.sh; ./modify_meta_database.sh; rm modify_meta_database.sh;`, ], }) ); dataMovementParallel.branch( new tasks.EmrAddStep(this, 'Some ETL', { name: 'Execute an ETL', clusterId: sfn.JsonPath.stringAt('$.ClusterId'), actionOnFailure: tasks.ActionOnFailure.CONTINUE, jar: 'command-runner.jar', args: [ 'spark-submit', '--deploy-mode', 'cluster', '--master', 'yarn', '--num-executors', '2', '--executor-cores', '8', '--executor-memory', '12g', '--conf', 'spark.yarn.submit.waitAppCompletion=true', `s3://${this.demoBucketName}/etl/spark-etl.py`, ], }) ); const dataMovementParallel = new sfn.Parallel( this, 'Do some complex things in an EMR Cluster', { resultPath: sfn.JsonPath.DISCARD, } ); Example assignment of paralleling tasks for an EMR cluster
  • 15. import * as events from '@aws-cdk/aws-events'; import * as targets from '@aws-cdk/aws-events-targets'; const stateMachine = new sfn.StateMachine(this, 'StateMachine', { stateMachineName: stateMachineName, definition: shouldLaunchCluster, }); const stateMachineTarget = new targets.SfnStateMachine( stateMachine, { input: events.RuleTargetInput.fromObject({ LaunchCluster: true, TerminateCluster: false, }), } ); const stateMachineRule = new events.Rule( this, 'StateMachineRule', { schedule: events.Schedule.expression(`cron(20 0 ? * Mon-Fri *)`), ruleName: `${process.env.DEPLOYMENT_ENV}-sql-analytics-statemachine-rule`, enabled: true description: 'An event rule to launch an EMR cluster via AWS Step Functions.', } ); stateMachineRule.addTarget(stateMachineTarget);
  • 17. Common Errors in AWS Step Functions https://docs.aws.amazon.com/step-functions/latest/apireference/CommonErrors.html
  • 18. src ├── custom-glue-workflows.ts ├── emr-scripts │ ├── bootstrap │ │ ├── modify_meta_database.sh │ │ └── update_ssm.sh │ ├── hive │ │ ├── hive_create_database.q │ │ └── hive_create_table.q │ └── spark │ └── remove-auto.py ├── glue-resources.ts ├── glue-scripts │ ├── app-team │ │ └── resume_detection.py │ └── schema │ └── update_workflow_property.py
  • 20. AWS Glue AWS Services On-premises Big Data Data Warehouse SaaS Cross-cloud Data Store
  • 21. Inferring schema, detecting data drift, keeping metadata up to date Reusable data pipelines, event-triggered workflow Visual data preparation tool for data analysis Materialized views
  • 23. ● Performance-optimized Spark runtime ○ upgrading from Spark 2.4 to Spark 3.1.1 ○ upgraded JDBC drivers ● Faster read and write access ● Faster and efficient partition pruning ● Fine-grained access control ● ACID transactions ● Improved user experience for monitoring, debugging, and tuning Spark applications
  • 24. Xue, C. and Zhou, Y., 2021. Building a SIMD Supported Vectorized Native Engine for Spark SQL. [video] Available at: <https://youtu.be/hwAzodnaqa0>
  • 25. Shuffled hash join improvement (SPARK-32461) ● Preserve shuffled hash join build side partitioning (SPARK-32330) ● Preserve hash join (BHJ and SHJ) stream side ordering (SPARK-32383) ● Coalesce bucketed tables for shuffled hash join (SPARK-32286) ● Add code-gen for shuffled hash join (SPARK-32421) ● Support full outer join in shuffled hash join (SPARK-32399)
  • 27.
  • 28. { "Sid": "S3Event1", "Effect": "Allow", "Principal": { "Service": "cloudtrail.amazonaws.com" }, "Action": "s3:GetBucketAcl", "Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1" }, { "Sid": "S3Event2", "Effect": "Allow", "Principal": { "Service": "cloudtrail.amazonaws.com" }, "Action": "s3:PutObject", "Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1/hakunamatata/*", "Condition": { "StringEquals": { "s3:x-amz-acl": "bucket-owner-full-control" } } } S3 Bucket Policy
  • 29. import * as cloudtrail from '@aws-cdk/aws-cloudtrail'; const glueWorkflowTrail = new cloudtrail.Trail(this, 'GlueWorkflowTrail', { trailName: `s3-events-trail`, bucket: s3.Bucket.fromBucketName( this, 'DemoBucket', `scott-demo-events-${cdk.Aws.REGION}` ), s3KeyPrefix: 'event-folder', }); glueWorkflowTrail.addS3EventSelector( [ { bucket: s3.Bucket.fromBucketName( this, 'BucketWhereFileWillBePut', 'scott-target-bucket-${cdk.Aws.REGION}' ), objectPrefix: 'hakunamatata/', }, ], { includeManagementEvents: false, readWriteType: cloudtrail.ReadWriteType.WRITE_ONLY, } ); CloudTrail configuration
  • 30. import * as glue from '@aws-cdk/aws-glue'; const glueWorkFlow = new glue.CfnWorkflow(this, 'GlueWorkFlow', { name: `scott-demo-glue-workflow`, description: 'A Glue workflow', }); const glueWorkFlowArn = `arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:workflow/${glueWor kFlow.name}`; Glue workflow
  • 31. import * as events from '@aws-cdk/aws-events'; const eventRule = new events.Rule(this, 'FileDetectionRule', { ruleName: `event-glue-workflow-trigger`, description: 'An event rule to trigger the Glue workflow', eventPattern: { source: ['aws.s3'], detailType: ['AWS API Call via CloudTrail'], detail: { ['eventSource']: ['s3.amazonaws.com'], ['eventName']: ['PutObject'], ['requestParameters']: { bucketName: ['scott-target-bucket-${cdk.Aws.REGION}'], key: [{ prefix: 'hakunamatata/' }], }, }, }, }); Amazon EventBridge
  • 32. const cfnEventRule = eventRule.node.defaultChild as events.CfnRule; cfnEventRule.targets = [ { arn: glueWorkFlowArn, id: 'CloudTrailTriggersWorkflow', roleArn: eventBridgeExecutionRole.roleArn, }, ]; Set target as the Glue workflow
  • 33. import * as cr from '@aws-cdk/custom-resources'; const updateTriggerSdkCall: cr.AwsSdkCall = { service: 'Glue', action: 'updateTrigger', parameters: { Name: triggerEntity.name, TriggerUpdate: { Actions: [ { JobName: jobName, Timeout: 1, }, ], Description: triggerEntity.description, EventBatchingCondition: { BatchSize: 8, BatchWindow: 120, }, }, }, physicalResourceId: cr.PhysicalResourceId.of(Date.now().toString()), }; new cr.AwsCustomResource(this, id + 'CustomResource', { onCreate: updateTriggerSdkCall, onUpdate: updateTriggerSdkCall, policy: cr.AwsCustomResourcePolicy.fromStatements([ new iam.PolicyStatement({ actions: ['glue:UpdateTrigger'], resources: [ `arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:trigger/*`, ], }), ]), logRetention: logs.RetentionDays.ONE_WEEK, });}