This is a sharing on a seminar held together by Cathay Bank and the AWS User Group in Taiwan. In this sharing, overview of Amazon EMR and AWS Glue is offered and CDK management on those services via practical scenarios is also presented
6. Transient or long-running clusters
Long-running and auto scaling Transient and job scoped
1.Great for lines of business leaders
2.Great for short-running jobs or ad hoc
queries
3.Ideal to save costs for multi-tenanted
data science and data engineering
jobs
1.Works well for job-scoped pipelines
2.Reduces blast radius
3.Easier to upgrade clusters and restart
jobs
Example use cases:
● Notebooks
● Ad-hoc jobs and experimentation
● streaming
Example use cases:
● Large-scale transformation
● ETL to other DWH or Data Lake
● Building ML jobs
Liem, M., 2020. Amazon EMR Deep Dive and Best Practices - AWS Online Tech
Talks. [video] Available at: <https://www.youtube.com/watch?v=dU40df0Suoo>
9. Richardson, C., Novikova, M. and Zhang, K., 2021. How Tamr Optimized Amazon EMR Workloads to
Unify 200 Billion Records 5x Faster than On-Premises
10. Amazon S3
marts
Amazon S3
source data
Amazon EMR
Prepare data
Launch
Service
Use data
JDBC
Access
Zeppelin
AirFlow pipelines
Dubrovsky, O. and Reuveni, Y., 2020. AWS re:Invent 2020: How
Nielsen built a multi-petabyte data platform using Amazon EMR.
11.
12. import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
interface ExtendedEmrCreateClusterProps extends tasks.EmrCreateClusterProps {
/**
* Specifies the step concurrency level to allow multiple steps to run in parallel
*
* Requires EMR release label 5.28.0 or above.
* Must be in range [1, 256].
*
* @default 1 - no step concurrency allowed
*/
readonly stepConcurrencyLevel?: number;
}
class ExtendedEmrCreateCluster extends tasks.EmrCreateCluster {
protected readonly stepConcurrencyLevel: number;
constructor(
scope: cdk.Construct,
id: string,
props: ExtendedEmrCreateClusterProps
) {
super(scope, id, props);
this.stepConcurrencyLevel = props.stepConcurrencyLevel ?? 1;
}
protected _renderTask(): any {
const originalObject = super._renderTask();
const extensionObject = {};
Object.assign(extensionObject, originalObject, {
Parameters: {
StepConcurrencyLevel: cdk.numberToCloudFormation(
this.stepConcurrencyLevel
),
...originalObject.Parameters,
},
});
return extensionObject;
}}
CDK issues
● #15223
● #15242
13. import * as sfn from '@aws-cdk/aws-stepfunctions';
import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
tasks.EmrSetClusterTerminationProtection
tasks.EmrAddStep
tasks.EmrTerminateCluster
sfn.Choice
sfn.Condition
sfn.Parallel
Constructs that you’ll encounter
pretty much frequently
14. dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Make Traditional Chinese
available', {
name: 'modify metadata',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'bash',
'-c',
`aws s3 cp
s3://${this.demoBucketName}/modify_meta_database.sh .;
chmod +x modify_meta_database.sh;
./modify_meta_database.sh;
rm modify_meta_database.sh;`,
],
})
);
dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Some ETL', {
name: 'Execute an ETL',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'spark-submit',
'--deploy-mode',
'cluster',
'--master',
'yarn',
'--num-executors',
'2',
'--executor-cores',
'8',
'--executor-memory',
'12g',
'--conf',
'spark.yarn.submit.waitAppCompletion=true',
`s3://${this.demoBucketName}/etl/spark-etl.py`,
],
})
);
const dataMovementParallel = new sfn.Parallel(
this,
'Do some complex things in an EMR Cluster',
{
resultPath: sfn.JsonPath.DISCARD,
}
);
Example assignment of
paralleling tasks for an EMR cluster
15. import * as events from '@aws-cdk/aws-events';
import * as targets from '@aws-cdk/aws-events-targets';
const stateMachine = new sfn.StateMachine(this, 'StateMachine', {
stateMachineName: stateMachineName,
definition: shouldLaunchCluster,
});
const stateMachineTarget = new targets.SfnStateMachine(
stateMachine,
{
input: events.RuleTargetInput.fromObject({
LaunchCluster: true,
TerminateCluster: false,
}),
}
);
const stateMachineRule = new events.Rule(
this,
'StateMachineRule',
{
schedule: events.Schedule.expression(`cron(20 0 ? * Mon-Fri *)`),
ruleName: `${process.env.DEPLOYMENT_ENV}-sql-analytics-statemachine-rule`,
enabled: true
description:
'An event rule to launch an EMR cluster via AWS Step Functions.',
}
);
stateMachineRule.addTarget(stateMachineTarget);
21. Inferring schema, detecting data
drift, keeping metadata up to date
Reusable data pipelines,
event-triggered workflow
Visual data preparation tool
for data analysis
Materialized views