This presentation was part of "AWS Big Data Demystified #5 | Automate all your EMR related activities" meetup.
in this presentation I shared from my own experience how we managed to automate EMR Clusters creation for scheduled running ETL Spark jobs, submitting ad-hoc Spark steps and creating EMR Clusters per developer request using Slack with the help of the super cool chatbot they developed in WeissBeerger.
1. Automate all your EMR related activities
Eitan Sela - System Architect
eitan.sela@weissbeerger.com
2. $ whoami
• "Hands-On" system Architect with more than 17 years of
experience with billing, banking, information security (DLP) and
Cloud IoT/Big Data applications.
• Big Data specialist – Hadoop, Spark, Hive and EMR on AWS.
• Work with vast AWS services, and with serverless projects
especially.
• Java development, scalability performance and stabilization
expert.
• Alexa skills developer.
• Love to share my experience in lectures and meetups.
3. What to expect from this session
• WeissBeerger use case – Aggregating raw orders and IoT data.
• Amazon EMR basics.
• Implementing ETLs with Spark.
• Submitting work to a Cluster.
• Provisioning scheduled transient EMR Clusters for ETLs jobs.
• Our new Slack Chabot for EMR, using Amazon Lex!
21. Launching Applications with spark-submit
./bin/spark-submit
--jars jar1.jar,jar2.jar
--py-files path/to/my/pymodule1.py, path/to/my/pymodule2.py
my_program.py arg1 arg2
• The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
• It can use all of Spark’s supported cluster managers through a
uniform interface so you don’t have to configure your application
especially for each one.
22. EMR Steps - Submit Work to a Cluster
• You can submit work to a cluster by adding steps or by interactively
submitting Hadoop jobs to the master node.
• You can add steps to a cluster using the AWS Management Console,
the AWS CLI, or the Amazon EMR API.
• You can add step during cluster creation or to a running cluster.
29. Requirements
• Run ETL using Spark on EMR cluster every 1 hour for one month
back.
• Input: MySQL or Hive (stg).
• Output: Hive (stg) or Redshift.
• Storage should be separated from the compute, so EMR clusters
should be transient.
• Multiple clusters should be able to run together.
• Fully automated and monitored.
30.
31. Passing Spark Job steps parameters to Lambda input
• We created a simple json with all parameters required to add step to EMR cluster.
32. Monitoring EMR Steps with Lambda and Datadog
• We created a Lambda to sample all running EMR clusters for failed steps.
35. Amazon Lex
• Conversational interfaces for your applications.
• Powered by the same deep learning technologies as Alexa.
• Amazon Lex provides the advanced deep learning functionalities of
automatic speech recognition (ASR) for converting speech to text,
and natural language understanding (NLU).
41. We Are Hiring!
Senior Data Scientist
Senior Designer (UI/UX)
Senior Full Stack Developer
Java Developer
Senior Manual QA
Director of Ops
BI Analyst
Data Management Analyst
Customer Success Manager
Senior BI Analyst