Automate all your EMR related activities

Automate all your EMR related activities
Eitan Sela - System Architect
eitan.sela@weissbeerger.com

$ whoami
• "Hands-On" system Architect with more than 17 years of
experience with billing, banking, information security (DLP) and
Cloud IoT/Big Data applications.
• Big Data specialist – Hadoop, Spark, Hive and EMR on AWS.
• Work with vast AWS services, and with serverless projects
especially.
• Java development, scalability performance and stabilization
expert.
• Alexa skills developer.
• Love to share my experience in lectures and meetups.

What to expect from this session
• WeissBeerger use case – Aggregating raw orders and IoT data.
• Amazon EMR basics.
• Implementing ETLs with Spark.
• Submitting work to a Cluster.
• Provisioning scheduled transient EMR Clusters for ETLs jobs.
• Our new Slack Chabot for EMR, using Amazon Lex!

WeissBeerger use case – Aggregating raw orders and IoT data

Solution
• WeissBeerger bridges the gap between breweries, bars and
customers.

Benefits for the brewery
• Consumption Analytics.
• Dynamic Promotions.
• Beer Quality.
• Value creation.
• Beer penetration.

Benefits for the bar
• Real Time Consumption Tracking.
• Waste Reduction.
• Sales Growth.

How does it work?
• IoT (Pouring) – Beverage Analytics Hub.
• Point of sales – POS Vendors, via REST API, S3, DB, etc.

Aggregating raw point of sales orders and IoT data

Amazon EMR - Easily Run and Scale Apache Hadoop, Spark, HBase, Presto,
Hive, and other Big Data Frameworks

Amazon EMR – Create Cluster – Software and Steps

Amazon EMR – Create Cluster – Hardware

Amazon EMR – General Cluster Settings

Problem – Huge queries from MySQL to aggregative tables in Redshift

Solution - Implementing all ETL’s with PySpark

New data pipeline using EMR PySpark jobs – ELT rather than ETL

Launching Applications with spark-submit
./bin/spark-submit
--jars jar1.jar,jar2.jar
--py-files path/to/my/pymodule1.py, path/to/my/pymodule2.py
my_program.py arg1 arg2
• The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
• It can use all of Spark’s supported cluster managers through a
uniform interface so you don’t have to configure your application
especially for each one.

EMR Steps - Submit Work to a Cluster
• You can submit work to a cluster by adding steps or by interactively
submitting Hadoop jobs to the master node.
• You can add steps to a cluster using the AWS Management Console,
the AWS CLI, or the Amazon EMR API.
• You can add step during cluster creation or to a running cluster.

EMR Steps – Job Types
• Custom Jar.
• Streaming Program.
• Spark App.
• Hive Program.

EMR Steps - Lifecycle
• Pending
• Cancelled (by user or API request)
• Running
• Completed / Failed.

EMR Steps – Add jobs to a running cluster

WeissBeerger’s Spark ETL jobs submitted to EMR Cluster

Provisioning scheduled transient EMR Clusters for ETLs jobs

Requirements
• Run ETL using Spark on EMR cluster every 1 hour for one month
back.
• Input: MySQL or Hive (stg).
• Output: Hive (stg) or Redshift.
• Storage should be separated from the compute, so EMR clusters
should be transient.
• Multiple clusters should be able to run together.
• Fully automated and monitored.

Passing Spark Job steps parameters to Lambda input
• We created a simple json with all parameters required to add step to EMR cluster.

Monitoring EMR Steps with Lambda and Datadog
• We created a Lambda to sample all running EMR clusters for failed steps.

As more developers are developing PySpark Jobs…

Our new Slack Chabot for EMR, using Amazon Lex

Amazon Lex
• Conversational interfaces for your applications.
• Powered by the same deep learning technologies as Alexa.
• Amazon Lex provides the advanced deep learning functionalities of
automatic speech recognition (ASR) for converting speech to text,
and natural language understanding (NLU).

Amazon Lex - Use cases - Call Center Bots

awsbot - the chatbot that help you manage AWS resources

awsbot - Demo - EMR Cluster is ready

We Are Hiring!
Senior Data Scientist
Senior Designer (UI/UX)
Senior Full Stack Developer
Java Developer
Senior Manual QA
Director of Ops
BI Analyst
Data Management Analyst
Customer Success Manager
Senior BI Analyst

Automate all your EMR related activities

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automate all your EMR related activities

Similar to Automate all your EMR related activities (20)

Recently uploaded

Recently uploaded (20)

Automate all your EMR related activities