SlideShare a Scribd company logo
Automate all your EMR related activities
Eitan Sela - System Architect
eitan.sela@weissbeerger.com
$ whoami
• "Hands-On" system Architect with more than 17 years of
experience with billing, banking, information security (DLP) and
Cloud IoT/Big Data applications.
• Big Data specialist – Hadoop, Spark, Hive and EMR on AWS.
• Work with vast AWS services, and with serverless projects
especially.
• Java development, scalability performance and stabilization
expert.
• Alexa skills developer.
• Love to share my experience in lectures and meetups.
What to expect from this session
• WeissBeerger use case – Aggregating raw orders and IoT data.
• Amazon EMR basics.
• Implementing ETLs with Spark.
• Submitting work to a Cluster.
• Provisioning scheduled transient EMR Clusters for ETLs jobs.
• Our new Slack Chabot for EMR, using Amazon Lex!
WeissBeerger use case – Aggregating raw orders and IoT data
Solution
• WeissBeerger bridges the gap between breweries, bars and
customers.
Benefits for the brewery
• Consumption Analytics.
• Dynamic Promotions.
• Beer Quality.
• Value creation.
• Beer penetration.
Benefits for the bar
• Real Time Consumption Tracking.
• Waste Reduction.
• Sales Growth.
How does it work?
• IoT (Pouring) – Beverage Analytics Hub.
• Point of sales – POS Vendors, via REST API, S3, DB, etc.
Aggregating raw point of sales orders and IoT data
Amazon EMR basics
Amazon EMR - Easily Run and Scale Apache Hadoop, Spark, HBase, Presto,
Hive, and other Big Data Frameworks
Amazon EMR – Create Cluster – Software and Steps
Amazon EMR – Create Cluster – Hardware
Amazon EMR – General Cluster Settings
Amazon EMR – Security
Implementing ETLs with Spark
Problem – Huge queries from MySQL to aggregative tables in Redshift
Solution - Implementing all ETL’s with PySpark
New data pipeline using EMR PySpark jobs – ELT rather than ETL
Submitting work to a Cluster
Launching Applications with spark-submit
./bin/spark-submit 
--jars jar1.jar,jar2.jar 
--py-files path/to/my/pymodule1.py, path/to/my/pymodule2.py
my_program.py arg1 arg2
• The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
• It can use all of Spark’s supported cluster managers through a
uniform interface so you don’t have to configure your application
especially for each one.
EMR Steps - Submit Work to a Cluster
• You can submit work to a cluster by adding steps or by interactively
submitting Hadoop jobs to the master node.
• You can add steps to a cluster using the AWS Management Console,
the AWS CLI, or the Amazon EMR API.
• You can add step during cluster creation or to a running cluster.
EMR Steps – Job Types
• Custom Jar.
• Streaming Program.
• Spark App.
• Hive Program.
EMR Steps - Lifecycle
• Pending
• Cancelled (by user or API request)
• Running
• Completed / Failed.
EMR Steps – Add jobs to a running cluster
EMR Steps – View logs
WeissBeerger’s Spark ETL jobs submitted to EMR Cluster
Provisioning scheduled transient EMR Clusters for ETLs jobs
Requirements
• Run ETL using Spark on EMR cluster every 1 hour for one month
back.
• Input: MySQL or Hive (stg).
• Output: Hive (stg) or Redshift.
• Storage should be separated from the compute, so EMR clusters
should be transient.
• Multiple clusters should be able to run together.
• Fully automated and monitored.
Passing Spark Job steps parameters to Lambda input
• We created a simple json with all parameters required to add step to EMR cluster.
Monitoring EMR Steps with Lambda and Datadog
• We created a Lambda to sample all running EMR clusters for failed steps.
As more developers are developing PySpark Jobs…
Our new Slack Chabot for EMR, using Amazon Lex
Amazon Lex
• Conversational interfaces for your applications.
• Powered by the same deep learning technologies as Alexa.
• Amazon Lex provides the advanced deep learning functionalities of
automatic speech recognition (ASR) for converting speech to text,
and natural language understanding (NLU).
Amazon Lex - Use cases - Call Center Bots
awsbot - the chatbot that help you manage AWS resources
awsbot - Demo
awsbot - Demo - EMR Cluster is ready
Q & A
We Are Hiring!
Senior Data Scientist
Senior Designer (UI/UX)
Senior Full Stack Developer
Java Developer
Senior Manual QA
Director of Ops
BI Analyst
Data Management Analyst
Customer Success Manager
Senior BI Analyst

More Related Content

What's hot

RDB開発者のためのApache Cassandra データモデリング入門
RDB開発者のためのApache Cassandra データモデリング入門RDB開発者のためのApache Cassandra データモデリング入門
RDB開発者のためのApache Cassandra データモデリング入門
Yuki Morishita
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
Ceph Community
 
Dreaming Infrastructure
Dreaming InfrastructureDreaming Infrastructure
Dreaming Infrastructurekyhpudding
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
Romain Jacotin
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)
涛 吴
 
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
涛 吴
 
CloudStack vs OpenStack
CloudStack vs OpenStackCloudStack vs OpenStack
CloudStack vs OpenStack
Victor Zhang
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Jignesh Shah
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Amazon Aurora Deep Dive (db tech showcase 2016)
Amazon Aurora Deep Dive (db tech showcase 2016)Amazon Aurora Deep Dive (db tech showcase 2016)
Amazon Aurora Deep Dive (db tech showcase 2016)
Amazon Web Services Japan
 
Aurora
AuroraAurora
Aurora
maruyama097
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Amazon Web Services
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQL
Jonathan Katz
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
Ceph Community
 
PostgreSQL replication
PostgreSQL replicationPostgreSQL replication
PostgreSQL replication
NTT DATA OSS Professional Services
 

What's hot (20)

RDB開発者のためのApache Cassandra データモデリング入門
RDB開発者のためのApache Cassandra データモデリング入門RDB開発者のためのApache Cassandra データモデリング入門
RDB開発者のためのApache Cassandra データモデリング入門
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Dreaming Infrastructure
Dreaming InfrastructureDreaming Infrastructure
Dreaming Infrastructure
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)
 
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
 
CloudStack vs OpenStack
CloudStack vs OpenStackCloudStack vs OpenStack
CloudStack vs OpenStack
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
Amazon Aurora Deep Dive (db tech showcase 2016)
Amazon Aurora Deep Dive (db tech showcase 2016)Amazon Aurora Deep Dive (db tech showcase 2016)
Amazon Aurora Deep Dive (db tech showcase 2016)
 
MapReduce入門
MapReduce入門MapReduce入門
MapReduce入門
 
Aurora
AuroraAurora
Aurora
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQL
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
PostgreSQL replication
PostgreSQL replicationPostgreSQL replication
PostgreSQL replication
 

Similar to Automate all your EMR related activities

Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Amazon Web Services
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
Amazon Web Services
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
Sid Anand
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
Helen Rogers
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
Amazon Web Services
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Aws re invent 2018 recap
Aws re invent 2018 recapAws re invent 2018 recap
Aws re invent 2018 recap
CloudHesive
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Amazon Web Services
 
Cloud & Native Cloud for Managers
Cloud & Native Cloud for ManagersCloud & Native Cloud for Managers
Cloud & Native Cloud for Managers
Eitan Sela
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Managing Your Cloud Assets
Managing Your Cloud AssetsManaging Your Cloud Assets
Managing Your Cloud Assets
Amazon Web Services
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
Amazon Web Services
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Amazon Web Services
 

Similar to Automate all your EMR related activities (20)

Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Aws re invent 2018 recap
Aws re invent 2018 recapAws re invent 2018 recap
Aws re invent 2018 recap
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Cloud & Native Cloud for Managers
Cloud & Native Cloud for ManagersCloud & Native Cloud for Managers
Cloud & Native Cloud for Managers
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Managing Your Cloud Assets
Managing Your Cloud AssetsManaging Your Cloud Assets
Managing Your Cloud Assets
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 

Recently uploaded

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 

Recently uploaded (20)

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 

Automate all your EMR related activities

  • 1. Automate all your EMR related activities Eitan Sela - System Architect eitan.sela@weissbeerger.com
  • 2. $ whoami • "Hands-On" system Architect with more than 17 years of experience with billing, banking, information security (DLP) and Cloud IoT/Big Data applications. • Big Data specialist – Hadoop, Spark, Hive and EMR on AWS. • Work with vast AWS services, and with serverless projects especially. • Java development, scalability performance and stabilization expert. • Alexa skills developer. • Love to share my experience in lectures and meetups.
  • 3. What to expect from this session • WeissBeerger use case – Aggregating raw orders and IoT data. • Amazon EMR basics. • Implementing ETLs with Spark. • Submitting work to a Cluster. • Provisioning scheduled transient EMR Clusters for ETLs jobs. • Our new Slack Chabot for EMR, using Amazon Lex!
  • 4. WeissBeerger use case – Aggregating raw orders and IoT data
  • 5. Solution • WeissBeerger bridges the gap between breweries, bars and customers.
  • 6. Benefits for the brewery • Consumption Analytics. • Dynamic Promotions. • Beer Quality. • Value creation. • Beer penetration.
  • 7. Benefits for the bar • Real Time Consumption Tracking. • Waste Reduction. • Sales Growth.
  • 8. How does it work? • IoT (Pouring) – Beverage Analytics Hub. • Point of sales – POS Vendors, via REST API, S3, DB, etc.
  • 9. Aggregating raw point of sales orders and IoT data
  • 11. Amazon EMR - Easily Run and Scale Apache Hadoop, Spark, HBase, Presto, Hive, and other Big Data Frameworks
  • 12. Amazon EMR – Create Cluster – Software and Steps
  • 13. Amazon EMR – Create Cluster – Hardware
  • 14. Amazon EMR – General Cluster Settings
  • 15. Amazon EMR – Security
  • 17. Problem – Huge queries from MySQL to aggregative tables in Redshift
  • 18. Solution - Implementing all ETL’s with PySpark
  • 19. New data pipeline using EMR PySpark jobs – ELT rather than ETL
  • 20. Submitting work to a Cluster
  • 21. Launching Applications with spark-submit ./bin/spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/pymodule1.py, path/to/my/pymodule2.py my_program.py arg1 arg2 • The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. • It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one.
  • 22. EMR Steps - Submit Work to a Cluster • You can submit work to a cluster by adding steps or by interactively submitting Hadoop jobs to the master node. • You can add steps to a cluster using the AWS Management Console, the AWS CLI, or the Amazon EMR API. • You can add step during cluster creation or to a running cluster.
  • 23. EMR Steps – Job Types • Custom Jar. • Streaming Program. • Spark App. • Hive Program.
  • 24. EMR Steps - Lifecycle • Pending • Cancelled (by user or API request) • Running • Completed / Failed.
  • 25. EMR Steps – Add jobs to a running cluster
  • 26. EMR Steps – View logs
  • 27. WeissBeerger’s Spark ETL jobs submitted to EMR Cluster
  • 28. Provisioning scheduled transient EMR Clusters for ETLs jobs
  • 29. Requirements • Run ETL using Spark on EMR cluster every 1 hour for one month back. • Input: MySQL or Hive (stg). • Output: Hive (stg) or Redshift. • Storage should be separated from the compute, so EMR clusters should be transient. • Multiple clusters should be able to run together. • Fully automated and monitored.
  • 30.
  • 31. Passing Spark Job steps parameters to Lambda input • We created a simple json with all parameters required to add step to EMR cluster.
  • 32. Monitoring EMR Steps with Lambda and Datadog • We created a Lambda to sample all running EMR clusters for failed steps.
  • 33. As more developers are developing PySpark Jobs…
  • 34. Our new Slack Chabot for EMR, using Amazon Lex
  • 35. Amazon Lex • Conversational interfaces for your applications. • Powered by the same deep learning technologies as Alexa. • Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU).
  • 36. Amazon Lex - Use cases - Call Center Bots
  • 37. awsbot - the chatbot that help you manage AWS resources
  • 39. awsbot - Demo - EMR Cluster is ready
  • 40. Q & A
  • 41. We Are Hiring! Senior Data Scientist Senior Designer (UI/UX) Senior Full Stack Developer Java Developer Senior Manual QA Director of Ops BI Analyst Data Management Analyst Customer Success Manager Senior BI Analyst