SlideShare a Scribd company logo
1 of 31
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Keith Steward, Ph.D.,
Specialist (EMR) Solutions Architect
September 29, 2016
Building Real-Time Processing
Data Analytics Applications on
AWS
Build powerful applications easily
Agenda
1. Introduction to Amazon EMR and Kinesis
2. Real Time Processing Application Demo
3. Best Practices
Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Customize the cluster
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Mahout,
Sqoop
HBase/Phoenix
Presto
Hue (SQL Interface/Metastore
Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver
(JDBC/ODBC)
Amazon EMR service
Options to submit jobs to EMR
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
Airflow, Luigi, or other
schedulers on EC2
Create pipeline
to schedule job
submission or create
complex workflows
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
AWS
Lambda
AWS Data
Pipeline
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
EMR File
System
(EMRFS)
Amazon S3
Amazon EMR
EMR 5.0 - Applications
Apache Spark 2.0
Quick introduction to Spark
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Spark 2.0 – Performance Enhancements
• Second generation Tungsten engine
• Whole-stage code generation to create optimized
bytecode at runtime
• Improvements to Catalyst optimizer for query
performance
• New vectorized Parquet decoder to increase throughput
Datasets & DataFrames (Spark 2.0)
Datasets
• Distributed collection of data
• Strong typing, can use Lambda
functions
• Object-oriented operations (similar to
RDD API)
• Optimized encoders for better
performance; minimize serialization
/deserialization overhead
• Compile-time type safety for robust
applications
DataFrames
• Dataset organized into named
columns
• Represented as a Dataset of rows
Spark SQL (Spark 2.0)
• SparkSession – replaces the old SQLContext and
HiveContext
• Seamlessly mix SQL with Spark programs
• ANSI SQL Parser and subquery support
• HiveQL compatibility and can directly use tables in
Hive metastore
• Connect through JDBC / ODBC using the Spark Thrift
server
Spark 2.0 – Structured Streaming
• Structured Streaming API: extension to the
DataFrame/Dataset API (instead of DStream)
• SparkSession is the new entry point for streaming
• Better merges processing on static and streaming
datasets, abstracting the velocity of the data
Configuring Executors – Dynamic Allocation
• Optimal resource utilization
• YARN dynamically adjusts executors based on resource
needs of Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default,
and calculates the default executor size to use based on
the instance family of your Core Group
Try different configurations to find your optimal architecture.
CPU
c3 family
c4 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m4 family
m3 family
m1 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Zeppelin 0.6.1 – New Features
• Shiro Authentication
• Notebook Authorization
Configurations to save notebook in S3
zeppelin-env.sh:
export ZEPPELIN_NOTEBOOK_S3_BUCKET = bucket_name
export ZEPPELIN_NOTEBOOK_S3_USER = username
(optional) export ZEPPELIN_NOTEBOOK_S3_KMS_KEY_ID = kms-key-id
Amazon Kinesis
Streams
Build your own custom
applications that
process or analyze
streaming data
Amazon Kinesis
Firehose
Easily load massive
volumes of streaming
data into Amazon S3
and Redshift
Amazon Kinesis
Analytics
Easily analyze data
streams using
standard SQL queries
Amazon Kinesis: Streaming data made easy
Services make it easy to capture, deliver, and process streams on AWS
Amazon Kinesis Concepts
Streams: • Ordered sequence of data records
Data Records: • Unit of data stored in a Kinesis stream
Producers: • Anything that puts records into a Kinesis stream
• Kinesis Producer Library (KPL)
Consumers: • End-points that get records from a Kinesis stream &
processes them
• Kinesis Consumer Library (KCL)
Amazon Kinesis Streams features…
• Kinesis Client Library in Python, Node.JS,
Ruby…
• PutRecords API, up to 500 records or 5 MB of payload
• Kinesis Producer Library to simplify producer development
• Server-Side Timestamps
• individual max record payload up to 1 MB
• Low end-to-end propagation delay
• Stream Retention defaults to 24 hrs, but can increase to 7 days
Real Time Processing
Application
and Demo
Real Time Processing Application
+
Kinesis Producer
(KCL)
Amazon Kinesis
Apache Zeppelin
+
Produce Collect Process Analyze
Amazon
EMR
Real Time Processing Application – 5 Steps
http://bit.ly/realtime-aws
1. Create Kinesis Stream
2. Create Amazon EMR Cluster
3. Start Kinesis Producer
4. Ad-hoc Analytics
• Analyze data using Apache Zeppelin on
Amazon EMR with Spark Streaming
5. Continuous Streaming
• Spark Streaming application submitted to
Amazon EMR cluster using Step API
Amazon
Kinesis
Amazon
EMR
Amazon
EC2
(Kinesis
Producer)
(Kinesis
Consumer)
Real Time Processing Application
1. Create Kinesis Stream:
$ aws kinesis create-stream --stream-name spark-demo --shard-count 2
Real Time Processing Application
2. Create Amazon EMR Cluster with Spark and Zeppelin:
$ aws emr create-cluster --release-label emr-5.0.0 
--applications Name=Zeppelin Name=Spark Name=Hadoop 
--enable-debugging 
--ec2-attributes KeyName=test-key-1,AvailabilityZone=us-east-1d 
--log-uri s3://kinesis-spark-streaming1/logs 
--instance-groups 
Name=Master,InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1 
Name=Core,InstanceGroupType=CORE,InstanceType=m3.xlarge,InstanceCount=2 
Name=Task,InstanceGroupType=TASK,InstanceType=m3.xlarge,InstanceCount=2 
--name "kinesis-processor"
Real Time Processing Application
3. Start Kinesis Producer (Using Kinesis Producer Library)
a. On a host (e.g. EC2), download JAR and run Kinesis Producer:
$ wget https://s3.amazonaws.com/chayel-emr/KinesisProducer.jar
$ java –jar KinesisProducer.jar
b. Continuous stream of data records will be fed into Kinesis in CSV format:
… device_id,temperature,timestamp …
Git link: https://github.com/manjeetchayel/emr-kpl-demo
Real Time Processing Application
Use Zeppelin on Amazon
EMR:
• Configure Spark interpreter to use
org.apache.spark:spark-streaming-
kinesis-asl_2.11:2.0.0 dependency
• Import notebook from
https://raw.githubusercontent.com/m
anjeetchayel/aws-big-data-
blog/master/aws-blog-realtime-
analytics-using-
zeppelin/Spark_Streaming.json
• Run spark code blocks to generate
real-time analytics.
Best Practices
Best Practices - Kinesis
• Use Kinesis Producer Library
• Increase capacity utilization with Aggregation
• Setup CloudWatch alarms
• Randomized Partitions key to distribute puts across
Shards
• Deal with unsuccessful records
• Use PutRecords for higher throughput!
• Kinesis Scaling Utility for scaling Streams
Best Practices – Amazon EMR + Spark Streaming
• Use Step API to submit long running applications
• Run Spark application in cluster mode
• Use maxResourceAllocation
• Unqiue Spark Application name for KCL DynamoDB
metadata
• Setup CloudWatch alarms
• Use IAM roles
Thank you!
stewardk@amazon.com
http://aws.amazon.com/emr
http://blogs.aws.amazon.com/bigdata

More Related Content

Viewers also liked

Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaAmazon Web Services
 
(SDD415) NEW LAUNCH: Amazon Aurora: Amazon’s New Relational Database Engine |...
(SDD415) NEW LAUNCH: Amazon Aurora: Amazon’s New Relational Database Engine |...(SDD415) NEW LAUNCH: Amazon Aurora: Amazon’s New Relational Database Engine |...
(SDD415) NEW LAUNCH: Amazon Aurora: Amazon’s New Relational Database Engine |...Amazon Web Services
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...
AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...
AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...Amazon Web Services
 
AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...
AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...
AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...Amazon Web Services
 
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar SeriesDeep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar SeriesAmazon Web Services
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
AWS January 2016 Webinar Series - Managing your Infrastructure as Code
AWS January 2016 Webinar Series - Managing your Infrastructure as CodeAWS January 2016 Webinar Series - Managing your Infrastructure as Code
AWS January 2016 Webinar Series - Managing your Infrastructure as CodeAmazon Web Services
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkRussell Jurney
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Real Time Analytics On AWS: Optimized Architectures
Real Time Analytics On AWS: Optimized ArchitecturesReal Time Analytics On AWS: Optimized Architectures
Real Time Analytics On AWS: Optimized ArchitecturesAmazon Web Services
 
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksDeep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksAmazon Web Services
 
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...Amazon Web Services
 

Viewers also liked (18)

Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
(SDD415) NEW LAUNCH: Amazon Aurora: Amazon’s New Relational Database Engine |...
(SDD415) NEW LAUNCH: Amazon Aurora: Amazon’s New Relational Database Engine |...(SDD415) NEW LAUNCH: Amazon Aurora: Amazon’s New Relational Database Engine |...
(SDD415) NEW LAUNCH: Amazon Aurora: Amazon’s New Relational Database Engine |...
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...
AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...
AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...
 
AWS Real-Time Event Processing
AWS Real-Time Event ProcessingAWS Real-Time Event Processing
AWS Real-Time Event Processing
 
AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...
AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...
AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...
 
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar SeriesDeep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
AWS Lambdaを紐解く
AWS Lambdaを紐解くAWS Lambdaを紐解く
AWS Lambdaを紐解く
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
AWS January 2016 Webinar Series - Managing your Infrastructure as Code
AWS January 2016 Webinar Series - Managing your Infrastructure as CodeAWS January 2016 Webinar Series - Managing your Infrastructure as Code
AWS January 2016 Webinar Series - Managing your Infrastructure as Code
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Real Time Analytics On AWS: Optimized Architectures
Real Time Analytics On AWS: Optimized ArchitecturesReal Time Analytics On AWS: Optimized Architectures
Real Time Analytics On AWS: Optimized Architectures
 
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksDeep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
 
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 

Recently uploaded (20)

AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 

Building Real-Time Data Analytics Applications on AWS - September 2016 Webinar Series

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Keith Steward, Ph.D., Specialist (EMR) Solutions Architect September 29, 2016 Building Real-Time Processing Data Analytics Applications on AWS Build powerful applications easily
  • 2. Agenda 1. Introduction to Amazon EMR and Kinesis 2. Real Time Processing Application Demo 3. Best Practices
  • 3. Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Customize the cluster
  • 4. Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop HBase/Phoenix Presto Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service
  • 5. Options to submit jobs to EMR Amazon EMR Step API Submit a Spark application Amazon EMR Airflow, Luigi, or other schedulers on EC2 Create pipeline to schedule job submission or create complex workflows Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster AWS Lambda AWS Data Pipeline
  • 6. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift EMR File System (EMRFS) Amazon S3 Amazon EMR
  • 7. EMR 5.0 - Applications
  • 9. Quick introduction to Spark join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle
  • 10. Spark components to match your use case
  • 11. Spark 2.0 – Performance Enhancements • Second generation Tungsten engine • Whole-stage code generation to create optimized bytecode at runtime • Improvements to Catalyst optimizer for query performance • New vectorized Parquet decoder to increase throughput
  • 12. Datasets & DataFrames (Spark 2.0) Datasets • Distributed collection of data • Strong typing, can use Lambda functions • Object-oriented operations (similar to RDD API) • Optimized encoders for better performance; minimize serialization /deserialization overhead • Compile-time type safety for robust applications DataFrames • Dataset organized into named columns • Represented as a Dataset of rows
  • 13. Spark SQL (Spark 2.0) • SparkSession – replaces the old SQLContext and HiveContext • Seamlessly mix SQL with Spark programs • ANSI SQL Parser and subquery support • HiveQL compatibility and can directly use tables in Hive metastore • Connect through JDBC / ODBC using the Spark Thrift server
  • 14. Spark 2.0 – Structured Streaming • Structured Streaming API: extension to the DataFrame/Dataset API (instead of DStream) • SparkSession is the new entry point for streaming • Better merges processing on static and streaming datasets, abstracting the velocity of the data
  • 15. Configuring Executors – Dynamic Allocation • Optimal resource utilization • YARN dynamically adjusts executors based on resource needs of Spark application • Spark uses the executor memory and executor cores settings in the configuration for each executor • Amazon EMR uses dynamic allocation by default, and calculates the default executor size to use based on the instance family of your Core Group
  • 16. Try different configurations to find your optimal architecture. CPU c3 family c4 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m4 family m3 family m1 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  • 17. Zeppelin 0.6.1 – New Features • Shiro Authentication • Notebook Authorization Configurations to save notebook in S3 zeppelin-env.sh: export ZEPPELIN_NOTEBOOK_S3_BUCKET = bucket_name export ZEPPELIN_NOTEBOOK_S3_USER = username (optional) export ZEPPELIN_NOTEBOOK_S3_KMS_KEY_ID = kms-key-id
  • 18. Amazon Kinesis Streams Build your own custom applications that process or analyze streaming data Amazon Kinesis Firehose Easily load massive volumes of streaming data into Amazon S3 and Redshift Amazon Kinesis Analytics Easily analyze data streams using standard SQL queries Amazon Kinesis: Streaming data made easy Services make it easy to capture, deliver, and process streams on AWS
  • 19. Amazon Kinesis Concepts Streams: • Ordered sequence of data records Data Records: • Unit of data stored in a Kinesis stream Producers: • Anything that puts records into a Kinesis stream • Kinesis Producer Library (KPL) Consumers: • End-points that get records from a Kinesis stream & processes them • Kinesis Consumer Library (KCL)
  • 20. Amazon Kinesis Streams features… • Kinesis Client Library in Python, Node.JS, Ruby… • PutRecords API, up to 500 records or 5 MB of payload • Kinesis Producer Library to simplify producer development • Server-Side Timestamps • individual max record payload up to 1 MB • Low end-to-end propagation delay • Stream Retention defaults to 24 hrs, but can increase to 7 days
  • 22. Real Time Processing Application + Kinesis Producer (KCL) Amazon Kinesis Apache Zeppelin + Produce Collect Process Analyze Amazon EMR
  • 23. Real Time Processing Application – 5 Steps http://bit.ly/realtime-aws 1. Create Kinesis Stream 2. Create Amazon EMR Cluster 3. Start Kinesis Producer 4. Ad-hoc Analytics • Analyze data using Apache Zeppelin on Amazon EMR with Spark Streaming 5. Continuous Streaming • Spark Streaming application submitted to Amazon EMR cluster using Step API Amazon Kinesis Amazon EMR Amazon EC2 (Kinesis Producer) (Kinesis Consumer)
  • 24. Real Time Processing Application 1. Create Kinesis Stream: $ aws kinesis create-stream --stream-name spark-demo --shard-count 2
  • 25. Real Time Processing Application 2. Create Amazon EMR Cluster with Spark and Zeppelin: $ aws emr create-cluster --release-label emr-5.0.0 --applications Name=Zeppelin Name=Spark Name=Hadoop --enable-debugging --ec2-attributes KeyName=test-key-1,AvailabilityZone=us-east-1d --log-uri s3://kinesis-spark-streaming1/logs --instance-groups Name=Master,InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1 Name=Core,InstanceGroupType=CORE,InstanceType=m3.xlarge,InstanceCount=2 Name=Task,InstanceGroupType=TASK,InstanceType=m3.xlarge,InstanceCount=2 --name "kinesis-processor"
  • 26. Real Time Processing Application 3. Start Kinesis Producer (Using Kinesis Producer Library) a. On a host (e.g. EC2), download JAR and run Kinesis Producer: $ wget https://s3.amazonaws.com/chayel-emr/KinesisProducer.jar $ java –jar KinesisProducer.jar b. Continuous stream of data records will be fed into Kinesis in CSV format: … device_id,temperature,timestamp … Git link: https://github.com/manjeetchayel/emr-kpl-demo
  • 27. Real Time Processing Application Use Zeppelin on Amazon EMR: • Configure Spark interpreter to use org.apache.spark:spark-streaming- kinesis-asl_2.11:2.0.0 dependency • Import notebook from https://raw.githubusercontent.com/m anjeetchayel/aws-big-data- blog/master/aws-blog-realtime- analytics-using- zeppelin/Spark_Streaming.json • Run spark code blocks to generate real-time analytics.
  • 29. Best Practices - Kinesis • Use Kinesis Producer Library • Increase capacity utilization with Aggregation • Setup CloudWatch alarms • Randomized Partitions key to distribute puts across Shards • Deal with unsuccessful records • Use PutRecords for higher throughput! • Kinesis Scaling Utility for scaling Streams
  • 30. Best Practices – Amazon EMR + Spark Streaming • Use Step API to submit long running applications • Run Spark application in cluster mode • Use maxResourceAllocation • Unqiue Spark Application name for KCL DynamoDB metadata • Setup CloudWatch alarms • Use IAM roles