SlideShare a Scribd company logo
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Sr. Product Manager – Amazon EMR
Manjeet Chayel, Specialist Solutions Architect
October 2015
BDT309
Data Science & Best Practices for
Apache Spark on Amazon EMR
What to Expect from the Session
• Data science with Apache Spark
• Running Spark on Amazon EMR
• Customer use cases and architectures
• Best practices for running Spark
• Demo: Using Apache Zeppelin to analyze US domestic
flights dataset
Spark is fast
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in RDDs in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Spark speaks your language
And more!
Apache Zeppelin notebook to develop queries
Now available on
Amazon EMR 4.1.0!
Use DataFrames to easily interact with data
• Distributed
collection of data
organized in
columns
• An extension of the
existing RDD API
• Optimized for query
execution
Easily create DataFrames from many formats
RDD
Load data with the Spark SQL Data Sources API
Additional libraries at spark-packages.org
Sample DataFrame manipulations
Use DataFrames for machine learning
• Spark ML libraries
(replacing MLlib) use
DataFrames as
input/output for
models
• Create ML pipelines
with a variety of
distributed algorithms
Create DataFrames on streaming data
• Access data in Spark Streaming DStream
• Create SQLContext on the SparkContext used for Spark
Streaming application for ad hoc queries
• Incorporate DataFrame in Spark Streaming application
Use R to interact with DataFrames
• SparkR package for using R to manipulate DataFrames
• Create SparkR applications or interactively use the SparkR
shell (no Zeppelin support yet - ZEPPELIN-156)
• Comparable performance to Python and Scala
DataFrames
Spark SQL
• Seamlessly mix SQL with Spark programs
• Uniform data access
• Hive compatibility – run Hive queries without
modifications using HiveContext
• Connect through JDBC/ODBC
Running Spark on
Amazon EMR
Focus on deriving insights from your data
instead of manually configuring clusters
Easy to install and
configure Spark
Secured
Spark submit or use
Zeppelin UI
Quickly add
and remove capacity
Hourly, reserved, or
EC2 Spot pricing
Use S3 to decouple
compute and storage
Launch the latest Spark version
July 15 – Spark 1.4.1 GA release
July 24 – Spark 1.4.1 available on Amazon EMR
September 9 – Spark 1.5.0 GA release
September 30 – Spark 1.5.0 available on Amazon EMR
< 3 week cadence with latest open source release
Amazon EMR runs Spark on YARN
• Dynamically share and centrally configure
the same pool of cluster resources across
engines
• Schedulers for categorizing, isolating, and
prioritizing workloads
• Choose the number of executors to use, or
allow YARN to choose (dynamic allocation)
• Kerberos authentication
Storage
S3, HDFS
YARN
Cluster Resource Management
Batch
MapReduce
In Memory
Spark
Applications
Pig, Hive, Cascading, Spark Streaming, Spark SQL
Create a fully configured cluster in minutes
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use an AWS SDK directly with the Amazon EMR API
Or easily change your settings
Many storage layers to choose from
Amazon DynamoDB
EMR-DynamoDB
connector
Amazon RDS
Amazon
Kinesis
Streaming data
connectorsJDBC Data Source
w/ Spark SQL
Elasticsearch
connector
Amazon Redshift
Amazon Redshift Copy
From HDFS
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Amazon EMR
Easy to run your Spark workloads
Amazon EMR Step API
SSH to master node
(Spark Shell)
Submit a Spark
application
Amazon EMR
Secure Spark clusters – encryption at rest
On-Cluster
HDFS transparent encryption (AES 256)
[new on release emr-4.1.0]
Local disk encryption for temporary files
using LUKS encryption via bootstrap action
Amazon S3
Amazon S3
EMRFS support for Amazon S3 client-side
and server-side encryption (AES 256)
Secure Spark clusters – encryption in flight
Internode communication on-cluster
Blocks are encrypted in-transit in HDFS
when using transparent encryption
Spark’s Broadcast and FileServer services
can use SSL. BlockTransferService (for
shuffle) can’t use SSL (SPARK-5682).
Amazon S3
S3 to Amazon EMR cluster
Secure communication with SSL
Objects encrypted over the wire if using
client-side encryption
Secure Spark clusters – additional features
Permissions:
• Cluster level: IAM roles for the Amazon
EMR service and the cluster
• Application level: Kerberos (Spark on
YARN only)
• Amazon EMR service level: IAM users
Access: VPC, security groups
Auditing: AWS CloudTrail
Customer use cases
Some of our customers running Spark on Amazon EMR
Best Practices for
Spark on Amazon EMR
• Using correct instance
• Understanding Executors
• Sizing your executors
• Dynamic allocation on YARN
• Understanding storage layers
• File formats and compression
• Boost your performance
• Data serialization
• Avoiding shuffle
• Managing partitions
• RDD Persistence
• Using Zeppelin notebook
What does Spark need?
• Memory – lots of it!!
• Network
• CPU
• Horizontal Scaling
Workflow Resource
Machine learning CPU
ETL I/O
Instance
Try different configurations to find your optimal architecture.
CPU
c1 family
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
cr1.8xlarge
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Interactive Large
process learning Analysis HDFS
• Using correct instance
• Understanding Executors
• Sizing your executors
• Dynamic allocation on YARN
• Understanding storage layers
• File formats and compression
• Caching tables
• Boost your performance
• Data serialization
• Avoiding shuffle
• Managing partitions
• RDD Persistence
How Spark executes
• Spark Driver
• Executor
Spark Driver
Executor
Spark Executor on YARN
This is where all the action happens
- How to select number of executors?
- How many cores per executor?
- Example
Sample Amazon EMR cluster
Model vCPU Mem (GB) SSD
Storage
(GB)
Networking
r3.large 2 15.25 1 x 32 Moderate
r3.xlarge 4 30.5 1 x 80 Moderate
r3.2xlarge 8 61 1 x 160 High
r3.4xlarge 16 122 1 x 320 High
r3.8xlarge 32 244 2 x 320 10 Gigabit
Create Amazon EMR cluster
$ aws emr create-cluster --name ”Spark cluster” 
--release-label emr-4.1.0 
--applications Name=Hive Name=Spark 
--use-default-roles 
--ec2-attributes KeyName=myKey --instance-type r3.4xlarge 
--instance-count 6 
--no-auto-terminate
Selecting number of executor cores:
• Leave 1 core for OS and other activities
• 4-5 cores per executor gives a good performance
• Each executor can run up to 4-5 tasks
• i.e. 4-5 threads for read/write operations to HDFS
Inside Spark Executor on YARN
Selecting number of executor cores:
--num-executors or spark.executor.instances
• 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 =
(𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑒𝑠 𝑜𝑛 𝑛𝑜𝑑𝑒 −1 𝑓𝑜𝑟 𝑂𝑆 )
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑎𝑠𝑘 𝑝𝑒𝑟 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟
•
16 −1
5
= 3 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 𝑝𝑒𝑟 𝑛𝑜𝑑𝑒
Inside Spark Executor on YARN
Model vCPU Mem (GB) SSD
Storage
(GB)
Networking
r3.4xlarge 16 122 1 x 320 High
Selecting number of executor cores:
--num-executors or spark.executor.instances
•
16 −1
5
= 3 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 𝑝𝑒𝑟 𝑛𝑜𝑑𝑒
• 6 instances
• 𝑛𝑢𝑚 𝑜𝑓 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 = 3 ∗ 6 − 1 = 𝟏𝟕
Inside Spark Executor on YARN
Inside Spark Executor on YARN
Max Container size on node
YARN Container Controls the max sum of memory used by the container
yarn.nodemanager.resource.memory-mb
→
Default: 116 GConfig File: yarn-site.xml
Inside Spark Executor on YARN
Max Container size on node
Executor space Where Spark executor Runs
Executor Container
→
Inside Spark Executor on YARN
Max Container size on node
Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.)
𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟. 𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10
Executor Container
Memory
Overhead
Config File: spark-default.conf
Inside Spark Executor on YARN
Max Container size on node
Spark executor memory - Amount of memory to use per executor process
spark.executor.memory
Executor Container
Memory
Overhead
Spark Executor Memory
Config File: spark-default.conf
Inside Spark Executor on YARN
Max Container size on node
Shuffle Memory Fraction- Fraction of Java heap to use for aggregation and cogroups
during shuffles
spark.shuffle.memoryFraction
Executor Container
Memory
Overhead
Spark Executor Memory
Shuffle
memoryFraction
Default: 0.2
Inside Spark Executor on YARN
Max Container size on node
Storage storage Fraction - Fraction of Java heap to use for Spark's memory cache
spark.storage.memoryFraction
Executor Container
Memory
Overhead
Spark Executor Memory
Shuffle
memoryFraction
Storage
memoryFraction
Default: 0.6
Inside Spark Executor on YARN
Max Container size on node
--executor-memory or spark.executor.memory
𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟 𝑚𝑒𝑚𝑜𝑟𝑦 =
𝑀𝑎𝑥 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑒𝑟 𝑠𝑖𝑧𝑒
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟 𝑝𝑒𝑟 𝑛𝑜𝑑𝑒
Config File: spark-default.conf
Inside Spark Executor on YARN
Max Container size on node
--executor-memory or spark.executor.memory
𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟 𝑚𝑒𝑚𝑜𝑟𝑦 =
116 𝐺
3
~=38 G
Config File: spark-default.conf
Inside Spark Executor on YARN
Max Container size on node
--executor-memory or spark.executor.memory
𝑀𝑒𝑚𝑜𝑟𝑦 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 => 38 ∗ 0.10 => 3.8 𝐺
Config File: spark-default.conf
Inside Spark Executor on YARN
Max Container size on node
--executor-memory or spark.executor.memory
𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟 𝑀𝑒𝑚𝑜𝑟𝑦 => 38 − 3.8 => ~34 𝐺𝐵
Config File: spark-default.conf
Optimal setting:
--num-executors 17
--executor-cores 5
--executor-memory 34G
Inside Spark Executor on YARN
Optimal setting:
--num-executors 17
--executor-cores 5
--executor-memory 34G
Inside Spark Executor on YARN
• Using correct instance
• Understanding Executors
• Sizing your executors
• Dynamic allocation on YARN
• Understanding storage layers
• File formats and compression
• Boost your performance
• Data serialization
• Avoiding shuffle
• Managing partitions
• RDD Persistence
• Using Zeppelin Notebook
Dynamic Allocation on YARN
… allows your Spark applications to scale up based on
demand and scale down when not required.
Remove Idle executors,
Request more on demand
Dynamic Allocation on YARN
Scaling up on executors
- Request when you want the job to complete faster
- Idle resources on cluster
- Exponential increase in executors over time
Dynamic allocation setup
Property Value
Spark.dynamicAllocation.enabled true
Spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 5
spark.dynamicAllocation.maxExecutors 17
spark.dynamicAllocation.initalExecutors 0
sparkdynamicAllocation.executorIdleTime 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBackl
ogTimeout
5s
Optional
• Using correct instance
• Understanding Executors
• Sizing your executors
• Dynamic allocation on YARN
• Understanding storage layers
• File formats and compression
• Boost your performance
• Data serialization
• Avoiding shuffle
• Managing partitions
• RDD Persistence
• Using Zeppelin notebook
Compressions
• Always compress data files on Amazon S3
• Reduces storage cost
• Reduces bandwidth between Amazon S3 and
Amazon EMR
• Speeds up your job
Compressions
Compression types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT slower
– Some are split able and some are not
Algorithm % Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s
Compressions
• If you are time-sensitive, faster compressions are a
better choice
• If you have large amount of data, use space-efficient
compressions
• If you don’t care, pick GZIP
• Using correct instance
• Understanding Executors
• Sizing your executors
• Dynamic allocation on YARN
• Understanding storage layers
• File formats and compression
• Boost your performance
• Data serialization
• Avoiding shuffle
• Managing partitions
• RDD Persistence
• Using Zeppelin notebook
Data Serialization
• Data is serialized when cached or shuffled
Default: Java serializer
Memory
Disk
Memory
Disk
Spark executor
Data Serialization
• Data is serialized when cached or shuffled
Default: Java serializer
• Kyro serialization (10x faster than Java serialization)
• Does not support all Serializable types
• Register the class in advance
Usage: Set in SparkConf
conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
Spark doesn’t like to Shuffle
• Shuffling is expensive
• Disk I/O
• Data Serialization
• Network I/O
• Spill to disk
• Increased Garbage collection
• Use aggregateByKey() instead of your own aggregator
Usage:
myRDD.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect
• Apply filter earlier on data
Parallelism & Partitions
𝑠𝑝𝑎𝑟𝑘. 𝑑𝑒𝑓𝑎𝑢𝑙𝑡. 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙𝑖𝑠𝑚
• getNumPartitions()
• If you have >10K tasks, then its good to coalesce
• If you are not using all the slots on cluster, repartition can
increase parallelism
• 2-3 tasks per CPU core in your cluster
Config File: spark-default.conf
RDD Persistence
• Caching or persisting dataset in memory
• Methods
• cache()
• persist()
• Small RDD  MEMORY_ONLY
• Big RDD  MEMORY_ONLY_SER (CPU intensive)
• Don’t spill to disk
• Use replicated storage for faster recovery
Zeppelin and Spark
on Amazon EMR
Remember to complete
your evaluations!
Thank you!

More Related Content

What's hot

Cassandraのしくみ データの読み書き編
Cassandraのしくみ データの読み書き編Cassandraのしくみ データの読み書き編
Cassandraのしくみ データの読み書き編
Yuki Morishita
 
Data platformdesign
Data platformdesignData platformdesign
Data platformdesign
Ryoma Nagata
 
Oracle Database Vaultのご紹介
Oracle Database Vaultのご紹介Oracle Database Vaultのご紹介
Oracle Database Vaultのご紹介
オラクルエンジニア通信
 
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTT DATA Technology & Innovation
 
Kinesis Firehoseを使ってみた
Kinesis Firehoseを使ってみたKinesis Firehoseを使ってみた
Kinesis Firehoseを使ってみた
Masaki Misawa
 
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
Amazon Web Services Japan
 
日本のお客様におけるAmazon Auroraへの移行・検証事例と技術ポイント
日本のお客様におけるAmazon Auroraへの移行・検証事例と技術ポイント日本のお客様におけるAmazon Auroraへの移行・検証事例と技術ポイント
日本のお客様におけるAmazon Auroraへの移行・検証事例と技術ポイント
Amazon Web Services Japan
 
KeycloakでAPI認可に入門する
KeycloakでAPI認可に入門するKeycloakでAPI認可に入門する
KeycloakでAPI認可に入門する
Hitachi, Ltd. OSS Solution Center.
 
負荷分散だけじゃないELBのメリット
負荷分散だけじゃないELBのメリット負荷分散だけじゃないELBのメリット
負荷分散だけじゃないELBのメリット
Takashi Toyosaki
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築
Minero Aoki
 
Oracle Data Guard による高可用性
Oracle Data Guard による高可用性Oracle Data Guard による高可用性
Oracle Data Guard による高可用性
Yahoo!デベロッパーネットワーク
 
AWS Black Belt Online Seminar Amazon Aurora
AWS Black Belt Online Seminar Amazon AuroraAWS Black Belt Online Seminar Amazon Aurora
AWS Black Belt Online Seminar Amazon Aurora
Amazon Web Services Japan
 
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Web Services Japan
 
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
オラクルエンジニア通信
 
システムのモダナイズ 落ちても良いアプリの作り方
システムのモダナイズ 落ちても良いアプリの作り方システムのモダナイズ 落ちても良いアプリの作り方
システムのモダナイズ 落ちても良いアプリの作り方
Chihiro Ito
 
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
NTT DATA Technology & Innovation
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and Performance
Mineaki Motohashi
 
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
Amazon Web Services Japan
 
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
Trainocate Japan, Ltd.
 
Verrazzanoご紹介
Verrazzanoご紹介Verrazzanoご紹介

What's hot (20)

Cassandraのしくみ データの読み書き編
Cassandraのしくみ データの読み書き編Cassandraのしくみ データの読み書き編
Cassandraのしくみ データの読み書き編
 
Data platformdesign
Data platformdesignData platformdesign
Data platformdesign
 
Oracle Database Vaultのご紹介
Oracle Database Vaultのご紹介Oracle Database Vaultのご紹介
Oracle Database Vaultのご紹介
 
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
Kinesis Firehoseを使ってみた
Kinesis Firehoseを使ってみたKinesis Firehoseを使ってみた
Kinesis Firehoseを使ってみた
 
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
 
日本のお客様におけるAmazon Auroraへの移行・検証事例と技術ポイント
日本のお客様におけるAmazon Auroraへの移行・検証事例と技術ポイント日本のお客様におけるAmazon Auroraへの移行・検証事例と技術ポイント
日本のお客様におけるAmazon Auroraへの移行・検証事例と技術ポイント
 
KeycloakでAPI認可に入門する
KeycloakでAPI認可に入門するKeycloakでAPI認可に入門する
KeycloakでAPI認可に入門する
 
負荷分散だけじゃないELBのメリット
負荷分散だけじゃないELBのメリット負荷分散だけじゃないELBのメリット
負荷分散だけじゃないELBのメリット
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築
 
Oracle Data Guard による高可用性
Oracle Data Guard による高可用性Oracle Data Guard による高可用性
Oracle Data Guard による高可用性
 
AWS Black Belt Online Seminar Amazon Aurora
AWS Black Belt Online Seminar Amazon AuroraAWS Black Belt Online Seminar Amazon Aurora
AWS Black Belt Online Seminar Amazon Aurora
 
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
 
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
 
システムのモダナイズ 落ちても良いアプリの作り方
システムのモダナイズ 落ちても良いアプリの作り方システムのモダナイズ 落ちても良いアプリの作り方
システムのモダナイズ 落ちても良いアプリの作り方
 
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and Performance
 
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
 
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
 
Verrazzanoご紹介
Verrazzanoご紹介Verrazzanoご紹介
Verrazzanoご紹介
 

Viewers also liked

Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Amazon Web Services
 
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
Amazon Web Services Korea
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
Amazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
Hoon Park
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기
Brian Hong
 

Viewers also liked (7)

Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기
 

Similar to (BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR

Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
Amazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
Julien SIMON
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
EMR Training
EMR TrainingEMR Training
EMR Training
vishal192091
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
Israel AWS User Group
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
Demi Ben-Ari
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
Amazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
Jianfeng Zhang
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @Windward
Demi Ben-Ari
 

Similar to (BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR (20)

Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
EMR Training
EMR TrainingEMR Training
EMR Training
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @Windward
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz, Sr. Product Manager – Amazon EMR Manjeet Chayel, Specialist Solutions Architect October 2015 BDT309 Data Science & Best Practices for Apache Spark on Amazon EMR
  • 2. What to Expect from the Session • Data science with Apache Spark • Running Spark on Amazon EMR • Customer use cases and architectures • Best practices for running Spark • Demo: Using Apache Zeppelin to analyze US domestic flights dataset
  • 3.
  • 4. Spark is fast join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in RDDs in memory • Partitioning-aware to avoid network-intensive shuffle
  • 5. Spark components to match your use case
  • 6. Spark speaks your language And more!
  • 7. Apache Zeppelin notebook to develop queries Now available on Amazon EMR 4.1.0!
  • 8. Use DataFrames to easily interact with data • Distributed collection of data organized in columns • An extension of the existing RDD API • Optimized for query execution
  • 9. Easily create DataFrames from many formats RDD
  • 10. Load data with the Spark SQL Data Sources API Additional libraries at spark-packages.org
  • 12. Use DataFrames for machine learning • Spark ML libraries (replacing MLlib) use DataFrames as input/output for models • Create ML pipelines with a variety of distributed algorithms
  • 13. Create DataFrames on streaming data • Access data in Spark Streaming DStream • Create SQLContext on the SparkContext used for Spark Streaming application for ad hoc queries • Incorporate DataFrame in Spark Streaming application
  • 14. Use R to interact with DataFrames • SparkR package for using R to manipulate DataFrames • Create SparkR applications or interactively use the SparkR shell (no Zeppelin support yet - ZEPPELIN-156) • Comparable performance to Python and Scala DataFrames
  • 15. Spark SQL • Seamlessly mix SQL with Spark programs • Uniform data access • Hive compatibility – run Hive queries without modifications using HiveContext • Connect through JDBC/ODBC
  • 17. Focus on deriving insights from your data instead of manually configuring clusters Easy to install and configure Spark Secured Spark submit or use Zeppelin UI Quickly add and remove capacity Hourly, reserved, or EC2 Spot pricing Use S3 to decouple compute and storage
  • 18. Launch the latest Spark version July 15 – Spark 1.4.1 GA release July 24 – Spark 1.4.1 available on Amazon EMR September 9 – Spark 1.5.0 GA release September 30 – Spark 1.5.0 available on Amazon EMR < 3 week cadence with latest open source release
  • 19. Amazon EMR runs Spark on YARN • Dynamically share and centrally configure the same pool of cluster resources across engines • Schedulers for categorizing, isolating, and prioritizing workloads • Choose the number of executors to use, or allow YARN to choose (dynamic allocation) • Kerberos authentication Storage S3, HDFS YARN Cluster Resource Management Batch MapReduce In Memory Spark Applications Pig, Hive, Cascading, Spark Streaming, Spark SQL
  • 20. Create a fully configured cluster in minutes AWS Management Console AWS Command Line Interface (CLI) Or use an AWS SDK directly with the Amazon EMR API
  • 21. Or easily change your settings
  • 22. Many storage layers to choose from Amazon DynamoDB EMR-DynamoDB connector Amazon RDS Amazon Kinesis Streaming data connectorsJDBC Data Source w/ Spark SQL Elasticsearch connector Amazon Redshift Amazon Redshift Copy From HDFS EMR File System (EMRFS) Amazon S3 Amazon EMR
  • 23. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Amazon EMR
  • 24. Easy to run your Spark workloads Amazon EMR Step API SSH to master node (Spark Shell) Submit a Spark application Amazon EMR
  • 25. Secure Spark clusters – encryption at rest On-Cluster HDFS transparent encryption (AES 256) [new on release emr-4.1.0] Local disk encryption for temporary files using LUKS encryption via bootstrap action Amazon S3 Amazon S3 EMRFS support for Amazon S3 client-side and server-side encryption (AES 256)
  • 26. Secure Spark clusters – encryption in flight Internode communication on-cluster Blocks are encrypted in-transit in HDFS when using transparent encryption Spark’s Broadcast and FileServer services can use SSL. BlockTransferService (for shuffle) can’t use SSL (SPARK-5682). Amazon S3 S3 to Amazon EMR cluster Secure communication with SSL Objects encrypted over the wire if using client-side encryption
  • 27. Secure Spark clusters – additional features Permissions: • Cluster level: IAM roles for the Amazon EMR service and the cluster • Application level: Kerberos (Spark on YARN only) • Amazon EMR service level: IAM users Access: VPC, security groups Auditing: AWS CloudTrail
  • 29. Some of our customers running Spark on Amazon EMR
  • 30.
  • 31.
  • 32. Best Practices for Spark on Amazon EMR
  • 33. • Using correct instance • Understanding Executors • Sizing your executors • Dynamic allocation on YARN • Understanding storage layers • File formats and compression • Boost your performance • Data serialization • Avoiding shuffle • Managing partitions • RDD Persistence • Using Zeppelin notebook
  • 34. What does Spark need? • Memory – lots of it!! • Network • CPU • Horizontal Scaling Workflow Resource Machine learning CPU ETL I/O Instance
  • 35. Try different configurations to find your optimal architecture. CPU c1 family c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family cr1.8xlarge Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Interactive Large process learning Analysis HDFS
  • 36. • Using correct instance • Understanding Executors • Sizing your executors • Dynamic allocation on YARN • Understanding storage layers • File formats and compression • Caching tables • Boost your performance • Data serialization • Avoiding shuffle • Managing partitions • RDD Persistence
  • 37. How Spark executes • Spark Driver • Executor Spark Driver Executor
  • 38. Spark Executor on YARN This is where all the action happens - How to select number of executors? - How many cores per executor? - Example
  • 39. Sample Amazon EMR cluster Model vCPU Mem (GB) SSD Storage (GB) Networking r3.large 2 15.25 1 x 32 Moderate r3.xlarge 4 30.5 1 x 80 Moderate r3.2xlarge 8 61 1 x 160 High r3.4xlarge 16 122 1 x 320 High r3.8xlarge 32 244 2 x 320 10 Gigabit
  • 40. Create Amazon EMR cluster $ aws emr create-cluster --name ”Spark cluster” --release-label emr-4.1.0 --applications Name=Hive Name=Spark --use-default-roles --ec2-attributes KeyName=myKey --instance-type r3.4xlarge --instance-count 6 --no-auto-terminate
  • 41. Selecting number of executor cores: • Leave 1 core for OS and other activities • 4-5 cores per executor gives a good performance • Each executor can run up to 4-5 tasks • i.e. 4-5 threads for read/write operations to HDFS Inside Spark Executor on YARN
  • 42. Selecting number of executor cores: --num-executors or spark.executor.instances • 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 = (𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑒𝑠 𝑜𝑛 𝑛𝑜𝑑𝑒 −1 𝑓𝑜𝑟 𝑂𝑆 ) 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑎𝑠𝑘 𝑝𝑒𝑟 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟 • 16 −1 5 = 3 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 Inside Spark Executor on YARN Model vCPU Mem (GB) SSD Storage (GB) Networking r3.4xlarge 16 122 1 x 320 High
  • 43. Selecting number of executor cores: --num-executors or spark.executor.instances • 16 −1 5 = 3 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 • 6 instances • 𝑛𝑢𝑚 𝑜𝑓 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 = 3 ∗ 6 − 1 = 𝟏𝟕 Inside Spark Executor on YARN
  • 44. Inside Spark Executor on YARN Max Container size on node YARN Container Controls the max sum of memory used by the container yarn.nodemanager.resource.memory-mb → Default: 116 GConfig File: yarn-site.xml
  • 45. Inside Spark Executor on YARN Max Container size on node Executor space Where Spark executor Runs Executor Container →
  • 46. Inside Spark Executor on YARN Max Container size on node Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.) 𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟. 𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10 Executor Container Memory Overhead Config File: spark-default.conf
  • 47. Inside Spark Executor on YARN Max Container size on node Spark executor memory - Amount of memory to use per executor process spark.executor.memory Executor Container Memory Overhead Spark Executor Memory Config File: spark-default.conf
  • 48. Inside Spark Executor on YARN Max Container size on node Shuffle Memory Fraction- Fraction of Java heap to use for aggregation and cogroups during shuffles spark.shuffle.memoryFraction Executor Container Memory Overhead Spark Executor Memory Shuffle memoryFraction Default: 0.2
  • 49. Inside Spark Executor on YARN Max Container size on node Storage storage Fraction - Fraction of Java heap to use for Spark's memory cache spark.storage.memoryFraction Executor Container Memory Overhead Spark Executor Memory Shuffle memoryFraction Storage memoryFraction Default: 0.6
  • 50. Inside Spark Executor on YARN Max Container size on node --executor-memory or spark.executor.memory 𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟 𝑚𝑒𝑚𝑜𝑟𝑦 = 𝑀𝑎𝑥 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑒𝑟 𝑠𝑖𝑧𝑒 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟 𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 Config File: spark-default.conf
  • 51. Inside Spark Executor on YARN Max Container size on node --executor-memory or spark.executor.memory 𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟 𝑚𝑒𝑚𝑜𝑟𝑦 = 116 𝐺 3 ~=38 G Config File: spark-default.conf
  • 52. Inside Spark Executor on YARN Max Container size on node --executor-memory or spark.executor.memory 𝑀𝑒𝑚𝑜𝑟𝑦 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 => 38 ∗ 0.10 => 3.8 𝐺 Config File: spark-default.conf
  • 53. Inside Spark Executor on YARN Max Container size on node --executor-memory or spark.executor.memory 𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟 𝑀𝑒𝑚𝑜𝑟𝑦 => 38 − 3.8 => ~34 𝐺𝐵 Config File: spark-default.conf
  • 54. Optimal setting: --num-executors 17 --executor-cores 5 --executor-memory 34G Inside Spark Executor on YARN
  • 55. Optimal setting: --num-executors 17 --executor-cores 5 --executor-memory 34G Inside Spark Executor on YARN
  • 56. • Using correct instance • Understanding Executors • Sizing your executors • Dynamic allocation on YARN • Understanding storage layers • File formats and compression • Boost your performance • Data serialization • Avoiding shuffle • Managing partitions • RDD Persistence • Using Zeppelin Notebook
  • 57. Dynamic Allocation on YARN … allows your Spark applications to scale up based on demand and scale down when not required. Remove Idle executors, Request more on demand
  • 58. Dynamic Allocation on YARN Scaling up on executors - Request when you want the job to complete faster - Idle resources on cluster - Exponential increase in executors over time
  • 59. Dynamic allocation setup Property Value Spark.dynamicAllocation.enabled true Spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 5 spark.dynamicAllocation.maxExecutors 17 spark.dynamicAllocation.initalExecutors 0 sparkdynamicAllocation.executorIdleTime 60s spark.dynamicAllocation.schedulerBacklogTimeout 5s spark.dynamicAllocation.sustainedSchedulerBackl ogTimeout 5s Optional
  • 60. • Using correct instance • Understanding Executors • Sizing your executors • Dynamic allocation on YARN • Understanding storage layers • File formats and compression • Boost your performance • Data serialization • Avoiding shuffle • Managing partitions • RDD Persistence • Using Zeppelin notebook
  • 61. Compressions • Always compress data files on Amazon S3 • Reduces storage cost • Reduces bandwidth between Amazon S3 and Amazon EMR • Speeds up your job
  • 62. Compressions Compression types: – Some are fast BUT offer less space reduction – Some are space efficient BUT slower – Some are split able and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
  • 63. Compressions • If you are time-sensitive, faster compressions are a better choice • If you have large amount of data, use space-efficient compressions • If you don’t care, pick GZIP
  • 64. • Using correct instance • Understanding Executors • Sizing your executors • Dynamic allocation on YARN • Understanding storage layers • File formats and compression • Boost your performance • Data serialization • Avoiding shuffle • Managing partitions • RDD Persistence • Using Zeppelin notebook
  • 65. Data Serialization • Data is serialized when cached or shuffled Default: Java serializer Memory Disk Memory Disk Spark executor
  • 66. Data Serialization • Data is serialized when cached or shuffled Default: Java serializer • Kyro serialization (10x faster than Java serialization) • Does not support all Serializable types • Register the class in advance Usage: Set in SparkConf conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
  • 67. Spark doesn’t like to Shuffle • Shuffling is expensive • Disk I/O • Data Serialization • Network I/O • Spill to disk • Increased Garbage collection • Use aggregateByKey() instead of your own aggregator Usage: myRDD.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect • Apply filter earlier on data
  • 68. Parallelism & Partitions 𝑠𝑝𝑎𝑟𝑘. 𝑑𝑒𝑓𝑎𝑢𝑙𝑡. 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙𝑖𝑠𝑚 • getNumPartitions() • If you have >10K tasks, then its good to coalesce • If you are not using all the slots on cluster, repartition can increase parallelism • 2-3 tasks per CPU core in your cluster Config File: spark-default.conf
  • 69. RDD Persistence • Caching or persisting dataset in memory • Methods • cache() • persist() • Small RDD  MEMORY_ONLY • Big RDD  MEMORY_ONLY_SER (CPU intensive) • Don’t spill to disk • Use replicated storage for faster recovery
  • 70. Zeppelin and Spark on Amazon EMR