SlideShare a Scribd company logo
1 of 75
Download to read offline
Top 5 Mistakes when writing
Spark applications
Mark Grover | @mark_grover | Software Engineer
Ted Malaska | @TedMalaska | Principal Solutions Architect
tiny.cloudera.com/spark-mistakes
About the book
•  @hadooparchbook
•  hadooparchitecturebook.com
•  github.com/hadooparchitecturebook
•  slideshare.com/hadooparchbook
Mistakes people make
when using Spark
Mistakes people we’ve made
when using Spark
Mistakes people make
when using Spark
Mistake # 1
# Executors, cores, memory !?!
•  6 Nodes
•  16 cores each
•  64 GB of RAM each
Decisions, decisions, decisions
•  Number of executors (--num-executors)
•  Cores for each executor (--executor-cores)
•  Memory for each executor (--executor-memory)
•  6 nodes
•  16 cores each
•  64 GB of RAM
Spark Architecture recap
Answer #1 – Most granular
•  Have smallest sized executors
possible
•  1 core each
•  64GB/node / 16 executors/node
= 4 GB/executor
•  Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 6
Executor 5
Executor 4
Executor 3
Executor 2
Executor 1
Answer #1 – Most granular
•  Have smallest sized executors
possible
•  1 core each
•  64GB/node / 16 executors/node
= 4 GB/executor
•  Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 6
Executor 5
Executor 4
Executor 3
Executor 2
Executor 1
Why?
•  Not using benefits of running multiple tasks in
same executor
Answer #2 – Least granular
•  6 executors in total
=>1 executor per node
•  64 GB memory each
•  16 cores each
Worker node
Executor 1
Answer #2 – Least granular
•  6 executors in total
=>1 executor per node
•  64 GB memory each
•  16 cores each
Worker node
Executor 1
Why?
•  Need to leave some memory overhead for OS/
Hadoop daemons
Answer #3 – with overhead
•  6 executors – 1 executor/node
•  63 GB memory each
•  15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
Answer #3 – with overhead
•  6 executors – 1 executor/node
•  63 GB memory each
•  15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
Let’s assume…
•  You are running Spark on YARN, from here on…
3 things
•  3 other things to keep in mind
#1 – Memory overhead
•  --executor-memory controls the heap size
•  Need some overhead (controlled by
spark.yarn.executor.memory.overhead) for off heap memory
•  Default is max(384MB, .07 * spark.executor.memory)
#2 - YARN AM needs a core: Client
mode
#2 YARN AM needs a core: Cluster
mode
#3 HDFS Throughput
•  15 cores per executor can lead to bad HDFS I/O
throughput.
•  Best is to keep under 5 cores per executor
Calculations
•  5 cores per executor
–  For max HDFS throughput
•  Cluster has 6 * 15 = 90 cores in total
after taking out Hadoop/Yarn daemon cores)
•  90 cores / 5 cores/executor
= 18 executors
•  Each node has 3 executors
•  63 GB/3 = 21 GB, 21 x (1-0.07)
~ 19 GB
•  1 executor for AM => 17 executors
Overhead
Worker node
Executor 3
Executor 2
Executor 1
Correct answer
•  17 executors in total
•  19 GB memory/executor
•  5 cores/executor
* Not etched in stone
Overhead
Worker node
Executor 3
Executor 2
Executor 1
Dynamic allocation helps with
though, right?
•  Dynamic allocation allows Spark to dynamically
scale the cluster resources allocated to your
application based on the workload.
•  Works with Spark-On-Yarn
Decisions with Dynamic Allocation
•  Number of executors (--num-executors)
•  Cores for each executor (--executor-cores)
•  Memory for each executor (--executor-memory)
•  6 nodes
•  16 cores each
•  64 GB of RAM
Read more
•  From a great blog post on this topic by Sandy
Ryza:
http://blog.cloudera.com/blog/2015/03/how-to-tune-
your-apache-spark-jobs-part-2/
Mistake # 2
Application failure
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in
stage 6.0 (TID 120, 10.215.149.47):
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:
517) at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:
146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:
70)
Why?
•  No Spark shuffle block can be greater than 2 GB
Ok, what’s a shuffle block again?
•  In MapReduce terminology, a file written from
one Mapper for a Reducer
•  The Reducer makes a local copy of this file
(reducer local copy) and then ‘reduces’ it
Defining shuffle and partition
Each yellow arrow
in this diagram
represents a
shuffle block.
Each blue block is
a partition.
Once again
•  Overflow exception if shuffle block size > 2 GB
What’s going on here?
•  Spark uses ByteBuffer as abstraction for
blocks
val buf = ByteBuffer.allocate(length.toInt)
•  ByteBuffer is limited by Integer.MAX_SIZE (2 GB)!
Spark SQL
•  Especially problematic for Spark SQL
•  Default number of partitions to use when doing
shuffles is 200
–  This low number of partitions leads to high shuffle
block size
Umm, ok, so what can I do?
1.  Increase the number of partitions
–  Thereby, reducing the average partition size
2.  Get rid of skew in your data
–  More on that later
Umm, how exactly?
•  In Spark SQL, increase the value of
spark.sql.shuffle.partitions
•  In regular Spark applications, use
rdd.repartition() or rdd.coalesce()
(latter to reduce #partitions, if needed)
But, how many partitions should I
have?
•  Rule of thumb is around 128 MB per partition
But! There’s more!
•  Spark uses a different data structure for
bookkeeping during shuffles, when the number
of partitions is less than 2000, vs. more than
2000.
Don’t believe me?
•  In MapStatus.scala
def apply(loc: BlockManagerId, uncompressedSizes:
Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
Ok, so what are you saying?
If number of partitions < 2000, but not by much,
bump it to be slightly higher than 2000.
Can you summarize, please?
•  Don’t have too big partitions
–  Your job will fail due to 2 GB limit
•  Don’t have too few partitions
–  Your job will be slow, not making using of parallelism
•  Rule of thumb: ~128 MB per partition
•  If #partitions < 2000, but close, bump to just > 2000
•  Track SPARK-6235 for removing various 2 GB limits
Mistake # 3
Slow jobs on Join/Shuffle
•  Your dataset takes 20 seconds to run over with a
map job, but take 4 hours when joined or
shuffled. What wrong?
Mistake - Skew
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Normal
Distributed
The Holy Grail of Distributed Systems
Mistake - Skew
Single Thread
Normal
Distributed
What about Skew, because that is a thing
Mistake – Skew : Answers
•  Salting
•  Isolated Salting
•  Isolated Map Joins
Mistake – Skew : Salting
•  Normal Key: “Foo”
•  Salted Key: “Foo” + random.nextInt(saltFactor)
Managing Parallelism
Mistake – Skew: Salting
Add Example Slide
©2014 Cloudera, Inc. All rights reserved.
Mistake – Skew : Salting
•  Two Stage Aggregation
–  Stage one to do operations on the salted keys
–  Stage two to do operation access unsalted key results
Data Source Map
Convert to
Salted Key & Value
Tuple
Reduce
By Salted Key
Map Convert
results to
Key & Value
Tuple
Reduce
By Key
Results
Mistake – Skew : Isolated Salting
•  Second Stage only required for Isolated Keys
Data Source Map
Convert to
Key & Value
Isolate Key and
convert to
Salted Key &
Value
Tuple
Reduce
By Key &
Salted Key
Filter Isolated
Keys
From Salted
Keys
Map Convert
results to
Key & Value
Tuple
Reduce
By Key
Union to
Results
Mistake – Skew : Isolated Map Join
•  Filter Out Isolated Keys and use Map Join/
Aggregate on those
•  And normal reduce on the rest of the data
•  This can remove a large amount of data being
shuffled
Data Source Filter Normal
Keys
From Isolated
Keys
Reduce
By Normal Key
Union to
Results
Map Join
For Isolated
Keys
Managing Parallelism
Cartesian Join
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
ReduceTask
ReduceTask
ReduceTask
ReduceTask
Amount
of Data
Amount of Data
10x
100x
1000x
10000x
100000x
1000000x
Or more
Table YTable X
Managing Parallelism
•  How To fight Cartesian Join
–  Nested Structures
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
Table X
A, 1, 4
A, 2, 4
A, 3, 4
A, 1, 5
A, 2, 5
A, 3, 5
A, 1, 6
A, 2, 6
A, 3, 6
JOIN OR
Table X
A
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
Managing Parallelism
•  How To fight Cartesian Join
–  Nested Structures
create table nestedTable (
col1 string,
col2 string,
col3 array< struct<
col3_1: string,
col3_2: string>>
val rddNested = sc.parallelize(Array(
Row("a1", "b1", Seq(Row("c1_1",
"c2_1"),
Row("c1_2", "c2_2"),
Row("c1_3", "c2_3"))),
Row("a2", "b2", Seq(Row("c1_2",
"c2_2"),
Row("c1_3", "c2_3"),
Row("c1_4", "c2_4")))), 2)
=
Mistake # 4
Out of luck?
• Do you every run out of memory?
• Do you every have more then 20 stages?
• Is your driver doing a lot of work?
Mistake – DAG Management
• Shuffles are to be avoided
• ReduceByKey over GroupByKey
• TreeReduce over Reduce
• Use Complex/Nested Types
Mistake – DAG Management:
Shuffles
•  Map Side reduction, where possible
•  Think about partitioning/bucketing ahead of time
•  Do as much as possible with a single shuffle
•  Only send what you have to send
•  Avoid Skew and Cartesians
ReduceByKey over GroupByKey
•  ReduceByKey can do almost anything that GroupByKey
can do
•  Aggregations
•  Windowing
•  Use memory
•  But you have more control
•  ReduceByKey has a fixed limit of Memory requirements
•  GroupByKey is unbound and dependent on data
TreeReduce over Reduce
•  TreeReduce & Reduce return some result to driver
•  TreeReduce does more work on the executors
•  While Reduce bring everything back to the driver
Partition
Partition
Partition
Partition
Driver
100%
Partition
Partition
Partition
Partition
Driver
4
25%
25%
25%
25%
Complex Types
• Top N List
• Multiple types of Aggregations
• Windowing operations
• All in one pass
Complex Types
•  Think outside of the box use objects to reduce by
•  (Make something simple)
Mistake # 5
Ever seen this?
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at org.apache.spark.util.collection.OpenHashSet.org
$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at…....
But!
• I already included protobuf in my app’s
maven dependencies?
Ah!
• My protobuf version doesn’t match with
Spark’s protobuf version!
Shading
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</version>
...
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>com.company.my.protobuf</shadedPattern>
</relocation>
</relocations>
Future of shading
• Spark 2.0 has some libraries shaded
• Gauva is fully shaded
Summary
5 Mistakes
• Size up your executors right
• 2 GB limit on Spark shuffle blocks
• Evil thing about skew and cartesians
• Learn to manage your DAG, yo!
• Do shady stuff, don’t let classpath leaks
mess you up
THANK YOU.
tiny.cloudera.com/spark-mistakes
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska

More Related Content

What's hot

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 

What's hot (20)

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 

Viewers also liked

Hive on Spark の設計指針を読んでみた
Hive on Spark の設計指針を読んでみたHive on Spark の設計指針を読んでみた
Hive on Spark の設計指針を読んでみたRecruit Technologies
 
SparkやBigQueryなどを用いた モバイルゲーム分析環境
SparkやBigQueryなどを用いたモバイルゲーム分析環境SparkやBigQueryなどを用いたモバイルゲーム分析環境
SparkやBigQueryなどを用いた モバイルゲーム分析環境yuichi_komatsu
 
[db tech showcase Tokyo 2017] AzureでOSS DB/データ処理基盤のPaaSサービスを使ってみよう (Azure Dat...
[db tech showcase Tokyo 2017] AzureでOSS DB/データ処理基盤のPaaSサービスを使ってみよう (Azure Dat...[db tech showcase Tokyo 2017] AzureでOSS DB/データ処理基盤のPaaSサービスを使ってみよう (Azure Dat...
[db tech showcase Tokyo 2017] AzureでOSS DB/データ処理基盤のPaaSサービスを使ってみよう (Azure Dat...Naoki (Neo) SATO
 
SparkMLlibで始めるビッグデータを対象とした機械学習入門
SparkMLlibで始めるビッグデータを対象とした機械学習入門SparkMLlibで始めるビッグデータを対象とした機械学習入門
SparkMLlibで始めるビッグデータを対象とした機械学習入門Takeshi Mikami
 
Business Innovation cases driven by AI and BigData technologies
Business Innovation cases driven by AI and BigData technologiesBusiness Innovation cases driven by AI and BigData technologies
Business Innovation cases driven by AI and BigData technologiesDataWorks Summit/Hadoop Summit
 
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話Yahoo!デベロッパーネットワーク
 
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-Makoto SHIMURA
 
モバイル開発を支えるAWS Mobile Services
モバイル開発を支えるAWS Mobile Servicesモバイル開発を支えるAWS Mobile Services
モバイル開発を支えるAWS Mobile ServicesKeisuke Nishitani
 
Sparkを活用したレコメンドエンジンのパフォーマンスチューニング&自動化
Sparkを活用したレコメンドエンジンのパフォーマンスチューニング&自動化Sparkを活用したレコメンドエンジンのパフォーマンスチューニング&自動化
Sparkを活用したレコメンドエンジンのパフォーマンスチューニング&自動化Nagato Kasaki
 
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016Tokoroten Nakayama
 
Apache Hadoopを利用したビッグデータ分析基盤
Apache Hadoopを利用したビッグデータ分析基盤Apache Hadoopを利用したビッグデータ分析基盤
Apache Hadoopを利用したビッグデータ分析基盤Hortonworks Japan
 
20分でおさらいするサーバレスアーキテクチャ 「サーバレスの薄い本ダイジェスト」 #serverlesstokyo
20分でおさらいするサーバレスアーキテクチャ 「サーバレスの薄い本ダイジェスト」 #serverlesstokyo20分でおさらいするサーバレスアーキテクチャ 「サーバレスの薄い本ダイジェスト」 #serverlesstokyo
20分でおさらいするサーバレスアーキテクチャ 「サーバレスの薄い本ダイジェスト」 #serverlesstokyoMasahiro NAKAYAMA
 
大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016Cloudera Japan
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesAerospike, Inc.
 
Hadoop’s Impact on Recruit Company
Hadoop’s Impact on Recruit CompanyHadoop’s Impact on Recruit Company
Hadoop’s Impact on Recruit CompanyRecruit Technologies
 
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)hamaken
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16Yifeng Jiang
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) hamaken
 

Viewers also liked (20)

Hive on Spark の設計指針を読んでみた
Hive on Spark の設計指針を読んでみたHive on Spark の設計指針を読んでみた
Hive on Spark の設計指針を読んでみた
 
SparkやBigQueryなどを用いた モバイルゲーム分析環境
SparkやBigQueryなどを用いたモバイルゲーム分析環境SparkやBigQueryなどを用いたモバイルゲーム分析環境
SparkやBigQueryなどを用いた モバイルゲーム分析環境
 
[db tech showcase Tokyo 2017] AzureでOSS DB/データ処理基盤のPaaSサービスを使ってみよう (Azure Dat...
[db tech showcase Tokyo 2017] AzureでOSS DB/データ処理基盤のPaaSサービスを使ってみよう (Azure Dat...[db tech showcase Tokyo 2017] AzureでOSS DB/データ処理基盤のPaaSサービスを使ってみよう (Azure Dat...
[db tech showcase Tokyo 2017] AzureでOSS DB/データ処理基盤のPaaSサービスを使ってみよう (Azure Dat...
 
SparkMLlibで始めるビッグデータを対象とした機械学習入門
SparkMLlibで始めるビッグデータを対象とした機械学習入門SparkMLlibで始めるビッグデータを対象とした機械学習入門
SparkMLlibで始めるビッグデータを対象とした機械学習入門
 
SEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile gamesSEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile games
 
Case Study: OLAP usability on Spark and Hadoop
Case Study: OLAP usability on Spark and HadoopCase Study: OLAP usability on Spark and Hadoop
Case Study: OLAP usability on Spark and Hadoop
 
Business Innovation cases driven by AI and BigData technologies
Business Innovation cases driven by AI and BigData technologiesBusiness Innovation cases driven by AI and BigData technologies
Business Innovation cases driven by AI and BigData technologies
 
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
 
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
 
モバイル開発を支えるAWS Mobile Services
モバイル開発を支えるAWS Mobile Servicesモバイル開発を支えるAWS Mobile Services
モバイル開発を支えるAWS Mobile Services
 
Sparkを活用したレコメンドエンジンのパフォーマンスチューニング&自動化
Sparkを活用したレコメンドエンジンのパフォーマンスチューニング&自動化Sparkを活用したレコメンドエンジンのパフォーマンスチューニング&自動化
Sparkを活用したレコメンドエンジンのパフォーマンスチューニング&自動化
 
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
 
Apache Hadoopを利用したビッグデータ分析基盤
Apache Hadoopを利用したビッグデータ分析基盤Apache Hadoopを利用したビッグデータ分析基盤
Apache Hadoopを利用したビッグデータ分析基盤
 
20分でおさらいするサーバレスアーキテクチャ 「サーバレスの薄い本ダイジェスト」 #serverlesstokyo
20分でおさらいするサーバレスアーキテクチャ 「サーバレスの薄い本ダイジェスト」 #serverlesstokyo20分でおさらいするサーバレスアーキテクチャ 「サーバレスの薄い本ダイジェスト」 #serverlesstokyo
20分でおさらいするサーバレスアーキテクチャ 「サーバレスの薄い本ダイジェスト」 #serverlesstokyo
 
大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
 
Hadoop’s Impact on Recruit Company
Hadoop’s Impact on Recruit CompanyHadoop’s Impact on Recruit Company
Hadoop’s Impact on Recruit Company
 
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
 

Similar to Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Spark Summit
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014marvin herrera
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010jbellis
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB
 
#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future DesignPivotalOpenSourceHub
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with SparkAlpine Data
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache KylinShi Shao Feng
 
Taming Go's Memory Usage — and Avoiding a Rust Rewrite
Taming Go's Memory Usage — and Avoiding a Rust RewriteTaming Go's Memory Usage — and Avoiding a Rust Rewrite
Taming Go's Memory Usage — and Avoiding a Rust RewriteScyllaDB
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 

Similar to Top 5 mistakes when writing Spark applications (20)

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with Spark
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
 
Taming Go's Memory Usage — and Avoiding a Rust Rewrite
Taming Go's Memory Usage — and Avoiding a Rust RewriteTaming Go's Memory Usage — and Avoiding a Rust Rewrite
Taming Go's Memory Usage — and Avoiding a Rust Rewrite
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 

More from hadooparchbook

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 

More from hadooparchbook (20)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Top 5 mistakes when writing Spark applications

  • 1. Top 5 Mistakes when writing Spark applications Mark Grover | @mark_grover | Software Engineer Ted Malaska | @TedMalaska | Principal Solutions Architect tiny.cloudera.com/spark-mistakes
  • 2. About the book •  @hadooparchbook •  hadooparchitecturebook.com •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook
  • 4. Mistakes people we’ve made when using Spark
  • 7. # Executors, cores, memory !?! •  6 Nodes •  16 cores each •  64 GB of RAM each
  • 8. Decisions, decisions, decisions •  Number of executors (--num-executors) •  Cores for each executor (--executor-cores) •  Memory for each executor (--executor-memory) •  6 nodes •  16 cores each •  64 GB of RAM
  • 10. Answer #1 – Most granular •  Have smallest sized executors possible •  1 core each •  64GB/node / 16 executors/node = 4 GB/executor •  Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 6 Executor 5 Executor 4 Executor 3 Executor 2 Executor 1
  • 11. Answer #1 – Most granular •  Have smallest sized executors possible •  1 core each •  64GB/node / 16 executors/node = 4 GB/executor •  Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 6 Executor 5 Executor 4 Executor 3 Executor 2 Executor 1
  • 12. Why? •  Not using benefits of running multiple tasks in same executor
  • 13. Answer #2 – Least granular •  6 executors in total =>1 executor per node •  64 GB memory each •  16 cores each Worker node Executor 1
  • 14. Answer #2 – Least granular •  6 executors in total =>1 executor per node •  64 GB memory each •  16 cores each Worker node Executor 1
  • 15. Why? •  Need to leave some memory overhead for OS/ Hadoop daemons
  • 16. Answer #3 – with overhead •  6 executors – 1 executor/node •  63 GB memory each •  15 cores each Worker node Executor 1 Overhead(1G,1 core)
  • 17. Answer #3 – with overhead •  6 executors – 1 executor/node •  63 GB memory each •  15 cores each Worker node Executor 1 Overhead(1G,1 core)
  • 18. Let’s assume… •  You are running Spark on YARN, from here on…
  • 19. 3 things •  3 other things to keep in mind
  • 20. #1 – Memory overhead •  --executor-memory controls the heap size •  Need some overhead (controlled by spark.yarn.executor.memory.overhead) for off heap memory •  Default is max(384MB, .07 * spark.executor.memory)
  • 21. #2 - YARN AM needs a core: Client mode
  • 22. #2 YARN AM needs a core: Cluster mode
  • 23. #3 HDFS Throughput •  15 cores per executor can lead to bad HDFS I/O throughput. •  Best is to keep under 5 cores per executor
  • 24. Calculations •  5 cores per executor –  For max HDFS throughput •  Cluster has 6 * 15 = 90 cores in total after taking out Hadoop/Yarn daemon cores) •  90 cores / 5 cores/executor = 18 executors •  Each node has 3 executors •  63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB •  1 executor for AM => 17 executors Overhead Worker node Executor 3 Executor 2 Executor 1
  • 25. Correct answer •  17 executors in total •  19 GB memory/executor •  5 cores/executor * Not etched in stone Overhead Worker node Executor 3 Executor 2 Executor 1
  • 26. Dynamic allocation helps with though, right? •  Dynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload. •  Works with Spark-On-Yarn
  • 27. Decisions with Dynamic Allocation •  Number of executors (--num-executors) •  Cores for each executor (--executor-cores) •  Memory for each executor (--executor-memory) •  6 nodes •  16 cores each •  64 GB of RAM
  • 28. Read more •  From a great blog post on this topic by Sandy Ryza: http://blog.cloudera.com/blog/2015/03/how-to-tune- your-apache-spark-jobs-part-2/
  • 30. Application failure 15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala: 517) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala: 146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala: 70)
  • 31. Why? •  No Spark shuffle block can be greater than 2 GB
  • 32. Ok, what’s a shuffle block again? •  In MapReduce terminology, a file written from one Mapper for a Reducer •  The Reducer makes a local copy of this file (reducer local copy) and then ‘reduces’ it
  • 33. Defining shuffle and partition Each yellow arrow in this diagram represents a shuffle block. Each blue block is a partition.
  • 34. Once again •  Overflow exception if shuffle block size > 2 GB
  • 35. What’s going on here? •  Spark uses ByteBuffer as abstraction for blocks val buf = ByteBuffer.allocate(length.toInt) •  ByteBuffer is limited by Integer.MAX_SIZE (2 GB)!
  • 36. Spark SQL •  Especially problematic for Spark SQL •  Default number of partitions to use when doing shuffles is 200 –  This low number of partitions leads to high shuffle block size
  • 37. Umm, ok, so what can I do? 1.  Increase the number of partitions –  Thereby, reducing the average partition size 2.  Get rid of skew in your data –  More on that later
  • 38. Umm, how exactly? •  In Spark SQL, increase the value of spark.sql.shuffle.partitions •  In regular Spark applications, use rdd.repartition() or rdd.coalesce() (latter to reduce #partitions, if needed)
  • 39. But, how many partitions should I have? •  Rule of thumb is around 128 MB per partition
  • 40. But! There’s more! •  Spark uses a different data structure for bookkeeping during shuffles, when the number of partitions is less than 2000, vs. more than 2000.
  • 41. Don’t believe me? •  In MapStatus.scala def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = { if (uncompressedSizes.length > 2000) { HighlyCompressedMapStatus(loc, uncompressedSizes) } else { new CompressedMapStatus(loc, uncompressedSizes) } }
  • 42. Ok, so what are you saying? If number of partitions < 2000, but not by much, bump it to be slightly higher than 2000.
  • 43. Can you summarize, please? •  Don’t have too big partitions –  Your job will fail due to 2 GB limit •  Don’t have too few partitions –  Your job will be slow, not making using of parallelism •  Rule of thumb: ~128 MB per partition •  If #partitions < 2000, but close, bump to just > 2000 •  Track SPARK-6235 for removing various 2 GB limits
  • 45. Slow jobs on Join/Shuffle •  Your dataset takes 20 seconds to run over with a map job, but take 4 hours when joined or shuffled. What wrong?
  • 46. Mistake - Skew Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Normal Distributed The Holy Grail of Distributed Systems
  • 47. Mistake - Skew Single Thread Normal Distributed What about Skew, because that is a thing
  • 48. Mistake – Skew : Answers •  Salting •  Isolated Salting •  Isolated Map Joins
  • 49. Mistake – Skew : Salting •  Normal Key: “Foo” •  Salted Key: “Foo” + random.nextInt(saltFactor)
  • 51. Mistake – Skew: Salting
  • 52. Add Example Slide ©2014 Cloudera, Inc. All rights reserved.
  • 53. Mistake – Skew : Salting •  Two Stage Aggregation –  Stage one to do operations on the salted keys –  Stage two to do operation access unsalted key results Data Source Map Convert to Salted Key & Value Tuple Reduce By Salted Key Map Convert results to Key & Value Tuple Reduce By Key Results
  • 54. Mistake – Skew : Isolated Salting •  Second Stage only required for Isolated Keys Data Source Map Convert to Key & Value Isolate Key and convert to Salted Key & Value Tuple Reduce By Key & Salted Key Filter Isolated Keys From Salted Keys Map Convert results to Key & Value Tuple Reduce By Key Union to Results
  • 55. Mistake – Skew : Isolated Map Join •  Filter Out Isolated Keys and use Map Join/ Aggregate on those •  And normal reduce on the rest of the data •  This can remove a large amount of data being shuffled Data Source Filter Normal Keys From Isolated Keys Reduce By Normal Key Union to Results Map Join For Isolated Keys
  • 56. Managing Parallelism Cartesian Join Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 ReduceTask ReduceTask ReduceTask ReduceTask Amount of Data Amount of Data 10x 100x 1000x 10000x 100000x 1000000x Or more
  • 57. Table YTable X Managing Parallelism •  How To fight Cartesian Join –  Nested Structures A, 1 A, 2 A, 3 A, 4 A, 5 A, 6 Table X A, 1, 4 A, 2, 4 A, 3, 4 A, 1, 5 A, 2, 5 A, 3, 5 A, 1, 6 A, 2, 6 A, 3, 6 JOIN OR Table X A A, 1 A, 2 A, 3 A, 4 A, 5 A, 6
  • 58. Managing Parallelism •  How To fight Cartesian Join –  Nested Structures create table nestedTable ( col1 string, col2 string, col3 array< struct< col3_1: string, col3_2: string>> val rddNested = sc.parallelize(Array( Row("a1", "b1", Seq(Row("c1_1", "c2_1"), Row("c1_2", "c2_2"), Row("c1_3", "c2_3"))), Row("a2", "b2", Seq(Row("c1_2", "c2_2"), Row("c1_3", "c2_3"), Row("c1_4", "c2_4")))), 2) =
  • 60. Out of luck? • Do you every run out of memory? • Do you every have more then 20 stages? • Is your driver doing a lot of work?
  • 61. Mistake – DAG Management • Shuffles are to be avoided • ReduceByKey over GroupByKey • TreeReduce over Reduce • Use Complex/Nested Types
  • 62. Mistake – DAG Management: Shuffles •  Map Side reduction, where possible •  Think about partitioning/bucketing ahead of time •  Do as much as possible with a single shuffle •  Only send what you have to send •  Avoid Skew and Cartesians
  • 63. ReduceByKey over GroupByKey •  ReduceByKey can do almost anything that GroupByKey can do •  Aggregations •  Windowing •  Use memory •  But you have more control •  ReduceByKey has a fixed limit of Memory requirements •  GroupByKey is unbound and dependent on data
  • 64. TreeReduce over Reduce •  TreeReduce & Reduce return some result to driver •  TreeReduce does more work on the executors •  While Reduce bring everything back to the driver Partition Partition Partition Partition Driver 100% Partition Partition Partition Partition Driver 4 25% 25% 25% 25%
  • 65. Complex Types • Top N List • Multiple types of Aggregations • Windowing operations • All in one pass
  • 66. Complex Types •  Think outside of the box use objects to reduce by •  (Make something simple)
  • 68. Ever seen this? Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org $apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102) at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210) at…....
  • 69. But! • I already included protobuf in my app’s maven dependencies?
  • 70. Ah! • My protobuf version doesn’t match with Spark’s protobuf version!
  • 72. Future of shading • Spark 2.0 has some libraries shaded • Gauva is fully shaded
  • 74. 5 Mistakes • Size up your executors right • 2 GB limit on Spark shuffle blocks • Evil thing about skew and cartesians • Learn to manage your DAG, yo! • Do shady stuff, don’t let classpath leaks mess you up
  • 75. THANK YOU. tiny.cloudera.com/spark-mistakes Mark Grover | @mark_grover Ted Malaska | @TedMalaska