SlideShare a Scribd company logo
1 of 32
Big DATA SKEW
Ayan Ray
Big Data Analytics Engineer
© 2016 RS Software (India) Ltd. 2
Index
• Definition
• Types
• Problem in Hadoop
• Problem Solving Approaches
• Mapreduce
• Hive
• Pig
© 2016 RS Software (India) Ltd. 3
Definition
Skewness is the measure of asymmetry of the
probability distribution of a real-valued random variable
about its mean.
© 2016 RS Software (India) Ltd. 4
Types
Negative/Left
• The left tail is longer; the mass of the distribution is
concentrated on the right.
• Mean is at left of the peak i.e Mean of the data is less
than median.
© 2016 RS Software (India) Ltd. 5
Types
Positive/Right
• The right tail is longer; the mass of the distribution is
concentrated on the left.
• Mean is at right of the peak i.e Mean of the data
values is larger than the median.
© 2016 RS Software (India) Ltd. 6
Types
No Skew/ Normal Distribution
A normal distribution is not skewed.
It is perfectly symmetrical.
Mean is exactly at the peak. Mean=median
© 2016 RS Software (India) Ltd. 7
Problem in Hadoop
• Say, we have to process some twitter feed
corresponding to a user, each of which is in the
format <twitter_id, twitter_post>
• Now, say some of the users are very active on twitter
and some seldom uses it.
• The heavy user will have a very large number of <id,
post> data.
• When, we will try to process the data through a
Mapreduce job, the reducer assigned with heavy user
will take long time to complete.
• This will result in high over all time and low resource
utilization.
© 2016 RS Software (India) Ltd. 8
Solution in Mapreduce
Combiner
• Implement a combiner to reduce the amount of data
going into the reduce-phase. This will significantly
reduce the effects of any type of reduce-side skew.
• Combiners are effective at handling Partitioning
Skew and Expensive Input at the reduce side when
the skew observed during reduce phase is mainly due
to the volume of data transferred during the shuffle
phase.
• But we can’t run combiner in all cases especially
when reducer calculation is associative /commutative
(say average)
© 2016 RS Software (India) Ltd. 9
Solution in Mapreduce
Partitioner
• This phase exist between map and reduce phase.
• Number of reducers is equal to number of
Partitioner.
• Partitioner has an inherent method as follows:
int getPartition(K key, V value, int numReduceTasks)
• Based on the integer value returned from the above
function, Hadoop selects node where the reduce task
for a particular key should run.
• We can manipulate the above method, to write our
own custom Partitioner.
© 2016 RS Software (India) Ltd. 10
Solution in Mapreduce
Partitioner-Continued
• By default, all values for a particular key goes to same
reducer.
• Say, if we know that there is possibility that values
for a particular key will be overcrowded then we can
write our custom partitioner to divide it further to
different reducers.
Let us take an example,
• We are trying to find out highest salaried employee
by gender in different age groups (e.g below 20,
between 20 and 40 and above 40)
© 2016 RS Software (India) Ltd. 11
Solution in Mapreduce
Partitioner-Continued
Input Data:
Id Name Age Gender Salary
1201 gopal 45 Male 50,000
1202 manisha 40 Female 50,000
1203 khalil 34 Male 30,000
1204 prasanth 30 Male 30,000
1205 kiran 20 Male 40,000
1206 laxmi 25 Female 35,000
1207 bhavya 20 Female 15,000
1208 reshma 19 Female 15,000
1209 kranthi 22 Male 22,000
© 2016 RS Software (India) Ltd. 12
Solution in Mapreduce
Partitioner-Continued
© 2016 RS Software (India) Ltd. 13
Solution in Mapreduce
Partitioner-Continued
If we analyse the data, we will find that we have
following number of records for each category
We can observe that the age range 20<=x<=40 is
overcrowded.
Range Count
<20=1 1
20<=x<=40 6
>40 1
© 2016 RS Software (India) Ltd. 14
Solution in Mapreduce
Partitioner-Continued
• Hence, the reducer 1 will take much longer time as
compared to other 2 reducers.
• So, the other 2 reducers will have to wait as reducer 1
will carry on its processing.
• We can split them into different reducers.
© 2016 RS Software (India) Ltd. 15
Solution in Mapreduce
Partitioner-Continued
© 2016 RS Software (India) Ltd. 16
Solution in Mapreduce
Partitioner-Continued
Now, the load will be more uniformly distributed and the
skew effect will be dampened.
Range Count
<20=1 1
20<=x<=40 && salary <35000=3 3
20<=x<=40 && salary >=35000=3 3
>40 1
© 2016 RS Software (India) Ltd. 17
Solution in Mapreduce
Combiner and Partitioner
Both combiner and partitioner can be combined and
used in the same job where possible.
© 2016 RS Software (India) Ltd. 18
Solution in Hive
Skewed table
• A skewed table is a special type of table where the
values that appear very often (heavy skew) are split
out into separate files and rest of the values go to
some other file.
• Syntax:
create table <T> (schema) skewed by (keys) on ('c1', 'c2')
[STORED as DIRECTORIES];
• Example:
create table T (c1 string, c2 string) skewed by (c1) on
('x1');
© 2016 RS Software (India) Ltd. 19
Solution in Hive
How does it solve Data skew?
• By specifying the skewed values Hive will split those
out into separate files automatically.
• It takes this fact into account during queries so that it
can skip (or include) whole files if possible thus
enhancing the performance.
© 2016 RS Software (India) Ltd. 20
Solution in Hive
List Bucketing
• List bucketing is a special type of Skewed table where
we identify the keys which are highly skewed and
maintain one directory per skewed key. The data
corresponding to remaining (non-skewed) keys go
into separate directory.
© 2016 RS Software (India) Ltd. 21
Solution in Hive
Single key
Create table list_bucketed_table (c1 int, c2 int, c3 int)
skewed by (c1) on (10,20,30) stored as directories;
• This will create separate directories for c1 values of
10, 20 and 30 and another one directory for all other
values.
Select c1, c2, c3 from list_bucketed_table where c1=10;
• The Hive compiler will only use the directory
corresponding to x=30 for the map-reduce job.
Select c1, c2, c3 from list_bucketed_table where c1=10;
• The Hive compiler will only use the directory
corresponding to x=others for the map-reduce job.
© 2016 RS Software (India) Ltd. 22
Solution in Hive
Multiple key
Create table list_bucketed_table (c1 string, c2 int, c3 int)
skewed by (c1, c2) on ((‘a’, 10), (‘b’, 20)) stored as
directories;
The metastore will have mapping like (‘a’, 10)->1 , (‘b’,20)
->2 , others -> 3.
Select c1, c2, c3 from list_bucketed_table where c1=’a’ and
c2=10;
The Hive query will use the file from directory (‘a’,10) -> 1
© 2016 RS Software (India) Ltd. 23
Solution in Hive
Multiple key
Create table list_bucketed_table (c1 string, c2 int, c3 int)
skewed by (c1, c2) on ((‘a’, 10), (‘b’, 20)) stored as
directories;
The metastore will have mapping like (‘a’, 10)->1 , (‘b’,20)
->2 , others -> 3.
Select c1, c2, c3 from list_bucketed_table where c1=’a’ and
c2=10;
The Hive query will use the file from directory (‘a’,10) -> 1
© 2016 RS Software (India) Ltd. 24
Solution in Hive
Advantages:
• Each partition’s skewed keys accounts for a
significant percentage of the total data. In the above
scenario if skewed keys 10,20 and 30 occupy
significant portion of the data then queries of the
form x=40 will need not require to scan the
remaining portion of the data.
• The number of skewed keys per partition is small.
Since this list is stored in metastore, so it does not
make sense to store very large number of keys per
partition in the metastore.
© 2016 RS Software (India) Ltd. 25
Solution in Hive
Disadvantages:
• The approach is not scalable when the number of
skewed keys is very large. This creates a problem for
metastore capability.
• It is also not scalable suited when number of skewed
keys is more than 1 but in the query all the keys are
not specified.
• It will not give desired result when skewed keys
occupy very less percentage of the total data.
© 2016 RS Software (India) Ltd. 26
Solution in Hive
Disadvantages:
• The approach is not scalable when the number of
skewed keys is very large. This creates a problem for
metastore capability.
• It is also not scalable suited when number of skewed
keys is more than 1 but in the query all the keys are
not specified.
• It will not give desired result when skewed keys
occupy very less percentage of the total data.
© 2016 RS Software (India) Ltd. 27
Solution in Pig
Skewed Join
• Skew join works by first sampling one input for the
join.
• Skew join is capable of identifying that it will not be
able to fit the entire input into memory hence, splits
them into two reducers.
• For all records except those identified in the sample,
it does a standard join, collecting records with the
same key onto the same reducer.
• The second input is the one that is sampled and have
its keys with large number of values split across
reducers. The first input will have those values
replicated across reducers.
© 2016 RS Software (India) Ltd. 28
Solution in Pig
For example,
Employee= load ‘employee’ as (name:chararray,
city:chararray);
Citydetails= load ‘employee’ as (city: chararray,
population: int);
Joinop= join Citydetails by city, users by city using
‘skewed’;
Suppose the distribution is as follows:
20 users live in Bangalore
10000 users live in Kolkata,
300 users live in Chennai.
© 2016 RS Software (India) Ltd. 29
Solution in Pig
• Let us assume that Pig determined that 7500 records
could be fitted into memory.
• If we don’t use skew Pig will throw OutOfMemory
exception
• But with the use of skew it will separate users with
Kolkata as key into two reducers.
© 2016 RS Software (India) Ltd. 30
Solution in Pig
Memory Usage
• Pig looks at the record sizes in the sample and
assumes it can use 30%(default) of the JVM’s heap to
materialize records that will be joined.
• Memory should be decreased if the join is still failing
due to out-of-memory errors even using skew join.
• So you should tell it to use less.
© 2016 RS Software (India) Ltd. 31
Solution in Pig
Memory Usage
Memory allocation can be configured manually using the
following configuration
pig.skewedjoin.reduce.memusage=0.25
It can be passed from command line also
-D pig.skewedjoin.reduce.memusage=0.25
This will use 25% instead of 30%.
© 2016 RS Software (India) Ltd. 32
Thanking You!
For further assistance and explanation in anything
related to Big Data feel free to mail me at
ayanray089@gmail.com

More Related Content

What's hot

Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoringMiguel Rodriguez
 
性能測定道 事始め編
性能測定道 事始め編性能測定道 事始め編
性能測定道 事始め編Yuto Hayamizu
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...NTT DATA OSS Professional Services
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
新しいTERASOLUNA Batch Frameworkとは
新しいTERASOLUNA Batch Frameworkとは新しいTERASOLUNA Batch Frameworkとは
新しいTERASOLUNA Batch Frameworkとはapkiban
 
vSRX on Your Laptop : PCで始めるvSRX ~JUNOSをさわってみよう!~
vSRX on Your Laptop : PCで始めるvSRX ~JUNOSをさわってみよう!~vSRX on Your Laptop : PCで始めるvSRX ~JUNOSをさわってみよう!~
vSRX on Your Laptop : PCで始めるvSRX ~JUNOSをさわってみよう!~Juniper Networks (日本)
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connectconfluent
 
俺のサイジング
俺のサイジング俺のサイジング
俺のサイジングToru Makabe
 
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送Google Cloud Platform - Japan
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into CassandraDataStax
 
SRv6 Mobile User Plane : Initial POC and Implementation
SRv6 Mobile User Plane : Initial POC and ImplementationSRv6 Mobile User Plane : Initial POC and Implementation
SRv6 Mobile User Plane : Initial POC and ImplementationKentaro Ebisawa
 
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)NTT DATA Technology & Innovation
 
フレームワークも使っていないWebアプリをLaravel+PWAでモバイルアプリっぽくしてみちゃう
フレームワークも使っていないWebアプリをLaravel+PWAでモバイルアプリっぽくしてみちゃうフレームワークも使っていないWebアプリをLaravel+PWAでモバイルアプリっぽくしてみちゃう
フレームワークも使っていないWebアプリをLaravel+PWAでモバイルアプリっぽくしてみちゃう株式会社オプト 仙台ラボラトリ
 
忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜
忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜
忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜Masahito Zembutsu
 
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)NTT DATA Technology & Innovation
 
SDN界隈の用語・考え方をざっくりまとめます。
SDN界隈の用語・考え方をざっくりまとめます。SDN界隈の用語・考え方をざっくりまとめます。
SDN界隈の用語・考え方をざっくりまとめます。Etsuji Nakai
 

What's hot (20)

Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoring
 
性能測定道 事始め編
性能測定道 事始め編性能測定道 事始め編
性能測定道 事始め編
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
新しいTERASOLUNA Batch Frameworkとは
新しいTERASOLUNA Batch Frameworkとは新しいTERASOLUNA Batch Frameworkとは
新しいTERASOLUNA Batch Frameworkとは
 
vSRX on Your Laptop : PCで始めるvSRX ~JUNOSをさわってみよう!~
vSRX on Your Laptop : PCで始めるvSRX ~JUNOSをさわってみよう!~vSRX on Your Laptop : PCで始めるvSRX ~JUNOSをさわってみよう!~
vSRX on Your Laptop : PCで始めるvSRX ~JUNOSをさわってみよう!~
 
MapReduce入門
MapReduce入門MapReduce入門
MapReduce入門
 
Exadata X8M-2 KVM仮想化ベストプラクティス
Exadata X8M-2 KVM仮想化ベストプラクティスExadata X8M-2 KVM仮想化ベストプラクティス
Exadata X8M-2 KVM仮想化ベストプラクティス
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
俺のサイジング
俺のサイジング俺のサイジング
俺のサイジング
 
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
SRv6 Mobile User Plane : Initial POC and Implementation
SRv6 Mobile User Plane : Initial POC and ImplementationSRv6 Mobile User Plane : Initial POC and Implementation
SRv6 Mobile User Plane : Initial POC and Implementation
 
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
 
フレームワークも使っていないWebアプリをLaravel+PWAでモバイルアプリっぽくしてみちゃう
フレームワークも使っていないWebアプリをLaravel+PWAでモバイルアプリっぽくしてみちゃうフレームワークも使っていないWebアプリをLaravel+PWAでモバイルアプリっぽくしてみちゃう
フレームワークも使っていないWebアプリをLaravel+PWAでモバイルアプリっぽくしてみちゃう
 
忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜
忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜
忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜
 
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
 
SDN界隈の用語・考え方をざっくりまとめます。
SDN界隈の用語・考え方をざっくりまとめます。SDN界隈の用語・考え方をざっくりまとめます。
SDN界隈の用語・考え方をざっくりまとめます。
 

Similar to Big data skew

Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Greg Makowski
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Shuffle sort 101
Shuffle sort 101Shuffle sort 101
Shuffle sort 101Jeff Bean
 
Design patterns in MapReduce
Design patterns in MapReduceDesign patterns in MapReduce
Design patterns in MapReduceAkhilesh Joshi
 
MapReduce
MapReduceMapReduce
MapReduceKavyaGo
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)
[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)
[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)datastaxjp
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
AWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAmazon Web Services
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
 

Similar to Big data skew (20)

Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Hadoop
HadoopHadoop
Hadoop
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Shuffle sort 101
Shuffle sort 101Shuffle sort 101
Shuffle sort 101
 
Design patterns in MapReduce
Design patterns in MapReduceDesign patterns in MapReduce
Design patterns in MapReduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)
[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)
[Cassandra summit Tokyo, 2015] Cassandra 2015 最新情報 by ジョナサン・エリス(Jonathan Ellis)
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
AWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data Analytics
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 

Recently uploaded

MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证nhjeo1gg
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxviniciusperissetr
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Big data skew

  • 1. Big DATA SKEW Ayan Ray Big Data Analytics Engineer
  • 2. © 2016 RS Software (India) Ltd. 2 Index • Definition • Types • Problem in Hadoop • Problem Solving Approaches • Mapreduce • Hive • Pig
  • 3. © 2016 RS Software (India) Ltd. 3 Definition Skewness is the measure of asymmetry of the probability distribution of a real-valued random variable about its mean.
  • 4. © 2016 RS Software (India) Ltd. 4 Types Negative/Left • The left tail is longer; the mass of the distribution is concentrated on the right. • Mean is at left of the peak i.e Mean of the data is less than median.
  • 5. © 2016 RS Software (India) Ltd. 5 Types Positive/Right • The right tail is longer; the mass of the distribution is concentrated on the left. • Mean is at right of the peak i.e Mean of the data values is larger than the median.
  • 6. © 2016 RS Software (India) Ltd. 6 Types No Skew/ Normal Distribution A normal distribution is not skewed. It is perfectly symmetrical. Mean is exactly at the peak. Mean=median
  • 7. © 2016 RS Software (India) Ltd. 7 Problem in Hadoop • Say, we have to process some twitter feed corresponding to a user, each of which is in the format <twitter_id, twitter_post> • Now, say some of the users are very active on twitter and some seldom uses it. • The heavy user will have a very large number of <id, post> data. • When, we will try to process the data through a Mapreduce job, the reducer assigned with heavy user will take long time to complete. • This will result in high over all time and low resource utilization.
  • 8. © 2016 RS Software (India) Ltd. 8 Solution in Mapreduce Combiner • Implement a combiner to reduce the amount of data going into the reduce-phase. This will significantly reduce the effects of any type of reduce-side skew. • Combiners are effective at handling Partitioning Skew and Expensive Input at the reduce side when the skew observed during reduce phase is mainly due to the volume of data transferred during the shuffle phase. • But we can’t run combiner in all cases especially when reducer calculation is associative /commutative (say average)
  • 9. © 2016 RS Software (India) Ltd. 9 Solution in Mapreduce Partitioner • This phase exist between map and reduce phase. • Number of reducers is equal to number of Partitioner. • Partitioner has an inherent method as follows: int getPartition(K key, V value, int numReduceTasks) • Based on the integer value returned from the above function, Hadoop selects node where the reduce task for a particular key should run. • We can manipulate the above method, to write our own custom Partitioner.
  • 10. © 2016 RS Software (India) Ltd. 10 Solution in Mapreduce Partitioner-Continued • By default, all values for a particular key goes to same reducer. • Say, if we know that there is possibility that values for a particular key will be overcrowded then we can write our custom partitioner to divide it further to different reducers. Let us take an example, • We are trying to find out highest salaried employee by gender in different age groups (e.g below 20, between 20 and 40 and above 40)
  • 11. © 2016 RS Software (India) Ltd. 11 Solution in Mapreduce Partitioner-Continued Input Data: Id Name Age Gender Salary 1201 gopal 45 Male 50,000 1202 manisha 40 Female 50,000 1203 khalil 34 Male 30,000 1204 prasanth 30 Male 30,000 1205 kiran 20 Male 40,000 1206 laxmi 25 Female 35,000 1207 bhavya 20 Female 15,000 1208 reshma 19 Female 15,000 1209 kranthi 22 Male 22,000
  • 12. © 2016 RS Software (India) Ltd. 12 Solution in Mapreduce Partitioner-Continued
  • 13. © 2016 RS Software (India) Ltd. 13 Solution in Mapreduce Partitioner-Continued If we analyse the data, we will find that we have following number of records for each category We can observe that the age range 20<=x<=40 is overcrowded. Range Count <20=1 1 20<=x<=40 6 >40 1
  • 14. © 2016 RS Software (India) Ltd. 14 Solution in Mapreduce Partitioner-Continued • Hence, the reducer 1 will take much longer time as compared to other 2 reducers. • So, the other 2 reducers will have to wait as reducer 1 will carry on its processing. • We can split them into different reducers.
  • 15. © 2016 RS Software (India) Ltd. 15 Solution in Mapreduce Partitioner-Continued
  • 16. © 2016 RS Software (India) Ltd. 16 Solution in Mapreduce Partitioner-Continued Now, the load will be more uniformly distributed and the skew effect will be dampened. Range Count <20=1 1 20<=x<=40 && salary <35000=3 3 20<=x<=40 && salary >=35000=3 3 >40 1
  • 17. © 2016 RS Software (India) Ltd. 17 Solution in Mapreduce Combiner and Partitioner Both combiner and partitioner can be combined and used in the same job where possible.
  • 18. © 2016 RS Software (India) Ltd. 18 Solution in Hive Skewed table • A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file. • Syntax: create table <T> (schema) skewed by (keys) on ('c1', 'c2') [STORED as DIRECTORIES]; • Example: create table T (c1 string, c2 string) skewed by (c1) on ('x1');
  • 19. © 2016 RS Software (India) Ltd. 19 Solution in Hive How does it solve Data skew? • By specifying the skewed values Hive will split those out into separate files automatically. • It takes this fact into account during queries so that it can skip (or include) whole files if possible thus enhancing the performance.
  • 20. © 2016 RS Software (India) Ltd. 20 Solution in Hive List Bucketing • List bucketing is a special type of Skewed table where we identify the keys which are highly skewed and maintain one directory per skewed key. The data corresponding to remaining (non-skewed) keys go into separate directory.
  • 21. © 2016 RS Software (India) Ltd. 21 Solution in Hive Single key Create table list_bucketed_table (c1 int, c2 int, c3 int) skewed by (c1) on (10,20,30) stored as directories; • This will create separate directories for c1 values of 10, 20 and 30 and another one directory for all other values. Select c1, c2, c3 from list_bucketed_table where c1=10; • The Hive compiler will only use the directory corresponding to x=30 for the map-reduce job. Select c1, c2, c3 from list_bucketed_table where c1=10; • The Hive compiler will only use the directory corresponding to x=others for the map-reduce job.
  • 22. © 2016 RS Software (India) Ltd. 22 Solution in Hive Multiple key Create table list_bucketed_table (c1 string, c2 int, c3 int) skewed by (c1, c2) on ((‘a’, 10), (‘b’, 20)) stored as directories; The metastore will have mapping like (‘a’, 10)->1 , (‘b’,20) ->2 , others -> 3. Select c1, c2, c3 from list_bucketed_table where c1=’a’ and c2=10; The Hive query will use the file from directory (‘a’,10) -> 1
  • 23. © 2016 RS Software (India) Ltd. 23 Solution in Hive Multiple key Create table list_bucketed_table (c1 string, c2 int, c3 int) skewed by (c1, c2) on ((‘a’, 10), (‘b’, 20)) stored as directories; The metastore will have mapping like (‘a’, 10)->1 , (‘b’,20) ->2 , others -> 3. Select c1, c2, c3 from list_bucketed_table where c1=’a’ and c2=10; The Hive query will use the file from directory (‘a’,10) -> 1
  • 24. © 2016 RS Software (India) Ltd. 24 Solution in Hive Advantages: • Each partition’s skewed keys accounts for a significant percentage of the total data. In the above scenario if skewed keys 10,20 and 30 occupy significant portion of the data then queries of the form x=40 will need not require to scan the remaining portion of the data. • The number of skewed keys per partition is small. Since this list is stored in metastore, so it does not make sense to store very large number of keys per partition in the metastore.
  • 25. © 2016 RS Software (India) Ltd. 25 Solution in Hive Disadvantages: • The approach is not scalable when the number of skewed keys is very large. This creates a problem for metastore capability. • It is also not scalable suited when number of skewed keys is more than 1 but in the query all the keys are not specified. • It will not give desired result when skewed keys occupy very less percentage of the total data.
  • 26. © 2016 RS Software (India) Ltd. 26 Solution in Hive Disadvantages: • The approach is not scalable when the number of skewed keys is very large. This creates a problem for metastore capability. • It is also not scalable suited when number of skewed keys is more than 1 but in the query all the keys are not specified. • It will not give desired result when skewed keys occupy very less percentage of the total data.
  • 27. © 2016 RS Software (India) Ltd. 27 Solution in Pig Skewed Join • Skew join works by first sampling one input for the join. • Skew join is capable of identifying that it will not be able to fit the entire input into memory hence, splits them into two reducers. • For all records except those identified in the sample, it does a standard join, collecting records with the same key onto the same reducer. • The second input is the one that is sampled and have its keys with large number of values split across reducers. The first input will have those values replicated across reducers.
  • 28. © 2016 RS Software (India) Ltd. 28 Solution in Pig For example, Employee= load ‘employee’ as (name:chararray, city:chararray); Citydetails= load ‘employee’ as (city: chararray, population: int); Joinop= join Citydetails by city, users by city using ‘skewed’; Suppose the distribution is as follows: 20 users live in Bangalore 10000 users live in Kolkata, 300 users live in Chennai.
  • 29. © 2016 RS Software (India) Ltd. 29 Solution in Pig • Let us assume that Pig determined that 7500 records could be fitted into memory. • If we don’t use skew Pig will throw OutOfMemory exception • But with the use of skew it will separate users with Kolkata as key into two reducers.
  • 30. © 2016 RS Software (India) Ltd. 30 Solution in Pig Memory Usage • Pig looks at the record sizes in the sample and assumes it can use 30%(default) of the JVM’s heap to materialize records that will be joined. • Memory should be decreased if the join is still failing due to out-of-memory errors even using skew join. • So you should tell it to use less.
  • 31. © 2016 RS Software (India) Ltd. 31 Solution in Pig Memory Usage Memory allocation can be configured manually using the following configuration pig.skewedjoin.reduce.memusage=0.25 It can be passed from command line also -D pig.skewedjoin.reduce.memusage=0.25 This will use 25% instead of 30%.
  • 32. © 2016 RS Software (India) Ltd. 32 Thanking You! For further assistance and explanation in anything related to Big Data feel free to mail me at ayanray089@gmail.com