Big data skew

Big DATA SKEW
Ayan Ray
Big Data Analytics Engineer

© 2016 RS Software (India) Ltd. 2
Index
• Definition
• Types
• Problem in Hadoop
• Problem Solving Approaches
• Mapreduce
• Hive
• Pig

Definition
Skewness is the measure of asymmetry of the
probability distribution of a real-valued random variable
about its mean.

Types
Negative/Left
• The left tail is longer; the mass of the distribution is
concentrated on the right.
• Mean is at left of the peak i.e Mean of the data is less
than median.

Types
Positive/Right
• The right tail is longer; the mass of the distribution is
concentrated on the left.
• Mean is at right of the peak i.e Mean of the data
values is larger than the median.

Types
No Skew/ Normal Distribution
A normal distribution is not skewed.
It is perfectly symmetrical.
Mean is exactly at the peak. Mean=median

Problem in Hadoop
• Say, we have to process some twitter feed
corresponding to a user, each of which is in the
format <twitter_id, twitter_post>
• Now, say some of the users are very active on twitter
and some seldom uses it.
• The heavy user will have a very large number of <id,
post> data.
• When, we will try to process the data through a
Mapreduce job, the reducer assigned with heavy user
will take long time to complete.
• This will result in high over all time and low resource
utilization.

Solution in Mapreduce
Combiner
• Implement a combiner to reduce the amount of data
going into the reduce-phase. This will significantly
reduce the effects of any type of reduce-side skew.
• Combiners are effective at handling Partitioning
Skew and Expensive Input at the reduce side when
the skew observed during reduce phase is mainly due
to the volume of data transferred during the shuffle
phase.
• But we can’t run combiner in all cases especially
when reducer calculation is associative /commutative
(say average)

Partitioner
• This phase exist between map and reduce phase.
• Number of reducers is equal to number of
Partitioner.
• Partitioner has an inherent method as follows:
int getPartition(K key, V value, int numReduceTasks)
• Based on the integer value returned from the above
function, Hadoop selects node where the reduce task
for a particular key should run.
• We can manipulate the above method, to write our
own custom Partitioner.

Partitioner-Continued
• By default, all values for a particular key goes to same
reducer.
• Say, if we know that there is possibility that values
for a particular key will be overcrowded then we can
write our custom partitioner to divide it further to
different reducers.
Let us take an example,
• We are trying to find out highest salaried employee
by gender in different age groups (e.g below 20,
between 20 and 40 and above 40)

Input Data:
Id Name Age Gender Salary
1201 gopal 45 Male 50,000
1202 manisha 40 Female 50,000
1203 khalil 34 Male 30,000
1204 prasanth 30 Male 30,000
1205 kiran 20 Male 40,000
1206 laxmi 25 Female 35,000
1207 bhavya 20 Female 15,000
1208 reshma 19 Female 15,000
1209 kranthi 22 Male 22,000

If we analyse the data, we will find that we have
following number of records for each category
We can observe that the age range 20<=x<=40 is
overcrowded.
Range Count
<20=1 1
20<=x<=40 6
>40 1

• Hence, the reducer 1 will take much longer time as
compared to other 2 reducers.
• So, the other 2 reducers will have to wait as reducer 1
will carry on its processing.
• We can split them into different reducers.

Now, the load will be more uniformly distributed and the
skew effect will be dampened.
Range Count
<20=1 1
20<=x<=40 && salary <35000=3 3
20<=x<=40 && salary >=35000=3 3
>40 1

Combiner and Partitioner
Both combiner and partitioner can be combined and
used in the same job where possible.

Solution in Hive
Skewed table
• A skewed table is a special type of table where the
values that appear very often (heavy skew) are split
out into separate files and rest of the values go to
some other file.
• Syntax:
create table <T> (schema) skewed by (keys) on ('c1', 'c2')
[STORED as DIRECTORIES];
• Example:
create table T (c1 string, c2 string) skewed by (c1) on
('x1');

Solution in Hive
How does it solve Data skew?
• By specifying the skewed values Hive will split those
out into separate files automatically.
• It takes this fact into account during queries so that it
can skip (or include) whole files if possible thus
enhancing the performance.

Solution in Hive
List Bucketing
• List bucketing is a special type of Skewed table where
we identify the keys which are highly skewed and
maintain one directory per skewed key. The data
corresponding to remaining (non-skewed) keys go
into separate directory.

Solution in Hive
Single key
Create table list_bucketed_table (c1 int, c2 int, c3 int)
skewed by (c1) on (10,20,30) stored as directories;
• This will create separate directories for c1 values of
10, 20 and 30 and another one directory for all other
values.
Select c1, c2, c3 from list_bucketed_table where c1=10;
• The Hive compiler will only use the directory
corresponding to x=30 for the map-reduce job.
Select c1, c2, c3 from list_bucketed_table where c1=10;
• The Hive compiler will only use the directory
corresponding to x=others for the map-reduce job.

Solution in Hive
Multiple key
Create table list_bucketed_table (c1 string, c2 int, c3 int)
skewed by (c1, c2) on ((‘a’, 10), (‘b’, 20)) stored as
directories;
The metastore will have mapping like (‘a’, 10)->1 , (‘b’,20)
->2 , others -> 3.
Select c1, c2, c3 from list_bucketed_table where c1=’a’ and
c2=10;
The Hive query will use the file from directory (‘a’,10) -> 1

Solution in Hive
Advantages:
• Each partition’s skewed keys accounts for a
significant percentage of the total data. In the above
scenario if skewed keys 10,20 and 30 occupy
significant portion of the data then queries of the
form x=40 will need not require to scan the
remaining portion of the data.
• The number of skewed keys per partition is small.
Since this list is stored in metastore, so it does not
make sense to store very large number of keys per
partition in the metastore.

Solution in Hive
Disadvantages:
• The approach is not scalable when the number of
skewed keys is very large. This creates a problem for
metastore capability.
• It is also not scalable suited when number of skewed
keys is more than 1 but in the query all the keys are
not specified.
• It will not give desired result when skewed keys
occupy very less percentage of the total data.

Solution in Pig
Skewed Join
• Skew join works by first sampling one input for the
join.
• Skew join is capable of identifying that it will not be
able to fit the entire input into memory hence, splits
them into two reducers.
• For all records except those identified in the sample,
it does a standard join, collecting records with the
same key onto the same reducer.
• The second input is the one that is sampled and have
its keys with large number of values split across
reducers. The first input will have those values
replicated across reducers.

Solution in Pig
For example,
Employee= load ‘employee’ as (name:chararray,
city:chararray);
Citydetails= load ‘employee’ as (city: chararray,
population: int);
Joinop= join Citydetails by city, users by city using
‘skewed’;
Suppose the distribution is as follows:
20 users live in Bangalore
10000 users live in Kolkata,
300 users live in Chennai.

Solution in Pig
• Let us assume that Pig determined that 7500 records
could be fitted into memory.
• If we don’t use skew Pig will throw OutOfMemory
exception
• But with the use of skew it will separate users with
Kolkata as key into two reducers.

Solution in Pig
Memory Usage
• Pig looks at the record sizes in the sample and
assumes it can use 30%(default) of the JVM’s heap to
materialize records that will be joined.
• Memory should be decreased if the join is still failing
due to out-of-memory errors even using skew join.
• So you should tell it to use less.

Solution in Pig
Memory Usage
Memory allocation can be configured manually using the
following configuration
pig.skewedjoin.reduce.memusage=0.25
It can be passed from command line also
-D pig.skewedjoin.reduce.memusage=0.25
This will use 25% instead of 30%.

Thanking You!
For further assistance and explanation in anything
related to Big Data feel free to mail me at
ayanray089@gmail.com

Big data skew

More Related Content

Similar to Big data skew

Recently uploaded

Big data skew