Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data skew

2,414 views

Published on

The document explains the problem, cause and effect of Data skew. It also explains different techniques to minimize data skew across various big data technologies like mapreduce, hive and pig.

Published in: Data & Analytics
  • Be the first to comment

Big data skew

  1. 1. Big DATA SKEW Ayan Ray Big Data Analytics Engineer
  2. 2. © 2016 RS Software (India) Ltd. 2 Index • Definition • Types • Problem in Hadoop • Problem Solving Approaches • Mapreduce • Hive • Pig
  3. 3. © 2016 RS Software (India) Ltd. 3 Definition Skewness is the measure of asymmetry of the probability distribution of a real-valued random variable about its mean.
  4. 4. © 2016 RS Software (India) Ltd. 4 Types Negative/Left • The left tail is longer; the mass of the distribution is concentrated on the right. • Mean is at left of the peak i.e Mean of the data is less than median.
  5. 5. © 2016 RS Software (India) Ltd. 5 Types Positive/Right • The right tail is longer; the mass of the distribution is concentrated on the left. • Mean is at right of the peak i.e Mean of the data values is larger than the median.
  6. 6. © 2016 RS Software (India) Ltd. 6 Types No Skew/ Normal Distribution A normal distribution is not skewed. It is perfectly symmetrical. Mean is exactly at the peak. Mean=median
  7. 7. © 2016 RS Software (India) Ltd. 7 Problem in Hadoop • Say, we have to process some twitter feed corresponding to a user, each of which is in the format <twitter_id, twitter_post> • Now, say some of the users are very active on twitter and some seldom uses it. • The heavy user will have a very large number of <id, post> data. • When, we will try to process the data through a Mapreduce job, the reducer assigned with heavy user will take long time to complete. • This will result in high over all time and low resource utilization.
  8. 8. © 2016 RS Software (India) Ltd. 8 Solution in Mapreduce Combiner • Implement a combiner to reduce the amount of data going into the reduce-phase. This will significantly reduce the effects of any type of reduce-side skew. • Combiners are effective at handling Partitioning Skew and Expensive Input at the reduce side when the skew observed during reduce phase is mainly due to the volume of data transferred during the shuffle phase. • But we can’t run combiner in all cases especially when reducer calculation is associative /commutative (say average)
  9. 9. © 2016 RS Software (India) Ltd. 9 Solution in Mapreduce Partitioner • This phase exist between map and reduce phase. • Number of reducers is equal to number of Partitioner. • Partitioner has an inherent method as follows: int getPartition(K key, V value, int numReduceTasks) • Based on the integer value returned from the above function, Hadoop selects node where the reduce task for a particular key should run. • We can manipulate the above method, to write our own custom Partitioner.
  10. 10. © 2016 RS Software (India) Ltd. 10 Solution in Mapreduce Partitioner-Continued • By default, all values for a particular key goes to same reducer. • Say, if we know that there is possibility that values for a particular key will be overcrowded then we can write our custom partitioner to divide it further to different reducers. Let us take an example, • We are trying to find out highest salaried employee by gender in different age groups (e.g below 20, between 20 and 40 and above 40)
  11. 11. © 2016 RS Software (India) Ltd. 11 Solution in Mapreduce Partitioner-Continued Input Data: Id Name Age Gender Salary 1201 gopal 45 Male 50,000 1202 manisha 40 Female 50,000 1203 khalil 34 Male 30,000 1204 prasanth 30 Male 30,000 1205 kiran 20 Male 40,000 1206 laxmi 25 Female 35,000 1207 bhavya 20 Female 15,000 1208 reshma 19 Female 15,000 1209 kranthi 22 Male 22,000
  12. 12. © 2016 RS Software (India) Ltd. 12 Solution in Mapreduce Partitioner-Continued
  13. 13. © 2016 RS Software (India) Ltd. 13 Solution in Mapreduce Partitioner-Continued If we analyse the data, we will find that we have following number of records for each category We can observe that the age range 20<=x<=40 is overcrowded. Range Count <20=1 1 20<=x<=40 6 >40 1
  14. 14. © 2016 RS Software (India) Ltd. 14 Solution in Mapreduce Partitioner-Continued • Hence, the reducer 1 will take much longer time as compared to other 2 reducers. • So, the other 2 reducers will have to wait as reducer 1 will carry on its processing. • We can split them into different reducers.
  15. 15. © 2016 RS Software (India) Ltd. 15 Solution in Mapreduce Partitioner-Continued
  16. 16. © 2016 RS Software (India) Ltd. 16 Solution in Mapreduce Partitioner-Continued Now, the load will be more uniformly distributed and the skew effect will be dampened. Range Count <20=1 1 20<=x<=40 && salary <35000=3 3 20<=x<=40 && salary >=35000=3 3 >40 1
  17. 17. © 2016 RS Software (India) Ltd. 17 Solution in Mapreduce Combiner and Partitioner Both combiner and partitioner can be combined and used in the same job where possible.
  18. 18. © 2016 RS Software (India) Ltd. 18 Solution in Hive Skewed table • A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file. • Syntax: create table <T> (schema) skewed by (keys) on ('c1', 'c2') [STORED as DIRECTORIES]; • Example: create table T (c1 string, c2 string) skewed by (c1) on ('x1');
  19. 19. © 2016 RS Software (India) Ltd. 19 Solution in Hive How does it solve Data skew? • By specifying the skewed values Hive will split those out into separate files automatically. • It takes this fact into account during queries so that it can skip (or include) whole files if possible thus enhancing the performance.
  20. 20. © 2016 RS Software (India) Ltd. 20 Solution in Hive List Bucketing • List bucketing is a special type of Skewed table where we identify the keys which are highly skewed and maintain one directory per skewed key. The data corresponding to remaining (non-skewed) keys go into separate directory.
  21. 21. © 2016 RS Software (India) Ltd. 21 Solution in Hive Single key Create table list_bucketed_table (c1 int, c2 int, c3 int) skewed by (c1) on (10,20,30) stored as directories; • This will create separate directories for c1 values of 10, 20 and 30 and another one directory for all other values. Select c1, c2, c3 from list_bucketed_table where c1=10; • The Hive compiler will only use the directory corresponding to x=30 for the map-reduce job. Select c1, c2, c3 from list_bucketed_table where c1=10; • The Hive compiler will only use the directory corresponding to x=others for the map-reduce job.
  22. 22. © 2016 RS Software (India) Ltd. 22 Solution in Hive Multiple key Create table list_bucketed_table (c1 string, c2 int, c3 int) skewed by (c1, c2) on ((‘a’, 10), (‘b’, 20)) stored as directories; The metastore will have mapping like (‘a’, 10)->1 , (‘b’,20) ->2 , others -> 3. Select c1, c2, c3 from list_bucketed_table where c1=’a’ and c2=10; The Hive query will use the file from directory (‘a’,10) -> 1
  23. 23. © 2016 RS Software (India) Ltd. 23 Solution in Hive Multiple key Create table list_bucketed_table (c1 string, c2 int, c3 int) skewed by (c1, c2) on ((‘a’, 10), (‘b’, 20)) stored as directories; The metastore will have mapping like (‘a’, 10)->1 , (‘b’,20) ->2 , others -> 3. Select c1, c2, c3 from list_bucketed_table where c1=’a’ and c2=10; The Hive query will use the file from directory (‘a’,10) -> 1
  24. 24. © 2016 RS Software (India) Ltd. 24 Solution in Hive Advantages: • Each partition’s skewed keys accounts for a significant percentage of the total data. In the above scenario if skewed keys 10,20 and 30 occupy significant portion of the data then queries of the form x=40 will need not require to scan the remaining portion of the data. • The number of skewed keys per partition is small. Since this list is stored in metastore, so it does not make sense to store very large number of keys per partition in the metastore.
  25. 25. © 2016 RS Software (India) Ltd. 25 Solution in Hive Disadvantages: • The approach is not scalable when the number of skewed keys is very large. This creates a problem for metastore capability. • It is also not scalable suited when number of skewed keys is more than 1 but in the query all the keys are not specified. • It will not give desired result when skewed keys occupy very less percentage of the total data.
  26. 26. © 2016 RS Software (India) Ltd. 26 Solution in Hive Disadvantages: • The approach is not scalable when the number of skewed keys is very large. This creates a problem for metastore capability. • It is also not scalable suited when number of skewed keys is more than 1 but in the query all the keys are not specified. • It will not give desired result when skewed keys occupy very less percentage of the total data.
  27. 27. © 2016 RS Software (India) Ltd. 27 Solution in Pig Skewed Join • Skew join works by first sampling one input for the join. • Skew join is capable of identifying that it will not be able to fit the entire input into memory hence, splits them into two reducers. • For all records except those identified in the sample, it does a standard join, collecting records with the same key onto the same reducer. • The second input is the one that is sampled and have its keys with large number of values split across reducers. The first input will have those values replicated across reducers.
  28. 28. © 2016 RS Software (India) Ltd. 28 Solution in Pig For example, Employee= load ‘employee’ as (name:chararray, city:chararray); Citydetails= load ‘employee’ as (city: chararray, population: int); Joinop= join Citydetails by city, users by city using ‘skewed’; Suppose the distribution is as follows: 20 users live in Bangalore 10000 users live in Kolkata, 300 users live in Chennai.
  29. 29. © 2016 RS Software (India) Ltd. 29 Solution in Pig • Let us assume that Pig determined that 7500 records could be fitted into memory. • If we don’t use skew Pig will throw OutOfMemory exception • But with the use of skew it will separate users with Kolkata as key into two reducers.
  30. 30. © 2016 RS Software (India) Ltd. 30 Solution in Pig Memory Usage • Pig looks at the record sizes in the sample and assumes it can use 30%(default) of the JVM’s heap to materialize records that will be joined. • Memory should be decreased if the join is still failing due to out-of-memory errors even using skew join. • So you should tell it to use less.
  31. 31. © 2016 RS Software (India) Ltd. 31 Solution in Pig Memory Usage Memory allocation can be configured manually using the following configuration pig.skewedjoin.reduce.memusage=0.25 It can be passed from command line also -D pig.skewedjoin.reduce.memusage=0.25 This will use 25% instead of 30%.
  32. 32. © 2016 RS Software (India) Ltd. 32 Thanking You! For further assistance and explanation in anything related to Big Data feel free to mail me at ayanray089@gmail.com

×