Upcoming SlideShare
×

958 views

Published on

3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
958
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
0
0
Likes
3
Embeds 0
No embeds

No notes for slide

1. 1. 1 Hadoop Puzzlers Aaron Myers & Daniel Templeton Cloudera, Inc.
2. 2. 2 Your Hosts Aaron “ATM” Myers • AKA “Cash Money” • Software Engineer • Apache Hadoop Committer Daniel Templeton • Certification Developer • Crusty, old HPC guy • Likes Perl ©2014 Cloudera, Inc. All rights reserved.2
3. 3. 3 What is a Hadoop Puzzler ©2014 Cloudera, Inc. All rights reserved.3 • Shameless knockoff of Josh Bloch’s Java Puzzlers talks • We’ll walk through a puzzle • You vote on the answer • We all learn a valuable lesson
5. 5. 5 An Easy One public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.5
6. 6. 6 An Easy One The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.6
7. 7. 7 An Easy One public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.7 A 1 A 5 A 3
8. 8. 8 An Easy One The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.8
9. 9. 9 An Easy One (Answer) The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.9
10. 10. 10 An Easy One (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.10
11. 11. 11 An Easy One (Moral) ©2014 Cloudera, Inc. All rights reserved.11 • MapReduce reuses Writables whenever it can • That includes while iterating through the values • Always be careful to only store the value instead of the Writable!
12. 12. 12 A Sinking Feeling public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.12
13. 13. 13 A Sinking Feeling The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.13
14. 14. 14 A Sinking Feeling public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.14 The complete works of William Shakespeare
15. 15. 15 A Sinking Feeling The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.15
16. 16. 16 A Sinking Feeling (Answer) The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.16
17. 17. 17 A Sinking Feeling (Problem) public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.17
18. 18. 18 A Sinking Job (Moral) ©2014 Cloudera, Inc. All rights reserved.18 • Read the API docs! • Sometimes the obvious meanings of methods and parameters aren’t correct • Parameter for waitForCompletion() controls whether status output is printed • Driver does wait for job to exit but does not print all the job status information
19. 19. 19 Do-over public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.19
20. 20. 20 Do-over The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.20
21. 21. 21 Do-over public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.21 A 1 A 5
22. 22. 22 Do-over The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.22
23. 23. 23 Do-over (Answer) The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.23
24. 24. 24 Do-over (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.24
25. 25. 25 Do-over (Moral) ©2014 Cloudera, Inc. All rights reserved.25 • Mismatched signatures can lead to unexpected behaviors because of exposed base class method implementations • ALWAYS use @Override!
26. 26. 26 Joining Forces hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit \$ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.26
27. 27. 27 Joining Forces The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.27
28. 28. 28 Joining Forces hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit \$ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.28 1 6506506500 CA 2 2282282280 MS 1 Palo Alto CA 2 Gautier MS
29. 29. 29 Joining Forces The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.29
30. 30. 30 Joining Forces (Answer) The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.30
31. 31. 31 Joining Forces (Problem) hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit \$ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.31
32. 32. 32 Joining Forces (Moral) ©2014 Cloudera, Inc. All rights reserved.32 • Hive’s default delimiter is 0x01 (CTRL-A) • Easy to assume export will use a sane delimiter – it doesn’t • Incidentally, Hive’s join rules are pretty sane and work as you’d expect
33. 33. 33 Close Enough public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.33
34. 34. 34 Close Enough The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.34
35. 35. 35 Close Enough public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.35 A 1 A 5 A 4
36. 36. 36 Close Enough The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.36
37. 37. 37 Close Enough (Answer) The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.37
38. 38. 38 Close Enough (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.38
39. 39. 39 Close Enough (Moral) ©2014 Cloudera, Inc. All rights reserved.39 • For scalability reasons, the values iterable is single-shot • Subsequent iterators iterate over an empty collection • Store values (not Writables!) in the first pass • Better yet, restructure the logic to avoid storing all values in memory
40. 40. 40 Overbyte public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.40
41. 41. 41 Overbyte The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.41
42. 42. 42 Overbyte public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.42 Hadoop Spark Hive Sqoop2
43. 43. 43 Overbyte The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.43
44. 44. 44 Overbyte (Answer) The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.44
45. 45. 45 Overbyte (Problem) public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.45
46. 46. 46 Overbyte (Moral) ©2014 Cloudera, Inc. All rights reserved.46 • Writables get reused in loops • In addition, Text.getBytes() reuses byte array allocated by previous calls • Net result is wrongness • Text.getLength() is the correct way to get the length of a Text.
47. 47. 47 What We Learned ©2014 Cloudera, Inc. All rights reserved.47 • Beware of reuse of Writables • Always use @Override so your compiler can help you • Don’t assume you know what a method does because of the name or parameters – read the docs! • Sometimes scalability is inconvenient
48. 48. 48 One Closing Note ©2014 Cloudera, Inc. All rights reserved.48 • Hadoop is still not easy • Being good takes effort and experience • Recognizing Hadoop talent can be hard • Cloudera’s is working to make Hadoop talent easier to recognize through certification http://cloudera.com/content/cloudera/en/training/cert ification.html