Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hands on Hadoop and pig


Published on

More details at

Published in: Technology

Hands on Hadoop and pig

  1. 1. BigData using Hadoop and Pig Sudar Muthu Research Engineer Yahoo Labs
  2. 2. Who am I? Research Engineer at Yahoo Labs Mines useful information from huge datasets Worked on both structured and unstructured data. Builds robots as hobby ;)
  3. 3. What we will see today? What is BigData? Get our hands dirty with Hadoop See some code Try out Pig Glimpse of Hbase and Hive
  4. 4. What is BigData?
  5. 5. “ Big data is a collection of data sets so large ” and complex that it becomes difficult to process using on-hand database management tools
  6. 6. How big is BigData?
  7. 7. 1GB today is not the sameas 1GB just 10 years before
  8. 8. Anything that doesn’t fitinto the RAM of a single machine
  9. 9. Types of Big Data
  10. 10. Data in Movement (streams) Twitter/Facebook comments Stock market data Access logs of a busy web server Sensors: Vital signs of a newly born
  11. 11. Data at rest (Oceans) Collection of what has streamed Emails or IM messages Social Media Unstructured documents: forms, claims
  12. 12. We have all this data and need to find a way to process them
  13. 13. Traditional way of scaling (Scaling up) Make the machine more powerful  Add more RAM  Add more cores to CPU It is going to be very expensive Will be limited by disk seek and read time Single point of failure
  14. 14. New way to scale up (Scale out) Add more instances of the same machine Cost is less compared to scaling up Immune to failure of a single or a set of nodes Disk seek and write time is not going to be bottleneck Future safe (to some extend)
  15. 15. Is it fit for ALL types of problems?
  16. 16. Divide and conquer
  17. 17. Hadoop
  18. 18. A scalable, fault-tolerant grid operating system fordata storage and processing
  19. 19. What is Hadoop? Runs on Commodity hardware HDFS: Fault-tolerant high-bandwidth clustered storage MapReduce: Distributed data processing Works with structured and unstructured data Open source, Apache license Master (named-node) – Slave architecture
  20. 20. Design Principles System shall manage and heal itself Performance shall scale linearly Algorithm should move to data  Lower latency, lower bandwidth Simple core, modular and extensible
  21. 21. Components of Hadoop HDFS Map Reduce PIG HBase Hive
  22. 22. Getting started with Hadoop
  23. 23. What I am not going to cover? Installation or setting up Hadoop  Will be running all the code in a single node instance Monitoring of the clusters Performance tuning User authentication or quota
  24. 24. Before we get into code, let’s understand some concepts
  25. 25. Map Reduce
  26. 26. Framework for distributedprocessing of large datasets
  27. 27. MapReduceConsists of two functions Map  Filter and transform the input, which the reducer can understand Reduce  Aggregate over the input provided by the Map function
  28. 28. Formal definitionMap<k1, v1> -> list(<k2,v2>)Reduce<k2, list(v2)> -> list <k3, v3>
  29. 29. Let’s see some examples
  30. 30. Count number of words in filesMap<file_name, file_contents> => list<word, count>Reduce<word, list(count)> => <word, sum_of_counts>
  31. 31. Count number of words in filesMap<“file1”, “to be or not to be”> =>{<“to”,1>,<“be”,1>,<“or”,1>,<“not”,1>,<“to,1>,<“be”,1>}
  32. 32. Count number of words in filesReduce{<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>,<“not”,<1>>}=>{<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}
  33. 33. Max temperature in a yearMap<file_name, file_contents> => <year, temp>Reduce<year, list(temp)> => <year, max_temp>
  34. 34. HDFS
  35. 35. HDFS Distributed file system Data is distributed over different nodes Will be replicated for fail over Is abstracted out for the algorithms
  36. 36. HDFS Commands
  37. 37. HDFS Commands hadoop fs –mkdir <dir_name> hadoop fs –ls <dir_name> hadoop fs –rmr <dir_name> hadoop fs –put <local_file> <remote_dir> hadoop fs –get <remote_file> <local_dir> hadoop fs –cat <remote_file> hadoop fs –help
  38. 38. Let’s write some code
  39. 39. Count Words Demo Create a mapper class  Override map() method Create a reducer class  Override reduce() method Create a main method Create JAR Run it on Hadoop
  40. 40. Map Methodpublic void map(LongWritable key, Text value, Contextcontext) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), newIntWritable(1)); }}
  41. 41. Reduce Methodpublic void reduce(Text key, Iterable<IntWritable>values, Context context) throws IOException,InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum));}
  42. 42. Main MethodJob job = new Job();job.setJarByClass(CountWords.class);job.setJobName("Count Words");FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(CountWordsMapper.class);job.setReducerClass(CountWordsReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);
  43. 43. Run it on Hadoophadoop jar dist/countwords.jarcom.sudarmuthu.hadoop.countwords.CountWords input/ output/
  44. 44. Outputat 1be 3can 7cant 1code 2command 1connect 1consider 1continued 1control 4could 1couple 1courtesy 1desktop, 1detailed 1details 1…..…..
  45. 45. Pig
  46. 46. What is Pig?Pig provides an abstraction for processing largedatasetsConsists of Pig Latin – Language to express data flows Execution environment
  47. 47. Why we need Pig? MapReduce can get complex if your data needs lot of processing/transformations MapReduce provides primitive data structures Pig provides rich data structures Supports complex operations like joins
  48. 48. Running Pig programs In an interactive shell called Grunt As a Pig Script Embedded into Java programs (like JDBC)
  49. 49. Grunt – Interactive Shell
  50. 50. Grunt shell fs commands – like hadoop fs  fs –ls  Fs –mkdir fs copyToLocal <file> fs copyFromLocal <local_file> <dest> exec – execute Pig scripts sh – execute shell scripts
  51. 51. Let’s see them in action
  52. 52. Pig Latin LOAD – Read files DUMP – Dump data in the console JOIN – Do a join on data sets FILTER – Filter data sets SORT – Sort data STORE – Store data back in files
  53. 53. Let’s see some code
  54. 54. Sort words based on count
  55. 55. Filter words present in a list
  56. 56. HBase
  57. 57. What is Hbase? Distributed, column-oriented database built on top of HDFS Useful when real-time read/write random-access to very large datasets is needed. Can handle billions of rows with millions of columns
  58. 58. Hive
  59. 59. What is Hive? Useful for managing and querying structured data Provides SQL like syntax Meta data is stored in a RDBMS Extensible with types, functions , scripts etc
  60. 60. Hadoop Relational Databases Affordable  Interactive response times Storage/Compute  ACID Structured or Unstructured  Structured data Resilient Auto Scalability  Cost/Scale prohibitive
  61. 61. Thank You