Hive vs Pig for HadoopSourceCodeReading

14,178 views

Published on

Published in: Technology
  • http://dbmanagement.info/Tutorials/Apache_Hive.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hive vs Pig for HadoopSourceCodeReading

  1. 1. HIVE PIG MAPREDUCE • @hamburger_kid 2010 5 27 1
  2. 2. done CLOUDERA HADOOP TRAINING FOR DEVELOPERS
  3. 3. done CLOUDERA HADOOP TRAINING FOR DEVELOPERS
  4. 4. Day 1 Day 3 Hadoop mapreduce HDFS mapreduce mapreduce mapreduce Day 2 RDBMS Hadoop Hive Pig
  5. 5. Day 1 Day 3 Hadoop mapreduce HDFS mapreduce mapreduce mapreduce Day 2 RDBMS Hadoop Hive Pig Mr.Alex
  6. 6. Day 1 Day 3 Hadoop mapreduce HDFS mapreduce mapreduce mapreduce Day 2 RDBMS Hadoop Hive Pig Mr.Alex
  7. 7. Hive vs Pig
  8. 8. Hive vs Pig VS
  9. 9. mapreduce
  10. 10. mapreduce
  11. 11. NameNode Secondary ClientNode JobTracker NameNode Block DataNode TaskTracker
  12. 12. Hive Pig NameNode Secondary ClientNode JobTracker NameNode Block DataNode TaskTracker
  13. 13. mapreduce THE END OF MONEY IS THE END OF LOVE map shuffle&sort reduce source: http://techblog.yahoo.co.jp/cat207/cat209/hadoop/
  14. 14. Hive
  15. 15. Hive
  16. 16. Hive Facebook SQL like mapreduce Hive QL Table, Partitions, Buckets Metastore HDFS
  17. 17. Hive Table, Partitions, Buckets Table column int, float, string, boolean Partitions data table partitioning HDFS Partitions Buckets data Buckets = Reduce Sampling
  18. 18. Hive Metastore Metastore Table, Partitions Metastore ClientNode NameNode Derby/MySQL DB Metastore Table HDFS Partitions
  19. 19. Hive HDFS HDFS directory /user/hive/warehouse Table warehouse subdirectory Partitons Table subdirectory data reduce /user/hive/warehouse/table/patition/data SequenceFiles SerDe format http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook
  20. 20. http://www.rakuten.co.jp/recruit/en/career/employee/ appengineer.html http://www.rakuten.co.jp/recruit/en/career/employee/ systemproducer.html
  21. 21. HDFS ls -al /home/hamburgerkid/workspace/techtalk/data/ hadoop fs -rmr hive hadoop fs -mkdir hive/input hadoop fs -put /home/hamburgerkid/workspace/ techtalk/data/* hive/input hadoop fs -ls /user/hamburgerkid/hive/input
  22. 22. Table mapreduce wordcount hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar wordcount hive/input/app_eng hive/output/app_eng/ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar wordcount hive/input/producer hive/output/producer/ hadoop fs -ls /user/hamburgerkid/hive/output/app_eng/ hadoop fs -ls /user/hamburgerkid/hive/output/producer/ hadoop fs -cat /user/hamburgerkid/hive/output/app_eng/part* hadoop fs -cat /user/hamburgerkid/hive/output/producer/part*
  23. 23. wordcount output CREATE TABLE producer (word STRING , freq INT) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE ; SHOW TABLES ; DESCRIBE producer ; LOAD DATA INPATH '/user/hamburgerkid/hive/output/producer/part*' INTO TABLE producer PARTITION (dt='20100526') ; SELECT * FROM producer WHERE LENGTH(word) > 3 AND freq > 1 SORT BY freq DESC LIMIT 10 ; EXPLAIN SELECT * FROM producer WHERE LENGTH(word) > 3 AND freq > 1 SORT BY freq DESC LIMIT 10 ; hadoop fs -ls /user/hive/warehouse/producer/dt=20100526/ hadoop fs -ls /user/hamburgerkid/hive/output/producer/
  24. 24. wordcount output CREATE TABLE app_eng (word STRING , freq INT) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE ; LOAD DATA INPATH '/user/hamburgerkid/hive/output/app_eng/part*' INTO TABLE app_eng PARTITION (dt='20100526') ;
  25. 25. Table (JOIN) CREATE TABLE develop (word STRING, p_freq INT, e_freq INT) ; INSERT OVERWRITE TABLE develop SELECT p.word, p.freq, e.freq FROM producer p JOIN app_eng e ON (p.word = e.word) WHERE p.freq > 1 AND e.freq > 1 ; SELECT word, p_freq, e_freq, (p_freq + e_freq) AS ttl FROM develop WHERE LENGTH(word) > 3 SORT BY ttl DESC LIMIT 10 ;
  26. 26. (OUTER JOIN) SELECT e.word, e.freq, p.freq FROM app_eng e LEFT OUTER JOIN producer p ON (e.word = p.word) WHERE LENGTH(e.word) > 3 AND p.freq IS NULL SORT BY e.freq DESC LIMIT 10 ;
  27. 27. Engineering with..
  28. 28. Engineering with..
  29. 29. Pig
  30. 30. Pig
  31. 31. Pig Yahoo! Pig Latin mapreduce join, group, filter, sort Grunt Pig shell script Hive Metastore map, tuple, bag mapreduce Local Hadoop
  32. 32. Pig int, long, double, chararray, bytearray 'apache.org' , '1.0' tuple <apache.org , 1.0> bag tuple {<apache.org , 1.0> , <flickr.com , 0.8>} map key/value value OK [ 'apache' : <'search' , 'news'> ; 'cnn' : 'news' ]
  33. 33. Hive HDFS ls -al /home/hamburgerkid/workspace/techtalk/data/ hadoop fs -rmr pig hadoop fs -mkdir pig/input hadoop fs -put /home/hamburgerkid/workspace/techtalk/data/* pig/input hadoop fs -ls /user/hamburgerkid/pig/input Hive Pig mapreduce
  34. 34. wordcount log = LOAD 'pig/input/app_eng' USING TextLoader() ; flatd = FOREACH log GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word ; grpd = GROUP flatd BY word ; cntd = FOREACH grpd GENERATE COUNT(flatd) , group ; STORE cntd INTO 'pig/output/app_eng' ; hadoop fs -ls /user/hamburgerkid/pig/output/app_eng hadoop fs -cat /user/hamburgerkid/pig/output/app_eng/part*
  35. 35. wordcount log = LOAD 'pig/input/producer' USING TextLoader() ; flatd = FOREACH log GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word ; grpd = GROUP flatd BY word ; cntd = FOREACH grpd GENERATE COUNT(flatd) , group ; STORE cntd INTO 'pig/output/producer' ; hadoop fs -ls /user/hamburgerkid/pig/output/producer hadoop fs -cat /user/hamburgerkid/pig/output/producer/part*
  36. 36. (JOIN) eng = LOAD 'pig/output/app_eng' AS (freq , word) ; pro = LOAD 'pig/output/producer' AS (freq , word) ; cg = COGROUP eng BY word , pro BY word ; flatd = FOREACH cg GENERATE FLATTEN(eng) , FLATTEN(pro.freq) AS freq2 ; ttld = FOREACH flatd GENERATE word , SIZE(word) AS size , freq , freq2 , (freq + freq2) AS total ; fltrd = FILTER ttld BY freq > 1 AND freq2 > 1 AND size > 3L ; odrd = LIMIT (ORDER fltrd BY total DESC) 10 ; DUMP odrd ; STORE odrd INTO 'pig/output/develop' ; hadoop fs -ls pig/output/develop hadoop fs -cat pig/output/develop/part*
  37. 37. (OUTER JOIN) eng = LOAD 'pig/output/app_eng' AS (freq , word) ; pro = LOAD 'pig/output/producer' AS (freq , word) ; cg = COGROUP eng BY word , pro BY word ; outrd = FILTER cg BY COUNT(eng) == 0 ; flatd = FOREACH outrd GENERATE FLATTEN(pro) ; szd = FOREACH flatd GENERATE word , SIZE(word) AS size , freq ; fltrd = FILTER szd BY size > 3L ; odrd = LIMIT (ORDER fltrd BY freq DESC) 10 ; DUMP odrd ;
  38. 38. Produce for..
  39. 39. Produce for..
  40. 40. Hive Pig mapreduce < Hive < Pig mapreduce > Hive < Pig mapreduce > Hive > Pig Hive/Pig UDF /
  41. 41. Hive Pig Pig close w Core logic mapreduce SQL Hive Pig
  42. 42. @hamburger_kid

×