Hive vs Pig for HadoopSourceCodeReading

14,010 views

Published on

Published in: Technology
2 Comments
19 Likes
Statistics
Notes
No Downloads
Views
Total views
14,010
On SlideShare
0
From Embeds
0
Number of Embeds
1,476
Actions
Shares
0
Downloads
605
Comments
2
Likes
19
Embeds 0
No embeds

No notes for slide










































  • Hive vs Pig for HadoopSourceCodeReading

    1. 1. HIVE PIG MAPREDUCE • @hamburger_kid 2010 5 27 1
    2. 2. done CLOUDERA HADOOP TRAINING FOR DEVELOPERS
    3. 3. done CLOUDERA HADOOP TRAINING FOR DEVELOPERS
    4. 4. Day 1 Day 3 Hadoop mapreduce HDFS mapreduce mapreduce mapreduce Day 2 RDBMS Hadoop Hive Pig
    5. 5. Day 1 Day 3 Hadoop mapreduce HDFS mapreduce mapreduce mapreduce Day 2 RDBMS Hadoop Hive Pig Mr.Alex
    6. 6. Day 1 Day 3 Hadoop mapreduce HDFS mapreduce mapreduce mapreduce Day 2 RDBMS Hadoop Hive Pig Mr.Alex
    7. 7. Hive vs Pig
    8. 8. Hive vs Pig VS
    9. 9. mapreduce
    10. 10. mapreduce
    11. 11. NameNode Secondary ClientNode JobTracker NameNode Block DataNode TaskTracker
    12. 12. Hive Pig NameNode Secondary ClientNode JobTracker NameNode Block DataNode TaskTracker
    13. 13. mapreduce THE END OF MONEY IS THE END OF LOVE map shuffle&sort reduce source: http://techblog.yahoo.co.jp/cat207/cat209/hadoop/
    14. 14. Hive
    15. 15. Hive
    16. 16. Hive Facebook SQL like mapreduce Hive QL Table, Partitions, Buckets Metastore HDFS
    17. 17. Hive Table, Partitions, Buckets Table column int, float, string, boolean Partitions data table partitioning HDFS Partitions Buckets data Buckets = Reduce Sampling
    18. 18. Hive Metastore Metastore Table, Partitions Metastore ClientNode NameNode Derby/MySQL DB Metastore Table HDFS Partitions
    19. 19. Hive HDFS HDFS directory /user/hive/warehouse Table warehouse subdirectory Partitons Table subdirectory data reduce /user/hive/warehouse/table/patition/data SequenceFiles SerDe format http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook
    20. 20. http://www.rakuten.co.jp/recruit/en/career/employee/ appengineer.html http://www.rakuten.co.jp/recruit/en/career/employee/ systemproducer.html
    21. 21. HDFS ls -al /home/hamburgerkid/workspace/techtalk/data/ hadoop fs -rmr hive hadoop fs -mkdir hive/input hadoop fs -put /home/hamburgerkid/workspace/ techtalk/data/* hive/input hadoop fs -ls /user/hamburgerkid/hive/input
    22. 22. Table mapreduce wordcount hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar wordcount hive/input/app_eng hive/output/app_eng/ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar wordcount hive/input/producer hive/output/producer/ hadoop fs -ls /user/hamburgerkid/hive/output/app_eng/ hadoop fs -ls /user/hamburgerkid/hive/output/producer/ hadoop fs -cat /user/hamburgerkid/hive/output/app_eng/part* hadoop fs -cat /user/hamburgerkid/hive/output/producer/part*
    23. 23. wordcount output CREATE TABLE producer (word STRING , freq INT) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE ; SHOW TABLES ; DESCRIBE producer ; LOAD DATA INPATH '/user/hamburgerkid/hive/output/producer/part*' INTO TABLE producer PARTITION (dt='20100526') ; SELECT * FROM producer WHERE LENGTH(word) > 3 AND freq > 1 SORT BY freq DESC LIMIT 10 ; EXPLAIN SELECT * FROM producer WHERE LENGTH(word) > 3 AND freq > 1 SORT BY freq DESC LIMIT 10 ; hadoop fs -ls /user/hive/warehouse/producer/dt=20100526/ hadoop fs -ls /user/hamburgerkid/hive/output/producer/
    24. 24. wordcount output CREATE TABLE app_eng (word STRING , freq INT) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE ; LOAD DATA INPATH '/user/hamburgerkid/hive/output/app_eng/part*' INTO TABLE app_eng PARTITION (dt='20100526') ;
    25. 25. Table (JOIN) CREATE TABLE develop (word STRING, p_freq INT, e_freq INT) ; INSERT OVERWRITE TABLE develop SELECT p.word, p.freq, e.freq FROM producer p JOIN app_eng e ON (p.word = e.word) WHERE p.freq > 1 AND e.freq > 1 ; SELECT word, p_freq, e_freq, (p_freq + e_freq) AS ttl FROM develop WHERE LENGTH(word) > 3 SORT BY ttl DESC LIMIT 10 ;
    26. 26. (OUTER JOIN) SELECT e.word, e.freq, p.freq FROM app_eng e LEFT OUTER JOIN producer p ON (e.word = p.word) WHERE LENGTH(e.word) > 3 AND p.freq IS NULL SORT BY e.freq DESC LIMIT 10 ;
    27. 27. Engineering with..
    28. 28. Engineering with..
    29. 29. Pig
    30. 30. Pig
    31. 31. Pig Yahoo! Pig Latin mapreduce join, group, filter, sort Grunt Pig shell script Hive Metastore map, tuple, bag mapreduce Local Hadoop
    32. 32. Pig int, long, double, chararray, bytearray 'apache.org' , '1.0' tuple <apache.org , 1.0> bag tuple {<apache.org , 1.0> , <flickr.com , 0.8>} map key/value value OK [ 'apache' : <'search' , 'news'> ; 'cnn' : 'news' ]
    33. 33. Hive HDFS ls -al /home/hamburgerkid/workspace/techtalk/data/ hadoop fs -rmr pig hadoop fs -mkdir pig/input hadoop fs -put /home/hamburgerkid/workspace/techtalk/data/* pig/input hadoop fs -ls /user/hamburgerkid/pig/input Hive Pig mapreduce
    34. 34. wordcount log = LOAD 'pig/input/app_eng' USING TextLoader() ; flatd = FOREACH log GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word ; grpd = GROUP flatd BY word ; cntd = FOREACH grpd GENERATE COUNT(flatd) , group ; STORE cntd INTO 'pig/output/app_eng' ; hadoop fs -ls /user/hamburgerkid/pig/output/app_eng hadoop fs -cat /user/hamburgerkid/pig/output/app_eng/part*
    35. 35. wordcount log = LOAD 'pig/input/producer' USING TextLoader() ; flatd = FOREACH log GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word ; grpd = GROUP flatd BY word ; cntd = FOREACH grpd GENERATE COUNT(flatd) , group ; STORE cntd INTO 'pig/output/producer' ; hadoop fs -ls /user/hamburgerkid/pig/output/producer hadoop fs -cat /user/hamburgerkid/pig/output/producer/part*
    36. 36. (JOIN) eng = LOAD 'pig/output/app_eng' AS (freq , word) ; pro = LOAD 'pig/output/producer' AS (freq , word) ; cg = COGROUP eng BY word , pro BY word ; flatd = FOREACH cg GENERATE FLATTEN(eng) , FLATTEN(pro.freq) AS freq2 ; ttld = FOREACH flatd GENERATE word , SIZE(word) AS size , freq , freq2 , (freq + freq2) AS total ; fltrd = FILTER ttld BY freq > 1 AND freq2 > 1 AND size > 3L ; odrd = LIMIT (ORDER fltrd BY total DESC) 10 ; DUMP odrd ; STORE odrd INTO 'pig/output/develop' ; hadoop fs -ls pig/output/develop hadoop fs -cat pig/output/develop/part*
    37. 37. (OUTER JOIN) eng = LOAD 'pig/output/app_eng' AS (freq , word) ; pro = LOAD 'pig/output/producer' AS (freq , word) ; cg = COGROUP eng BY word , pro BY word ; outrd = FILTER cg BY COUNT(eng) == 0 ; flatd = FOREACH outrd GENERATE FLATTEN(pro) ; szd = FOREACH flatd GENERATE word , SIZE(word) AS size , freq ; fltrd = FILTER szd BY size > 3L ; odrd = LIMIT (ORDER fltrd BY freq DESC) 10 ; DUMP odrd ;
    38. 38. Produce for..
    39. 39. Produce for..
    40. 40. Hive Pig mapreduce < Hive < Pig mapreduce > Hive < Pig mapreduce > Hive > Pig Hive/Pig UDF /
    41. 41. Hive Pig Pig close w Core logic mapreduce SQL Hive Pig
    42. 42. @hamburger_kid

    ×