0
My life as a  beekeeper   @89clouds
Who am I?Pedro Figueiredo (pfig@89clouds.com)Hadoop et alSocialFacebook games, media (TV,publishing)Elastic MapReduce, Clo...
The problem with      HiveIt looks like SQL
No, seriouslySELECT  CONCAT(vishi,vislo),  SUM(    CASE WHEN searchengine = google       THEN 1       ELSE 0    END  ) AS ...
“It’s just like     Oracle!”Analysts will be very happyAt least until they join with that 30billion-record tablePro tip: e...
Your first interview      question “Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”
Dynamic partitionsPartitions are the poor person’sindexesUnstructured data is full of surprises set   hive.exec.dynamic.pa...
Multi-vitaminsYou can minimise input scans by usingmulti-table INSERTs:FROM inputINSERT INTO TABLE output1 SELECT fooINSER...
Persistence, do you     speak it? External Hive metastore Avoid the pain of cluster set up Use an RDS metastore if on AWS,...
Now you have 2      problemsRegular expressions are great, ifyou’re using a real programminglanguage.WHERE foo RLIKE ‘(a|b...
AvroSerialisation framework (thinkThrift/Protocol Buffers).Avro container files areSequenceFile-like, splittable.Support f...
AvroCREATE EXTERNAL TABLE IF NOT EXISTS mytable  PARTITIONED BY (ds STRING)  ROW FORMAT SERDE    com.linkedin.haivvreo.Avr...
MAKE! MONEY! FAST!Use spot instances in EMRUsually stick around until Americawakes upBrilliant for worker nodes
Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred...
Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred...
Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred...
Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred...
Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred...
Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred...
To be or not to be“Consider a traditional RDBMS”At what size should we do this?Hive is not an end, it’s the meansData on H...
Hive != MapReduceDon’t use Hive instead of Native/Streaming“I know, I’ll just stream this bitthrough a shell script!”Imo, ...
Thank youFred Easey (@poppa_f)Peter Hanlon
Questions? pfig@89clouds.com @pfig / @89cloudshttp://89clouds.com/
Upcoming SlideShare
Loading in...5
×

My life as a beekeeper

488

Published on

Your Hive honeymoon can be cut short if you don't take the necessary precautions. In this talk I'll share my experience with Hive in the last 3 years (in Elastic MapReduce and Cloudera CDH3), describing what I got wrong the first time around, and what eventually saved the day. I've used Hive in environments with a number of events ranging from a few million to a few billion a day, so hopefully there'll be something for everyone.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
488
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • https://www.facebook.com/note.php?note_id=470667928919\n“Currently, if the total size of small tables is larger than 25MB, then the conditional task will choose the original common join to run. 25MB is a very conservative number and you can change this number with set hive.smalltable.filesize=30000000”\nSELECT /* +mapjoin(f,b,g) */\nset hive.auto.convert.join = true;\nhive.smalltable.filesize, depending on version\nset hive.mapjoin.localtask.max.memory.usage = 0.999;\n\n
  • \n
  • Also, there’s no UPDATE, you can only overwrite a whole table, so use partitions\ne.g., 20 games with 40 events with 5 attrs on average, per day (date=/game=/event=/attr=): 1.46M partitions per year (4000/day)\nSET hive.exec.max.dynamic.partitions=100000;\nSET hive.exec.max.dynamic.partitions.pernode=100000;\navoid RECOVER PARTITIONS, generate a partition list and add them statically, or use a persistent metastore\n
  • Or INSERT OVERWRITE. Append (INSERT INTO) only available from 0.8 onwards\nObviously works with partitions, static (with the value in the INSERT statement) or dynamic, but:\nThe dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause\n
  • \n
  • \n
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.\nNo manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.\nThe schema (defined in JSON) is included in the data files\nHive >= 0.9.1\n\n
  • The new SerDe uses TBLPROPERTIES and avro.schema.url / literal. Another property is\norg.apache.hadoop.hive.serde2.avro.AvroSerDe\nAlso, the statement order is important!\nOne more thing: 1.6.x won’t read files created with 1.7.x. CDH3 up to u3 comes with 1.6.0, so be conservative\n
  • Look at the historical prices, bid above it\nRegular price: $0.38, spot: $0.03\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • When using an RDBMS, it’s much harder to get at your data from other tools\n
  • Convoluted, long-winded code\nReporting is hard\n
  • \n
  • \n
  • Transcript of "My life as a beekeeper"

    1. 1. My life as a beekeeper @89clouds
    2. 2. Who am I?Pedro Figueiredo (pfig@89clouds.com)Hadoop et alSocialFacebook games, media (TV,publishing)Elastic MapReduce, ClouderaNoSQL, as in “Not a SQL guy”
    3. 3. The problem with HiveIt looks like SQL
    4. 4. No, seriouslySELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = google THEN 1 ELSE 0 END ) AS google_searchesFROM omnitureWHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = YGROUP BY CONCAT(vishi,vislo);
    5. 5. “It’s just like Oracle!”Analysts will be very happyAt least until they join with that 30billion-record tablePro tip: explain MapReduce and thenMAPJOIN sethive.mapjoin.smalltable.filesize=xxx;
    6. 6. Your first interview question “Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”
    7. 7. Dynamic partitionsPartitions are the poor person’sindexesUnstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000;Plan your partitions ahead
    8. 8. Multi-vitaminsYou can minimise input scans by usingmulti-table INSERTs:FROM inputINSERT INTO TABLE output1 SELECT fooINSERT INTO TABLE output2 SELECT bar;
    9. 9. Persistence, do you speak it? External Hive metastore Avoid the pain of cluster set up Use an RDS metastore if on AWS, RDBMS otherwise. 10GB will get you a long way, this thing is tiny
    10. 10. Now you have 2 problemsRegular expressions are great, ifyou’re using a real programminglanguage.WHERE foo RLIKE ‘(a|b|c)’ will hurtWHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’Generate these statements, if needsbe, it will pay off.
    11. 11. AvroSerialisation framework (thinkThrift/Protocol Buffers).Avro container files areSequenceFile-like, splittable.Support for snappy built-in.If using the LinkedIn SerDe, thetable creation syntax changes.
    12. 12. AvroCREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE com.linkedin.haivvreo.AvroSerDe WITH SERDEPROPERTIES (schema.url=hdfs:///user/hadoop/avro/myschema.avsc) STORED AS INPUTFORMATcom.linkedin.haivvreo.AvroContainerInputFormat OUTPUTFORMATcom.linkedin.haivvreo.AvroContainerOutputFormat LOCATION /data/mytable;
    13. 13. MAKE! MONEY! FAST!Use spot instances in EMRUsually stick around until Americawakes upBrilliant for worker nodes
    14. 14. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
    15. 15. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
    16. 16. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
    17. 17. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
    18. 18. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
    19. 19. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
    20. 20. To be or not to be“Consider a traditional RDBMS”At what size should we do this?Hive is not an end, it’s the meansData on HDFS/S3 is simply available,not “available to Hive”Hive isn’t suitable for near realtime
    21. 21. Hive != MapReduceDon’t use Hive instead of Native/Streaming“I know, I’ll just stream this bitthrough a shell script!”Imo, Hive excels at analysis andaggregation, so use it for that
    22. 22. Thank youFred Easey (@poppa_f)Peter Hanlon
    23. 23. Questions? pfig@89clouds.com @pfig / @89cloudshttp://89clouds.com/
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×