My life as a beekeeper
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

My life as a beekeeper

  • 669 views
Uploaded on

Your Hive honeymoon can be cut short if you don't take the necessary precautions. In this talk I'll share my experience with Hive in the last 3 years (in Elastic MapReduce and Cloudera CDH3),......

Your Hive honeymoon can be cut short if you don't take the necessary precautions. In this talk I'll share my experience with Hive in the last 3 years (in Elastic MapReduce and Cloudera CDH3), describing what I got wrong the first time around, and what eventually saved the day. I've used Hive in environments with a number of events ranging from a few million to a few billion a day, so hopefully there'll be something for everyone.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
669
On Slideshare
668
From Embeds
1
Number of Embeds
1

Actions

Shares
Downloads
9
Comments
0
Likes
1

Embeds 1

https://twitter.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • https://www.facebook.com/note.php?note_id=470667928919\n“Currently, if the total size of small tables is larger than 25MB, then the conditional task will choose the original common join to run. 25MB is a very conservative number and you can change this number with set hive.smalltable.filesize=30000000”\nSELECT /* +mapjoin(f,b,g) */\nset hive.auto.convert.join = true;\nhive.smalltable.filesize, depending on version\nset hive.mapjoin.localtask.max.memory.usage = 0.999;\n\n
  • \n
  • Also, there’s no UPDATE, you can only overwrite a whole table, so use partitions\ne.g., 20 games with 40 events with 5 attrs on average, per day (date=/game=/event=/attr=): 1.46M partitions per year (4000/day)\nSET hive.exec.max.dynamic.partitions=100000;\nSET hive.exec.max.dynamic.partitions.pernode=100000;\navoid RECOVER PARTITIONS, generate a partition list and add them statically, or use a persistent metastore\n
  • Or INSERT OVERWRITE. Append (INSERT INTO) only available from 0.8 onwards\nObviously works with partitions, static (with the value in the INSERT statement) or dynamic, but:\nThe dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause\n
  • \n
  • \n
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.\nNo manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.\nThe schema (defined in JSON) is included in the data files\nHive >= 0.9.1\n\n
  • The new SerDe uses TBLPROPERTIES and avro.schema.url / literal. Another property is\norg.apache.hadoop.hive.serde2.avro.AvroSerDe\nAlso, the statement order is important!\nOne more thing: 1.6.x won’t read files created with 1.7.x. CDH3 up to u3 comes with 1.6.0, so be conservative\n
  • Look at the historical prices, bid above it\nRegular price: $0.38, spot: $0.03\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  • When using an RDBMS, it’s much harder to get at your data from other tools\n
  • Convoluted, long-winded code\nReporting is hard\n
  • \n
  • \n

Transcript

  • 1. My life as a beekeeper @89clouds
  • 2. Who am I?Pedro Figueiredo (pfig@89clouds.com)Hadoop et alSocialFacebook games, media (TV,publishing)Elastic MapReduce, ClouderaNoSQL, as in “Not a SQL guy”
  • 3. The problem with HiveIt looks like SQL
  • 4. No, seriouslySELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = google THEN 1 ELSE 0 END ) AS google_searchesFROM omnitureWHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = YGROUP BY CONCAT(vishi,vislo);
  • 5. “It’s just like Oracle!”Analysts will be very happyAt least until they join with that 30billion-record tablePro tip: explain MapReduce and thenMAPJOIN sethive.mapjoin.smalltable.filesize=xxx;
  • 6. Your first interview question “Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”
  • 7. Dynamic partitionsPartitions are the poor person’sindexesUnstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000;Plan your partitions ahead
  • 8. Multi-vitaminsYou can minimise input scans by usingmulti-table INSERTs:FROM inputINSERT INTO TABLE output1 SELECT fooINSERT INTO TABLE output2 SELECT bar;
  • 9. Persistence, do you speak it? External Hive metastore Avoid the pain of cluster set up Use an RDS metastore if on AWS, RDBMS otherwise. 10GB will get you a long way, this thing is tiny
  • 10. Now you have 2 problemsRegular expressions are great, ifyou’re using a real programminglanguage.WHERE foo RLIKE ‘(a|b|c)’ will hurtWHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’Generate these statements, if needsbe, it will pay off.
  • 11. AvroSerialisation framework (thinkThrift/Protocol Buffers).Avro container files areSequenceFile-like, splittable.Support for snappy built-in.If using the LinkedIn SerDe, thetable creation syntax changes.
  • 12. AvroCREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE com.linkedin.haivvreo.AvroSerDe WITH SERDEPROPERTIES (schema.url=hdfs:///user/hadoop/avro/myschema.avsc) STORED AS INPUTFORMATcom.linkedin.haivvreo.AvroContainerInputFormat OUTPUTFORMATcom.linkedin.haivvreo.AvroContainerOutputFormat LOCATION /data/mytable;
  • 13. MAKE! MONEY! FAST!Use spot instances in EMRUsually stick around until Americawakes upBrilliant for worker nodes
  • 14. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
  • 15. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
  • 16. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
  • 17. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
  • 18. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
  • 19. Bag of tricksset hive.optimize.s3.query=true;set hive.cli.print.header=true;set hive.exec.max.created.files=xxx;set mapred.reduce.tasks=xxx;hive.exec.compress.intermediate=true;hive.exec.parallel=true;
  • 20. To be or not to be“Consider a traditional RDBMS”At what size should we do this?Hive is not an end, it’s the meansData on HDFS/S3 is simply available,not “available to Hive”Hive isn’t suitable for near realtime
  • 21. Hive != MapReduceDon’t use Hive instead of Native/Streaming“I know, I’ll just stream this bitthrough a shell script!”Imo, Hive excels at analysis andaggregation, so use it for that
  • 22. Thank youFred Easey (@poppa_f)Peter Hanlon
  • 23. Questions? pfig@89clouds.com @pfig / @89cloudshttp://89clouds.com/