First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA

  1. 1. @
  2. 2. Who is talking?• Tomáš Červenka• Quick Bio: • Slovakia • Cambridge CompSci + Management • Google – Adsense for TV • VisualDNA – Software Engineer -> … -> CTO• @tomascervenka 2
  3. 3. What is this talk about?• What is Hive? • What is it useful for? • What is it not useful for?• Where to start? • Amazon EMR + S3 • Simple example• How do we use Hive at VisualDNA? • What is VisualDNA, anyway? • Use cases: reporting, analytics, ML • Tips and tricks• Q&A 3
  4. 4. What is Hive?• Data warehousing solution built on top of Hadoop• Input format agnostic: can read CSV, Json, Thrift, SequenceTable…• Initially developed at Facebook, became part of Apache Hadoop• In simple terms gives you SQL-like interface to query Big Data. • HiveQL together with custom mappers and reducers give you enough flexibility to write most data processing back-ends.• Hive compiles your HiveQL queries to a set of MapReduce jobs and uses Hadoop to deliver the results of the query. 4
  5. 5. Why is HiveQL important? 5
  6. 6. What is Hive useful for?• Big Data analytics • Running queries over large semi-structured datasets • Makes filtering, aggregation, joins etc. very easy • Hides the complexity of MR => used by non-developers• Big Data processing • Efficient and effective way to write data pipelines • Easy way to parallelise computationally complex queries • Scales nicely with amount of data and cluster size 6
  7. 7. What is Hive not useful for?• Real time analytics or processing • Even small queries can take tens of seconds or minutes • Can’t build Hive (or Hadoop for that matter) into real-time flow• Algorithms which are difficult to parallelise • Almost everything can be expressed in a number of MR steps • Almost always MR is sub-optimal • If your data is small, R or scripting is often better and faster• Another downside: Hive (on EMR) tends to be a pain to debug 7
  8. 8. How to start with Hive?• Build your own Hadoop cluster + install Hive • The “right” way to do it, might take some time for multi-node setup• Spinning up an EMR cluster • The quick and cheap way to do it. • You need an Amazon AWS account and some data on S3. • You need an EMR ruby library installed and configured locally. • You need to spin up an EMR cluster in interactive mode. Voila. $ emr --create --alive --name ”MY JOB" --hive-interactive --num-instances 8 - -instance-type cc1.4xlarge --hive-versions 0.8.1 --bootstrap-action "s3://my- bucket/emr-bootstrap" 8
  9. 9. Getting Started with Hive• SSH into your cluster (your namenode) • $ emr --ssh j-AHF0QE733K8F• Run screen (you’ll quickly find out why) • $ screen• Run Hive • $ hive• Welcome to Hive interface!• Monitor Hive • $ elinks http://localhost:9100• Terminate Hive • $ emr --terminate j-AHF0QE733K8F 9
  10. 10. Example – CTR by ad by dayadd jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;CREATE EXTERNAL TABLE events ( time string, action string, id string)PARTITIONED BY (d string)ROW FORMAT SERDE’WITH SERDEPROPERTIES (paths=’time,action,id)LOCATION s3://my-bucket/events/’;ALTER TABLE events ADD PARTITION (d=2012-07-09);ALTER TABLE events ADD PARTITION (d=2012-07-08);ALTER TABLE events ADD PARTITION (d=2012-07-07); 10
  11. 11. Example – CTR by ad by dayCREATE EXTERNAL TABLE ad_stats (d string, id string, impressions bigint, clicks bigint, ctr float)ROW FORMAT DELIMITED FIELDS TERMINATED BY , LINES TERMINATED BY nLOCATION s3://my-bucket/ad-stats/;INSERT OVERWRITE TABLE ad_statsSELECT i.d,, impressions, clicks, clicks/impressionsFROM ( SELECT d, id, count(1) as clicks FROM events WHERE action = CLICK’ GROUP BY d, id) c FULL OUTER JOIN ( SELECT d, id, count(1) as impressions FROM events WHERE action = IMPRESSION’ GROUP BY d, id) i ON (c.d = i.d AND =; 11
  12. 12. What is VisualDNA, anyway?• Leading audience profiling technology and audience data provider• ≈ 100 million monthly uniques reach globally• Data for ad targeting, risk, personalisation, recommendations…• Use Cassandra, Hadoop, Hive, Redis, Scala, PHP, Java in production• Running on a mix of AWS and physical HW in London We’re hiring (like mad) Back-End, Front-End and Research Engineers 12
  13. 13. How do we use Hive?• Main data source is: events• Events are associated with users actions (mainly) • Conversions, pageviews, impressions, clicks, syncs… • Contain user ID, timestamp, browser info, geo, event info…• Roughly 50M of them a day = 50 GB of text • JSON format, one JSON object per line, validated input • Coming from 8 events trackers, rotated every 5 mins • Partitioned by date (d=2012-07-09) • Storing all of them on S3. Never deleting anything. 13
  14. 14. Use case #1: Analytics• Analytics queries on our events table • How many people started each quiz in the last 3 months? • Give me the IDs of people visiting football section on Mirror today. • Give me a histogram of frequency of visits per user • …• Best thing about Hive: non-developers use it (after we wrote a wiki) • Can simplify further by using Karmasphere on AWS• Downsides: • Takes time to spin-up the cluster on AWS. • Takes time to execute simple queries. Very big queries often fail. • Replacing a lot of the “how many” queries by Redis. 14
  15. 15. Use case #2: Reporting pipeline• Interactive mode in Hive is only part of the picture.• Hive can also run scripted queries for you: • $ emr --create --hive-script --name ”Test” --num-instances 2 --slave- instance-type cc1.4xlarge --master-instance-type c1.medium --arg hive_script.q --args "-d,PARTITION=2012-07-09,-d,RUNDATE=2012-07-10” • Note: arguments are accessible in the hive query: ${PARTITION} • Rule of thumb: always run queries by hand first, script them if you’re sure they work• Reporting is repeated analytics => similar queries, but ran regularly• Hive drives a lot of our reporting tools and provides data for Redis• We use cron + bash scripts to schedule, run and monitor Hive jobs • Poll emr for status of the job until finished (success or fail) • Suggestions for better tools? 15
  16. 16. Use case #3: Inference Engine (ML)• Inference Engine helps us scale audience data to 100M+ profiles• In principle, extrapolates quiz profile data over user behaviour online• At its heart, it’s a few hundred lines of Hive queries• Every day, fetches users from Cassandra and sifts through events: • Update profiles for pages visited by profiled users yesterday • Update profiles for users based on their behaviour yesterday• Input is about 2M users, 50M events; output is 5-10M user profiles• Runs in < 3 hours with 10 large instances -> parallelises nicely • Could use Apache Mahout, but was single-threaded back then• Biggest issues? Global sorts, running out of memory/disk on joins. 16
  17. 17. Tips and Tricks• Performance related • On AWS, S3 is often the bottleneck. Use cc1.* or cc2.* instances. • Copy from S3 to internal table if you query it multiple times. • Use compression for output. Plenty of CPU cycles for this. • Use SequenceTable format and internal tables where applicable. • Use MapJoin wherever possible (SELECT /*+ MAPJOIN(table)*/). • Avoid SerDe-s and TRANSFORM mappers if possible. • Don’t sort (ORDER BY) unless you really have to => 1 reducer. • Partition your data (input and/or output) if you can.• Might make your life easier • If queries start stalling, add more instances. Debugging is painful. • Use arguments to pass in commands / partitions (if you need to). 17
  18. 18. Q&A• Thank you for your time!• Hope this was a bit useful – let me know your feedback.• Any questions? 18