Simplifying the Use of Hive with The Hive Query Tool


Published on

The slides from my talk about a tool we've developed at TripAdvisor and open-sourced. The Hive Query Tool makes it easy for non-technical end users to run highly customizable reports on Hive.

Published in: Technology
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi Stephen, We were faced with a similar requirement from business recently.So glad to see someone already did this and open sourced it.However, We have already built a very simple Web App that uses Hive thrift JDBC and users liked it. 'The Hive Thrift Server is horrible'.I couldn't agree more.Even the guys at tableau(or anyone dependent on thrift) feel similarly( , search 'Known limitations'). While your HQT looks very tempting with more features, we will wait for a while before adopting it(training ourselves on perl being the top most reason). We have plans of letting users enter questions in plain english instead of having to fillup forms.Right now, we are doing a feasibilty study of implementing it.Do you have any plans to provide natural language interface to HQT? Any thoughts/comments on the NLI feature? Thanks, Eswara
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Simplifying the Use of Hive with The Hive Query Tool

  1. 1. 1Stephen R. Scaffidisscaffidi@tripadvisor.comHadoop SummitSan Jose 2013Simplifying the use of Hivewith the Hive Query Tool
  2. 2. 2• Introduction• What is the Hive Query Tool (HQT)?• Why did we build it?• How it’s being used today• Design & system requirements• HQT Query Templates• Getting the source, building, and running• Future plans & possibilitiesTalk Outline
  3. 3. Introduction 3Introduction Section Title
  4. 4. Introduction4 Me• Sr. Software Engineer at TripAdvisor• Data Warehouse Engineering Group• Mildly obsessed with making things “Just Work”• OK, more than mildly...• Varied background, from PC Tech to EmailAdmin to Telco NMS to Lisp Hacker, etc...• Thrives on making computers do the work• No Hadoop experience before joining
  5. 5. Introduction my team• Data Warehouse Engineering• Small, focused, tenacious group• Varied skills and backgrounds• We keep the elephants fed and healthy• We help others in the company make use ofthe facilities provided on the clusters• DevOps in every sense of the term5
  6. 6. Introduction6 TripAdvisor• Awesome place to work• It really feels like we’re one big team• Always new challenges and things to learn• Smart, driven and genuinely *nice* people• Offices around the world• Great benefits• We’re hiring!
  7. 7. What is the Hive Query Tool? 7Simplifying the use of Hive with The Hive Query Tool
  8. 8. What is the Hive Query Tool?8 simple web interface for running reports on Hive• Our Specific Goals/Needs:• Easy to use for non-technical people• More flexible query customization than simplevariable interpolation• Relatively easy installation and administration• Allow jobs to run with different scheduler queuesand users• Performance equal-to or better-than plain Hive
  9. 9. What is the Hive Query Tool?9 for non-technical end-users• Intended for use by non-technical people:• Sales, Marketing, Customer-Relations, etc.• People who don’t know anything about Hadoopor Hive (or need to)• People who don’t live in a *nix shell• No need to even know anything about SQL!
  10. 10. What is the Hive Query Tool?10 Query Customization• Other solutions we looked at were too limited• We needed to give the users something morepowerful than simple variable substitution.• HQT’s template system can generate andinsert arbitrary HQL clauses into a query basedon a user’s input to a simple web interface.
  11. 11. What is the Hive Query Tool?11 Install and Administration• If we were going to build our own, we didn’twant maintenance to be *another* full-time job• Internal adoption by other engineers wasimportant• Java hackers don’t want to deal with a 23.5-stepinstall and configure process• Especially if it’s not written in Java• Check-out the source, run the setup script, edita single config file and run the startup scripts.
  12. 12. What is the Hive Query Tool?12 jobs with different Users and Queues• Face it, the Hive Thrift Server is horrible• Most other user-friendly Hive front-ends use it• So they have all its limitations• And its bugs • The HQT simply spawns a Hive CLI for eachjob, using sudo to change users whennecessary.
  13. 13. What is the Hive Query Tool?13• Some options we looked at before building theHQT did a whole lot more• Some claimed to be faster than Hive.• Some of these options had so much overheadthat they were slower than using Hive directly!• The HQT simply runs HQL code thru thestandard Hive CLI. No overhead, no differencein performance over plain-vanilla Hive.
  14. 14. Why did we need this? 14Simplifying the use of Hive with The Hive Query Tool
  15. 15. Why did we need the HQT?15 the data accessible• The data we pump into our Hadoop clusters isfull of valuable information to our business• And more is fed into our Hive tables every day• And more people need access to that dataevery day• But not all of those people are 733t h4(k3rengineers 😉
  16. 16. Why did we need the HQT?16 the data accessible• The target users may not know Linux and Javaand SQL...• But they do know how they want the datafiltered and correlated and aggregated.• We needed a way to let them run querieswhere they could choose these parameterswith a high degree of flexibility...• But without having to teach them all HQL
  17. 17. Why did we need the HQT?17 at what was available...• Nothing else we looked at seemed to satisfy allour requirements.• Some that looked interesting, unfortunatelyhad terrible performance, as they did not useHive directly.• Not that everything we looked at was terrible –some solutions were really quite impressive.• But it came down to a classic question in tech-oriented businesses...
  18. 18. Why did we need the HQT?18 bottom line• We knew what we wanted• We knew what we wanted wasn’t particularlycomplex• We asked ourselves if we could just buildsomething that gives us exactly what we need• And would that effort cost less than trying tomake something else work the way wewanted?• A “Eureka!” moment and a rough prototypeanswered the question 😉
  19. 19. HQT Use at TripAdvisor 19
  20. 20. HQT Use at TripAdvisor20 surprise hit• Some interested people tried the prototype• Liked how it worked, requested more features• Other groups became interested• Even committed engineering resources to helpget it to “beta”• It’s now being used across the company• New report templates constantly being added• (sorry, those aren’t available publicly)
  21. 21. HQT Use at TripAdvisor21 adoption• End users find it easy to use and relativelyconvenient.• Template authors have found it easy to createand modify report templates.• Users include people in Sales, Marketing,Commerce, and even Legal!• Weekly peak usage at over 40 simultaneousHive jobs – on a single server.(we’ve actually had to add throttling to keep HQTjobs from using too many mapred slots)
  22. 22. HQT Design 22
  23. 23. HQT Design23 Front-End• Web interface• Handles user authentication• Processes HQT Templates to determine...• What options/input elements to present the user• How to process and validate input from the user• What HQL to send to the back-end• Gets job progress and status info from theback-end• Doesn’t do much else
  24. 24. HQT Design24 Back-End• Presents a “json/rest-like” interface over HTTPto receive requests from the front-end• Uses an event-loop instead of threads• Spawns Hive CLI instances to run submittedHQL• Tracks and parses output from each instance• Watches CLI instances for progress and errors• Processes results for retrieval by users• Sends email notifications
  25. 25. HQT Design25 System• The “special sauce” of the HQT• The template “language” is designed so that“directives” concisely express a whole lot:• What input to gather (and optionally what kind)• How to validate that input• What output to generate and how to format it• It’s a little tricky to explain• But extremely flexible• More details shortly...
  26. 26. HQT Design26 & Frameworks• Written in Perl• Uses lots of components from the CPAN• Front-end web framework is Mojolicious• Template System uses Text::Template• Back-end uses AnyEvent• Most classes built using Moo• Decent example of “Modern Perl”, but is still awork-in-progress.
  27. 27. HQT Design27 Requirements• Requires Perl 5.10.1 or newer• Hadoop & Hive clients & libs should already beinstalled and configured• Does *not* require root or root access• LDAP & sudo should be configured if you wantto run jobs as different users.• Web-server is built-in, but can run under justabout any setup you want
  28. 28. HQT Design28 State• The front-end code is rather nice• MVC-style web app code• Uses Mojolicious .epl templates for web content,which is very similar to .erb• Back-end code is kind of hairy• AnyEvent is fairly low-level• REST/json stuff too mixed with the code thatwraps the Hive CLI processes.• It shouldn’t be responsible for sending email!
  29. 29. HQT Design29 State, contd.• Template-system code:• Fairly simple code, but allows for a lot ofinteresting functionality.• Other engineers seem to think it’s fine...• But I think it needs refactoring∙ Too much “action at a distance”∙ Template evaluation is a big security risk∙ Should use OO instead of ad-hoc data structures∙ Etc...
  30. 30. The HQL Template System 30
  31. 31. Template System31• Template code blocks are embedded intootherwise normal HQL:{{ begin_main_select }}SELECT foo, bar FROM bazWHERE ds={{insert_vardate => { type => ‘date’, default => days_ago_ymd(3) }}}{{append_where {columns => { wibble => ‘string’, wobble => ‘int’ }}}}
  32. 32. Template System32• Template functions/”directives” simultaneouslydefine...• What input options to present the user• Input validation• What to insert into the HQL based on the input
  33. 33. Template System33, this...{{ begin_main_select }}SELECT foo, bar FROM bazWHERE ds={{insert_vardate => { type => ‘date’, default => days_ago_ymd(3) }}}{{append_where {columns => { wibble => ‘string’, wobble => ‘int’ }}}}
  34. 34. Template System34 this:
  35. 35. Template System35 when filled out like this:
  36. 36. Template System36 HQL like this:
  37. 37. Template System37 Engine• Didn’t build anything new, just used theexisting Text::Template module in a cleverway• Template blocks are just Perl code, evaluatedin a specified package/namespace.• Used some trickery to make it look a little lesslike Perl, but nothing fancy.• The things that look like “directives” are justfunctions.• Lots of functions defined in that namespace...
  38. 38. Template System38 Functions• Functions available for:• Simple value insertion/substitution• Adding & extending WHERE clauses• Adding & extending GROUP BY clauses• Setting defaults• Manipulating and comparing dates• Parameter validation• Plus a lot of misc utils and support functions thatprobably should be in a different module.
  39. 39. Template System39 Files• Simple format – a YAML header followed bytemplatized HQL code like you saw earlier:id: pageviews_uniquesname: Daily Pageviews and Unique Visitorsdescription: >Any description which will appear on the page.<i>May include HTML</i>author: optional...{{ begin_main_select() }}SELECT foo, bar FROM bazWHERE ds={{ insert_var date => {type => ‘date’} }}
  40. 40. Template System40• Code in the web-app depends on the structureof data internal to the template module.• Would take a lot of work to fix, but worth it.• Template evaluation is a potential securitynightmare.• Perl does have a sandbox module for this sort ofthings, though. I just need RTFM and use it.• The APIs of the various functions isn’t entirelyconsistent, but not too bad• Will definitely fix for next release.
  41. 41. Try it for yourself 41
  42. 42. Try for Yourself42• Source code available on GitHub now:•• Apache 2.0 Licensed• Modest system prerequisites• Automated download and installation of alldependencies• Works on a variety of platforms• Bug Reports, Feature Requests and PullRequests all *very welcome*
  43. 43. The future? 43
  44. 44. Future Plans44• Complete rewrite of the back-end for cleanerand more flexible code.• Implement sandboxing for template security• More user-oriented features, like• Ability to save pre-filled query reports• Better management of past and running jobs• Better status info from the backend• Column-headers in report output• An administrator dashboard/console• Bug-fixes, feature enhancements, lots more
  45. 45. Future Possibilities45• Workflow & Scheduling functionality• Separate template system for stand-alone use• Make the back-end good enough to be a viablereplacement for the Hive Thrift Server.• Add template functions for joins, sub-selects,and lots of other HQL constructs that aren’t yetcustomizable.• Add ability for a single template to definemultiple queries delivering in multiple result-sets.• Rewrite in Perl 6 😉
  46. 46. Questions??Any Questions?