• Save
Simplifying the Use of Hive with The Hive Query Tool
Upcoming SlideShare
Loading in...5
×
 

Simplifying the Use of Hive with The Hive Query Tool

on

  • 3,083 views

The slides from my talk about a tool we've developed at TripAdvisor and open-sourced. The Hive Query Tool makes it easy for non-technical end users to run highly customizable reports on Hive.

The slides from my talk about a tool we've developed at TripAdvisor and open-sourced. The Hive Query Tool makes it easy for non-technical end users to run highly customizable reports on Hive.

Statistics

Views

Total Views
3,083
Views on SlideShare
3,069
Embed Views
14

Actions

Likes
6
Downloads
0
Comments
1

4 Embeds 14

https://twitter.com 9
http://www.linkedin.com 2
https://www.linkedin.com 2
http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi Stephen, We were faced with a similar requirement from business recently.So glad to see someone already did this and open sourced it.However, We have already built a very simple Web App that uses Hive thrift JDBC and users liked it. 'The Hive Thrift Server is horrible'.I couldn't agree more.Even the guys at tableau(or anyone dependent on thrift) feel similarly( http://kb.tableausoftware.com/articles/knowledgebase/administering-hadoop-hive , search 'Known limitations'). While your HQT looks very tempting with more features, we will wait for a while before adopting it(training ourselves on perl being the top most reason). We have plans of letting users enter questions in plain english instead of having to fillup forms.Right now, we are doing a feasibilty study of implementing it.Do you have any plans to provide natural language interface to HQT? Any thoughts/comments on the NLI feature? Thanks, Eswara
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Simplifying the Use of Hive with The Hive Query Tool Simplifying the Use of Hive with The Hive Query Tool Presentation Transcript

    • http://tripadvisor.com/careers 1Stephen R. Scaffidisscaffidi@tripadvisor.comHadoop SummitSan Jose 2013Simplifying the use of Hivewith the Hive Query Tool
    • http://tripadvisor.com/careers 2• Introduction• What is the Hive Query Tool (HQT)?• Why did we build it?• How it’s being used today• Design & system requirements• HQT Query Templates• Getting the source, building, and running• Future plans & possibilitiesTalk Outline
    • Introductionhttp://tripadvisor.com/careers 3Introduction Section Title
    • Introduction4http://tripadvisor.com/careersAbout Me• Sr. Software Engineer at TripAdvisor• Data Warehouse Engineering Group• Mildly obsessed with making things “Just Work”• OK, more than mildly...• Varied background, from PC Tech to EmailAdmin to Telco NMS to Lisp Hacker, etc...• Thrives on making computers do the work• No Hadoop experience before joining
    • Introductionhttp://tripadvisor.com/careersAbout my team• Data Warehouse Engineering• Small, focused, tenacious group• Varied skills and backgrounds• We keep the elephants fed and healthy• We help others in the company make use ofthe facilities provided on the clusters• DevOps in every sense of the term5
    • Introduction6http://tripadvisor.com/careersAbout TripAdvisor• Awesome place to work• It really feels like we’re one big team• Always new challenges and things to learn• Smart, driven and genuinely *nice* people• Offices around the world• Great benefits• We’re hiring!
    • What is the Hive Query Tool?http://tripadvisor.com/careers 7Simplifying the use of Hive with The Hive Query Tool
    • What is the Hive Query Tool?8http://tripadvisor.com/careersA simple web interface for running reports on Hive• Our Specific Goals/Needs:• Easy to use for non-technical people• More flexible query customization than simplevariable interpolation• Relatively easy installation and administration• Allow jobs to run with different scheduler queuesand users• Performance equal-to or better-than plain Hive
    • What is the Hive Query Tool?9http://tripadvisor.com/careersEasy for non-technical end-users• Intended for use by non-technical people:• Sales, Marketing, Customer-Relations, etc.• People who don’t know anything about Hadoopor Hive (or need to)• People who don’t live in a *nix shell• No need to even know anything about SQL!
    • What is the Hive Query Tool?10http://tripadvisor.com/careersFlexible Query Customization• Other solutions we looked at were too limited• We needed to give the users something morepowerful than simple variable substitution.• HQT’s template system can generate andinsert arbitrary HQL clauses into a query basedon a user’s input to a simple web interface.
    • What is the Hive Query Tool?11http://tripadvisor.com/careersEasy Install and Administration• If we were going to build our own, we didn’twant maintenance to be *another* full-time job• Internal adoption by other engineers wasimportant• Java hackers don’t want to deal with a 23.5-stepinstall and configure process• Especially if it’s not written in Java• Check-out the source, run the setup script, edita single config file and run the startup scripts.
    • What is the Hive Query Tool?12http://tripadvisor.com/careersRun jobs with different Users and Queues• Face it, the Hive Thrift Server is horrible• Most other user-friendly Hive front-ends use it• So they have all its limitations• And its bugs • The HQT simply spawns a Hive CLI for eachjob, using sudo to change users whennecessary.
    • What is the Hive Query Tool?13http://tripadvisor.com/careersPerformance?• Some options we looked at before building theHQT did a whole lot more• Some claimed to be faster than Hive.• Some of these options had so much overheadthat they were slower than using Hive directly!• The HQT simply runs HQL code thru thestandard Hive CLI. No overhead, no differencein performance over plain-vanilla Hive.
    • Why did we need this?http://tripadvisor.com/careers 14Simplifying the use of Hive with The Hive Query Tool
    • Why did we need the HQT?15http://tripadvisor.com/careersMaking the data accessible• The data we pump into our Hadoop clusters isfull of valuable information to our business• And more is fed into our Hive tables every day• And more people need access to that dataevery day• But not all of those people are 733t h4(k3rengineers 😉
    • Why did we need the HQT?16http://tripadvisor.com/careersMaking the data accessible• The target users may not know Linux and Javaand SQL...• But they do know how they want the datafiltered and correlated and aggregated.• We needed a way to let them run querieswhere they could choose these parameterswith a high degree of flexibility...• But without having to teach them all HQL
    • Why did we need the HQT?17http://tripadvisor.com/careersLooked at what was available...• Nothing else we looked at seemed to satisfy allour requirements.• Some that looked interesting, unfortunatelyhad terrible performance, as they did not useHive directly.• Not that everything we looked at was terrible –some solutions were really quite impressive.• But it came down to a classic question in tech-oriented businesses...
    • Why did we need the HQT?18http://tripadvisor.com/careersThe bottom line• We knew what we wanted• We knew what we wanted wasn’t particularlycomplex• We asked ourselves if we could just buildsomething that gives us exactly what we need• And would that effort cost less than trying tomake something else work the way wewanted?• A “Eureka!” moment and a rough prototypeanswered the question 😉
    • HQT Use at TripAdvisorhttp://tripadvisor.com/careers 19
    • HQT Use at TripAdvisor20http://tripadvisor.com/careersA surprise hit• Some interested people tried the prototype• Liked how it worked, requested more features• Other groups became interested• Even committed engineering resources to helpget it to “beta”• It’s now being used across the company• New report templates constantly being added• (sorry, those aren’t available publicly)
    • HQT Use at TripAdvisor21http://tripadvisor.com/careersCompany-wide adoption• End users find it easy to use and relativelyconvenient.• Template authors have found it easy to createand modify report templates.• Users include people in Sales, Marketing,Commerce, and even Legal!• Weekly peak usage at over 40 simultaneousHive jobs – on a single server.(we’ve actually had to add throttling to keep HQTjobs from using too many mapred slots)
    • HQT Designhttp://tripadvisor.com/careers 22
    • HQT Design23http://tripadvisor.com/careersArchitecture: Front-End• Web interface• Handles user authentication• Processes HQT Templates to determine...• What options/input elements to present the user• How to process and validate input from the user• What HQL to send to the back-end• Gets job progress and status info from theback-end• Doesn’t do much else
    • HQT Design24http://tripadvisor.com/careersArchitecture: Back-End• Presents a “json/rest-like” interface over HTTPto receive requests from the front-end• Uses an event-loop instead of threads• Spawns Hive CLI instances to run submittedHQL• Tracks and parses output from each instance• Watches CLI instances for progress and errors• Processes results for retrieval by users• Sends email notifications
    • HQT Design25http://tripadvisor.com/careersTemplate System• The “special sauce” of the HQT• The template “language” is designed so that“directives” concisely express a whole lot:• What input to gather (and optionally what kind)• How to validate that input• What output to generate and how to format it• It’s a little tricky to explain• But extremely flexible• More details shortly...
    • HQT Design26http://tripadvisor.com/careersLanguage & Frameworks• Written in Perl• Uses lots of components from the CPAN• Front-end web framework is Mojolicious• Template System uses Text::Template• Back-end uses AnyEvent• Most classes built using Moo• Decent example of “Modern Perl”, but is still awork-in-progress.
    • HQT Design27http://tripadvisor.com/careersSystem Requirements• Requires Perl 5.10.1 or newer• Hadoop & Hive clients & libs should already beinstalled and configured• Does *not* require root or root access• LDAP & sudo should be configured if you wantto run jobs as different users.• Web-server is built-in, but can run under justabout any setup you want
    • HQT Design28http://tripadvisor.com/careersCurrent State• The front-end code is rather nice• MVC-style web app code• Uses Mojolicious .epl templates for web content,which is very similar to .erb• Back-end code is kind of hairy• AnyEvent is fairly low-level• REST/json stuff too mixed with the code thatwraps the Hive CLI processes.• It shouldn’t be responsible for sending email!
    • HQT Design29http://tripadvisor.com/careersCurrent State, contd.• Template-system code:• Fairly simple code, but allows for a lot ofinteresting functionality.• Other engineers seem to think it’s fine...• But I think it needs refactoring∙ Too much “action at a distance”∙ Template evaluation is a big security risk∙ Should use OO instead of ad-hoc data structures∙ Etc...
    • The HQL Template Systemhttp://tripadvisor.com/careers 30
    • Template System31http://tripadvisor.com/careers• Template code blocks are embedded intootherwise normal HQL:{{ begin_main_select }}SELECT foo, bar FROM bazWHERE ds={{insert_vardate => { type => ‘date’, default => days_ago_ymd(3) }}}{{append_where {columns => { wibble => ‘string’, wobble => ‘int’ }}}}
    • Template System32http://tripadvisor.com/careers• Template functions/”directives” simultaneouslydefine...• What input options to present the user• Input validation• What to insert into the HQL based on the input
    • Template System33http://tripadvisor.com/careersSo, this...{{ begin_main_select }}SELECT foo, bar FROM bazWHERE ds={{insert_vardate => { type => ‘date’, default => days_ago_ymd(3) }}}{{append_where {columns => { wibble => ‘string’, wobble => ‘int’ }}}}
    • Template System34http://tripadvisor.com/careersRenders this:
    • Template System35http://tripadvisor.com/careersWhich when filled out like this:
    • Template System36http://tripadvisor.com/careersGenerates HQL like this:
    • Template System37http://tripadvisor.com/careersTemplate Engine• Didn’t build anything new, just used theexisting Text::Template module in a cleverway• Template blocks are just Perl code, evaluatedin a specified package/namespace.• Used some trickery to make it look a little lesslike Perl, but nothing fancy.• The things that look like “directives” are justfunctions.• Lots of functions defined in that namespace...
    • Template System38http://tripadvisor.com/careersTemplate Functions• Functions available for:• Simple value insertion/substitution• Adding & extending WHERE clauses• Adding & extending GROUP BY clauses• Setting defaults• Manipulating and comparing dates• Parameter validation• Plus a lot of misc utils and support functions thatprobably should be in a different module.
    • Template System39http://tripadvisor.com/careersTemplate Files• Simple format – a YAML header followed bytemplatized HQL code like you saw earlier:id: pageviews_uniquesname: Daily Pageviews and Unique Visitorsdescription: >Any description which will appear on the page.<i>May include HTML</i>author: optional...{{ begin_main_select() }}SELECT foo, bar FROM bazWHERE ds={{ insert_var date => {type => ‘date’} }}
    • Template System40http://tripadvisor.com/careersIssues• Code in the web-app depends on the structureof data internal to the template module.• Would take a lot of work to fix, but worth it.• Template evaluation is a potential securitynightmare.• Perl does have a sandbox module for this sort ofthings, though. I just need RTFM and use it.• The APIs of the various functions isn’t entirelyconsistent, but not too bad• Will definitely fix for next release.
    • Try it for yourselfhttp://tripadvisor.com/careers 41
    • Try for Yourself42http://tripadvisor.com/careersAvailability• Source code available on GitHub now:• https://github.com/tripadvisor/hive-query-tool• Apache 2.0 Licensed• Modest system prerequisites• Automated download and installation of alldependencies• Works on a variety of platforms• Bug Reports, Feature Requests and PullRequests all *very welcome*
    • The future?http://tripadvisor.com/careers 43
    • Future Plans44http://tripadvisor.com/careers• Complete rewrite of the back-end for cleanerand more flexible code.• Implement sandboxing for template security• More user-oriented features, like• Ability to save pre-filled query reports• Better management of past and running jobs• Better status info from the backend• Column-headers in report output• An administrator dashboard/console• Bug-fixes, feature enhancements, lots more
    • Future Possibilities45http://tripadvisor.com/careers• Workflow & Scheduling functionality• Separate template system for stand-alone use• Make the back-end good enough to be a viablereplacement for the Hive Thrift Server.• Add template functions for joins, sub-selects,and lots of other HQL constructs that aren’t yetcustomizable.• Add ability for a single template to definemultiple queries delivering in multiple result-sets.• Rewrite in Perl 6 😉
    • Questions??Any Questions?