• Save
Simplifying Use of Hive with the Hive Query Tool
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Simplifying Use of Hive with the Hive Query Tool

on

  • 1,675 views

As TripAdvisor moves increasing amounts of data into Hadoop and Hive, the need for simplifying, controlling, and expanding access to this data has grown. Having reviewed existing solutions without ...

As TripAdvisor moves increasing amounts of data into Hadoop and Hive, the need for simplifying, controlling, and expanding access to this data has grown. Having reviewed existing solutions without finding what we needed, we began working on our own solution to meet our specific goals and use-cases. The Hive Query Tool (HQT) is a web interface that allows anybody to configure and run Hive queries without requiring client-side installation or even knowledge of the query language. Users familiar with HQL can add sophisticated and highly customizable queries with a flexible and powerful template system. A primary innovation, the template system, allows one to define the inputs available to the end-user, validation checks, and what HQL to generate, easily and concisely. We plan to release the code as open-source. This talk will discuss: – The features of the HQT and how it is used for business intelligence – The challenges it was built to meet and how its design and architecture addresses them – Installing and running an HQT server – How to use, customize, and expand the template system – Known limitations and issues – Future plans and features

Statistics

Views

Total Views
1,675
Views on SlideShare
1,673
Embed Views
2

Actions

Likes
8
Downloads
0
Comments
0

1 Embed 2

http://localhost 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Simplifying Use of Hive with the Hive Query Tool Presentation Transcript

  • 1. http://tripadvisor.com/careers 1 Stephen R. Scaffidi sscaffidi@tripadvisor.com Hadoop Summit San Jose 2013 Simplifying the use of Hive with the Hive Query Tool
  • 2. http://tripadvisor.com/careers 2 • Introduction • What is the Hive Query Tool (HQT)? • Why did we build it? • How it’s being used today • Design & system requirements • HQT Query Templates • Getting the source, building, and running • Future plans & possibilities Talk Outline
  • 3. Introduction http://tripadvisor.com/careers 3 Introduction Section Title
  • 4. Introduction 4http://tripadvisor.com/careers About Me • Sr. Software Engineer at TripAdvisor • Data Warehouse Engineering Group • Mildly obsessed with making things “Just Work” • OK, more than mildly... • Varied background, from PC Tech to Email Admin to Telco NMS to Lisp Hacker, etc... • Thrives on making computers do the work • No Hadoop experience before joining
  • 5. Introduction http://tripadvisor.com/careers About my team • Data Warehouse Engineering • Small, focused, tenacious group • Varied skills and backgrounds • We keep the elephants fed and healthy • We help others in the company make use of the facilities provided on the clusters • DevOps in every sense of the term 5
  • 6. Introduction 6http://tripadvisor.com/careers About TripAdvisor • Awesome place to work • It really feels like we’re one big team • Always new challenges and things to learn • Smart, driven and genuinely *nice* people • Offices around the world • Great benefits • We’re hiring!
  • 7. What is the Hive Query Tool? http://tripadvisor.com/careers 7 Simplifying the use of Hive with The Hive Query Tool
  • 8. What is the Hive Query Tool? 8http://tripadvisor.com/careers A simple web interface for running reports on Hive • Our Specific Goals/Needs: • Easy to use for non-technical people • More flexible query customization than simple variable interpolation • Relatively easy installation and administration • Allow jobs to run with different scheduler queues and users • Performance equal-to or better-than plain Hive
  • 9. What is the Hive Query Tool? 9http://tripadvisor.com/careers Easy for non-technical end-users • Intended for use by non-technical people: • Sales, Marketing, Customer-Relations, etc. • People who don’t know anything about Hadoop or Hive (or need to) • People who don’t live in a *nix shell • No need to even know anything about SQL!
  • 10. What is the Hive Query Tool? 10http://tripadvisor.com/careers Flexible Query Customization • Other solutions we looked at were too limited • We needed to give the users something more powerful than simple variable substitution. • HQT’s template system can generate and insert arbitrary HQL clauses into a query based on a user’s input to a simple web interface.
  • 11. What is the Hive Query Tool? 11http://tripadvisor.com/careers Easy Install and Administration • If we were going to build our own, we didn’t want maintenance to be *another* full-time job • Internal adoption by other engineers was important • Java hackers don’t want to deal with a 23.5-step install and configure process • Especially if it’s not written in Java • Check-out the source, run the setup script, edit a single config file and run the startup scripts.
  • 12. What is the Hive Query Tool? 12http://tripadvisor.com/careers Run jobs with different Users and Queues • Face it, the Hive Thrift Server is horrible • Most other user-friendly Hive front-ends use it • So they have all its limitations • And its bugs  • The HQT simply spawns a Hive CLI for each job, using sudo to change users when necessary.
  • 13. What is the Hive Query Tool? 13http://tripadvisor.com/careers Performance? • Some options we looked at before building the HQT did a whole lot more • Some claimed to be faster than Hive. • Some of these options had so much overhead that they were slower than using Hive directly! • The HQT simply runs HQL code thru the standard Hive CLI. No overhead, no difference in performance over plain-vanilla Hive.
  • 14. Why did we need this? http://tripadvisor.com/careers 14 Simplifying the use of Hive with The Hive Query Tool
  • 15. Why did we need the HQT? 15http://tripadvisor.com/careers Making the data accessible • The data we pump into our Hadoop clusters is full of valuable information to our business • And more is fed into our Hive tables every day • And more people need access to that data every day • But not all of those people are 733t h4(k3r engineers 😉
  • 16. Why did we need the HQT? 16http://tripadvisor.com/careers Making the data accessible • The target users may not know Linux and Java and SQL... • But they do know how they want the data filtered and correlated and aggregated. • We needed a way to let them run queries where they could choose these parameters with a high degree of flexibility... • But without having to teach them all HQL
  • 17. Why did we need the HQT? 17http://tripadvisor.com/careers Looked at what was available... • Nothing else we looked at seemed to satisfy all our requirements. • Some that looked interesting, unfortunately had terrible performance, as they did not use Hive directly. • Not that everything we looked at was terrible – some solutions were really quite impressive. • But it came down to a classic question in tech- oriented businesses...
  • 18. Why did we need the HQT? 18http://tripadvisor.com/careers The bottom line • We knew what we wanted • We knew what we wanted wasn’t particularly complex • We asked ourselves if we could just build something that gives us exactly what we need • And would that effort cost less than trying to make something else work the way we wanted? • A “Eureka!” moment and a rough prototype answered the question 😉
  • 19. HQT Use at TripAdvisor http://tripadvisor.com/careers 19
  • 20. HQT Use at TripAdvisor 20http://tripadvisor.com/careers A surprise hit • Some interested people tried the prototype • Liked how it worked, requested more features • Other groups became interested • Even committed engineering resources to help get it to “beta” • It’s now being used across the company • New report templates constantly being added • (sorry, those aren’t available publicly)
  • 21. HQT Use at TripAdvisor 21http://tripadvisor.com/careers Company-wide adoption • End users find it easy to use and relatively convenient. • Template authors have found it easy to create and modify report templates. • Users include people in Sales, Marketing, Commerce, and even Legal! • Weekly peak usage at over 40 simultaneous Hive jobs – on a single server. (we’ve actually had to add throttling to keep HQT jobs from using too many mapred slots)
  • 22. HQT Design http://tripadvisor.com/careers 22
  • 23. HQT Design 23http://tripadvisor.com/careers Architecture: Front-End • Web interface • Handles user authentication • Processes HQT Templates to determine... • What options/input elements to present the user • How to process and validate input from the user • What HQL to send to the back-end • Gets job progress and status info from the back-end • Doesn’t do much else
  • 24. HQT Design 24http://tripadvisor.com/careers Architecture: Back-End • Presents a “json/rest-like” interface over HTTP to receive requests from the front-end • Uses an event-loop instead of threads • Spawns Hive CLI instances to run submitted HQL • Tracks and parses output from each instance • Watches CLI instances for progress and errors • Processes results for retrieval by users • Sends email notifications
  • 25. HQT Design 25http://tripadvisor.com/careers Template System • The “special sauce” of the HQT • The template “language” is designed so that “directives” concisely express a whole lot: • What input to gather (and optionally what kind) • How to validate that input • What output to generate and how to format it • It’s a little tricky to explain • But extremely flexible • More details shortly...
  • 26. HQT Design 26http://tripadvisor.com/careers Language & Frameworks • Written in Perl • Uses lots of components from the CPAN • Front-end web framework is Mojolicious • Template System uses Text::Template • Back-end uses AnyEvent • Most classes built using Moo • Decent example of “Modern Perl”, but is still a work-in-progress.
  • 27. HQT Design 27http://tripadvisor.com/careers System Requirements • Requires Perl 5.10.1 or newer • Hadoop & Hive clients & libs should already be installed and configured • Does *not* require root or root access • LDAP & sudo should be configured if you want to run jobs as different users. • Web-server is built-in, but can run under just about any setup you want
  • 28. HQT Design 28http://tripadvisor.com/careers Current State • The front-end code is rather nice • MVC-style web app code • Uses Mojolicious .epl templates for web content, which is very similar to .erb • Back-end code is kind of hairy • AnyEvent is fairly low-level • REST/json stuff too mixed with the code that wraps the Hive CLI processes. • It shouldn’t be responsible for sending email!
  • 29. HQT Design 29http://tripadvisor.com/careers Current State, contd. • Template-system code: • Fairly simple code, but allows for a lot of interesting functionality. • Other engineers seem to think it’s fine... • But I think it needs refactoring ∙ Too much “action at a distance” ∙ Template evaluation is a big security risk ∙ Should use OO instead of ad-hoc data structures ∙ Etc...
  • 30. The HQL Template System http://tripadvisor.com/careers 30
  • 31. Template System 31http://tripadvisor.com/careers • Template code blocks are embedded into otherwise normal HQL: {{ begin_main_select }} SELECT foo, bar FROM baz WHERE ds={{ insert_var date => { type => ‘date’, default => days_ago_ymd(3) } }} {{ append_where { columns => { wibble => ‘string’, wobble => ‘int’ } } }}
  • 32. Template System 32http://tripadvisor.com/careers • Template functions/”directives” simultaneously define... • What input options to present the user • Input validation • What to insert into the HQL based on the input
  • 33. Template System 33http://tripadvisor.com/careers So, this... {{ begin_main_select }} SELECT foo, bar FROM baz WHERE ds={{ insert_var date => { type => ‘date’, default => days_ago_ymd(3) } }} {{ append_where { columns => { wibble => ‘string’, wobble => ‘int’ } } }}
  • 34. Template System 34http://tripadvisor.com/careers Renders this:
  • 35. Template System 35http://tripadvisor.com/careers Which when filled out like this:
  • 36. Template System 36http://tripadvisor.com/careers Generates HQL like this:
  • 37. Template System 37http://tripadvisor.com/careers Template Engine • Didn’t build anything new, just used the existing Text::Template module in a clever way • Template blocks are just Perl code, evaluated in a specified package/namespace. • Used some trickery to make it look a little less like Perl, but nothing fancy. • The things that look like “directives” are just functions. • Lots of functions defined in that namespace...
  • 38. Template System 38http://tripadvisor.com/careers Template Functions • Functions available for: • Simple value insertion/substitution • Adding & extending WHERE clauses • Adding & extending GROUP BY clauses • Setting defaults • Manipulating and comparing dates • Parameter validation • Plus a lot of misc utils and support functions that probably should be in a different module.
  • 39. Template System 39http://tripadvisor.com/careers Template Files • Simple format – a YAML header followed by templatized HQL code like you saw earlier: id: pageviews_uniques name: Daily Pageviews and Unique Visitors description: > Any description which will appear on the page. <i>May include HTML</i> author: optional ... {{ begin_main_select() }} SELECT foo, bar FROM baz WHERE ds={{ insert_var date => {type => ‘date’} }}
  • 40. Template System 40http://tripadvisor.com/careers Issues • Code in the web-app depends on the structure of data internal to the template module. • Would take a lot of work to fix, but worth it. • Template evaluation is a potential security nightmare. • Perl does have a sandbox module for this sort of things, though. I just need RTFM and use it. • The APIs of the various functions isn’t entirely consistent, but not too bad • Will definitely fix for next release.
  • 41. Try it for yourself http://tripadvisor.com/careers 41
  • 42. Try for Yourself 42http://tripadvisor.com/careers Availability • Source code available on GitHub now: • https://github.com/tripadvisor/hive-query-tool • Apache 2.0 Licensed • Modest system prerequisites • Automated download and installation of all dependencies • Works on a variety of platforms • Bug Reports, Feature Requests and Pull Requests all *very welcome*
  • 43. The future? http://tripadvisor.com/careers 43
  • 44. Future Plans 44http://tripadvisor.com/careers • Complete rewrite of the back-end for cleaner and more flexible code. • Implement sandboxing for template security • More user-oriented features, like • Ability to save pre-filled query reports • Better management of past and running jobs • Better status info from the backend • Column-headers in report output • An administrator dashboard/console • Bug-fixes, feature enhancements, lots more
  • 45. Future Possibilities 45http://tripadvisor.com/careers • Workflow & Scheduling functionality • Separate template system for stand-alone use • Make the back-end good enough to be a viable replacement for the Hive Thrift Server. • Add template functions for joins, sub-selects, and lots of other HQL constructs that aren’t yet customizable. • Add ability for a single template to define multiple queries delivering in multiple result- sets. • Rewrite in Perl 6 😉
  • 46. Questions? ? Any Questions?