Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Querying & Analytical Extensions


Published on

During the Big Data Warehousing Meetup, Hortonworks provided a deep-dive demo of Stinger!

We also discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. See that presentation here:

If you would like more information, please don't hesitate to contact us at Or, visit our website at

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Enterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other toolsHow?Improve Latency & ThroughputQuery engine improvementsNew “Optimized RCFile” column storeNext-gen runtime (elim’s M/R latency)Extend Deep Analytical AbilityAnalytics functionsImproved SQL coverageContinued focus on core Hive use cases
  • Add statistics on Compression . . .
  • - For illustration, here’s a quick glance at benchmarking. This is of course, very active in R&D for us. Point being we are seeing 10x and upwards of performance uplift when all is said and done. This will only get better.
  • Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Querying & Analytical Extensions

    1. 1. © Hortonworks Inc. 2013HortonworksStinger, TezPage 1Leveraging Hive & Yarn for High-Performance/Interactive Querying &Analytical Extensions
    2. 2. © Hortonworks Inc. 2013Stinger InitiativeAccelerating Hive into the FuturePage 2
    3. 3. © Hortonworks Inc. 2013What is Stinger and Tez initiatives• Collection of development threads in the Hivecommunity for–Improved SQL Interface–Updated Query Engine–Optimized File Format–Always on ServicesPage 3
    4. 4. © Hortonworks Inc. 2013Stinger Initiative: 2-Pronged ApproachPage 4Tez• New primitives move beyond map-reduceand beyond batch• Avoid unnecessary persistence oftemporary data• Hive, Pig and others generate Tez plansfor high perfQuery Engine Improvements• Cost-based optimizer• In-memory joinsState-of-the-art Column Store• ―Optimized RCFile‖ or ORCFile• Minimizes disk IO and deserializationTez Service• Always-on service providing queryinteractivityImprove Latency and ThroughputAnalytics Functions• SQL:2003 Compliant• OVER with PARTITION BY and ORDERBY• Wide variety of windowing functions:• RANK• LEAD/LAG• ROW_NUMBER• FIRST_VALUE• LAST_VALUE• Many more• Aligns well with BI ecosystemImproved SQL Coverage• Subqueries within IN / HAVING• Expanded SQL types includingDATETIME, VARCHAR, etc.Extend Deep Analytical AbilityMaking Hive Best for Interactive Query
    5. 5. © Hortonworks Inc. 2013Stinger PhasesPage 5
    6. 6. © Hortonworks Inc. 2013Where we at• Key features in Hive 0.11–ORC File–Improved Data Types–Analytic Functions– ANK, LEAD/LAG, ROW_NUMBER, FIRST_VALUE, LAST_VALUEand more– Aggregate OVER functions with PARTITION BY and ORDER BY–Joins improved in Hive 0.11– Broadcast join and the SMB join work without user hints• Tez Alpha ReleasedPage 6
    7. 7. © Hortonworks Inc. 2013Stinger: Enhance Hive for BI Use CasesPage 7Enterprise ReportsDashboard / ScorecardParameterized ReportsVisualization Data MiningInteractive BatchMore SQL&Better Performance
    8. 8. © Hortonworks Inc. 2013Hive Performance: Intelligent Optimizer• For joins where one side fits in memory:–In-Memory Hash Join -- Hive reads the small table into a hash table,makes available to all participating nodes via dist. cache.–Scans through the big file to produce the output.• Users often don’t know how to provide Hive hints–End up with a long pipeline of MapReduce jobs.–Removed need for many hints• Star-schema joins–Dimension Tables loaded to memory/distributed via distributed cache.–Scatter-Gather without distributed joins (resolved locally).• Improvements–Lower the footprint of the fact tables in memory.–Enable the optimizer to automatically pick map joins.Page 8
    9. 9. © Hortonworks Inc. 2013Some New Benchmarking Results . . .Incremental Changes Adding-upto BIG improvements:• JIRA: HIVE-3784 – Remove need toexplicitly provide “hint” to optimizer.• JIRA: HIVE-3952 – MapJoins formultiple small tables joining large table.• JIRA: HIVE-2340 – Collapse OrderBy/Group By into single task . . .Page 9In this case, SixMR’s reduced toOne
    10. 10. © Hortonworks Inc. 2013ORCFile - Optimized Column Storage• JIRA-3874: Make a better columnar storage file–Evolve based on Google Dremel format• Decompose complex row types into primitive fields–Better compression and projection• Only read bytes from HDFS for the required columns.• Store column level aggregates in the files–Only need to read the file meta information for common queries–Stored both for file and each section of a file–Aggregates: min, max, sum, average, count–Allows fast access by sorted columns• Ability to add bloom filters for columns–Enables quick checks for whether a value is present–Accelerates searches on alternate keysPage 10
    11. 11. © Hortonworks Inc. 2013ORCFile - File LayoutPage 11
    12. 12. © Hortonworks Inc. 2013Tez InitiativePage 12
    13. 13. © Hortonworks Inc. 2013Tez – Moving Hive Beyond MapReduce• Low level data-processing execution engine• Use it for the base of MapReduce, Hive, Pig, Cascadingetc.• Enables efficient pipelining of jobs• Removes task and job launch times• Hive and Pig jobs no longer need to move to the end ofthe queue between steps in the pipeline–Performance-oriented jobs aren’t forced into interleaving model• Does not write intermediate output to HDFS–Much lighter disk and network usage–Appropriate for shorter-running jobs—where performance is moreimportant than being able to re-start a failed job where it left-off• Built on YARNPage 13
    14. 14. © Hortonworks Inc. 2013YARN – The Foundation for TezResourceManagerMapReduce StatusJob SubmissionClientNodeManagerNodeManagerContainerNodeManagerApp MstrNode StatusResource RequestTez is a YARNapplication . . .Instances runon all nodeshosting datatargeted foracceleratedqueryprocessing
    15. 15. © Hortonworks Inc. 2013Pig/Hive-MR versus Pig/Hive-TezPage 15I/O SynchronizationBarrierI/O PipeliningPig/Hive - MR Pig/Hive - TezSELECT a.state, COUNT(*)FROM a JOIN b ON ( = BY a.state
    16. 16. © Hortonworks Inc. 2013Result: Massive Performance UpliftPage 16Existing HiveParse Query 0.5sCreate Plan 0.5sLaunch Map-Reduce 35sProcess Map-Reduce 102sTotal 138sInteractive HiveParse Query 0.5sCreate Plan 0.5sLaunch Map-Reduce 35sProcess Map-Reduce 7sTotal 43sInteractive Hive & TezParse Query 0.5sCreate Plan 0.5sSubmit to Service 0.1sProcess Map-Reduce 7sTotal 8.1sInteractive Hive & Tez I/OParse Query 0.5sCreate Plan 0.5sSubmit to Service 0.1sProcess Map-Reduce – No Disk I/O 3.5sTotal 4.6s
    17. 17. © Hortonworks Inc. 2013FastQuery: Beyond Batch with YARNPage 17Tez Generalizes Map-ReduceSimplified execution plans processdata more efficientlyAlways-On Tez ServiceLow latency processing forall Hadoop data processing
    18. 18. © Hortonworks Inc. 2013Tez Service• MR Query Startup Expensive–Job launch & task-launch latencies are fatal for short queries (in orderof 5s to 30s)• Solution–Tez Service– Removes task-launch overhead– Removes job-launch overhead–Hive/Pig – Submit query-plan to Tez Service–Native Hadoop service, not ad-hoc• An Architecture that can be Extended to the Next Levelof Performance–Potential for Future Memory-based performance optimizationsbased on staging/pre-loading designated tables, indexes, andaggregates . . .Page 18
    19. 19. © Hortonworks Inc. 2013QuestionsPage 19