Your SlideShare is downloading. ×
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Querying & Analytical Extensions
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Querying & Analytical Extensions


Published on

During the Big Data Warehousing Meetup, Hortonworks provided a deep-dive demo of Stinger! …

During the Big Data Warehousing Meetup, Hortonworks provided a deep-dive demo of Stinger!

We also discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. See that presentation here:

If you would like more information, please don't hesitate to contact us at Or, visit our website at

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Enterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other toolsHow?Improve Latency & ThroughputQuery engine improvementsNew “Optimized RCFile” column storeNext-gen runtime (elim’s M/R latency)Extend Deep Analytical AbilityAnalytics functionsImproved SQL coverageContinued focus on core Hive use cases
  • Add statistics on Compression . . .
  • - For illustration, here’s a quick glance at benchmarking. This is of course, very active in R&D for us. Point being we are seeing 10x and upwards of performance uplift when all is said and done. This will only get better.
  • Transcript

    • 1. © Hortonworks Inc. 2013HortonworksStinger, TezPage 1Leveraging Hive & Yarn for High-Performance/Interactive Querying &Analytical Extensions
    • 2. © Hortonworks Inc. 2013Stinger InitiativeAccelerating Hive into the FuturePage 2
    • 3. © Hortonworks Inc. 2013What is Stinger and Tez initiatives• Collection of development threads in the Hivecommunity for–Improved SQL Interface–Updated Query Engine–Optimized File Format–Always on ServicesPage 3
    • 4. © Hortonworks Inc. 2013Stinger Initiative: 2-Pronged ApproachPage 4Tez• New primitives move beyond map-reduceand beyond batch• Avoid unnecessary persistence oftemporary data• Hive, Pig and others generate Tez plansfor high perfQuery Engine Improvements• Cost-based optimizer• In-memory joinsState-of-the-art Column Store• ―Optimized RCFile‖ or ORCFile• Minimizes disk IO and deserializationTez Service• Always-on service providing queryinteractivityImprove Latency and ThroughputAnalytics Functions• SQL:2003 Compliant• OVER with PARTITION BY and ORDERBY• Wide variety of windowing functions:• RANK• LEAD/LAG• ROW_NUMBER• FIRST_VALUE• LAST_VALUE• Many more• Aligns well with BI ecosystemImproved SQL Coverage• Subqueries within IN / HAVING• Expanded SQL types includingDATETIME, VARCHAR, etc.Extend Deep Analytical AbilityMaking Hive Best for Interactive Query
    • 5. © Hortonworks Inc. 2013Stinger PhasesPage 5
    • 6. © Hortonworks Inc. 2013Where we at• Key features in Hive 0.11–ORC File–Improved Data Types–Analytic Functions– ANK, LEAD/LAG, ROW_NUMBER, FIRST_VALUE, LAST_VALUEand more– Aggregate OVER functions with PARTITION BY and ORDER BY–Joins improved in Hive 0.11– Broadcast join and the SMB join work without user hints• Tez Alpha ReleasedPage 6
    • 7. © Hortonworks Inc. 2013Stinger: Enhance Hive for BI Use CasesPage 7Enterprise ReportsDashboard / ScorecardParameterized ReportsVisualization Data MiningInteractive BatchMore SQL&Better Performance
    • 8. © Hortonworks Inc. 2013Hive Performance: Intelligent Optimizer• For joins where one side fits in memory:–In-Memory Hash Join -- Hive reads the small table into a hash table,makes available to all participating nodes via dist. cache.–Scans through the big file to produce the output.• Users often don’t know how to provide Hive hints–End up with a long pipeline of MapReduce jobs.–Removed need for many hints• Star-schema joins–Dimension Tables loaded to memory/distributed via distributed cache.–Scatter-Gather without distributed joins (resolved locally).• Improvements–Lower the footprint of the fact tables in memory.–Enable the optimizer to automatically pick map joins.Page 8
    • 9. © Hortonworks Inc. 2013Some New Benchmarking Results . . .Incremental Changes Adding-upto BIG improvements:• JIRA: HIVE-3784 – Remove need toexplicitly provide “hint” to optimizer.• JIRA: HIVE-3952 – MapJoins formultiple small tables joining large table.• JIRA: HIVE-2340 – Collapse OrderBy/Group By into single task . . .Page 9In this case, SixMR’s reduced toOne
    • 10. © Hortonworks Inc. 2013ORCFile - Optimized Column Storage• JIRA-3874: Make a better columnar storage file–Evolve based on Google Dremel format• Decompose complex row types into primitive fields–Better compression and projection• Only read bytes from HDFS for the required columns.• Store column level aggregates in the files–Only need to read the file meta information for common queries–Stored both for file and each section of a file–Aggregates: min, max, sum, average, count–Allows fast access by sorted columns• Ability to add bloom filters for columns–Enables quick checks for whether a value is present–Accelerates searches on alternate keysPage 10
    • 11. © Hortonworks Inc. 2013ORCFile - File LayoutPage 11
    • 12. © Hortonworks Inc. 2013Tez InitiativePage 12
    • 13. © Hortonworks Inc. 2013Tez – Moving Hive Beyond MapReduce• Low level data-processing execution engine• Use it for the base of MapReduce, Hive, Pig, Cascadingetc.• Enables efficient pipelining of jobs• Removes task and job launch times• Hive and Pig jobs no longer need to move to the end ofthe queue between steps in the pipeline–Performance-oriented jobs aren’t forced into interleaving model• Does not write intermediate output to HDFS–Much lighter disk and network usage–Appropriate for shorter-running jobs—where performance is moreimportant than being able to re-start a failed job where it left-off• Built on YARNPage 13
    • 14. © Hortonworks Inc. 2013YARN – The Foundation for TezResourceManagerMapReduce StatusJob SubmissionClientNodeManagerNodeManagerContainerNodeManagerApp MstrNode StatusResource RequestTez is a YARNapplication . . .Instances runon all nodeshosting datatargeted foracceleratedqueryprocessing
    • 15. © Hortonworks Inc. 2013Pig/Hive-MR versus Pig/Hive-TezPage 15I/O SynchronizationBarrierI/O PipeliningPig/Hive - MR Pig/Hive - TezSELECT a.state, COUNT(*)FROM a JOIN b ON ( = BY a.state
    • 16. © Hortonworks Inc. 2013Result: Massive Performance UpliftPage 16Existing HiveParse Query 0.5sCreate Plan 0.5sLaunch Map-Reduce 35sProcess Map-Reduce 102sTotal 138sInteractive HiveParse Query 0.5sCreate Plan 0.5sLaunch Map-Reduce 35sProcess Map-Reduce 7sTotal 43sInteractive Hive & TezParse Query 0.5sCreate Plan 0.5sSubmit to Service 0.1sProcess Map-Reduce 7sTotal 8.1sInteractive Hive & Tez I/OParse Query 0.5sCreate Plan 0.5sSubmit to Service 0.1sProcess Map-Reduce – No Disk I/O 3.5sTotal 4.6s
    • 17. © Hortonworks Inc. 2013FastQuery: Beyond Batch with YARNPage 17Tez Generalizes Map-ReduceSimplified execution plans processdata more efficientlyAlways-On Tez ServiceLow latency processing forall Hadoop data processing
    • 18. © Hortonworks Inc. 2013Tez Service• MR Query Startup Expensive–Job launch & task-launch latencies are fatal for short queries (in orderof 5s to 30s)• Solution–Tez Service– Removes task-launch overhead– Removes job-launch overhead–Hive/Pig – Submit query-plan to Tez Service–Native Hadoop service, not ad-hoc• An Architecture that can be Extended to the Next Levelof Performance–Potential for Future Memory-based performance optimizationsbased on staging/pre-loading designated tables, indexes, andaggregates . . .Page 18
    • 19. © Hortonworks Inc. 2013QuestionsPage 19