• Save
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Querying & Analytical Extensions
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Querying & Analytical Extensions

on

  • 2,048 views

During the Big Data Warehousing Meetup, Hortonworks provided a deep-dive demo of Stinger! ...

During the Big Data Warehousing Meetup, Hortonworks provided a deep-dive demo of Stinger!

We also discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. See that presentation here: http://www.slideshare.net/CasertaConcepts/real-time-interactive-queries-in-hadoop-big-data-warehousing-meetup

If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.

Statistics

Views

Total Views
2,048
Views on SlideShare
2,048
Embed Views
0

Actions

Likes
5
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Enterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other toolsHow?Improve Latency & ThroughputQuery engine improvementsNew “Optimized RCFile” column storeNext-gen runtime (elim’s M/R latency)Extend Deep Analytical AbilityAnalytics functionsImproved SQL coverageContinued focus on core Hive use cases
  • Add statistics on Compression . . .
  • - For illustration, here’s a quick glance at benchmarking. This is of course, very active in R&D for us. Point being we are seeing 10x and upwards of performance uplift when all is said and done. This will only get better.

Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Querying & Analytical Extensions Presentation Transcript

  • 1. © Hortonworks Inc. 2013HortonworksStinger, TezPage 1Leveraging Hive & Yarn for High-Performance/Interactive Querying &Analytical Extensions
  • 2. © Hortonworks Inc. 2013Stinger InitiativeAccelerating Hive into the FuturePage 2
  • 3. © Hortonworks Inc. 2013What is Stinger and Tez initiatives• Collection of development threads in the Hivecommunity for–Improved SQL Interface–Updated Query Engine–Optimized File Format–Always on ServicesPage 3
  • 4. © Hortonworks Inc. 2013Stinger Initiative: 2-Pronged ApproachPage 4Tez• New primitives move beyond map-reduceand beyond batch• Avoid unnecessary persistence oftemporary data• Hive, Pig and others generate Tez plansfor high perfQuery Engine Improvements• Cost-based optimizer• In-memory joinsState-of-the-art Column Store• ―Optimized RCFile‖ or ORCFile• Minimizes disk IO and deserializationTez Service• Always-on service providing queryinteractivityImprove Latency and ThroughputAnalytics Functions• SQL:2003 Compliant• OVER with PARTITION BY and ORDERBY• Wide variety of windowing functions:• RANK• LEAD/LAG• ROW_NUMBER• FIRST_VALUE• LAST_VALUE• Many more• Aligns well with BI ecosystemImproved SQL Coverage• Subqueries within IN / HAVING• Expanded SQL types includingDATETIME, VARCHAR, etc.Extend Deep Analytical AbilityMaking Hive Best for Interactive Query
  • 5. © Hortonworks Inc. 2013Stinger PhasesPage 5
  • 6. © Hortonworks Inc. 2013Where we at• Key features in Hive 0.11–ORC File–Improved Data Types–Analytic Functions– ANK, LEAD/LAG, ROW_NUMBER, FIRST_VALUE, LAST_VALUEand more– Aggregate OVER functions with PARTITION BY and ORDER BY–Joins improved in Hive 0.11– Broadcast join and the SMB join work without user hints• Tez Alpha ReleasedPage 6
  • 7. © Hortonworks Inc. 2013Stinger: Enhance Hive for BI Use CasesPage 7Enterprise ReportsDashboard / ScorecardParameterized ReportsVisualization Data MiningInteractive BatchMore SQL&Better Performance
  • 8. © Hortonworks Inc. 2013Hive Performance: Intelligent Optimizer• For joins where one side fits in memory:–In-Memory Hash Join -- Hive reads the small table into a hash table,makes available to all participating nodes via dist. cache.–Scans through the big file to produce the output.• Users often don’t know how to provide Hive hints–End up with a long pipeline of MapReduce jobs.–Removed need for many hints• Star-schema joins–Dimension Tables loaded to memory/distributed via distributed cache.–Scatter-Gather without distributed joins (resolved locally).• Improvements–Lower the footprint of the fact tables in memory.–Enable the optimizer to automatically pick map joins.Page 8
  • 9. © Hortonworks Inc. 2013Some New Benchmarking Results . . .Incremental Changes Adding-upto BIG improvements:• JIRA: HIVE-3784 – Remove need toexplicitly provide “hint” to optimizer.• JIRA: HIVE-3952 – MapJoins formultiple small tables joining large table.• JIRA: HIVE-2340 – Collapse OrderBy/Group By into single task . . .Page 9In this case, SixMR’s reduced toOne
  • 10. © Hortonworks Inc. 2013ORCFile - Optimized Column Storage• JIRA-3874: Make a better columnar storage file–Evolve based on Google Dremel format• Decompose complex row types into primitive fields–Better compression and projection• Only read bytes from HDFS for the required columns.• Store column level aggregates in the files–Only need to read the file meta information for common queries–Stored both for file and each section of a file–Aggregates: min, max, sum, average, count–Allows fast access by sorted columns• Ability to add bloom filters for columns–Enables quick checks for whether a value is present–Accelerates searches on alternate keysPage 10
  • 11. © Hortonworks Inc. 2013ORCFile - File LayoutPage 11
  • 12. © Hortonworks Inc. 2013Tez InitiativePage 12
  • 13. © Hortonworks Inc. 2013Tez – Moving Hive Beyond MapReduce• Low level data-processing execution engine• Use it for the base of MapReduce, Hive, Pig, Cascadingetc.• Enables efficient pipelining of jobs• Removes task and job launch times• Hive and Pig jobs no longer need to move to the end ofthe queue between steps in the pipeline–Performance-oriented jobs aren’t forced into interleaving model• Does not write intermediate output to HDFS–Much lighter disk and network usage–Appropriate for shorter-running jobs—where performance is moreimportant than being able to re-start a failed job where it left-off• Built on YARNPage 13
  • 14. © Hortonworks Inc. 2013YARN – The Foundation for TezResourceManagerMapReduce StatusJob SubmissionClientNodeManagerNodeManagerContainerNodeManagerApp MstrNode StatusResource RequestTez is a YARNapplication . . .Instances runon all nodeshosting datatargeted foracceleratedqueryprocessing
  • 15. © Hortonworks Inc. 2013Pig/Hive-MR versus Pig/Hive-TezPage 15I/O SynchronizationBarrierI/O PipeliningPig/Hive - MR Pig/Hive - TezSELECT a.state, COUNT(*)FROM a JOIN b ON (a.id = b.id)GROUP BY a.state
  • 16. © Hortonworks Inc. 2013Result: Massive Performance UpliftPage 16Existing HiveParse Query 0.5sCreate Plan 0.5sLaunch Map-Reduce 35sProcess Map-Reduce 102sTotal 138sInteractive HiveParse Query 0.5sCreate Plan 0.5sLaunch Map-Reduce 35sProcess Map-Reduce 7sTotal 43sInteractive Hive & TezParse Query 0.5sCreate Plan 0.5sSubmit to Service 0.1sProcess Map-Reduce 7sTotal 8.1sInteractive Hive & Tez I/OParse Query 0.5sCreate Plan 0.5sSubmit to Service 0.1sProcess Map-Reduce – No Disk I/O 3.5sTotal 4.6s
  • 17. © Hortonworks Inc. 2013FastQuery: Beyond Batch with YARNPage 17Tez Generalizes Map-ReduceSimplified execution plans processdata more efficientlyAlways-On Tez ServiceLow latency processing forall Hadoop data processing
  • 18. © Hortonworks Inc. 2013Tez Service• MR Query Startup Expensive–Job launch & task-launch latencies are fatal for short queries (in orderof 5s to 30s)• Solution–Tez Service– Removes task-launch overhead– Removes job-launch overhead–Hive/Pig – Submit query-plan to Tez Service–Native Hadoop service, not ad-hoc• An Architecture that can be Extended to the Next Levelof Performance–Potential for Future Memory-based performance optimizationsbased on staging/pre-loading designated tables, indexes, andaggregates . . .Page 18
  • 19. © Hortonworks Inc. 2013QuestionsPage 19