Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tuning and Optimizing U-SQL Queries (SQLPASS 2016)

2,785 views

Published on

Deep dive on how to tune and optimize U-SQL queries and data.

Published in: Data & Analytics
  • Be the first to comment

Tuning and Optimizing U-SQL Queries (SQLPASS 2016)

  1. 1. Tuning and Optimizing U-SQL queries for maximum performance Michael Rys, Principal Program Manager, Microsoft @MikeDoesBigData, usql@microsoft.com AD-315-MAD-400-M
  2. 2. Please silence cell phones
  3. 3. Session Objectives And Takeaways Session Objective(s): • Understand the U-SQL Query execution • Be able to understand and improve U-SQL job performance/cost • Be able to understand and improve the U-SQL query plan • Know how to write more efficient U-SQL scripts Key Takeaways: • U-SQL is designed for scale-out • U-SQL provides scalable execution of user code • U-SQL has a tool set that can help you analyze and improve your scalability, cost and performance
  4. 4. Agenda • Job Execution Experience and Investigations Query Execution Stage Graph Dryad crash course Job Metrics Resource Planning • Job Performance Analysis Analyze the critical path Heat Map Critical Path Data Skew • Tuning / Optimizations Cost Optimizations Data Partitioning Partition Elimination Predicate Pushing Column Pruning Some Data Hints UDOs can be evil INSERT optimizations U-SQL Query Execution and Performance Tuning
  5. 5. Job Scheduler & Queue Front-EndService 6 Vertex Execution Consume Overall U-SQL Batch Job Execution Lifetime Local Storage Data Lake Store Author Plan Compiler Optimizer Vertexes running in YARN Containers U-SQL Runtime Optimized Plan Vertex Scheduling On containers Job Manager USQL Compiler Service & USQL Catalog
  6. 6. Expression-flow Programming Style Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. Execution plan that is optimized out-of-the- box and w/o user intervention. Per job and user driven level of parallelization. Detail visibility into execution steps, for debugging. Heatmap like functionality to identify performance bottlenecks.
  7. 7. U-SQL Compilation Process C# C++ Algebra Other files (system files, deployed resources) managed dll Unmanaged dll Compilation output (in job folder) Compiler & Optimizer U-SQL Metadata Service Deployed to Vertices
  8. 8. Analyzing a job
  9. 9. Parallelism 1000 (ADLAUs) Work composed of 12K Vertices 1 ADLAU currently maps to a VM with 2 cores and 6 GB of memory
  10. 10. U-SQL Query Execution Physical plans vs. Dryad stage graph…
  11. 11. Stage Details 252 Pieces of work AVG Vertex execution time 4.3 Billion rows Data Read & Written Super Vertex = Stage
  12. 12. 16 U-SQL Query Execution Redefinition of big-data…
  13. 13. 17 U-SQL Query Execution Redefinition of big-data…
  14. 14. 18 U-SQL Performance Analysis Analyze the critical path, heat maps, playback, and runtime metrics on every vertex…
  15. 15. Tuning for Cost Efficiency
  16. 16. Dips down to 1 active vertex at these times
  17. 17. Smallest estimated time when given 2425 ADLAUs 1410 seconds = 23.5 minutes
  18. 18. Model with 100 ADLAUs 8709 seconds = 145.5 minutes
  19. 19. Data Storage • Files • Tables • Unstructured Data in files • Files are split into 250MB extents • 4 extents per vertex -> 1GB per vertex • Different file content formats: • Splittable formats are parallelizable: • row-oriented (CSV etc) • Where data does not span extents • Non-splittable formats cannot be parallelized: • XML/JSON • Have to be processed in single vertex extractor with atomicFileProcessing=true. • Use File Sets to provide semantic partition pruning • Tables • Clustered Index (row-oriented) storage • Vertical and horizontal partitioning • Statistics for the optimizer (CREATE STATISTICS) • Native scalar value serialization
  20. 20. Querying unstructured data
  21. 21. // Unstructured Files (24 hours daily log impressions) @Impressions = EXTRACT ClientId int, Market string, OS string, ... FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif" FROM @"wasb://ads@wcentralus/2015/10/30/{Market}_{*}.nif" ; // … // Filter to by Market @US = SELECT * FROM @Impressions WHERE Market == "en" ; U-SQL Optimizations Partition Elimination – Unstructured Files Partition Elimination • Even with unstructured files! • Leverage Virtual Columns (Named) • Avoid unnamed {*} • WHERE predicates on named virtual columns • That binds the PE range during compilation time • Named virtual columns without predicate = warning • Design directories/files with PE in mind • Design for elimination early in the tree, not in the leaves Extracts all files in the folder Post filter = pay I/O cost to drop most data PE pushes this predicate to the EXTRACT EXTRACT now only reads “en” files! en_10.0.nif en_8.1.nif de_10.0.nif jp_7.0.nif de_8.1.nif ../2015/10/30/ …
  22. 22. How many clicks per domain? @rows = SELECT Domain, SUM(Clicks) AS TotalClicks FROM @ClickData GROUP BY Domain;
  23. 23. File Read Read Partition Partition Full Agg Write Full Agg Write Full Agg Write Read Partition Partial Agg Partial Agg Partial Agg CNN, FB, WH EXTENT 1 EXTENT 2 EXTENT 3 CNN, FB, WH CNN, FB, WH U-SQL Table Distributed by Domain Read Read Full Agg Full Agg Write Write Read Full Agg Write FB EXTENT 1 WH EXTENT 2 CNN EXTENT 3 Expensive!
  24. 24. Scaling out with Distributions
  25. 25. Data Partitioning Tables Table Partitioning and Distribution • Fine grained (horizontal) partitioning/distribution • Distributes within a partition (together with clustering) to keep same data values close • Choose for: • Join alignment, partition size, filter selectivity, partition elimination • Coarse grained (vertical) partitioning • Based on Partition keys • Partition is addressable in language • Query predicates will allow partition pruning • Choose for data life cycle management, partition elimination Distribution Scheme When to use? HASH(keys) Automatic Hash for fast item lookup DIRECT HASH(id) Exact control of hash bucket value RANGE(keys) Keeps ranges together ROUND ROBIN To get equal distribution (if others give skew)
  26. 26. Partitions, Distributions and Clusters TABLE T ( id … , C … , date DateTime, … , INDEX i CLUSTERED (id, C) PARTITIONED BY (date) DISTRIBUTED BY HASH(id) INTO 4) PARTITION (@date1) PARTITION (@date2) PARTITION (@date3) HASH DISTRIBUTION 1 HASH DISTRIBUTION 2 HASH DISTRIBUTION 3 HASH DISTRIBUTION 1 HASH DISTRIBUTION 1 HASH DISTRIBUTION 2 HASH DISTRIBUTION 3 HASH DISTRIBUTION 4 HASH DISTRIBUTION 3 C1 C2 C3 C1 C2 C4 C5 C4 C6 C6 C7 C8 C7 C5 C6 C9 C10 C1 C3 /catalog/…/tables/Guid(T)/ Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss LOGICAL PHYSICAL
  27. 27. Benefits of Clustered Index in Distribution Benefits • Design for most frequent/costly queries • Manage data skew in distribution bucket • Provide locality of same data values • Provide seeks and range scans for query predicates (index lookup) Clustered index in tables is mandatory, chose according to desired benefits Pro Tip: Distribution keys should be prefix of Clustered Index keys: Especially for RANGE distribution Optimizer will make use of global ordering then: If you make the RANGE distribution key a prefix of the index key, U- SQL will repartition on demand to align any UNIONALLed or JOINed tables or partitions! Split points of table distribution partitions are choosen independently, so any partitioned table can do UNION ALL in this manner if the data is to be processed subsequently on the distribution key.
  28. 28. Benefits of Distribution in Tables Benefits • Design for most frequent/costly queries • Manage data skew in partition/table • Manage parallelism in querying (by number of distributions) • Manage minimizing data movement in joins • Provide distribution seeks and range scans for query predicates (distribution bucket elimination) Distribution in tables is mandatory, chose according to desired benefits
  29. 29. Benefits of Partitioned Tables Benefits • Partitions are addressable • Enables finer-grained data lifecycle management at partition level • Manage parallelism in querying by number of partitions • Query predicates provide partition elimination • Predicate has to be constant-foldable Use partitioned tables for • Managing large amounts of incrementally growing structured data • Queries with strong locality predicates • point in time, for specific market etc • Managing windows of data • provide data for last x months for processing
  30. 30. Partitioned tables Use partitioned tables for querying parts of large amounts of incrementally growing structured data Get partition elimination optimizations with the right query predicates Creating partition table CREATE TABLE PartTable(id int, event_date DateTime, lat float, long float , INDEX idx CLUSTERED (vehicle_id ASC) PARTITIONED BY(event_date) DISTRIBUTED BY HASH (vehicle_id) INTO 4); Creating partitions DECLARE @pdate1 DateTime = new DateTime(2014, 9, 14, 00,00,00,00,DateTimeKind.Utc); DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15, 00,00,00,00,DateTimeKind.Utc); ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@pdate2); Loading data into partitions dynamically DECLARE @date1 DateTime = DateTime.Parse("2014-09-14"); DECLARE @date2 DateTime = DateTime.Parse("2014-09-16"); INSERT INTO vehiclesP ON INTEGRITY VIOLATION IGNORE SELECT vehicle_id, event_date, lat, long FROM @data WHERE event_date >= @date1 AND event_date <= @date2; • Filters and inserts clean data only, ignore “dirty” data Loading data into partitions statically ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@baddate); INSERT INTO vehiclesP ON INTEGRITY VIOLATION MOVE TO @baddate SELECT vehicle_id, lat, long FROM @data WHERE event_date >= @date1 AND event_date <= @date2;
  31. 31. @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView OPTION(SKEWFACTOR(Query)=0.5) ; // Q1(A,B) @Sessions = SELECT ClientId, Query, SUM(PageClicks) AS Clicks FROM @Impressions GROUP BY Query, ClientId ; // Q2(B) @Display = SELECT * FROM @Sessions INNER JOIN @Campaigns ON @Sessions.Query == @Campaigns.Query ; U-SQL Optimizations Distributions – Minimize (re)partitions Input must be distributed on: (Query) Input must be distributed on: (Query) or (ClientId) or (Query, ClientId) Optimizer wants to distribute only once But Query could be skewed Data Distribution • Re-Distributing is very expensive • Many U-SQL operators can handle multiple distribution choices • Optimizer bases decision upon estimations Wrong statistics may result in worse query performance
  32. 32. // Unstructured (24 hours daily log impressions) @Huge = EXTRACT ClientId int, ... FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif" ; // Small subset (ie: ForgetMe opt out) @Small = SELECT * FROM @Huge WHERE Bing.ForgetMe(x,y,z) OPTION(ROWCOUNT=500) ; // Result (not enough info to determine simple Broadcast join) @Remove = SELECT * FROM Bing.Sessions INNER JOIN @Small ON Sessions.Client == @Small.Client ; U-SQL Optimizations Distribution - Cardinality Broadcast JOIN right? Broadcast is now a candidate. Wrong statistics may result in worse query performance => CREATE STATISTICS Optimizer has no stats this is small...
  33. 33. What is Data Skew? • Some data points are much more common than others • data may be distributed such that all rows that match a certain key go to a single vertex • imbalanced execution, vertex time out. 0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 California Florida Ohio NorthCarolina Washington Indiana Maryland Colorado Louisiana Oklahoma Mississippi Utah Nebraska Hawaii RhodeIsland SouthDakota DistrictofColumbia Population by State
  34. 34. Low Distinctiveness Keys • Keys with small selectivity can lead to large vertices even without skew @rows = SELECT Gender, AGG<MyAgg>(…) AS Result FROM @HugeInput GROUP BY Gender; Gender==Male Gender==Female @HugeInput Vertex 0 Vertex 1
  35. 35. Why is this a problem? Vertexes have a 5 hour runtime limit! Your UDO or join may excessively allocate memory. • Your memory usage may not be obvious due to garbage collection
  36. 36. Addressing Data Skew/Low distinctiveness • Improve data partition sizes: • Find more fine grained keys, eg, states and congressional districts or ZIP codes • If no fine grained keys can be found or are too fine-grained: use ROUND ROBIN distribution • Write queries that can handle data skew: • Use filters that prune skew out early • Use Data Hints to identify skew and “low distinctness” in keys: • SKEWFACTOR(columns) = x provides hint that given columns have a skew factor x between 0 (no skew) and 1 (very heavy skew)) • DISTINCTVALUE(columns) = n let’s you specify how many distinct values the given columns have (n>1) • Implement aggregation/reducer recursively if possible
  37. 37. Non-Recursive vs Recursive SUM 1 2 3 4 5 6 7 8 36 1 2 3 4 5 6 7 8 6 15 15 36
  38. 38. U-SQL Partitioning during Processing Data Skew
  39. 39. U-SQL Partitioning Data Skew – Recursive Reducer // Metrics per domain @Metric = REDUCE @Impressions ON UrlDomain USING new Bing.TopNReducer(count:10) ; // … Inherent Data Skew [SqlUserDefinedReducer(IsRecursive = true)] public class TopNReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Compute TOP(N) per group // … } } Recursive • Allow multi-stage aggregation trees • Requires same schema (input => output) • Requires associativity: • R(x, y) = R( R(x), R(y) ) • Default = non-recursive • User code has to honor recursive semantics www.bing.com brought to a single vertex
  40. 40. // Bing impressions @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute sessions @Sessions = REDUCE @Impressions ON Client, Market READONLY Market USING new Bing.SessionReducer(range : 30) ; // Users metrics @Metrics = SELECT * FROM @Sessions WHERE Market == "en-us" ; // … Microsoft Confidential U-SQL Optimizations Predicate pushing – UDO pass-through columns
  41. 41. Show me U- SQL UDOs! https://blogs.msdn.microsoft.com/azuredatalake/20 16/06/27/how-do-i-combine-overlapping-ranges- using-u-sql-introducing-u-sql-reducer-udos/
  42. 42. // Bing impressions @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute page views @Impressions = PROCESS @Impressions READONLY Market PRODUCE Client, Market, Header string USING new Bing.HtmlProcessor() ; @Sessions = REDUCE @Impressions ON Client, Market READONLY Market USING new Bing.SessionReducer(range : 30) ; // Users metrics @Metrics = SELECT * FROM @Sessions WHERE Market == "en-us" ; Microsoft Confidential U-SQL Optimizations Predicate pushing – UDO row level processors public abstract class IProcessor : IUserDefinedOperator { /// <summary/> public abstract IRow Process(IRow input, IUpdatableRow output); } public abstract class IReducer : IUserDefinedOperator { /// <summary/> public abstract IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output); }
  43. 43. // Bing impressions @Impressions = SELECT Client, Market, Html FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute page views @Impressions = PROCESS @Impressions PRODUCE Client, Market, Header string USING new Bing.HtmlProcessor() ; // Users metrics @Metrics = SELECT * FROM @Sessions WHERE Market == "en-us" && Header.Contains("microsoft.com") AND Header.Contains("microsoft.com") ; U-SQL Optimizations Predicate pushing – relational vs. C# semantics
  44. 44. // Bing impressions @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute page views @Impressions = PROCESS @Impressions PRODUCE * REQUIRED ClientId, HtmlContent(Header, Footer) USING new Bing.HtmlProcessor() ; // Users metrics @Metrics = SELECT ClientId, Market, Header FROM @Sessions WHERE Market == "en-us" ; U-SQL Optimizations Column Pruning and dependencies C H M C H M C H M Column Pruning • Minimize I/O (data shuffling) • Minimize CPU (complex processing, html) • Requires dependency knowledge: • R(D*) = Input ( Output ) • Default no pruning • User code has to honor reduced columns A B C D E F G J KH I … M … 1000
  45. 45. UDO Tips and Warnings • Tips when Using UDOs: • READONLY clause to allow pushing predicates through UDOs • REQUIRED clause to allow column pruning through UDOs • PRESORT on REDUCE if you need global order • Hint Cardinality if it does choose the wrong plan • Warnings and better alternatives: • Use SELECT with UDFs instead of PROCESS • Use User-defined Aggregators instead of REDUCE • Learn to use Windowing Functions (OVER expression) • Good use-cases for PROCESS/REDUCE/COMBINE: • The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori. • Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO • You need an ordered Aggregator or produce more than 1 row per group
  46. 46. INSERT Multiple INSERTs into same table • Generates separate file per insert in physical storage: • Can lead to performance degradation • Recommendations: • Try to avoid small inserts • Rebuild table after frequent insertions with: ALTER TABLE T REBUILD;
  47. 47. Future Items GA and beyond • Tooling • Resource planning based on $-cost • Storage support • Storage compression (available since this week!) • Columnar Storage/Index • Secondary Index
  48. 48. Additional Resources Blogs and community page: • http://usql.io (U-SQL Github) • http://blogs.msdn.microsoft.com/mrys/ • http://blogs.msdn.microsoft.com/azuredatalake/ • https://channel9.msdn.com/Search?term=U-SQL#ch9Search Documentation and articles: • http://aka.ms/usql_reference • https://azure.microsoft.com/en-us/documentation/services/data-lake- analytics/ • https://msdn.microsoft.com/en-us/magazine/mt614251 ADL forums and feedback • http://aka.ms/adlfeedback • https://social.msdn.microsoft.com/Forums/azure/en- US/home?forum=AzureDataLake • http://stackoverflow.com/questions/tagged/u-sql Slide decks • http://www.Slideshare.net/MichaelRys
  49. 49. Explore Everything PASS Has to Offer FREE ONLINE WEBINAR EVENTS FREE 1-DAY LOCAL TRAINING EVENTS LOCAL USER GROUPS AROUND THE WORLD ONLINE SPECIAL INTEREST USER GROUPS BUSINESS ANALYTICS TRAINING VOLUNTEERING OPPORTUNITIES PASS COMMUNITY NEWSLETTER BA INSIGHTS NEWSLETTERFREE ONLINE RESOURCES
  50. 50. Session Evaluations ways to access Go to passSummit.com Download the GuideBook App and search: PASS Summit 2016 Follow the QR code link displayed on session signage throughout the conference venue and in the program guide Submit by 5pm Friday November 6th to WIN prizes Your feedback is important and valuable. 3
  51. 51. Thank You Learn more from Michael Rys usql@microsoft.com or follow @MikeDoesBigData

×