Automating DWH Patterns Through Metadata

4,168 views

Published on

Around 80% of the work to create a data warehouse/BI solution is spent on the ETL phase. Although building an ETL solution can be a challenge, you can break down the project into at least two separate processes for easier management. One process is strictly related to business modeling, and therefore cannot be replicated. But the other is made up of purely technical processes that are always the same, regardless of the business environment we operate in, and thus can be highly automated.

In this session, we will look at well-known patterns to solving common problems and how they can be automated with the help of specific tools and techniques that use metadata to reduce development time and bugs. Using these engineering techniques, you will be able to adopt an Agile approach to your BI solution.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,168
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
140
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • http://chartporn.org/2012/05/10/repetitive-tasks/
  • http://en.wikipedia.org/wiki/Software_design_pattern
  • http://en.wikipedia.org/wiki/Software_design_pattern
  • http://en.wikipedia.org/wiki/Software_design_pattern
  • Matt Masson Blog: http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx
  • Automating DWH Patterns Through Metadata

    1. 1. Automating Data Warehouse Patterns Through Metadata Davide Mauri dmauri@solidq.com
    2. 2. Davide Mauri 20 Years of experience on the SQL Server Platform – Specialized in Data Solution Architecture, Database Design, Performance Tuning, Business Intelligence, Data Warehouse, Big Data & Analytics Microsoft SQL Server MVP President of UGISS (Italian SQL Server UG) Mentor @ SolidQ – Regular Speaker @ SQL Server events – Projects, Consulting, Mentoring & Training Find me here: – Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx – Twitter:@mauridb
    3. 3. Building a DWH in 2013 Is still a (almost) manual process A *lot* of repetitive low-value work No (or very few) standard tools available
    4. 4. How it should be Semi-automatic process – “develop by intent” Define the mapping logic CREATE DIMENSION Customer FROM SourceCustomerTable MAP USING CustomerMetadata ALTER DIMENSION Customers ADD ATTRIBUTE LoyaltyLevel from a TYPE 1 semantic perspective AS – Source to Dimensions / Measures • (Metadata anyone?) CREATE FACT Orders FROM SourceOrdersTable MAP USING OrdersMetadata Design the model and let the tool build it for you ALTER FACT Orders ADD DIMENSION Customer
    5. 5. The perfect BI process & architecture Iterative!
    6. 6. Is automation possible? DWH PROCESSES
    7. 7. Invest on Automation? Faster development – Reduce Costs – Embrace Changes Less bugs Increase solution quality and make it consistent throughout the whole product
    8. 8. Automation Pre-Requisites Split the process to have two separate type of processes – What can be automated – What can NOT be automated Create and impose a set of rules that defines – How to solve common technical problems – How to implement such identified solutions
    9. 9. No Monkey Work! Let the people think and let the machines do the «monkey» work.
    10. 10. Design Pattern “A general reusable solution to a commonly occurring problem within a given context”
    11. 11. Design Pattern Generic ETL Pattern – Partition Load – Incremental/Differential Load Generic BI Design Pattern – Slowly Changing Dimension • SCD1, SCD2, ecc. – Fact Table • Transactional, Snapshot, Temporal Snapshot
    12. 12. Design Pattern Specific SQL Server Patterns – Change Data Capture – Change Tracking – Partition Load – SSIS Parallelism
    13. 13. Engineering the DWH “Software Engineering allows and require the formalization of software building and maintenance process.”
    14. 14. Sample Rules • Always put «last_update» column • Always log Inserted/Updated/Deleted rows to log.load_info table • Use MD5 – binary(16) for checksums • Use views to expose data – Dimension & Fact views MUST use the same column names for lookup columns
    15. 15. Engineering the DWH There are two intrinsc processes hidden in the development of a BI solution that must be allowed (or forced) to emerge.
    16. 16. Business Process Data manipulation, transformation, enrichment & cleansing logic Specific for every customer. Almost not automatable
    17. 17. Technical Process Application of data extraction and loading techniques Recurring (pattern) in any solution Highly Automatable
    18. 18. Hi-Level Vision Technical Process Technical Process ETL OLTP L ET STG E TL Business Process DWH
    19. 19. ETL Phases «E» and «L» must be – Simple, Easy and Straightforward – Completely Automated – Completely Reusable «E» and «L» have ZERO value in a BI Solution – Should be done in the most economic way
    20. 20. Well known solution to common problems PATTERN
    21. 21. Source Full Load E
    22. 22. Source Incremental Load In this scenario, “ID” is a IDENTITY/SEQUENCE. Probably a PK. E
    23. 23. Source Differential Load/1 In this scenario the source table doesn’t offer any specific way to Understand what’s changed E
    24. 24. Source Differential Load/2 In this scenario the source table has a TimeStamp-Like column E
    25. 25. Source Differential Load E • SQL Server 2012 that can help with incremental/differential load – Change Data Capture • Natively supported in SSIS 2012 • http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sqlserver-2012-2/ – Change Tracking • Underused feature in BI…not so rich as CDC but MUCH more simpler and easier
    26. 26. L SCD 1 & SCD 2 Start Lookup Dimension Id and MD5 Checksum From Business Key Insert new members into DWH Calculate MD5 Checksum of NonSCD-Key Colums Yes Dimension Id is Null? No Checksum are different? Yes End Merge data from temp table to DWH Store into temp table
    27. 27. SCD 2 Special Note L • Merge => UPDATE Interval + INSERT New Row
    28. 28. FACT TABLE LOAD L
    29. 29. Partition Load EL
    30. 30. Parallel Load • Logically split the work in several steps – E.g: Load/Process one customer at time • Create a «queue» table the stores information for each step – Step 1 -> Load Customer «A» – Step 2 -> Load Customer «B» • Create a Package that 1. Pick the first not already picked up 2. Do work 3. Back to step 3 • Call the Package «n» times simultaneously EL
    31. 31. Other SSIS Specific Patterns • Range Lookup – Not natively supported – Matt Masson has the answer in his blog  • http://blogs.msdn.com/b/mattm/archive/2008/11/25/l ookup-pattern-range-lookups.aspx
    32. 32. A key ingredient in automation METADATA
    33. 33. Metadata Provide context information – Which columns are used to build/feed a Dimension? – Which columns are Business Keys? – Which table is the Fact Table? – How Fact and Dimension are connected? • Which columns are used?
    34. 34. How to manage Metadata? • Naming Convention • Extended Properties • Specific, Ad Hoc Database or Tables • Other (XML, File, ecc.)
    35. 35. Naming Convention • The easiest and cheapest – – – – No additional (hidden) costs No need to be maintained Never out-of-sync No documentation need • Actually, it IS PART of the documentation – Imposes a Standard • Very limited in terms of flexibility and usage
    36. 36. Extended Properties Support most of metadata needs No additional software needed Very verbose usage – Development of a wrapper to make usage simpler is feasible and encouraged
    37. 37. Metadata Objects Dedicated Ad-Hoc Database and Tables As Flexible as you need Maintenance Overhead to keep metadata in-sync with data – Development of automatic check procedure is needed – DMV can help a lot here
    38. 38. External Metadata Objects Really expensive to keep them in-sync – A tool is needed, otherwise too much manual work Does not give any specific benefits with respect to Ad-Hoc Database/Tables
    39. 39. DEMO
    40. 40. Let’s make it possible! AUTOMATION
    41. 41. Automation Scenarios • Run-Time: «Auto-Configuring» Packages – Really hard to customize packages – SSIS limitations must be managed • Eg: Data Flow cannot be changed at runtime • On-the fly creation of package may be needed • Design-Time: Package Generators / Package Templates – Easy to customize created packages
    42. 42. Automation Solutions • Specific Tool/frameworks – BIML / MIST • SQL Server Platform – SQL, PowerShell, .NET – SMO, AMO
    43. 43. Package Generators Required Assemblies Microsoft.SqlServer.ManagedDTS Microsoft.SqlServer.DTSRuntimeWrap Microsoft.SqlServer.DTSPipelineWrap Path: C:Program Files (x86)Microsoft SQL Server110SDKAssemblies
    44. 44. DEMO
    45. 45. Useful Resources • «STOCK» Tasks: – http://msdn.microsoft.com/enus/library/ms135956.aspx • How to set Task properties at runtime: – http://technet.microsoft.com/enus/library/microsoft.sqlserver.dts.runtime.executables .add.aspx
    46. 46. BIML – BI Markup Language • Developed by Varigence – http://www.varigence.com – http://bimlscript.com/ – MIST: BIML Full-Featured IDE • Free via BIDS Helper – Support “limited” to SSIS package generation – http://bidshelper.codeplex.com
    47. 47. THANK YOU! • For attending this session and PASS SQLRally Nordic 2013, Stockholm

    ×