This document provides an overview of the SSIS design pattern for data warehousing and change data capture. It discusses what design patterns are and how they are commonly used for SSIS and data warehousing projects. It then covers 13 specific patterns including truncate and load, slowly changing dimensions, hashbytes, change data capture, merge, and master/child workflows. The document explains when each pattern is best used and provides pros and cons. It also provides guidance on configuring and using SQL Server change data capture functionality.
Data Discovery at Databricks with AmundsenDatabricks
Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.
We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:
Surface the most popular tables used within Databricks
Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
Lineage information (downstream table, upstream table, downstream jobs, downstream users)
Dataset owner
Dataset frequent users
Delta extend metadata (e.g change history)
ETL job that generates the dataset
Column stats on numeric type columns
Dashboards that use the given dataset
Use Databricks data tab to show the sample data
Surface metadata on dashboards including: create time, last update time, tables used, etc
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
Databricks' founders caused a seismic shift in data analysis community when they created Apache Spark which has become a cornerstone of Big Data processing pipelines and tools in large and small companies all around the world. Now they've built a revolutionary, comprehensive and easy-to-use platform around Apache Spark and their other inventions, such as MLFlow and Koalas frameworks and most importantly the Data Lakehouse: a concept of fusing data warehouse and data lake architectures into a single versatile and fast platform. Technical foundation for Databricks Data Lakehouse is Delta Lake. More than 7000 organizations today rely on Databricks to enable massive-scale data engineering, collaborative data science, full-lifecycle machine learning and business analytics. Come to the talk and see the demo to find out why.
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
Slide deck of the third part of building Modern Data Warehouse using Azure. This session covered Azure Synapse, formerly SQL Data Warehouse. We look at the Azure Synapse Architecture, external files, integration with Azuer Data Factory.
The recording of the session is available on YouTube
https://www.youtube.com/watch?v=LZlu6_rFzm8&WT.mc_id=DP-MVP-5003170
Data Architecture Best Practices for Advanced AnalyticsDATAVERSITY
Many organizations are immature when it comes to data and analytics use. The answer lies in delivering a greater level of insight from data, straight to the point of need.
There are so many Data Architecture best practices today, accumulated from years of practice. In this webinar, William will look at some Data Architecture best practices that he believes have emerged in the past two years and are not worked into many enterprise data programs yet. These are keepers and will be required to move towards, by one means or another, so it’s best to mindfully work them into the environment.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Data Discovery at Databricks with AmundsenDatabricks
Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.
We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:
Surface the most popular tables used within Databricks
Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
Lineage information (downstream table, upstream table, downstream jobs, downstream users)
Dataset owner
Dataset frequent users
Delta extend metadata (e.g change history)
ETL job that generates the dataset
Column stats on numeric type columns
Dashboards that use the given dataset
Use Databricks data tab to show the sample data
Surface metadata on dashboards including: create time, last update time, tables used, etc
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
Databricks' founders caused a seismic shift in data analysis community when they created Apache Spark which has become a cornerstone of Big Data processing pipelines and tools in large and small companies all around the world. Now they've built a revolutionary, comprehensive and easy-to-use platform around Apache Spark and their other inventions, such as MLFlow and Koalas frameworks and most importantly the Data Lakehouse: a concept of fusing data warehouse and data lake architectures into a single versatile and fast platform. Technical foundation for Databricks Data Lakehouse is Delta Lake. More than 7000 organizations today rely on Databricks to enable massive-scale data engineering, collaborative data science, full-lifecycle machine learning and business analytics. Come to the talk and see the demo to find out why.
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
Slide deck of the third part of building Modern Data Warehouse using Azure. This session covered Azure Synapse, formerly SQL Data Warehouse. We look at the Azure Synapse Architecture, external files, integration with Azuer Data Factory.
The recording of the session is available on YouTube
https://www.youtube.com/watch?v=LZlu6_rFzm8&WT.mc_id=DP-MVP-5003170
Data Architecture Best Practices for Advanced AnalyticsDATAVERSITY
Many organizations are immature when it comes to data and analytics use. The answer lies in delivering a greater level of insight from data, straight to the point of need.
There are so many Data Architecture best practices today, accumulated from years of practice. In this webinar, William will look at some Data Architecture best practices that he believes have emerged in the past two years and are not worked into many enterprise data programs yet. These are keepers and will be required to move towards, by one means or another, so it’s best to mindfully work them into the environment.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Considerations for Data Access in the LakehouseDatabricks
Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker.
The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.
Learn how to accurately scope analytics migrations that comes in on time and on budget. See the recording and download this deck: https://senturus.com/resources/prepare-bi-migration/
Senturus offers a full spectrum of services for business analytics. Our Knowledge Center has hundreds of free live and recorded webinars, blog posts, demos and unbiased product reviews available on our website at: https://senturus.com/resources/
Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Considerations for Data Access in the LakehouseDatabricks
Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker.
The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.
Learn how to accurately scope analytics migrations that comes in on time and on budget. See the recording and download this deck: https://senturus.com/resources/prepare-bi-migration/
Senturus offers a full spectrum of services for business analytics. Our Knowledge Center has hundreds of free live and recorded webinars, blog posts, demos and unbiased product reviews available on our website at: https://senturus.com/resources/
Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Survey On Temporal Data And Change Management in Data WarehousesEtisalat
A Survey on the following papers:
[22] Felix Naumann & Stefano Rizzi 2013 – Fusion Cubes
[30] MaSM: Efficient Online Updates in Data Warehouses
[37] Managing Late Measurements In Data Warehouses (Matteo Golfarelli & Stefano Rizzi 2007)
[38] A SchemaGuide for Accelerating the View Adaptation Process
[35] Temporal Query Processing in Teradata
[33] Toward Propagating the Evolution of Data Warehouse on Data Marts
[31] Wrembel and Bebel (2007)
This presentation is an overview of things all IT management team needs to consider before upgrading to Oracle Database 12c and were presented in a webinar: bit.ly/1yzSdsd
Oracle ACE Director Dan Morgan was a 12c beta tester and is intimately familiar with changes in architecture and how they will impact existing infrastructure and planning and budgeting considerations.
In 2015, most organizations will begin to migrate to the latest version of Oracle Database 12c, but few are aware of the challenges to management, planning and budgeting that need to be addressed before executing such a project. Dan Morgan presents what needs to be considered and shares a few tips about 12c new features.
If you have any questions regarding your future Oracle migrations and upgrades, feel free to visit Performance Tuning Corporation at www.perftuning.com
When it comes to user experience a snappy application beat a glamorous one. Nothing frustrates an end user more than a slow application. Did you know that any wait time greater than one second will break a user's concentration and cause them to feel frustration? How can we create applications to meet user expectations? This class will cover all things performance from design to delivery. We will go over application design, user interface guidelines, caching guidelines, code optimizations, and query optimizations.
Database Migrations with Gradle and LiquibaseDan Stine
Database migration scripts are a notorious source of difficulty in the software delivery process. This session will discuss how we neutralized this all too common headache.
Now our deployment framework executes database migrations automatically with every application deploy, and the QA team performs self-service full stack deployments in test environments. The resulting additional bandwidth has been invested in more frequent software releases, and the opportunity to focus on higher-value tasks.
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Microsoft Technet France
La nouvelle version d'Exchange Server 2013 intègre une foule de nouveautés lui permettant d'être aujourd'hui le serveur de messagerie le plus sécurisé et le plus fiable sur le marché. L'expérience acquise par la gestion des solutions de messagerie Cloud par les équipes Microsoft a été directement intégrée dans cette nouvelle version du produit ce qui va vous permettre la mise en place d'un système de messagerie ultra résilient. Scott Schnoll, Principal Technical Writer dans l'équipe Exchange à Microsoft Corp va vous expliquer de manière didactique l'ensemble des mécanismes de haute disponibilité et les solutions de resilience inter sites dans les plus petits détails. Venez apprendre directement par l'expert qui a travaillé sur ces sujets chez Microsoft ! Attention, session très technique, en anglais.
(ATS4-DEV02) Accelrys Query Service: Technology and ToolsBIOVIA
This talk discussions the technology provided by the new Accelrys Query Service and what it offers to developers. Attendees should come away with a basic understanding of what the query service does, when it is the technology of choice, and how to use it.
Health care or healthcare is the maintenance or improvement of health via the diagnosis, treatment, and prevention of disease, illness, injury, and other physical and mental impairments in human beings
Manufacturing is the production of merchandise for use or sale using labour and machines, tools, chemical and biological processing, or formulation. The term may refer to a range of human activity, from handicraft to high tech, but is most commonly applied to industrial production, in which raw materials are transformed into finished goods on a large scale.
Logistics is the function of making goods and other resources physically available for use as and when required. This generally includes two basic activities of moving or transporting these resources, and storing them at different location till required for use or further transportation.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
3. What is a Design Pattern?
• Pattern –A design for a package that solves a certain
scenario
• Over time certain SSIS logic flows have emerged as
best practices
• These designs have been classified into patterns for
reference purposes
• Standard Design Patterns
– Learn from others
– Common patterns make it easy for new personnel to
understand and work with
– Easy to apply in new projects
4. Design Patterns and Data
Warehousing
• SSIS most commonly used in Data Warehousing
• Patterns in this course most commonly used in
Data Warehousing
• Applicable to non DW projects
• Definitions
– Type 1 –Dimension updates simply overwrite pre-
existing values
– Type 2 –Each update to a dimension causes a new
record to be created
– Fact –Records the measures for a transaction and
associates with dimensions
5. What You Need?
• SQL Server Data Tools –BI Components
– For SQL Server 2012 use Visual Studio 2012
– For SQL Server 2014 use Visual Studio 2013
• SQL Server Data Tools –Database Project –SQL Server 2012
– Uses Visual Studio 2012
• SQL Server Data Tools –Database Project –SQL Server 2014
– Included in Visual Studio 2013 Community Edition
– Included in other versions of VS 2013 out of box
– Make sure to install update 4
6. Versions of SQL Server
• We will use SQL Server 2014 Project Deployment Mode
• Material works identically in 2012 (Project Deployment
Mode)
– Package Deployment Mode for 2012/2014 requires older
style configurations for Master/Child
• Patterns applicable in 2008R2 & 2008 with limitations
– CDC has to be manually implemented, no controls in SSIS
Toolbox
• Master / Child works differently –uses configurations
• Limited applicability to SQL Server 2005
– No Hashbytes
– No Merge
– No CDC
– Master / Child works differently –uses configurations
7. Deploying the Test Database
• Before running project you will need to
deploy and setup the test database
• Uses SSDT Database Project as part of the
solution
• Deploy a database
• After deploy run stored procedure
DDL.CreateAllObjects
8. The 13 Patterns• Truncate and Load
• SCD Wizard
– Type 1
– Type 2
• Set Based Updates
– Type 1
– Type 2
• Hashbytes
– Different Databases
– Same Database
• Change Data Capture
• Merge
• Date Based
• Fact Table Pattern
• Master / Child
– Basic
– Passing Parameters
– Load Balancing
9. Truncate and Load
• Deletes all rows in target, then
completely reloads from source
• Commonly used in staging environments,
often with other patterns
• Pros
– Simple to implement
– Fast for small to medium datasets
• Cons
– No change tracking
– Slower for large datasets
10. SCD Wizard Type 1
• SCD (Slowly Changing Dimension) Wizard
• Pattern for tables with Type 1 attributes only
• Pros
– Easy to create
– Good for very very small updates
• Cons
– When something changes all SCD generated
components must be deleted and recreated.
– Incredibly slow.
– Did we mention it is slow?
– It is really really slow.
11. SCD Wizard Type 2
• SCD (Slowly Changing Dimension) Wizard
• Pattern for tables with Type 1 and 2 attributes
• Wizard is same for both patterns, just different
options
• Pros
– Easy to create
– Good for very small updates
• Cons
– When something changes all SCD generated
components must be deleted and recreated.
– Incredibly slow.
– It didn’t get any faster since the last section
12. Set Based Updates –Type 1
• Set Based Updates –Type 1
• Pros
– Scales well
– Runs fast
• Cons
– Requires extra tables in the database
– Requires more setup work in the
package
13. Set Based Updates –Type 2
• Set Based Updates –Type 2
• Pros
– Scales well
– Runs fast
• Cons
– Logic is somewhat complex
– Requires extra tables in the database
14. Hashbytes –Different Databases
• Uses the Hashbytes function to generate a
unique value for comparisons
• Pros
– Good for tables with many columns
– Scales well -fast
• Cons
– Requires use of lookups –caching requires memory
– Requires concatenation of all data columns in select
statement
15. Hashbytes –Same Database
• Uses Hashbytes with a Merge Join
• Pros
– Avoids use of lookups, lowers memory
requirements
– Scales very well
– Will work on different databases but most
efficient in a single database
• Cons
– Requires data sources to be sorted
– Requires common key to sort on
– Needs to concatenate data columns for
Hashbytes
16. Change Data Capture
• Lets SQL Server track which rows in source have
changed
• Pros
– Tracks changes to data
– Only read rows which have changed
– Easy to determine Create / Update / Delete actions
• Cons
– Only works with SQL Server
– Requires setup work in database and tables before it
can be used
– Must have ability to alter the source system
17. Merge
• Uses SQL Server MERGE statement
• Pros
– Simple to implement
– Very fast
• Cons
– No transformations
– No ability to track progress
18. Date Based
• Uses date driven values to determine
changes
• Pros
– Easy to determine changes to rows
– Reduces number of rows that are read
– Can be combined with any of the other patterns
• Cons
– Requires source system to have a reliable date
field indicating changes
– Still requires logic to determine new rows vs
updates
19. Fact Table Pattern
• Used to update metrics /
measurements in data warehouse
• Pros
– Common pattern
– Easy to implement
• Cons
– Can require many lookups
– Updates not always simple
20. Master / Child (Basic)
• A master (parent) package which coordinates
the execution of other packages (children)
• Pros
– Simple to implement
• Cons
– Not always efficient when many packages are
involved
21. Master / Child (Parameters)
• Passing values from master package to children
• Pros
– SQL Server 2012 / 2014 project deployment mode
make it very easy to pass values
– Easy to reuse values across multiple child packages
• Cons
– In package deployment mode, or SQL Server 2008R2
and previous requires the more complex
configurations
22. Master/Child (Load Balanced)
• Uses a table to drive package execution
• Pros
– Easy to alter execution –just update a table
– Can easily balance parallel execution of
packages
• Cons
– Needs many variables
– Requires some manual effort and
monitoring to effectively balance
23. Choosing a Pattern
• Truncate and Load
– Low to moderate number of rows
– No requirement to track changes
– Good for staging tables
• SCD Wizard, Type 1 & 2
– Very small number of rows (< 2000)
– Packages that won’t change
– There is almost always a better pattern
24. Choosing a Pattern
• Set Based Updates, Type 1 & 2
– Scales well
– Good for limited number of columns
– Extra ram required
• Hashbytes
– Scales well
– Good for large number of columns
– Source systems needs to implement a form of
the Hashbytes function
• Change Data Capture
– Excellent pattern –SQL Server tells you all
changes
– Data source must be SQL Server
25. Choosing a Pattern
• Merge
– Good for very simple ETL when no
monitoring is required
• Date Based
– Limits number of rows read in
– Can be combined with other patterns
• Fact Table Pattern
• Master / Child
– Basic
– Passing Parameters
– Load Balancing
27. Introduction
• Many applications have requirements for
identifying data changes that have occurred in a
database for various reasons
– Tracking historical changes to data
– Auditing changes to data
– Synchronizing data changes across disconnected
systems
– Implementing an Operational Data Store (ODS)
– Incremental data loading for a Data Warehouse
• SQL Server provides many techniques for tracking
changes to data
28. Techniques for Identifying Changes
• DML table triggers
– Can track before and after row state and deleted rows
– Can be customized to include the user, modification
time, or input buffer
– Can introduce significant performance overhead on
transactional systems
• Modified datetime or timestamp column
– Can introduce performance overhead to pull changes
– Does not track deleted rows
– Requires schema modifications to source tables and
code to set value
29. Techniques for Identifying Changes
• Data comparisons
– Comparing source and destination data requires
scanning all rows to
– determine changes and introduces significant
performance overhead
• Replication and Subscriber triggers
– Offloads change identification to subscription
database
– Requires customizations and manual management
of schema changes
30. Change Data Capture Solution
• Change Data Capture provides information about DML
changes to a table in near real-time using the same Log
Reader Agent as transactional replication
– Eliminates expensive techniques that require schema
modifications
• DML triggers
• Timestamp columns
• Data comparisons and complex JOIN queries
• May be used to answer a variety of critical questions:
– What are all of the changes that happened to a table since
the last ETL?
– Which columns changed?
– What type of changes occurred? INSERT/UPDATE/DELETE?
– What was the before image of a row that was modified?
• Supports net change identification, with a performance
trade-off due to an additional index on change tables
31. Configuring Change Data Capture
• Change Data Capture provides the ability to
capture the row data from DML changes to a
database when enabled for capture
• Configuring Change Data Capture has specific
requirements, which when met allows individual
tables to be configured for change capture
• The options for configuring a table for change
capture affect performance, the data collected,
and security controls for accessing the capture
tables
32. Requirements to Enable CDC
• Enterprise Edition feature only
• Enabling CDC for a database requires
sysadmin privileges
– Requires executing sp_cdc_enable_db
• Enabling CDC for a table requires
db_owner privileges in the database
– Requires executing sp_cdc_enable_table for
each table that will track changes within the
database
• Querying the results from the CDC tables
requires membership within the database
role specified in the sp_cdc_enable_table
procedure call if specified
33. sp_cdc_enable_table Options
• @source_schema – the name of the source table schema
(required)
• @source_name – the name of the source table (required)
• @role_name – the name of the database security role used for
gaiting access to the change data (required but can explicitly be
set to NULL)
• @capture_instance – the name of the capture instance
(optional)
• @supports_net_changes – indicates whether querying for net
changes is supported by the capture instance (optional defaults
1)
– Enabling net change support adds an additional non-clustered index to
the capture table, which can impact insert performance for change
rows
• @index_name – the name of a unique index to uniquely
identify rows in the source table (optional defaults to primary
key)
34. sp_cdc_enable_table Options
• @captured_column_list – the list of source table
columns to include in the change table (optional)
• @filegroup_name – the filegroup to be used for the
change table (optional)
• @allow_partition_switch – indicates whether the
SWITCH PARTITION command of ALTER TABLE can
be executed against a table that is enabled for CDC
(optional)
– Switching a partition into a CDC enabled table does
not generate INSERT change data for rows that
previously existed in the partition prior to the switch
– Switching a partition out of a CDC enabled table does
not generate DELETE change data for the rows
contained within the partition
36. Introduction
• After enabling a database for Change Data
Capture and configuring capture instances for the
source tables, the change data must be queried
for processing
• All change rows are identified by the Log
Sequence Number (LSN) associated with the
transaction that changed the row
• Change tables include internal metadata columns
that describe the change row as well as the
captured columns configured for the table
37. Finding Change Table
Metadata
• Stored Procedures
– sys.sp_cdc_help_change_data_capture –
returns the CDC capture information for each
table enabled within a database
• May return up to two rows per table, one for each
capture instance
• The @source_schema parameter specifies the
source schema to return results for when the
procedure executes
• The @source_name parameter specifies the source
table to return results for when the procedure
executes
– sys.sp_cdc_get_captured_columns – returns
the captured columns for the capture instance
specified by the @capture_instance parameter
38. Finding Change Table
Metadata
• System tables
– cdc.change_tables – contains up to two
rows, one per capture instance enabled
on a source table
– cdc.captured_columns – contains one
row per captured column for a source
capture instance
• Querying system tables is not
recommended, and using the stored
procedures is recommended instead
39. Understanding Change Table
Columns• Change tables exist within the CDC schema and are named
<capture_instance>_CT
• The first five columns are metadata columns:
– __$start_lsn – the starting LSN of the transaction
– __$end_lsn – the ending LSN of the transaction
– __$seqval – the sequence or order of the row changes within a
transaction
– __$operation – the type of operation reflected by the change row
• 1 = Delete
• 2 = Insert
• 3 = Value before update
• 4 = Value after update
– __$update_mask – bitmask of columns changed by operation within
row
• Remaining columns match the source table column definition
when the capture instance was created
40. Determining Change Rows to
Process
• By LSN:
– sys.fn_cdc_map_time_to_lsn ( '<relational_operator>',
tracking_time )
• Returns the LSN value from the start_lsn column of the
cdc.lsn_time_mapping system table for the tracking_time specified
• The relational_operator specifies the comparison to be applied against
the tran_end_time of the cdc_lsn_time_mapping table when
determining the LSN value to return
– largest less than, largest less than or equal, smallest greater than, or
smallest greater than or equal
– sys.fn_cdc_get_min_lsn ( 'capture_instance_name' )
• Returns the start_lsn value for the capture instance from
cdc.change_tables
• Sets the lower endpoint for change data for a given capture instance
– sys.fn_cdc_get_max_lsn ()
• Returns the maximum start_lsn column value from
cdc.lsn_time_mapping system table setting the upper endpoint for all
capture instances
41. Determining Change Rows to
Process• By LSN:
– Custom tracking table updated by application code to track
the capture instance name and last processed LSN
– sys.fn_cdc_decrement_lsn ( lsn_value )
• Returns the previous LSN in the sequence based upon the
specified LSN
• Often used to decrement sys.fn_cdc_get_max_lsn () value to set
upper endpoint without overlapping LSNs across different data
loads
– sys.fn_cdc_increment_lsn ( lsn_value )
• Returns the next LSN in the sequence based upon the specified
LSN
• Often used to increment the last saved LSN from a custom
tracking table to set a new lower endpoint without overlapping
LSNs across different data loads
• By time:
– sys.fn_cdc_map_lsn_to_time ( lsn_value )
• Returns the tran_end_time column from cdc.lsn_time_mapping
for the specified LSN, allowing
42. Change Row Table-Valued
Functions
• cdc.fn_cdc_get_all_changes_<capture_ins
tance>
– Returns one row for each modification
applied to the source table within the
specified LSN range
– Multiple modifications of a source row
within the LSN range will be represented
individually in the result set
• cdc.fn_cdc_get_net_changes_<capture_in
stance>
– Returns a single net change row for each
source row modified within the specified
LSN range
43. Determining Whether a Column
Changed
• sys.fn_cdc_get_column_ordinal (
'capture_instance', 'column_name‘ )
– Return s the ordinal position of a column name
within the specified capture instances update mask
• sys.fn_cdc_is_bit_set ( position, update_mask )
– Checks the specified ordinal position of the update
mask to determine if the change bit is set
• sys.fn_cdc_has_column_changed (
'capture_instance','column_name‘ , update_mask )
– Identifies whether the specified column has been
updated in the associated change row
– Ideally only used for post processing
– Use sys.fn_cdc_get_column_ordinal once to set the
position, and sys.fn_cdc_is_bit_set to parse the
update_mask in queries against change tables for
better performance
45. Introduction
• SQL Server 2012 introduced new
SQL Server Integration Services
(SSIS) for CDC to simplify
extracting and consuming change
data
• Using the CDC components does
not require advanced knowledge
of SSIS to move change data from
a source system to a target for
further processing
46. CDC Control Task Component
• Used to control the life cycle of CDC packages in SSIS
– Synchronizes initial package load and the management of
LSN ranges processed by the CDC package executions
– Maintains the state across executions by persisting state
variable to a table
– Handles error scenarios and recovery from problems
during processing
• Supports two types of operations
– Synchronization of the initial data load and change processing
• Mark initial load start and initial load end for a full load from an
active source
• Resetting the CDC state variable to restart tracking
• Marking the CDC start from a snapshot LSN from a snapshot
database
– Management of change processing LSN ranges and tracking
what is processed successfully
• Getting a processing range before execution
• Marking a processing range after successfully processing changes
47. CDC Control Task Component
• Persisting state across executions
– Manual state persistence requires the package developer
to read and write the state variable for the package
– Automated state persistence reads the value of the state
variable from the table configured in the Control Task
editor to get the processing range and writes the value to
the table to mark the processed range
• Errors can be reported by the Control Task if:
– A get processing range is called after a previous get
processing range operation without the mark processed
range operation occurring
• Possibly a different package running concurrently with the same
state variable name
– Reading the persisted state variable value from the
persisted store fails
– The state variable value read from the persistent store is
not consistent
– Writing the state variable value to the persistent store fails
48. CDC Source Component
• Reads the processing range of change data from a capture instance change table and
delivers the changes to other SSIS components
– The processing range is derived from the state package
variable that is set by a CDC Control task executed before
the data flow starts
• The CDC source requires the following configurations:
– ADO.NET connection manager to access the SQL Server
CDC database
– The name of a table enabled for CDC
– The name of the capture instance of the table to read the
changes from
– The change processing mode to use for reading the
changes
– The name of the CDC state package variable to determine the CDC processingrange
• The CDC source does not modify that variable, a subsequent CDC
Control task execution after the data flow must be used to update
the state values
49. CDC Processing Modes (1)
• All
– Returns a single row for each change applied to the source
table
– Similar to querying the
cdc.fn_cdc_get_all_changes_<capture_instance> table-
valued function with the ‘all’ filter option
• All with old values
– Similar to All, but with two rows per update, one for the
Before value and one for the After value
– Similar to querying the
cdc.fn_cdc_get_all_changes_<capture_instance> table-
valued function with the ‘all update old’ filter option
– The __$operation column distinguishes between Before (3)
and After (4)
50. CDC Processing Modes(2)
• Net
– Rolls up all changes for a key into a single row to simplify ETL processing
– Requires @supports_net_changes= true for the capture instance
– Similar to querying the cdc.fn_cdc_get_net_changes_<capture_instance>
table-valued function with the ‘all’ filter option
• Net with update mask
– Similar to Net but includes additional booleancolumns
(__$<column_name>_Changed) specifying whether a column was changed
– Similar to querying the cdc.fn_cdc_get_net_changes_<capture_instance>
table-valued function with the ‘all with mask’ filter option
51. CDC Processing Modes(3)
• Net with merge
– Groups INSERT and UPDATE operations
together making it easier to use the
MERGE statement (__$operation = 5)
– Similar to querying the
cdc.fn_cdc_get_net_changes_<capture_i
nstance> table-valued function with the
‘all with merge’ filter option
– Only the DELETE and UPDATE split paths
will receive rows from the CDC Splitter in
this mode
52. CDC Splitter Components
• Splits a single input of change rows from the CDC
Source component into separate outputs for Insert,
Update and Delete operations based on the
__$operation column value from the change table
– 1 –Delete
– 2 –Insert (not available using Net with Merge mode)
– 3 –Before Update row (only when using All with Old Values
mode)
– 4 –After Update row
– 5 –Merged Update row (only when using Net with Merge
mode)
• The CDC Source for the Data Flow must have the
NetCDCprocessing mode configured to use the CDC
Splitter
• No advanced configuration is required for the CDC
Splitter
53. Package Design Considerations
Configure separate packages for handling Initial Load and Incremental Loads
Initial load will mark the start LSN before transferring data
from the source, and the end LSN after using the CDC tracking
variable for all tables associated with the data flow
Facilitates easier re-initialization if necessary from the source system
• Error handling considerations need to be made when operation order must be
maintained as a part of the data flow
– CDC components can redirect error rows when appropriate
to prevent component failures but may result in out-of-
order processing of changes
• Consider using staging tables to fast load change data
and perform batch processing of changes in Transact-
SQL to prevent row-by-row processing of changes in
SSIS
– Change from ETL (SSIS processing of rows) to ELT (database engine processing) to benefit from set based operations
54. CDC Setup
• Step 1. Enable CDC for database
USE
AdventureWorks2012
GO
EXEC
sp_changedbowner'sa
'
GO
EXEC
sys.sp_cdc_enable_d
b
GO
CDC Functions
CDC Stored Procedures
CDC Tables
55. CDC Setup
• Step 2. Enable CDC for table(s)
USE AdventureWorks2012
GO
EXEC sys.sp_cdc_enable_table
@source_schema= N'Production'
,@source_name= N'Product'
,@role_name= N'cdc_Admin'
,@capture_instance=
N'Production_Product'
,@supports_net_changes= 1
cdc.Production_Product_CT
56. Anatomy of a CDC Table
• __$start_lsnand __$seqval
– Link record to a transaction
– Specify order of operations
• __$operation
– 1 = delete
– 2 = insert
– 3 = update (record data before change)
– 4 = update (record data after change)
– 5 = merge
• __$update_mask
– Identify which columns changed
– Use with Sys.fn_cdc_has_column_changed
57. CDC in Integration Services
SSIS manages
current state of
CDC processing
here
Source Database
Staging Database
Table Structures
One staging
table per type
of change with
source AND
change
columns
EXCEPT…
Updates table
includes
ChangeType
when both Type 1
and Type2
processing
required
58. CDC in Integration Services Control
Flow-Extraction
CDC
Control
Task to
mark
beginning
and end of
processing
Truncate 3 staging tables
Incremental LoadInitial Load
61. Data Flow –Transform and
Load
SELECT [__$start_lsn]
,[__$operation]
,[__$update_mask]
,[ProductID]
,[Name]
. . .
FROM
[stage].[stageProduct_Inse
rts]
UNION ALL
SELECT [__$start_lsn]
,[__$operation]
,[__$update_mask]
,[ProductID]
,[Name]
. . .
FROM
[stage].[stageProduct_Upda
tes]
WHERE ChangeType= 2