Designing an extensible, flexible schema that supports user customization is a common requirement, but it's easy to paint yourself into a corner.
Examples of extensible database requirements:
- A database that allows users to declare new fields on demand.
- Or an e-commerce catalog with many products, each with distinct attributes.
- Or a content management platform that supports extensions for custom data.
The solutions we use to meet these requirements is overly complex and the performance is terrible. How should we find the right balance between schema and schemaless database design?
I'll briefly cover the disadvantages of Entity-Attribute-Value (EAV), a problematic design that's an example of the antipattern called the Inner-Platform Effect, That is, modeling an attribute-management system on top of the RDBMS architecture, which already provides attributes through columns, data types, and constraints.
Then we'll discuss the pros and cons of alternative data modeling patterns, with respect to developer productivity, data integrity, storage efficiency and query performance, and ease of extensibility.
- Class Table Inheritance
- Serialized BLOB
- Inverted Indexing
Finally we'll show tools like pt-online-schema-change and new features of MySQL 5.6 that take the pain out of schema modifications.
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...Cathrine Wilhelmsen
Recording available on Pragmatic Works' website: http://pragmaticworks.com/Training/FreeTraining/ViewWebinar/WebinarID/1743
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (Presented during Pragmatic Works' Free Training on the T's on March 26th, 2015)
Presentation given at OSCON 2009 and PostgreSQL West 09. Describes SQL solutions to a selection of object-oriented problems:
- Extensibility
- Polymorphism
- Hierarchies
- Using ORM in MVC application architecture
These slides are excerpted from another presentation, "SQL Antipatterns Strike Back."
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction Mark Ginnebaugh
Patrick Sheehan of Microsoft covers platform architecture, data warehousing methodology, and multi-dimensional cube development.
You will learn:
* How to develop and deploy data cubes using SQL Server Analysis Services (SSAS)
* Optimal data warehouse methodology for use with SSAS
* Tips/tricks for designing & building cubes over no warehouse/suboptimal source system (it happens)
* Cube processing types - How/why to use each
* Cube design practices + How to build and deploy cubes!
Designing an extensible, flexible schema that supports user customization is a common requirement, but it's easy to paint yourself into a corner.
Examples of extensible database requirements:
- A database that allows users to declare new fields on demand.
- Or an e-commerce catalog with many products, each with distinct attributes.
- Or a content management platform that supports extensions for custom data.
The solutions we use to meet these requirements is overly complex and the performance is terrible. How should we find the right balance between schema and schemaless database design?
I'll briefly cover the disadvantages of Entity-Attribute-Value (EAV), a problematic design that's an example of the antipattern called the Inner-Platform Effect, That is, modeling an attribute-management system on top of the RDBMS architecture, which already provides attributes through columns, data types, and constraints.
Then we'll discuss the pros and cons of alternative data modeling patterns, with respect to developer productivity, data integrity, storage efficiency and query performance, and ease of extensibility.
- Class Table Inheritance
- Serialized BLOB
- Inverted Indexing
Finally we'll show tools like pt-online-schema-change and new features of MySQL 5.6 that take the pain out of schema modifications.
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...Cathrine Wilhelmsen
Recording available on Pragmatic Works' website: http://pragmaticworks.com/Training/FreeTraining/ViewWebinar/WebinarID/1743
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (Presented during Pragmatic Works' Free Training on the T's on March 26th, 2015)
Presentation given at OSCON 2009 and PostgreSQL West 09. Describes SQL solutions to a selection of object-oriented problems:
- Extensibility
- Polymorphism
- Hierarchies
- Using ORM in MVC application architecture
These slides are excerpted from another presentation, "SQL Antipatterns Strike Back."
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction Mark Ginnebaugh
Patrick Sheehan of Microsoft covers platform architecture, data warehousing methodology, and multi-dimensional cube development.
You will learn:
* How to develop and deploy data cubes using SQL Server Analysis Services (SSAS)
* Optimal data warehouse methodology for use with SSAS
* Tips/tricks for designing & building cubes over no warehouse/suboptimal source system (it happens)
* Cube processing types - How/why to use each
* Cube design practices + How to build and deploy cubes!
Tree-like data relationships are common, but working with trees in SQL usually requires awkward recursive queries. This talk describes alternative solutions in SQL, including:
- Adjacency List
- Path Enumeration
- Nested Sets
- Closure Table
Code examples will show using these designs in PHP, and offer guidelines for choosing one design over another.
Single-Row Functions in orcale Data baseSalman Memon
After completing this lesson, you should be able to
do the following:
Describe various types of functions availablein SQL
Use character, number, and date functions in SELECT statements
Describe the use of conversion functions
http://phpexecutor.com
Learn Entity Framework in a day with Code First, Model First and Database FirstJibran Rasheed Khan
Learn Entity Framework in a day with Code First, Model First and Database First
•Introduction to Entity Framework (EF)
•Architecture
•What’s new!
•Different approaches to work with (Code first, Database first and model first)
•Choosing right work model
•Pictorial Tour to each model
•Features & Advantages
•Question & Answer
for any help and understanding feel free to contact
thank you
An introduction to Microsoft Power BI, emphasisng on the usability of Power Query and how it's useful for the excel population. A session delived at Orion India Systems Pvt. Ltd.
Qlik Sense for Beginners - www.techstuffy.com - QlikView Next GenerationPractical QlikView
Learn Qlik Sense Desktop the Easy and Fast way with Qlik Sense for Beginners (QS4B) - Sample
If you'd like to analyse your data so you can make informed decisions that help you reach your goals read on.
Qlik Sense can be used for all sorts of goals. For example in business that goal might be to make more profit, whereas for the personal user that goal might be to find where they are spending all their money.
Qlik Sense for Beginners (QS4B) will :
· Teach you how to create Qlik Sense apps from scratch in easy to understand steps with plenty of screenshots.
· Explain how to get data into a Qlik Sense apps from a variety of sources such as Excel, text files, access and sql databases.
· Show you how to create various charts and tables Qlik Sense for example bar, gauges,line,combo,treemaps and scatter plots.
· How to manage the Qlik Sense data model using joins, grouping, inline tables, link tables and dimensions.
Once you have mastered the basics some of the other topics we will cover are:
· Development Tips - Migrating from QlikView to Qlik Sense User.
· Development Techniques - subroutines and external scripts, crosstables.
· Advanced Functions such as Class, Intervalmatch, Dual and more.
· Set analysis
· QVDs and incremental loads
· Bookmarks
· Storytelling feature
· Qlik Sense Extensions
And much more...
Qlik Sense Desktop is a FREE product.
Examples are using Qlik Sense Desktop version 0.96
Qlik Sense Desktop has recently been release.
Buy with confidence on the Kindle knowing that you will get UPDATES to the book automatically.
Other books:
Practical QlikView
Practical QlikView 2 - Beyond Basic QlikView
Practical Sql: Microsoft Sql Server T-SQL for Beginners
https://www.youtube.com/user/practicalqlik
https://twitter.com/practicalqlik
http://www.techstuffy.com
Alteryx is a platform that allows companies to answer business questions quickly and efficiently. The platform can be used as a major building block in a digital transformation or automation initiative. Alteryx allows teams to build processes in a more efficient, repeatable, less error-prone, and less risky way.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Let's get into several common types of queries that developers struggle with, showing SQL solutions, and then analyze them for optimal efficiency. I'll cover Exclusion Join, Random Selection, Greatest-Per-Group, Dynamic Pivot, and Relational Division.
Wednesday 14-Dec-2016 Future of Data - Princeton Meetup
@TigerLabs in Princeton, NJ A talk on Apache NiFi for processing Drone Data. Apache NiFi reads the images from a directory or MQTT, extracts metadata including geolocation, runs TensorFlow for image recognition and stores all the metadata in Phoenix as well as raw JSON in HDFS. Images are also stored in HDFS.
This presentation deals with the fundamentals of SQL, Installation and Database concepts. Presented by our team in Alphalogic Inc: https://www.alphalogicinc.com/
Implementing Change Systems in SQL Server 2016Douglas McClurg
Features or concepts like Change Tracking, Change Data Capture, Temporal Tables, and other similar delta systems are complex and may carry a stigma or misapprehension in your organization around performance or security or cost. Even if you do not directly implement these features or methods absolutely, most information systems rely on tracking changes especially from legacy line of business applications. I'm here to show you robust techniques for implementing delta systems in SQL Server to increase the trustworthiness of your data warehouse. I will also steer you away from common pitfalls.
Tree-like data relationships are common, but working with trees in SQL usually requires awkward recursive queries. This talk describes alternative solutions in SQL, including:
- Adjacency List
- Path Enumeration
- Nested Sets
- Closure Table
Code examples will show using these designs in PHP, and offer guidelines for choosing one design over another.
Single-Row Functions in orcale Data baseSalman Memon
After completing this lesson, you should be able to
do the following:
Describe various types of functions availablein SQL
Use character, number, and date functions in SELECT statements
Describe the use of conversion functions
http://phpexecutor.com
Learn Entity Framework in a day with Code First, Model First and Database FirstJibran Rasheed Khan
Learn Entity Framework in a day with Code First, Model First and Database First
•Introduction to Entity Framework (EF)
•Architecture
•What’s new!
•Different approaches to work with (Code first, Database first and model first)
•Choosing right work model
•Pictorial Tour to each model
•Features & Advantages
•Question & Answer
for any help and understanding feel free to contact
thank you
An introduction to Microsoft Power BI, emphasisng on the usability of Power Query and how it's useful for the excel population. A session delived at Orion India Systems Pvt. Ltd.
Qlik Sense for Beginners - www.techstuffy.com - QlikView Next GenerationPractical QlikView
Learn Qlik Sense Desktop the Easy and Fast way with Qlik Sense for Beginners (QS4B) - Sample
If you'd like to analyse your data so you can make informed decisions that help you reach your goals read on.
Qlik Sense can be used for all sorts of goals. For example in business that goal might be to make more profit, whereas for the personal user that goal might be to find where they are spending all their money.
Qlik Sense for Beginners (QS4B) will :
· Teach you how to create Qlik Sense apps from scratch in easy to understand steps with plenty of screenshots.
· Explain how to get data into a Qlik Sense apps from a variety of sources such as Excel, text files, access and sql databases.
· Show you how to create various charts and tables Qlik Sense for example bar, gauges,line,combo,treemaps and scatter plots.
· How to manage the Qlik Sense data model using joins, grouping, inline tables, link tables and dimensions.
Once you have mastered the basics some of the other topics we will cover are:
· Development Tips - Migrating from QlikView to Qlik Sense User.
· Development Techniques - subroutines and external scripts, crosstables.
· Advanced Functions such as Class, Intervalmatch, Dual and more.
· Set analysis
· QVDs and incremental loads
· Bookmarks
· Storytelling feature
· Qlik Sense Extensions
And much more...
Qlik Sense Desktop is a FREE product.
Examples are using Qlik Sense Desktop version 0.96
Qlik Sense Desktop has recently been release.
Buy with confidence on the Kindle knowing that you will get UPDATES to the book automatically.
Other books:
Practical QlikView
Practical QlikView 2 - Beyond Basic QlikView
Practical Sql: Microsoft Sql Server T-SQL for Beginners
https://www.youtube.com/user/practicalqlik
https://twitter.com/practicalqlik
http://www.techstuffy.com
Alteryx is a platform that allows companies to answer business questions quickly and efficiently. The platform can be used as a major building block in a digital transformation or automation initiative. Alteryx allows teams to build processes in a more efficient, repeatable, less error-prone, and less risky way.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Let's get into several common types of queries that developers struggle with, showing SQL solutions, and then analyze them for optimal efficiency. I'll cover Exclusion Join, Random Selection, Greatest-Per-Group, Dynamic Pivot, and Relational Division.
Wednesday 14-Dec-2016 Future of Data - Princeton Meetup
@TigerLabs in Princeton, NJ A talk on Apache NiFi for processing Drone Data. Apache NiFi reads the images from a directory or MQTT, extracts metadata including geolocation, runs TensorFlow for image recognition and stores all the metadata in Phoenix as well as raw JSON in HDFS. Images are also stored in HDFS.
This presentation deals with the fundamentals of SQL, Installation and Database concepts. Presented by our team in Alphalogic Inc: https://www.alphalogicinc.com/
Implementing Change Systems in SQL Server 2016Douglas McClurg
Features or concepts like Change Tracking, Change Data Capture, Temporal Tables, and other similar delta systems are complex and may carry a stigma or misapprehension in your organization around performance or security or cost. Even if you do not directly implement these features or methods absolutely, most information systems rely on tracking changes especially from legacy line of business applications. I'm here to show you robust techniques for implementing delta systems in SQL Server to increase the trustworthiness of your data warehouse. I will also steer you away from common pitfalls.
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases.
Speaker
Nishant Bangarwa, Software Engineer, Hortonworks
subject- database management system (dbms)
topic ppt present-
Transaction Concept
Transaction State
Implementation of Atomicity and Durability
Concurrent Executions
Recoverability
Serializability and schedule also types
ACID Properties
Transaction Operations
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly.
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
This slide covers the LendingClub use case for implementing real time analytics using Oracle goldengate for Big Data. It covers architecture ,implementation and troubleshooting steps.
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Jim Czuprynski
Autonomous Transaction Processing (ATP) - the second in the family of Oracle’s Autonomous Databases – offers Oracle DBAs the ability to apply a force multiplier for their OLTP database application workloads. However, it’s important to understand both the benefits and limitations of ATP before migrating any workloads to that environment. I'll offer a quick but deep dive into how best to take advantage of ATP - including how to load data quickly into the underlying database – and some ideas on how ATP will impact the role of Oracle DBA in the immediate future. (Hint: Think automatic transmission instead of stick-shift.)
When your query execution is slow, a couple of questions arise. Where to look for resources utilization? What tools do you have to analyze CPU, hard drive and RAM bottlenecks? Could you do something to reduce query execution time? MariaDB's Patrick LeBlanc and Roman Nozdrin touch on both Columnstore's query execution introspection tools as well as operating system capabilities that everyone should know about. They go on to discuss a number of real life use cases too. Some called for configuration changes whilst others forced them to make serious changes in the code.
Building a globalized, customer facing e-commerce product, powered by micro-s...Nikos Dimitrakopoulos
Insights on what 10 years of trying to go global has taught us.
Presented during #21st Athens Ruby Meetup from team e-Travel (mytrip.com, airtickets24.com, trip.ru, trip.ae, pamediakopes.gr, et al).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
With versioning:
We track all changes to the data, who made the change and when ― not just the last change made ― without using cumbersome “history” tables. Examining the history is simply a matter of writing the right query.
We may “look back in time” to see how the data looked on a particular date (transaction time) ― even if some of the entities have subsequently been deleted. This is particularly useful in auditing and report generation.
We may “look back in time” to see how the data should have looked on a particular date (valid time).
It becomes possible to implement an “undo” feature in the database. So a change to the data, no matter how extensive, may be reversed or rolled all the way back to the way it was at a particular time.
Data may be “pre-inserted” or “pre-updated” to take effect at a particular date and time in the future. The new or modified data will remain invisible to queries until the effective date and time is reached.
We don’t have to track history through extraordinary means (such as separate “history” tables). Every versioned entity will automatically maintain a complete history, from creation to deletion with no additional effort on the part of the applications.
Not only are changes tracked, but also when each change was made or became effective. Thus we can also track the length of time each version was in effect.
All referential integrity constraints to and from versioned tables work the same way as with non-versioned tables.
Queries against versioned tables (while we realize will be a bit more complicated) follow a consistent pattern and, once learned, are the same no matter the table or the database in which it is used.
Versioning takes place at the field level. A “versioned table” is just a table that has had some of its fields versioned, not necessarily all. The underlying table can be altered (fields added or dropped) without impacting the versioned fields — unless, of course, a versioned field is dropped. Versioned fields may be unversioned and unversioned fields may be versioned with little impact.
Applications like CVS and SVN ― which allow us to look back in time to follow the changes made to source code ― may easily be developed to give us the same ability to track changes to our data.
All this in a transactional database without maintaining separate snapshot or audit tables, and with only minor impact on performance.
Versioning State Of The Art
I only recently became aware of the following synopsis of database versioning found in Wikipedia. The entry is titled Slowly Changing Dimension (SCD) which gives a brief description of the problem of tracking changes in information. This, of course, applies mainly to data warehousing applications, but still provides excellent background as to the nature of the solutions in use today. It lists six methods that are, I assume, currently used to handle modifications to data, from doing nothing (Type 1) to a hodge-podge design that uses a date to designate when the data became effective, a date to designate when it is no longer effective, an indicator to signify the row that is now current and a sequence value (Type 6/Hybrid). There are several versions of Type 6, one that seems as though it would almost be workable. There are several distinct disadvantages to even the best of these methods.
There are two dates used: start and end. The start date indicates when the data in the row becomes effective. The end date indicates when the data in the row is no longer effective. The problem with this is that we now have to keep adjacent versions in sync with each other. This creates the problem I refer to as a “row spanning dependency”: when data in one row of a table is dependent on data in another row of the same table. When we need to write a new version that is to take effect immediately, we first have to update the existing current version to store “now” in the end date field and write the new version with “now” in the start date field. So all changes to the data involves an Update/Insert pair of operations (sometimes referred to as an “Upsert”).
There is also a problem if we want to delete the current version, in effect “undo” the last change. The current version may be deleted but the last version must now be located to return the end date to the value understood to mean “current”.
Referential integrity is still not possible as the version timestamp or a sequence value must be part of the primary key or unique index. Any foreign key in another table referring to the versioned table can only refer to one version.
Actually, one of the methods solved the referential integrity problem in a clever way. The primary key consists of the natural key and a sequence number starting at 0. Any foreign key reference would refer to this row. When the second version is created, the first row is copied back into the table, becoming sequence number 1 and the first row is actually updated with the new data. So the current version is always sequence number 0 and previous versions take on higher sequence values. This also means that sequential versions must be kept synchronized and all changes involve a two-step, Insert/Update, operation. There are also the complicated (and unmentioned in the article) problem of maintaining sequential values for each entity in the table. All current versions of all entities have value 0 so this column can't be an Identity (or autogenerated) field.
Basic Requirements Of Maintaining Temporally Sensitive Data
To make working with temporally sensitive data relatively simple and less error prone, we must demand certain minimum requirements.
We must have a date and time value to signify when the version takes effect. But we should not have to specify when the version is no longer effective. Let's just agree that a version remains in effect until superseded by the next version or the entity is deleted. This eliminates row-spanning dependencies (the necessity of keeping different rows of the table in sync with each other).
We should not have a separate field used to indicate the current version. Whether it be a 'Y/N' indication or a set numeric value, this is just another area where errors can creep in. (What happens when we have more than one version with a 'Y' in this field? Don't even try to argue that it would never happen!)
When a new version is written to the table, no changes at all should be needed to any previous version including the current version. A logical Update of entity data is converted into a physical Insert, but this should be the only operation needed and it cannot be allowed to introduce systemic errors into the data.
We should be able to create foreign key references to temporally sensitive entities no differently than to “normal” entities.
The versioning method should be usable for any and all temporally sensitive data, whatever the entity and wherever in the enterprise it is needed.
Easier said than done? Possibly, but we have to start somewhere and I contend that if we don't meet those basic requirements, we don't have a workable solution.
The first two rules of versioning already mentioned (no overlapping and no gaps), are brought about by requirements 1 and 2 above. With only one date, the effective date, there is no way for the versions to get out of sync. The effective date of one version is the “end” date of the previous version. It is not possible, therefore, for overlapping of versions or gaps between versions. By using only the effective dates to determine which is the current version (or the version in effect at any particular moment in time) we eliminate any confusion or errors possible with other indicators. And because the effective date becomes part of the primary key of the versioned data, it is not possible to have two versions with the same effective date.
By using only the effective date to determine the one version of interest, we satisfy requirement 3. When we modify the versioned entity ― which is performed by inserting a new version ― no changes need be made to any data already existing in the entity table.
Requirement 4 was the tough one. But the solution turned to be simplicity itself. It took me an entire month to come up with a workable solution but it took another three months before I saw just how simple it turned out to be. I had arrived at an answer by such a convoluted path that I kept seeing it as complicated ― workable but complicated ― until one day I drew it all out on a whiteboard to explain it to a coworker. By ignoring the path to the solution and just looking at the solution itself, the simplicity just leaps out at you. This simplicity also affects its universal applicability, thus requirement 5 is met.
The solution turns out to be nothing more than to consider versioning as just another form of relational normalization ― a Version Normal Form (vnf), if you will.
I doubt any experienced data modeler actually steps through the normalization process as we are taught in Data Management 101 class in college. We very quickly reach the point where we think in 3nf (or BCNF) and our design hits the paper already in the required normal form. But let's consider the process for a moment.
We have identified all the attributes we will maintain for an entity. We have identified all the candidate keys and selected one to be the Primary Key (or elected to use a surrogate). Now we perform the first step in the normalization process (1nf) and as a result, we now have two tables ― the primary entity table (or independent table) and, say, one additional table (dependent table) which contains repeating attributes.
Have we created a problem with referential integrity?
Of course not. In fact, data integrity is maintained within the entity by using a foreign key in the dependent tables back to the independent table. Any references to the entity from other entities will be resolved by referencing the independent table. It has the one entity primary key and the normalization process will not effect that. The dependent tables will have a primary key of its own, usually consisting of the entity key field and an identifying additional value or values.
The same thing applies when we continue the normalization process. At the end, we end up with one logical entity that may be spread across several physical tables, but there is one table that is the “entity” table (Employees, Customers, Parts, Classes, Items, etc.) and it is this table that is the target of any foreign key that refers to the entity. There is no problem with having both normalization and referential integrity.
Thus, after placing the entity in temporal normal form, we get all the benefits of versioning and referential integrity is unaffected. We also get the benefit of having versioning in a pattern (normalization) that database programmers are intimately familiar with. So the learning curve of working with versioned data is pretty close to zero.
Now when Sally notifies us of her name change, we just take the current versioned row from EmployeeVersions, change “Jones” to “Smith” and Effective to the date of the wedding and insert the new record. When we query the database, we see “Smith.” But if we query asking “show me the data for employee 1001 as of Dec. 31, 2005,” (or any date before then) we will get “Jones.” Because this pattern is bi-temporal, we can also ask “show me the data for employee 1001 as it existed in the database on Jan, 10, 2006 and we will get “Jones” because, on that date, the database had not yet been updated with the correct name.
Now when Sally notifies us of her name change, we just take the current versioned row from EmployeeVersions, change “Jones” to “Smith” and Effective to the date of the wedding and insert the new record. When we query the database, we see “Smith.” But if we query asking “show me the data for employee 1001 as of Dec. 31, 2005,” (or any date before then) we will get “Jones.” Because this pattern is bi-temporal, we can also ask “show me the data for employee 1001 as it existed in the database on Jan, 10, 2006 and we will get “Jones” because, on that date, the database had not yet been updated with the correct name.
Now when Sally notifies us of her name change, we just take the current versioned row from EmployeeVersions, change “Jones” to “Smith” and Effective to the date of the wedding and insert the new record. When we query the database, we see “Smith.” But if we query asking “show me the data for employee 1001 as of Dec. 31, 2005,” (or any date before then) we will get “Jones.” Because this pattern is bi-temporal, we can also ask “show me the data for employee 1001 as it existed in the database on Jan, 10, 2006 and we will get “Jones” because, on that date, the database had not yet been updated with the correct name.
Now when Sally notifies us of her name change, we just take the current versioned row from EmployeeVersions, change “Jones” to “Smith” and Effective to the date of the wedding and insert the new record. When we query the database, we see “Smith.” But if we query asking “show me the data for employee 1001 as of Dec. 31, 2005,” (or any date before then) we will get “Jones.” Because this pattern is bi-temporal, we can also ask “show me the data for employee 1001 as it existed in the database on Jan, 10, 2006 and we will get “Jones” because, on that date, the database had not yet been updated with the correct name.