We're going to start out looking at some diagrams of systems as they usually are before a data warehouse initiative is put into place. And we'll see what the target looks like, as well as some typical outcomes. We'll do a quick review of normalized relational database theory. We'll look at dimensional models and their advantages for data warehousing. I’ve got an example which shows how complex reporting can get in even a simple traditional normalized model, and then a possible dimensional model for some reporting. Then I’ll cover a couple best practices/worst practices and we’ll hit the Q&A.
Typically business processes have evolved and there is no coherent data strategy in place. Individual business units are responsible for systems, people need things tied together and they get it built. I’ve categorized the data reporting needs in 4 ways. Reporting: No interpretation, effectively sending data/information to people to interpret Analysis: Forward-looking or ad-hoc, usually designed to answer questions and what-if scenarios Tactics: Operational requirements, business-as-usual, batch processes, dashboards Strategy: Dashboards, status, monitoring Pros: You only build something if it's necessary Cons: Changes to systems are hard to make since lots of downstream dependencies Data is fragmented, systems differ, formats differ, conventions differ, retention varies, processes can interfere with production In all these cases, I’m assuming regular business processes are creating/updating this data in their own applications. We’re not trying to bring everything into one application, but simply to tie the systems together reliably for read only usage.
Solution! Make a bottleneck DEFINE Data Warehouse DEFINE ETL Everything in one place (although it might not really be one place in the physical world, there is one logical clearinghouse) Standards, data cleansing, consistency, unified security, data management, consolidated storage management Pros: What? Cons: It's a bottleneck
It would be nice to think you always get to the promised land, but I'm going to start off by showing the scenarios you still end up getting a lot of the time.
In this scenario, you've got important data which isn't in the warehouse (yet), and you end up building a process which goes around the warehouse for that data. And it would be nice to think that eventually that data gets in the warehouse and you redefine your process and you're all homogeneous again. But the fact is, these things stick around. For. a. long. time.
A lot of times, you're still in love with reports. Because that's what users always ask for. For some reason users like to work. This is not a problem that programmers have. A programmer would rather spend 40 hours writing a program and never do any more work, than have to spend 1 hour each year consolidating 25 reports into a TPS report for a boss. On the other hand, users will insist on asking for reports and spend 10 hours a week using them (to do what? - well, that's the question that's never asked) instead of having a program written in 20 hours which does it for them with a click (or no-click). The fact is that the question not asked is “what is the user's goal?” By users and programmers being focused on the requests and not the goal, the system is never going to get insight into the actual user and business goals. Why would you expect to get a system to help the enterprise achieve its goals if questions are never asked in terms of goals and only in terms of requirements or current usage patterns?
What stops this from happening? For one, users asking for &quot;things&quot; instead of everyone understanding their goals. Another is having a process which isn't responsive enough to get the users to their changing goals - so users end up using the &quot;things&quot; they have in new and different ways instead of getting to their new goals most efficiently. Also, the old “things” become the new goals instead of understanding why the old “things” were necessary, seeing if they are still necessary, seeing if the needs for them have changed. So in data warehousing, we look for ways to make things easier by accommodating user needs without anticipating user needs. Two big areas where we want to make some progress is making the data more accessible for self-reporting and data analysts who aren’t necessarily programmers. And performance – you want to be able to get very powerful results with modest computing power and without sacrificing your application which is generating and managing all your operational data. Dimensional modeling has certain advantages related to exactly this.
When we talk about dimensional modeling, this is modeling the data in a completely different way than traditional Entity-Relationship Modeling (E-R, or ERM).
Traditional normal forms in relational databases can be summed up as &quot;the key, the whole key, and nothing but the key&quot;. Data is organized in relations based on sharing a key. Relation means that all the data is related to the key. Keys between tables can be used to join the data together. Relational databases are now dominant, but before the 80s, they competed with a number of database systems such as network databases, hierarchical databases, and various special file-based databases. A normalized database, while optimized for transactions and a goal of a system to have a low intellectual distance from the real world, requires a great deal of effort to extract data in all but the most trivial models. I won't dwell on the entire relational theory, but just point out a problem when reporting off normalized data - multiplicity of paths in many-to-many relationships can be hard for ordinary report-writing users to get their heads around.
In 1964, Christopher Alexander made a case for decomposition of architectural designs based on a need to measure goodness of fit and to solve design problems at different scales in an iterative and composable way based on understanding context/solution as two sides to the same boundary and quality of design fitness measured by absence of fit. This became the basis for his Pattern Language. This was borrowed for the Design Patterns movement in software. Today we not only recognize the power of design patterns in the software industry, but the sound software engineering principles of cohesion and coupling (Constantine and Yourdon). The impact of this on data warehousing, especially those which rely on dimensional modeling is that dimensional modeling is born out of pragmatism in the face of data complexity. In all cases, dimensional modeling is a move away from a normalized database to a database optimized for data analysis. I want to make clear that the techniques I’m going to discuss are data modeling for a specific usage pattern – high-performance read-only data analysis. This is not a transactional model and I do not advocate you designing any application against a dimensional model unless it’s an analytical platform.
In transforming from a normal form to a dimensional model, the facts which comprise a traditional normal form are allocated to several simple star schemas. Star schemas are particularly simply to query and to optimize for. Typically, the star model means that each fact has a foreign key to every dimension. There is no possibility for multiplication during joins. It is usually possible to represent the star in a flattened form which is universally equivalent and without any loss of data or need for separate interpretation. A data warehouse system will have multiple fact tables determined by the grain (usually the time) and subject matter. Different stars may share some dimensions. When these dimensions are system-wide or enterprise-wide, they are called conformed dimensions. Date or time are very good examples. Customer might also be an example if you have relatively homogeneous customers. Account might be an example.
I’ve modeled a relatively simple system in a normalized form. This system is a conference/meeting management system. Several times a year, a conference is held, so you need to track classes, venues, sections, attendees, attendance, hotels, invoices etc. For several years we had a system like this to manage training meetings for office managers and in looking for an example for this presentation, I threw this together from memory – it’s only a partial model, I haven’t included hotel accomodations, room sharing or details like employee-office relationships.
What you'll find is that the terms are typically used very vaguely and it is often difficult to get a lot out of the terminology when applied to a specific system without looking at the actual architecture. The key things about data warehouse that's always common: - it's a copy of the data - it never changes Integration is always part of the goal, subject-orientation depends on the compartmentalization mindset. The top-down and bottom-up associations are largely false .
There are a number of tools out there. Nothing is going to do it off the shelf. Huge problems dealing with data which can sometimes be relatively freeform - like Excel
Kimball is largely responsible for popularizing this approach, which relies on remodelling the data to a star model (or sometimes a snowflake model) to simplify reporting, eliminate common user mistakes. Many approaches keep the data in its original normalized form and use parallelism (Teradata), or in a more explicit object/dynamic form and use parallelism (Hadoop, Map/Reduce)
Something especially true about data warehouses is that they don't submit well to top-down development. The users won't know what they need until they can use the data warehouse to learn about the data. So it's a chicken and egg situation and it is perfect for incremental delivery. As long as users will have Excel, they will generate contradictory reporting. Total control is impossible - users WILL work around controls to produce information. The dimensional model works to our advantage here.
When people say bottom-up, they expect that you will handle each department's individually and then somehow tie it all up together at the end. In reality, no one works that way. People in an enterprise know the common data and they know that it needs to be shared. Modelling the data without knowing how it is going to be used will miss out key factors that are best handled in the modelling or in the ETL - for instance, pre-calculating logic. When people say top-down, there is an expectation that you take all the reports and from that deduce all the needs and that there isn't any new insight going to be gained from the departmental level of work. In practice, you always have to burn the candle at every end - feed back and forth in order to be successful.
You may hear these terms. Like I said, they don't really matter until you've seen a system's architecture whether they are using these concepts and what they represent.
In this relatively simple example, a calendar entry has a foreign key to a meeting entity. A meeting entity contains references t a meal and a venue. Attendees are linked to meetings (a typical many-to-many relationship). Some data queries can be awkward for reporting purposes because of the larger network of tables and especially the many-to-many or cascading many-to-one relationships which can cause multiplicity of results.
While this example is rather contrived, it shows that although a given meeting attended had only one certain meal and one venue it could be held at, you do not have to go through the entity relationships in order to traverse the entity-relationship-model.
Data Warehousing for Gnocode
Data Warehouses: A Whistle-Stop Tour Cade Roux [email_address]
The intuitive resolution of contemporary design problems simply lies beyond the reach of a single individual’s integrative grasp… … there are bounds to man’s cognitive and creative capacity… … the very frequent failure of individual designers to produce well organized forms suggests strongly that there are limits to the individual designer’s capacity. Christopher Alexander – Notes on the Synthesis of Form, Introduction: The Need for Rationality
Best Practices <ul><li>There are tons of top ten lists of tips and keys to success in articles and books. I will give you my top two. </li></ul><ul><li>Incremental Delivery – Show successes early, win people over, prove concepts and approach </li></ul><ul><li>Proactively Manage Quality - Test thoroughly and automate – Testing is usually considered important, but people don’t approach it systematically. Round-trip the data, know the dimensional behavior with benchmarking, automate exception reporting and make sure false positives don’t make your warning system too noisy. Get confidence by showing the tests are working. Add tests as defects are found, documenting expectations. </li></ul>
Worst Practices <ul><li>Again, there are plenty of online tips – every one of the best practices has a corresponding anti-practice, but these are my top two. </li></ul><ul><li>Avoid understanding the data, the business motivations, or the details because there are far too many feeds of data coming into the warehouse. Avoid looking ahead to how the data will be used because you shouldn’t change the ETL process to accommodate expectations or provide services. </li></ul><ul><li>Handle every model the same way, so the data warehouse is consistent, even if some models are awkward and difficult for users to use and difficult to change over time as the business evolves. </li></ul>
Glossary <ul><li>Data warehouse </li></ul><ul><li>Bill Inmon - "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process" - typically associated with top-down design </li></ul><ul><li>Ralph Kimball - "A copy of transaction data specifically structured for query and analysis." - Typically associated with bottom-up design </li></ul>
Glossary (2) <ul><li>ETL </li></ul><ul><li>Extract...Transform...Load </li></ul><ul><li>Shorthand for any number of ways of getting the data into the warehouse. </li></ul><ul><li>Sometimes it's really transform...extract...load, sometimes it's extract...load...transform...load. </li></ul><ul><li>Key things are to have a strategy and principles for when data is changed/cleaned/conformed/exceptions reported. </li></ul>
Glossary (3) <ul><li>Facts </li></ul><ul><li>Dimensions </li></ul><ul><li>Conformed Dimensions </li></ul><ul><li>Slowly Changing Dimensions </li></ul><ul><li>Granularity </li></ul><ul><li>Dimensionally modelled data is mostly associated with Kimball. </li></ul><ul><li>Huge advantages in analyzing large amounts of data. </li></ul><ul><li>Modelling is problematic, but not nearly as hard as normalizing a non-normalized database. </li></ul>
Glossary (4) <ul><li>Business Intelligence </li></ul><ul><li>Single version of the truth </li></ul><ul><li>These are relatively meaningless, but they point to the problem trying to be solved: </li></ul><ul><li>Get good decision support information to the business - every business is different, and there isn't a silver bullet </li></ul><ul><li>Eliminate, as much as possible, the ability for users to generate inconsistent information from the same data </li></ul>
Glossary (5) <ul><li>Data Mart </li></ul><ul><li>Silos </li></ul><ul><li>Silos are mini-data warehouses that are specialized to a subject area - typically from a bottom-up approach. </li></ul><ul><li>Data Marts are the components of a data warehouse in the top-down design, the building blocks of a data warehouse in a bottom-up design. </li></ul><ul><li>Typically, you cannot really do JUST top-down or JUST bottom-up. The reality is always hybrid, because you have to look forward to enterprise-level integration. </li></ul>
Glossary (6) <ul><li>Operational Data Store </li></ul><ul><li>Enterprise data warehouse </li></ul><ul><li>ODS is a place where data is combined before load. Sometimes there are services performed off this. Typically, the data model has not changed dramatically from the original operational source systems, but it is (another) copy of the data. </li></ul><ul><li>EDW is an Inmon term which means that the data warehouse covers the enterprise in an integrated fashion. It is mainly used to distinguish from a data warehouse which does not cover the entire enterprise. </li></ul>
Glossary (7) <ul><li>OLTP </li></ul><ul><li>OLAP </li></ul><ul><li>OnLine Transaction Processing: Typical online systems, may maintain coherent temporal history, may overwrite themselves when data is changed, usually modelled in third normal form or better, Entity-Relationship modeling. </li></ul><ul><li>OnLine Analytical Processing: Fast analysis of multi-dimensional data - generally refers to tools running against dimensional data warehouses because the dimensions are explicit - often precalculated "cubes" are created </li></ul>
Dimensional Modelling <ul><li>Facts: </li></ul><ul><li>Usually scalar quantities </li></ul><ul><li>Typically can be: </li></ul><ul><li>SUM, AVG, etc. </li></ul><ul><li>Modelling: </li></ul><ul><li>View all data as either facts or dimensions </li></ul><ul><li>Determine the nature of the changes in the dimensions </li></ul><ul><li>Then divide up dimensions for convenience - based on usage/data patterns </li></ul><ul><li>Combination of art and science </li></ul>
Topics <ul><li>Conformed dimensions </li></ul><ul><li>NULLs </li></ul><ul><li>Junk Dimensions </li></ul><ul><li>Too Few Dimensions </li></ul><ul><li>Too Many Dimensions </li></ul><ul><li>Parallel ETL </li></ul>
Conformed Dimensions <ul><li>Reduces the learning curve </li></ul><ul><li>Allows models to be combined </li></ul><ul><li>Account number padding, e.g. </li></ul>
Some things to keep in mind <ul><li>Terminology is confusing and inconsistent – only your architecture matters – keep eyes open to approaches, but terminology is not as important as conventions chosen matching environment desired. </li></ul><ul><li>Overriding concern is practicality – get the information into users hands, this will drive the need for more information and guide you into managing the data. </li></ul><ul><li>Decoupling produces a lot of redundancy: Source->Flat File->EDI gateway->Stage->DW – understand where the redundancy can be removed, and where decoupling is the goal. </li></ul>
NULLs <ul><li>Usually represent unknowns </li></ul><ul><li>Big problem for users in face of model evolution </li></ul><ul><li>If you have a derived stat/measure like customer.allfees = customer.latefees + customer.nsffees </li></ul><ul><li>Model starts out like </li></ul><ul><li>-latefees money NOT NULL </li></ul><ul><li>-nsffees money NOT NULL </li></ul><ul><li>Now we branch out into mailbox rental: </li></ul><ul><li>-customer.rentalfees NULL (or NOT NULL?) </li></ul><ul><li>customer.allfees = customer.latefees + customer.nsffees + customer.rentalfees </li></ul><ul><li>Handle with a view, or populate old data with 0 </li></ul>
Performance Issues <ul><li>Cleansing/Manipulation: </li></ul><ul><li>Y/N, M/F, Codes – standardize in the ETL, use data types efficiently </li></ul><ul><li>Indexes – keep end goals in mind, index according to expected usage </li></ul><ul><li>SARG'able, leading zeros/spaces – look for consistency where posible to avoid data manipulation prior to joins, this helps index usage </li></ul><ul><li>Partitioning tables – for sliding windows of data retention, partitioning the tables allows old data to be dropped off the end of the fact tables fairly easily </li></ul>
Application Logic <ul><li>Shared work should be pushed into ETL when: </li></ul><ul><li>Not likely to change </li></ul><ul><li>Expensive </li></ul><ul><li>Everybody needs it </li></ul><ul><li>Examples: </li></ul><ul><li>Trivial - Scaling to convention (rates) </li></ul><ul><li>Intermediate - Simple calculations (Patient Age) </li></ul><ul><li>Marginal - Interest rates, risk ratings </li></ul><ul><li>Alternative to marginal cases - generate additional facts, either in their own fact tables or as late arriving facts </li></ul><ul><li>When a DW is shared, lots more applications to worry about </li></ul>