The naming convention is a key component of any IT project.
The purpose of this article is to suggest a standard for a practical and effective Data Warehouse design
Recipes 6 of Data Warehouse and Business Intelligence - Naming convention techniques
1. Recipes of Data Warehouse and
Business Intelligence
Naming convention techniques
The naming convention is a key component of any
IT project.
The purpose of this article is to suggest a standard
for a practical and effective Data Warehouse design
2. Introduction (1)
• Usually, the naming convention is the method by which you decide to name of the various entities
of the OLTP (Online Transaction Processing) or DSS (Decision Support System) system that you are
designing.
• In this article the term "entity" will be synonymous with "structure": a generic term to identify
data structures typical of a relational Data Warehouse such as: tables, views, materialized
views, synonyms, etc.
• The application of a method should not be considered a separate issue, restricted to the particular
entity that is being analyzed, separated from the global system architecture. On the contrary, it
shall produce a name that should be able to immediately recall all semantic entities involved, no
matter if they belong to a Data Warehouse or to an operational system.
• This is the keystone of the method, because to define a naming convention for an entity not
simply give it a name but means defining an architectural context.
• The names selected will be used in discussions, meetings, documents, deliverables, in relational
schemas, and accompany our working life for several months.
3. Introduction (2)
• Very soon the relationships between various entities, and between entities and programs, will
prevent any changes in the naming convention, and bad choices or no choice, will have unpleasant
consequences for the economy of the entire project.
• The examples shown in the article focus mainly on the data tables, because practically all the work
that takes place within the data loading process, is aimed at filling these entities.
• In these cases the effectiveness of the naming convention appears clear: try to think of the time
needed to acquire knowledge about a system when all entities, perhaps hundreds, have not
followed a standard, or followed only in part or in ambiguous way.
• The purpose of these techniques is to avoid these situations for us and for our partners.
4. Guidelines
• Naming convention must be established early in the life-cycle of a project. You can not start the
process of physical design and development phase without the naming convention of the entities
that constitute our system. No "create ..." will be possible without these decisions.
• The guidelines that are the basis of the techniques showned here are just three:
1 – The logical partitioning
2 – The coding rule
3 – The classification
• The Logical partitioning allows you to define a hierarchy among the entities involved, the
classification allows you to associate a physical connotation to the entities.
• the coding rule standardizes the naming derived from partitioning and classification. Since the
best way to learn these concepts is to see them do in practice, the three guidelines are detailed in
the next sections where we will see how you can apply to a real project.
• Now focus our attention on the entities.
5. Naming convention for the entities (1)
• The paradigm that underlies the naming convention for the entities can be summarized in the
following formula:
<entity name> = <project code>_<area code>_<section code>_<logical name>_<type code>
• To understand the formula, use the first guideline, that the logical partitioning. Define it a
guideline is certainly simplistic to say the least. It's actually a moment of reflection that can only
be faced after realizing our global vision of the system.
• The term ‘logical partitioning’ means the decomposition of the system in its logical components so
called areas and sections. It is an approach very similar to a drill-down: an overview towards a
particular view. It starts with the project, which is the root of our logic tree, it is divided into
areas, each area is divided into the sections.
• The figure 1 show you immediately an example of logical partitioning.
6. Naming convention for the entities (2)
Figure 1
• Since we are talking about Data Warehouse, what are the areas and sections and the way you
think the logical partitioning will be based on the Ralph Kimball philosophy, an approach that I use
successfully from several years.
• Let's look all the five individual components of the <entity name> .
7. The project code (1)
• The first decision to be addressed in a project is to assign it a code. It may seem an obvious
statement, but the identification of a project code is a precondition to lead the project in the right
direction. The choice of the project code has immediate and important consequences that impact
at different levels. Let's see.
Architectural level
• The first level impacted is the logical / physical architecture of the Data Warehouse. After
choosing the code project, each entity will be named with that project code followed by
underscore "_".
• This means that each table, view, materialized view, synonym, program, or generally, any physical
structure is created, it must start with <project code>.
• Anywhere the physical structures are allocated, we have established a unique way to identify all
the objects in our Data Warehouse. (Obviously there must not be another project with the same
code). A selection from the database dictionary will allow us to identify exclusively the objects
that concerns us.
8. The project code (2)
File system level
• The second level impacted is the physical file system. All project directories, in which arrives the
flat files coming from the source systems, should also start with ''<project code> _ ". Another
example might be the directories of the database files, of the creation script, etc.
Documentation level
• The third level impacted concern documentation and deliverables. Using the same technique for
them, we will be sure to identify with our search always and only the documents associated with
our project.
9. The rule of three
• At this point we need a coding rule. If, for example, we have defined our project as "Intelligent
Enterprise Data Warehouse," it is unthinkable to use 38 characters as project code.
• My experience about the design of several Data Warehouses, provided me a very clear indication.
It can be argued that the best rule to identify the project code is the "rule of three.“
• This rule suggests that the project code must be of three characters. In our example, a good
acronym, and then the project code, will be "EDW”. If we are designing a Data Warehouse of
Exploration and Production of an oil company, a good project code will be" EPD. “
• As you know, three is the perfect number, and in my past, this perfection was clearly manifested in
all its aspects. Two characters cause ambiguity and misunderstanding, four or plus characters are
too much. Since three is the perfect number, that rule will also apply to other encodings.
• So, we start then with the project code “EDW”.
10. The area code
• The areas are a way to make a first logical partitioning of the objects of our Data Warehouse.
Break all objects in areas of membership, allow us to associate them in a single scope, all the
items with similar logical/temporal affinity.
• To understand the areas, we must express some principles of data warehousing, in
particular, some aspects of the ETL process (Extract, Transform and Load). These principles will be
simplified as much as possible to ensure an understanding to all.
• Obviously there is a very extensive literature on this subject which I suggest you for further
investigations.
11. The ETL process (1)
• The loading processes of the tables of a Data Warehouse, typically is a workflow like the following.
• External systems produce source data files. Such systems may be of any type (OLTP, other Data
Warehouse, Operational Data Store, etc.) and these files can be of any format (ASCII flat files of
fixed length, with terminator, XML, etc.).
• The data in these flat files are loaded into work tables, defined staging tables. These tables are
almost always structured almost identical picture of the flat file.
• After the loading of the staging tables, the process continues with the loading of the others
tables, which are called dimensional and basic fact tables; I like call them of “level 0”. This process
is responsible to perform all the cleaning and checking work on the data of the staging
tables, define the artificial keys for the fact tables, and then load the data according to the
"dimensional model.“
• With this term we mean that the data structure is composed with a fact table and dimensional
tables that make up what is called the “star-schema”. (excuse me, Ralph, the simplification...)
12. The ETL process (2)
• After loading the fact tables of level 0, may be loaded aggregate tables (to improve the
performance of reading data from the Business Intelligent tools) or higher-level fact tables (1.2, ..
N). It depends on the complexity and needs of the Data Warehouse.
• A typical example in the market of large retail may be the level 0 fact tables that stores the data of
sales of the stores and the costs associated with the sold items.
• From these fact tables, you can load a fact table to a level 1 where you can lose some dimensions
of analysis and compare the sell and costs to get the net or gross margin.
• The whole process that loads the data arriving from the external systems up to the loading of all
other structures is defined the ETL process.
• There may be variations on this workflow, but it can be said that nearly all Data Warehouse
projects are adapting to this process.
13. First step - the area partitioning
• After this brief interlude, we return to the speech left open, that is the connection of the ETL
process with our naming convention. The logical lartitioning of the entities that we do in fact, is
precisely focused on the process described above. We can identify, for example, the following
logical areas:
• Staging Area: In this area we have all the entities that receive the flat data files coming from
external systems.
• Level 0 Data Mart Area: In this area there are all entities which contain the data cleaned and
dimensionalized by the load process.
• Level 1 Data Mart Area (and subsequent levels): In this area there are higher level entities that are
derived by the entities of level 0.
• Common Area: In this area there are all the dimension tables and all those entities that are
common to the whole Data Warehouse, such as configuration tables, service tables, etc..
• Of course you can also identify other areas. For example we can identify an area called
“Operational Data Store” where all the data structures are normalized and not structured
according to the dimensional model, or we can identify an area of infrastructural
information, linked to the logging of programs and schedules. In short, the possibilities are
different, and we will think carefully.
14. Second step – the coding rule
• The next step then is to codify by the "rule of three" all areas that we identified. Using acronyms
very intuitive, like STA, DM0, DM1, COM obtain as indicated in Figure 1.
• As you can see the effects of naming conventions are starting to take shape and already at this
level, still high, you can easily identify all the entities belonging to a common logic.
• If you want to know the names of all entities that collect data from external flat files, is enough to
extract ,from the RDBMS data dictionary, all the object types that begin with “EDW_STA_”. If you
want to identify all the fact tables of level 1, search all object types that begin with “EDW_DM1_”
• In reality, this initial partitioning process gives you a too high level. In projects of some
complexity, where the number of entities is high, should be further partitioned areas in the
sections or sub-areas.
• Each area will then be decomposed into smaller elements that deepen and enrich the knowledge
of the Data Warehouse structures.
15. The section code (sub-area)
• The sections allow you to group logically all entities that belong to the different areas. Each area
has specific characteristics, so that even the process of partitioning areas in the sections is specific
to each area.
• Obviously, every designer can define the sections as it seems appropriate, based on their
experience and creativity. Sections that now will be described, for the areas defined above, are
those based on my personal experience and it is only a suggestion.
• Let's look in more detail.
16. The sections of the Staging Area
• Taking the example of the flat data files feeding the Data Warehouse, often producing subsytems
can be different, each generating different data. If, for example, we have three subsystem like
sales, accounting and logistical movements, the approach can be as follows:
• Codify with the rule of three the external subsystems, such as selling stores, with STO
(Store), logistics movements with MOV (Movements) and accounts with PLS (Profit and Loss).
• Each entity linked to these subsystems will have these codes within its name.
• If you want to know the names of all Staging Area tables loaded from logistics subsystem, simply
extract from the database data dictionary all entities that begin with "EDW_STA_MOV_.
17. The sections of the level 0 Data Mart area
• This is a key area because it found space all the basic fact tables, that are those with the most
detailed granularity of the information. The best way to partition the area into sections, is
obviously to identify with the rule of three the different entities that comprise it.
• For example, if a data mart contains all the sales data, a code should be SLS or SAL (short for
sales), if it contains cost data, will be CST or COS (short for Costs).
• Since the concept of data marts is essential in a Data Warehouse project, I prefer to open another
parenthesis about principles of data warehousing and give you a more detailed definition for ease
of understanding of the naming convention that uses it.
18. The Data Mart
• The data mart is a "logical subset of the complete Data Warehouse" as shown by R. Kimball in his
book [1]. Each data mart is usually organized around a single business processes, and the union
of all data marts is the Data Warehouse.
• Then sections of the DM0, will be specific business process that you are deploying. Someone can
think that the fact table with and data mart are conceptually the same. The data mart is a logical
concept, the fact table is only a part of the physical implementation of this logical concept.
• In reality, the data mart consists of a fact table and all its aggregated tables (materialized views in
Oracle) that are created to increase performance reporting associated with that business area.
Obviously, the aggregated structures are not mandatory, but a necessity for VLDW (Very Large
Data Warehouse), where the amount of data processed may exceed terabytes.
• A data mart that contains only the fact table, perhaps several gigabytes of data, is not a data mart:
it is a problem.
• Because in a Data Warehouse the number of aggregate tables tend to be high, the naming
convention provides an help to immediately identify all data structures associated with a specific
business process.
19. The sections of the level 1 Data Mart area
• In this area there are the data marts whose content is not immediately derivable from a data mart
level 0, but consists of data taken by several lower-level data marts or obtained by processes that
do not make a trivial aggregate calculation. We have two examples to clarify the concept.
• As mentioned above, in the market associated with large retailers, it is typical to have a data mart
of Profitability, which contains a single data structure sales information and a data Mart with
various cost types that allow you to get the net or gross margin.
• We can think Profitability as a data mart of level 1, which is loaded by the two data marts level 0
indicated above, ie EDW_DM0_SLS and EDW_DM0_CST. You can then encode this data marts of
profitability with EDW_DM1_PRF or EDW_DM1_PRO.
• Another example is related to aggregate data that are not immediately achievable with simple
sums or counts (because in that case I use the materialized views). If a user requirement is to see
the monthly data associates with the the same month of the last year, both current and
progressive, a design choice may be to load, with the ETL process, a level1 data mart with these
aggregate data already calculated, rather than entrust this task to the reporting tools.
• You can encode this data mart as EDW_DM1_MPY (indicating an Monthly, with data P revious year
and progressive, Year to date).
20. The sections of the Common Area (1)
• In this area we associate all entities that are common to the entire project and not connected to
one of the areas described above. Typical examples of these sections are the following:
• CDI section (Conformed Dimension) - The dimensions of analysis present in the Data
Warehouse, are not specific to a Data Mart, but they are shared by many data marts. Just think of
the time dimension, present in virtually all data mart of any level. Are therefore common
structures that can be grouped into one section.
• CFA section (Conformed Fact) - Concept similar to the previous but applied to the fact tables. An
example is the table of currency conversion, which is a typical fact table but comprehensive and
usable by all the Data Warehouse.
• LOG section (Logging) - All structures associated with the log of the ETL process.
• MTD section (metadata) - All metadata structure in the Data Warehouse.
21. The sections of the Common Area
• The Figure 2 show you an example of sections, and then complete our design with that just
described.
Figure 2
• At this point our partitioning process can be considered complete. Additional sub-sections may
give little added value to knowledge, in addition to further extend the final name of various
entities. Indeed, in very simple projects, may be enough the partitioning at the level area. Explore
now the fourth components of the <entity name>, that is the logical name.
22. The logical name
• We have to identify a logical name, preferably short, after the codes just described. Returning to
the source data files that load a Staging Area table, if the subsystem Profit and Loss provide us the
customer informations, its logical name will be “CUST” and its complete name will be:
EDW_STA_PRL_CUST
• You can see the completeness of information that comes from the name of this entity. In a few
characters we know that they are customer data, provided by Profit and Loss Subsystem, which
are loaded on a Staging area that is part of the Intelligent Enterprise Data Warehouse.
• It is not necessary submit the logical to the rule of three, and, in some cases, it may be omitted.
For example, in the data mart area, the prefix is already sufficiently explanatory, and the addition
of a logical name will not give any added value to understanding the content of the entity.
• Indeed, long before the start of deploy to production, the data marts of Profitability exemplified
above, will be called “affectionately" PRO”, and all recognize the Profitability Data Mart naming
only the three characters of the logical name.
23. The type code (1)
• Now you might think of getting our initial goal. The name obtained by following the guidelines of
the naming convention is semantically significant, and I believe that if we simply called the entity
CUSTOMER, we would have lost in knowledge and context.
• Unfortunately it is not just so. The entity EDW_STA_PRL_CUST still does not tell us everything
about its own nature.
• If you carefully review the semantics that the naming convention has given to the word
EDW_STA_PRL_CUST, we find that there are some cognitive deficiencies. It is quite clear that
customer data are provided by the Profit and Loss Subsystem, but does not show up anywhere
that it is a table of data. The technique of classification can help us to fill these gaps.
• Among the various entities which are present in the section of an area, there may be tables, but
also view (views are the selection of certain rows and / or certain columns in a table), synonyms
(ie tables aliases), sequences, etc.
• This information is important because it avoids us to access to the database data dictionary. This
fact should not be underestimated because in many situations, this information, which seems
highly associated with the "physical" side of the entity, also has a substantial logical content.
24. The type code (2)
• Consider, for example the case of the fact table and materialized views (the Oracle summary
tables). The fact tables are always present in a Data Warehouse projects.
• The materialized views are aggregated tables that are typically generated from fact tables, but that
aggregate the data according to certain dimensions of analysis.
• It is important to distinguish immediately the two types of entities, because their content is
different and also their feeding process can be different and performed at different times.
• Inside the data mart level 0 area, there may be a section that identifies the sales data mart, but
within this section may be dozens of entities, and how many and what are the "aggregated"
ie, materialized views, is an important information for us and for the database administrator.
• By classification, this information will be immediately apparent, not only for a specific section or
area, but also across all areas. If we decide to typify all the materialized views derived from fact
tables with the code FMV (Fact Materialized View), all entities whose name ends with FMV will
surely "aggregated" tables.
25. The type code (3)
• What follows is a fairly exhaustive list of codes and descriptions that will certainly help us in the
classification process.
• DAT - Data Table. General data table typical of staging areas. Are basically tables that contain
metrics, rate, or other measures, with their natural production codes not transformed in artificial
keys.
• DST - Descriptive Table. Like the previous, it refers to tables typical of staging areas. For example
the personal data codes and their natural attributes of the customers, the product codes, etc. In a
world, all staging tables that will turn into dimensional tables.
• EDT - External Data Table. Typical of the staging areas. The 'external table’ is a very useful feature
of Oracle by which a flat file is seen in effect as a table. Then easily managed with the SQL
language.
• EXT - Exception Table. Table that contains all lines that have failed integrity checks in a table.
• FAT - Fact Table. Typically contains facts at maximum transactional detail, how the individual
movements of current account or the individual lines of receipts issued by a supermarket
checkout.
• FFT - Factless Fact Table. Fact table without numeric information. In them, the mere presence of
the line certify the occurrence of an event, such as the presence of a student in a university
course.
26. The type code (4)
• PSF - Periodic Snapshot Fact table. This type of fact table reassume many facts in one line
summary of a situation at the end of a period of time. A typical example is the bank account: it is
possible to construct a monthly summaries supplied by the basic fact table where there are details
of each movement performed on the bank account.
• ASF - Accumulating Snapshot Fact table. This type of table describes the processes that have a
beginning and an end not too time-deferred and when a dimension is repeated several times. A
typical example is the story of an order or an invoice, all summarized in one line. FAV - Fact View -
Partial view of data in one or more fact tables
• FMV - Fact materialized views - materialized view built on one or more fact tables.
• DIT - Dimensional table - Table for dimensional analysis
• DIV - Dimensional View - view associated with a dimensional table.
• DIM - Dimension - For Oracle the dimension is not the dimension table, but the logical structure
that describes its content in terms of hierarchical levels and attributes associated with the various
levels.
• TMT - Temporary table – They are tables that contains temporary data used in complex ETL
processes.
• CFT - Configuration table – Generic table containing configuration data.
• DTE - Data Entry Table - Table that can be updated by the user with a graphical interface.
• AWS - Analytic workspace - Cube OLAP data structure that is typical of the multidimensional
databases.
27. Conclusions
• It's the end of this brief journey through the naming convention techniques . How exposed is
certainly not exhaustive of the potential for application of such techniques, nor intends to be a
dogma to be followed. Each of us, based on its experience, can build own naming convention
according to their needs and according to their intuition.
• In fact it does not matter which choice you start to partition the system or how its components
are classified, but it is important to apply a standardization method.
• An effective naming convention certainly provides all the tools needed to control the Data
Warehouse, in terms of knowledge, management and maintenance.
Bibliography
[1] R. Kimball - The Data WarehouseLifecycle Toolkit ", Wiley, 1998