Data Marts: What They
Are and Why Businesses
Need Them
Imagine you run a candy store. Some of the goodies are on display cases for quick access while the rest is
in the storeroom. Now let’s think of the sweets as the data required for your company’s daily operations.
Instead of combing through the vast amounts of all organizational data stored in a data warehouse, you can
use a data mart — a repository that makes specific pieces of data available quickly to any given business
unit. Just like display cases in a store.
This article is going to provide an in-depth explanation of what data marts are and how they store data
for Business Intelligence purposes. You’ll also find out about the key types of data marts, their structure
schemas, implementation steps, and more.
What is a data mart?
A data mart is a smaller subsection of a data warehouse built specifically for a particular subject area,
business function, or group of users. The main idea is to provide a specific part of an organization with data
that is the most relevant for their analytical needs. For example, the sales or finance teams can use a data
mart containing sales information only to make quarterly or yearly reports and projections. Since data marts
provide analytical capabilities for a restricted area of a data warehouse, they offer isolated security and
isolated performance.
Data mart vs data warehouse vs data lake vs OLAP cube
Data lakes, data warehouses, and data marts are all data repositories of different sizes. Apart from the size,
there are other significant characteristics to highlight.
A data warehouse (DW) is a data repository that enables storing and managing all the historical enterprise
data, coming from disparate internal and external sources like CRMs, ERPs, flat files, etc. Initially, DWs
dealt with structured data presented in tabular forms. Modern cloud warehouses make it possible to store
data in its raw formats similar to what data lakes do. While cloud solutions are quicker to set up, on-premise
DWs may take months to build.
A data mart is a subject-oriented relational database commonly containing a subset of DW data that is
specific to a particular business department of an enterprise, e.g., a marketing department. Data marts get
information from relatively few sources and are small in size — less than 100 GB. They typically contain
structured data and take less time for setup — normally 3 to 6 months for on-premise solutions.
A data lake is a central repository used to store massive amounts of both structured and unstructured
data coming from a great variety of sources. Data lakes accept raw data, eliminating the need for prior
cleansing and processing. As far as the size, they can be home to many files, where even one file can be
larger than 100 GB. Depending on the goal, it may take weeks or months to set up a data lake. Moreover,
not all organizations use data lakes.
Data marts shouldn’t be confused with OLAP cubes either. An OLAP or Online Analytical Processing cube is
the tool used to represent data for analysis in a multidimensional way. So, just like data warehouses, data
marts can be used as the foundation for creating an OLAP cube. For example, a company has a data mart
containing all the financial data. The company may wish to model an OLAP cube to summarize this data by
different dimensions: by time, by product, or by city, to name a few.
Types of data marts
Based on how data marts are related to the data warehouse as well as external and internal data sources,
they can be categorized as dependent, independent, and hybrid. Let’s elaborate on each one.
Dependent data marts are the subdivisions of a larger data warehouse that serves as a centralized data
source. This is something known as the top-down approach — you first create a data warehouse and then
design data marts on top of it. Within this sort of relationship, data marts do not interact with data sources
directly. Based on the subjects, different sets of data are clustered inside a data warehouse, restructured,
and loaded into respective data marts from where they can be queried.
Dependent data marts are well suited for larger companies that need better control over the systems,
improved performance, and lower telecommunication costs.
Independent data marts act as standalone systems, meaning they can work without a data warehouse.
They receive data from external and internal data sources directly. The data presented in independent data
marts can be then used for the creation of a data warehouse. This approach is called bottom-up.
Often, the motivation behind choosing independent data marts is shorter time to market. They work great for
small to medium-sized companies.
So, the key difference between dependent and independent data marts is in the way they get data from
sources. The step involving data transfer, filtering, and loading into either a data warehouse or data mart is
called the extract-transform-load (ELT) process. When dealing with dependent data marts, the central data
warehouse already keeps data formatted and cleansed, so ETL tools will do little work. On the other hand,
independent data marts require the complete ETL process for data to be injected.
Hybrid data marts integrate data from all existing operational data sources and/or data warehouses. This
method collects the benefits and addresses the issues of both top-down and bottom-up approaches. Hybrid
data marts are a good choice for organizations that have multiple databases.
Data mart structure schemas
Similar to traditional data warehouses, data marts use a relational approach to data modeling. A relation is a
mathematical term for a table, which is a combination of rows and columns containing different values. To
logically arrange pieces of data in a data mart, companies use two main schemas — star and snowflake.
Both consist of a fact table and dimension tables with different levels of joints.
Star schema, as the name suggests, resembles a star. It comprises only one fact table that is placed in the
center of the model and breaks down into several dimension tables with denormalized data. This means that
the data is redundant and that results in faster data retrieval as fewer joins are needed.
The fact table encompasses aggregated data designed to be used for analytical and reporting purposes
while the dimension tables contain descriptions of the stored data. The star schema is a simple type of data
mart structure as the fact table has only one link to each dimension table. As such, this model makes it
easier to accomplish complex queries.
Snowflake schema has the star schema as its base, yet the data in dimension tables is normalized as it is
split into additional dimension tables. The normalization of the dimension tables in the snowflake schema is
reached by getting rid of attributes with few unique values and forming separate tables. Such an
arrangement forms a sort of snowflake, hence the name of the schema. Though the snowflake schema
protects data integrity more efficiently and takes up less disk space, querying becomes more complex
because of many levels of joins between tables.
Data mart implementation steps
The process of creating data marts may be complicated and differ depending on the needs of a particular
company. In most cases, there are five core steps such as designing a data mart, constructing it,
transferring data, configuring access to a repository, and finally managing it. We’ll walk you through each
step in more detail.
Data mart designing
The first thing you do when implementing a data mart is deciding on the scope of the project and its design.
Since data marts are subject-oriented databases, this step involves determining a subject or a topic to which
data stored in a mart will be related. In addition to collecting information about technical specifications, you
need to decide on business requirements during this phase too. It is also necessary to identify the data
sources related to the subject and design the logical and physical structure of the data mart.
Data mart constructing
Once the scope of work is established, here comes the second step that involves constructing the logical
and physical structures of the data mart architecture designed during the first phase.
Logical structure refers to the scenario where data exists in the form of virtual tables or
views separated from the warehouse logically, not physically. Virtual data marts may be a good option
when resources are limited.
Physical structure refers to the scenario where a database is physically separated from the
warehouse. The database may be cloud-based or on-premises.
Also, this step requires the creation of the schema objects (e.g., tables, indexes) and setting up data access
structures.
It is essential to perform a detailed requirement collection before implementing any scenario since different
organizations may need different types of data marts.
Data transferring
The third step covers all the tasks related to transferring data from sources to data marts:
extracting information from target data sources,
cleansing and converting data into a fitting format, and
loading data into a data mart.
To perform the processes of extraction, transformation, and loading, ETL tools are used.
Data access configuring
Now that data is in data marts, it’s time to put it to use: making queries, analyzing data, creating reports, etc.
The accessing step involves the following tasks:
setting up the intermediate (meta) layer for the front-end application (the layer converts database
structures into business terms so that end clients can access data from data marts easily);
setting up and managing database structures like summarized tables; and
setting up APIs (application programming interfaces) if required.
Data marts can be accessed via a command line or GUI (graphical user interface), which is a more user-
friendly option.
Managing
The final step of the data mart implementation process encompasses different management tasks like:
providing secure user access to data;
optimizing and fine-tuning the system for better performance;
adding and managing new data; and
ensuring system availability and planning recovery scenarios.
Data mart use cases
Companies can become more agile and data-driven with the right approach to business intelligence and
data analytics. Data marts were initially created to help companies make more informed business decisions
and address unique organizational problems — those specific to one or several departments. There are
quite a few cases where data marts can be used. We’ll cover the typical ones in this next paragraph.
Subject-focused data analytics
Data analytics play a crucial role in any business lifecycle. Data marts allow for more focused data analysis
because they only contain records organized around specific subjects such as products, sales, customers,
etc. Since there’s no extraneous information, businesses can discern clearer and more accurate insights.
For example, data marts can be used as on-premise or cloud-based destinations to consolidate all the
marketing data and store it in a structured format. This allows marketing teams to reach a single source of
truth and get a better handle on important metrics such as the return of investment (ROI), customer
acquisition cost (CAC), and return on ad spend (ROAS). Data marts provide easy and fast access to
important data points when needed. They can process complex queries and push the required data into
corresponding reporting and data analytics tools.
Selective data access
Data marts can be used in situations when an organization needs selective privileges for accessing and
managing data. This is often the case for big enterprises that can’t expose the entire data warehouse to all
users. Building multiple dependent data marts can help protect sensitive data from unauthorized access and
accidental writes.
Improved resource management
Providing each department with a separate data mart can be a good way to manage the imbalance of
resource use by different organizational units. Say, the department running logistics operations does a lot of
actions with a database daily. This may cause system malfunctions of other departments that perform fewer
database queries. Eventually, this may decrease the performance effectiveness of the whole company. Data
marts allow for using resources efficiently and effectively.
Time-limited data projects
Compared to corporate data warehouses that require significant time and effort, data marts are much easier
and faster to set up: Data engineers and developers work with smaller amounts of data, fewer sources, and
simpler schemas. On top of that, data marts are cheaper to implement than a DW. So, if you have time
limitations in terms of completing a data project, data marts may be the way to go.
The “cloud-y” future of data marts
Businesses face an endless growth of information. Getting actionable, data-driven insights becomes difficult
for those still using on-premises solutions. In the Big Data reality, data warehouses are progressively moving
to the cloud — and so are data marts. Cloud solutions facilitate storing and sharing massive sets of data
unlocking the true power of effective data analysis.
Cloud-based platforms offer flexible architectures with separate data storage and compute powers, resulting
in better scalability and faster data querying. With a single repository containing all data marts in the cloud,
businesses can not only lower costs but also provide all departments with unhindered access to data in real-
time.
In addition, cloud data marts can be a great tool for machine learning purposes. Data marts contain all the
relevant information connected to transactions, products, or customers for a given period of time. Because
they’re credible, they can be used to build different ML models such as propensity
models predicting customer churn or those providing personalized recommendations.
Data Cube
What Does Data Cube Mean?
A data cube refers is a three-dimensional (3D) (or higher) range of values that
are generally used to explain the time sequence of an image's data. It is a data
abstraction to evaluate aggregated data from a variety of viewpoints. It is also
useful for imaging spectroscopy as a spectrally-resolved image is depicted as a
3-D volume.
A data cube can also be described as the multidimensional extensions of two-
dimensional tables. It can be viewed as a collection of identical 2-D tables
stacked upon one another. Data cubes are used to represent data that is too
complex to be described by a table of columns and rows. As such, data cubes
can go far beyond 3-D to include many more dimensions.
Techopedia Explains Data Cube
A data cube is generally used to easily interpret data. It is especially useful
when representing data together with dimensions as certain measures of
business requirements. A cube's every dimension represents certain
characteristic of the database, for example, daily, monthly or yearly sales. The
data included inside a data cube makes it possible analyze almost all the figures
for virtually any or all customers, sales agents, products, and much more.
Thus, a data cube can help to establish trends and analyze performance.
Data cubes are mainly categorized into two categories:
Multidimensional Data Cube: Most OLAP products are developed based on a
structure where the cube is patterned as a multidimensional array. These
multidimensional OLAP (MOLAP) products usually offers improved
performance when compared to other approaches mainly because they can
be indexed directly into the structure of the data cube to gather subsets of
data. When the number of dimensions is greater, the cube becomes
sparser. That means that several cells that represent particular attribute
combinations will not contain any aggregated data. This in turn boosts the
storage requirements, which may reach undesirable levels at times,
making the MOLAP solution untenable for huge data sets with many
dimensions. Compression techniques might help; however, their use can
damage the natural indexing of MOLAP.
Relational OLAP: Relational OLAP make use of the relational database
model. The ROLAP data cube is employed as a bunch of relational tables
(approximately twice as many as the quantity of dimensions) compared to
OLAP (online analytical processing)
OLAP (online analytical processing) is a computing method that enables users to easily and
selectively extract and query data in order to analyze it from different points of view. OLAP business
intelligence queries often aid in trends analysis, financial reporting, sales forecasting, budgeting and other
planning purposes.
For example, a user can request that data be analyzed to display a spreadsheet showing all of a company's
beach ball products sold in Florida in the month of July, compare revenue figures with those for the same
products in September and then see a comparison of other product sales in Florida in the same time period.
How OLAP systems work
To facilitate this kind of analysis, data is collected from multiple data sources and stored in data warehouses then
cleansed and organized into data cubes. Each OLAP cube contains data categorized by dimensions (such as customers,
geographic sales region and time period) derived by dimensional tables in the data warehouses. Dimensions are then
populated by members (such as customer names, countries and months) that are organized hierarchically. OLAP cubes
are often pre-summarized across dimensions to drastically improve query time over relational databases.
Analysts can then perform five types of OLAP analytical operations against these multidimensional databases:
Roll-up. Also known as consolidation, or drill-up, this operation summarizes the data along the dimension.
Drill-down. This allows analysts to navigate deeper among the dimensions of data, for example drilling down from
"time period" to "years" and "months" to chart sales growth for a product.
Slice. This enables an analyst to take one level of information for display, such as "sales in 2017."
Dice. This allows an analyst to select data from multiple dimensions to analyze, such as "sales of blue beach balls
in Iowa in 2017."
Pivot. Analysts can gain a new view of data by rotating the data axes of the cube.
OLAP software then locates the intersection of dimensions, such as all products sold in the Eastern region above a
certain price during a certain time period, and displays them. The result is the "measure"; each OLAP cube has at least
one to perhaps hundreds of measures, which are derived from information stored in fact tables in the data warehouse.
Types of OLAP systems
OLAP (online analytical processing) systems typically fall into one of three types:
Multidimensional OLAP (MOLAP) is OLAP that indexes directly into a multidimensional database.
Relational OLAP (ROLAP) is OLAP that performs dynamic multidimensional analysis of data stored in
a relational database.
Hybrid OLAP (HOLAP) is a combination of ROLAP and MOLAP. HOLAP was developed to combine the
greater data capacity of ROLAP with the superior processing capability of MOLAP.
Uses of OLAP
OLAP can be used for data mining or the discovery of previously undiscerned relationships between data
items. An OLAP database does not need to be as large as a data warehouse, since not all transactional
data is needed for trend analysis. Using Open Database Connectivity (ODBC), data can be imported from
existing relational databases to create a multidimensional database for OLAP.
OLAP products include IBM Cognos, Oracle OLAP and Oracle Essbase. OLAP features are also included in
tools such as Microsoft Excel and Microsoft SQL Server's Analysis Services). OLAP products are typically
designed for multiple-user environments, with the cost of the software based on the number of users.