2. Introduction
2
Dramatic advances in data capture, processing power, data
transmission, and storage capabilities are enabling organisations to
integrate their various databases into data warehouses.
Data warehousing is defined as a process of centralised data
management and retrieval.
Data warehousing represents an ideal vision of maintaining a central
repository of all organisational data. Centralisation of data is needed
to maximize user access and analysis.
As knowledge becomes the new currency of organizations,
information now is viewed in an entirely new way - as a strategic
source of opportunity.
With this new focus on the information delivery, government and
industry are looking to Data Warehousing as valuable construct to
convert data to information.
3. Data Mining & Data Warehouse
3
Data mining is a broad technology that can potentially benefit any
functional area in a business where there is a major need or
opportunity for improved performance and where data analysis can
impact that improvement.
Part of the power of data mining is that it not only solves difficult
business problems, but it does so in ways that are repeatable.
The data mining process involves developing models that can be used
to solve the business problem at hand. Since they are models, they
can be reused on new data.
As the data in the warehouse is refreshed, the models can be re-run
on new data and new results obtained.
4. Data Mining: A KDD Process
4
Data Cleaning
Data Integration
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Databases
5. Characteristics of a Data Warehouse
5
A common way of introducing data warehousing is to refer to the
characteristics of a data warehouse as set forth by William Inmon:
Subject Oriented
Integrated
Non-volatile *
Time Variant *
The major characteristics of data warehouse are:
Organisation
Consistency
Non-volatile *
Time Variant *
Relational
Client/server
Web-based
(Turban, McLean, Wetherbe)
6. Characteristics of a Data Warehouse
6
Organisation: Data are organised by subject (e.g. by customer, vendor,
product, price level, and region) and contain information relevant for
decision support only.
Consistency: Data in different operational databases may be encoded
differently. In the warehouse they will be coded in a consistent manner.
Relational: Typically the data warehouse uses a relational structure.
Client/server: The data warehouse uses the client/server architecture
mainly to provide the end user an easy access to its data.
Web-based: Today’s data warehouses are designed to provide an
efficient computing environment for Web-based applications
(Rundensteiner et.al., 2000)
CISM01 Intelligent Systems for Management Unit 9
7. Subject Oriented
7
Data warehouses are designed to help you analyse data.
For example, to learn more about your company's sales data, you can
build a warehouse that concentrates on sales.
Using this warehouse, you can answer questions like "Who was our
best customer for this item last year?" This ability to define a data
warehouse by subject matter, sales in this case, makes the data
warehouse subject oriented.
8. Integrated
8
Integration is closely related to subject orientation.
Data warehouses must put data from disparate sources into a
consistent format.
They must resolve such problems as naming conflicts and
inconsistencies among units of measure.
When they achieve this, they are said to be integrated.
9. Nonvolatile
9
Non-volatile means that, once entered into the warehouse, data are not
changed/updated.
This is logical because the purpose of a warehouse is to enable you to
analyse what has occurred.
10. Time Variant
10
In order to discover trends in business, analysts need large amounts of
data.
This is very much in contrast to online transaction processing (OLTP)
systems, where performance requirements demand that historical data
be moved to an archive.
The data are kept for many years so they can be used for trends,
forecasting, and comparisons over time.
A data warehouse's focus on change over time is what is
meant by the term time variant.
11. Data Marts
11
The high cost of data warehouses confines their use to large
companies.
An alternative used by many other firms is creation of a lower cost,
scaled-down version of a data warehouse called a data mart.
A data mart is a small warehouse designed for a strategic business
unit (SBU) or a department.
The advantages of data marts include:
Low cost
Significantly shorter lead time for implementation
Local rather than central control, conferring power on the using group
12. Data Marts
12
From statistical viewpoint, a data mart should be organised according
to two principles:
The statistical units, the elements in the reference population that are
considered important for the aims of the analysis (e.g. the supply
companies, the customers, the people who visit the site).
The statistical variables, the important characteristics, measured for
each statistical unit (e.g. the amounts customers buy, the payment
methods they use, the socio-demographic profile of each customer).
13. Operational vs. Informational
13
Operational Data Data Warehouse
Application OLTP OLAP
Use Precise Queries Ad Hoc
Temporal Snapshot Historical
Modification Dynamic Static
Orientation Application Business
Data Operational Values Integrated
Size Gigabits Terabits
Level Detailed Summarized
Access Often Less Often
Response Few Seconds Minutes
Data Schema Relational Star/Snowflake
14. OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
14
15. Contrasting OLTP & Data Warehousing Environments
15
Few Many
Many Some
Normalised
DBMS
Denormalised
DBMS
Rare Common
Indexes
Joins
Duplicated Data
Derived Data &
Aggregates
OLTP Data Warehouse
Complex Data Structures Multidimensional Data
structures
16. Contrasting OLTP & Data Warehousing Environments
16
One major difference between the types of system is that data
warehouses are not usually in third normal form (3NF), a type of data
normalisation common in OLTP environments.
Data warehouses and OLTP systems have very different requirements.
Here are some examples of differences between typical data
warehouses and OLTP systems:
Workload
Data warehouses are designed to accommodate ad hoc queries. You might
not know the workload of your data warehouse in advance, so a data
warehouse should be optimised to perform well for a wide variety of
possible query operations.
OLTP systems support only predefined operations. Your applications might
be specifically tuned or designed to support only these operations.
17. Contrasting OLTP & Data Warehousing Environments
17
Data modifications
A data warehouse is updated on a regular basis using bulk data
modification techniques. The end users of a data warehouse do not
directly update the data warehouse.
In OLTP systems, end users routinely issue individual data modification
statements to the database. The OLTP database is always up to date, and
reflects the current state of each business transaction.
18. Contrasting OLTP & Data Warehousing Environments
18
Schema design
Data warehouses often use denormalised or partially denormalised
schemas (such as a star schema) to optimise query performance.
OLTP systems often use fully normalized schemas to optimise
update/insert/delete performance, and to guarantee data consistency.
Typical operations
A typical data warehouse query scans thousands or millions of rows. For
example, "Find the total sales for all customers last month."
A typical OLTP operation accesses only a handful of records. For
example, "Retrieve the current order for this customer."
19. Contrasting OLTP & Data Warehousing Environments
19
Historical data
Data warehouses usually store many months or years of data. This is to
support historical analysis.
OLTP systems usually store data from only a few weeks or months. The
OLTP system stores only historical data as needed to successfully meet
the requirements of the current transaction.
20. Levels
of the Data Warehouse Architecture
20
Organisationally
structured: to meet the
informational requirements
of the entire organisation.
Organisationally
Structured
Departmentally
Structured
Individually
Structured
Departmentally structured:
structured to meet the focused
informational requirements of
the distinct group identified by
a specific business function.
Individually structured:
structured to meet an even
more focused set of
informational requirements
as defined by a specific
management function.
21. Metadata
21
Metadata describes types of information that are stored in the database.
For a data warehouse, metadata provides discipline, since changes to
the warehouse must be reflected in the metadata to be communicated to
users.
A good metadata system helps ensure the success of a data warehouse
by making users more aware of and comfortable with the contents. It
proves valuable assistance in understanding data.
The metadata repository is an often overlooked component of the data
warehousing environment.
(from Berry & Linoff, 2004)
Without metadata, the data warehouse and its associated components
in the architected environment are merely disjoined components
working independently and with separate goals.
23. A General Architecture for Data Warehousing
23
OLAP Engine
Data Storage Front-End Tools
Extract
Transform
Load
Refresh
Analysis
Query
Reporting
Data mining
Operational
DBs
other sources
Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Client analysis tools
(Tier 3)
Data Marts
e.g., MOLAP
e.g., ROLAP
Data
Warehouse
Serve
Information
Sources
24. A General Architecture for
Data Warehousing
24
The major components of data warehouse architecture are:
Source systems are where the data comes from.
Extraction, transformation, and load (ETL) move data between different
data stores.
The central repository is the main store for the data warehouse.
The metadata repository describes what is available and where.
Data marts provide fast, specialised access for end users and
applications.
Operational feedback integrates decision support back into the
operational systems.
End-users are the reason for developing the warehouse in the first place
MOLAP:
Multi-Dimensional
On-Line Analytical Processing
ROLAP:
Relational
On-Line Analytical Processing
25. Cloud Database
A cloud database is a database that typically runs on a cloud
computing platform and access to the database is provided as-a-
service.
25
26. methods to run a database in a cloud
• There are two primary methods to run a database in a cloud:
• Virtual machine image
• Cloud platforms allow users to purchase virtual-machine instances for a
limited time, and one can run a database on such virtual machines. Users
can either upload their own machine image with a database installed on it,
or use ready-made machine images that already include an optimized
installation of a database.
•
• Database-as-a-service (DBaaS)
• With a database as a service model, application owners do not have to install
and maintain the database themselves. Instead, the database service
provider takes responsibility for installing and maintaining the database, and
application owners are charged according to their usage of the service. This
is a type of software as a service (SaaS).
26
27. Architecture and common characteristics
• Most database services offer web-based consoles, which the end user can
use to provision and configure database instances.
• Database services consist of a database-manager component, which controls
the underlying database instances using a service API. The service API is
exposed to the end user, and permits users to perform maintenance and
scaling operations on their database instances.
• Underlying software-stack stack typically includes the operating system, the
database and third-party software used to manage the database. The service
provider is responsible for installing, patching and updating the underlying
software stack and ensuring the overall health and performance of the
database.
• Scalability features differ between vendors – some offer auto-scaling, others
enable the user to scale up using an API, but do not scale automatically.
• There is typically a commitment for a certain level of high availability (e.g.
99.9% or 99.99%). This is achieved by replicating data and failing instances
over to other database instances.
27
28. Data model
• Advanced queries expressed in SQL work well with the strict relationships
that are imposed on information by relational databases. However, relational
database technology was not initially designed or developed for use over
distributed systems. This issue has been addressed with the addition of
clustering enhancements to the relational databases, although some basic
tasks require complex and expensive protocols, such as with data
synchronization.
• Modern relational databases have shown poor performance on data-
intensive systems, therefore, the idea of NoSQL has been utilized within
database management systems for cloud based systems. Within NoSQL
implemented storage, there are no requirements for fixed table schemas,
and the use of join operations is avoided. "The NoSQL databases have
proven to provide efficient horizontal scalability, good performance, and
ease of assembly into cloud applications." Data models relying on simplified
relay algorithms have also been employed in data-intensive cloud mapping
applications unique to virtual frameworks.
28
29. Difference between cloud databases which are
relational as opposed to non-relational or
NoSQL:
• SQL databases
• Are one type of database which can run in the cloud, either in a virtual
machine or as a service, depending on the vendor. While SQL databases are
easily vertically scalable, horizontal scalability poses a challenge, that cloud
database services based on SQL have started to address.
• EDB Postgres Advanced Server
• IBM Db2
• Ingres (database)
• MariaDB
• MySQL
• NuoDB
• Oracle Database
• PostgreSQL
• SAP HANA
• YugabyteDB
29
30. • NoSQL databases
• Are another type of database which can run in the cloud. NoSQL databases
are built to service heavy read/write loads and can scale up and down easily,
and therefore they are more natively suited to running in the cloud.
However, most contemporary applications are built around an SQL data
model, so working with NoSQL databases often requires a complete rewrite
of application code.
• Some SQL databases have developed NoSQL capabilities including JSON,
binary JSON (e.g. BSON or similar variants), and key-value store data types.
• A multi-model database with relational and non-relational capabilities
provides a standard SQL interface to users and applications and thus
facilitates the usage of such databases for contemporary applications built
around an SQL data model. Native multi-model databases support multiple
data models with one core and a unified query language to access all data
models.
30
31. • Examples
• Apache Cassandra on Amazon EC2 or Google Compute Engine
• ArangoDB on Amazon EC2, Google Compute or Microsoft Azure
• Clusterpoint Database Virtual Box VM
• CouchDB on Amazon EC2 or Google Cloud Platform
• EDB Postgres Advanced Server
• Hadoop on Amazon EC2, Google Cloud Platform, or Rackspace
• MarkLogic on Amazon EC2 or Google Cloud Platform
• MongoDB on Amazon EC2, Google Compute Engine, Microsoft Azure, or
Rackspace
• Neo4J on Amazon EC2 or Microsoft Azure
• ScyllaDB on Amazon EC2 or Google Cloud Platform
• YugabyteDB
31