Introduction to Datawarehousing

Introduction to Data Warehousing

Contents
• History
• OLTP vs. OLAP
• Paradigm Shift
• Architecture
• Emerging Technologies
• Questions

History: Hollerith Cards
Once upon a time…
• Reporting was done with data stored on Hollerith cards.
– A card contained one record.
– Maximum length of the record was 80 characters.
– A data file consisted of a stack of cards.
• Data was “loaded” each time a report was run.
• Data had no “modeling” per se. There were just sets of records.
• Programs in languages such as COBOL, RPG, or BASIC would:
– Read a stack of data cards into memory
– Loop through records and perform a series of steps on each (e.g. increment a counter or add an amount
to a total)
– Send a formatted report to a printer
• It was difficult to report from multiple record types.
• Changes to data were implemented by simply adding, removing or replacing cards.

History: Hollerith Cards
FACTOIDS
• Card type: IBM 80-column punched card
• A/K/A: “Punched Card”, “IBM Card”
• Size: 7 3⁄8 by 3 1⁄4 inches
• Thickness: .007 inches (143 cards per inch)
• Capacity: 80 columns with 12 punch locations each

History: Magnetic Tape
• Card were eventually replaced by magnetic tape.
• Tapes made data easier to load and storage more efficient.
• Records were stored sequentially, so individual records could not be quickly accessed.
• Data processing was still very similar to that of cards—load a file into computer memory
and loop through the records to process.
• Updating data files was still difficult.

History: Disk Storage
History
• The arrival of disk storage revolutionized data storage and access.
• It was now possible to have a home base for data: a database.
• Data was always available: online.
• Direct access replaced sequential access so data could be accessed more quickly.
• Data stored on disk required new methods for adding, deleting, or updating records. This
type of processing became known as Online Transaction Processing (OLTP).
• Reporting from a database became known as Online Analytical Processing (OLAP).

History: Disk Storage
History
FACTOIDS
• Storage Device: IBM 350 disk storage unit
• Released: 1956
• Capacity: 5 million 6-bit characters (3.75 megabytes).
• Disk spin speed: 1200 RPM.
• Data transfer rate: 8,800 characters per second.

History: Relational Model
History
• In the 1960’s, E.F. Codd developed the relational model.
• Relational modeling was based on a branch of mathematics called set theory and added
rigor to the organization and management of data.
• Relational modeling also improved OLTP (inserting, updating, and deleting data) by
making these processes more efficient and reducing data anomalies.
• The relational model also introduced primary keys, foreign keys, referential integrity,
relational algebra, and a number of other concepts used in modern database systems.
• It soon became apparent that different data models facilitated OLAP vs. OLTP.
• Relational data was often denormalized (made non-relational) to support OLAP.
• Structured Query Language (SQL) was the first language created to support relational
database operations both OLAP and OLTP.

History: Relational Model
FACTOIDS
• Although E.F. Codd was employed by IBM when he create the relational model and IBM
originated the SQL language (then “SEQUEL”), IBM was not the first vendor to produce a
relational database or to use SQL.
• The first commercial implementation of relational database and SQL was from Relational
Software, Inc. which is now Oracle Corporation.
• SQL has been standardized by standards organizations American National Standards
Institute (ANSI) and the International Standards Organization (ISO).

History: Extracts 1
• In early relational databases, data was extracted from OLTP systems into denormalized
extracts for reporting.
OLTP OLAP
Report
Source
Extract

History: Extracts 2
History
• And more extracts...
OLTP OLAP
Report
Source
Extract
OLTP OLAP
Report
Source
Extract
OLAP Report
Extract

History: Extracts 3

History: Extracts 4

History: Naturally Evolving Systems 1
• Naturally evolving systems began to emerge. 

History: Naturally Evolving Systems 2
• Naturally evolving systems resulted in
– Poor organization of data
– Extremely complicated processing requirements
– Inconsistencies in extract refresh status
– Inconsistent report results.
• This created a need for architected systems for analysis and reporting.
• Instead of multiple extract files, a single source of truth was needed for each data source.

History: Architected Systems
• Developers began to design architected systems for OLAP data.
• In the 1980’s and 1990’s, organizations began to integrate data from multiple sources such
as accts. receivable, accts. payable, HR, inventory, and so on. These integrated OLAP
databases became known as Enterprise Data Warehouses (EDWs).
• Over time methods and techniques for extracting and integrating data into architected
systems began to evolve and standardize.
• The term data warehousing is now used to refer to the commonly used architectures,
methods, and techniques for transforming and integrating data to be used for analysis.

History: An Architected Data Warehouse
Example of an Architected Data Warehouse
OLTP Report
OLTP
Staging
History
DM
DM
DM
DM
Report
Data
set
Data
set
OLTP
OLTP
ODS

History: Compare Naturally Evolving System
Compare: Naturally evolving system 
18

History: Compare Architected Data Warehouse
Compare: Architected Data Warehouse 
OLTP Report
OLTP
Staging
History
DM
DM
DM
DM
Report
Data
set
Data
set
OLTP
OLTP
ODS

History: Inmon
• In the early 1990’s W.H. Inmon published Building the Data Warehouse (ISBN-
10: 0471141615)
• Inmon put together the quickly accumulating knowledge of data warehousing and
popularized most of the terminology we use today.
• Data in a data warehouse is extracted from another data source, transformed to make it
suitable for analysis, and loaded into the data warehouse. This process is often referred to
as Extract, Transform and Load (ETL).
• Since data from multiple sources was integrated in most data warehouses, Inmon also
described the process as Transformation and Integration (T&I).
• Data in a data warehouse is stored in history which is modeled for fast performance when
querying the data.
• The history tables are the source of truth.
• Data from history is usually extracted into data marts which are used for analysis and
reporting.
• Separate data marts are created for each application. There is often redundant data
across data marts.

History: Inmon
FACTOIDS
• W.H. Inmon coined the term data warehouse.
• W.H. Inmon is recognized by many as the father of data warehousing.
• W.H. Inmon created the first and most commonly accepted definition of a data
warehouse: A subject oriented, nonvolatile, integrated, time variant collection of data in
support of management's decisions.
• Other firsts of W.H. Inman
– Wrote the first book on data warehousing
– Wrote the first magazine column on data warehousing
– Taught the first classes on data warehousing

History: Kimball 1
• Also in the 1990’s, Ralph Kimball published The Data Warehouse Toolkit (ISBN-
10: 0471153370) which popularized dimensional modeling.
• Dimensional modeling is based on the cube concept which is a multi-dimensional view of
data.
A cube used to represent multi-dimensional data
• The cube metaphor can only illustrate three dimensions. A dimensional model can be any
number of dimensions.

History: Kimball 2
• Kimball implemented cubes as star schemas which support querying data in multiple
dimensions.

History: Kimball 3
• Kimball’s books do not discuss the relational model in depth, but his dimensional model
can be explained in relational terms.
• A star schema is a useful way to store data for quickly slicing and dicing data on multiple
dimensions.
• Dimensional modeling and star schema are frequently misunderstood and improperly
implemented. Queries against incorrectly designed tables in a star schema can skew
report results.
• The term OLAP has come to be used specifically to refer to dimensional modeling in many
marketing materials.
• Star schemas are implemented as data marts so that they can be queried by users and
applications. However, data marts aren’t necessarily star schema.

History: Kimball 4
FACTOIDS
• Ralph Kimball had a Ph.D. in electrical engineering from Stanford University.
• Kimball worked at the Xerox Palo Alto Research Center (PARC). PARC is where laser
printing, Ethernet, object-oriented programming, and graphic user interfaces were
invented.
• Kimball was a principal designer of the Xerox Star Workstation which was the first personal
computer to use windows, icons, and mice.

OLTP vs. OLAP
Operational Data/OLTP Data Warehouse/OLAP
Data is normalized (3NF) Data may be normalized, denormalized, use
dimensional models, application-specific data
sets, or other designs.
Data is constantly updated. Data represents a state at a point in time.
Existing data does not change. New data can
be added to history.
Typical operations are selects on small sets of
records, inserts, updates, and deletes of
individual records.
Typical operations are selects, sorts, groupings
and aggregations of large numbers of records,
and inserts of thousands or millions of records.
All transactions are logged. Inserts may not be logged at record level.
There normally are no updates or deletes.
B-tree indexes used for performance Partitioning and bitmap indexes are used for
performance.
Traditional development life cycle Heuristic and agile development
Data designed for application Data taken from some other application
Date range of records is limited; old
transactions are archived.
Date range of history tables can be many years.

Paradigm Shift: For Management
For Management
• Traditional development life cycle doesn’t work well when building a data warehouse. There
is a discovery process. Agile development works better.
• OLTP data was designed for a given purpose, but OLAP is created from data that was
designed for some other purpose—not reporting. It is important to evaluate data content
before designing applications.
• OLAP data may not be complete or precise per the application.
• Data integrated from different sources may be inconsistent.
– Different code values
– Different columns
– Different meaning of column names
• OLAP data tend to be much larger requiring more resources.
• Storage, storage, storage…

Paradigm Shift: For DBAs
For DBAs
• Different system configurations (in Oracle, different initialization parameters)
• Transaction logging may not be used, and methods for recovery from failure are different.
• Different tuning requirements:
– Selects are high cardinality (large percentage of rows)
– Massive sorting, grouping and aggregation
– DML operations can involve thousands or millions of records.
• Need much more temporary space for caching aggregations, sorts and temporary tables.
• Need different backup strategies. Backup frequency is based on ETL scheduling instead of
transaction volume.
• May be required to add new partitions and archive old partitions in history tables.
• Storage, storage, storage…

Paradigm Shift: For Architects & Developers
For Architects and Developers
• Different logical modeling and schema design.
• Use indexes differently (e.g. bitmap rather than b-tree)
• Extensive use of partitioning for history and other large tables
• Different tuning requirements
– Selects are high cardinality (large percentage of rows)
– Lots of sorting, grouping and aggregation
– DML operations can involve thousands or millions of records.
• ETL processes are different than typical DML processes
– Use different coding techniques
– Use packages, functions, and stored procedures but rarely use triggers or constraints
– Many steps to a process
– Integrate data from multiple sources
• Iterative and incremental development process (agile development)

Paradigm Shift: For Analysts and Data Users
For Analysts and Data Users—All Good News
• A custom schema (data mart) can be created for each application per the user requirements.
• Data marts can be permanent, temporary, generalized or project-specific.
• New data marts can be created quickly—typically in days instead of weeks or months.
• Data marts can easily be refreshed when new data is added to the data warehouse. Data
mart refreshes can be scheduled or on demand.
• In addition to parameterized queries and SQL, there may be additional query tools and
dashboards (e.g. Business Intelligence, Self-Service BI, data visualization, etc.).
• Several years of history can be maintained in a data warehouse—bigger samples.
• There is a consistent single source of truth for any given data set.

Architecture: Main Components
Components of a Data Warehouse
Operational Data Data Warehouse OLAP Data
ETL ETL
OLTP Report
OLTP
Staging History
REF
DM
DM
DM
DM
Report
Data
set
Data
set
ODS

Architecture: Staging and ODS
Staging and ODS
• New data is initially loaded into staging so that it can be
processed into the data warehouse.
OLTP Report
OLTP
Staging History
REF
DM
DM
DM
DM
Report
Data
set
Data
set
ODS
ETL
• Many options are available for getting operational data
from internal or external sources into the staging area
• SQL Loader
• imp/exp/impdp/expdp
• Change Data Capture (CDC)
• Replication via materialized views
• Third-party ETL tools
• Staging contains a snapshot in time of operational data.
• An Operational Data Store (ODS) is an optional
component that is used for near real time reporting.
• Transformation and integration of data in an ODS
is limited.
• Less history (shorter time span) is kept in an ODS.

Architecture: History and Reference Data
History and Reference Data
ETL
OLTP Report
OLTP
Staging History
REF
DM
DM
DM
DM
Report
Data
set
Data
set
ODS
• History includes all source data—no
exclusions or integrity constraints.
• Partitioning is used to:
• manage extremely large tables
• improve performance of queries
• to facilitate “rolling window” of
history.
• Denormalization can be used to
reduce number of joins when
selecting data from history.
• No surrogate keys—maintain all
original code values in history.
• Reference data should also have
history (e.g. changing ICD9 codes
over time).

Architecture: Data Marts
Data Marts
• Data marts are per requirements of users
Operational Data OLAP Data
ETL ETL
OLTP Report
OLTP
Staging History
REF
DM
DM
DM
DM
Report
Data
set
Data
set
ODS
and applications.
• Selection criteria (conditions in WHERE
clause) are applied when creating data
marts.
• Logical data modeling is applied at data
mart level (e.g. denormalized, star
schemas, analytic data sets, etc.).
• Integrity constraints can be applied at data
mart level.
• Any surrogate keys can be applied at data
mart level (e.g. patient IDs).
• Data marts can be Oracle, SQL Server,
text files, SAS data sets, etc.
• Data marts can be permanent or
temporary for ongoing or one-time
applications.
• Data mart refreshes can be scheduled or
on demand.

Emerging Technologies
Emerging technologies that are having an impact on data warehousing
• Massively Parallel Processing (MPP)
• In-Memory Databases (IMDB)
• Unstructured Databases
• Column-Oriented Databases
• Database Appliances
• Data Access Tools
• Cloud Database Services

Emerging Technologies: MPP
Massively Parallel Processing (MPP)
• Data is partitioned over hundreds or even thousands of server nodes.
• A controller node manages query execution.
• A query is passed to all nodes simultaneously.
• Data is retrieved from all nodes and assembled to produce query results.
• MPP systems will automatically partition and distribute data using their own algorithms.
Developers and architects need only be concerned with conventional data modeling and DML
operations.
• MPP systems make sense for OLAP and data warehousing where queries are on very large
numbers of records.

Emerging Technologies: IMDB
In-Memory Database
• Data is stored in random access memory (RAM) rather than on disk or SSD.
• Memory is accessed much more quickly reducing seek times.
• Traditional RDBMS software often uses a memory cache when processing data, but it is
optimized for limited cache with most data stored on disk.
• IMDB software has modified algorithms to be optimized to read data from memory.
• Database replication with failover is typically required because of the volatility of computer
memory.
• Cost of RAM has dropped considerably in recent years making IMDB systems more feasible.
• Microsoft SQL Server has an In-Memory option. Tables must be defined as memory
optimized to use this feature.
• Oracle has recently announced the upcoming availability of their In-Memory Option.

Emerging Technologies: Unstructured Databases
Unstructured Databases
• Unstructured databases-- sometimes referred to as NoSQL databases--support vast amounts
of text data and extremely fast text searches.
• Unstructured databases utilize massively parallel processing (MPP) and extensive text
indexing.
• Unstructured databases do not fully support relational features such as complex data
modeling, join operations and referential integrity. However, these databases are evolving to
incorporate additional relational capabilities.
• Oracle, Microsoft, and other RDBMS vendors are creating hybrid database systems that
incorporate unstructured data with relational database systems.
• Unstructured databases are useful for very fast text searches on very large amounts of data—
they are generally not useful for complex transaction processing, analyses and informatics.

Emerging Technologies: Big Data
FACTOIDS
• Big data became an issues as early as 1880 with the U.S. Census which took several years to
tabulate with then existing methods.
• The term information explosion was first used in a the Lawton Constitution, a small-town
Oklahoma newspaper in 1941.
• The term big data was used for the first time in an article by NASA researchers Michael Cox
and David Ellsworth. The article discussed the inability of current computer systems to
handle the increasing amounts of data.
• Google was a pioneer in creating modern hardware and software solutions for big data.
• Parkinson’s Law of Data: “Data expands to fill the space available.”
• 1 exabyte= 10006 bytes = 1018 bytes = 1000 petabytes = 1 billion gigabytes.

Emerging Technologies: Column-Oriented
Column-Oriented Databases
• Data in a typical relational database is organized by row. The row paradigm is used for
physical storage as well as the logical organization of data.
• Column-Oriented databases physically organize data by column while still able to present
data within rows.
• Data is stored on disk in blocks. While the row-oriented databases store the contents of a
row in a block, column-oriented databases store the contents of a column in a block.
• Each column has row and table identifiers so that columns can be combined to produce rows
of data in a table.
• Since most queries select a subset of columns (rather than entire rows), column-oriented
databases tend to perform much better for analytical processing (e.g. querying a data mart).
• Microsoft SQL Server and Oracle Exadata have support for column-based data storage.

Emerging Technologies: Appliances
Database Appliances
• A database appliance is an integrated, preconfigured package of RDBMS software and
hardware.
• The most common type of database appliance is a data warehouse appliance.
• Most major database vendors including Microsoft and Oracle and their hardware partners
package and sell database appliances for data warehousing.
• Data warehouse appliances utilize massively parallel processing (MPP).
• Database appliances generally do not scale well outside of the purchased configuration. For
example, you generally don’t add storage to a database appliance.
• The database appliance removes the burden of performance tuning. Conversely, database
administrators have less flexibility.
• A database appliance can be a cost-effective solution for data warehousing in many
situations.

Emerging Technologies: Data Access Tools
Data Access Tools
• Business Intelligence (BI) tools allow users to view and access data, create aggregations and
summaries, create reports, and view dashboards with current data.
• BI tools typically sit on top of data marts created by the architects and developers. Data
marts that support BI are typically star schema.
• Newer Self-Service BI tools add additional capabilities such as allowing users to integrate
multiple data sources and do further analysis on result data sets from previous analyses.
• Data visualization tools allow users to view data in various graphs.
• Newer tools allow users to access and analyze data from multiple form factors including
smart phones and tablets.
• Data access, BI and data visualization tools do not always provide the capability to perform
complex analyses or fulfill specific requirements of complex reports (e.g. complex statistical
analyses or studies submitted to journals). Programming skills are frequently still required.

Emerging Technologies: Cloud Databases
Cloud Database Services
• Oracle, Microsoft, and other database vendors offer cloud database services.
• The cloud service performs all database administrative tasks:
– Replicate data on multiple severs
– Make backups
– Scale growing databases
– Performance tuning
• Cloud services can be useful for prototyping and heuristic development. A large commitment
to hardware purchases and administrative staff can be postponed for later assessment.
• Cloud services could result in considerable cost savings for some organizations.
• A cloud hybrid database is one that has database components both on the cloud and on local
servers.
• Cloud services may limit administrative options and flexibility vs. having your own DBAs.
• Cloud services may not meet regulatory requirements for security and storage for some
industries (e.g. medical data).

ETL ETL
OLTP Report
OLTP
Staging History
REF
DM
DM
DM
DM
Report
Data
set
Data
set
ODS

Introduction to Datawarehousing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Datawarehousing

Similar to Introduction to Datawarehousing (20)

More from karunakar81987

More from karunakar81987 (8)

Recently uploaded

Recently uploaded (20)

Introduction to Datawarehousing

Editor's Notes