The key steps in developing a data warehouse can be summarized as:
1. Project initiation and requirements analysis
2. Design of the architecture, databases, and applications
3. Construction by selecting tools, developing data feeds, and building reports
4. Deployment including release and training
5. Ongoing maintenance
2. Data Warehouse Process and Technology: Warehousing
Strategy, Warehouse management and Support Processes.
Warehouse Planning and Implementation.
H/w and O.S. for Data Warehousing, C/Server Computing
Model & Data Warehousing, Parallel Processors &
Cluster Systems, Distributed DBMS implementations.
Warehousing Software, Warehouse Schema Design.
Data Extraction, Cleanup & Transformation Tools,
Warehouse Metadata
3. “Storage or warehousing provides the place
utility as part of logistics for any business and
along with Transportation is a critical
component of customer service standards”.
4. To support the company’s customer policy.
To maintain a source of supply without interruptions.
To support changing market conditions and sudden
changes in demand.
To provide customers with the right mix of products
at all times and all locations.
To ensure least logistics cost for a desired level of
customer service.
5. More cost effective decision making.
Better enterprise intelligence: Increasing
quality and flexibility of enterprise analysis.
Enhanced customer service.
Business re-engineering: Knowing what
information is important provides direction and
priority for re-engineering efforts.
Information system re-engineering.
6. Private warehouses: It is a storage facility that is mostly
owned by big companies or single manufacturing units. It
is also known as proprietary warehousing.
Public warehouses: It is a facility that stores inventory for
many different businesses as opposed to a
"private warehouse”.
Contract warehouses: A contract warehouse handles the
shipping, receiving and storage of goods on a contract
basis. This type of warehouse usually requires a client to
commit to services for a particular period of time.
7. An integrated warehouse strategy focuses on two
questions:
1. How many warehouses should be employed.
2. Which warehouse types should be used to meet
market requirements.
Many firms utilize a combination of private, public,
and contract facilities.
8. It involves following activities:
1. Establish sponsorship.
2. Identify enterprise needs.
3. Determine measurement cycle.
4. Validate measures.
5. Design data warehouse architecture.
6. Apply appropriate technologies.
7. Implementing data warehouse.
9. 1. Establish sponsorship: Establishing the right
sponsorship chain will ensure successful development
and implementation. Sponsorship chain should include
a data warehousing manager and two key individuals.
2. Identify Enterprise needs: Interview with key
enterprise manager and analysis other pertinent
documentations are techniques used to determine
enterprise needs.
10. 3. Determine measurement cycle: Describing
the cycles or time period used for the measure. Are
quarters, months or hours are appropriate to capture
useful measurement data? Does it need historical
data?
4. Validate measures: After determining and
identifying enterprise needs, it is necessary to
“reality check” of it. The feedback will be used for
refining the measures.
11. 5. Design data warehouse architecture: This activity
involves active user participation in facilitated design
sessions.
6. Apply appropriate technologies: Enterprise selects
technology, key technology issues, security policies etc.
7. Implementing data warehouse: Loading preliminary
data, designing user interface, developing standard
queries and reports etc.
12.
13. There are four major processes that build a data
warehouse:
1. Extract and load data: Data extraction takes data
from the source systems. Data load takes the
extracted data and loads it into the data warehouse.
It involves:
Controlling the Process: Determining when to start data
extraction. It ensures that the tools, the logic modules,
and the programs are executed in correct sequence and at
correct time.
14. When to Initiate Extract: Data warehouse should
represent a single, consistent version of the information to
the user. So, Data needs to be in a consistent state.
Loading the Data : Data is loaded into a temporary data
store where it is cleaned up and made consistent.
2. Cleaning and transforming the data: Clean
and transform the loaded data into a structure,
Partition the data and Aggregation.
15. 3. Backup and Archive the data: In order to recover the
data in the event of data loss, software failure, or
hardware failure, it is necessary to keep regular back
ups.
4. Managing queries & directing them to the
appropriate data sources: Manages the queries,
helps speed up the execution time of queries, Directs
the queries to their most effective data sources.
Ensures that all the system sources are used in the
most effective way, Monitors actual query profiles.
16.
17. A warehouse management system (WMS) is a
software application, designed to support and
optimize warehouse or distribution center
management.
They facilitate management in their daily planning,
organizing, staffing, directing, and controlling the
utilization of available resources, to move and store
materials into, within, and out of a warehouse, while
supporting staff in the performance of material
movement and storage in and around a warehouse.
18. 1. Load management: Relates to the collection of
information from internal or external sources.
Loading process includes summarizing,
manipulating and changing the data structures
into a format that lends itself to analytical
processing.
2. Warehouse Management: The management
tasks include ensuring its availability, the
effective backup of its contents, and its security.
19. 3. Query management: relates to the provision of
access to the contents of the warehouse and may
include the partitioning of information into
different areas with different privileges to
different users.
Access may be provided through custom-built
applications, or ad hoc query tools.
20.
21. Includes loading preliminary data, implementing
transformation program, design user interface,
developing standard query and reports and
training to warehouse users.
23. The process of extracting data from source systems and
bringing it into the data warehouse is commonly
called ETL, which stands for:
Extraction: To retrieve all the required data from the
source system with as little resources as possible.
Transformation, and
Loading.
24. Ways to perform the extract:
Update notification – If the source system is able to
provide a notification that a record has been changed,
this is the easiest way to get the data.
Incremental extract –They are able to identify which
records have been modified and provide an extract of
such records. By using daily extract, we may not be able
to handle deleted records.
Full extract - The full extract requires keeping a copy of
the last extract in the same format in order to be able to
identify changes. Handles deletions as well.
25. 2. Clean: Ensures the quality of the data in the data
warehouse.
3. Transform: Applies a set of rules to transform the
data from the source to the target.
Converting any measured data to the same dimension
using the same units so that they can later be joined.
It also requires joining data from several sources,
generating aggregates, generating surrogate keys,
sorting, deriving new calculated values.
26. 4. Load: To ensure that the load is performed correctly
and with as little resources as possible.The target of the
Load process is often a database. The referential integrity
needs to be maintained by ETL tool to ensure consistency.
5. Managing ETL Process:
There is a possibility that the ETL process fails.This can be
caused by missing values in one of the reference tables, or
simply a connection or power outage. It is necessary to
design the ETL process keeping fail-recovery in mind.
27. 6. Staging:
A staging area or landing zone is an intermediate storage
area used for data processing during the ETL process.
Primary motivations for their use are to increase
efficiency of ETL processes, ensure data integrity and
support data quality operations.
28. Commercial tools : Ab Initio, IBM InfoSphere
DataStage, Informatica, Oracle Data
Integrator and SAP Data Integrator.
Open source ETL tools: CloverETL, Apatar,
Pentaho and Talend.
29. Data Warehousing comes in all shapes and sizes,
which is having a direct relationship to cost and
time involved.
The steps listed below are summary of some of
the points to consider:
Get Professional Advice
Plan the Data
Who will use the Data Warehouse
Integration to External Applications
30. The key steps in developing a data warehouse can
be summarized as follows:
Project initiation
Requirements analysis
Design (architecture, databases and applications)
Construction (selecting and installing tools,
developing data feeds and building reports)
Deployment (release & training)
Maintenance
31.
32. It applies to the software architecture that describe
processing between application and supporting services.
It represents distributive co-operating processing,
relationship between client and server is the relationship
between hardware and software components.
It covers a wide range of functions, services and other
aspects of distributed environment.
33. Host based application processing is performed on one
computer system with attached unintelligent, “dumb”
terminals.
A single stand alone PC or an IBM mainframe with
attached character-based display terminals are example
of host-based processing environment.
Host based processing is totally non-distributed.
34.
35. Slave computers are attached to master computer and
perform application-processing-related functions only as
directed by their master.
Distribution of processing tends to be unidirectional-
from master to slaves.
Slaves are capable of some limited local application
processing.
E.g. Mainframe (host) computer, such as IBM 3090 used
with cluster controllers and intelligent terminals.
36.
37. This generation used to model:
1. Shared device LAN processing environment : PCs
are attached to a system device that allows these
PCs to share a common resource – file Server on
Hard disk or printer Server.
E.g. Microsoft’s LAN manager, which allows a LAN
to have a system dedicated to file and print services.
38.
39. 2. Client server LAN processing environment:
Extension of shared device processing.
E.g. SYBASE SQL Server
An application running on PC sends Read request
to its database server. DB server process it locally
and sends only the requested records to PC
applications.
40.
41. Two-tiered architecture to multi-tiered
architecture.
Computing model deals with servers
dedicated to application, data, transaction
management and system management.
Supported relational to multidimensional to
multimedia data structure.
42.
43. A distributed database system consists of
loosely coupled sites that share no physical
component.
Database systems that run on each site are
independent of each other.
Transactions may access data at one or more
sites.
44. In a homogeneous distributed database
All sites have identical software
Are aware of each other and agree to cooperate in processing
user requests.
Each site surrenders part of its autonomy in terms of right to
change schemas or software
Appears to user as a single system
In a heterogeneous distributed database
Different sites may use different schemas and software
▪ Difference in schema is a major problem for query processing
▪ Difference in software is a major problem for transaction processing
Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing
45. DDBMS architectures are generally developed
depending on three parameters −
Distribution − It states the physical distribution of
data across the different sites.
Autonomy − It indicates the distribution of control
of the database system and the degree to which each
constituent DBMS can operate independently.
Heterogeneity − It refers to the uniformity or
dissimilarity of the data models, system components
and databases.
46. Data Replication
Fragmentation
The three dimensions of distribution
transparency are −
Location transparency
Fragmentation transparency
Replication transparency
51. The data warehouse operations mainly consist of huge data loads and
index builds, generation of materialized views, and queries over large
volumes of data. The elemental I/O system of a data warehouse should
be built to meet these heavy requirements.
Architecture Options:
1. Symmetric Multiprocessing (SMP): where two or more identical
processors are connected to a single, shared main memory.
2. Massive parallel processing (MPP): large number of processors to
perform a set of coordinated computations in parallel.
Number of CPUs
Memory of data warehouse
Number of Disks
52. Server OS determine:
how quickly the server can fulfill client request
how many clients it can support concurrently and
reliably,
how efficient the system resources such as
memory,
Disk I/O and communication components are
utilized.
53. Multiuser Support
Preemptive multitasking
Multithreaded Design
Memory Protection: Concurrent tasks should not
violate each others memory.
Scalability
Security
Reliability
Availability
54. Relatively small and highly secure than uniprocessors.
Simplified architecture, Extensibility, Portability, real
time support, robust system security and multiprocessor
support.
This architecture results into highly modular OS that
can support multiple OS “personalities” by configuring
outside services as needed.
For e.g. Mach 3.0 microkernel used by IBM to allow
DOS, OS/2 and AIX OS to coexist on single machine.
58. A cluster is a loosely coupled SMP machines
connected by high speed interconnection
network.
A cluster behave just like a single large
machine.