A company that spends less money for their data warehouse is often happier with it.
The main justification for the development expense is that a DW reduces the cost of accessing the information owned by the organization.
Since information has to be retrieved just once (when it is placed in the warehouse), DW users see a lower cost on each report generated.
Typical Multidatabase Report and Screen Generation Data download and transformation contribute to retrieval costs for every report or screen generated Source System A Source System B Source System C Source System D
Typical DW Report and Screen Generation Data upload and transformation costs occur just once. Retrieval costs are lower. Source System A Source System B Source System C Source System D Organizational Data Warehouse
Farmers and Explorers
Every corporation has two types of DW users.
Farmers know what they want before they set out to find it. They submit small queries and retrieve small nuggets of information.
Explorers are quite unpredictable. They often submit large queries. Sometimes they find nothing, sometimes they find priceless nuggets.
Cost justification for the DW is usually done on the basis of the results obtained by farmers since explorers are unpredictable.
1-5: Data Marts and the Data Warehouse Legacy systems feed data to the warehouse. The warehouse feeds specialized information to departments. Organizational Data Warehouse Finance Data Mart Accounting Data Mart Marketing Data Mart Sales Data Mart Operational Data Store Operational Data Store Operational Data Store Operational Data Store Legacy Systems
The Data Mart is More Specialized The data mart serves the needs of one business unit, not the organization. Organizational Data Warehouse Finance Data Mart Accting Data Mart Marketing Data Mart Sales Data Mart
Summarized, aggregated data
Star join design
Limited historical data
Limited data volume
Requirements driven data
Focused on departmental needs
Multi-dimensional DBMS technologies
Organizational Data Warehouse
Highly granular data
Robust historical data
Large data volume
Data Model driven data
General purpose DBMS technologies
1-6: Foundations of Data Mining
Data mining is the process of using raw data to infer important business relationships.
Despite a consensus on the value of data mining, a great deal of confusion exists about what it is.
It is a collection of powerful techniques intended for analyzing large datasets.
There is no single data mining approach, but rather a set of techniques that can be used in combination with each other.
1-7: The Roots of Data Mining
The approach has roots in practice dating back over 30 years.
In the early 1960s, data mining was called statistical analysis, and the pioneers were statistical software companies such as SAS and SPSS.
By the 1980s, the traditional techniques had been augmented by new methods such as fuzzy logic, heuristics and neural networks.
A General Approach
Although all data mining endeavors are unique, they possess a common set of process steps:
Infrastructure preparation – choice of hardware platform, the database system and one or more mining tools
Exploration – looking at summary data, sampling and applying intuition
Analysis – each discovered pattern is analyzed for significance and trends
A General Approach (continued)
Interpretation – Once patterns have been discovered and analyzed, the next step is to interpret them. Considerations include business cycles, seasonality and the population the pattern applies to.
Exploitation – this is both a business and a technical activity. One way to exploit a pattern is to use it for prediction. Others are to package, price or advertise the product in a different way.
1.8: The Approach to Data Exploration and Data Mining The basis for all data mining activities is correlation. A Perfect Correlation A Strong Correlation A Weak Correlation A B A B A B
The Spectrum of Correlation
In general, a correlation coefficient is a number between 0 and 1 that shows strength of a relationship.
Some types of correlation are signed ( ±) to also show the direction of the relationship.
Even a weak correlation can be interesting, however, if it shows a trend over time.
1 .5 0 Perfect Moderate No Correlation Correlation Correlation
Methods to Determine Correlation vs. The method used depends on the type of elements being correlated. A B vs. vs. vs. vs. vs. A A A A A B B B B B B B B B B B B Data element vs. data element Data element vs. unit of time Data element vs. data element groups Data element vs. geography Data element vs. external trends Data element vs. demographics
The Data Warehouse and Data Mining
Data mining does not require the use of a warehouse, but it may be the best foundation for mining.
If multiple analyses are run in sequence, the data need to be held constant (as in a DW). In an operational database, data change often.
Also important is that the data in the DW is integrated and stable
Volumes of Data – The Biggest Challenge
The largest challenge a data miner may face is the sheer volume of data in the warehouse.
It is quite important, then, that summary data also be available to get the analysis started.
A major problem is that this sheer volume may mask the important relationships the analyst is interested in.
The ability to overcome the volume and visualize the data becomes quite important.
1.9: Foundations of Data Visualization
One of the earliest known examples of data visualization was in London during the 1854 cholera epidemic. A map (next slide) helped to identify the source of the disease.
Modern visualization techniques grew from the twin technologies of computer graphics and high performance computing in the 1970s and 1980s.
One computer scientist who saw this trend arising was Douglas Engelbart in the 1950s.
Dr. John Snow used a map to show the source of cholera was a water pump, thus proving the disease was water borne. Broad Street Pump
Opportunity and Timing
Alternative input devices (light pen, sketch pad and mouse) began to appear in the 1960s.
In the 1970s, flight simulators became much more realistic when graphics replaced film.
In the same decade, special effects computers became entrenched in the entertainment industry.
In the 1980s, visualization grew more dynamic with applications like the animation of Los Angeles smog patterns.
One of today’s more useful types of visualization is in simulators (both in games and in practice). This is the only way most of us will ever fly a Boeing 747.
It is now both cheaper and safer to train commercial pilots on simulators. With good software, pilots can be placed in situations they may not ever see – until too late – in the cockpit.
A Sequence of Frames Animating LA Smog Day 1 Swirling Winds – Light Smog Particles Day 2 Offshore Winds – Moderate Smog Particles Day 3 Head-on View of Smog Particles and Streamlines
Number Crunching With a Difference
In the 1990s, rapid advances in chip technology, both at the CPU and the graphics processor, put data visualization everywhere .
Imagine trying to understand DNA sequences from just the numbers!
On the next slide, a Mapuccino display helps us see where the results from a text search come from.