Data mining has been used successfully since the 1980s in select and focused situations from oil drilling to niche identification. It is been used for medical diagnosis, genome analysis, and behavioral profiling. When historical data exists with established precedents for observing trends and patterns over time, data mining can assist in any field, across all industries. With warehousing as the foundation, data mining brings the power and intelligence layer to the warehouse environment, and ultimately to the business user’s desktop.
Most tools still work in their own proprietary environment: The process of moving vast amount of data out of warehouse databases into the tool environment (and back and forth during the exploration process) is cumbersome, time consuming and unintuitive.
Data mining processing uses warehouse data as input. It crunches through the historical data finding patterns and developing rules about business. Once the analysis is run, business users validate the output.
Business identifies the business questions, verifies potential business factors and data sources, determining validity of the analytical model Analysts are experts in statistics, machine learning algorithms, business analysis and technology The right tools: 1. Comprehensiveness: a variety of statistical and machine learning algorithms.e.g regression, clustering, factor analysis, decision tree 2. Data manipulation: can the tool work directly with data in the source database or must it moved in and out of the tool environment for derivation, transformation, modeling and testing 3. Functionality: Can the users set parameters, easily read output, understand the validity of the model, change settings, choose different variables 4. Metadata: Is there easy access to information about the development and use of the analytical models. Teradata warehouse miner’s in place mining doesn’t have the problem of data movement.the data is at the source. Processing times reduced by orders of magnitude.
UNDERSTANDING DATA MINING SOFTWARE II Ekin Baykal Nikhil Brahmbhatt Jechand Chennupati Joel Edgeman Pushpendra Singh
INTRODUCTION <ul><li>Data Mining </li></ul><ul><li>CRISP-DM Model </li></ul><ul><li>Teradata Warehouse Miner </li></ul><ul><li>Show and Tell </li></ul>
INTRODUCTION <ul><li>The best search engine on the internet indexes only 16% of the sites. In 1999 the internet contained over 15 terabytes of data. (Nature, 1999b) </li></ul><ul><li>The quantity of data in GenBank, the international repository for genome-sequences doubles every 14 months. (Economist, 1999) </li></ul><ul><li>The 'Large Hadron Collider' at the CERN will generate 20 terabytes of test data each day, for the next 15 years. (Nature, 1999a). </li></ul>Source: http://www.stt.nl/stt2_intl/projects/datm/datm.htm
DATA MINING The process of identifying and interpreting intrinsic patterns in data to solve a business problem.
HISTORICAL CHALLENGES <ul><li>Lack of standards and business packaging </li></ul><ul><li>Inability of tools to scale up to the volumes of data </li></ul><ul><li>Noisy, missing, and faulty corporate data </li></ul><ul><li>Corporate warehousing have been slow to evolve </li></ul><ul><li>Databases designed for operational processing cannot scale up to voluminous analytical processing </li></ul><ul><li>Business doesn’t trust results that it can’t validate/understand </li></ul><ul><li>Data analysis and mining are typically niche oriented processes that exist outside of business processes. </li></ul>
TODAY’S GROWING DEMAND <ul><li>Technological advances in compute power and speed </li></ul><ul><li>Advanced data processing and management techniques </li></ul><ul><li>Greater user sophistication </li></ul><ul><li>BUT… </li></ul><ul><li>Most tools still work in their own proprietary environment </li></ul><ul><li>Most databases aren’t optimized for analytic processing. </li></ul><ul><li>Businesses haven’t integrated data mining and knowledge discovery into their workflow. </li></ul><ul><li>Lack of executive commitment </li></ul>
WHERE DOES MINING FIT? Name, Addr., # Prod.s, Tot.$, #Yrs. Data Warehouse Data Name, Addr., # Prod.s, Tot.$, #Yrs. Prop to buy Product X,Y,Z Prof. Score, Churn Score, Cluster ID Data Warehouse Data Mined Intelligence
WHERE DOES MINING FIT? The intelligence from the analysis is incorporated back into the warehouse in the form of scores, predictions, forecasts, and descriptions. Develop Analytical Model BUILD Test model can be deployed as: Code Database triggers Called module One-time report TEST & DEPLOY DW OLAP DSS Reports Operational Databases USE- DEPLOY TO
SUCCESSFUL MINING <ul><li>The right people, an integrated technological environment, good tools and sound business commitment. </li></ul><ul><li>To be successful, and profitable, it must a be a collaboration driven by the business , developed by mining analysts and supported by IT . </li></ul><ul><li>Good quality data </li></ul><ul><li>The right tools: IT and analysts work together to determine which tools work best within the technical architecture. </li></ul>
THE ANALYTIC ROADMAP Loyalty Buying Prop Churn Prop Satisfact. Profitab. Lifetime Value Profitab Retention Loss Satisfac. Forecasting Lifetime Value Supply/ Demand Price Point Analysis Bundling New Product Projections Product Optimization Lifecycle Analysis Inventory Analysis Shipper Profiling Timeline Optimiz. Shipment Analysis Warehouse Optimization Maintenance Forecast Channel Analysis Rep Profiling Best Practices Partner Profiling Bundling Target Marketing Cross-sell Strategy Best Campaign Mkt Basket Analysis Campaing Effectiv. Life Cycle Sequence Sales Forecast CUSTOMER MARKETING SALES EQUIPMENT PRODUCT FINANCIAL
DATA MINING SYSTEMS <ul><li>Four generations of Data Mining Systems </li></ul><ul><ul><li>First – Vector value data </li></ul></ul><ul><ul><li>Second – Databases & data warehouses </li></ul></ul><ul><ul><li>Third – Internets and Extranets </li></ul></ul><ul><ul><li>Fourth – Mobile & embedded computing devices </li></ul></ul>Source: http://www.lac.uic.edu/~grossman/papers/esj-98.htm
CRISP-DM MODEL <ul><li>CRoss Industry Standard Process for Data Mining </li></ul><ul><li>Non-proprietary, documented, and freely available data model </li></ul><ul><li>Provides “Complete blueprint for conducting a data mining project” </li></ul><ul><li>Conceived by four leaders of the data mining market – Daimler-Benz, Integral Solutions, NCR, & OHRA </li></ul>
CRISP-DM MODEL <ul><li>Data Mining process organized into six phases </li></ul><ul><ul><li>Business understanding </li></ul></ul><ul><ul><li>Data understanding </li></ul></ul><ul><ul><li>Data preparation </li></ul></ul><ul><ul><li>Modeling </li></ul></ul><ul><ul><li>Evaluation </li></ul></ul><ul><ul><li>Deployment </li></ul></ul>
CRISP-DM REFERENCE MODEL Format data Integrate data Review project Assess model Construct data Verify data quality Produce project plan Produce final report Determine next steps Build model Clean data Explore data Determine data mining goals Plan monitoring and maintenance Review process Generate test design Select data Describe data Assess situation Plan deployment Evaluate results Select modeling technique Data set Collect initial data Determine business objectives Deployment Evaluation Modeling Data Preparation Data Understanding Business Understanding
Teradata Platform: MP-RAS Windows NT 4.0 Windows 2000 Client Platform : Windows NT 4.0 Windows 2000 Teradata RDBMS Version 2 Release 3.1 or later Teradata OLAP and Data Mining Assists Teradata Warehouse Miner Interfaces TeraMiner TM Stats COM Interface Teradata Warehouse Miner Graphical User Interface Teradata Data Dictionary Teradata ODBC Driver Analytic Metadata Metadata Services Teradata Source Data Analytic Algorithm EXE Server ActiveX TM Private Interface Matrix Builder EXE Server Scoring & Evaluation EXE Server Visualization EXE Server 3 rd party / NCR CRM applications Source: Teradata product documentation
RESOURCES <ul><li>Data Mining for Enterprise Solutions, Lelia Morrill, NCR Corporation, 2001 </li></ul><ul><li>The CRISP-DM Model: The New Blueprint for Data Mining, Colin Shearer, Journal of Data Warehousing, Vol. 5 No. 4, Fall 2000 (Abstract) </li></ul><ul><li>Data Mining (DATM), http://www.stt.nl/stt2_intl/projects/datm/datm.htm </li></ul><ul><li>Data Rich, Information Poor, http://www.eco.utexas.edu/~norman/BUS.FOR/course.mat/Alex/ </li></ul><ul><li>There's Gold in that Mountain of Data, Dan R. Greening, http://www.newarchitectmag.com/archives/2000/01/greening/ </li></ul><ul><li>Supporting the Data Mining Process with Next Generation Data Mining Systems, Robert Grossman, http://www.lac.uic.edu/~grossman/papers/esj-98.htm </li></ul>