Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to data mining and data warehousing

Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Functionalities of Data Warehouse,Roll-Up(Consolidation),
KDD Process,Application of Data Mining

  • Login to see the comments

Introduction to data mining and data warehousing

  1. 1. Er. Nawaraj Bhandari Data Warehouse/Data Mining Chapter 1: Introduction to Data Mining and Data Warehousing
  2. 2. Course Title: Data Warehousing and Data Mining (BSC-CSIT, TU) Course no: CSC-459 Credit hours: 3 Nature of course: Theory (3 Hrs.) + Lab (3 Hrs.) Full Marks: 60+20+20 Pass Marks: 24+8+8 Prerequisite: C, Data Structure, Database Course Overview
  3. 3. Text Books
  4. 4. Reference Books
  5. 5. Types of database processing • OLTP - On-line transaction processing. - It is a class of program that facilitates and manages transaction-oriented applications. - It is used for supporting daily business. • OLAP - On-line analytical processing - It is a way of viewing data in a multidimensional format. - It is used for supporting decision making.
  6. 6. Transform “Data” into “Information”  Data Warehouse provides a multidimensional view of an organization’s operational (OLTP) data to help user make more informed, fast decisions.
  7. 7. OLTP VS Data Warehouses(OLAP) Property OLTP Data Warehouse Nature of Data Warehouses 3NF Multidimensional Indexes Few Many Joins Many Some Duplicate data Normalized Denormalized Aggregate data Rare Common Nature of queries Mostly simple Mostly complex Updates All the time Not allowed, only refreshed Historical data Often not available Essential
  8. 8. Stock taking and reordering database Customer Records database Internet and VPN or WAN LAN On-line shopping Webserver and database for On line shopping OLTP for point of salesPoint of SaleCustomer with loyalty card Supermarket Systems
  9. 9. Activity – Identify the Types of Data been Collected and Used here?
  10. 10. And… What Benefits from Bringing this Data Together? - 1
  11. 11. And… What Benefits from Bringing this Data Together? Sales Trends Customer Buying habits Regional variations Variations by time Goods generating profit
  12. 12. Data Warehouse • Subject-oriented • Integrated • Time-variant • Non-volatile What is a Data Warehouse?
  13. 13. Subject Orientation Data warehouse supplier customer product A subject orientation buying A data warehouse can be use to analyse a particular subject area.
  14. 14. Integration OLTP System Data warehouse App1-m,f App2-1,0 App3-male,female Integration Date(ddmmyy) App1-date(yymmdd) App2-date(mmddyy) App3-date(ddmmyy) m,f Integration Data warehouse have integrated data from multiple data source. For example data source A and data source B may have different ways of identifying product. But in data warehouse there will be Only a single way of identifying a product.
  15. 15. Time Variant OLTP System Data warehouse • time horizon – 60-90 days depending on business • key will not usually have an element of time • data can be changed • time horizon – long term 5-10 years • key will contain an element of time • data cannot be changed All data in the data warehouse is identified with a particular time period. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
  16. 16. Non-Volatile Operational System Data warehouse create update retrievedelete load access access … Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business
  17. 17. Functionalities of Data Warehouse Data warehouse is characterized by a relatively low volume of transaction, queries are often very complex and involve aggregations. The basic operations in OLAP are: 1. Roll-Up(Consolidation) 2. Drill-down 3. Slicing 4. Dicing 5. Pivot
  18. 18. Roll-Up(Consolidation) It performs aggregation on a data cubes in following ways. • Data is summarized with increased generalization. • By climbing up a concept hierarchy for a dimensions. • By dimension reduction
  19. 19. Roll-Up(Consolidation)
  20. 20. Drill-Down It is reverse of roll-up: It is performed either by following ways. • By stepping down the concept hierarchy for a dimensions. • By introducing a new domain.
  21. 21. Drill-Down
  22. 22. Slice The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows how slice works.
  23. 23. Slice
  24. 24. Dice Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation.
  25. 25. Dice
  26. 26. Pivot The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation.
  27. 27. Pivot
  28. 28. Overview of the KDD Process • The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process of finding knowledge in data, and emphasizes the "high-level" application of particular data mining methods. • It is of interest to researchers in machine learning pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data visualization. • The unifying goal of the KDD process is to extract knowledge from data in the context of large databases.
  29. 29. Overview of the KDD Process
  30. 30. Overview of the KDD Process Developing an understanding of the • application domain • the relevant prior knowledge • the goals of the end-user
  31. 31. Overview of the KDD Process Creating a target data set: • Selecting a data set • Focusing on a subset of variable • Or data sample on which discovery is to be performed.
  32. 32. Overview of the KDD Process 1. Data cleaning • Removal of noise or outliers. • Cleaning is performed for detection of syntax error. • Parser decides weather the given string of data is acceptable within data specification.
  33. 33. Overview of the KDD Process 2. Data Integration Where multiple data source are combine. 3. Data Selection Where data relevant to the analysis tasks are retrieved from the database
  34. 34. Overview of the KDD Process 4. Transformation Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance. 5. Data Mining: An essential process where intelligent methods are applied to extract data patterns.
  35. 35. Overview of the KDD Process 6. Pattern Evaluation: To identify the truly interesting patterns representing knowledge base on some measures. 7. Knowledge Representation: Where visualization and knowledge representation techniques are used to present the mined knowledge to the users.
  36. 36. Major Issues in Data Warehousing Building a data Warehouse is very difficult and a pain. It is challenging, but it is a fabulous project to be involved in, because when data warehouses work properly, they are magnificently useful, huge fun and unbelievably rewarding. Some of the major issues involved in building data warehouse are discussed below: • General Issues • Technical Issues • Cultural Issues:
  37. 37. General Issues It includes but is not limited to following issues: • What kind of analysis do the business users want to perform? • Do you currently collect the data required to support that analysis? • How clean is data? • Are there multiple sources for similar data? • What structure is best for the core data warehouse (i.e., dimensional or relational)?
  38. 38. Technical Issues It includes but is not limited to following issues: • How much data are you going to ship around your network, and will it be able to cope? • How much disk space will be needed? • How fast does the disk storage need to be? • Are you going to use SSDs to store “hot” data (i.e., frequently accessed information)? • What database and data management technology expertise already exists within the company?
  39. 39. Cultural Issues It includes but is not limited to following issues: • How do data definitions differ between your operational systems? Different departments and business units often use their own definitions of terms like “customer,” “sale” and “order” within systems. So you’ll need to standardize the definitions and add prefixes such as “all sales,” “recent sales,” “commercial sales” and so on. • What’s the process for gathering business requirements? Some people will not want to spend time for you. Instead, they will expect you to use your telepathic powers to divine their warehousing and data analysis needs.
  40. 40. Applications of Data Warehousing Information processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below: Information Processing - A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs. Analytical Processing - A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.
  41. 41. Applications of Data Warehousing Data Mining - Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools.
  42. 42. Application of Data Mining Market Analysis and Management: Target marketing, customer relation management, market basket analysis, cross selling, market segmentation, Find clusters of customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Risk Analysis and Management: Forecasting, customer retention, improved underwriting, quality control, competitive analysis, credit scoring. Fraud Detection and Management: Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. For example, detect suspicious money transactions.
  43. 43. Application of Data Mining Sports: Data mining can be used to analyze shots & fouls of different athletes, their weaknesses and helps athletes to assist in improving their games. Space Science: Data mining can be used to automate the analysis image data collected from sky survey with better accuracy. Internet Web Surf-Aid: Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
  44. 44. Application of Data Mining Social Web and Networks: There are a growing number of highly-popular user-centric applications such as blogs, wikis and Web communities that generate a lot of structured and semi-structured information. In these applications data mining can be used to explain and predict the evolution of social networks, personalized search for social interaction, user behavior prediction etc.
  45. 45. References   
  46. 46. ANY QUESTIONS?