Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Mining


Published on

Published in: Education, Technology, Business
  • Can u plz send lesson plan or workshop for bca data mining of mangalore university
    Are you sure you want to  Yes  No
    Your message goes here

Data Mining

  1. 1. KUMARAGURU COLLEGE OF TECHNOLOGY COIMBATORE DATA WAREHOUSING AND DATA MINING Presented by K.Santhosh (07bcs43) E-Mail Contact No: 9788153199 V.Siddharth (07bcs50) E-Mail Contact No: 9843286841
  2. 2. DATA WAREHOUSING AND DATA MINING ABSTRACT: Fast, accurate and scalable data analysis techniques are needed to extract useful information from huge pile of data. Data warehouse is a single, integrated source of decision support information formed by collecting data from multiple sources, internal to the organization as well as external, and transforming and summarizing this information to enable improved decision making. Data warehouse is designed for easy access by users to large amounts of information, and data access is typically supported by specialized analytical tools and applications. Typical applications include decision support systems and execution information system. Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. It is “An information extraction activity whose goal is to discover hidden facts contained in databases”. The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions. Data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.”
  3. 3. DATA WAREHOUSING AND DATA MINING Introduction: Everyday increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies. A large amount of the right information is the key to survival in today’s competitive environment. And this kind of information can be made available only if there’s totally integrated enterprise data warehouse. What is data warehousing? A data warehouse is a subject-oriented, integrated, non-volatile & time-variant collection of data in support of management’s decisions NEED FOR A DATA WAREHOUSE : • IT or business staff spending a lot of time developing special reports for decision- makers. • Lots of PC-based or small server systems obtaining extracts of data incapable of presenting a holistic view of the entire gamut of information. • Same data present on different systems, in different department and users may be unaware of this fact. • Difficulty in getting meaningful information in a timely manner. • Multiple systems giving different answer to the business questions. • Less analysis by decision makers and policy planners due to non-availability of sophisticated tools and easily decipherable, timely and comprehensive information
  4. 4. PURPOSE OF A DATA WAREHOUSE : Better business intelligence for end users. • Reduction in time to locate, access and analyze information. • Consolidation of disparate information sources. • Replacement of older, less-responsive decision support systems • Faster time to market for products and services • Strategic advantage over competitors Data Warehouse Characteristics: 1.Subject-orientedWH is organized around the major subjects of the enterprise rather than the major application areas. This is reflected in the need to store decision- support data rather than application-oriented data. 2.Integratedbecause the source data come together from different enterprise-wide applications systems. The source data is often inconsistent using..The integrated data source must be made consistent to present a unified view of the data to the users 3.Time-variantthe source data in the WH is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots 4.Non-volatiledata is not update in real time but is refresh from OS on a regular basis. New data is always added as a supplement to DB, rather than replacement. The DB continually absorbs this new data, incrementally integrating it with previous data DATA WAREHOUSE LIFE CYCLE: Data warehousing is a concept. It is not a product that can be purchased off the shelf. It is a set of hardware and software components integrated together which can be used to
  5. 5. analyze the massive amount of data stored in an efficient manner. It is a process through which one can build a successful data warehouse. Following are the five steps towards building a successful data warehouse. 1.JUSTIFICATION 2.REQUIREMENT ANALYSIS 3.DESIGN 4.DEVELOPMENT AND IMPLEMENTATION 5.DEPLOYMENT Main Components: 1Operational data sourcesfor the DW is supplied from mainframe operational data held in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organization’s suppliers or customers 2Operational datastore(ODS)is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse 3load manageralso called the frontend component, it performance all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse 4warehouse managerperforms all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data
  6. 6. 5query manageralso called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries 6detailed, lightly and lightly summarized data,archive/backup data 7meta-data 8end-user access toolscan be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools Data Flows 1Inflow- The processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse. 2upflow- The process associated with adding value to the data in the warehouse through summarizing, packaging , packaging, and distribution of the data 3downflow- The processes associated with archiving and backing-up of data in the warehouse 4outflow- The process associated with making the data availabe to the end-users 5Meta-flow- The processes associated with the management of the meta-data Tools and Technologies: 1The critical steps in the construction of a data warehouse: a. Extraction b. Cleansing c. Transformation 1after the critical steps, loading the results into target system can be carried out either by separate products, or by a single, categories: 2code generators 3database data replication tools 4dynamic transformation engines
  7. 7. The importance of managing meta-data(integration): 1The integration of meta-data, that is ”data about data” 2Meta-data is used for a variety of purposes and the management of it is a critical issue in achieving a fully integrated data warehouse 3The major purpose of meta-data is to show the pathway back to where the data began, so that the warehouse administrators know the history of any item in the warehouse 4The meta-data associated with data transformation and loading must describe the source data and any changes that were made to the data 5The meta-data associated with data management describes the data as it is stored in the warehouse 6The meta-data is required by the query manager to generate appropriate queries, also is associated with the user of queries Data Warehousing Issues 1Semantic Integration: When getting data from multiple sources, must eliminate mismatches, e.g., different currencies, DB schemas. 2Heterogeneous Sources: Must access data from a variety of source formats and repositories. Replication capabilities can be exploited here. 3Load, Refresh, Purge: Must load data, periodically refresh it, and purge too-old data. 4Metadata Management: Must keep track of source, loading time, and other information for all data in the warehouse. Star Schema: A logical structure that has a fact table containing factual data in the center, surrounded by dimension tables containing reference data (which can be denormalized) Snowflake Schema:
  8. 8. A variant of the star schema where dimension tables do not contain denormalized data. Starflake Schema: A hybrid structure that contains a mixture of star and snowflake schemas. The benefits of data warehousing: 1The potential benefits of data warehousing are high returns on investment. 2substantial competitive advantage.. 3Increased productivity of corporate decision-makers.. 4More cost effective decision making 5Better enterprise intelligence 6Enhanced customer service 7Better asset/liability management 8Business process reengineering 9Empowerment of all employees Applications: On Line Transaction Processing: OLTP systems are the major kinds of enterprise applications: Examples: Order entry systems, Inventory control systems, Reservation systems, Point-of-sale systems, Tracking systems, etc. Executive information system (EIS) : Present information at the highest level of summarization using corporate business measures. They are designed for extreme ease-of-use and, in many cases, only a mouse is required. Graphics are usually generously incorporated to provide at-a-glance indications of performance Decision Support Systems (DSS) :
  9. 9. They ideally present information in graphical and tabular form, providing the user with the ability to drill down on selected information. Note the increased detail and data manipulation options presented. DATA MINING What is data mining? Data Mining refers to the process of analyzing the data from different perspectives and summarizing it into useful information. Data mining software is one of the numbers of tools used for analyzing data. It allows users to analyze from many different dimensions or angles, categorize it, and summarize the relationship identified. 1Data Mining is about techniques for finding and describing Structural Patterns in data. Definition: Data mining is the process of finding correlation or patterns among fields in large relational databases. The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. (Simoudis, 1996) Different Types of Data Mining: 1Business Data Mining 2Scientific Data Mining 3Internet Data Mining Five major elements of Data Mining:
  10. 10. 1.Extract, transform, and load transaction data on to the data warehouse system. 2.Store and manage data in multidimensional database system. 3.Provide access to business analysts and information technology Professionals. 4.Analyze the data by application software. 5.Present the data in useful format such as graph or table. Requirements of Data Mining: 1Handling of different type of data 2Efficiency and scalability of algorithm 3Usefulness, certainty and expressiveness of result 4Expression of various kinds of mining results 5Interactive mining knowledge at multiple levels 6Mining information from different sources of data 7Protection of privacy and data security Various kinds of data on which Data Mining is applied : 1Relational database 2Data warehouse 3Transactional database 4Multimedia database 5Spatial and temporal data 6Object-relational database Data mining applications: The Main application for Data Mining is WEB MINING. What is Web Mining? “Web mining can be broadly defined as the automated discovery and analysis of useful information from the Web documents and services using data mining techniques.”
  11. 11. Web mining is the application of data mining or other information process techniques to WWW, to find useful patterns. People can take advantage of these patterns to access WWW more efficiently. NEED FOR WEB MINING: Now a day, the World Wide Web is a popular and interactive medium, ideal for publishing information. It is huge, diverse and dynamic and thus raises issue of scalability, multimedia and temporal data respectively, due to those situations; the users are currently “drowning” in an information overload that expands at rate that far outpaces human ability to process and exploit it. Domains of Web Mining: There are three domains that pertain to Web mining: 1. Web Contents Mining 2. Web Structure Mining 3. Web Usage Mining 1. Web Content Mining Web content mining is an automatic process that extracts patterns from on-line information, such as the HTML files, images, or E-mails, and it already goes beyond only keyword extraction or some simple statistics of words and phrases in documents. Web content mining is the "process of information or resource discovery from millions of sources across the World Wide Web ". There are two approaches in Web content mining: 1Agent-based approaches 2Database approaches Agent-Based approaches: The agent-based approach involves artificial intelligence systems that can "act autonomously or semi-autonomously on behalf of a particular user, to discover and organize Web-based information ". Some intelligent Web agents can use a user profile to
  12. 12. search for relevant information, then organize and interpret the discovered information (e.g., Harvest). Database approaches: The database approach focuses on "integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources." These "metadata, are organized into structured collections (e.g., relational or object-oriented databases) and can be analyzed". 2. Web Structure Mining The Data which describes organization of content.Intra-page structure information includes the arrangement of various HTML or XML tags within a given page. This can be represented as tree structure, where the <html> tag becomes the root of tree. The principal kind of inter-page structure information is hyper-links connecting one page to another. 3. Web Usage Mining Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the Web access logs of different Web sites can help to understand the user behavior and the Web structure, by improving design of the colossal collection of resources. Web Mining Techniques The common techniques for Web mining are: 1Clustering/classification 2Association rules 3Path analysis 4Sequential patterns. 1. Clustering/classification This technique is used to develop profiles of items with similar characteristics. This ability enhances the discovery of relationships that are otherwise not obvious. Eg:
  13. 13. Classification of Web access logs allows a company to discover the average age of customers who order a certain product. 2. Association rules Rules that govern "databases of transactions where each transaction consists of a set of items." This technique is used to predict the correlation of items "where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other items." 3. Path analysis A Technique that involves the generation of some form of graph that "represents relation[s] defined on Web pages." This can be the physical layout of a Web site in which the Web pages are nodes and the hypertext links between these pages are directed edges. Eg: what paths do users travel before they go to a particular URL. 4. Sequential patterns Applied to Web access server transaction logs. The purpose is to discover sequential patterns that indicate user visit patterns over a certain period. Web mining as a tool: Web mining can be a promising tool to address ineffective search engines, which produce incomplete indexing, unverified reliability of retrieved information. Web mining discovers information from mounds of data on the WWW, but it also monitors and predicts user visit habits. This gives designers more reliable information in structuring and designing a Web site. Web mining technology can help librarians design Web sites with paths that can be traveled easily by end users, saving time and effort. Eg: Web mining technology and academic librarianship
  14. 14. Conclusion: Data Warehousing provides the means to change the raw data into information for making effective business decisions-the emphasis on information, not data.The Data warehouse is the hub for decision support data. Data mining is a useful tool with multiple algorithms that can be tuned for specific tasks. It can benefit business, medicine, and science. It needs more efficient algorithms to speed up data mining process.Web mining is a huge, interdisciplinary and vary dynamic/scientific area, converging from several research communities such as database, information retrieval and artificial intelligence especially from machine learning and natural language processing. This area is so broad today partly due to the interests of various research communities. References: 2Data Base Systems-Elmasri, Navathe 3Data Mining Technologies-Arun K.Pujari 4Data Mining and Data Warehousing and OLAP-A.Berson, S.J.Smith 5Database Management System-Sylbardcards