Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Overview of Data Mining:
 Generally, data mining (sometimes called data or knowledge discovery) is the process of
analyzi...
o Association and Correlation Analysis
o Classification
o Prediction
o Cluster analysis
o Outlier analysis
o Evolution ana...
Majordata mining Tasks
 The two high-level primary goals of data mining, in practice, are prediction and description.
1) ...
ii. The quantitative level of the model specifies the strengths of the dependencies using
some numerical scale.
6. Change ...
6. Handling noisy or incomplete data
 The data cleaning methods are required that can handle the noise, incomplete object...
Classification and Prediction Issues
 The major issue is preparing the data for Classification and Prediction. preparing ...
1. FINANCIAL DATA ANALYSIS
 The financial data in banking and financial industry is generally reliable and of high qualit...
 Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expandi...
5. OTHER SCIENTIFIC APPLICATIONS
 The applications discussed above tend to handle relatively small and homogeneous data s...
Data mining Process
 Data Mining is an analytic process designed to explore data (usually large amounts of data -
typical...
 These techniques - which are often considered the core of predictive data mining -
include: Bagging(Voting, Averaging), ...
The Scope ofData Mining
 Data mining derives its name from the similarities between searching for valuable business
infor...
Faster processing means that users can automatically experiment with more models to
understand complex data. High speed ma...
2. Association
 Association is one of the best known data mining technique. In association, a pattern is
discovered based...
technique, we can keep books that have some kinds of similarities in one cluster or one shelf
and label it with a meaningf...
 The web pages do not have unifying structure. They are very complex as compared to
traditional text document. There are ...
2. Finance / Banking
 Data mining gives financial institutions information about loan information and credit
reporting. B...
2. Security issues
 Security is a big issue. Businesses own information about their employees and customers
including soc...
analysis which are the relevant driver attributes of its customers buying a specific product (or
at least being very inter...
A Producerwants to know…………
1. Which are our lowest/highest margin customers?
2. Who are my customer and what products the...
2. Integrated
 Constructed by integrating multiple, heterogeneous data sources.
o Relational databases, flat files, on-li...
 The term "Business Intelligence" describes the process a business uses to gather all its raw
data from multiple sources ...
o A data warehouse has the ability to receive data from many different sources, meaning
any system in a business can contr...
3. Data Transformation
 Data Transformation involves converting data from legacy format to warehouse format.
4. Data Load...
between various types of data. Association of patterns is done by relating one event to
another. A sequence or path is set...
 By providing data from various sources, managers and executives will no longer need to
make business decisions based on ...
The Disadvantagesofa Data Warehouse
1. Extra Reporting Work
 Depending on the size of the organization, a data warehouse ...
the queries are often ad hoc, the queries are limited by what data relations were set when the
aggregation was assembled.
...
5) Ill-defined, changing business data requirements & Insensitivity of technical team in
understanding business requiremen...
enterprise wide view, many business users are comfortable in using divisional data mart
assuming that “Known devil is bett...
Our Solution
o Infosys designed and implemented a data warehouse solution to extract information from the
Mortgage Sales A...
1) No Coupling
 In this scheme the Data Mining system does not utilize any of the database or data warehouse
functions. I...
Bibliography
http://forum.jntuworld.com/showthread.php?3818-Data-Warehousing-and-Data-
Mining-(DWDM)-Unit-wise-Notes-All-8...
DEPARTMENT OF BUSINESS AND INDUSTRIAL
MANAGEMENT
TERM ASSIGNMENT 2014-15
INFORMATION TECHNOLOGY FOR BUSINESS
FYMBA- SEM-I
...
Upcoming SlideShare
Loading in …5
×

data mining and data warehousing

6,701 views

Published on

data mining and data warehousing meaning purpose issues cases example implications applications

Published in: Business
  • Be the first to comment

  • Be the first to like this

data mining and data warehousing

  1. 1. Overview of Data Mining:  Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both.  Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions.  Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. DEFINITION OF 'DATA MINING'  A process used by companies to turn raw data into useful information. By using software to look for patterns in large batches of data, businesses can learn more about their customers and develop more effective marketing strategies as well as increase sales and decrease costs. Data mining depends on effective data collection and warehousing as well as computer processing.  Grocery stores are well-known users of data mining techniques. Many supermarkets offer free loyalty cards to customers that give them access to reduced prices not available to non- members. The cards make it easy for stores to track who is buying what, when they are buying it, and at what price. The stores can then use this data, after analyzing it, for multiple purposes, such as offering customers coupons that are targeted to their buying habits and deciding when to put items on sale and when to sell them at full price. Data Mining Engine  Data mining engine is very essential to the data mining system. It consists of a set of functional modules.  These modules are for following tasks: o Characterization
  2. 2. o Association and Correlation Analysis o Classification o Prediction o Cluster analysis o Outlier analysis o Evolution analysis Purpose and Uses of Data Mining  The purpose of data mining is to identify patterns in order to make predictions from information contained in databases. It allows the user to be proactive in identifying and predicting trends with that information.  Common uses of data mining in government include knowledge discovery, fraud detection, and analysis of research, decision support, and website personalization.  The most common federal government uses of data mining as identified by GAO include: 1) Improving service or performance 2) Detecting fraud, waste, and abuse 3) Analyzing scientific and research information 4) Managing human resources 5) Detecting criminal activities or patterns 6) Analyzing intelligence and detecting terrorist activities.  State government data mining efforts include programs to ensure that the proper beneficiaries of state benefits programs receive the correct amount of benefits. Such uses can save states substantial amounts of money that otherwise would be erroneously paid out in the form of state benefits.  Moreover, in a recent report, GAO found that twenty one states are using data mining software to look for unusual patterns in claims, provider, and beneficiary information stored in data warehouses in order to identify potential provider abuse.
  3. 3. Majordata mining Tasks  The two high-level primary goals of data mining, in practice, are prediction and description. 1) Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest. 2) Description focuses on finding human-interpretable patterns describing the data.  The relative importance of prediction and description for particular data mining applications can vary considerably. However, in the context of knowledge discovery process (KDD), description tends to be more important than prediction. This is in contrast to pattern recognition and machine learning applications (such as speech recognition) where prediction is often the primary goal of the KDD process.  The goals of prediction and description are achieved by using the following primary data mining tasks: 1. Classification is learning a function that maps (classifies) a data item into one of several predefined classes. 2. Regression is learning a function which maps a data item to a real-valued prediction variable. 3. Clustering is a common descriptive task where one seeks to identify a finite set of categories or clusters to describe the data. o Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of objects that are very similar to each other but are highly different from the objects in other clusters. o Closely related to clustering is the task of probability density estimation which consists of techniques for estimating, from data, the joint multi-variate probability density function of all of the variables/fields in the database. 4. Summarization involves methods for finding a compact description for a subset of data. 5. Dependency Modeling consists of finding a model which describes significant dependencies between variables. o Dependency models exist at two levels: i. The structural level of the model specifies (often graphically) which variables are locally dependent on each other, and
  4. 4. ii. The quantitative level of the model specifies the strengths of the dependencies using some numerical scale. 6. Change and Deviation Detection focuses on discovering the most significant changes in the data from previously measured or normative values. Mining Methodologyand User Interaction Issues It refers to the following kind of issues: 1. Mining different kinds of knowledge in databases  The need of different users is not the same. And Different user may be in interested in different kind of knowledge. Therefore it is necessary for data mining to cover broad range of knowledge discovery task. 2. Interactive mining of knowledge at multiple levels of abstraction  The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on returned results. 3. Incorporation of background knowledge  To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple level of abstraction. 4. Data mining query languages and ad hoc data mining  Data Mining Query language that allows the user to describe ad hoc mining tasks should be integrated with a data warehouse query language and optimized for efficient and flexible data mining. 5. Presentation and visualization of data mining results  Once the patterns are discovered it needs to be expressed in high level languages, visual representations. These representations should be easily understandable by the users.
  5. 5. 6. Handling noisy or incomplete data  The data cleaning methods are required that can handle the noise, incomplete objects while mining the data regularities. If data cleaning methods are not there then the accuracy of the discovered patterns will be poor. 7. Pattern evaluation  It refers to interestingness of the problem. The patterns discovered should be interesting because either they represent common knowledge or lack novelty. Performance Issues It refers to the following issues: 1. Efficiency and scalability of data mining algorithms  In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable. 2. Parallel, distributed, and incremental mining algorithms  The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed parallel. Then the results from the partitions are merged. The incremental algorithms, updates databases without having mine the data again from scratch. Diverse Data Types Issues 1. Handling of relational and complex types of data  The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data. 2. Mining information from heterogeneous databases and global information systems  The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining knowledge from them adds challenges to data mining.
  6. 6. Classification and Prediction Issues  The major issue is preparing the data for Classification and Prediction. preparing the data involves the following activities: 1) Data Cleaning  Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute. 2) Relevance Analysis  Database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related. 3) Data Transformation and reduction  The data can be transformed by any of the following methods.  Normalization - The data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used.  Generalization -The data can also be transformed by generalizing it to the higher concept. For this purpose we can use the concept hierarchies. Data Mining Applications Here is the list of areas where data mining is widely used:  Financial Data Analysis  Retail Industry  Telecommunication Industry  Biological Data Analysis  Other Scientific Applications  Intrusion Detection
  7. 7. 1. FINANCIAL DATA ANALYSIS  The financial data in banking and financial industry is generally reliable and of high quality which facilitates the systematic data analysis and data mining.  Here are the few typical cases: o Design and construction of data warehouses for multidimensional data analysis and data mining. o Loan payment prediction and customer credit policy analysis. o Classification and clustering of customers for targeted marketing. o Detection of money laundering and other financial crimes. 2. RETAIL INDUSTRY  Data Mining has its great application in Retail Industry because it collects large amount data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of increasing ease, availability and popularity of web.  The Data Mining in Retail Industry helps in identifying customer buying patterns and trends that leads to improved quality of customer service and good customer retention and satisfaction. Here is the list of examples of data mining in retail industry: o Design and Construction of data warehouses based on benefits of data mining. o Multidimensional analysis of sales, customers, products, time and region. o Analysis of effectiveness of sales campaigns. o Customer Retention. o Product recommendation and cross-referencing of items. 3. TELECOMMUNICATION INDUSTRY  Today the Telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, Internet messenger, images, e-mail, web data transmission etc.
  8. 8.  Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business.  Data Mining in Telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service.  Here is the list examples for which data mining improve telecommunication services: o Multidimensional Analysis of Telecommunication data. o Fraudulent pattern analysis. o Identification of unusual patterns. o Multidimensional association and sequential patterns analysis. o Mobile Telecommunication services. o Use of visualization tools in telecommunication data analysis. 4. BIOLOGICAL DATA ANALYSIS  Nowadays we see that there is vast growth in field of biology such as genomics, proteomics, functional Genomics and biomedical research. Biological data mining is very important part of Bioinformatics.  Following are the aspects in which Data mining contribute for biological data analysis: o Semantic integration of heterogeneous, distributed genomic and proteomic databases. o Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences. o Discovery of structural patterns and analysis of genetic networks and protein pathways. o Association and path analysis. o Visualization tools in genetic data analysis.
  9. 9. 5. OTHER SCIENTIFIC APPLICATIONS  The applications discussed above tend to handle relatively small and homogeneous data sets for which the statistical techniques are appropriate. Huge amount of data have been collected from scientific domains such as geosciences, astronomy etc. There is large amount of data sets being generated because of the fast numerical simulations in various fields such as climate, and ecosystem modeling, chemical engineering, fluid dynamics etc.  Following are the applications of data mining in field of Scientific Applications: o Data Warehouses and data preprocessing. o Graph-based mining. o Visualization and domain specific knowledge. 6. INTRUSION DETECTION  Intrusion refers to any kind of action that threatens integrity, confidentiality, or availability of network resources. In this world of connectivity security has become the major issue. With increased usage of internet and availability of tools and tricks for intruding and attacking network prompted intrusion detection to become a critical component of network administration.  Here is the list of areas in which data mining technology may be applied for intrusion detection:  Development of data mining algorithm for intrusion detection.  Association and correlation analysis, aggregation to help select and build discriminating attributes.  Analysis of Stream data.  Distributed data mining.  Visualization and query tools.
  10. 10. Data mining Process  Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data.  The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications.  The process of data mining consists of three stages: 1. The initial exploration. 2. Model building or pattern identification with validation/verification. 3. Deployment (i.e., the application of the model to new data in order to generate predictions). Stage 1: Exploration  This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered).  Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods. Stage 2: Model building and validation  This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best.
  11. 11.  These techniques - which are often considered the core of predictive data mining - include: Bagging(Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning. Stage 3: Deployment  That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected result or outcome.  The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty or assurance.  In recent times, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business data mining (e.g., Classification Trees), but Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both some components of its general approaches and specific techniques.  However, an important general difference in the focus and purpose between Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a "black box" approach to data exploration or knowledge discovery and uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.
  12. 12. The Scope ofData Mining  Data mining derives its name from the similarities between searching for valuable business information in a large database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a element of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities: 1. Automated prediction  Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data quickly.  A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. 2. Automated discovery  Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step.  An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.  Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes.
  13. 13. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions. Techniques of data mining 1. Decision trees:  Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).  In decision tree technique, the root of the decision tree is a simple question or condition that has multiple answers. Each answer then leads to a set of questions or conditions that help us determine the data so that we can make the final decision based on it.  For example, we use the following decision tree to determine whether or not to play tennis.  Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it is rainy, we should only play tennis if the wind is week. And if it is sunny then we should play tennis in case the humidity is normal.
  14. 14. 2. Association  Association is one of the best known data mining technique. In association, a pattern is discovered based on a relationship between items in the same transaction. That’s the reason why association technique is also known as relation technique. The association technique is used in market basket analysis to identify a set of products that customers frequently purchase together.  Retailers are using association technique to research customer’s buying habits. Based on historical sale data, retailers might find out that customers always buy crisps when they buy beers, and therefore they can put beers and crisps next to each other to save time for customer and increase sales. 3. Classification  Classification is a classic data mining technique based on machine learning. Basically classification is used to classify each item in a set of data into one of predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics.  In classification, we develop the software that can learn how to classify the data items into groups.  For example, we can apply classification in application that “given all records of employees who left the company; predict who will probably leave the company in a future period.” In this case, we divide the records of employees into two groups that named “leave” and “stay”. And then we can ask our data mining software to classify the employees into separate groups. 4. Clustering  Clustering is a data mining technique that makes meaningful or useful cluster of objects which have similar characteristics using automatic technique.  The clustering technique defines the classes and puts objects in each class, while in the classification techniques, objects are assigned into predefined classes. To make the concept clearer, we can take book management in library as an example. In a library, there is a wide range of books in various topics available. The challenge is how to keep those books in a way that readers can take several books in a particular topic without hassle. By using clustering
  15. 15. technique, we can keep books that have some kinds of similarities in one cluster or one shelf and label it with a meaningful name. If readers want to grab books in that topic, they would only have to go to that shelf instead of looking for entire library. 5. Prediction  The prediction, as it name implied, is one of a data mining techniques that discovers relationship between independent variables and relationship between dependent and independent variables.  For instance, the prediction analysis technique can be used in sale to predict profit for the future if we consider sale is an independent variable, profit could be a dependent variable. Then based on the historical sale and profit data, we can draw a fitted regression curve that is used for profit prediction. 6. Sequential Patterns  Sequential patterns analysis is one of data mining technique that seeks to discover or identify similar patterns, regular events or trends in transaction data over a business period.  In sales, with historical transaction data, businesses can identify a set of items that customers buy together different times in a year. Then businesses can use this information to recommend customers buy it with better deals based on their purchasing frequency in the past. Challenges in Web Mining  The web poses great challenges for resource and knowledge discovery based on the following observations: 1. The web is too huge  The size of the web is very huge and rapidly increasing. This seems that the web is too huge for data warehousing and data mining. 2. Complexity of Web pages
  16. 16.  The web pages do not have unifying structure. They are very complex as compared to traditional text document. There are huge amount of documents in digital library of web. These libraries are not arranged according in any particular sorted order. 3. Web is dynamic information source  The information on the web is rapidly updated. The data such as news, stock markets, weather, sports, shopping etc are regularly updated. 4. Diversity of user communities  The user community on the web is rapidly expanding. These users have different backgrounds, interests, and usage purposes. There are more than 100 million workstations that are connected to the Internet and still rapidly increasing. 5. Relevancy of Information  It is considered that a particular person is generally interested in only small portion of the web, while the rest of the portion of the web contains the information that is not relevant to the user and may swamp desired results. Advantages of Data Mining 1. Marketing / Retail  Data mining helps marketing companies build models based on historical data to predict who will respond to the new marketing campaigns such as direct mail, online marketing campaign. Through the results, marketers will have appropriate approach to sell profitable products to targeted customers.  Data mining brings a lot of benefits to retail companies in the same way as marketing. Through market basket analysis, a store can have an appropriate production arrangement in a way that customers can buy frequent buying products together with pleasant. In addition, it also helps the retail companies offer certain discounts for particular products that will attract more customers.
  17. 17. 2. Finance / Banking  Data mining gives financial institutions information about loan information and credit reporting. By building a model from historical customer’s data, the bank and financial institution can determine good and bad loans. In addition, data mining helps banks detect fraudulent credit card transactions to protect credit card’s owner. 3. Manufacturing  By applying data mining in operational engineering data, manufacturers can detect faulty equipments and determine optimal control parameters.  For example semi-conductor manufacturers has a challenge that even the conditions of manufacturing environments at different wafer production plants are similar, the quality of wafer are lot the same and some for unknown reasons even has defects. Data mining has been applying to determine the ranges of control parameters that lead to the production of golden wafer. Then those optimal control parameters are used to manufacture wafers with desired quality. 4. Governments  Data mining helps government agency by digging and analyzing records of financial transaction to build patterns that can detect money laundering or criminal activities. Disadvantagesofdata mining 1. Privacy Issues  The concerns about the personal privacy have been increasing enormously recently especially when internet is booming with social networks, e-commerce, forums, blogs.  Because of privacy issues, people are afraid of their personal information is collected and used in unethical way that potentially causing them a lot of troubles. Businesses collect information about their customers in many ways for understanding their purchasing behaviors trends.  However businesses don’t last forever, some days they may be acquired by other or gone. At this time the personal information they own probably is sold to other or leak.
  18. 18. 2. Security issues  Security is a big issue. Businesses own information about their employees and customers including social security number, birthday, payroll and etc.  However how properly this information is taken care is still in questions. There have been a lot of cases that hackers accessed and stole big data of customers from big corporation such as Ford Motor Credit Company, Sony, with so much personal and financial information available, the credit card stolen and identity theft become a big problem. 3. Misuse of information/inaccurate information  Information is collected through data mining intended for the ethical purposes can be misused. This information may be exploited by unethical people or businesses to take benefits of vulnerable people or discriminate against a group of people.  In addition, data mining technique is not perfectly accurate. Therefore if inaccurate information is used for decision-making, it will cause serious consequence. Data Mining Example: Marketing  In marketing in the area of advertising campaigns data mining can often increase the response and purchase rate by a factor of two to three.  The following describes a typical [data mining] example:  A company wants to launch an advertising campaign for a product. Among its present customers the company wants to post product information to those with a high probability of purchasing the product. The company has data describing the past customer behaviour and personal data about each of its customers. There are also customers who have already bought the product, e.g. in a trial period. The customers of the trial period are divided into two classes: those who have bought the product and those who have not. With this data a prediction model is created to predict the probability of purchasing the product. After that the probability of purchasing the product is predicted for all other customers. Only those with a higher probability are addressed. As a side effect the company learns with this data mining
  19. 19. analysis which are the relevant driver attributes of its customers buying a specific product (or at least being very interested in it).  The example shows how Data Mining can help in marketing to predict the purchase probability of customers for a specific product. This reduces cost, because sales activity can be focused much better (lower cost for mailings and flyers or for cost intensive sales agents’ visits on the spot). The customers benefit at the same time because the average relevance of the company’s offers increases (or the other way round: the “spam” quota of non-relevant offers is reduced).
  20. 20. A Producerwants to know………… 1. Which are our lowest/highest margin customers? 2. Who are my customer and what products they are buying? 3. Which customers are most likely to go to the competitors? 4. What impacts will new products/ services have on revenues and margins? 5. What product promotions have biggest impact on revenues? 6. What is the most effective distribution channel? What is a Data Warehouse?  A single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context- Barry Devlin  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”  Data warehousing: o The process of constructing and using data warehouses. o Organized around major subjects, such as customer, product, sales. o Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. o Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Characteristics ofData warehousing 1. Subject-Oriented  Organized around major subjects, such as customer, product, sales.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
  21. 21. 2. Integrated  Constructed by integrating multiple, heterogeneous data sources. o Relational databases, flat files, on-line transaction records.  Data cleaning and data integration techniques are applied. o Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources.  E.g., Hotel price: currency, tax, breakfast covered, etc. o When data is moved to the warehouse, it is converted. 3. Time Variant  The time horizon for the data warehouse is significantly longer than that of operational systems. o Operational database: current value data. o Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse contains an element of time, explicitly or implicitly but the key of operational data may or may not contain “time element”. 4. Non-Volatile  Once data is entered into warehouse are not changed and updated.  A physically separate store of data transformed from the operational environment.  Operational update of data does not occur in the data warehouse environment. o Does not require transaction processing, recovery, and concurrency control mechanisms. o Requires only two operations in data accessing: Initial loading of data and access of data. Purpose of Data warehousing (why data warehousing)  "As part of a company's business intelligence solution, a data warehouse is integral to the gathering, processing and use of all the information a business receives daily. A strong business intelligence plan, coupled with a robust data warehouse, will guarantee a business has all the tools needed to make the right decisions for today and for the future.
  22. 22.  The term "Business Intelligence" describes the process a business uses to gather all its raw data from multiple sources and process it into practical information they will apply to determine effectiveness of business processes, create policy, forecast trends, analyze the market and much more. Data warehousing is an integral part of any effective business intelligence endeavour. Data warehousing is more than just a database-like method of storing information. While a database simply holds data, a well designed data warehousing system is actually comprised of three segments: a. Staging: raw data is stored and manipulated by developers. The goal of developers in this stage is to take raw information from widely disparate sources, standardize and organize it, readying it for integration. b. Integration: raw data is further categorized and stored logically according to the needs of the end user, allowing easier access. c. Access: data is presented to users in a coherent way that is easy to understand and use. Clients employ computer applications to both access and analyze the information the data warehousing system provides.  While many companies are on board with data warehousing and storage and use business intelligence systems daily, others find the concept of a data warehouse and its benefits hard to grasp.  Here is a look at some of the pros of employing a data warehousing solution: 1. Improved user access: a standard database can be read and manipulated by programs like SQL Query Studio or the Oracle client, but there is considerable ramp up time for end users to effectively use these apps to get what they need. Business intelligence and data warehouse end-user access tools are built specifically for the purposes data warehouses are used: analysis, benchmarking, prediction and more. 2. Better consistency of data: developers work with data warehousing systems after data has been received so that all the information contained in the data warehouse is standardized. Only uniform data can be used efficiently for successful comparisons. Other solutions simply cannot match a data warehouse's level of consistency. 3. All-in-one:
  23. 23. o A data warehouse has the ability to receive data from many different sources, meaning any system in a business can contribute its data. o Let's face it: different business segments use different applications. Only a proper data warehouse solution can receive data from all of them and give a business the "big picture" view that is needed to analyze the business, make plans, track competitors and more. 4. Advanced query processing: in most businesses, even the best database systems are bound to either a single server or a handful of servers in a cluster. A data warehouse is a purpose- built hardware solution far more advanced than standard database servers. What this means is a data warehouse will process queries much faster and more effectively, leading to efficiency and increased productivity. 5. Retention of data history: end-user applications typically don't have the ability, not to mention the space, to maintain much transaction history and keep track of multiple changes to data. Data warehousing solutions have the ability to track all alterations to data, providing a reliable history of all changes, additions and deletions. With a data warehouse, the integrity of data is ensured. 6. Disaster recovery implications: a data warehouse system offers a great deal of security when it comes to disaster recovery. Since data from disparate systems is all sent to a data warehouse, that data warehouse essentially acts as another information backup source. Considering the data warehouse will also be backed up, that's now four places where the same information will be stored: the original source, its backup, the data warehouse and its subsequent backup. This is unparalleled information security. Data Warehouse Tools and Utilities Functions The following are the functions of Data Warehouse tools and Utilities: 1. Data Extraction  Data Extraction involves gathering the data from multiple heterogeneous sources. 2. Data Cleaning  Data Cleaning involves finding and correcting the errors in data.
  24. 24. 3. Data Transformation  Data Transformation involves converting data from legacy format to warehouse format. 4. Data Loading  Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions. 5. Refreshing  Refreshing involves updating from data sources to warehouse. DATA WAREHOUSING APPLICATIONS  In the world of computing the term data warehousing is an efficient system which is used for reporting and analysis.  These systems are used to store the historical as well as current data which is used for making trending reports which is used further for senior management reporting used for comparisons annually and quarterly.  It helps in bringing all the data in a central location called data warehouse. All the data that is stored in this warehouse is uploaded from the operational systems. The data in this warehouse is passed through various operations. The data warehouse environment consists of various source systems that provide this warehouse with data. Various data integration technologies are used to make the data ready to use.  Various architectures, tools and applications are included for storing data in this warehouse.  A data warehouse has its foundation on a mainframe server. The data here is extracted and organized which serves the user queries. It gives us the advantage of gathering information and data from diverse resources for easy access and analysis. The applications of data warehousing are data mining, web mining and decision support systems. 1. Data mining  It is the analysis of the data for the new relationships between various types of data. It is basically done by sorting and analysing the data to recognize the patterns and relationships
  25. 25. between various types of data. Association of patterns is done by relating one event to another. A sequence or path is setup after analysis of patterns where one event is responsible for the occurrence of other. All the patterns are classified and organization of data is done accordingly. And discovering of new patterns every time is used for predictive analysis. The data mining techniques are used in research areas. 2. Web mining  It is becoming important in the field of customer relationship management. It is basically the integration of data and information by data mining methodologies. The information is gathered from all over the world. When used in customer relationship management it is used to observe the customer behaviour and their needs more properly and surely this helps in success of the market. The data mining parameters like classification association and clustering are used for evaluation of the data. 3. Decision support system  It is an application of data warehousing which is used in analysis of data related to business and presents its results in such a way so as to make the business decisions easier for the business users. It is considered to be an informational application. It basic purposes are to compare the sales figures of various weeks. Assumptions are also done by forecasting the revenue figures based on the sales of products. The past experiences and sales are also counted and make the decisions right. The information presented by decision support system is done graphically. It may also include an artificial intelligence system for the purpose. Seeing to all the above points it is clear that data warehousing has lot of applications which are being used in almost every field. Advantages and disadvantages of data warehouses  Data warehouses are the traditional solution for data integration, and for good reason, but this is becoming increasingly difficult to scale and copy data from multiple data sources in multiple organizations in multiple locations. 1. A Data Warehouse Delivers Enhanced Business Intelligence
  26. 26.  By providing data from various sources, managers and executives will no longer need to make business decisions based on limited data or their gut. In addition, “data warehouses and related BI can be applied directly to business processes including marketing segmentation, inventory management, financial management, and sales.” 2. A Data Warehouse Saves Time  Since business users can quickly access critical data from a number of sources (all in one place) they can rapidly make informed decisions on key initiatives. They won’t waste precious time retrieving data from multiple sources.  Not only can that but the business execs query the data themselves with little or no support from IT, saving more time and more money. That means the business users won’t have to wait until IT gets around to generating the reports, and those hardworking folks in IT can do what they do best—keep the business running. 3. A Data Warehouse Enhances Data Quality and Consistency  A data warehouse implementation includes the conversion of data from numerous source systems into a common format.  Since each data from the various departments is standardized, each department will produce results that are in line with all the other departments. So you can have more confidence in the accuracy of your data. And accurate data is the basis for strong business decisions. 4. A Data Warehouse Provides Historical Intelligence  A data warehouse stores large amounts of historical data so you can analyze different time periods and trends in order to make future predictions. Such data typically cannot be stored in a transactional database or used to generate reports from a transactional system. 5. A Data Warehouse Generates a High ROI  Finally, the piece de resistance—return on investment. Companies that have implemented data warehouses and complementary BI systems have generated more revenue and saved more money than companies that haven’t invested in BI systems and data warehouses.
  27. 27. The Disadvantagesofa Data Warehouse 1. Extra Reporting Work  Depending on the size of the organization, a data warehouse runs the risk of extra work on departments. Each type of data that's needed in the warehouse typically has to be generated by the IT teams in each division of the business. This can be as simple as duplicating data from an existing database, but at other times, it involves gathering data from customers or employees that wasn't gathered before. 2. Cost/Benefit Ratio  A commonly cited disadvantage of data warehousing is the cost/benefit analysis. A data warehouse is a big IT project, and like many big IT projects, it can suck a lot of IT man hours and budgetary money to generate a tool that doesn't get used often enough to justify the implementation expense.  This is completely sidestepping the issue of the expense of maintaining the data warehouse and updating it as the business grows and adapts to the market. 3. Data Ownership Concerns  Data warehouses are often, but not always, Software as a Service implementations, or cloud services applications. Your data security in this environment is only as good as your cloud vendor. Even if implemented locally, there are concerns about data access throughout the company. Make sure that the people doing the analysis are individuals that your organization trusts, especially with customers' personal data. A data warehouse that leaks customer data is a privacy and public relations nightmare. 4. Data Flexibility  Data warehouses tend to have static data sets with minimal ability to "drill down" to specific solutions. The data is imported and filtered through a schema, and it is often days or weeks old by the time it's actually used. In addition, data warehouses are usually subject to ad hoc queries and are thus notoriously difficult to tune for processing speed and query speed. While
  28. 28. the queries are often ad hoc, the queries are limited by what data relations were set when the aggregation was assembled. Top 10 challenges in building data warehouse forlarge banks 1) Lack of strategic focus to build Enterprise Data Warehouse (EDW)  Building EDW is a strategic initiative since it requires shift in culture, longer timescale & more importantly it is an expensive affaire. Hence, it should be one of the top agendas of the CXOs and they need to closely monitor the progress and also need to provide executive support to break any unwanted barriers. 2) Needof considerable Time, Effort & Cost  Typical time taken for a global bank to build an EDW varies from a couple of years to 5 years. It also requires substantial effort & eventually huge amount of money to build a data warehouse. Also, Evidence of successful ROI is very opaque in the existing data warehouse implementation. 3) Lack of cross divisional collaboration  Building EDW requires constructive collaboration from various teams like multiple business divisions, source system teams, architecture & design teams, project teams and vendor teams. 4) Technological complexity  Mostly, source data is kept in multiple operating systems & multiple data base technologies. There are plenty of tools for data sourcing, data quality management, data integration, data ware housing, reporting & analytics.  Choosing appropriate technology is not so simple and is complicated by various emerging techniques like data virtualization, self service BI, in-data base analytics, columnar data base, NOSQL database, massively parallel processing, in-memory computing and etc,.  Also, traditional data warehouse is required to be integrated with big data technologies & Internet of Things for gaining business insights.
  29. 29. 5) Ill-defined, changing business data requirements & Insensitivity of technical team in understanding business requirements  Most of the time business finds difficulty in defining the data requirements, since data requirements keep evolving as the use of data increases. However, technical team wants finalised data requirements from business before designing & building a data warehouse. 6) Lack of clarity on true source of data  Most of the large banks have great legacy behind them and have been growing over decades through mergers & acquisition. They have widespread footprint across geographies and various customer segments. In this process, they have acquired many systems which are poorly integrated, less documented and data 2is scattered across multiple systems. It is nightmare for these banks to identify the true source of their data. 7) Lack of ability to manage data quality issues  Since data is an organisational asset it needs to be acquired & maintained well.  Many front office/customer facing systems don’t capture quality data at its origination. There is no unified data capturing process across organisation.  For example, last name of a personal customer would not have been captured in a front office system, since it is not a mandatory field, whereas it may be mandatory field for another system.  Sometimes there is lack of well defined processes & technologies to curtail the data quality issues. 8) Vestedinterest of vendors in promoting their own solution  Most of the top data warehousing vendors have their own suit of solutions/products in the entire data warehousing eco system. These vendors tend to promote their own solution rather than advocating what is best suited for the customer. 9) Comfort of using divisional data marts  Reporting is indispensable activity of banking. Many banks have built divisional data marts for fulfilling their own divisional needs. Though divisional marts do not provide
  30. 30. enterprise wide view, many business users are comfortable in using divisional data mart assuming that “Known devil is better than unknown angel”. 10) Subordinate use of data ware house  Business users from various divisions need to use data warehouse for reporting, business intelligence, data analytics & advanced analytics to unleash full potential of the enterprise data asset. Under utilised data warehouse will not grow & will not yield the desired return on investment (ROI). Case Study Data Warehousing Solution for One of Europe's Largest Financial Services Groups o The client sought a business intelligence solution to consolidate the mortgage administration processes, provide better sales cycle management, mortgage product performance analysis, financial forecasting based on sales demands, fraud detection and general mortgage operational reporting. Infosys delivered a highly scalable solution. The Client o The client is one of Europe's largest financial services groups in corporate and commercial banking, retail banking, credit cards and general insurance. The company sells mortgages to corporate and retail customers through various channels. These mortgage systems run on different technology platforms and follow different business processes. Business Need o Consolidate the mortgage administration processes for all brands and BI for different brands. o Satisfy better sales cycle management, mortgage product performance analysis, financial forecasting based on sales demands, fraud detection and general mortgage operational reporting. The Challenges o The biggest challenge was to provide scalable architecture for consolidating huge amount of data.
  31. 31. Our Solution o Infosys designed and implemented a data warehouse solution to extract information from the Mortgage Sales Application and administration systems of different brands and house them in a single data warehouse database. This resulted in a highly scalable solution that met the following requirements:  Transaction volume expected: 73 Million per year; annual growth rate of 110%  Size expected: 180 GB at the end of Year 1; annual growth rate of 45% Implementation Process o Infosys followed an iterative phased approach to implement the solution that included the following phases:  Business requirements analysis  Data warehouse dimensional modeling  Architecture design  ETL (Extract, Transform and Load) and business intelligence reporting development and implementation Benefits o Highly scalable solution to meet the following requirements:  Transaction volume expected: 73 Million per year; annual growth rate of 110%  Size expected: 180 GB at the end of Year 1; annual growth rate of 45% Integrating Data Mining System with a Database or Data Warehouse System  The data mining system needs to be integrated with database or the data warehouse system. If the data mining system is not integrated with any database or data warehouse system then there will be no system to communicate with. This scheme is known as non-coupling scheme. In this scheme the main focus is put on data mining design and for developing efficient and effective algorithms for mining the available data sets.  Here is the list of Integration Schemes:
  32. 32. 1) No Coupling  In this scheme the Data Mining system does not utilize any of the database or data warehouse functions. It then fetches the data from a particular source and processes that data using some data mining algorithms. The data mining result is stored in other file. 2) Loose Coupling  In this scheme the data mining system may use some of the functions of database and data warehouse system. It then fetches the data from data respiratory managed by these systems and perform data mining on that data. It then stores the mining result either in a file or in a designated place in a database or data warehouse. 3) Semi-tight Coupling  In this scheme the data mining system is along with the kinking the efficient implementation of data mining primitives can be provided in database or data warehouse systems. 4) Tight coupling  In this coupling scheme data mining system is smoothly integrated into database or data warehouse system. The data mining subsystem is treated as one functional component of an information system.
  33. 33. Bibliography http://forum.jntuworld.com/showthread.php?3818-Data-Warehousing-and-Data- Mining-(DWDM)-Unit-wise-Notes-All-8-Units http://www.thearling.com/text/dmwhite/dmwhite.htm http://www.thearling.com/text/dmtechniques/dmtechniques.htm http://www.infosys.com/consulting/information-management/case-studies/Pages/data- warehousing-solutions.aspx http://www.watchwise.net/data-warehousing.htm http://www.information-management.com/issues/19990101/232-1.html
  34. 34. DEPARTMENT OF BUSINESS AND INDUSTRIAL MANAGEMENT TERM ASSIGNMENT 2014-15 INFORMATION TECHNOLOGY FOR BUSINESS FYMBA- SEM-I SECTION-A TOPIC: DATA MINING AND DATA WAREHOUSING GROUP NUMBER: 9 BY, 16: CHAWLA DIVYA 23: GANDHI SANI 43: LADHNI ROMA SUBMITTED ON-24TH DECEMBER,2014 SUBMITTED TO – DR. JAYDEEPCHAUDRY

×