“ Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.”
Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge driven decisions.
Prospective analysis offered by data mining move beyond analyses of past events provided by retrospective tools typical of decision support systems.
Data mining tools can answer business questions that traditionally were too time consuming to resolve.
They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.
Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line.
Classical statistics embrace concepts such as regression analysis, standard distribution, standard deviation, standard variance, cluster analysis, all of which are used to study data and data relationships.
These are the building blocks with which more advanced statistical analysis are underpinned.
Within the heart of today’s data mining tools and techniques, classical statistical analysis plays a significant role.
It is built upon heuristics (method that often rapidly leads to a solution that is usually close to the best possible answer) as opposed to statistics, attempts to apply human-thought-like processing to statistical problems.
Since this approach requires vast computer processing power, it was not practical until the early 1980s, when computers began to offer useful power at reasonable prices.
Certain AI concepts were adopted by some high-end commercial products, such as query optimization modules for Relational Database Management Systems (RDBMS).
Is an evolution of artificial intelligence because it blends artificial intelligence heuristics with advanced statistical analysis.
Machine learning attempts to let computer programs learn about the data they study, such that programs make different decisions based on the qualities of the studied data, using statistics for fundamental concepts, and adding more advanced AI heuristics and algorithms to achieve its goals.
Evolution of Data Mining Prospective, proactive information delivery Pilot, Lockheed, IBM, SGI, numerous startups (nascent industry) Advanced algorithms, multiprocessor computers, massive databases "What’s likely to happen to Boston unit sales next month? Why?" Data Mining (Emerging Today) Retrospective, dynamic data delivery at multiple levels Pilot, Comshare, Arbor, Cognos, Microstrategy On-line analytic processing (OLAP), multidimensional databases, data warehouses "What were unit sales in New England last March? Drill down to Boston." Data Warehousing & Decision Support (1990s) Retrospective, dynamic data delivery at record level Oracle, Sybase, Informix, IBM, Microsoft Relational databases (RDBMS), Structured Query Language (SQL), ODBC "What were unit sales in New England last March?" Data Access(1980s) Retrospective, static data delivery IBM, CDC Computers, tapes, disks "What was my total revenue in the last five years?" Data Collection(1960) Purpose Product Providers Enabling Technologies Business Question Evolutionary Step
Automated prediction of trends and behaviors . A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings.
identifying segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns . Data mining tools sweep through databases and identify previously hidden patterns in one step.
analysis of retail sales data to identify seemingly unrelated products that are often purchased together (ex beer and diapers).
detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.
Classes : Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.
Clusters : Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.
Associations : Data can be mined to identify associations. The beer-diaper example is an example of associative mining.
Sequential patterns : Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Clustering is the method by which like records are grouped together. Usually this is done to give the end user a high level view of what is going on in the database. Clustering is sometimes used to mean segmentation - which most marketing people will tell you is useful for coming up with a birds eye view of the business.
EX: 1) Clustering people with similar movie preferences
2) Amazon.com displays “Customers who brought this book also bought…”
Nearest neighbor algorithm is a refinement of clustering. It perfoms prediction by finding the prediction value of records (near neighbors) similar to the record to be predicted.
Decision Tree: A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no decision. Decision trees therefore represent Boolean functions. Specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the dataset with their classification.