2. Lecture Overview
• Decision making and analytics (concepts)
• Categorization of analytical methods and models
• Data scientist: cool job opportunities
3. Business Analytics
• The extensive use of data, statistical and quantitative
analysis, explanatory and predictive models and fact-
based management to drive decisions & actions.
---Thomas Davenport
6. What is Data Analytics (Mining)?
• Data analytics is the process of discovering
knowledge in large data repositories
• Many other definitions:
– Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
– Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
9. Data Analytics Applications
• Lots of data being collected
and warehoused
– Web data, e-commerce
– Social Networks
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Government agencies
• Computers have become cheaper and more powerful
11. Data Analytics Tasks
• Predictive Tasks
– Use some variables (explanatory/independent/input
variables) to predict unknown or future values of a
particular variable (target/dependent variable)
• Descriptive Tasks
– Find general properties that describe the data
12. Data Analytics Tasks…
• Classification [Predictive]
• Regression [Predictive]
• Visualization [Descriptive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Graph Mining / Social Networks [Descriptive]
13. Classification: Example
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, not buy} binary decision forms the class
attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
• Use this information as input attributes to learn a classification model.
• To predict class attribute value of new customers, given their input
attributes known.
14. Classification: Example
• Customer Churn/Attrition:
– Goal: To predict whether a customer is likely to be lost to a
competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-the-day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
15. Regression/Prediction: Example
• Predict a value of a given continuous valued variable based
on the values of other variables, assuming a linear or
nonlinear model of dependency.
• Greatly studied in statistics, econometrics, neural network
fields.
• Examples:
– Predicting sales amounts of new product based on advertising
expenditure.
– Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
– Time series prediction of stock market indices (forecasting).
16. Clustering: Example
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns
of customers in same clusters vs. those from different
cluster.
17. Association Rule Mining: Example
• Given a set of record each of which contain some number of
items from a given collection;
– Produce dependency rules which will predict occurrence
of an item based on occurrence of other items.
18. Challenges of Data Analytics
• Scalability
• Dimensionality
• Complex and Heterogeneous Data
• Data Quality
• Data Ownership and Distribution
• Privacy Preservation
More objective
Tool of BA can aid decision making by creating insights from data, by improving out ability to more accurately forecast for planning, by helping us quantify risk, and by yielding better alternatives through analysis and optimization.
Firms guided by data-driven decision making have higher productivity and market values and increased output and profitability.
Business analytics (BA) is the practice of iterative, methodical exploration of an organization’s data with emphasis on statistical analysis. Business analytics is used by companies committed to data-driven decision making.
Successful business analytics depends on data quality, skilled analysts who understand the technologies and the business and an organizational commitment to data-driven decision making.
Data mining has attracted a great deal of attention in the business community in recent years.
This is due to the wide availability of data and the imminent need for turning such data into valuable information for companies.
Simply stated, data mining refers to the process of discovering knowledge and patterns in large data repositories.
The term itself is actually a misnomer; our goal in this process is to mine knowledge from data as shown in the diagram – but not to mine the data itself.
The diagram shown here is called the knowledge discovery process and places data mining in a broader scope.
Knowledge discovery process is an iterative sequence consisting of several steps.
The process starts by selection and extraction of data relevant to business question and the analysis.
This data is then pre-processed to remove noise and inconsistent values.
The third step involves transformation or consolidation of data into forms appropriate for mining by performing operations such as aggregation or dimensionality reduction.
The following step is called the actual data mining step and simply put, is the essential procedure in the knowledge discovery process. Data mining as a step refers to the application of computer based methods and techniques to extract data patterns. The majority of our course will be dedicated to studying such methods, as you will see in later lectures.
Once the data mining step is completed, the resulting patterns are evaluated and interpreted in terms of their interestingness and correctness.
In this figure, you can see the hierarchy between intelligence of an organization and the data it keeps. To put this hierarchy into perspective, let’s consider the case of a grocery store.
Suppose that a customer in this store has just completed a check-out. Now, the individual transaction record of this customer (that shows what he/she has just bought in this transaction) will be entered into a database and can be considered as a small data piece. By itself, this piece of data can be of little use to the company’s operational and strategic decision making.
However, now suppose that this customer has a member's card with the grocery store. Members card helps the store to identify the customer whenever he/she buys something. Since we know the customer’s identity, we can pull his/her transaction records for a period of time. Compiling and analyzing the history of this individual’s transactions, suppose that we identified milk and salad to be purchased together in each transaction by this customer. Now, this would be considered information and will be somewhat more valuable to the company.
Further, consider the case where we compiled and analyzed historical transaction record of numerous customers in this store. Suppose that we realized a pattern – customers who buy milk now tend to buy salad in the next two transactions. Now, this is something that wasn’t known before and can be considered knowledge.
Finally, suppose that the grocery store decides to give coupons for salad dressing to customers who buy milk. The company’s assumption is that milk buying customers will also buy salad and maybe they can also be persuaded to buy salad dressing through the use of a discount coupon. This is an intelligent decision made in order to increase sales based on knowledge.
Obviously, grocery store scenario is only one example where we can see the importance of data mining on making intelligent business decisions.
In real-life, there are various cases where organizations that collect and store huge amounts of data every day such as e-commerce websites, financial institutions, government and legal enforcement agencies.
While it may take human analysts weeks or months to discover useful knowledge from such data, computers have become the necessary tools to support the knowledge discovery process.
Descriptive analytics encompasses the set of techniques that describes what has happened in the past. Data queries, reports, descriptive statistics, visualization including data dashboards and spreadsheets models.
Predictive analytics consists of techniques that use models constructed from past data to predict the future or explain the impact of one variable on another. Regression models, data mining and simulation.
Prescriptive analytics indicates a best course of action to take. Optimization models, simulation.
In general, data mining tasks can be classified into two groups: predictive and descriptive
Predictive tasks use some variables in the data to make inference about unknown and future values of other variables.
On the other hand, descriptive tasks are used to characterize the general properties of the data in the databases.
- These two tasks can be further divided into sub-categories based on the characteristics that they have, the techniques that they employ and the kinds of patterns that they can discover.
Classification is the process of finding a model that describes and distinguishes between a discrete number of data classes. The purpose is to use this model to predict the classes of observations whose class labels are unknown in advance. An example of a classification task could be to predict whether a potential customer browsing a store is actually a buyer or not. Here the class attribute has two possible values: buyer or not buyer. The class attribute can also be called prediction attribute. A possible predictor attribute to find the class on the other hand could the income level of this person.
Regression is a similar concept to classification, however instead of having a discrete number of data classes, regression intends to predict the value of a continuous valued attribute. An example to regression could be to predict the income level of a person. Here, income level is a continuous and numeric attribute.
Visualization is the study of the visual representation of data, creating an abstract form of data characteristics and the relationship between data attributes. The main goal of data visualization is to communicate this information clearly and effectively through graphical means.
Clustering is the task of grouping observations in the data in such a way that observations in the same group are more similar to each other than to those in different groups. Unlike classification, clustering analyzes data observations without consulting a known class label.
Association rule discovery is used for discovering interesting relationships between variables in large databases. For example, the previous grocery store scenario we described earlier where it was observed that milk and salad are frequently purchased together is a case of association rule discovery.
Finally, graph mining is a structured form of data mining, where interrelations and connections between data observations are used to discover interesting patterns in the data.
For each of these tasks, there exist unique and significant challenges that undermine the success of the data mining initiative.
Two of these challenges are scalability and dimensionality. In data mining, it is always desirable that the techniques that are employed are robust enough to handle all types of different data sizes and dimensions. As we will later see in this course, unfortunately this is not always the case and we need to come up with smart approaches to introduce scalability into our analyses.
Another challenge of data mining is the complexity and the heterogeneity of the data. In most business cases, we need to deal with different data types including temporal or location-based data, monetary values, text, web data and multimedia. Further, such data can come from all kinds different sources including distinct databases, legacy systems or web sources.
Standardization of all these information in a computer readable form is a difficult process and can lead to data quality problems including missing values, noisy data and duplicate observations.
Finally, privacy and data ownership issues are big concerns that need to be addressed thoroughly in all data mining practices. Especially, mining of personal information for organizational purposes is a very sensitive issue that can have ethical as well as legal consequences.