This document discusses Simpson's paradox and how the relationship between variables can change or reverse depending on whether data is analyzed together or stratified based on other factors. It provides an example where customers who buy high-definition TVs appear more likely to buy exercise machines when data is combined, but the opposite is true within customer groups (college students vs. working adults). The document also discusses how skewed support distributions in data can affect association rule mining by generating many weakly correlated "cross-support patterns" when a low support threshold is used.
The document summarizes a math exploration project analyzing problems at a model construction company. The student uses Pareto analysis to determine the top causes of errors, finding the main issues are differences in block lengths, storage, packaging, logistics, and availability. Normal distribution graphs are then used to analyze sample data from three potential wooden block suppliers, with Company A found to have the lowest standard deviation and highest accuracy for the requested 50.0000m length.
The document describes the phases of a data mining project on predicting home insurance purchases. It analyzes customer data using various techniques, creates new variables, selects the top predictive variables, and generates models. The best model was one where the cost of false negatives was 10 times the cost of false positives, as it had the lowest overall cost and highest return on investment. This model will be deployed.
This document provides examples and explanations of various graphical methods for describing data, including frequency distributions, bar charts, pie charts, stem-and-leaf diagrams, histograms, and cumulative relative frequency plots. It demonstrates how to construct these graphs using sample data on student weights, grades, ages, and other examples. The goal is to help readers understand different ways to visually represent data distributions and patterns.
This document discusses regression analysis techniques for estimating relationships between variables. It provides examples of using single and multiple regression to model how dependent variables, like income, are impacted by independent variables, such as education levels and population density. Key outputs from regression analyses like the model summary, ANOVA table, and coefficients are also presented to interpret the results and significance of relationships.
The report describes the results of a Discrete Choice Experiment (a type of Conjoint-Analysis) to explore the potential configuration of a tablet computer from a new entrant to the category.
Building & Evaluating Predictive model: Supermarket Business CaseSiddhanth Chaurasiya
- The document describes building predictive models using decision tree and regression modeling to predict which customers are likely to purchase new organic products being introduced by a supermarket.
- Both decision tree and logistic regression models were created, with the decision tree models performing slightly better based on various evaluation metrics such as cumulative lift, ROC curve, and average square error.
- The top variables influencing the likelihood of a customer purchasing organics according to the models were gender, age, and affluence level.
The document discusses various graphical methods for describing data, including bar charts, pie charts, stem-and-leaf diagrams, histograms, and cumulative relative frequency plots. It provides examples of each using sample student data on vision correction, weights, ages, and GPAs to illustrate how to construct and interpret the different graph types.
The document summarizes a math exploration project analyzing problems at a model construction company. The student uses Pareto analysis to determine the top causes of errors, finding the main issues are differences in block lengths, storage, packaging, logistics, and availability. Normal distribution graphs are then used to analyze sample data from three potential wooden block suppliers, with Company A found to have the lowest standard deviation and highest accuracy for the requested 50.0000m length.
The document describes the phases of a data mining project on predicting home insurance purchases. It analyzes customer data using various techniques, creates new variables, selects the top predictive variables, and generates models. The best model was one where the cost of false negatives was 10 times the cost of false positives, as it had the lowest overall cost and highest return on investment. This model will be deployed.
This document provides examples and explanations of various graphical methods for describing data, including frequency distributions, bar charts, pie charts, stem-and-leaf diagrams, histograms, and cumulative relative frequency plots. It demonstrates how to construct these graphs using sample data on student weights, grades, ages, and other examples. The goal is to help readers understand different ways to visually represent data distributions and patterns.
This document discusses regression analysis techniques for estimating relationships between variables. It provides examples of using single and multiple regression to model how dependent variables, like income, are impacted by independent variables, such as education levels and population density. Key outputs from regression analyses like the model summary, ANOVA table, and coefficients are also presented to interpret the results and significance of relationships.
The report describes the results of a Discrete Choice Experiment (a type of Conjoint-Analysis) to explore the potential configuration of a tablet computer from a new entrant to the category.
Building & Evaluating Predictive model: Supermarket Business CaseSiddhanth Chaurasiya
- The document describes building predictive models using decision tree and regression modeling to predict which customers are likely to purchase new organic products being introduced by a supermarket.
- Both decision tree and logistic regression models were created, with the decision tree models performing slightly better based on various evaluation metrics such as cumulative lift, ROC curve, and average square error.
- The top variables influencing the likelihood of a customer purchasing organics according to the models were gender, age, and affluence level.
The document discusses various graphical methods for describing data, including bar charts, pie charts, stem-and-leaf diagrams, histograms, and cumulative relative frequency plots. It provides examples of each using sample student data on vision correction, weights, ages, and GPAs to illustrate how to construct and interpret the different graph types.
Producing a staff handbook is important for all businesses. It is a place where staff can find information about the rules that govern their working relationship. In this guide we present some of the core areas that you should address as a business and what policies should be included.
Jak dziękować za wsparcie? O roli wdzięczności w budowaniu relacji z darczyńcamiMichal Serwinski
Prezentacja została przygotowana do VI Webinarium FAOO - organizowanego przez Fundację Akademia Organizacji Obywatelskich na platformie Kursodrom.
WEBINAR: https://youtu.be/gObW1EojN_c?t=5s
https://www.facebook.com/events/937698639635804/
#nonprofitPL
Dynamicznie rozwijająca się technologia i komunikacja stanowią filary fundraisingu. Jak z korzyścią dla organizacji stosować technologie w fundraisingu, budując pozytywne relacje z darczyńcami?
Prezentacja została stworzona na ewentualność prezentacji podczas V Spotkania FAOO | Odkryjmy darczyńców
indywidualnych. Fakty, doświadczenia, narzędzia
1 czerwca 2016
Live Classroom in Blackboard allows for:
1) Online synchronous meetings with optional archives for asynchronous users, where participants can share presentations, chat, and use audio;
2) Instructors to create rooms and enable guest access in the room settings;
3) Participants to enter rooms, add presentations, record archives of the sessions, and use tools like chat, emoticons, and raising hands during sessions.
Setfords offeres a tailored collections service for recovery of your service charge and ground rent arrears which includes demand letters and county court proceedings through to enforcement of your judgment and issuing section 146 notices.
Going through a divorce can be very stressful and emotional and the financial side can cause a lot of friction in the process. Here we present some of the common financial questions around divorce and offer some advice on how to tackle them.
This document provides information about calculating statutory redundancy pay and unfair dismissal awards in the UK. It includes a ready reckoner table to calculate redundancy payments based on an employee's age, length of service and weekly salary. It also lists the current maximum weekly salary and total payment amounts. Contact details are provided for legal advice on redundancy and employment issues.
My TEDx Talk - TEDx MasterCanteenSquareRachit Pandey
The document describes the author's experience as an engineering student and contrasts it with what is considered a typical engineering life. While the author attended many tech festivals and worked at multiple startups like other engineers, traveling to over 100 cities, their experience diverged from most. What was common was low class attendance, last-minute exam preparation, and never understanding physics formulas. The author questions what defines success and wonders if believing in oneself rather than external pressures could lead to more people finding success.
Is the person working for you an employee? AmberBoniface
The document discusses factors that indicate whether a person working for a business is an employee or self-employed. It lists factors like control over work, mutual obligations between parties, personal service requirements, integration within the business, and responsibility for taxation that can determine employment status. The legal status is important for rights like unfair dismissal claims. The document provides contact information for Setfords Solicitors which can offer advice on employment law issues.
UniqueSoft provides automation software that enables faster, higher quality software development at lower costs compared to traditional methods. They focus on end-to-end lifecycle automation to eliminate errors and reduce cycle times. UniqueSoft's approach results in development that is 2-3 times faster, with defects reduced by an order of magnitude, and costs that are 20-30% lower than typical projects. They are a stable company with experienced leadership and a track record of successful government projects.
Drogi do pozyskania funduszy przez internet - fundraising w sieciMichal Serwinski
This document provides links to three different websites. The first link is to a non-existent website with a questionable domain name. The second link directs to a page where one can donate to Charity Water, a non-profit focused on bringing clean drinking water to people in developing nations. The third link leads to a blog post listing 40 hashtags related to social good causes that non-profits and donors can use to spread awareness on social media.
This document provides information on the services offered by Ivyctor to help with MBA application preparation. Ivyctor guides applicants through the entire MBA application process, including GMAT preparation, school selection, essay writing, recommendations, and interview preparation. Their mentors are experienced MBA graduates who can provide personalized assistance. Ivyctor offers flexible packages tailored to students' needs that include candidate profiling, essay reviews, mock interviews, and help with different parts of the application based on the number of schools applying to.
A female student committed suicide by jumping from the roof of her dormitory. Afterwards, students heard splatting sounds moving down the hall towards her room along with a voice asking "Is anyone in there?". During summer break when students were gone, the janitor experienced similar creepy events while cleaning the empty dormitory alone at night. While hiding under a desk, a disembodied voice told her that it could see her and that she couldn't hide from it.
The document provides information about the Technical University of Manabi including its vision, mission, and courses. It discusses the university's vision to be a leading higher education institution in Ecuador that promotes knowledge creation and transmission. The mission is to train responsible, ethical professionals to solve problems in Ecuador and contribute to national development through teaching, research, and knowledge generation. It also notes a computer systems engineering course called English 3C taught at the university's faculty of computer science.
The document discusses exploratory data analysis techniques used to analyze a telecommunications customer churn dataset containing 3,333 records and 20 variables.
Key techniques included:
1. Examining relationships between categorical variables like international plan and voice mail plan to churn, finding customers in international plans churned at higher rates.
2. Exploring distributions and correlations of numeric variables like account length, day minutes, and customer service calls with churn. Higher values in calls and minutes were linked to higher churn.
3. Using histograms, scatter plots, and other graphs to identify multivariate relationships, like finding customers with many calls but low minutes churned more.
The analysis helped identify variables likely to predict churn for modeling without pre
The document describes building a model to predict customer responses to a home equity line of credit mailing. Reducing the mailing by 25% captures 77.77% of responses. The most impactful variables are customer segments - prospects who are high status seniors have a 10.12% higher response probability, and prospects who are Dinki's (double income, no kids) have a 12.45% higher probability. An alternative 36% mailing reduction captures 44% of responses based on lift analysis.
Churn in the Telecommunications Industryskewdlogix
Strategic Business Analysis Capstone Project Telecommunications Churn Management
Churn is a significant problem that costs telecommunications companies billions of dollars through lost revenue. Now that the market is more mature, the only way for a company to grow is to take their competitors customers. This issue
combined with the greater choice that consumers have gained means that any adverse touch point with a consumer can result in a lost customer.
�
� �
�
5
Association Analysis:
Basic Concepts and
Algorithms
Many business enterprises accumulate large quantities of data from their day-
to-day operations. For example, huge amounts of customer purchase data are
collected daily at the checkout counters of grocery stores. Table 5.1 gives an
example of such data, commonly known as market basket transactions.
Each row in this table corresponds to a transaction, which contains a unique
identifier labeled TID and a set of items bought by a given customer. Retailers
are interested in analyzing the data to learn about the purchasing behavior
of their customers. Such valuable information can be used to support a vari-
ety of business-related applications such as marketing promotions, inventory
management, and customer relationship management.
This chapter presents a methodology known as association analysis,
which is useful for discovering interesting relationships hidden in large data
sets. The uncovered relationships can be represented in the form of sets of
items present in many transactions, which are known as frequent itemsets,
Table 5.1. An example of market basket transactions.
TID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
�
� �
�
358 Chapter 5 Association Analysis
or association rules, that represent relationships between two itemsets. For
example, the following rule can be extracted from the data set shown in
Table 5.1:
{Diapers} −→ {Beer}.
The rule suggests a relationship between the sale of diapers and beer because
many customers who buy diapers also buy beer. Retailers can use these types
of rules to help them identify new opportunities for cross-selling their products
to the customers.
Besides market basket data, association analysis is also applicable to data
from other application domains such as bioinformatics, medical diagnosis, web
mining, and scientific data analysis. In the analysis of Earth science data, for
example, association patterns may reveal interesting connections among the
ocean, land, and atmospheric processes. Such information may help Earth
scientists develop a better understanding of how the different elements of the
Earth system interact with each other. Even though the techniques presented
here are generally applicable to a wider variety of data sets, for illustrative
purposes, our discussion will focus mainly on market basket data.
There are two key issues that need to be addressed when applying associ-
ation analysis to market basket data. First, discovering patterns from a large
transaction data set can be computationally expensive. Second, some of the
discovered patterns may be spurious (happen simply by chance) and even for
non-spurious patterns, some are more interesting than others. The remainder
of this chapter is organized around these two issues. The first part of the
chapter is devoted to explaining the basic concepts of association ...
Producing a staff handbook is important for all businesses. It is a place where staff can find information about the rules that govern their working relationship. In this guide we present some of the core areas that you should address as a business and what policies should be included.
Jak dziękować za wsparcie? O roli wdzięczności w budowaniu relacji z darczyńcamiMichal Serwinski
Prezentacja została przygotowana do VI Webinarium FAOO - organizowanego przez Fundację Akademia Organizacji Obywatelskich na platformie Kursodrom.
WEBINAR: https://youtu.be/gObW1EojN_c?t=5s
https://www.facebook.com/events/937698639635804/
#nonprofitPL
Dynamicznie rozwijająca się technologia i komunikacja stanowią filary fundraisingu. Jak z korzyścią dla organizacji stosować technologie w fundraisingu, budując pozytywne relacje z darczyńcami?
Prezentacja została stworzona na ewentualność prezentacji podczas V Spotkania FAOO | Odkryjmy darczyńców
indywidualnych. Fakty, doświadczenia, narzędzia
1 czerwca 2016
Live Classroom in Blackboard allows for:
1) Online synchronous meetings with optional archives for asynchronous users, where participants can share presentations, chat, and use audio;
2) Instructors to create rooms and enable guest access in the room settings;
3) Participants to enter rooms, add presentations, record archives of the sessions, and use tools like chat, emoticons, and raising hands during sessions.
Setfords offeres a tailored collections service for recovery of your service charge and ground rent arrears which includes demand letters and county court proceedings through to enforcement of your judgment and issuing section 146 notices.
Going through a divorce can be very stressful and emotional and the financial side can cause a lot of friction in the process. Here we present some of the common financial questions around divorce and offer some advice on how to tackle them.
This document provides information about calculating statutory redundancy pay and unfair dismissal awards in the UK. It includes a ready reckoner table to calculate redundancy payments based on an employee's age, length of service and weekly salary. It also lists the current maximum weekly salary and total payment amounts. Contact details are provided for legal advice on redundancy and employment issues.
My TEDx Talk - TEDx MasterCanteenSquareRachit Pandey
The document describes the author's experience as an engineering student and contrasts it with what is considered a typical engineering life. While the author attended many tech festivals and worked at multiple startups like other engineers, traveling to over 100 cities, their experience diverged from most. What was common was low class attendance, last-minute exam preparation, and never understanding physics formulas. The author questions what defines success and wonders if believing in oneself rather than external pressures could lead to more people finding success.
Is the person working for you an employee? AmberBoniface
The document discusses factors that indicate whether a person working for a business is an employee or self-employed. It lists factors like control over work, mutual obligations between parties, personal service requirements, integration within the business, and responsibility for taxation that can determine employment status. The legal status is important for rights like unfair dismissal claims. The document provides contact information for Setfords Solicitors which can offer advice on employment law issues.
UniqueSoft provides automation software that enables faster, higher quality software development at lower costs compared to traditional methods. They focus on end-to-end lifecycle automation to eliminate errors and reduce cycle times. UniqueSoft's approach results in development that is 2-3 times faster, with defects reduced by an order of magnitude, and costs that are 20-30% lower than typical projects. They are a stable company with experienced leadership and a track record of successful government projects.
Drogi do pozyskania funduszy przez internet - fundraising w sieciMichal Serwinski
This document provides links to three different websites. The first link is to a non-existent website with a questionable domain name. The second link directs to a page where one can donate to Charity Water, a non-profit focused on bringing clean drinking water to people in developing nations. The third link leads to a blog post listing 40 hashtags related to social good causes that non-profits and donors can use to spread awareness on social media.
This document provides information on the services offered by Ivyctor to help with MBA application preparation. Ivyctor guides applicants through the entire MBA application process, including GMAT preparation, school selection, essay writing, recommendations, and interview preparation. Their mentors are experienced MBA graduates who can provide personalized assistance. Ivyctor offers flexible packages tailored to students' needs that include candidate profiling, essay reviews, mock interviews, and help with different parts of the application based on the number of schools applying to.
A female student committed suicide by jumping from the roof of her dormitory. Afterwards, students heard splatting sounds moving down the hall towards her room along with a voice asking "Is anyone in there?". During summer break when students were gone, the janitor experienced similar creepy events while cleaning the empty dormitory alone at night. While hiding under a desk, a disembodied voice told her that it could see her and that she couldn't hide from it.
The document provides information about the Technical University of Manabi including its vision, mission, and courses. It discusses the university's vision to be a leading higher education institution in Ecuador that promotes knowledge creation and transmission. The mission is to train responsible, ethical professionals to solve problems in Ecuador and contribute to national development through teaching, research, and knowledge generation. It also notes a computer systems engineering course called English 3C taught at the university's faculty of computer science.
The document discusses exploratory data analysis techniques used to analyze a telecommunications customer churn dataset containing 3,333 records and 20 variables.
Key techniques included:
1. Examining relationships between categorical variables like international plan and voice mail plan to churn, finding customers in international plans churned at higher rates.
2. Exploring distributions and correlations of numeric variables like account length, day minutes, and customer service calls with churn. Higher values in calls and minutes were linked to higher churn.
3. Using histograms, scatter plots, and other graphs to identify multivariate relationships, like finding customers with many calls but low minutes churned more.
The analysis helped identify variables likely to predict churn for modeling without pre
The document describes building a model to predict customer responses to a home equity line of credit mailing. Reducing the mailing by 25% captures 77.77% of responses. The most impactful variables are customer segments - prospects who are high status seniors have a 10.12% higher response probability, and prospects who are Dinki's (double income, no kids) have a 12.45% higher probability. An alternative 36% mailing reduction captures 44% of responses based on lift analysis.
Churn in the Telecommunications Industryskewdlogix
Strategic Business Analysis Capstone Project Telecommunications Churn Management
Churn is a significant problem that costs telecommunications companies billions of dollars through lost revenue. Now that the market is more mature, the only way for a company to grow is to take their competitors customers. This issue
combined with the greater choice that consumers have gained means that any adverse touch point with a consumer can result in a lost customer.
�
� �
�
5
Association Analysis:
Basic Concepts and
Algorithms
Many business enterprises accumulate large quantities of data from their day-
to-day operations. For example, huge amounts of customer purchase data are
collected daily at the checkout counters of grocery stores. Table 5.1 gives an
example of such data, commonly known as market basket transactions.
Each row in this table corresponds to a transaction, which contains a unique
identifier labeled TID and a set of items bought by a given customer. Retailers
are interested in analyzing the data to learn about the purchasing behavior
of their customers. Such valuable information can be used to support a vari-
ety of business-related applications such as marketing promotions, inventory
management, and customer relationship management.
This chapter presents a methodology known as association analysis,
which is useful for discovering interesting relationships hidden in large data
sets. The uncovered relationships can be represented in the form of sets of
items present in many transactions, which are known as frequent itemsets,
Table 5.1. An example of market basket transactions.
TID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
�
� �
�
358 Chapter 5 Association Analysis
or association rules, that represent relationships between two itemsets. For
example, the following rule can be extracted from the data set shown in
Table 5.1:
{Diapers} −→ {Beer}.
The rule suggests a relationship between the sale of diapers and beer because
many customers who buy diapers also buy beer. Retailers can use these types
of rules to help them identify new opportunities for cross-selling their products
to the customers.
Besides market basket data, association analysis is also applicable to data
from other application domains such as bioinformatics, medical diagnosis, web
mining, and scientific data analysis. In the analysis of Earth science data, for
example, association patterns may reveal interesting connections among the
ocean, land, and atmospheric processes. Such information may help Earth
scientists develop a better understanding of how the different elements of the
Earth system interact with each other. Even though the techniques presented
here are generally applicable to a wider variety of data sets, for illustrative
purposes, our discussion will focus mainly on market basket data.
There are two key issues that need to be addressed when applying associ-
ation analysis to market basket data. First, discovering patterns from a large
transaction data set can be computationally expensive. Second, some of the
discovered patterns may be spurious (happen simply by chance) and even for
non-spurious patterns, some are more interesting than others. The remainder
of this chapter is organized around these two issues. The first part of the
chapter is devoted to explaining the basic concepts of association ...
In our increasingly data-driven world, it's more important than ever to have accessible ways to view and understand data.
After all, employees' demand for data skills steadily increases each year.
Employees and Business owners at every level need to understand data and its impact.
That's where Data Visualization comes in handy.
To make Data more accessible and understandable, Data Visualization in Dashboards is the go-to tool for many businesses to Analyze and share Information.
Association rule mining is used to find interesting relationships among data items in large datasets. It can help with business decision making by analyzing customer purchasing patterns. For example, market basket analysis looks at what items are frequently bought together. Association rules use support and confidence metrics, where support is the probability an itemset occurs and confidence is the probability that a rule is correct. The Apriori algorithm is commonly used to generate association rules by first finding frequent itemsets that meet a minimum support threshold across multiple passes of the data. It then generates rules from those itemsets if they meet a minimum confidence. Association rule mining has various applications and can provide useful insights but also has computational limitations.
As mentioned earlier, the mid-term will have conceptual and quanti.docxfredharris32
As mentioned earlier, the mid-term will have conceptual and quantitative multiple-choice questions. You need to read all 4 chapters and you need to be able to solve problems in all 4 chapters in order to do well in this test.
The following are for review and learning purposes only. I am not indicating that identical or similar problems will be in the test. As I have indicated in the class syllabus, all the exams in this course will have multiple-choice questions and problems.
Suggestion: treat this review set as you would an actual test. Sit down with your one page of notes and your calculator, and give it a try. That way you will know what areas you still need to study.
ADMN 210
Answers to Review for Midterm #1
1) Classify each of the following as nominal, ordinal, interval, or ratio data.
a. The time required to produce each tire on an assembly line – ratio since it is numeric with a valid 0 point meaning “lack of”
b. The number of quarts of milk a family drinks in a month - ratio since it is numeric with a valid 0 point meaning “lack of”
c. The ranking of four machines in your plant after they have been designated as excellent, good, satisfactory, and poor – ordinal since it is ranking data only
d. The telephone area code of clients in the United States – nominal since it is a label
e. The age of each of your employees - ratio since it is numeric with a valid 0 point meaning “lack of”
f. The dollar sales at the local pizza house each month - ratio since it is numeric with a valid 0 point meaning “lack of”
g. An employee’s identification number – nominal since it is a label
h. The response time of an emergency unit - ratio since it is numeric with a valid 0 point meaning “lack of”
2) True or False: The highest level of data measurement is the ratio-level measurement.
True (you can do the most powerful analysis with this kind of data)
3) True or False: Interval- and ratio-level data are also referred to as categorical data.
False (Interval and ratio level data are numeric and therefore quantitative, NOT qualitative….Nominal is qualitative)
4) A small portion or a subset of the population on which data is collected for conducting statistical analysis is called __________.
A sample! A population is the total group, a census IS the population, and a data set can be either a sample or a population.
5) One of the advantages for taking a sample instead of conducting a census is this:
a sample is more accurate than census
a sample is difficult to take
a sample cannot be trusted
a sample can save money when data collection process is destructive
6) Selection of the winning numbers is a lottery is an example of __________.
convenience sampling
random sampling
nonrandom sampling
regulatory sampling
7) A type of random sampling in which the population is divided into non-overlapping subpopulations is called __________.
stratified random sampling
cluster sampling
systematic random sampling
regulatory sampling
8) A ...
Instructions and Advice · This assignment consists of six que.docxdirkrplav
Instructions and Advice:
· This assignment consists of six questions. They each have lots of parts but most of them are very short!
· Data for Questions 3 and 6 are in the companion Excel spreadsheet <Asst3_2013_Data.xlsx>.
· Present the parts of your answers in the same order as the questions are asked.
· Do not include any original data in your printed submission.
· Maintain all precision in your calculator or in Excel as you do your multi-step computations. Round off to fewer decimal places only when you write your work and the final answer down to hand in.
· When formatting numbers in Excel, display only as many decimal places as provide decision-making value to the reader.
Question 1 – Interpreting or Misinterpreting Correlation
a) Various factors are associated with the gross domestic product (GDP) of nations. State whether each of the following statements is reasonable or not. If not, explain the blunder.
(i) A correlation of –0.722 shows that there is almost no association between GDP and Infant Mortality Rate.
(ii) There is a correlation of 0.44 between GDP and Continent.
(iii) There is a very strong correlation of 1.22 between Life Expectancy and GDP.
(iv) The correlation between Literacy Rate and GDP was 0.83. This shows that countries wanting to increase their standard of living should invest heavily in education.
b) An article in a business magazine reported that Internet E-commerce has doubled nearly every three years. It then stated that there was a high correlation between sales made on the Internet and year. Do you think this is an appropriate summary? Explain in one sentence.
c) Simpson’s Paradox can occur in regression, when a relationship between variables within groups of observations is reversed if all the data are combined. Here is an example.
Group
X
Y
Group
X
Y
1
1
10.1
2
6
18.3
1
2
8.9
2
7
17.1
1
3
8.9
2
8
16.2
1
4
6.9
2
9
15.1
1
5
6.1
2
10
14.3
(i) Make a scatterplot of the data for Group 1 and add the least squares line. Describe the relationship between Y and X for Group 1. Find the correlation (using Excel).
(ii) Do the same for Group 2.
(iii) Make a scatterplot using all 10 observations and add the least squares line. Find the correlation (using Excel).
(iv) Summarize your findings in one or two sentences.
d) Since 1980, average mortgage interest rates in the U.S. have fluctuated from a low of under 6% to a high of over 14%. Is there a relationship between the amount of money people borrow and the interest rate that’s offered? Here is a scatterplot of Total Mortgages in the U.S. (in millions of 2005 dollars) vs. Interest Rates at various times over the past 26 years. The correlation is -0.84.
(i) Describe the relationship between Total Mortgages and Interest Rate.
(ii) If we standardized both variables, what would the correlation coefficient between the standardized variables be?
(iii) If we were to measure Total Mortgages in thousands of dollars instead of millions of dollars, how would the.
The document summarizes a data science portfolio project that aims to help a financial institution solve customer acquisition and retention problems. It details a machine learning project that analyzes a credit card application dataset to build classifiers to predict approvals. A decision tree model is found to best separate approvals from denials, with an AUC of 0.823. This stable model is preferred over logistic regression and tree ensemble models and is expected to save the institution up to 50% of funds wasted on failed applications compared to the tree ensemble model.
The document discusses supervised versus unsupervised discretization methods for transforming variables in cluster analysis models. It finds that unsupervised, or SAS-defined, transformations generally result in more profitable models compared to supervised, or user-defined, transformations. However, the most profitable transformations can be complex and difficult to explain. There is a tradeoff between profitability and interpretability, known as the "cost of simplicity." The document analyzes different variable transformations applied to a credit risk prediction model to determine which balance of profit and explanation is most appropriate.
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
The document discusses various data mining techniques including association rules, classification, clustering, and approaches to discovering patterns in datasets. It covers clustering algorithms like partition and hierarchical clustering. It also explains different data mining problems like discovering sequential patterns, patterns in time series data, and classification and regression rules.
Creating an Explainable Machine Learning AlgorithmBill Fite
How to create an explainable scorecard model using machine learning to optimize its performance with results and insights from applying it to a stock picking problem.
The document discusses creating explainable machine learning models and provides examples of widely used statistical models like logistic regression that produce easily understandable linear scores. It then proposes that an explainable machine learning model could combine both the probability of a financial outcome and the expected gain or loss, while maintaining model interpretability through techniques like scorecards with integer weights. The document outlines applying such a model to problems like picking stocks to maximize expected return over time periods while selecting a set number of stocks and considering constraints.
IRJET- Minning Frequent Patterns,Associations and CorrelationsIRJET Journal
This document discusses mining frequent patterns, associations, and correlations from data. It begins by defining frequent patterns as patterns that occur often in a dataset. It then discusses market basket analysis and how it is used to find associations between frequently purchased items. The document outlines key concepts for mining patterns including support, confidence, and association rules. It also discusses different types of patterns that can be mined such as closed, maximal and approximate patterns. Finally, it provides an overview of the different methodologies used for pattern mining and applications.
Question 1. 1.You are given only three quarterly seasonal indi.docxteofilapeerless
Question 1.
1.
You are given only three quarterly seasonal indices and quarterly seasonally adjusted data for the entire year. What is the raw data value for Q4? Raw data is not adjusted for seasonality.
Quarter Seasonal Index Seasonally Adjusted Data
Q1 .80 295
Q2 .85 299
Q3 1.15 270
Q4 --- 271
(Points : 3)
325
225
252
271
Question 2.
2.
One model of exponential smoothing will provide almost the same forecast as a liner trend method. What are linear trend intercept and slope counterparts for exponential smoothing? (Points : 3)
Alpha and Delta
Delta and Gamma
Alpha and Gamma
Std Dev and Mean
Question 3.
3.
Why is the residual mean value important to a forecaster? (Points : 3)
Large mean values indicate nonautoregressiveness.
Small mean values indicate the total amount of error is small.
Large absolute mean values indicate estimate bias.
Large mean values indicate the standard error of the model is small.
Question 4.
4.
When performing correlation analysis what is the null hypothesis? What measure in Minitab is used to test it and to be 95% confident in the significance of correlation coefficient. (Points : 3)
Ho: r = .05 p < .5
Ho: r = 1 p =.05
Ho: r ≠ 0 p≤.05
Ho: r = 0 p≤.05
Question 5.
5.
In decomposition what does the cycle factor (CF) of .80 represent for a monthly forecast estimate of a Y variable? (Points : 3)
The estimated value is 80% of the average monthly seasonal estimate.
The estimate is .80 of the forecasted Y trend value.
The estimated value is .80 of the historical average CMA values.
The estimated value has 20% more variation than the average historical Y data values.
Question 6.
6.
A Burger King franchise owner notes that the sales per store has fallen below the stated national Burger King outlet average of $1,258,000. He asserts a change has occurred that reduced the fast food eating habits of Americans. What is his hypothesis (H1) and what type of test for significance must be applied? (Points : 3)
H1: u ≥ $1.258,000 A one-tailed t-test to the left.
H1: u = $1.258,000 A two-tailed t-test.
H1: u < $1.258,000 A one-tailed t-test to the left.
H1: p < $1.258,000 A one-tailed test to the right.
Question 7.
7.
The CEO of Home Depot wants to see if city size has any relationship to the current profit margins of the company stores. What data type will he likely use to determine this?
(Points : 3)
Time series data of profits by store.
Recent 10 year sample of profits by stores.
Recent cross section of store profits by city.
Trend of a random sample of store profits over time.
Question 8.
8.
Sometimes forecasters get lazy or forgetful and do not.
1. You are given only three quarterly seasonal indices and quarter.docxjackiewalcutt
1. You are given only three quarterly seasonal indices and quarterly seasonally adjusted data for the entire year. What is the raw data value for Q4? Raw data is not adjusted for seasonality.
Quarter Seasonal Index Seasonally Adjusted Data
Q1 .80 295
Q2 .85 299
Q3 1.15 270
Q4 --- 271
(Points : 3)
325
225
252
271
Question 2. 2. One model of exponential smoothing will provide almost the same forecast as a liner trend method. What are linear trend intercept and slope counterparts for exponential smoothing? (Points : 3)
Alpha and Delta
Delta and Gamma
Alpha and Gamma
Std Dev and Mean
Question 3. 3. Why is the residual mean value important to a forecaster? (Points : 3)
Large mean values indicate nonautoregressiveness.
Small mean values indicate the total amount of error is small.
Large absolute mean values indicate estimate bias. Large mean values indicate the standard error of the model is small.
Question 4. 4. When performing correlation analysis what is the null hypothesis? What measure in Minitab is used to test it and to be 95% confident in the significance of correlation coefficient. (Points : 3)
Ho: r = .05 p < .5
Ho: r = 1 p =.05
Ho: r ≠ 0 p≤.05
Ho: r = 0 p≤.05
Question 5. 5. In decomposition what does the cycle factor (CF) of .80 represent for a monthly forecast estimate of a Y variable? (Points : 3)
The estimated value is 80% of the average monthly seasonal estimate.
The estimate is .80 of the forecasted Y trend value.
The estimated value is .80 of the historical average CMA values.
The estimated value has 20% more variation than the average historical Y data values.
Question 6. 6. A Burger King franchise owner notes that the sales per store has fallen below the stated national Burger King outlet average of $1,258,000. He asserts a change has occurred that reduced the fast food eating habits of Americans. What is his hypothesis (H1) and what type of test for significance must be applied? (Points : 3)
H1: u ≥ $1.258,000 A one-tailed t-test to the left.
H1: u = $1.258,000 A two-tailed t-test.
H1: u < $1.258,000 A one-tailed t-test to the left.
H1: p < $1.258,000 A one-tailed test to the right.
Question 7. 7.
The CEO of Home Depot wants to see if city size has any relationship to the current profit margins of the company stores. What data type will he likely use to determine this?
(Points : 3)
Time series data of profits by store.
Recent 10 year sample of profits by stores.
Recent cross section of store profits by city.
Trend of a random sample of store profits over time.
Question 8. 8. Sometimes forecasters get lazy or forgetful and do not check the significance of XY data correlations ...
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
1. 384 Chapter 6 Association Analysis
Table 6.19. A two-way contingency table between the sale of high-definition television and exercise
machine.
BuyBuy Exercise Machine
HDTVYesNo
Yes9981180
No5466120
153147300
Table 6.20. Example of a three-way contingency table.
Customer Buy Buy Exercise Machine Total
Group HDTV Yes No
College Students Yes 1 9 10
No 4 30 34
Yes 98 72 170
Working Adult No 50 36 86
6.7.3 Simpson’s Paradox
It is important to exercise caution when interpreting the association between
variables because the observed relationship may be influenced by the presence
of other confounding factors, i.e., hidden variables that are not included in
the analysis. In some cases, the hidden variables may cause the observed
relationship between a pair of variables to disappear or reverse its direction, a
phenomenon that is known as Simpson’s paradox. We illustrate the nature of
this paradox with the following example.
Consider the relationship between the sale of high-definition television
(HDTV) and exercise machine, as shown in Table 6.19. The rule {HDTV=Yes}
−→ {Exercise machine=Yes} has a confidence of 99/180 = 55% and the rule
{HDTV=No} −→ {Exercise machine=Yes} has a confidence of 54/120 = 45%.
Together, these rules suggest that customers who buy high-definition televi-
sions are more likely to buy exercise machines than those who do not buy
high-definition televisions.
However, a deeper analysis reveals that the sales of these items depend
on whether the customer is a college student or a working adult. Table 6.20
summarizes the relationship between the sale of HDTVs and exercise machines
among college students and working adults. Notice that the support counts
given in the table for college students and working adults sum up to the fre-
quencies shown in Table 6.19. Furthermore, there are more working adults
2. 6.7 Evaluation of Association Patterns 385
than college students who buy these items. For college students:
c {HDTV=Yes} −→ {Exercise machine=Yes} = 1/10 = 10%,
c {HDTV=No} −→ {Exercise machine=Yes} = 4/34 = 11.8%,
while for working adults:
c {HDTV=Yes} −→ {Exercise machine=Yes} = 98/170 = 57.7%,
c {HDTV=No} −→ {Exercise machine=Yes} = 50/86 = 58.1%.
The rules suggest that, for each group, customers who do not buy high-
definition televisions are more likely to buy exercise machines, which contradict
the previous conclusion when data from the two customer groups are pooled
together. Even if alternative measures such as correlation, odds ratio, or
interest are applied, we still find that the sale of HDTV and exercise machine
is positively correlated in the combined data but is negatively correlated in
the stratified data (see Exercise 20 on page 414). The reversal in the direction
of association is known as Simpson’s paradox.
The paradox can be explained in the following way. Notice that most
customers who buy HDTVs are working adults. Working adults are also the
largest group of customers who buy exercise machines. Because nearly 85% of
the customers are working adults, the observed relationship between HDTV
and exercise machine turns out to be stronger in the combined data than
what it would have been if the data is stratified. This can also be illustrated
mathematically as follows. Suppose
a/b < c/d and p/q < r/s,
where a/b and p/q may represent the confidence of the rule A −→ B in two
different strata, while c/d and r/s may represent the confidence of the rule
A −→ B in the two strata. When the data is pooled together, the confidence
values of the rules in the combined data are (a + p)/(b + q) and (c + r)/(d + s),
respectively. Simpson’s paradox occurs when
c+ra+p
>,
b+qd+s
thus leading to the wrong conclusion about the relationship between the vari-
ables. The lesson here is that proper stratification is needed to avoid generat-
ing spurious patterns resulting from Simpson’s paradox. For example, market
3. 386 Chapter 6 Association Analysis
4
5
4.5
4
3.5
3
Support
2.5
2
1.5
1
0.5
0
0 500 10001500 2000 2500
Items sorted by support
Figure 6.29. Support distribution of items in the census data set.
basket data from a major supermarket chain should be stratified according to
store locations, while medical records from various patients should be stratified
according to confounding factors such as age and gender.
6.8 Effect of Skewed Support Distribution
The performances of many association analysis algorithms are influenced by
properties of their input data. For example, the computational complexity of
the Apriori algorithm depends on properties such as the number of items in
the data and average transaction width. This section examines another impor-
tant property that has significant influence on the performance of association
analysis algorithms as well as the quality of extracted patterns. More specifi-
cally, we focus on data sets with skewed support distributions, where most of
the items have relatively low to moderate frequencies, but a small number of
them have very high frequencies.
An example of a real data set that exhibits such a distribution is shown in
Figure 6.29. The data, taken from the PUMS (Public Use Microdata Sample)
census data, contains 49,046 records and 2113 asymmetric binary variables.
We shall treat the asymmetric binary variables as items and records as trans-
actions in the remainder of this section. While more than 80% of the items
have support less than 1%, a handful of them have support greater than 90%.
4. 6.8 Effect of Skewed Support Distribution 387
Table 6.21. Grouping the items in the census data set based on their support values.
Group G1 G2 G3
Support < 1% 1% − 90% > 90%
Number of Items 1735 358 20
To illustrate the effect of skewed support distribution on frequent itemset min-
ing, we divide the items into three groups, G1 , G2 , and G3 , according to their
support levels. The number of items that belong to each group is shown in
Table 6.21.
Choosing the right support threshold for mining this data set can be quite
tricky. If we set the threshold too high (e.g., 20%), then we may miss many
interesting patterns involving the low support items from G1 . In market bas-
ket analysis, such low support items may correspond to expensive products
(such as jewelry) that are seldom bought by customers, but whose patterns
are still interesting to retailers. Conversely, when the threshold is set too
low, it becomes difficult to find the association patterns due to the following
reasons. First, the computational and memory requirements of existing asso-
ciation analysis algorithms increase considerably with low support thresholds.
Second, the number of extracted patterns also increases substantially with low
support thresholds. Third, we may extract many spurious patterns that relate
a high-frequency item such as milk to a low-frequency item such as caviar.
Such patterns, which are called cross-support patterns, are likely to be spu-
rious because their correlations tend to be weak. For example, at a support
threshold equal to 0.05%, there are 18,847 frequent pairs involving items from
G1 and G3 . Out of these, 93% of them are cross-support patterns; i.e., the pat-
terns contain items from both G1 and G3 . The maximum correlation obtained
from the cross-support patterns is 0.029, which is much lower than the max-
imum correlation obtained from frequent patterns involving items from the
same group (which is as high as 1.0). Similar statement can be made about
many other interestingness measures discussed in the previous section. This
example shows that a large number of weakly correlated cross-support pat-
terns can be generated when the support threshold is sufficiently low. Before
presenting a methodology for eliminating such patterns, we formally define the
concept of cross-support patterns.
5. 388 Chapter 6 Association Analysis
Definition 6.9 (Cross-Support Pattern). A cross-support pattern is an
itemset X = {i1 , i2 , . . . , ik } whose support ratio
min s(i1 ), s(i2 ), . . . , s(ik )
r(X) = , (6.13)
max s(i1 ), s(i2 ), . . . , s(ik )
is less than a user-specified threshold hc .
Example 6.4. Suppose the support for milk is 70%, while the support for
sugar is 10% and caviar is 0.04%. Given hc = 0.01, the frequent itemset
{milk, sugar, caviar} is a cross-support pattern because its support ratio is
min 0.7, 0.1, 0.00040.0004
r= = 0.00058 < 0.01.=
0.7max 0.7, 0.1, 0.0004
Existing measures such as support and confidence may not be sufficient
to eliminate cross-support patterns, as illustrated by the data set shown in
Figure 6.30. Assuming that hc = 0.3, the itemsets {p, q}, {p, r}, and {p, q, r}
are cross-support patterns because their support ratios, which are equal to
0.2, are less than the threshold hc . Although we can apply a high support
threshold, say, 20%, to eliminate the cross-support patterns, this may come
at the expense of discarding other interesting patterns such as the strongly
correlated itemset, {q, r} that has support equal to 16.7%.
Confidence pruning also does not help because the confidence of the rules
extracted from cross-support patterns can be very high. For example, the
confidence for {q} −→ {p} is 80% even though {p, q} is a cross-support pat-
tern. The fact that the cross-support pattern can produce a high-confidence
rule should not come as a surprise because one of its items (p) appears very
frequently in the data. Therefore, p is expected to appear in many of the
transactions that contain q. Meanwhile, the rule {q} −→ {r} also has high
confidence even though {q, r} is not a cross-support pattern. This example
demonstrates the difficulty of using the confidence measure to distinguish be-
tween rules extracted from cross-support and non-cross-support patterns.
Returning to the previous example, notice that the rule {p} −→ {q} has
very low confidence because most of the transactions that contain p do not
contain q. In contrast, the rule {r} −→ {q}, which is derived from the pattern
{q, r}, has very high confidence. This observation suggests that cross-support
patterns can be detected by examining the lowest confidence rule that can be
extracted from a given itemset. The proof of this statement can be understood
as follows.