Data Mining
Association Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 6
Introduction to Data Mining
by
Tan, Steinbach, Kumar
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Association Rule MiningGiven a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Definition: Frequent ItemsetItemsetA collection of one or more itemsExample: {Milk, Bread, Diaper}k-itemsetAn itemset that contains k itemsSupport count ()Frequency of occurrence of an itemsetE.g. ({Milk, Bread,Diaper}) = 2 SupportFraction of transactions that contain an itemsetE.g. s({Milk, Bread, Diaper}) = 2/5Frequent ItemsetAn itemset whose support is greater than or equal to a minsup threshold
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Definition: Association RuleAssociation RuleAn implication expression of the form X Y, where X and Y are itemsetsExample:
{Milk, Diaper} {Beer}
Rule Evaluation MetricsSupport (s)Fraction of transactions that contain both X and YConfidence (c)Measures how often items in Y
appear in transactions that
contain X
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Association Rule Mining TaskGiven a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup thresholdconfidence ≥ minconf threshold
Brute-force approach:List all possible association rulesCompute the support and confidence for each rulePrune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Mining Association Rules
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations: All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer} Rules originating from the same itemset have identical support but
can have different confidence Thus, we may decouple the support and confidence requirements
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diap ...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
Association Rule mining is one of the dominant tasks of data mining, which concerns in finding frequent
itemsets in large volumes of data in order to produce summarized models of mined rules. These models are
extended to generate association rules in various applications such as e-commerce, bio-informatics,
associations between image contents and non image features, analysis of effectiveness of sales and retail
industry, etc. In the vast increasing databases, the major challenge is the frequent itemsets mining in a
very short period of time. In the case of increasing data, the time taken to process the data should be
almost constant. Since high performance computing has many processors, and many cores, consistent runtime
performance for such very large databases on association rules mining is achieved. We, therefore,
must rely on high performance parallel and/or distributed computing. In literature survey, we have studied
the sequential Apriori algorithms and identified the fundamental problems in sequential environment and
parallel environment. In our proposed ParApriori, we have proposed parallel algorithm for GPGPU, and
we have also done the results analysis of our GPU parallel algorithm. We find that proposed algorithm
improved the computing time, consistency in performance over the increasing load. The empirical analysis
of the algorithm also shows that efficiency and scalability is verified over the series of datasets
experimented on many core GPU platform.
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.
lide 7: Stack Operations - Pop
Description: Removing the top element from the stack.
Process: Retrieve the top element, decrement the stack pointer, and remove the element.
Illustration: Visual representation of the pop operation.
Slide 8: Stack Operations - Peek
Description: Viewing the top element of the stack without removing it.
Process: Access the element pointed by the stack pointer.
Illustration: Visual representation of the peek operation.
Slide 9: Stack Operations - isEmpty
Description: Checking if the stack is empty.
Process: Verify if the stack pointer is at the initial position.
Illustration: Visual representation of the isEmpty operation.
Slide 10-19: Examples and Illustrations
Example 1: Implementing a stack in C++.
Example 2: Evaluating postfix expressions using a stack.
Example 3: Undo mechanism in a text editor.
Illustrations of stack operations with step-by-step explanations.
Slide 20: Real-world Applications
Software Engineering: Function call management, recursion, expression evaluation.
Data Structures: Implementing other data structures (e.g., recursion in tree traversal).
Operating Systems: Memory management, system call management.
Slide 21: Conclusion
Summary of key points covered in the lecture.
Importance of understanding stacks in computer science and software engineering.
Encouragement for further exploration and practice.
This outline provides a comprehensive overview of stacks, their operations, applications, and significance in computer science. You can expand on each slide with detailed explanations, code snippets, diagrams, and examples to create an engaging and informative lecture.
also proved men slide from 10 to 19
ChatGPT
Certainly! Here are additional slides expanding on examples and illustrations from slide 10 to slide 19:
Slide 10: Example 1 - Implementing a Stack in C++
Description: Demonstration of how to implement a stack using an array or linked list in C++.
Code Example: Implementation of the Stack class with push, pop, peek, and isEmpty member functions.
Illustration: Visual representation of the stack structure and operations.
Slide 11: Example 2 - Evaluating Postfix Expressions
Description: Illustration of how to evaluate postfix expressions using a stack.
Process: Step-by-step explanation of how to convert and evaluate a postfix expression.
Code Example: C++ code snippet demonstrating postfix expression evaluation using a stack.
Illustration: Visual representation of the stack during postfix expression evaluation.
Slide 12: Example 3 - Undo Mechanism in Text Editor
Description: Explanation of how a stack can be used to implement an undo mechanism in a text editor.
this assignment is about Mesopotamia and Egypt. Some of these cu.docxOllieShoresna
this assignment is about
Mesopotamia and Egyp
t. Some of these cultures lasted centuries, others such as Egypt lasted millennia. The goal of this prompt is to dig deeper into the power of religion and visual representations of power from rulers on human culture.
The themes of religion and power dominate artwork from this era of art history. What is the importance of these themes relative to the civilizations at the time? How do these themes manifest themselves in works of art? Choose one culture (Sumerian, Babylonian, Assyrian, Egyptian, etc) and support your answer to that one culture describing specific artifacts.
The goal here is an analysis of a single artifact in support of the theme. Be sure to review week 1's material on writing about art to help you with structuring a response.
Please remember to use MLA format when organizing your response. This means proper in-text citations, captions for images, and references for any work that is cited in-text.
.
This assignment has two goals 1) have students increase their under.docxOllieShoresna
This assignment has two goals: 1) have students increase their understanding of the concept of Protecting Personal Information (PPI) and other ethical issues related to the use of information technology through research, and 2) learn to correctly use the tools and techniques within Word to format a research paper, including use of available References and citation tools. These skills will be valuable throughout a student’s academic career. The paper will require a title page, NO abstract, three to four full pages of content with incorporation of a minimum of 3 external resources from credible sources and a Works Cited/References page. Wikipedia and similar general information sites, blogs or discussion groups are not considered creditable sources for a research project. No more than 10% of the paper may be in the form of a direct citation from an external source.
.
More Related Content
Similar to Data Mining Association Analysis Basic Concepts a
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
Association Rule mining is one of the dominant tasks of data mining, which concerns in finding frequent
itemsets in large volumes of data in order to produce summarized models of mined rules. These models are
extended to generate association rules in various applications such as e-commerce, bio-informatics,
associations between image contents and non image features, analysis of effectiveness of sales and retail
industry, etc. In the vast increasing databases, the major challenge is the frequent itemsets mining in a
very short period of time. In the case of increasing data, the time taken to process the data should be
almost constant. Since high performance computing has many processors, and many cores, consistent runtime
performance for such very large databases on association rules mining is achieved. We, therefore,
must rely on high performance parallel and/or distributed computing. In literature survey, we have studied
the sequential Apriori algorithms and identified the fundamental problems in sequential environment and
parallel environment. In our proposed ParApriori, we have proposed parallel algorithm for GPGPU, and
we have also done the results analysis of our GPU parallel algorithm. We find that proposed algorithm
improved the computing time, consistency in performance over the increasing load. The empirical analysis
of the algorithm also shows that efficiency and scalability is verified over the series of datasets
experimented on many core GPU platform.
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.
lide 7: Stack Operations - Pop
Description: Removing the top element from the stack.
Process: Retrieve the top element, decrement the stack pointer, and remove the element.
Illustration: Visual representation of the pop operation.
Slide 8: Stack Operations - Peek
Description: Viewing the top element of the stack without removing it.
Process: Access the element pointed by the stack pointer.
Illustration: Visual representation of the peek operation.
Slide 9: Stack Operations - isEmpty
Description: Checking if the stack is empty.
Process: Verify if the stack pointer is at the initial position.
Illustration: Visual representation of the isEmpty operation.
Slide 10-19: Examples and Illustrations
Example 1: Implementing a stack in C++.
Example 2: Evaluating postfix expressions using a stack.
Example 3: Undo mechanism in a text editor.
Illustrations of stack operations with step-by-step explanations.
Slide 20: Real-world Applications
Software Engineering: Function call management, recursion, expression evaluation.
Data Structures: Implementing other data structures (e.g., recursion in tree traversal).
Operating Systems: Memory management, system call management.
Slide 21: Conclusion
Summary of key points covered in the lecture.
Importance of understanding stacks in computer science and software engineering.
Encouragement for further exploration and practice.
This outline provides a comprehensive overview of stacks, their operations, applications, and significance in computer science. You can expand on each slide with detailed explanations, code snippets, diagrams, and examples to create an engaging and informative lecture.
also proved men slide from 10 to 19
ChatGPT
Certainly! Here are additional slides expanding on examples and illustrations from slide 10 to slide 19:
Slide 10: Example 1 - Implementing a Stack in C++
Description: Demonstration of how to implement a stack using an array or linked list in C++.
Code Example: Implementation of the Stack class with push, pop, peek, and isEmpty member functions.
Illustration: Visual representation of the stack structure and operations.
Slide 11: Example 2 - Evaluating Postfix Expressions
Description: Illustration of how to evaluate postfix expressions using a stack.
Process: Step-by-step explanation of how to convert and evaluate a postfix expression.
Code Example: C++ code snippet demonstrating postfix expression evaluation using a stack.
Illustration: Visual representation of the stack during postfix expression evaluation.
Slide 12: Example 3 - Undo Mechanism in Text Editor
Description: Explanation of how a stack can be used to implement an undo mechanism in a text editor.
this assignment is about Mesopotamia and Egypt. Some of these cu.docxOllieShoresna
this assignment is about
Mesopotamia and Egyp
t. Some of these cultures lasted centuries, others such as Egypt lasted millennia. The goal of this prompt is to dig deeper into the power of religion and visual representations of power from rulers on human culture.
The themes of religion and power dominate artwork from this era of art history. What is the importance of these themes relative to the civilizations at the time? How do these themes manifest themselves in works of art? Choose one culture (Sumerian, Babylonian, Assyrian, Egyptian, etc) and support your answer to that one culture describing specific artifacts.
The goal here is an analysis of a single artifact in support of the theme. Be sure to review week 1's material on writing about art to help you with structuring a response.
Please remember to use MLA format when organizing your response. This means proper in-text citations, captions for images, and references for any work that is cited in-text.
.
This assignment has two goals 1) have students increase their under.docxOllieShoresna
This assignment has two goals: 1) have students increase their understanding of the concept of Protecting Personal Information (PPI) and other ethical issues related to the use of information technology through research, and 2) learn to correctly use the tools and techniques within Word to format a research paper, including use of available References and citation tools. These skills will be valuable throughout a student’s academic career. The paper will require a title page, NO abstract, three to four full pages of content with incorporation of a minimum of 3 external resources from credible sources and a Works Cited/References page. Wikipedia and similar general information sites, blogs or discussion groups are not considered creditable sources for a research project. No more than 10% of the paper may be in the form of a direct citation from an external source.
.
This assignment has two parts 1 paragraph per questionIn wh.docxOllieShoresna
This assignment has two parts: 1 paragraph per question
In what instances would Wikipedia be of benefit in conducting research necessary to develop quality deliverables?
what are the drawbacks of using Wikipedia as a primary academic source for conducting research necessary to develop quality deliverables.
.
This assignment is a minimum of 100 word all parts of each querstion.docxOllieShoresna
This assignment is a minimum of 100 word all parts of each querstion MUST be answered
1)
What is an example of past trends pertaining to the development and operation of community based corrections? How does institutional corrections and community corrections differ in relation to operations and development? How can we improve the development and operation of corrections by utilizing past, current, and future trends?
2)
What are the technological functions within correctional environments? How do technological functions relate to security and management functions within correctional environments? What would happen if there was a disconnect among these areas of a correctional facility?
3)
What are the technological functions within correctional environments? How do technological functions relate to security and management functions within correctional environments? What would happen if there was a disconnect among these areas of a correctional facility?
.
This assignment has three elements a traditional combination format.docxOllieShoresna
This assignment has three elements: a traditional combination format resume, a cover letter, and a reference sheet
. Cover letter is no more than one page (3-6 paragraphs) in length
The cover letter must be written to a real company for a real job in Pittsburg, PA.
Please direct a person's name with a complete address. Be sure to use the appropriate salutation such as Mr., Ms., Dr., etc.
Make sure that you have varied your sentence structure so that every sentence does not begin with "I"
Important information about myself:
Name : Nicolas J, an international student from France
Major: Management Information System
Skills: speak two language, native language is France, and second language is English.
Experience: five years working in Freeze company (from 2007 to 2012) in France at IT department before I came to the U.S. to study MIS.
Note: see the attached document for samples of a cover letter and a reference sheet
.
This assignment has four partsWhat changes in business software p.docxOllieShoresna
This assignment has four parts:
What changes in business software platforms have you experienced, and what was the driving force behind the change?
What important trends in business hardware are occurring? What relationship do you see happening between hardware changes and software? In your experience, which seems to drive the other and why?
How important do you perceive databases and data mining to business? How could a small business take advantage of the technology?
In your opinion, should software dictate business processes or should the business process dictate the software structure? Why? What are the risks?
.
This assignment consists of two partsthe core evaluation, a.docxOllieShoresna
This assignment consists of two parts:
the core evaluation,
and
the plan for extending the evaluation through research
in the
Illinois University
, and in sources that will increase the context of the evaluation even further.
My core essay (4 pages) would discuss these criteria through examples, and explanation of why they are important. In developing this essay, I could also use description, definition, comparison/contrast and cause and effect, since these are also ways to evaluate something. My goal is to provide an evaluation that readers find reasonable and thorough.
What I want to come up with based on this questioning, is a plan for research that includes discussion of who I would go to for information and why, a list of potential survey and interview questions and an annotated bibliography with a minimum of three sources accessed through Booth Library databases.
this project is related to my university just look the attached paper for further information.
.
This assignment asks you to analyze a significant textual elemen.docxOllieShoresna
This assignment asks you to analyze a significant textual element from “Welcome to Dataland”. First, provide a brief summary (1-2 sentences) of the essay, including an explanation of Bogost’s main claim. Next, using your tools for textual analysis, identify
one
key element of the text from Bogost’s essay and analyze the significance of this element. How does it contribute to the text’s purpose? In what ways does it relate to the essay’s main claim? How does it impact how an audience receives or interprets the text?
.
This assignment allows you to learn more about one key person in Jew.docxOllieShoresna
This assignment allows you to learn more about one key person in Jewish history and to relate that person to any specific rituals in Judaism today. In doing so, you will also learn how your chosen individual fits into the larger history of the religion.
Part 1
Complete
the University of Phoenix Material: Common Holy Days in Jewish Religious Traditions Worksheet to help you as you reflect on Part 2 of this assignment.
Part 2
Write
a 750-word paper that includes the following:
A summary of the life and importance of one key person in ancient Jewish history (chosing either Abraham, Moses, David, Solomon, Esther or Ezra will make it the easiest to complete the next two instructions)
An explanation of one key event in the history of Judaism that is connected to that person
A description of any rituals, symbols, or sacred texts in Judaism associated with this event or person
An example of how this person's story helped to develop the ideas of Jewish ethics
Format
your assignment according to appropriate course-level APA guidelines.
.
This assignment allows you to explore the effects of social influe.docxOllieShoresna
This assignment allows you to explore the effects of social influences on personal development.
Write
a 1,050- to 1,400-word paper in which you examine the concept of the self. Address the following:
·
Identify who was in the radius of significant others that shaped your development through your toddler, child, and adolescent years.
·
Identify verbal messages you recall that suggested situational or dispositional attributions about you.
·
Describe how you developed your current attitudes toward authority, competitors, subordinates, the opposite sex, or another generation.
·
Explore the effects your social world has had on your developing professional identity.
Cite
at least 2 scholarly references.
Format
your paper according to APA guidelines.
Click
the Assignment Files tab to submit your assignment.
This assignment is based on my low-self-esteem. My mother would be the one who shaped my
development. Follow the instructions. Please have the heading
The Self in the Social World
and the running head.
.
This assignment addresses pretrial procedures that occur prior to th.docxOllieShoresna
This assignment addresses pretrial procedures that occur prior to the trial but not the trial itself. Subjects included
but are not limited to
: first appearance, alternatives to bail, Grand Jury proceedings, plea-bargaining, and federal rules of procedures for plea-bargaining. In addition to topics listed in the syllabus, additional information from the textbook and research references is required.
.
This assignment allows you to learn more about one key person in J.docxOllieShoresna
This assignment allows you to learn more about one key person in Jewish history and to relate that person to any specific rituals in Judaism today. In doing so, you will also learn how your chosen individual fits into the larger history of the religion.
Part 1
Complete
the University of Phoenix Material: Common Holy Days in Jewish Religious Traditions Worksheet to help you as you reflect on Part 2 of this assignment.
Part 2
Write
a 700- to 1,050-word paper that includes the following:
A summary of the life and importance of one key person in Jewish history
An explanation of one key event in the history of Judaism that is connected to that person
A description of any rituals, symbols, or sacred texts in Judaism associated with this event or person
Brief explanation of Jewish ethics
Format
your assignment according to appropriate course-level APA guidelines.
Submit
your assignment to the Assignment Files tab
.
This assignment allows you to explore the effects of social infl.docxOllieShoresna
This assignment allows you to explore the effects of social influences on personal development.
Write
a 1,050- to 1,400-word paper in which you examine the concept of the self. Address the following:
Identify who was in the radius of significant others that shaped your development through your toddler, child, and adolescent years.
Identify verbal messages you recall that suggested situational or dispositional attributions about you.
Describe how you developed your current attitudes toward authority, competitors, subordinates, the opposite sex, or another generation.
Explore the effects your social world has had on your developing professional identity.
Cite
at least 2 scholarly references.
Format
your paper according to APA guidelines.
.
this about communication please i eant you answer this question.docxOllieShoresna
this about communication >>>
please i eant you answer this question from book Milestones in Mass Communication Research: Media Research, 3rd ED
Lowery &Defleur ISBN 0-8013-1437-2
I will submit the question and please the answer re write and own word i want rephrase the answer
i add some answor to help you to answer the question and please rephrase and write own words please i want use the book to find correct answer
.
Think of a time when a company did not process an order or perform a.docxOllieShoresna
Think of a time when a company did not process an order or perform a service for you in a timely manner.
What was your reaction?
What actions did the company take to correct the situation?
What actions would you have liked for the company to take?
Discuss possible reasons why the company was not able to complete your order/service in a timely manner and suggest potential areas for improvement.
4 Paragraph minimum.
.
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docxOllieShoresna
Think_Vision W5- Importance of Vaccination
Importance of Vaccination
Mary's one year old daughter is due to be given the Measles, Mumps, and Rubella (MMR) vaccine during her next visit to the doctor. Mary is upset and concerned because one of her friend's sons became ill after a similar vaccination. She has also heard rumors that MMR vaccine causes rubella. Mary was also told that her daughter will need to be vaccinated before Mary returns to work.
Mary is not alone. Many parents face this issue. Therefore, it is important for Mary and all such parents to make decisions that are best for their children, based on facts and not emotions.
Place yourself in the role of a health care worker, submit to the discussion area your plan to validate the importance of vaccination by addressing the following questions:
Why are vaccinations necessary components of the healthcare programs?
Are there reasons for people not being vaccinated despite of such elaborate healthcare programs? Explain.
What are the consequences of people not being vaccinated?
What is the impact of religious, cultural, legal, and ethical issues that parents need to consider before vaccination?
What type of information will help the parents make an informed decision about vaccinating their children?
NEED TO BE VERY SPECIFIC VERY DETAILED IS EXTREMLY IMORTANT THIS ONE
.
Thinks for both only 50 words as much for each one1-xxxxd, unf.docxOllieShoresna
Thinks for both only 50 words as much for each one
1
-xxxxd, unfortunately there isn’t any Ethical Code of Conduct that all countries follow to the letter. “When in Rome, you act as the Romans does.” Therefore, Chiquita did what they thought was right under the circumstances. Rather it was for profit or to save the lives of its employees. Their decision may have been considered unethical by the United States standards. But, to them it was the right thing to do to eliminate human causality. A lot of these atrocities stem from the wide gap in wealth distribution, corruption, and greed at the highest level in the government infrastructure. Not too long ago in the distant past in the United States, the government, politicians, influential business men and part of society partake in various atrocities as well against several groups of people all in the name of greed, profit, and racism. At the time, they felt their actions was justified, and continued the course without deviation.
2
-I enjoyed reading your response to question 6 on the civil death policy legal terminology persay. It helped me to understand more clearly that they would be denied rights such as voting, holding public office etc. It is truly not a black or white answer, but a very gray area. I know in some instances yes I could see businesses get that but due to the circumstances they were truly put into a no win situation. Even if they went to the authorities they would probably have been found out and then they would end up still tortured and killed.
this is about the chiquita case.
.
Think of a specific change you would like to bring to your organizat.docxOllieShoresna
Think of a specific change you would like to bring to your organization. Describe the change, the value that you believe the change would bring to the organization, and the methodology that you would use (top-down or bottom-up) in order to implement the change.
I would have mangement work with the employees who would be affected rather than managers making a change and it being a total nightmare. So I would do bottom up.
.
Think of a possible change initiative in your selected organization..docxOllieShoresna
Think of a possible change initiative in your selected organization.
This could be the one you identified in Unit 3. Briefly describe the initiative.
Identify the possible stakeholders – those people or organizations that would positively or negatively affect a successful outcome.
Identify two key stakeholders who would be supportive of the initiative and two who would resist it. Provide recommendations for
PLEASE SEE and FOLLOW the instructions on the Attached Rubric
.
Thinking About Research PaperConsider the research question and .docxOllieShoresna
Thinking About Research Paper
Consider
the research question and hypothesis you created in Week 3, as well as the information you summarized in your literature review in Week 2.
Write
a 4- to 6-page paper that explains the most appropriate research methods for your chosen topic. Keep in mind the following guidelines:
Identify at least two different research methods that could be used to investigate whether your hypothesis is accurate. You may wish to consider quantitative research, secondary data analysis, ethnographic studies, participant observation, or in-depth interviews.
Evaluate the appropriateness of each by explaining their advantages and disadvantages.
Explain which of the two methods you believe is the most appropriate.
Explain specifically how you could use this method to study your research question.
Use
APA writing style guidelines.
Include
an APA-style formatted references page listing the articles you selected.
Hypothesis: Police departments nationwide need to change their mindsets and policies to attract the modern millennial police recruits today.
Must receive by 2/2/2017 by 2000
.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
1. Data Mining
Association Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 6
Introduction to Data Mining
by
Tan, Steinbach, Kumar
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Association Rule MiningGiven a set of transactions, find rules
that will predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
2. {Beer, Bread
Implication means co-occurrence, not causality!
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Definition: Frequent ItemsetItemsetA collection of one or more
itemsExample: {Milk, Bread, Diaper}k-itemsetAn itemset that
of transactions that contain an itemsetE.g. s({Milk, Bread,
Diaper}) = 2/5Frequent ItemsetAn itemset whose support is
greater than or equal to a minsup threshold
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
3. Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Definition: Association RuleAssociation RuleAn implication
itemsetsExample:
Rule Evaluation MetricsSupport (s)Fraction of transactions that
contain both X and YConfidence (c)Measures how often items
in Y
appear in transactions that
contain X
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Association Rule Mining TaskGiven a set of transactions T, the
goal of association rule mining is to find all rules having
4. support ≥ minsup thresholdconfidence ≥ minconf threshold
Brute-force approach:List all possible association rulesCompute
the support and confidence for each rulePrune rules that fail the
minsup and minconf thresholds
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Mining Association Rules
Example of Rules:
} (s=0.4, c=0.5)
Observations: All the above rules are binary partitions of the
same itemset:
{Milk, Diaper, Beer} Rules originating from the same
itemset have identical support but
can have different confidence Thus, we may decouple the
support and confidence requirements
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
5. 3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Mining Association RulesTwo-step approach:
Frequent Itemset Generation
Rule Generation
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent i temset
Frequent itemset generation is still computationally expensive
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Frequent Itemset Generation
Given d items, there are 2d possible candidate itemsets
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Frequent Itemset GenerationBrute-force approach: Each itemset
in the lattice is a candidate frequent itemsetCount the support of
each candidate by scanning the database
6. Match each transaction against every candidateComplexity ~
O(NMw) => Expensive since M = 2d !!!
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
N�
w�
M�
List of Candidates�
Computational ComplexityGiven d unique items:Total number
of itemsets = 2dTotal number of possible association rules:
If d=6, R = 602 rules
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Frequent Itemset Generation StrategiesReduce the number of
candidates (M)Complete search: M=2dUse pruning techniques
to reduce M
Reduce the number of transactions (N)Reduce size of N as the
size of itemset increasesUsed by DHP and vertical-based mining
algorithms
Reduce the number of comparisons (NM)Use efficient data
structures to store the candidates or transactionsNo need to
match every candidate against every transaction
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
7. Reducing Number of CandidatesApriori principle:If an itemset
is frequent, then all of its subsets must also be frequent
Apriori principle holds due to the following property of the
support measure:
Support of an itemset never exceeds the support of its
subsetsThis is known as the anti-monotone property of support
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Illustrating Apriori Principle
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
9. {Milk,Beer}
2
{Milk,Diaper}
3
{Beer,Diaper}
3
ItemsetCount
{Bread,Milk,Diaper}
3
Apriori AlgorithmMethod:
Let k=1Generate frequent itemsets of length 1Repeat until no
new frequent itemsets are identifiedGenerate length (k+1)
candidate itemsets from length k frequent itemsetsPrune
candidate itemsets containing subsets of length k that are
infrequent Count the support of each candidate by scanning the
DBEliminate candidates that are infrequent, leaving only those
that are frequent
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Reducing Number of ComparisonsCandidate counting:Scan the
database of transactions to determine the support of each
candidate itemsetTo reduce the number of comparisons, store
the candidates in a hash structure Instead of matching each
transaction against every candidate, match it against candidates
contained in the hashed buckets
10. (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
N�
k�
Buckets�
Hash Structure�
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2
3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6
8}
You need: Hash function Max leaf size: max number of
itemsets stored in a leaf node (if number of candidate i temsets
exceeds max leaf size, split the node)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Association Rule Discovery: Hash tree
1,4,7
2,5,8
3,6,9
Hash Function
Candidate Hash Tree
Hash on 1, 4 or 7
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
11. Association Rule Discovery: Hash tree
1,4,7
2,5,8
3,6,9
Hash Function
Candidate Hash Tree
Hash on 2, 5 or 8
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Association Rule Discovery: Hash tree
1,4,7
2,5,8
3,6,9
Hash Function
Candidate Hash Tree
Hash on 3, 6 or 9
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Subset Operation
Given a transaction t, what are the possible subsets of size 3?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
1 2 3 5 6�
Transaction, t�
2 3 5 6�
3 5 6�
2�
13. 1 5 9
1 3 6
3 4 5
transaction
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Subset Operation Using Hash Tree
1 5 9
1 3 6
3 4 5
transaction
Match transaction against 11 out of 15 candidates
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Factors Affecting ComplexityChoice of minimum support
threshold lowering support threshold results in more frequent
itemsets this may increase number of candidates and max length
of frequent itemsetsDimensionality (number of items) of the
data set more space is needed to store support count of each
item if number of frequent items also increases, both
computation and I/O costs may also increaseSize of database
since Apriori makes multiple passes, run time of algorithm may
increase with number of transactionsAverage transaction width
transaction width increases with denser data setsThis may
increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its
width)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
14. Compact Representation of Frequent ItemsetsSome itemsets are
redundant because they have identical support as their supersets
Number of frequent itemsets
Need a compact representation
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Maximal Frequent Itemset
Border
Infrequent Itemsets
Maximal Itemsets
An itemset is maximal frequent if none of its immediate
supersets is frequent
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
null�
AB�
AC�
AD�
AE�
BC�
BD�
BE�
19. ABCD�
(a) Prefix tree�
(b) Suffix tree�
ABCD�
Alternative Methods for Frequent Itemset GenerationTraversal
of Itemset LatticeBreadth-first vs Depth-first
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
(a) Breadth first�
(b) Depth first�
Alternative Methods for Frequent Itemset
GenerationRepresentation of Databasehorizontal vs vertical data
layout
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Horizontal Data Layout�
Vertical Data Layout�
FP-growth AlgorithmUse a compressed representation of the
database using an FP-tree
Once an FP-tree has been constructed, it uses a recursive
divide-and-conquer approach to mine the frequent itemsets
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
20. 2002
FP-tree construction
null
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
After reading TID=1:
After reading TID=2:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Sheet1TIDItems1{A,B}2{B,C,D}3{A,C,D,E}4{A,D,E}5{A,B,C
}6{A,B,C,D}7{B,C}8{A,B,C}9{A,B,D}10{B,C,E}
Sheet2
Sheet3
FP-Tree Construction
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1
C:3
21. D:1
D:1
E:1
E:1
Pointers are used to assist frequent itemset generati on
D:1
E:1
Transaction Database
Header table
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Sheet1TIDItems1{A,B}2{B,C,D}3{A,C,D,E}4{A,D,E}5{A,B,C
}6{A,B,C,D}7{B,C}8{A,B,C}9{A,B,D}10{B,C,E}
Sheet2
Sheet3
Sheet1ItemPointerABCDE
Sheet2
Sheet3
FP-growth
null
A:7
B:5
B:1
C:1
D:1
C:1
D:1
C:3
D:1
D:1
Conditional Pattern base for D:
22. P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}
Recursively apply FP-growth on P
Frequent Itemsets found (with sup > 1):
AD, BD, CD, ACD, BCD
D:1
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Tree Projection
Set enumeration tree:
Possible Extension: E(A) = {B,C,D,E}
Possible Extension: E(ABC) = {D,E}
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Tree ProjectionItems are listed in lexicographic orderEach node
P stores the following information:Itemset for node PList of
possible lexicographic extensions of P: E(P)Pointer to projected
database of its ancestor nodeBitvec tor containing information
about which transactions in the projected database contain the
itemset
23. (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Projected Database
Original Database:
Projected Database for node A:
For each transaction T,
E(A)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Sheet1TIDItems1{A,B}2{B,C,D}3{A,C,D,E}4{A,D,E}5{A,B,C
}6{A,B,C,D}7{B,C}8{A,B,C}9{A,B,D}10{B,C,E}
Sheet2
Sheet3
Sheet1TIDItems1{B}2{}3{C,D,E}4{D,E}5{B, C}6{B,C,D}7{}8
{B,C}9{B,D}10{}
Sheet2
Sheet3
ECLATFor each item, store a list of transaction ids (tids)
TID-list
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
24. ECLATDetermine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
3 traversal approaches: top-down, bottom-up and
hybridAdvantage: very fast support countingDisadvantage:
intermediate tid-lists may become too large for memory
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Sheet1TIDItemsABCDE1A,B,E112212B,C,D423433C,E554564
A,C,D67895A,B,C,D7896A,E8107A,B98A,B,C9A,C,D10BAB11
425567788109
Sheet2
Sheet3
Sheet1TIDItemsABCDE1A,B,E112212B,C,D423433C,E554564
A,C,D67895A,B,C,D7896A,E8107A,B98A,B,C9A,C,D10BAB11
425567788109
Sheet2
Sheet3
Sheet1TIDItemsABCDE1A,B,E112212B,C,D423433C,E554564
A,C,D67895A,B,C,D7896A,E8107A,B98A,B,C9A,C,D10BABA
B111425557678788109
Sheet2
Sheet3
Rule GenerationGiven a frequent itemset L, find all non-empty
25. – f satisfies the minimum
confidence requirementIf {A,B,C,D} is a frequent itemset,
candidate rules:
If |L| = k, then there are 2k – 2 candidate association rules
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Rule GenerationHow to efficiently generate rules from frequent
itemsets?In general, confidence does not have an anti-monotone
property
But confidence of rules generated from the same itemset has an
anti-monotone propertye.g., L = {A,B,C,D}:
Confidence is anti-monotone w.r.t. number of items on the RHS
of the rule
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
26. (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
ABCD=>{ }�
BC=>AD�
BD=>AC�
CD=>AB�
AD=>BC�
AC=>BD�
AB=>CD�
D=>ABC�
C=>ABD�
B=>ACD�
A=>BCD�
ACD=>B�
ABD=>C�
ABC=>D�
BCD=>A�
Rule Generation for Apriori AlgorithmCandidate rule is
generated by merging two rules that share the same prefix
in the rule consequent
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
27. 2002
Effect of Support DistributionMany real data sets have skewed
support distribution
Support distribution of a retail data set
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support DistributionHow to set the appropriate minsup
threshold?If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive products)
If minsup is set too low, it is computationally expensive and the
number of itemsets is very large
Using a single minimum support threshold may not be effective
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Multiple Minimum SupportHow to apply multiple minimum
supports?MS(i): minimum support for item i e.g.:
MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%MS({Milk,
Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
Challenge: Support is no longer anti-monotone Suppose:
Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%
28. {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is
frequent
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Multiple Minimum Support
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Multiple Minimum Support
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Multiple Minimum Support (Liu 1999)Order the items according
to their minimum support (in ascending order)e.g.:
MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%Ordering:
Broccoli, Salmon, Coke, Milk
Need to modify Apriori such that:L1 : set of frequent itemsF1 :
where MS(1) is mini( MS(i) )C2 : candidate itemsets
of size 2 is generated from F1
instead of L1
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
29. Multiple Minimum Support (Liu 1999)Modifications to
Apriori:In traditional Apriori, A candidate (k+1)-itemset is
generated by merging two
frequent itemsets of size k The candidate is pruned if it
contains any infrequent subsets
of size kPruning step has to be modified: Prune only if subset
contains the first item e.g.: Candidate={Broccoli, Coke, Milk}
(ordered according to
minimum support) {Broccoli,
Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Pattern EvaluationAssociation rule algorithms tend to produce
too many rules many of them are uninteresting or
have same support & confidence
Interestingness measures can be used to prune/rank the derived
patterns
In the original formulation of association rules, support &
confidence are the only measures used
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
30. 2002
Application of Interestingness Measure
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
information needed to compute rule interestingness can be
obtained from a contingency table
Contingency table
Used to define various measures support, confidence, lift, Gini,
J-measure, etc.YY Xf11f10f1+X f01f00fo+f+1f+0|T|
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Drawback of Confidence
Coffee
CoffeeTea15520Tea755809010100
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Statistical IndependencePopulation of 1000 students600
students know how to swim (S)700 students know how to bike
(B)420 students know how to swim and bike (S,B)
31. Negatively correlated
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Statistical-based MeasuresMeasures that take into account
statistical dependence
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Example: Lift/Interest
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is
negatively associated)
Coffee
CoffeeTea15520Tea755809010100
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Drawback of Lift & Interest
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift =
1YYX10010X090901090100YYX90090X010109010100
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
32. There are lots of measures proposed in the literature
Some measures are good for certain applications, but not for
others
What criteria should we use to determine whether a measure is
good or bad?
What about Apriori-style support based pruning? How does it
affect these measures?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Properties of A Good MeasurePiatetsky-Shapiro:
3 properties a good measure M must satisfy:M(A,B) = 0 if A
and B are statistically independent
M(A,B) increase monotonically with P(A,B) when P(A) and
P(B) remain unchanged
M(A,B) decreases monotonically with P(A) [or P(B)] when
P(A,B) and P(B) [or P(A)] remain unchanged
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Comparing Different Measures
10 examples of contingency tables:
Rankings of contingency tables using various measures:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
34. in the samples
2x
10xMaleFemaleHigh235Low1453710MaleFemaleHigh43034Lo
w2404267076
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Property under Inversion Operation
Transaction 1
Transaction N
.
.
.
.
.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
- -coefficient is analogous to
correlation coefficient for continuous variables
tablesYYX601070X1020307030100YYX201030X106070307010
0
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Property under Null Addition
36. invarianceYes*: Yes if measure is normalizedYes**: Yes if
measure is symmetrized by taking max(M(A,B),M(B,A))No*:
Symmetry under row or column permutation
Sheet2
Sheet3
3
3
2
0
3
1
3
2
1
3
2
K
K
÷
÷
ø
ö
ç
ç
è
æ
-
-
÷
÷
ø
ö
ç
ç
è
37. æ
-
MBD01E1AF4B.unknown
Support-based PruningMost of the association rule mining
algorithms use support measure to prune rules and itemsets
Study effect of support pruning on correlation of
itemsetsGenerate 10000 random contingency tablesCompute
support and pairwise correlation for each tableApply support-
based pruning and examine the tables that are removed
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support-based Pruning
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Chart113951722823595036487168308679089158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9
Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
38. Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
Sheet1CorrelationAll> 0.9> 0.7> 0.5< 0.01< 0.03< 0.05-
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Effect of Support-based Pruning
Support-based pruning eliminates mostly negatively correlated
itemsets
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Chart113951722823595036487168308679089 158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9
39. Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
Sheet1CorrelationAll> 0.9> 0.7> 0.5< 0.01< 0.03< 0.05-
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Chart113951722823595036487168308679089158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9
40. Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
Sheet1CorrelationAll> 0.9> 0.7> 0.5< 0.01< 0.03< 0.05-
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Chart113951722823595036487168308679089158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9
41. Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
Sheet1CorrelationAll> 0.9> 0.7> 0.5< 0.01< 0.03< 0.05-
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Effect of Support-based PruningInvestigate how support-based
pruning affects other measures
Steps:Generate 10000 contingency tablesRank each table
according to the different measuresCompute the pair-wise
correlation between the measures
42. (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support-based Pruning Without Support Pruning (All
Pairs) Red cells indicate correlation between
the pair of measures > 0.85 40.14% pairs have correlation >
0.85
Scatter Plot between Correlation & Jaccard Measure
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support-
pairs have correlation > 0.85
Scatter Plot between Correlation & Jaccard Measure:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support-
pairs have correlation > 0.85
Scatter Plot between Correlation & Jaccard Measure
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Subjective Interestingness MeasureObjective measure: Rank
patterns based on statistics computed from datae.g., 21
measures of association (support, confidence, Laplace, Gini,
43. mutual information, Jaccard, etc).
Subjective measure:Rank patterns according to user’s
interpretation A pattern is subjectively interesting if it
contradicts the
expectation of a user (Silberschatz & Tuzhilin) A pattern is
subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Interestingness via UnexpectednessNeed to model expectation
of users (domain knowledge)
Need to combine expectation of users with evidence from data
(i.e., extracted patterns)
+
Pattern expected to be frequent
-
Pattern expected to be infrequent
Pattern found to be frequent
Pattern found to be infrequent
+
-
Expected Patterns
44. -
+
Unexpected Patterns
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Interestingness via UnexpectednessWeb Data (Cooley et al
2001)Domain knowledge in the form of site structureGiven an
itemset F = {X1, X2, …, Xk} (Xi : Web pages) L: number of
links connecting the pages lfactor = L / -1) cfactor = 1
(if graph is connected), 0 (disconnected graph)Structure
Usage evidence
Use Dempster-Shafer theory to combine domain knowledge and
evidence from data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Beer
}
Diaper
,
88. Data Mining
Association Rules: Advanced Concepts and Algorithms
Lecture Notes for Chapter 7
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Continuous and Categorical Attributes
Example of Association Rule:
= No}
How to apply association analysis formulation to non-
asymmetric binary variables?
Session Id
Country
Session Length (sec)
Number of Web Pages viewed
Gender
Browser Type
Buy
1
USA
982
90. …
…
…
…
…
10
Handling Categorical AttributesTransform categorical attri bute
into asymmetric binary variables
Introduce a new “item” for each distinct attribute-value
pairExample: replace Browser Type attribute with Browser Type
= Internet Explorer Browser Type = Mozilla Browser Type =
Mozilla
Handling Categorical AttributesPotential IssuesWhat if attribute
has many possible values Example: attribute country has more
than 200 possible values Many of the attribute values may have
very low support
Potential solution: Aggregate the low-support attribute values
What if distribution of attribute values is highly skewed
Example: 95% of the visitors have Buy = No Most of the items
will be associated with (Buy=No) item
Potential solution: drop the highly frequent items
Handling Continuous AttributesDifferent kinds of
Different methods:Discretization-basedStatistics-basedNon-
discretization based minApriori
91. Handling Continuous AttributesUse
discretizationUnsupervised:Equal-width binningEqual-depth
binningClustering
Supervised:
bin1
bin3
bin2
Attribute values,
vClassv1v2v3v4v5v6v7v8v9Anomalous002010200000Normal15
0100000100100150100
Discretization IssuesSize of the discretized intervals affect
support & confidence
If intervals too small may not have enough supportIf intervals
too large may not have enough confidencePotential solution: use
all possible intervals
Discretization IssuesExecution timeIf intervals contain n values,
there are on average O(n2) possible ranges
92. Too many rules
{Refund =
Approach by Srikant & AgrawalPreprocess the dataDiscretize
attribute using equi-depth partitioning Use partial completeness
measure to determine number of partitions Merge adjacent
intervals as long as support is less than max-support
Apply existing association rule mining algorithms
Determine interesting rules in the output
Approach by Srikant & AgrawalDiscretization will lose
information
Use partial completeness measure to determine how much
information is lost
C: frequent itemsets obtained by considering all
ranges of attribute values
P: frequent itemsets obtained by considering all ranges
over the partitions
P is K-
such that:
1. X’ is a generaliz
93. support(Y)
Given K (partial completeness level), can determine
number of intervals (N)
X
Approximated X
Interestingness Measure
Given an itemset: Z = {z1, z2, …, zk} and its generalization Z’
= {z1’, z2’, …, zk’}
P(Z): support of Z
EZ’(Z): expected support of Z based on Z’
Z is R-
{Refund = N
ES’(Y|X): expected support of Z based on Z’
94. Rule S is R-interesting w.r.t its ancestor rule S’ if Support, P(S)
Statistics-based MethodsExample:
consequent consists of a continuous variable, characterized by
their statistics mean, median, standard deviation,
etc.Approach:Withhold the target variable from the rest of the
dataApply existing frequent itemset generation on the rest of the
dataFor each frequent itemset, compute the descriptive statistics
for the corresponding target variable Frequent itemset becomes
a rule by introducing the target variable as rule
consequentApply statistical test to determine interestingness of
the rule
Statistics-based MethodsHow to determine whether an
association rule interesting?Compare the statistics for segment
of population covered by the rule vs segment of population not
covered by the rule:
variance 1 under null hypothesis
Statistics-based MethodsExample:
95. years (i n1 = 50, s1 = 3.5For r’
(complement): n2 = 250, s2 = 6.5
For 1-sided test at 95% confidence level, critical Z-value for
rejecting null hypothesis is 1.64.Since Z is greater than 1.64, r
is an interesting rule
Min-Apriori (Han et al)
Example:
W1 and W2 tends to appear together in the same document
Document-term matrix:
Sheet1TIDW1W2W3W4W5TIDW1W2W3W4W5D122001D10.4
0.40.00.00.2D200122D20.00.00.20.40.4D323000D30.40.60.00.0
0.0D400101D40.00.00.50.00.5D511102D50.20.20.20.00.4
Sheet2
Sheet3
Min-AprioriData contains only continuous attributes of the
same “type”e.g., frequency of words in a document
Potential solution:Convert into 0/1 matrix and then apply
existing algorithms lose word frequency
informationDiscretization does not apply as users want
association among words not ranges of words
96. Sheet1TIDW1W2W3W4W5TIDW1W2W3W4W5D122001D10.4
0.40.00.00.2D200122D20.00.00.20.40.4D323000D30.40.60.00.0
0.0D400101D40.00.00.50.00.5D511102D50.20.20.20.00.4
Sheet2
Sheet3
Min-AprioriHow to determine the support of a word?If we
simply sum up its frequency, support count will be greater than
total number of documents! Normalize the word vectors – e.g.,
using L1 norm Each word has a support equals to 1.0
Normalize
Sheet1TIDW1W2W3W4W5TIDW1W2W3W4W5D122001D10.4
0.40.00.00.2D200122D20.00.00.20.40.4D323000D30.40.60.00.0
0.0D400101D40.00.00.50.00.5D511102D50.20.20.20.00.4
Sheet2
Sheet3
Sheet1TIDW1W2W3W4W5TIDW1W2W3W4W5D122001D10.4
00.330.000.000.17D200122D20.000.000.331.000.33D323000D3
0.400.500.000.000.00D400101D40.000.000.330.000.17D511102
D50.200.170.330.000.33
Sheet2
Sheet3
Min-AprioriNew definition of support:
98. Multi-level Association RulesWhy should we incorporate
concept hierarchy?Rules at lower levels may not have enough
support to appear in any frequent itemsets
Rules at lower levels of the hierarchy are overly specific e.g.,
are indicative of association between milk and bread
Multi-level Association RulesHow do support and confidence
vary as we traverse the concept hierarchy?If X is the parent
item for both X1 and X2, then
If
and X is parent of X1, Y is parent of Y1
then
If
then
Multi-level Association RulesApproach 1:Extend current
association rule formulation by augmenting each transaction
with higher level items
99. Original Transaction: {skim milk, wheat bread}
Augmented Transaction:
{skim milk, wheat bread, milk, bread, food}
Issues:Items that reside at higher levels have much higher
support counts if support threshold is low, too many frequent
patterns involving items from the higher levelsIncreased
dimensionality of the data
Multi-level Association RulesApproach 2:Generate frequent
patterns at highest level first
Then, generate frequent patterns at the next highest level, and
so on
Issues:I/O requirements will increase dramatically because we
need to perform more passes over the dataMay miss some
potentially interesting cross-level association patterns
Sequence Data
Sequence Database:
10�
15�
20�
25�
30�
35�
2�
3�
5�
6�
1�
101. E4
E2
Element (Transaction)
Event
(Item)Sequence DatabaseSequenceElement (Transaction)Event
(Item)CustomerPurchase history of a given customerA set of
items bought by a customer at time tBooks, diary products,
CDs, etcWeb DataBrowsing activity of a particular Web
visitorA collection of files viewed by a Web visitor after a
single mouse clickHome page, index page, contact info,
etcEvent dataHistory of events generated by a given
sensorEvents triggered by a sensor at time tTypes of alarms
generated by sensors Genome sequencesDNA sequence of a
particular speciesAn element of the DNA sequence Bases
A,T,G,C
Formal Definition of a SequenceA sequence is an ordered list of
elements (transactions)
s = < e1 e2 e3 … >
Each element contains a collection of events (items)
ei = {i1, i2, …, ik}
Each element is attributed to a specific time or location
Length of a sequence, |s|, is given by the number of elements of
the sequence
A k-sequence is a sequence that contains k events (items)
Examples of SequenceWeb sequence:
102. < {Homepage} {Electronics} {Digital Cameras} {Canon
Digital Camera} {Shopping Cart} {Order Confirmation}
{Return to Shopping} >
Sequence of initiating events causing the nuclear accident at 3-
mile Island:
(http://stellar-
one.com/nuclear/staff_reports/summary_SOE_the_initiating_eve
nt.htm)
< {clogged resin} {outlet valve closure} {loss of feedwater}
{condenser polisher outlet valve shut} {booster pumps trip}
{main waterpump trips} {main turbine trips} {reactor pressure
increases}>
Sequence of books checked out at a library:
<{Fellowship of the Ring} {The Two Towers} {Return of the
King}>
Formal Definition of a SubsequenceA sequence <a1 a2 … an> is
contained in another sequence <b1 b2 … bm> (m ≥ n) if there
exist integers
i1 < i2 < … < in s
The support of a subsequence w is defined as the fraction of
data sequences that contain wA sequential pattern is a frequent
subsequence (i.e., a subsequence whose support is ≥
minsup)Data sequenceSubsequenceContain?< {2,4} {3,5,6} {8}
>< {2} {3,5} >Yes< {1,2} {3,4} > < {1} {2} >No< {2,4} {2,4}
103. {2,5} >< {2} {4} >Yes
Sequential Pattern Mining: DefinitionGiven: a database of
sequences a user-specified minimum support threshold, minsup
Task:Find all subsequences with support ≥ minsup
Sequential Pattern Mining: ChallengeGiven a sequence: <{a b}
{c d e} {f} {g h i}>Examples of subsequences:
<{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc.
How many k-subsequences can be extracted from a given n-
sequence?
<{a b} {c d e} {f} {g h i}> n = 9
k=4: Y _ _ Y Y _ _ _ Y
<{a} {d e} {i}>
Sequential Pattern Mining: Example
Minsup = 50%
Examples of Frequent Subsequences:
< {1,2} > s=60%
< {2,3} > s=60%
< {2,4}> s=80%
< {3} {5}> s=80%
< {1} {2} > s=80%
104. < {2} {2} > s=60%
< {1} {2,3} > s=60%
< {2} {2,3} > s=60%
< {1,2} {2,3} > s=60%
Extracting Sequential PatternsGiven n events: i1, i2, i3, …, in
Candidate 1-subsequences:
<{i1}>, <{i2}>, <{i3}>, …, <{in}>
Candidate 2-subsequences:
<{i1, i2}>, <{i1, i3}>, …, <{i1} {i1}>, <{i1} {i2}>, …, <{in-
1} {in}>
Candidate 3-subsequences:
<{i1, i2 , i3}>, <{i1, i2 , i4}>, …, <{i1, i2} {i1}>, <{i1, i2}
{i2}>, …,
<{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …, <{i1} {i1} {i1}>, <{i1}
{i1} {i2}>, …
Generalized Sequential Pattern (GSP)Step 1: Make the first pass
over the sequence database D to yield all the 1-element frequent
sequences
Step 2:
Repeat until no new frequent sequences are
foundCandidate Generation: Merge pairs of frequent
subsequences found in the (k-1)th pass to generate candidate
sequences that contain k items
Candidate Pruning:Prune candidate k-sequences that contain
infrequent (k-1)-subsequences
Support Counting:Make a new pass over the sequence database
D to find the support for these candidate sequences
Candidate Elimination:Eliminate candidate k-sequences whose
actual support is less than minsup
105. Candidate GenerationBase case (k=2): Merging two frequent 1-
sequences <{i1}> and <{i2}> will produce two candidate 2-
sequences: <{i1} {i2}> and <{i1 i2}>
General case (k>2):A frequent (k-1)-sequence w1 is merged
with another frequent
(k-1)-sequence w2 to produce a candidate k-sequence if the
subsequence obtained by removing the first event in w 1 is the
same as the subsequence obtained by removing the last event in
w2 The resulting candidate after merging is given by the
sequence w1 extended with the last event of w2.
If the last two events in w2 belong to the same element, then the
last event in w2 becomes part of the last element in w1
Otherwise, the last event in w2 becomes a separate element
appended to the end of w1
Candidate Generation ExamplesMerging the sequences
w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}>
will produce the candidate sequence < {1} {2 3} {4 5}> because
the last two events in w2 (4 and 5) belong to the same element
Merging the sequences
w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}>
will produce the candidate sequence < {1} {2 3} {4} {5}>
because the last two events in w2 (4 and 5) do not belong to the
same element
We do not have to merge the sequences
106. w1 =<{1} {2 6} {4}> and w2 =<{1} {2} {4 5}>
to produce the candidate < {1} {2 6} {4 5}> because if the
latter is a viable candidate, then it can be obtained by merging
w1 with
< {1} {2 6} {5}>
GSP Example
< {1} {2} {3} >
< {1} {2 5} >�< {1} {5} {3} >
< {2} {3} {4} >
< {2 5} {3} >
< {3} {4} {5} >
< {5} {3 4} >�
< {1} {2} {3} {4} >
< {1} {2 5} {3} >
< {1} {5} {3 4} >�< {2} {3} {4} {5} >�< {2 5} {3 4} >�
< {1} {2 5} {3} >�
Frequent �3-sequences�
Candidate�Generation�
Candidate�Pruning�
Timing Constraints (I)
{A B} {C} {D E}
<= ms
<= xg
>ng
xg: max-gap
ng: min-gap
107. ms: maximum span
xg = 2, ng = 0, ms= 4Data sequenceSubsequenceContain?<
{2,4} {3,5,6} {4,7} {4,5} {8} >< {6} {5} >Yes< {1} {2} {3}
{4} {5}>< {1} {4} >No< {1} {2,3} {3,4} {4,5}>< {2} {3} {5}
> Yes< {1,2} {3} {2,3} {3,4} {2,4} {4,5}>< {1,2} {5} >No
Mining Sequential Patterns with Timing ConstraintsApproach
1:Mine sequential patterns without timing
constraintsPostprocess the discovered patterns
Approach 2:Modify GSP to directly prune candidates that
violate timing constraintsQuestion: Does Apriori principle still
hold?
Apriori Principle for Sequence Data
Suppose:
xg = 1 (max-gap)
ng = 0 (min-gap)
ms = 5 (maximum span)
minsup = 60%
<{2} {5}> support = 40%
but
<{2} {3} {5}> support = 60%
Problem exists because of max-gap constraint
No such problem if max-gap is infinite
Contiguous Subsequencess is a contiguous subsequence of
w = <e1>< e2>…< ek>
108. if any of the following conditions hold:
s is obtained from w by deleting an item from either e1 or ek
s is obtained from w by deleting an item from any element ei
that contains more than 2 items
s is a contiguous subsequence of s’ and s’ is a contiguous
subsequence of w (recursive definition)
Examples: s = < {1} {2} > is a contiguous subsequence of
< {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3}
{4} > is not a contiguous subsequence of
< {1} {3} {2}> and < {2} {1} {3} {2}>
Modified Candidate Pruning StepWithout maxgap constraint:A
candidate k-sequence is pruned if at least one of its (k-1)-
subsequences is infrequent
With maxgap constraint:A candidate k-sequence is pruned if at
least one of its contiguous (k-1)-subsequences is infrequent
Timing Constraints (II)
xg: max-gap
ng: min-gap
ws: window size
ms: maximum span
xg = 2, ng = 0, ws = 1, ms= 5Data
sequenceSubsequenceContain?< {2,4} {3,5,6} {4,7} {4,6} {8}
>< {3} {5} >No< {1} {2} {3} {4} {5}>< {1,2} {3} > Yes<
{1,2} {2,3} {3,4} {4,5}>< {1,2} {3,4} >Yes
Modified Support Counting StepGiven a candidate pattern: <{a,
109. c}>Any data sequences that contain
<… {a c} … >,
<… {a} … {c}…> ( where time({c}) – time({a}) ≤ ws)
<…{c} … {a} …> (where time({a}) – time({c}) ≤ ws)
will contribute to the support count of candidate pattern
Other FormulationIn some domains, we may have only one very
long time seriesExample: monitoring network traffic events for
attacks monitoring telecommunication alarm signalsGoal is to
find frequent sequences of events in the time seriesThis problem
is also known as frequent episode mining
E1
E2
E1
E2
E1
E2
E3
E4
E3 E4
E1
E2
E2 E4
E3 E5
E2
E3 E5
E1
E2
E3 E1
111. p�
CWIN�
6�
Object's Timeline�
CMINWIN�
4�
CDIST_O�
8�
CDIST�
5�
Frequent Subgraph MiningExtend association rule mining to
finding frequent subgraphsUseful for Web Mining,
computational chemistry, bioinformatics, spatial data sets, etc
Graph Definitions
Representing Transactions as GraphsEach transaction is a clique
of items
Sheet1Transaction
IdItems1{A,B,C,D}2{A,B,E}3{B,C}4{A,B,D,E}5{B,C,D}
Sheet2
Sheet3
112. Representing Graphs as Transactions
ChallengesNode may contain duplicate labelsSupport and
confidenceHow to define them?Additional constraints imposed
by pattern structureSupport and confidence are not the only
constraintsAssumption: frequent subgraphs must be
connectedApriori-like approach: Use frequent k-subgraphs to
generate frequent (k+1) subgraphsWhat is k?
Challenges…Support: number of graphs that contain a particular
subgraph
Apriori principle still holds
Level-wise (Apriori-like) approach:Vertex growing: k is the
number of verticesEdge growing: k is the number of edges
Vertex Growing
a�
a�
e�
a�
p�
p�
q�
r�
p�
114. a�
a�
r�
a�
a�
a�
f�
p�
r�
a�
p�
q�
r�
f�
p�
G1�
G2�
G3 = join(G1,G2)�
r�
Apriori-like AlgorithmFind frequent 1-
subgraphsRepeatCandidate generation Use frequent (k-1)-
subgraphs to generate candidate k-subgraphCandidate pruning
Prune candidate subgraphs that contain infrequent
(k-1)-subgraphs Support counting Count the support of each
remaining candidateEliminate candidate k-subgraphs that are
infrequent
In practice, it is not as easy. There are many other issues
Example: Dataset
117. p�
d�
c�
e�
p�
(Pruned candidate)�
Minimum support count = 2�
k=2
Frequent�Subgraphs�
Candidate GenerationIn Apriori:Merging two frequent k-
itemsets will produce a candidate (k+1)-itemset
In frequent subgraph mining (vertex/edge growing)Merging two
frequent k-subgraphs may produce more than one candidate
(k+1)-subgraph
Multiplicity of Candidates (Vertex Growing)
a�
a�
e�
a�
p�
p�
q�
r�
p�
a�
a�
r�
a�
a�
120. Multiplicity of Candidates (Edge growing)Case 3: Core
multiplicity
Adjacency Matrix Representation The same graph can be
represented in many ways
Sheet1A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)A(1)11101000A(2)1
1010100A(3)10110010A(4)01110001B(5)10001110B(6)0100110
1B(7)00101011B(8)00010111A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(
8)A(1)11010100A(2)11100010A(3)01111000A(4)10110001B(5)
00101011B(6)10000111B(7)01001110B(8)00011101
Sheet2
Sheet3
Sheet1A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)A(1)11101000A(2)1
1010100A(3)10110010A(4)01110001B(5)10001110B(6)0100110
1B(7)00101011B(8)00010111A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(
8)A(1)11010100A(2)11100010A(3)01111000A(4)10110001B(5)
00101011B(6)10000111B(7)01001110B(8)00011101
Sheet2
Sheet3
Graph IsomorphismA graph is isomorphic if it is topologically
equivalent to another graph
A�
A�
A�
A�
121. B�
A�
B�
A�
B�
B�
A�
A�
B�
B�
B�
B�
Graph IsomorphismTest for graph isomorphism is
needed:During candidate generation step, to determine whether
a candidate has been generated
During candidate pruning step, to check whether its
(k-1)-subgraphs are frequent
During candidate counting, to check whether a candidate is
contained within another graph
Graph IsomorphismUse canonical labeling to handle
isomorphismMap each graph into an ordered string
representation (known as its code) such that two isomorphic
graphs will be mapped to the same canonical encodingExample:
Lexicographically largest adjacency matrix
String: 0010001111010110
122. Canonical: 0111101011001000
Session
Id
Country Session
Length
(sec)
Number of
Web Pages
viewed
Gender
Browser
Type
Buy
1 USA 982 8 Male IE No
2 China 811 10 Female Netscape No
3 USA 2125 45 Female Mozilla Yes
4 Germany 596 4 Male IE Yes
5 Australia 123 9 Male Mozilla No
… … … … … … …
10
)
'
(
)
'
(
)
(
)
'
(
147. b
a
b
a
a
b
a
a
A(1)
A(2)
B (6)
A(4)
B (5)
A(3)
B (7)
B (8)
A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)
A(1)
11101000
A(2)
11010100
A(3)
10110010
A(4)
01110001
B(5)
10001110
B(6)
01001101
B(7)
00101011
B(8)
00010111
A(2)
A(1)
B (6)
148. A(4)
B (7)
A(3)
B (5)
B (8)
A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)
A(1)
11010100
A(2)
11100010
A(3)
01111000
A(4)
10110001
B(5)
00101011
B(6)
10000111
B(7)
01001110
B(8)
00011101
A
A
AA
BA
B
A
B
B
A
A
BB
B
B
ú