Choosing the Right CBSE School A Comprehensive Guide for Parents
Â
A Project Report On MARKET ANALYSIS AND SALES DEVELOPMENT Submitted By Under The Guidance Of
1. A
Project Report On
MARKET ANALYSIS AND SALES DEVELOPMENT
Submitted By
SUNNY BARKE UIN 091P041
AADIL CHOUDHARY UIN 112P014
ADEEL ANSARI UIN 112P012
SAYED MEHDI ABBAS UIN 112P002
Under the guidance of
Prof. DINESH DEORE
Submitted as a partial fulfillment of
Bachelor of Engineering
B.E. (Semester VIII), COMPUTER
[2013 - 2014]
from
Rizvi College of Engineering
New Rizvi Educational Complex, Off-Carter Road,
Bandra(w), Mumbai - 400050
Affiliated to
University of Mumbai
2. CERTIFICATE
This is certify that the project report entitled
âMARKET ANALYSIS AND SALES DEVELOPMENTâ
Submitted By
SUNNY BARKE
AADIL CHOUDHARY
ADEEL ANSARI
SAYED MEHDI ABBAS
of Rizvi College of Engineering, Computer has been approved in partial fulfillment of requirement for
the degree of Bachelor of Engineering.
Prof. DINESH DEORE Prof. âââââââ
Internal Guide External Guide
Prof. DINESH DEORE Dr. Varsha Shah
Head of Department Principal
Prof. ââââââââ Prof. ââââââââ
Internal Examiner External Examiner
Date:
3. Acknowledgement
Put your acknowledgement here. Refer below for a sample.
I am profoundly grateful to ââ(Prof. DINESH DEORE )ââ- for his expert guidance and con-
tinuous encouragement throughout to see that this project rights its target since its commencement to its
completion.
I would like to express deepest appreciation towards Dr. Varsha Shah, Principal RCOE, Mumbai and
ââ(Prof. DINESH DEORE )ââ-, HoD ââ-(COMPUTER)ââ whose invaluable guidance sup-
ported me in completing this project.
At last I must express my sincere heartfelt gratitude to all the staff members of ââ-(COMPUTER)â
â who helped me directly or indirectly during this course of work.
SUNNY BARKE
AADIL CHOUDHARY
ADEEL ANSARI
SAYED MEHDI ABBAS
4. ABSTRACT
The mentioned system is designed to find the most frequent combinations of items. It is based on
developing an efficient algorithm that outperforms the best available frequent pattern algorithms on a
number of typical data sets. This will help in marketing and sales. The technique can be used to uncover
interesting cross-sells and related products. Three different algorithms from association mining have
been implanted and then best combination method is utilized to find more interesting results. The analyst
then can perform the data mining and extraction and finally conclude the result and make appropriate
decision.
With the explosive growth of information sources available on the World Wide Web, it has become
increasingly necessary for users to utilize automated tools to find the desired information resources, and
to track and analyze their usage patterns. Association rule mining is an active data mining.research
area. However, most ARM algorithms cater to a centralized environment. In contrast to previous ARM
algorithms, Optimized Distributed Association Rule Mining (ODARM) is a distributed algorithm for
geographically spread data sets that aimed to reduces operational/ communication costs. Recently, as
the need to mine patterns across distributed databases has grown, Distributed Association Rule Mining
(DARM) algorithms have been developed. These algorithms assume that the databases are either hori-
zontally or vertically distributed. In the special case of databases populated from information extracted
from textual data, existing D-ARM algorithms cannot discover rules based on higher-order associations
between items in distributed textual documents that are neither vertically nor horizontally distributed,
but rather a hybrid of the two. Hence, this paper proposes a Distributed Count Association Rule Mining
Algorithm(DCARM), which is experimented on real time datasets obtained from UCI machine learning
repository.
We are given a large database of customer transactions.Each transaction consists of items purchased
by a customer in a visit. We present an efficient algorithm that generates all signicant association rules
between items in the database. The algorithm incorporates buer management and novel estimation and
pruning techniques. We also present results of applying this algorithm to sales data obtained from a
large retailing company, which shows the effectiveness of the algorithm.
Keywords : Association rule mining, Optimized Distributed Association Rule Mining (ODARM),
Distributed Count Association Rule Mining Algorithm(DCARM)
8. Chapter 1 Introduction
Chapter 1
Introduction
Data mining, the extraction of hidden predictive information from large databases, is a powerful new
technology with great potential to help companies focus on the most important information in their
data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make
proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining
move beyond the analyses of past events provided by retrospective tools typical of decision support
systems. Data mining tools can answer business questions that traditionally were too time consuming to
resolve. They scour databases for hidden patterns, finding predictive information that experts may miss
because it lies outside their expectations.
Most companies already collect and refine massive quantities of data. Data mining techniques can
be implemented rapidly on existing software and hardware platforms to enhance the value of existing
information resources, and can be integrated with new products and systems as they are brought on-line.
When implemented on high performance client/server or parallel processing computers, data mining
tools can analyze massive databases to deliver answers to questions such as, âWhich clients are most
likely to respond to my next promotional mailing, and why?â
Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery
and Data Mining, is the process of automatically searching large volumes of data for patterns using tools
such as classification, association rule mining, clustering, etc.. Data mining is a complex topic and has
links with multiple core fields such as computer science and adds value to rich seminal computational
techniques from statistics, information retrieval, machine learning and pattern recognition.
Data mining techniques are the result of a long process of research and product development. This
evolution began when business data was first stored on computers, continued with improvements in data
access, and more recently, generated technologies that allow users to navigate through their data in real
time. Data mining takes this evolutionary process beyond retrospective data access and navigation to
prospective and proactive information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now sufficiently mature:
⢠Massive data collection
⢠Powerful multiprocessor computers
⢠Data mining algorithms
Rizvi College of Engineering, Bandra, Mumbai. 1
9. Chapter 1 Introduction
Commercial databases are growing at unprecedented rates. A recent META Group survey of data
warehouse projects found that 19 percent of respondents are beyond the 50 gigabyte level, while 59
percent expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers
can be much larger. The accompanying need for improved computational engines can now be met
in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms
embody techniques that have existed for at least 10 years, but have only recently been implemented as
mature, reliable, understandable tools that consistently outperform older statistical methods.
With the explosive growth of information sources available on the World Wide Web, it has become
increasingly necessary for users to utilize automated tools in find the desired information resources, and
to track and analyze their usage patterns. These factors give rise to the necessity of creating serverside
and clientside intelligent systems that can effectively mine for knowledge. Web miningcan be broadly
defined as the discovery and analysis of useful information from the World Wide Web. This describes
the automatic search of information resources available online, i.e. Web content mining, and the discov-
ery of user access patterns from Web servers, i.e., Web usage mining.
Figure 1.1: Entity Relationship Diagram of Market-Basket Analysis
Rizvi College of Engineering, Bandra, Mumbai. 2
10. Chapter 2 Literature Survey
Chapter 2
Literature Survey
2.1 PROBLEM STATEMENT
To develop an efficient algorithm to find the desired information resources and their usage pattern and
also to develop a distributed algorithm for geographical data sets that reduces communication cost and
communication overhead.
Purpose
It has become increasingly necessary for users to utilize automated tools in find the desired informa-
tion resources, and to track and analyze their usage patterns.Association rule mining is an active data
mining research area. However, most ARM algorithms cater to a centralized environment. Distributed
Association Rule Mining (D-ARM) algorithms have been developed. These algorithms, however, as-
sume that the databases are either horizontally or vertically distributed. In the special case of databases
populated from information extracted from textual data, existing D-ARM algorithms cannot discover
rules based on higher-order associations between items in distributed textual documents that are neither
vertically nor horizontally distributed, but rather a hybrid of the two.
2.2 EXISTING SYSTEM
The Data mining Algorithms can be categorized into the following :
⢠Association Algorithm
⢠Classification
⢠Clustering Algorithm
2.2.1 Classification:
The process of dividing a dataset into mutually exclusive groups such that the members of each group
are as âcloseâ as possible to one another, and different groups are as âfarâ as possible from one another
where distance is measured with respect to specific variable(s) you are trying to predict for example, a
typical classification problem is to divide a database of companies into groups that are as homogeneous
as possible with respect to a creditworthiness variable with values âGoodâ and âBad.â
2.2.2 Clustering:
The process of dividing a dataset into mutually exclusive groups such that the members of each group
are as âcloseâ as possible to one another, and different groups are as âfarâ as possible from one another,
Rizvi College of Engineering, Bandra, Mumbai. 3
11. Chapter 2 Literature Survey
where distance is measured with respect to all available variables given databases of sufficient size and
quality, data mining technology can generate new business opportunities by providing these capabilities
⢠Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-on
analysis can now be answered directly from the data quickly. A typical example of a predictive
problem is targeted marketing. Data mining uses data on past promotional mailings to identify the
targets most likely to maximize return on investment in future mailings. Other predictive problems
include forecasting bankruptcy and other forms of default, and identifying segments of a population
likely to respond similarly to given events.
⢠Automated discovery of previously unknown patterns. Data mining tools sweep through databases
and identify previously hidden patterns in one step. DARM discovers rules from various geograph-
ically distributed data sets. However, the network connection between those data sets isnât as fast
as in a parallel environment, so distributed mining usually aims to minimize communication costs.
2.3 PROPOSED SYSTEM
⢠Unlike other algorithms, ODAM offers better performance by minimizing candidate itemset gen-
eration costs. It achieves this by focusing on two major DARM issues communication and syn-
chronization. Communication is one of the most important DARM objectives. DARM algorithms
will perform better if we can reduce communication (for example, message exchange size) costs.
Synchronization forces
⢠Each participating site to wait a certain period until globally frequent itemset generation completes.
Each site will wait longer if computing support counts takes more time. Hence, we reduce the
computation time of candidate itemsetsâ support counts.
⢠To reduce communication costs, we highlight several message optimization techniques. ARM al-
gorithms and on the message exchange method, we can divide the message optimization techniques
into two methods direct and indirect support counts exchange.
⢠Each method has different aims, expectations, advantages, and disadvantages. For example, the first
method exchanges each candidate itemsetâs support count to generate globally frequent itemsets of
that pass (CD and FDM are examples of this approach).
Rizvi College of Engineering, Bandra, Mumbai. 4
12. Chapter 2 Literature Survey
2.4 SYSTEM REQUIRMENT SPECIFICATION
2.4.1 ENVIRONMENTAL SPECIFICATION
The environmental specification specifies the hardware and software requirements for carrying out this
project. The following are the hardware and the software requirements.
Hardware:-
⢠1 GB RAM.
⢠320 GB HDD.
⢠Intel 2.4 GHz Processor core2duo
Software:-
⢠Windows XP Service Pack 2 / Windows 7
⢠Visual Studio 2008
⢠MS SQL Server 2005
⢠Windows Operating System
Rizvi College of Engineering, Bandra, Mumbai. 5
13. Chapter 3 TECHNOLOGIES
Chapter 3
TECHNOLOGIES
3.1 SOFTWARE ENVIROMENT
ASP.NET
ASP.NET is more than the next version of Active Server Pages (ASP); it is a unified Web development
platform that provides the services necessary for developers to build enterprise-class Web applications.
While ASP.NET is largely syntax-compatible with ASP, it also provides a new programming model and
infrastructure that enables a powerful new class of applications. You can migrate your existing ASP
applications by incrementally adding ASP.NET functionality to them. ASP.NET is a compiled .NET
Framework -based environment. You can author applications in any .NET Framework compatible lan-
guage, including Visual Basic and Visual Csharp. Additionally, the entire .NET Framework platform is
available to any ASP.NET application. Developers can easily access the benefits of the .NET Frame-
work, which include a fully managed, protected, and feature-rich application execution environment,
simplified development and deployment, and seamless integration with a wide variety of languages.
VB.NET
Visual Basic is a programming language that is designed especially for windows programming. It
will explain most of the tools available for implementing GUI based programs. After introducing the
basic facilities and tools provided by Visual Basic, we apply our knowledge to implementing a small
VB program. Our program will implement a visual interface for a commonly know stack abstract data
type.
VB.NET is still the only language in VS.NET that includes background compilation, which means
that it can flag errors immediately, while you type. VB.NET is the only .NET language that supports
late binding. In the VS.NET IDE, VB.NET provides a dropdown list at the top of the code window with
all the objects and events; the IDE does not provide this functionality for any other language. VB.NET
is also unique for providing default values for optional parameters, and for having a collection of the
controls available to the developer.
Advantages of VB.NET:
⢠Build Robust Windows-based Applications :
With new Windows Forms, developers using Visual Basic.Net can build Windows-based appli-
cations that leverage the rich user interface features available in the Windows operating system.
All the rapid application development (RAD) tools that developers have come to expect from Mi-
crosoft are found in Visual Basic .NET, including drag-and-drop design and code behind forms.
In addition, new features such as automatic control resizing eliminate the need for complex resize
code.
Rizvi College of Engineering, Bandra, Mumbai. 6
14. Chapter 3 TECHNOLOGIES
⢠Resolve Deployment and Versioning Issues Seamlessly:- Visual Basic .NET delivers the answer
to all of your application setup and maintenance problems. With Visual Basic .NET, issues with
Component Object Model (COM) registration and DLL overwrites are relics of the past. Side-by-
side versioning prevents the overwriting and corruption of existing components and applications.
⢠Microsoft SQL Server 2005 Business today demands a different kind of data management solution.
Performance scalability, and reliability are essential, but businesses now expect more from their key
IT investment. SQL Server 2005 exceeds dependability requirements and provides innovative capa-
bilities that increase employee effectiveness, integrate heterogeneous IT ecosystems,and maximize
capital and operating budgets. SQL Server 2005 provides the enterprise data management plat-
form your organization needs to adapt quickly in a fast changing environment. Benchmarked for
scalability, speed, and performance, SQL Server 2005 is a fully enterprise-class database product,
providing core support for Extensible Markup Language (XML) and Internet queries.
⢠Easy-to-use Business Intelligence(BI) Tools Through rich data analysis and data mining capabil-
ities that integrate with familiar applications such as Microsoft Office, SQL Server 2005 enables
you to provide all of your employees with critical, timely business information tailored to their
specific information needs. Every copy of SQL Server 2005 ships with a suite of BI services.
⢠Self-Tuning and Management Capabilities Revolutionary self-tuning and dynamic self-configuring
features optimize database performance, while management tools automate standard activities.
Graphical tools and performance, wizards simplify setup, database design, and performance moni-
toring, allowing database administrators to focus on meeting strategic business needs.
⢠Data Management Application and Services Unlike its competitors, SQL Server 2005 provides a
powerful and comprehensive data management platform. Every software license includes extensive
management and development tools, a powerful extraction, transformation, and loading (ETL)
tool, business intelligence and analysis services such as Notification Service. The result is the
best overall business value available. Enterprise Edition includes the complete set of SQL Server
data management and analysis features are and is uniquely characterized by several features that
makes it the most scalable and available edition of SQL Server 2005 .It scales to the performance
levels required to support the largest Web sites, Enterprise Online Transaction Processing (OLTP)
system and Data Warehousing systems. Its support for failover clustering also makes it ideal for
any mission critical line-of-business application.
Rizvi College of Engineering, Bandra, Mumbai. 7
15. Chapter 4 SYSTEM DESIGN
Chapter 4
SYSTEM DESIGN
4.1 SOFTWARE DESIGN
System Design is a solution to how to approach to the creation of a system. This important phase
provides the understanding and procedural details necessary for implementing the system recommended
in the feasibility study. The design step produces a data design, an architectural design and a procedural
design. The data design transforms the information domain model created during analysis in to the data
structures that will be required to implement the software.
The architectural design defines the relationship among major structural components into a procedu-
ral description of the software. Source code generated and testing is conducted to integrate and validate
the software. From a project management point of view, software design is conducted in two steps.
Preliminary design is connected with the transformation of requirements into data and software archi-
tecture. Detailed design focuses on refinements to the architectural representation that leads to detailed
data structure and algorithmic representations of software.
4.1.1 Logical Design
The logical design of an information system is analogous to an engineering blue print or conceptual
view of an automobile. It shows the major features and how they are related to one another. The outputs,
inputs and relationship between the variables are designed in this phase. The objectives of database are
accuracy, integrity and successful recover from failure, privacy and security of data and good overall
performance.
4.1.2 Input Design
The input design is the bridge between users and the information system. It specifies the manner in
which data enters the system for processing. It can ensure the reliability of the system and produce
reports from accurate date or it may result in the output of error information. Online data entry is
available which accepts input from the keyboard and data is displayed on the screen for verification.
While designing the following points have been taken into consideration. Input formats are designed as
per the user requirements.
a) Interaction with the user is maintained in simple dialogues.
b) Appropriate fields are locked thereby allowing only valid inputs.
Rizvi College of Engineering, Bandra, Mumbai. 8
16. Chapter 4 SYSTEM DESIGN
4.1.3 Output Design
Each and every activity in this work is result-oriented. The most important feature of information sys-
tem for users is the output. Efficient intelligent output design improves the usability and acceptability of
the system and also helps in decision-making. Thus the following points are considered during output
design.
(1) What information to be present ?
(2) Whether to display or print the information ?
(3) How to arrange the information in an acceptable format ?
(4) How the status has to be maintained each and every time ?
(5) How to distribute the outputs to the recipients ?
The system being user friendly in nature is served to fulfill the requirements of the users; suitable
screen designs are made and produced to the user for refinements. The main requirement for the user is
the retrieval information related to a particular user.
4.1.4 Data Design
Data design is the first of the three design activities that are conducted during software engineering. The
impact of data structure on program structure and procedural complexity causes data design to have a
profound influence on software quality. The concepts of information hiding and data abstraction provide
the foundation for an approach to data design.
4.2 FUNDAMENTAL DESIGN CONCEPTS
4.2.1 Abstraction
During the software design, abstraction allows us to organize and channel our process by postponing
structural considerations until the functional characteristics; data streams and data stores have been
established. Data abstraction involves specifying legal operations on objects; representations and ma-
nipulations details are suppressed.
4.2.2 Information Hiding
Information hiding is a fundamental design concept for software. When software system is designed
using the information hiding approach, each module in the system hides the internal details if the pro-
cessing activities and modules communicating only through well-defined interfaces. Information hiding
can be used as the principal design technique for architectural design of a system.
4.2.3 Modularity
Modular systems incorporate collections of abstractions in which each functional abstraction, each data
abstraction and each control abstraction handles a local aspect of the problem being solved. Modular
system consists of well-defined interfaces among the units. Modularity enhances design clarity, which
in turn eases implementation, debugging and maintenance of the software product.
Rizvi College of Engineering, Bandra, Mumbai. 9
17. Chapter 4 SYSTEM DESIGN
4.2.4 Concurrency
Software systems can be categorized as sequential or concurrent. In a sequential system, of the sys-
tem is activate at any given time. Concurrent systems have implemented process that can be activated
simultaneously if multiple processors are available.
4.2.5 Verification
Design is the bridge between customer requirements and implementations that satisfies the customers
requirements. This is typically done in two steps:
1. Verification that the software requirements definition satisfies the customers needs.
2. Verification that the design satisfies the requirements definition.
4.3 DATA FLOW DIAGRAM
Figure 4.1:
Figure 4.2:
Rizvi College of Engineering, Bandra, Mumbai. 10
18. Chapter 4 SYSTEM DESIGN
Overview of the System:
Association rule mining finds interesting associations and/or correlation relationships among large
set of data items. Association rules show attributes value conditions that occur frequently together in a
given dataset. A typical and widely-used example of association rule mining is Market Basket Analysis.
For example, data are collected using bar-code scanners in supermarkets. Such market basket databases
consist of a large number of transaction records. Each record lists all items bought by a customer on a
single purchase transaction. Managers would be interested to know if certain groups of items are consis-
tently purchased together. They could use this data for adjusting store layouts (placing items optimally
with respect to each other), for cross-selling, for promotions, for catalog design and to identify customer
segments based on buying patterns.
Association rules provide information of this type in the form of âif-thenâ statements. These rules
are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in
nature.
In addition to the antecedent (the âifâ part) and the consequent (the âthenâ part), an association
rule has two numbers that express the degree of uncertainty about the rule. In association analysis the
antecedent and consequent are sets of items (called itemsets) that are disjoint (do not have any items in
common).
The first number is called the support for the rule. The support is simply the number of transactions
that include all items in the antecedent and consequent parts of the rule. (The support is sometimes
expressed as a percentage of the total number of records in the database.)
The other number is known as the confidence of the rule. Confidence is the ratio of the number of
transactions that include all items in the consequent as well as the antecedent (namely, the support) to
the number of transactions that include all items in the antecedent.
For example, if a supermarket database has 100,000 point-of-sale transactions, out of which 2,000
include both items A and B and 800 of these include item C, the association rule âIf A and B are pur-
chased then C is purchased on the same tripâ has a support of 800 transactions (alternatively 0.8% =
800/100,000) and a confidence of 40% (=800/2,000). One way to think of support is that it is the proba-
bility that a randomly selected transaction from the database will contain all items in the antecedent and
the consequent, whereas the confidence is the conditional probability that a randomly selected transac-
tion will include all the items in the consequent given that the transaction includes all the items in the
antecedent. An association rule tells us about the association between two or more items. For example:
In 80% of the cases when people buy bread, they also buy milk. This tells us of the association between
bread and milk.
We represent it as - bread =Âż milk â 80%
This should be read as - âBread means or implies milk, 80% of the time.â Here 80% is the âconfidence
factorâ of the rule.
Association rules can be between more than 2 items. For example -
bread, milk =Âż jam â 60%
bread =Âż milk, jam â 40% Given any rule, we can easily find its confidence. For example, for the
rule
bread, milk =Âż jam
we count the number say n1, of records that contain bread and milk. Of these, how many contain jam as
well? Let this be n2. Then required confidence is n2/n1.
This means that the user has to guess which rule is interesting and ask for its confidence. But our goal was
to âautomaticallyâ find all interesting rules. This is going to be difficult because the database is bound
to be very large. We might have to go through the entire database many times to find all interesting rules.
Rizvi College of Engineering, Bandra, Mumbai. 11
19. Chapter 4 SYSTEM DESIGN
Brute Force
The common-sense approach to solving this problem is as follows -
Let I = { i1, i2, ..., in } be a set of items, also called as an itemset. The number of times, this itemset
appears in the database is called its âsupportâ. Note that we can speak about support of an itemset and
confidence of a rule. The other combinations - support of a rule and confidence of an itemset are not
defined.
Now, if we know the support of âIâ and all its subsets, we can calculate the confidence of all rules which
involve these items. For example, the confidence of the rule i1, i2, i3 =Âż i4, i5
support of{ i1, i2, i3, i4, i5 }
is
support of { i1, i2, i3 }
So, the easiest approach would be to let âIâ contain all items in the supermarket. Then setup a counter
for every subset of âIâ to count all its occurances in the database. At the end of one pass of the database,
we would have all those counts and we can find the confidence of all rules. Then select the most âin-
terestingâ rules based on their confidence factors. How easy. The problem with this approach is that,
normally âIâ will contain atleast about 100 items. This means that it can have 2100 subsets. We will need
to maintain that many counters. If each counter is a single byte, then about 1020 GB will be required.
Clearly this canât be done
. Minimum Support
To make the problem tractable, we introduce the concept of minimum support. The user has to specify
this parameter - let us call it minsupport. Then any rule i1, i2, ... , in =Âż j1, j2, ... , jn
needs to be considered, only if the set of all items in this rule which is { i1, i2, ... , in, j1, j2, ... , jn } has
support greater than minsupport.
The idea is that in the rule
bread, milk =Âż jam
if the number of people buying bread, milk and jam together is very small, then this rule is hardly worth
consideration (even if it has high confidence).
Our problem now becomes - Find all rules that have a given minimum confidence and involves itemsets
whose support is more than minsupport. Clearly, once we know the supports of all these itemsets, we
can easily determine the rules and their confidences. Hence we need to concentrate on the problem of
finding all itemsets which have minimum support. We call such itemsets as frequent itemsets.
Some Properties of Frequent Itemsets
The methods used to find frequent itemsets are based on the following properties -
1. Every subset of a frequent itemset is also frequent. Algorithms make use of this property in the
following way - we need not find the count of an itemset, if all its subsets are not frequent. So, we can
first find the counts of some short itemsets in one pass of the database. Then consider longer and longer
itemsets in subsequent passes. When we consider a long itemset, we can make sure that all its subsets are
frequent. This can be done because we already have the counts of all those subsets in previous passes.
2. Let us divide the tuples of the database into partitions, not necessarily of equal size. Then an itemset
can be frequent only if it is frequent in atleast one partition. This property enables us to apply divide
and conquer type algorithms. We can divide the database into partitions and find the frequent itemsets
in each partition. An itemset can be frequent only if it is frequent in atleast one of these partitions. To
see that this is true, consider k partitions of sizes n1, n2,...,nk.
Rizvi College of Engineering, Bandra, Mumbai. 12
20. Chapter 4 SYSTEM DESIGN
Let minimum support be s.
Consider an itemset which does not have minimum support in any partition. Then its count in each
partition must be less than sn1, sn2,...,snk respectively.
3. Therefore its total count must be less than the sum of all these counts,
which is s( n1 + n2 +...+ nk ).
This is equal to s*(size of database).
Hence the itemset is not frequent in the entire database. This is extended to distributed data base.
Use Case Diagrams
Figure 4.3: KDD
Rizvi College of Engineering, Bandra, Mumbai. 13
21. Chapter 5 IMPLEMENTATION
Chapter 5
IMPLEMENTATION
5.1 ALGORITHMS
Association Rule Mining Association rule mining finds interesting associations and/or correlation re-
lationships among large set of data items. Association rules shows attribute value conditions that occur
frequently together in a given dataset. A typical and widelyused example of association rule mining is
Market BasketAnalysis.
For example, data are collected using bar-code scanners in supermarkets. Such market basket databases
consist of a large number of transaction records. Each record lists all items bought by a customer on a
single purchase transaction. Association rules provide information of this type in the form of âif-thenâ
statements. These rules are computed from the data and, unlike the if-then rules of logic, association
rules are probabilistic in nature. In addition to the antecedent (the âifâ part) and the consequent (the
âthenâ part), an association rule has two numbers that express the degree of uncertainty about the rule.
⢠Support
⢠Confidence
Support: In association analysis the antecedent and consequent are sets of items (called itemsets) that
are disjoint (do not have any items in common). The first number is called the support for the rule. The
support is simply the number of transactions that include all items in the antecedent and consequent
parts of the rule. (The support is sometimes expressed as a percentage of the total number of records in
the database.)
Confidence: The other number is known as the confidence of the rule. Confidence is the ratio of the
number of transactions that include all items in the consequent as well as the antecedent (namely, the
support) to the number of transactions that include all items in the antecedent.
Let us see an example based on these two association rule numbers:
If a supermarket database has 100,000 point-of-sale transactions, out of which 2,000 include both items
A and B and 800 of these include item C, the association rule âIf A and B are purchased then C is
purchased on the same tripâ has a support of 800 transactions (alternatively 0.8% = 800/100,000) and
a confidence of 40% (=800/2,000). One way to think of support is that it is the probability that a ran-
domly selected transaction from the database will contain all items in the antecedent and the consequent,
whereas the confidence is the conditional probability that a randomly selected transaction will include
all the items in the consequent given that the transaction includes all the items in the antecedent
. An association rule tells us about the association between two or more items. For example: In 80% of
the cases when people buy bread, they also buy milk. This tells us of the association between bread and
milk.
Rizvi College of Engineering, Bandra, Mumbai. 14
22. Chapter 5 IMPLEMENTATION
We represent it as -
bread =Âż milk â 80%
This should be read as - âBread means or implies milk, 80% of the time.â Here 80% is the âconfidence
factorâ of the rule. Association rules can be between more than 2 items. For example -
bread , milk =Âż jam â 60%
bread =Âż milk, jam â 40%
Given any rule, we can easily find its confidence. For example, for the rule
bread, milk =Âż jam
We count the number say n1, of records that contain bread and milk. Of these, how many contain jam
as well? Let this be n2. Then required confidence is n2/n1. This means that the user has to guess which
rule is interesting and ask for its confidence. But our goal was to âautomaticallyâ find all interesting
rules. This is going to be difficult because the database is bound to be very large. We might have to go
through the entire database many times to find all interesting rules.
Apriori Algorithm Apriori is designed to operate on databases containing transactions for example,
collections of items bought by customers or details of a website frequentation. As is common in associ-
ation rule mining, given a set of itemsets (for instance, sets of retail transactions, each listing individual
items purchased), the algorithm attempts to find subsets which are common to at least a minimum num-
ber C of the itemsets. Apriori uses a âbottom upâ approach, where frequent subsets are extended one
item at a time (a step known as candidate generation) and groups of candidates are tested against the
data. Apriori uses breadth-first search and a tree.
Brute Force The common-sense approach to solving this problem is as follows -
Let I = i1, i2, ..., in be a set of items, also called as an itemset. The number of times, this itemset
appears in the database is called its âsupportâ. Note that we can speak about support of an itemset and
confidence of a rule. The other combinations - support of a rule and confidence of an itemset are not
defined.
Now, if we know the support of âIâ and all its subsets, we can calculate the confidence of all rules which
involve these items. For example, the confidence of the rule
i1, i2, i3 =Âż i4, i5
support of { i1, i2, i3, i4, i5 }
is
support of { i1, i2, i3 }
So, the easiest approach would be to let âIâ contain all items in the supermarket. Then setup a counter
for every subset of âIâ to count all its occurances in the database. At the end of one pass of the database,
we would have all those counts and we can find the confidence of all rules. Then select the most âin-
terestingâ rules based on their confidence factors. How easy. The problem with this approach is that,
normally âIâ will contain atleast about 100 items. This means that it can have 2100 subsets. We will need
to maintain that many counters. If each counter is a single byte, then about 1020 GB will be required.
Clearly this canât be done.
Minimum Support
To make the problem tractable, we introduce the concept of minimum support. The user has to specify
this parameter - let us call it minsupport. Then any rule
Rizvi College of Engineering, Bandra, Mumbai. 15
23. Chapter 5 IMPLEMENTATION
i1, i2, ... , in =Âż j1, j2, ... , jn
needs to be considered, only if the set of all items in this rule which is { i1, i2, ... , in, j1, j2, ... , jn } has
support greater than minsupport.
The idea is that in the rule
bread, milk =Âż jam
if the number of people buying bread, milk and jam together is very small, then this rule is hardly worth
consideration (even if it has high confidence).
Our problem now becomes - Find all rules that have a given minimum confidence and involves itemsets
whose support is more than minsupport. Clearly, once we know the supports of all these itemsets, we
can easily determine the rules and their confidences. Hence we need to concentrate on the problem of
finding all itemsets which have minimum support. We call such itemsets as frequent itemsets.
Some Properties of Frequent Itemsets
The methods used to find frequent itemsets are based on the following properties -
1. Every subset of a frequent itemset is also frequent. Algorithms make use of this property in the
following way - we need not find the count of an itemset, if all its subsets are not frequent. So, we can
first find the counts of some short itemsets in one pass of the database. Then consider longer and longer
itemsets in subsequent passes. When we consider a long itemset, we can make sure that all its subsets are
frequent. This can be done because we already have the counts of all those subsets in previous passes.
2. Let us divide the tuples of the database into partitions, not necessarily of equal size. Then an itemset
can be frequent only if it is frequent in atleast one partition. This property enables us to apply divide
and conquer type algorithms. We can divide the database into partitions and find the frequent itemsets
in each partition. An itemset can be frequent only if it is frequent in atleast one of these partitions. To
see that this is true, consider k partitions of sizes n1, n2,...,nk.
Let minimum support be s.
Consider an itemset which does not have minimum support in any partition. Then its count in each
partition must be less than sn1, sn2,...,snk respectively.
3. Therefore its total count must be less than the sum of all these counts,
which is s( n1 + n2 +...+ nk ).
This is equal to s*(size of database).
Hence the itemset is not frequent in the entire database. This is extended to data base.
Figure 5.1: Frequency Itemset Generation
Rizvi College of Engineering, Bandra, Mumbai. 16
24. Chapter 5 IMPLEMENTATION
MODULES:
Network Connections Management
Client-server computing or networking is a distributed application architecture that partitions tasks or
workloads between service providers (servers) and service requesters, called clients. Often clients and
servers operate over a computer network on separate hardware. A server machine is a high-performance
host that is running one or more server programs which share its resources with clients. A client also
shares any of its resources; Clients therefore initiate communication sessions with servers which await
(listen to) incoming requests.
Database Management
The distributed database in our model is a horizontally partitioned database, which means the database
schema of all the partitions are the same. However, distributed database also has an intrinsic data skew-
ness property. The distributions of the item sets in different partitions are not identical, and many items
occur more frequently in some partitions than the others. As a result, many item sets may be large locally
at some sites but not necessarily in the other sites. This skewness property poses a new requirement in
the design of mining algorithm.
ARM Module:
Association rule mining is an active data mining research area and most ARM algorithms cater to a
centralized environment. However, adapting centralized data mining to discover useful patterns in dis-
tributed database isnât always feasible because merging data sets from different sites incurs huge network
communication costs. Therefore, our research is to develop a distributed algorithm for geographically
distributed data sets that reduces communication costs.
EDMA Module:
In this paper, we developed an efficient association rule mining algorithm in distributed databases called
EDMA. We have found that many candidate sets generated by applying the Apriori-gen function are not
needed in the search of frequent itemsets. In fact, there is a natural and effective method for every site to
generate its own set of candidate sets, which is typically much smaller than the set of all the candidate
sets. Following that, every site only needs to find the frequent itemsets among these candidate sets. The
following lemma is described to illustrate the above observations.
Rizvi College of Engineering, Bandra, Mumbai. 17
25. Chapter 5 IMPLEMENTATION
Results and Statistics
Then 2-itemsets are formed with the returned globally large 1-itemsets in the particular site and local
count is calculated. The process is repeated till no sets are formed or returned.
Figure 5.2:
Global support threshold:
((50/100)*12)= 6
[The global support count is calculated only by adding the counts of locally large item sets]
Rizvi College of Engineering, Bandra, Mumbai. 18
26. Chapter 5 IMPLEMENTATION
Bread 9
Peanutbutter 6
Milk 5
Beer 3
Messages: [Considering site 3 as receiver site]
Site 1:
Messages sent = 2
Messages received= 2
Site 2:
Messages sent = 3
Messages received = 1
Site 3:
Messages sent = 3
Messages received = 5
TOTAL SENT TO SITE 3 = 3
TOTAL RECEIVED FROM SITE 3 = 5
TOTAL MESSAGES = 8
Figure 5.3: Data Flow Diagram of Admins Function
Rizvi College of Engineering, Bandra, Mumbai. 19
27. Chapter 5 IMPLEMENTATION
Figure 5.4: Sequence Diagram of Manager,GUI & Application
Rizvi College of Engineering, Bandra, Mumbai. 20
28. Chapter 6 SYSTEM TESTING
Chapter 6
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover every conceivable
fault or weakness in a work product. It provides a way to check the functionality of components, sub
assemblies, assemblies and/or a finished product It is the process of exercising software with the intent
of ensuring that the Software system meets its requirements and user expectations and does not fail in
an unacceptable manner. There are various types of test. Each test type addresses a specific testing
requirement.
6.1 Types of Testing
6.1.1 Unit testing:
Unit testing involves the design of test cases that validate that the internal program logic is functioning
properly, and that program inputs produce valid outputs. All decision branches and internal code flow
should be validated. It is the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that relies on knowledge
of its construction and is invasive. Unit tests perform basic tests at component level and test a specific
business process, application, and/or system configuration. Unit tests ensure that each unique path of
a business process performs accurately to the documented specifications and contains clearly defined
inputs and expected results.
6.1.2 Integration testing:
Integration tests are designed to test integrated software components to determine if they actually run as
one program. Testing is event driven and is more concerned with the basic outcome of screens or fields.
Integration tests demonstrate that although the components were individually satisfaction, as shown by
successfully unit testing, the combination of components is correct and consistent. Integration testing is
specifically aimed at exposing the problems that arise from the combination of components.
6.1.3 Functional test:
Functional tests provide systematic demonstrations that functions tested are available as specified by
the business and technical requirements, system documentation, and user manuals. Functional testing is
centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Rizvi College of Engineering, Bandra, Mumbai. 21
29. Chapter 6 SYSTEM TESTING
Output : identified classes of application outputs must be exercised.
Systems/Procedure : interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements, key functions, or special
test cases. In addition, systematic coverage pertaining to identify Business process flows; data fields,
predefined processes, and successive processes must be considered for testing. Before functional testing
is complete, additional tests are identified and the effective value of current tests is determined.
6.1.4 System Test:
System testing ensures that the entire integrated software system meets requirements. It tests a con-
figuration to ensure known and predictable results. An example of system testing is the configuration
oriented system integration test. System testing is based on process descriptions and flows, emphasizing
pre-driven process links and integration points.
6.1.5 White Box Testing:
White Box Testing is a testing in which in which the software tester has knowledge of the inner workings,
structure and language of the software, or at least its purpose. It is purpose. It is used to test areas that
cannot be reached from a black box level.
6.1.6 Black Box Testing:
Black Box Testing is testing the software without any knowledge of the inner workings, structure or
language of the module being tested. Black box tests, as most other kinds of tests, must be written from
a definitive source document, such as specification or requirements document, such as specification or
requirements document. It is a testing in which the software under test is treated, as a black box .you
cannot see into it. The test provides inputs and responds to outputs without considering how the software
works.
6.1.7 Unit Testing:
Unit testing is usually conducted as part of a combined code and unit test phase of the software lifecycle,
although it is not uncommon for coding and unit testing to be conducted as two distinct phases.
Test strategy and approach
Field testing will be performed manually and functional tests will be written in detail.
Test objectives:
⢠All field entries must work properly.
⢠Pages must be activated from the identified link.
⢠The entry screen, messages and responses must not be delayed.
Features to be tested:
⢠Verify that the entries are of the correct format
⢠No duplicate entries should be allowed
⢠All links should take the user to the correct page.
Rizvi College of Engineering, Bandra, Mumbai. 22
30. Chapter 6 SYSTEM TESTING
6.1.8 Integration Testing:
Software integration testing is the incremental integration testing of two or more integrated software
components on a single platform to produce failures caused by interface defects. The task of the inte-
gration test is to check that components or software applications, e.g. components in a software system
or one step up software applications at the company level interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing: User Acceptance Testing is a critical phase of any project and requires signifi-
cant participation by the end user. It also ensures that the system meets the functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
Rizvi College of Engineering, Bandra, Mumbai. 23
31. Chapter 7 SYSTEM STUDY
Chapter 7
SYSTEM STUDY
7.1 FEASIBILITY STUDY:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a very
general plan for the project and some cost estimates. During system analysis the feasibility study of the
proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the
company. Three key considerations involved in the feasibility analysis are
⢠ECONOMICAL FEASIBILITY
⢠TECHNICAL FEASIBILITY
⢠SOCIAL FEASIBILITY
7.2 ECONOMICAL FEASIBILITY:
This study is carried out to check the economic impact that the system will have on the organization. The
amount of fund that the company can pour into the research and development of the system is limited.
The expenditures must be justified. Thus the developed system as well within the budget and this was
achieved because most of the technologies used are freely available.
7.3 TECHNICAL FEASIBILITY:
This study is carried out to check the technical feasibility, that is, the technical requirements of the
system. Any system developed must not have a high demand on the available technical resources. This
will lead to high demands on the available technical resources. This will lead to high demands being
placed on the client. The developed system must have a modest requirement, as only minimal or null
changes are required for implementing this system.
7.4 SOCIAL FEASIBILITY:
The aspect of study is to check the level of acceptance of the system by the user. This includes the
process of training the user to use the system efficiently. The user must not feel threatened by the
system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the
methods that are employed to educate the user about the system and to make him familiar with it. His
level of confidence must be raised as he is the final user of the system.
Rizvi College of Engineering, Bandra, Mumbai. 24
32. Chapter 8 PLAN OF WORK & PROJECT TIMELINE
Chapter 8
PLAN OF WORK & PROJECT TIMELINE
Figure 8.1: Project Timeline
Rizvi College of Engineering, Bandra, Mumbai. 25
33. Chapter 8 PLAN OF WORK & PROJECT TIMELINE
Gantt Charts The Gantt Chart shows planned and actual progress for a number of tasks displayed
against a horizontal time scale. It is effective and easy-to-read method of indicating the actual current
status for each of set of tasks compared to planned progress for each activity of the set.
Gantt Charts provide a clear picture of the current state of the project.
Figure 8.2: Gantt Charts
Figure 8.3: Planned Gantt Charts
Rizvi College of Engineering, Bandra, Mumbai. 26
34. Chapter 8 PLAN OF WORK & PROJECT TIMELINE
Figure 8.4: Pert Charts
Rizvi College of Engineering, Bandra, Mumbai. 27
35. Chapter 9 Conclusion and Future Scope
Chapter 9
Conclusion and Future Scope
9.1 CONCLUSION
Distributed ARM algorithms must reduce communication costs so that generating global association
rules costs less than combining the participating sitesâ datasets into a centralized site. We have developed
an efficient algorithm for mining association rules in distributed databases.
⢠Reduces the size of message exchanges by novel local and global pruning.
⢠Reduces the time of scan partition databases to get support counts by using a compressed matrix-
CMatrix, which is very effective in increasing the performance.
⢠Founds a center site to manage every the message exchanges to obtain all globally frequent item-
sets, only O(n) messages are needed for support count exchange. This is much less than a straight
adaptation of Apriori, which requires O(n2) messages for support count exchange.
9.2 FUTURE ENHANCEMENT
EDMA can be applied to the mining of association rules in a large centralized database by partitioning
the database to the nodes of a distributed system. This is particularly useful if the data set is too large for
sequential mining. In the future, as in our communication network the users concentration on different
alarms is various, which makes how to decide the weight of each alarm to be further considered.
Rizvi College of Engineering, Bandra, Mumbai. 28
36. References
References
[1] âA Fast Distributed Algorithm for Mining Association Rulesâ,Proc. Parallel and Distributed Infor-
mation Systems; D.W. Cheung, et al., IEEE CS Press, 1996,pp. 31-42
[2] âIntroduction: Recent Developments in Parallel and Distributed Data Miningâ,J. Distributed and
Parallel Databases; M.J. Zaki and Y. Pin, vol. 11, no. 2, 2002,pp. 123-127
[3] âEfficient Mining of Association Rules in Distributed Databasesâ,IEEE Trans. Knowledge and
Data Eng.; D.W. Cheung , et al., vol. 8, no. 6, 1996,pp. 911-922
[4] âCommunication-Efficient Distributed Mining of Association Rulesâ; A. Schuster and R. Wolff,
Proc. ACM SIGMOD Intâl Conf. Management of Data, ACM Press, 2001,pp. 473-484
[5] âMining Association Rules Between Sets of Items in Large Databasesâ; R. Agrawal, T. Imielinski,
and A. Swami, Proc. ACMSIGMOD Intâl Conf. Management of Data, , May 1993
[6] âAn Optimized Distributed Association Rule Mining Algorithmâ; M.Z Ashrafi, Monash University
ODAM, IEEE DISTRIBUTED SYSTEMS ONLINE 1541-4922 2004
[7] âThe Data Warehouse Toolkit, The Complete Guide to Dimensional Modelingâ,2nd edn. John Wi-
ley & Sons; Kimball, R., Ross, M., New York (2002)
[8] âWeb for Data Mining: Organizing and Interpreting the Discovered Rules Using the
Webâ,SIGKDD Explorations; Ma, Y., Liu, B., Wong, C.K., Vol. 2 (1). ACM Press, (2000) 16-
23.
[9] âNew Algorithm for Fast Discovery of Association Rulesâ,Technical Report
No. 261; Zaky, M.J., Parthasarathy, S., Ogihara, M., Li, W., University of
Rochester(1997),http://cs.aue.aau.dk/contribution/projects/datamining/papers/t r651.pdf
Rizvi College of Engineering, Bandra, Mumbai. 29
37. Project Hosting
Appendix A
Project Hosting
The project is hosted at Google Code. The complete source code along with the manual to operate the
project and supplementary files are uploaded.
Project Link : https://code.google.com/p/proquiz
QR CODE:
Rizvi College of Engineering, Bandra, Mumbai. 30