Data mining and business intelligence

Define “data mining”. Enumerate five example applications that can benefit by using
data mining. 5M
Data mining:
1. The extraction of hidden information from large databases is called as data mining.
2. Data mining is a new powerful technology to help companies focus on the most important
information in their databases of warehouses.
3. Data mining tools allows businesses to make proactive, knowledge-driven decisions.
4. The analyses offered by data mining move beyond the analyses of past events provided by
decision support systems.
5. Data mining tools can answer business questions that take more time to solve.
6. Data mining techniques can be implemented rapidly on existing software and hardware
platforms to enhance the value of existing information resources.
7. It can be integrated with new products and systems as they are brought on-line.
8. Data mining techniques are the result of a long process of research and product
development.
9. Data mining technique allows users to go through their data in real time.
10. Data mining is ready for application in the business community.
11. It is supported by three technologies that are:
i. Massive data collection
ii. Powerful multiprocessor computers
iii. Data mining algorithms.
12. Data mining is widely used in diverse areas.
Data Mining Applications: Financial Data Analysis, Retail Industry, Telecommunication
Industry, Biological Data Analysis, Other Scientific Applications & Intrusion Detection.
1. Financial Data Analysis:

The financial data is generally reliable and it has high quality which facilitates systematic data
analysis and data mining.
Some of the cases are as follows −
i. Design and construction of data warehouses for multidimensional data analysis and data
mining.
ii. Loan payment prediction and customer credit policy analysis.
iii. Classification and clustering of customers for targeted marketing.
iv. Detection of money laundering and other financial crimes.
2. Retail Industry:
Data Mining has its best application in Retail Industry because it collects large amount of data
from on sales, customer, consumption and services.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and satisfaction.
Here is the list of examples of data mining in the retail industry −
i. Design and Construction of data warehouses based on the benefits of data mining.
ii. Multidimensional analysis of sales, customers, products, time and region.
iii. Analysis of effectiveness of sales campaigns.
iv. Customer Retention.
v. Product recommendation and cross-referencing of items.
3. Telecommunication Industry:
The telecommunication industry is one of the most industries providing various services such as
fax, pager, cellular phone, internet messenger, images, e-mail, etc.
Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding.
Data mining helps telecommunication industry to identify the telecommunication patterns,
activities, make better use of resource, and improve quality of service.
Here is the list of examples for which data mining improves telecommunication services −

i. Multidimensional Analysis of Telecommunication data.
ii. Fraudulent pattern analysis.
iii. Identification of unusual patterns.
iv. Multidimensional association and sequential patterns analysis.
v. Mobile Telecommunication services.
vi. Use of visualization tools in telecommunication data analysis.
4. Biological Data Analysis:
Biological data analysis is a very important part of Bioinformatics.
Aspects of biological data analysis −
i. Semantic integration of heterogeneous, distributed genomic and proteomic databases.
ii. Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
iii. Discovery of structural patterns and analysis of genetic networks and protein pathways.
iv. Association and path analysis.
v. Visualization tools in genetic data analysis.
5. Other Scientific Applications:
Huge amount of data is collected from scientific domains such as astronomy.
A large amount of data sets is being generated because of the fast numerical simulations in
various fields.
Applications of data mining in the field of Scientific Applications −
i. Data Warehouses and data preprocessing.
ii. Graph-based mining.
Iii. Visualization and domain specific knowledge.
6. Intrusion Detection:

Intrusion refers to any kind of action that causes error in integrity, confidentiality, or the
availability of network resources.
With increased use of internet and availability of the tools and tricks for intrusion detection is
critical component of network administration.
Here is the list of areas in which data mining technology may be applied for intrusion detection
−
i. Development of data mining algorithm for intrusion detection.
ii. Association and correlation analysis, aggregation to help select and build discriminating
attributes.
iii. Analysis of Stream data.
iv. Distributed data mining.
v. Visualization and query tools.
What is data preprocessing? Explain the different methods for the data cleansing phase. 5M
Data preprocessing:
1. Data preprocessing is important step in data mining process.
2. Data preprocessing includes cleaning, normalization, transformation, feature extraction and
selection.
3. The product of data preprocessing is the final training set.
4. Data preprocessing is one of the most critical step in a data mining process which deals with
the preparation and transformation of the initial dataset.
5. Data preprocessing methods divided into following categories:
i. Data Cleaning
ii. Data Integration
iii. Data Transformation
iv. Data Reduction
Different methods for the data cleansing phase:

Difference between classification and prediction. 5M
No. Classification Prediction
1 Classification classifies data into classes. Prediction predicts the value of unseen data.
2 The accuracy of a classifier refers to the
ability of given classifier to correctly
classify new data.
The accuracy of the predicter refers to how
will a given predicter can give the value of
new or unseen data.
3 The speech of classifier refers to
computational cost involving in generating
and using the classifier.
The speed of predicter refers to
computational cost involving in generating
and using the predicter.
4 The robustness of classifier is the ability to
make correct classification on noisy data
or data with missing value.
The robustness of predicter is the ability to
make correct prediction on noisy data or
data with missing value.
5 The scalability of classification is the ability
to construct the classifier to efficiently
work on large amount of data.
The scalability of predicter is the ability to
construct the predicter to efficiently work on
amount of data.
Data mining architecture 10M
1. Data mining is a very important process where potentially useful and previously unknown
information is extracted from large volumes of data.
2. There are a number of components involved in the data mining process.
3. These components constitute the architecture of a data mining system.
4. The architecture of a typical data mining system may have the following major components
Database, data warehouse, World Wide Web, or other information repository.
5. This is one or a set of databases, data warehouses, spreadsheets, or other kinds of
information repositories.
6. Data cleaning and data integration techniques may be performed on the data.
7. Database or data warehouse server:
The database or data warehouse server is responsible for fetching the relevant data, based on
the user’s data mining request.
8. Knowledge base:

i. This is the domain knowledge that is used to guide the search or evaluate the interestingness
of resulting patterns.
ii. Such knowledge can include concept hierarchies, used to organize attributes or attribute
values into different levels of abstraction.
iii. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included.
9. Data mining engine:
i. The data mining engine is the core component of any data mining system.
ii. It consists of a number of modules for performing data mining tasks including association,
classification, characterization, clustering, prediction, time-series analysis etc.
10. Pattern evaluation module:
i. This component typically employs interestingness measures and interacts with the data
mining modules so as to focus the search toward interesting patterns.
ii. It may use interestingness thresholds to filter out discovered patterns.
iii. Alternatively, the pattern evaluation module may be integrated with the mining module,
depending on the implementation of the data mining method used.
11. User interface:
i. This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results.
ii. In addition, this component allows the user to browse database and data warehouse
schemas or data structures, evaluate mined patterns, and visualize the patterns in different
forms.
12. Database or Data Warehouse Server:
i. The database or data warehouse server contains the actual data that is ready to be
processed.
ii. Hence, the server is responsible for retrieving the relevant data based on the data mining
request of the user.

13. DIAGRAM:
14. Each and every component of data mining system has its own role and importance in
completing data mining efficiently.
15. These different modules need to interact correctly with each other in order to complete the
complex process of data mining successfully.
What is KDD? Explain its process?
knowledge discovery from data)
1. Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data.
2. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD.
3. The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
Selection, Pre-processing, Transformation, Data Mining and Interpretation/Evaluation.
KDD PROCESS
1. Creating a target data set: data selection

2. Data cleaning and preprocessing: (may take 60% of effort!)
3. Data reduction and transformation
4. Find useful features, dimensionality/variable reduction, and invariant representation
5. Choosing functions of data mining
6. summarization, classification, regression, association, clustering
7. Choosing the mining algorithm(s)
ALGORITHM
Input: D : a data set containing n objects, ε : the radius parameter, and MinPts: the
neighborhood density threshold.
Output: A set of density-based clusters.
Method:
1) mark all objects as unvisited;
2) do
3) randomly select an unvisited object p;
4) mark p as visited;
5) if the ε -neighborhood of p has at least MinPts objects
6) create a new cluster C, and add p to C;
7) let N be the set of objects in the ε -neighborhood of p;
8) for each point p' in N
9) if p' is unvisited
10) mark p' as visited;
11) if the -neighborhood of p' has at least MinPts points, add those points to N ;
12) if p' is not yet a member of any cluster, add p' to C;
13) end for

14) output C;
15) else mark p as noise;
16) until no object is unvisited;
Advantages:
1. DBSCAN does not require one to specify the number of clusters in the data a priori, as
opposed to k-means.
2. DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded
by (but not connected to) a different cluster.
3. Due to the MinPts parameter, the so-called single-link effect (different clusters being
connected by a thin line of points) is reduced.
4. DBSCAN has a notion of noise, and is robust to outliers.
5. DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points
in the database.
6. DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an
R* tree.
Disadvantages:
1. DBSCAN is not entirely deterministic: border points that are reachable from more than one
cluster can be part of either cluster, depending on the order the data is processed.
2. The quality of DBSCAN depends on the distance measure used in the function
regionQuery(P,ε). The most common distance metric used is Euclidean distance.
3. DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε
combination cannot then be chosen appropriately for all clusters.
4. If the data and scale are not well understood, choosing a meaningful distance threshold ε can
be difficult.
What is Resource Allocation? Explain Resource leveling & Resource smoothing.
Resource allocation is the scheduling of activities and the resources required by those activities
while taking into consideration both the resource availability and project time.

The resource allocation procedure mainly consists of two activities: Resource Smoothing and
Resource Leveling.
Resource Smoothing:-
1. If duration of completion of the project is the constraint, then resource smoothing should be
applied without changing the total project duration.
2. The periods of minimum demand for resources are located and the activities are shifted
according to the float availability and the requirement of resources.
3. Thus the intelligent utilization of floats can smoothen the demand of resources to the
maximum possible extent.
4. This type of resource allocation is called Resource Smoothing.
Resource leveling:-
1. In the process of resource leveling, whenever the availability of resource becomes less than
its maximum requirement, the only alternative is to delay the activity having larger float.
2. In case, two or more activities require the same amount of resources, the activity with
minimum duration is chosen for resource allocation.
3. Resource leveling is done if the restriction is on the availability of resources.
Write a short note on linear regression. 5M
Linear regression involves finding the “best” line to fit two attributes (or variables), so that one
attribute can be used to predict the other.
Linear Regression
a. Straight-line regression:
1. Straight-line regression analysis involves a response variable, y, and a single predictor
variable, x.
2. It is the simplest form of regression, and models y as a linear function of x.
3. That is, y = b+wx; where the variance of y is assumed to be constant, b and w are regression
coefficients specifying the Y-intercept and slope of the line, respectively.

4. These coefficients can be solved by the method of least squares, which estimates the best-
fitting straight line as the one that minimizes the error between the actual data and the
estimate of the line.
5. The regression coefficients can be estimated using this method with the following equations:
b. Multiple linear regressions:
1. Multiple linear regressions is an extension of straight-line regression so as to involve more
than one predictor variable.
2. It allows response variable y to be modeled as a linear function of n predictor variables or
attributes.
3. The equations (obtained from the method of least squares), become long and are tedious to
solve by hand.
4. Multiple regression problems are instead commonly solved with the use of statistical
software packages, such as SAS, SPSS, and S-Plus
5. Speed and Scalability: Time to construct the model and also time to use the model.
6. Robustness: This is the ability of the classifier to make correct predictions given noisy data or
data with missing values
7. Scalability: This refers to the ability to construct the classifier efficiently given large amounts
of data.
8. Interpretability: This refers to the level of understanding and insight that is provided by the
classifier
9. Goodness of rules: Decision tree size compactness of classification rules.
What is noisy data? How to handle it?
Noisy data:
1. Noisy data is meaningless data.
2. It includes any data that cannot be understood and interpreted correctly by machines, such
as unstructured text.
3. Noisy data unnecessarily increases the amount of storage space required and can also
adversely affect the results of any data mining analysis.

4. Noisy data can be caused by faulty data collection instruments, human or computer errors
occurring at data entry, data transmission errors, limited buffer size for coordinating
synchronized data transfer, inconsistencies in naming conventions or data codes used and
inconsistent formats for input fields(e.g.: date).
Noisy data can be handled by following the given procedures:
A. Binning:
1. Binning methods smooth a sorted data value by consulting the values around it. The sorted
values are distributed into a number of “buckets,” or bins. Because binning methods consult
the values around it, they perform local smoothing.
2. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by
the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given
bin are identified as the bin boundaries.
3. Each bin value is then replaced by the closest boundary value. In general, the larger the
width, the greater the effect of the smoothing. Alternatively, bins may be equal-width, where
the interval range of values in each bin is constant. Binning is also used as a discretization
technique.
B. Regression:
1. Here data can be smoothed by fitting the data to a function.
2. Linear regression involves finding the “best” line to fit two attributes, so that one attribute
can be used to predict the other.
3. Multiple linear regressions is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
C. Clustering:
1. Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters.”
2. Similarly, values that fall outside of the set of clusters may also be considered outliers.
Describe one hierarchical clustering algorithm using an example dendrogram. 5M
There are three algorithms of hierarchical clustering:

Agglomerative hierarchical clustering, Divisive hierarchical clustering and BIRCH hierarchical
clustering.
Agglomerative hierarchical clustering:
1. Agglomerative hierarchical clustering is a bottom-up clustering approach.
2. In this clustering the clusters have sub-clusters.
3. The example of this type of clustering is species taxonomy.
4. Gene expression data also exhibit this hierarchical quality.
5. Agglomerative hierarchical clustering starts with every single object (gene or sample) in a
single cluster.
6. Then each successive iteration it merges the closest pair of clusters by satisfying some
similarity criteria, until all the data is in one cluster.
7. The clusters generated in early stages are nested in clusters generated in later stages.
8. The clusters with different sizes in the tree can be valuable for discovery.
9. This type of clustering can produces an ordering of the objects, which may be informative for
data display.
10. In this clustering algorithm, smaller clusters are generated, which may be helpful for
discovery.
11. Different methods for combining clusters in agglomerative hierarchical clustering:
i. Single linkage
ii. Complete linkage
iii. Average linkage
iv. Centroid method
v. Ward’s method
12. Example dendrogram:

Explain the concept of decision support system with the help of an example
application. 5M
1. A decision support system (DSS) is a computer program application which analyzes business
data and presents it to make business decisions more easily.
2. Decision support system is an "informational application" system.
3. Decision support systemhelps businesses and organizations in decision making activities.
4. A decision support system presents information graphically and it includes artificial
intelligence.
5. Categories of decision support system:
i. Communication driven DSS: purpose is to conduct a meeting for users to collaborate.
ii. Data driven DSS: it is used to query a database to seek specific answers for specific purpose.
iii. Document driven DSS: purpose of this is to search a web page and to find documents.
iv. Knowledge driven DSS: it is used to produce management advice and to choose products or
services.
v. Model driven DSS: it is used by managers and staff members of the business to interact with
each other to analyze decisions.

6. Decision support systems are the combination of integrated resources which works together.
7. For example: a national on-line book seller wants to begin selling its products internationally
but first needs to determine if that will be a wise business decision.
8. In such case, The vendor can use a DSS to collect information from its own resources using a
tool OLAP to determine that the company has the ability to expand its business.
9. Also collect information from external resources, such as industry data to determine if there
is indeed a demand to meet.
10. The Decision Support System will collect and analyze the data and then present it in a way
that can be interpreted by humans.
11. There are few decision support systems, which come very close to acting as artificial
intelligence agents.
What is clustering? Explain k-means clustering algorithm. Suppose the data for
clustering is {2, 4, 10, 12, 3, 20, 11, 25} Consider k=2, cluster the given data using k-
means algorithm. 10M
Clustering:
1. The process of partitioning a set of data into a set of meaningful sub-classes or clusters is
called as clustering.
2. Clustering is a technique used to place data elements into related groups.
3. Example of graphical representation of the clustering:
4. A cluster is a collection of objects which are similar and are dissimilar to the objects of the
other clusters.
5. In the above example there are four clusters.
K-means clustering algorithm:

1. K-means clustering is an algorithm to group the different objects based on their features into
K number of group.
2. K is positive integer number and it can be decided by user.
3. The Centroids of each cluster are generally far away from each other.
4. Group the elements into the clusters which are nearest to the Centroid and use same
method to group elements as per new Centroids.
5. In every step Centroid changes and elements moved from one cluster to another cluster.
6. Follow same process till no element is moving from one cluster to other cluster.
Data: {2, 4, 10, 12, 3, 20, 11, 25}
K=2
Select any two means M1 and M2:
M1=4
M2=12
Randomly partition given dataset:
K1=2, 3, 4 mean=3
K2=10, 11, 12, 20, 25 mean=15.6
Reassign the values of dataset as per new mean values:
K1=2, 3, 4 mean=3
K2=10, 11, 12, 20, 25 mean=15.6
Again we have same mean values.
Therefore, stop solving algorithm.
(Note: if you are using different means for M1 and M2, your answer will be different and it will
stop in different steps)
For the given set of data points, 10M
(a) Find Mean, Median and Mode

(b) Show a boxplot of the data. Clearly indicating the five –numbersummary.
11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75
Given data:
11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75
(a) Mean, Median and Mode:
Mean:
Mean= Addition of all numbers divide by total numbers
Mean=
(11+13+13+15+15+16+19+20+20+20+21+21+22+23+24+30+40+45+45+45+71+72+73+75)/24
=32.04
Median:
Median= Total numbers divide by 2
Median= 24/2= 12th number = 21
Mode:
Mode= the number that is repeated more time than other numbers
Mode= 20,40
(b) Boxplot of the data with five number summary:
What is an outlier? Describe methods that can be used for outlier analysis. 10M
1. An outlier is an observation point that is distant from other observations.

2. Outlier indicates measurement error or experimental error.
3. They can be novel, new, abnormal, unusual or noisy information about the data.
4. Outliers can be classified into three categories:
i. Point outliers
ii. Contextual outliers
iii. Collective outliers
5. Point outliers:
i. This is the simplest type of outlier and it focuses on the majority of research on outlier
detection.
ii. If an individual data point can be considered anomalous with respect to the other data, then
it is called as a point outlier.
6. Contextual outliers:
i. If an individual data point is anomalous in a specific context, then it is called as a contextual
outlier or conditional outlier.
ii. Each data points is defined with two sets of attributes in contextual outlier: contextual
attributes and behavioral attributes.
7. Collective outliers:
i. If a collection of data points is anomalous with respect to the entire data set, it is called as a
collective outlier.
ii. Collective outliers can occurs only in data sets in which data points are somehow related.
8. The benefit of outlier is, it can be removed or considered separately in regression modeling
to improve accuracy.
9. Outlier detection is one of the basic problems of data mining.
10. Outliers may be erroneous or real.
Methods used for outlier analysis:
1. Statistical approach

2. Distance-based approach
3. Density-based local outlier approach
4. Deviation-based approach
1. Statistical approach:
i. This method assumes a distribution for the given data set and then identifies outliers with
respect to the model using a discordance test.
ii. A statistical discordance test examines two hypotheses, a working hypothesis and an
alternative hypothesis.
iii. A working hypothesis, states that the entire data set of n object is comes from an initial
distribution model.
iv. An alternative hypothesis, states that the entire data set of n objects is comes from another
distribution model.
2. Distance-based approach:
i. This method generalizes the ideas behind discordance testing for various standard
distributions.
ii. Its neighbors are defined based on their distance from the given object.
iii. Many efficient algorithms for mining distance-based outliers have been developed: indexed
based algorithm, nested loop algorithm and cell based algorithm.
3. Density-based local outlier approach:
i. This approach is depends on the overall distribution of the given set of data points.
ii. This brings us to the notion of local outliers and an object is a local outlier.
iii. This approach can detect both global and local outliers.
4. Deviation-based approach:
i. This method identifies outliers by examining the main characteristics of objects in a group.
ii. The term deviation is typically used to refer to outliers in this approach.
iii. There are two techniques for deviation-based outlier detection.

iv. The first technique compares objects sequentially in a set and the second employs an OLAP
data cube approach.
Design a BI system for fraud detection. Describe all the steps from Data collection to
Decision making clearly. 10M
Fraud detection in Telecommunication Industry:
1. Fraud is an adaptive crime. It needs special method of intelligent data analysis to detect
fraud and prevent it.
2. The telecommunications industry has expanded with the development of affordable mobile
phone technology.
3. For example, forensic analytics are used to review an employee’s purchasing card activity to
assess whether any of the purchases were diverted for personal use.
4. The main steps in forensic analytics are data collection, data preparation, data analysis and
reporting.
5. Fraud detection method exits in the areas of Knowledge Discovery in Databases (KDD), Data
Mining, Machine Learning and Statistics.
6. They offer applicable and successful solutions in different areas of fraud crimes.
7. Fraud detection:
8. Fraud detection techniques are categorized into two primary classes:
i. Statistical data analysis techniques
ii. Artificial intelligence techniques
9. Statistical data analysis techniques:

i. Statistical data analysis techniques are data preprocessing techniques for detection, data
validation, error correction and filling up of missing or incorrect data.
ii. It also includes calculation of various statistical parameters such as averages, performance
metrics.
10. Artificial Intelligence techniques:
i. Artificial intelligence techniques are data mining to classify, cluster and segment the data and
automatically find association rules in the data related to fraud.
ii. It also includes pattern recognition to detect approximate classes, clusters, or patterns of
suspicious behavior either automatically or to match given inputs.
Steps from data collection to decision making:
1. Data collection:
i. Before you collect new data, determine what information could be collected from existing
databases or sources on hand.
ii. Determine a file storing and naming system ahead of time to help all tasked team members
collaborate.
iii. If you need to gather data via observation or interviews, then develop an interview template
ahead of time to ensure consistency and save time.
iv. Keep your collected data organized in a log with collection dates and add any source notes as
you go.
2. Analyze data:
i. After you’ve collected the right data it’s time for deeper data analysis.
ii. Begin by manipulating your data in a number of different ways, such as plotting it out and
finding correlations or by creating a pivot table in Excel.
iii. A pivot table lets you sort and filter data by different variables and lets you calculate the
mean, maximum, minimum and standard deviation of your data.
3. Interpret results:
i. After analyzing data and possibly conducting further research it’s time to interpret your
results.

ii. As you interpret your analysis, you cannot ever prove a hypothesis true; you can only fail to
reject the hypothesis.
iii. By following these steps in your data analysis process, you make better decisions for your
business or government agency.
Partition the given data into4 bins using Equi-depth binning method and perform smoothing
according to the following methods. 10M
Smoothing by bin mean
Smoothing by bin median
Smoothing by bin boundaries
Data: 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75
Given data: 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73,
75
Let, distribute data into 4 bins using Equi-depth binning.
Total values (T) =24
Number of values in each bin=24/4=6
Thus we get,
B1=11, 13, 13, 15, 15, 16
B2=19, 20, 20, 20, 21, 21
B3=22, 23, 24, 30, 40, 45
B4=45, 45, 71, 72, 73, 75
i. smoothing by bin median:
Replace each value of bin with its mean value.
Mean for B1=(11+13+13+15+15+16)/6=13.83
Mean for B2=(19+20+20+20+21+21)/6=20.16
Mean for B3=(22+23+24+30+40+45)/6=30.67
Mean for B4=(45+45+71+72+73+75)/6=63.5
Thus we get,
B1=13.83, 13.83, 13.83, 13.83, 13.83, 13.83
B2=20.16, 20.16, 20.16, 20.16, 20.16, 20.16
B3=30.67, 30.67, 30.67, 30.67, 30.67, 30.67
B4=63.5, 63.5, 63.5, 63.5, 63.5, 63.5
ii. Smoothing by bin median:
Replace each value of bin with its median value.
Median for B1=(13+15)/2=14
Median for B2=(20+20)/2=20
Median for B3=(24+30)/2=27
Median for B4=(71+72)/2=71.5

Thus we get,
B1=14, 14, 14, 14, 14, 14
B2=20, 20, 20, 20, 20, 20
B3=27, 27, 27, 27, 27, 27
B4=71.5, 71.5, 71.5, 71.5, 71.5, 71.5
iii. Smoothing by bin boundaries:
Replace each value of bin with its closet boundary value.
Thus we get,
B1=11, 11, 11, 16, 16, 16
B2=19, 21, 21, 21, 21, 21
B3=22, 22, 22, 22, 45, 45
B4=45, 45, 75, 75, 75, 75

Data mining and business intelligence

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data mining and business intelligence

Similar to Data mining and business intelligence (20)

More from chirag patil

More from chirag patil (20)

Recently uploaded

Recently uploaded (20)

Data mining and business intelligence