SlideShare a Scribd company logo
Define “data mining”. Enumerate five example applications that can benefit by using
data mining. 5M
Data mining:
1. The extraction of hidden information from large databases is called as data mining.
2. Data mining is a new powerful technology to help companies focus on the most important
information in their databases of warehouses.
3. Data mining tools allows businesses to make proactive, knowledge-driven decisions.
4. The analyses offered by data mining move beyond the analyses of past events provided by
decision support systems.
5. Data mining tools can answer business questions that take more time to solve.
6. Data mining techniques can be implemented rapidly on existing software and hardware
platforms to enhance the value of existing information resources.
7. It can be integrated with new products and systems as they are brought on-line.
8. Data mining techniques are the result of a long process of research and product
development.
9. Data mining technique allows users to go through their data in real time.
10. Data mining is ready for application in the business community.
11. It is supported by three technologies that are:
i. Massive data collection
ii. Powerful multiprocessor computers
iii. Data mining algorithms.
12. Data mining is widely used in diverse areas.
Data Mining Applications: Financial Data Analysis, Retail Industry, Telecommunication
Industry, Biological Data Analysis, Other Scientific Applications & Intrusion Detection.
1. Financial Data Analysis:
The financial data is generally reliable and it has high quality which facilitates systematic data
analysis and data mining.
Some of the cases are as follows −
i. Design and construction of data warehouses for multidimensional data analysis and data
mining.
ii. Loan payment prediction and customer credit policy analysis.
iii. Classification and clustering of customers for targeted marketing.
iv. Detection of money laundering and other financial crimes.
2. Retail Industry:
Data Mining has its best application in Retail Industry because it collects large amount of data
from on sales, customer, consumption and services.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and satisfaction.
Here is the list of examples of data mining in the retail industry −
i. Design and Construction of data warehouses based on the benefits of data mining.
ii. Multidimensional analysis of sales, customers, products, time and region.
iii. Analysis of effectiveness of sales campaigns.
iv. Customer Retention.
v. Product recommendation and cross-referencing of items.
3. Telecommunication Industry:
The telecommunication industry is one of the most industries providing various services such as
fax, pager, cellular phone, internet messenger, images, e-mail, etc.
Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding.
Data mining helps telecommunication industry to identify the telecommunication patterns,
activities, make better use of resource, and improve quality of service.
Here is the list of examples for which data mining improves telecommunication services −
i. Multidimensional Analysis of Telecommunication data.
ii. Fraudulent pattern analysis.
iii. Identification of unusual patterns.
iv. Multidimensional association and sequential patterns analysis.
v. Mobile Telecommunication services.
vi. Use of visualization tools in telecommunication data analysis.
4. Biological Data Analysis:
Biological data analysis is a very important part of Bioinformatics.
Aspects of biological data analysis −
i. Semantic integration of heterogeneous, distributed genomic and proteomic databases.
ii. Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
iii. Discovery of structural patterns and analysis of genetic networks and protein pathways.
iv. Association and path analysis.
v. Visualization tools in genetic data analysis.
5. Other Scientific Applications:
Huge amount of data is collected from scientific domains such as astronomy.
A large amount of data sets is being generated because of the fast numerical simulations in
various fields.
Applications of data mining in the field of Scientific Applications −
i. Data Warehouses and data preprocessing.
ii. Graph-based mining.
Iii. Visualization and domain specific knowledge.
6. Intrusion Detection:
Intrusion refers to any kind of action that causes error in integrity, confidentiality, or the
availability of network resources.
With increased use of internet and availability of the tools and tricks for intrusion detection is
critical component of network administration.
Here is the list of areas in which data mining technology may be applied for intrusion detection
−
i. Development of data mining algorithm for intrusion detection.
ii. Association and correlation analysis, aggregation to help select and build discriminating
attributes.
iii. Analysis of Stream data.
iv. Distributed data mining.
v. Visualization and query tools.
What is data preprocessing? Explain the different methods for the data cleansing phase. 5M
Data preprocessing:
1. Data preprocessing is important step in data mining process.
2. Data preprocessing includes cleaning, normalization, transformation, feature extraction and
selection.
3. The product of data preprocessing is the final training set.
4. Data preprocessing is one of the most critical step in a data mining process which deals with
the preparation and transformation of the initial dataset.
5. Data preprocessing methods divided into following categories:
i. Data Cleaning
ii. Data Integration
iii. Data Transformation
iv. Data Reduction
Different methods for the data cleansing phase:
Difference between classification and prediction. 5M
No. Classification Prediction
1 Classification classifies data into classes. Prediction predicts the value of unseen data.
2 The accuracy of a classifier refers to the
ability of given classifier to correctly
classify new data.
The accuracy of the predicter refers to how
will a given predicter can give the value of
new or unseen data.
3 The speech of classifier refers to
computational cost involving in generating
and using the classifier.
The speed of predicter refers to
computational cost involving in generating
and using the predicter.
4 The robustness of classifier is the ability to
make correct classification on noisy data
or data with missing value.
The robustness of predicter is the ability to
make correct prediction on noisy data or
data with missing value.
5 The scalability of classification is the ability
to construct the classifier to efficiently
work on large amount of data.
The scalability of predicter is the ability to
construct the predicter to efficiently work on
amount of data.
Data mining architecture 10M
1. Data mining is a very important process where potentially useful and previously unknown
information is extracted from large volumes of data.
2. There are a number of components involved in the data mining process.
3. These components constitute the architecture of a data mining system.
4. The architecture of a typical data mining system may have the following major components
Database, data warehouse, World Wide Web, or other information repository.
5. This is one or a set of databases, data warehouses, spreadsheets, or other kinds of
information repositories.
6. Data cleaning and data integration techniques may be performed on the data.
7. Database or data warehouse server:
The database or data warehouse server is responsible for fetching the relevant data, based on
the user’s data mining request.
8. Knowledge base:
i. This is the domain knowledge that is used to guide the search or evaluate the interestingness
of resulting patterns.
ii. Such knowledge can include concept hierarchies, used to organize attributes or attribute
values into different levels of abstraction.
iii. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included.
9. Data mining engine:
i. The data mining engine is the core component of any data mining system.
ii. It consists of a number of modules for performing data mining tasks including association,
classification, characterization, clustering, prediction, time-series analysis etc.
10. Pattern evaluation module:
i. This component typically employs interestingness measures and interacts with the data
mining modules so as to focus the search toward interesting patterns.
ii. It may use interestingness thresholds to filter out discovered patterns.
iii. Alternatively, the pattern evaluation module may be integrated with the mining module,
depending on the implementation of the data mining method used.
11. User interface:
i. This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results.
ii. In addition, this component allows the user to browse database and data warehouse
schemas or data structures, evaluate mined patterns, and visualize the patterns in different
forms.
12. Database or Data Warehouse Server:
i. The database or data warehouse server contains the actual data that is ready to be
processed.
ii. Hence, the server is responsible for retrieving the relevant data based on the data mining
request of the user.
13. DIAGRAM:
14. Each and every component of data mining system has its own role and importance in
completing data mining efficiently.
15. These different modules need to interact correctly with each other in order to complete the
complex process of data mining successfully.
What is KDD? Explain its process?
knowledge discovery from data)
1. Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data.
2. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD.
3. The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
Selection, Pre-processing, Transformation, Data Mining and Interpretation/Evaluation.
KDD PROCESS
1. Creating a target data set: data selection
2. Data cleaning and preprocessing: (may take 60% of effort!)
3. Data reduction and transformation
4. Find useful features, dimensionality/variable reduction, and invariant representation
5. Choosing functions of data mining
6. summarization, classification, regression, association, clustering
7. Choosing the mining algorithm(s)
ALGORITHM
Input: D : a data set containing n objects, ε : the radius parameter, and MinPts: the
neighborhood density threshold.
Output: A set of density-based clusters.
Method:
1) mark all objects as unvisited;
2) do
3) randomly select an unvisited object p;
4) mark p as visited;
5) if the ε -neighborhood of p has at least MinPts objects
6) create a new cluster C, and add p to C;
7) let N be the set of objects in the ε -neighborhood of p;
8) for each point p' in N
9) if p' is unvisited
10) mark p' as visited;
11) if the -neighborhood of p' has at least MinPts points, add those points to N ;
12) if p' is not yet a member of any cluster, add p' to C;
13) end for
14) output C;
15) else mark p as noise;
16) until no object is unvisited;
Advantages:
1. DBSCAN does not require one to specify the number of clusters in the data a priori, as
opposed to k-means.
2. DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded
by (but not connected to) a different cluster.
3. Due to the MinPts parameter, the so-called single-link effect (different clusters being
connected by a thin line of points) is reduced.
4. DBSCAN has a notion of noise, and is robust to outliers.
5. DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points
in the database.
6. DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an
R* tree.
Disadvantages:
1. DBSCAN is not entirely deterministic: border points that are reachable from more than one
cluster can be part of either cluster, depending on the order the data is processed.
2. The quality of DBSCAN depends on the distance measure used in the function
regionQuery(P,ε). The most common distance metric used is Euclidean distance.
3. DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε
combination cannot then be chosen appropriately for all clusters.
4. If the data and scale are not well understood, choosing a meaningful distance threshold ε can
be difficult.
What is Resource Allocation? Explain Resource leveling & Resource smoothing.
Resource allocation is the scheduling of activities and the resources required by those activities
while taking into consideration both the resource availability and project time.
The resource allocation procedure mainly consists of two activities: Resource Smoothing and
Resource Leveling.
Resource Smoothing:-
1. If duration of completion of the project is the constraint, then resource smoothing should be
applied without changing the total project duration.
2. The periods of minimum demand for resources are located and the activities are shifted
according to the float availability and the requirement of resources.
3. Thus the intelligent utilization of floats can smoothen the demand of resources to the
maximum possible extent.
4. This type of resource allocation is called Resource Smoothing.
Resource leveling:-
1. In the process of resource leveling, whenever the availability of resource becomes less than
its maximum requirement, the only alternative is to delay the activity having larger float.
2. In case, two or more activities require the same amount of resources, the activity with
minimum duration is chosen for resource allocation.
3. Resource leveling is done if the restriction is on the availability of resources.
Write a short note on linear regression. 5M
Linear regression involves finding the “best” line to fit two attributes (or variables), so that one
attribute can be used to predict the other.
Linear Regression
a. Straight-line regression:
1. Straight-line regression analysis involves a response variable, y, and a single predictor
variable, x.
2. It is the simplest form of regression, and models y as a linear function of x.
3. That is, y = b+wx; where the variance of y is assumed to be constant, b and w are regression
coefficients specifying the Y-intercept and slope of the line, respectively.
4. These coefficients can be solved by the method of least squares, which estimates the best-
fitting straight line as the one that minimizes the error between the actual data and the
estimate of the line.
5. The regression coefficients can be estimated using this method with the following equations:
b. Multiple linear regressions:
1. Multiple linear regressions is an extension of straight-line regression so as to involve more
than one predictor variable.
2. It allows response variable y to be modeled as a linear function of n predictor variables or
attributes.
3. The equations (obtained from the method of least squares), become long and are tedious to
solve by hand.
4. Multiple regression problems are instead commonly solved with the use of statistical
software packages, such as SAS, SPSS, and S-Plus
5. Speed and Scalability: Time to construct the model and also time to use the model.
6. Robustness: This is the ability of the classifier to make correct predictions given noisy data or
data with missing values
7. Scalability: This refers to the ability to construct the classifier efficiently given large amounts
of data.
8. Interpretability: This refers to the level of understanding and insight that is provided by the
classifier
9. Goodness of rules: Decision tree size compactness of classification rules.
What is noisy data? How to handle it?
Noisy data:
1. Noisy data is meaningless data.
2. It includes any data that cannot be understood and interpreted correctly by machines, such
as unstructured text.
3. Noisy data unnecessarily increases the amount of storage space required and can also
adversely affect the results of any data mining analysis.
4. Noisy data can be caused by faulty data collection instruments, human or computer errors
occurring at data entry, data transmission errors, limited buffer size for coordinating
synchronized data transfer, inconsistencies in naming conventions or data codes used and
inconsistent formats for input fields(e.g.: date).
Noisy data can be handled by following the given procedures:
A. Binning:
1. Binning methods smooth a sorted data value by consulting the values around it. The sorted
values are distributed into a number of “buckets,” or bins. Because binning methods consult
the values around it, they perform local smoothing.
2. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by
the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given
bin are identified as the bin boundaries.
3. Each bin value is then replaced by the closest boundary value. In general, the larger the
width, the greater the effect of the smoothing. Alternatively, bins may be equal-width, where
the interval range of values in each bin is constant. Binning is also used as a discretization
technique.
B. Regression:
1. Here data can be smoothed by fitting the data to a function.
2. Linear regression involves finding the “best” line to fit two attributes, so that one attribute
can be used to predict the other.
3. Multiple linear regressions is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
C. Clustering:
1. Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters.”
2. Similarly, values that fall outside of the set of clusters may also be considered outliers.
Describe one hierarchical clustering algorithm using an example dendrogram. 5M
There are three algorithms of hierarchical clustering:
Agglomerative hierarchical clustering, Divisive hierarchical clustering and BIRCH hierarchical
clustering.
Agglomerative hierarchical clustering:
1. Agglomerative hierarchical clustering is a bottom-up clustering approach.
2. In this clustering the clusters have sub-clusters.
3. The example of this type of clustering is species taxonomy.
4. Gene expression data also exhibit this hierarchical quality.
5. Agglomerative hierarchical clustering starts with every single object (gene or sample) in a
single cluster.
6. Then each successive iteration it merges the closest pair of clusters by satisfying some
similarity criteria, until all the data is in one cluster.
7. The clusters generated in early stages are nested in clusters generated in later stages.
8. The clusters with different sizes in the tree can be valuable for discovery.
9. This type of clustering can produces an ordering of the objects, which may be informative for
data display.
10. In this clustering algorithm, smaller clusters are generated, which may be helpful for
discovery.
11. Different methods for combining clusters in agglomerative hierarchical clustering:
i. Single linkage
ii. Complete linkage
iii. Average linkage
iv. Centroid method
v. Ward’s method
12. Example dendrogram:
Explain the concept of decision support system with the help of an example
application. 5M
1. A decision support system (DSS) is a computer program application which analyzes business
data and presents it to make business decisions more easily.
2. Decision support system is an "informational application" system.
3. Decision support systemhelps businesses and organizations in decision making activities.
4. A decision support system presents information graphically and it includes artificial
intelligence.
5. Categories of decision support system:
i. Communication driven DSS: purpose is to conduct a meeting for users to collaborate.
ii. Data driven DSS: it is used to query a database to seek specific answers for specific purpose.
iii. Document driven DSS: purpose of this is to search a web page and to find documents.
iv. Knowledge driven DSS: it is used to produce management advice and to choose products or
services.
v. Model driven DSS: it is used by managers and staff members of the business to interact with
each other to analyze decisions.
6. Decision support systems are the combination of integrated resources which works together.
7. For example: a national on-line book seller wants to begin selling its products internationally
but first needs to determine if that will be a wise business decision.
8. In such case, The vendor can use a DSS to collect information from its own resources using a
tool OLAP to determine that the company has the ability to expand its business.
9. Also collect information from external resources, such as industry data to determine if there
is indeed a demand to meet.
10. The Decision Support System will collect and analyze the data and then present it in a way
that can be interpreted by humans.
11. There are few decision support systems, which come very close to acting as artificial
intelligence agents.
What is clustering? Explain k-means clustering algorithm. Suppose the data for
clustering is {2, 4, 10, 12, 3, 20, 11, 25} Consider k=2, cluster the given data using k-
means algorithm. 10M
Clustering:
1. The process of partitioning a set of data into a set of meaningful sub-classes or clusters is
called as clustering.
2. Clustering is a technique used to place data elements into related groups.
3. Example of graphical representation of the clustering:
4. A cluster is a collection of objects which are similar and are dissimilar to the objects of the
other clusters.
5. In the above example there are four clusters.
K-means clustering algorithm:
1. K-means clustering is an algorithm to group the different objects based on their features into
K number of group.
2. K is positive integer number and it can be decided by user.
3. The Centroids of each cluster are generally far away from each other.
4. Group the elements into the clusters which are nearest to the Centroid and use same
method to group elements as per new Centroids.
5. In every step Centroid changes and elements moved from one cluster to another cluster.
6. Follow same process till no element is moving from one cluster to other cluster.
Data: {2, 4, 10, 12, 3, 20, 11, 25}
K=2
Select any two means M1 and M2:
M1=4
M2=12
Randomly partition given dataset:
K1=2, 3, 4 mean=3
K2=10, 11, 12, 20, 25 mean=15.6
Reassign the values of dataset as per new mean values:
K1=2, 3, 4 mean=3
K2=10, 11, 12, 20, 25 mean=15.6
Again we have same mean values.
Therefore, stop solving algorithm.
(Note: if you are using different means for M1 and M2, your answer will be different and it will
stop in different steps)
For the given set of data points, 10M
(a) Find Mean, Median and Mode
(b) Show a boxplot of the data. Clearly indicating the five –numbersummary.
11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75
Given data:
11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75
(a) Mean, Median and Mode:
Mean:
Mean= Addition of all numbers divide by total numbers
Mean=
(11+13+13+15+15+16+19+20+20+20+21+21+22+23+24+30+40+45+45+45+71+72+73+75)/24
=32.04
Median:
Median= Total numbers divide by 2
Median= 24/2= 12th number = 21
Mode:
Mode= the number that is repeated more time than other numbers
Mode= 20,40
(b) Boxplot of the data with five number summary:
What is an outlier? Describe methods that can be used for outlier analysis. 10M
1. An outlier is an observation point that is distant from other observations.
2. Outlier indicates measurement error or experimental error.
3. They can be novel, new, abnormal, unusual or noisy information about the data.
4. Outliers can be classified into three categories:
i. Point outliers
ii. Contextual outliers
iii. Collective outliers
5. Point outliers:
i. This is the simplest type of outlier and it focuses on the majority of research on outlier
detection.
ii. If an individual data point can be considered anomalous with respect to the other data, then
it is called as a point outlier.
6. Contextual outliers:
i. If an individual data point is anomalous in a specific context, then it is called as a contextual
outlier or conditional outlier.
ii. Each data points is defined with two sets of attributes in contextual outlier: contextual
attributes and behavioral attributes.
7. Collective outliers:
i. If a collection of data points is anomalous with respect to the entire data set, it is called as a
collective outlier.
ii. Collective outliers can occurs only in data sets in which data points are somehow related.
8. The benefit of outlier is, it can be removed or considered separately in regression modeling
to improve accuracy.
9. Outlier detection is one of the basic problems of data mining.
10. Outliers may be erroneous or real.
Methods used for outlier analysis:
1. Statistical approach
2. Distance-based approach
3. Density-based local outlier approach
4. Deviation-based approach
1. Statistical approach:
i. This method assumes a distribution for the given data set and then identifies outliers with
respect to the model using a discordance test.
ii. A statistical discordance test examines two hypotheses, a working hypothesis and an
alternative hypothesis.
iii. A working hypothesis, states that the entire data set of n object is comes from an initial
distribution model.
iv. An alternative hypothesis, states that the entire data set of n objects is comes from another
distribution model.
2. Distance-based approach:
i. This method generalizes the ideas behind discordance testing for various standard
distributions.
ii. Its neighbors are defined based on their distance from the given object.
iii. Many efficient algorithms for mining distance-based outliers have been developed: indexed
based algorithm, nested loop algorithm and cell based algorithm.
3. Density-based local outlier approach:
i. This approach is depends on the overall distribution of the given set of data points.
ii. This brings us to the notion of local outliers and an object is a local outlier.
iii. This approach can detect both global and local outliers.
4. Deviation-based approach:
i. This method identifies outliers by examining the main characteristics of objects in a group.
ii. The term deviation is typically used to refer to outliers in this approach.
iii. There are two techniques for deviation-based outlier detection.
iv. The first technique compares objects sequentially in a set and the second employs an OLAP
data cube approach.
Design a BI system for fraud detection. Describe all the steps from Data collection to
Decision making clearly. 10M
Fraud detection in Telecommunication Industry:
1. Fraud is an adaptive crime. It needs special method of intelligent data analysis to detect
fraud and prevent it.
2. The telecommunications industry has expanded with the development of affordable mobile
phone technology.
3. For example, forensic analytics are used to review an employee’s purchasing card activity to
assess whether any of the purchases were diverted for personal use.
4. The main steps in forensic analytics are data collection, data preparation, data analysis and
reporting.
5. Fraud detection method exits in the areas of Knowledge Discovery in Databases (KDD), Data
Mining, Machine Learning and Statistics.
6. They offer applicable and successful solutions in different areas of fraud crimes.
7. Fraud detection:
8. Fraud detection techniques are categorized into two primary classes:
i. Statistical data analysis techniques
ii. Artificial intelligence techniques
9. Statistical data analysis techniques:
i. Statistical data analysis techniques are data preprocessing techniques for detection, data
validation, error correction and filling up of missing or incorrect data.
ii. It also includes calculation of various statistical parameters such as averages, performance
metrics.
10. Artificial Intelligence techniques:
i. Artificial intelligence techniques are data mining to classify, cluster and segment the data and
automatically find association rules in the data related to fraud.
ii. It also includes pattern recognition to detect approximate classes, clusters, or patterns of
suspicious behavior either automatically or to match given inputs.
Steps from data collection to decision making:
1. Data collection:
i. Before you collect new data, determine what information could be collected from existing
databases or sources on hand.
ii. Determine a file storing and naming system ahead of time to help all tasked team members
collaborate.
iii. If you need to gather data via observation or interviews, then develop an interview template
ahead of time to ensure consistency and save time.
iv. Keep your collected data organized in a log with collection dates and add any source notes as
you go.
2. Analyze data:
i. After you’ve collected the right data it’s time for deeper data analysis.
ii. Begin by manipulating your data in a number of different ways, such as plotting it out and
finding correlations or by creating a pivot table in Excel.
iii. A pivot table lets you sort and filter data by different variables and lets you calculate the
mean, maximum, minimum and standard deviation of your data.
3. Interpret results:
i. After analyzing data and possibly conducting further research it’s time to interpret your
results.
ii. As you interpret your analysis, you cannot ever prove a hypothesis true; you can only fail to
reject the hypothesis.
iii. By following these steps in your data analysis process, you make better decisions for your
business or government agency.
Partition the given data into4 bins using Equi-depth binning method and perform smoothing
according to the following methods. 10M
Smoothing by bin mean
Smoothing by bin median
Smoothing by bin boundaries
Data: 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75
Given data: 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73,
75
Let, distribute data into 4 bins using Equi-depth binning.
Total values (T) =24
Number of values in each bin=24/4=6
Thus we get,
B1=11, 13, 13, 15, 15, 16
B2=19, 20, 20, 20, 21, 21
B3=22, 23, 24, 30, 40, 45
B4=45, 45, 71, 72, 73, 75
i. smoothing by bin median:
Replace each value of bin with its mean value.
Mean for B1=(11+13+13+15+15+16)/6=13.83
Mean for B2=(19+20+20+20+21+21)/6=20.16
Mean for B3=(22+23+24+30+40+45)/6=30.67
Mean for B4=(45+45+71+72+73+75)/6=63.5
Thus we get,
B1=13.83, 13.83, 13.83, 13.83, 13.83, 13.83
B2=20.16, 20.16, 20.16, 20.16, 20.16, 20.16
B3=30.67, 30.67, 30.67, 30.67, 30.67, 30.67
B4=63.5, 63.5, 63.5, 63.5, 63.5, 63.5
ii. Smoothing by bin median:
Replace each value of bin with its median value.
Median for B1=(13+15)/2=14
Median for B2=(20+20)/2=20
Median for B3=(24+30)/2=27
Median for B4=(71+72)/2=71.5
Thus we get,
B1=14, 14, 14, 14, 14, 14
B2=20, 20, 20, 20, 20, 20
B3=27, 27, 27, 27, 27, 27
B4=71.5, 71.5, 71.5, 71.5, 71.5, 71.5
iii. Smoothing by bin boundaries:
Replace each value of bin with its closet boundary value.
Thus we get,
B1=11, 11, 11, 16, 16, 16
B2=19, 21, 21, 21, 21, 21
B3=22, 22, 22, 22, 45, 45
B4=45, 45, 75, 75, 75, 75

More Related Content

What's hot

Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
ShivanandaVSeeri
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
Amr Abd El Latief
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
Prof .Pragati Khade
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
Krish_ver2
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
Hariharan Ganesan
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
Vrishit Saraswat
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
Salah Amean
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
Manoj Mishra
 
ENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DBENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DB
Neo4j
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Tharushi Ruwandika
 
Clustering
ClusteringClustering
Clustering
M Rizwan Aqeel
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
Krish_ver2
 
Analytical tools
Analytical toolsAnalytical tools
Analytical tools
Aniket Joshi
 

What's hot (20)

Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Introduction
IntroductionIntroduction
Introduction
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Final ppt
Final pptFinal ppt
Final ppt
 
Big data
Big dataBig data
Big data
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
ENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DBENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DB
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Clustering
ClusteringClustering
Clustering
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
 
Analytical tools
Analytical toolsAnalytical tools
Analytical tools
 

Similar to Data mining and business intelligence

data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
Sunny Gandhi
 
Data mining
Data miningData mining
Data mining
SATISH KUMAR
 
Overview of Data Mining
Overview of Data MiningOverview of Data Mining
Overview of Data Mining
ijtsrd
 
Big data
Big dataBig data
Big data
Bhuvana Patt
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
ijsrd.com
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
IOSR Journals
 
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGTHE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
csijjournal
 
Data Mining in Telecommunication Industry
Data Mining in Telecommunication IndustryData Mining in Telecommunication Industry
Data Mining in Telecommunication Industry
ijsrd.com
 
New approaches of Data Mining for the Internet of things with systems: Litera...
New approaches of Data Mining for the Internet of things with systems: Litera...New approaches of Data Mining for the Internet of things with systems: Litera...
New approaches of Data Mining for the Internet of things with systems: Litera...
IRJET Journal
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET Journal
 
1. Web Mining – Web mining is an application of data mining for di.docx
1. Web Mining – Web mining is an application of data mining for di.docx1. Web Mining – Web mining is an application of data mining for di.docx
1. Web Mining – Web mining is an application of data mining for di.docx
braycarissa250
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its Applications
IRJET Journal
 
Data Mining @ Information Age
Data Mining @ Information AgeData Mining @ Information Age
Data Mining @ Information Age
IIRindia
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
IJTET Journal
 
Data mining
Data miningData mining
Data mining
Annies Minu
 
An analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining TechniquesAn analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining Techniques
ijcnes
 

Similar to Data mining and business intelligence (20)

data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Data mining
Data miningData mining
Data mining
 
Overview of Data Mining
Overview of Data MiningOverview of Data Mining
Overview of Data Mining
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Big data
Big dataBig data
Big data
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGTHE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
 
Data Mining in Telecommunication Industry
Data Mining in Telecommunication IndustryData Mining in Telecommunication Industry
Data Mining in Telecommunication Industry
 
New approaches of Data Mining for the Internet of things with systems: Litera...
New approaches of Data Mining for the Internet of things with systems: Litera...New approaches of Data Mining for the Internet of things with systems: Litera...
New approaches of Data Mining for the Internet of things with systems: Litera...
 
NCCT.pptx
NCCT.pptxNCCT.pptx
NCCT.pptx
 
ii mca juno
ii mca junoii mca juno
ii mca juno
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
1. Web Mining – Web mining is an application of data mining for di.docx
1. Web Mining – Web mining is an application of data mining for di.docx1. Web Mining – Web mining is an application of data mining for di.docx
1. Web Mining – Web mining is an application of data mining for di.docx
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its Applications
 
Data Mining @ Information Age
Data Mining @ Information AgeData Mining @ Information Age
Data Mining @ Information Age
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
 
Data mining
Data miningData mining
Data mining
 
An analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining TechniquesAn analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining Techniques
 

More from chirag patil

Wh Yes-No questions.pptx
Wh Yes-No questions.pptxWh Yes-No questions.pptx
Wh Yes-No questions.pptx
chirag patil
 
joining words not only but also.pptx
joining words not only but also.pptxjoining words not only but also.pptx
joining words not only but also.pptx
chirag patil
 
Basic English Grammar 2.pptx
Basic English Grammar 2.pptxBasic English Grammar 2.pptx
Basic English Grammar 2.pptx
chirag patil
 
Basic English Grammar.pptx
Basic English Grammar.pptxBasic English Grammar.pptx
Basic English Grammar.pptx
chirag patil
 
Maths formulae
Maths formulaeMaths formulae
Maths formulae
chirag patil
 
Input output devices
Input output devicesInput output devices
Input output devices
chirag patil
 
Shortcut keys
Shortcut keysShortcut keys
Shortcut keys
chirag patil
 
Operating system
Operating systemOperating system
Operating system
chirag patil
 
Network topology
Network topologyNetwork topology
Network topology
chirag patil
 
Decimal and binary conversion
Decimal and binary conversionDecimal and binary conversion
Decimal and binary conversion
chirag patil
 
Abbreviations and full forms
Abbreviations and full formsAbbreviations and full forms
Abbreviations and full forms
chirag patil
 
ASCII Code
ASCII CodeASCII Code
ASCII Code
chirag patil
 
Web engineering and Technology
Web engineering and TechnologyWeb engineering and Technology
Web engineering and Technology
chirag patil
 
Web data management
Web data managementWeb data management
Web data management
chirag patil
 
Web application development
Web application developmentWeb application development
Web application development
chirag patil
 
Programming the web
Programming the webProgramming the web
Programming the web
chirag patil
 
Operating System
Operating SystemOperating System
Operating System
chirag patil
 
8051 microcontroller
8051 microcontroller8051 microcontroller
8051 microcontroller
chirag patil
 
Computer Graphics and Virtual Reality
Computer Graphics and Virtual RealityComputer Graphics and Virtual Reality
Computer Graphics and Virtual Reality
chirag patil
 
Advanced Database Management Syatem
Advanced Database Management SyatemAdvanced Database Management Syatem
Advanced Database Management Syatem
chirag patil
 

More from chirag patil (20)

Wh Yes-No questions.pptx
Wh Yes-No questions.pptxWh Yes-No questions.pptx
Wh Yes-No questions.pptx
 
joining words not only but also.pptx
joining words not only but also.pptxjoining words not only but also.pptx
joining words not only but also.pptx
 
Basic English Grammar 2.pptx
Basic English Grammar 2.pptxBasic English Grammar 2.pptx
Basic English Grammar 2.pptx
 
Basic English Grammar.pptx
Basic English Grammar.pptxBasic English Grammar.pptx
Basic English Grammar.pptx
 
Maths formulae
Maths formulaeMaths formulae
Maths formulae
 
Input output devices
Input output devicesInput output devices
Input output devices
 
Shortcut keys
Shortcut keysShortcut keys
Shortcut keys
 
Operating system
Operating systemOperating system
Operating system
 
Network topology
Network topologyNetwork topology
Network topology
 
Decimal and binary conversion
Decimal and binary conversionDecimal and binary conversion
Decimal and binary conversion
 
Abbreviations and full forms
Abbreviations and full formsAbbreviations and full forms
Abbreviations and full forms
 
ASCII Code
ASCII CodeASCII Code
ASCII Code
 
Web engineering and Technology
Web engineering and TechnologyWeb engineering and Technology
Web engineering and Technology
 
Web data management
Web data managementWeb data management
Web data management
 
Web application development
Web application developmentWeb application development
Web application development
 
Programming the web
Programming the webProgramming the web
Programming the web
 
Operating System
Operating SystemOperating System
Operating System
 
8051 microcontroller
8051 microcontroller8051 microcontroller
8051 microcontroller
 
Computer Graphics and Virtual Reality
Computer Graphics and Virtual RealityComputer Graphics and Virtual Reality
Computer Graphics and Virtual Reality
 
Advanced Database Management Syatem
Advanced Database Management SyatemAdvanced Database Management Syatem
Advanced Database Management Syatem
 

Recently uploaded

ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
nikitacareer3
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
AIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdfAIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdf
RicletoEspinosa1
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
bhadouriyakaku
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Self-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptxSelf-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptx
iemerc2024
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 

Recently uploaded (20)

ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
AIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdfAIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdf
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Self-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptxSelf-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptx
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 

Data mining and business intelligence

  • 1. Define “data mining”. Enumerate five example applications that can benefit by using data mining. 5M Data mining: 1. The extraction of hidden information from large databases is called as data mining. 2. Data mining is a new powerful technology to help companies focus on the most important information in their databases of warehouses. 3. Data mining tools allows businesses to make proactive, knowledge-driven decisions. 4. The analyses offered by data mining move beyond the analyses of past events provided by decision support systems. 5. Data mining tools can answer business questions that take more time to solve. 6. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources. 7. It can be integrated with new products and systems as they are brought on-line. 8. Data mining techniques are the result of a long process of research and product development. 9. Data mining technique allows users to go through their data in real time. 10. Data mining is ready for application in the business community. 11. It is supported by three technologies that are: i. Massive data collection ii. Powerful multiprocessor computers iii. Data mining algorithms. 12. Data mining is widely used in diverse areas. Data Mining Applications: Financial Data Analysis, Retail Industry, Telecommunication Industry, Biological Data Analysis, Other Scientific Applications & Intrusion Detection. 1. Financial Data Analysis:
  • 2. The financial data is generally reliable and it has high quality which facilitates systematic data analysis and data mining. Some of the cases are as follows − i. Design and construction of data warehouses for multidimensional data analysis and data mining. ii. Loan payment prediction and customer credit policy analysis. iii. Classification and clustering of customers for targeted marketing. iv. Detection of money laundering and other financial crimes. 2. Retail Industry: Data Mining has its best application in Retail Industry because it collects large amount of data from on sales, customer, consumption and services. Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved quality of customer service and satisfaction. Here is the list of examples of data mining in the retail industry − i. Design and Construction of data warehouses based on the benefits of data mining. ii. Multidimensional analysis of sales, customers, products, time and region. iii. Analysis of effectiveness of sales campaigns. iv. Customer Retention. v. Product recommendation and cross-referencing of items. 3. Telecommunication Industry: The telecommunication industry is one of the most industries providing various services such as fax, pager, cellular phone, internet messenger, images, e-mail, etc. Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. Data mining helps telecommunication industry to identify the telecommunication patterns, activities, make better use of resource, and improve quality of service. Here is the list of examples for which data mining improves telecommunication services −
  • 3. i. Multidimensional Analysis of Telecommunication data. ii. Fraudulent pattern analysis. iii. Identification of unusual patterns. iv. Multidimensional association and sequential patterns analysis. v. Mobile Telecommunication services. vi. Use of visualization tools in telecommunication data analysis. 4. Biological Data Analysis: Biological data analysis is a very important part of Bioinformatics. Aspects of biological data analysis − i. Semantic integration of heterogeneous, distributed genomic and proteomic databases. ii. Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences. iii. Discovery of structural patterns and analysis of genetic networks and protein pathways. iv. Association and path analysis. v. Visualization tools in genetic data analysis. 5. Other Scientific Applications: Huge amount of data is collected from scientific domains such as astronomy. A large amount of data sets is being generated because of the fast numerical simulations in various fields. Applications of data mining in the field of Scientific Applications − i. Data Warehouses and data preprocessing. ii. Graph-based mining. Iii. Visualization and domain specific knowledge. 6. Intrusion Detection:
  • 4. Intrusion refers to any kind of action that causes error in integrity, confidentiality, or the availability of network resources. With increased use of internet and availability of the tools and tricks for intrusion detection is critical component of network administration. Here is the list of areas in which data mining technology may be applied for intrusion detection − i. Development of data mining algorithm for intrusion detection. ii. Association and correlation analysis, aggregation to help select and build discriminating attributes. iii. Analysis of Stream data. iv. Distributed data mining. v. Visualization and query tools. What is data preprocessing? Explain the different methods for the data cleansing phase. 5M Data preprocessing: 1. Data preprocessing is important step in data mining process. 2. Data preprocessing includes cleaning, normalization, transformation, feature extraction and selection. 3. The product of data preprocessing is the final training set. 4. Data preprocessing is one of the most critical step in a data mining process which deals with the preparation and transformation of the initial dataset. 5. Data preprocessing methods divided into following categories: i. Data Cleaning ii. Data Integration iii. Data Transformation iv. Data Reduction Different methods for the data cleansing phase:
  • 5. Difference between classification and prediction. 5M No. Classification Prediction 1 Classification classifies data into classes. Prediction predicts the value of unseen data. 2 The accuracy of a classifier refers to the ability of given classifier to correctly classify new data. The accuracy of the predicter refers to how will a given predicter can give the value of new or unseen data. 3 The speech of classifier refers to computational cost involving in generating and using the classifier. The speed of predicter refers to computational cost involving in generating and using the predicter. 4 The robustness of classifier is the ability to make correct classification on noisy data or data with missing value. The robustness of predicter is the ability to make correct prediction on noisy data or data with missing value. 5 The scalability of classification is the ability to construct the classifier to efficiently work on large amount of data. The scalability of predicter is the ability to construct the predicter to efficiently work on amount of data. Data mining architecture 10M 1. Data mining is a very important process where potentially useful and previously unknown information is extracted from large volumes of data. 2. There are a number of components involved in the data mining process. 3. These components constitute the architecture of a data mining system. 4. The architecture of a typical data mining system may have the following major components Database, data warehouse, World Wide Web, or other information repository. 5. This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. 6. Data cleaning and data integration techniques may be performed on the data. 7. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. 8. Knowledge base:
  • 6. i. This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. ii. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. iii. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. 9. Data mining engine: i. The data mining engine is the core component of any data mining system. ii. It consists of a number of modules for performing data mining tasks including association, classification, characterization, clustering, prediction, time-series analysis etc. 10. Pattern evaluation module: i. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. ii. It may use interestingness thresholds to filter out discovered patterns. iii. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. 11. User interface: i. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. ii. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. 12. Database or Data Warehouse Server: i. The database or data warehouse server contains the actual data that is ready to be processed. ii. Hence, the server is responsible for retrieving the relevant data based on the data mining request of the user.
  • 7. 13. DIAGRAM: 14. Each and every component of data mining system has its own role and importance in completing data mining efficiently. 15. These different modules need to interact correctly with each other in order to complete the complex process of data mining successfully. What is KDD? Explain its process? knowledge discovery from data) 1. Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data. 2. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. 3. The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: Selection, Pre-processing, Transformation, Data Mining and Interpretation/Evaluation. KDD PROCESS 1. Creating a target data set: data selection
  • 8. 2. Data cleaning and preprocessing: (may take 60% of effort!) 3. Data reduction and transformation 4. Find useful features, dimensionality/variable reduction, and invariant representation 5. Choosing functions of data mining 6. summarization, classification, regression, association, clustering 7. Choosing the mining algorithm(s) ALGORITHM Input: D : a data set containing n objects, ε : the radius parameter, and MinPts: the neighborhood density threshold. Output: A set of density-based clusters. Method: 1) mark all objects as unvisited; 2) do 3) randomly select an unvisited object p; 4) mark p as visited; 5) if the ε -neighborhood of p has at least MinPts objects 6) create a new cluster C, and add p to C; 7) let N be the set of objects in the ε -neighborhood of p; 8) for each point p' in N 9) if p' is unvisited 10) mark p' as visited; 11) if the -neighborhood of p' has at least MinPts points, add those points to N ; 12) if p' is not yet a member of any cluster, add p' to C; 13) end for
  • 9. 14) output C; 15) else mark p as noise; 16) until no object is unvisited; Advantages: 1. DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k-means. 2. DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. 3. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced. 4. DBSCAN has a notion of noise, and is robust to outliers. 5. DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. 6. DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an R* tree. Disadvantages: 1. DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data is processed. 2. The quality of DBSCAN depends on the distance measure used in the function regionQuery(P,ε). The most common distance metric used is Euclidean distance. 3. DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters. 4. If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult. What is Resource Allocation? Explain Resource leveling & Resource smoothing. Resource allocation is the scheduling of activities and the resources required by those activities while taking into consideration both the resource availability and project time.
  • 10. The resource allocation procedure mainly consists of two activities: Resource Smoothing and Resource Leveling. Resource Smoothing:- 1. If duration of completion of the project is the constraint, then resource smoothing should be applied without changing the total project duration. 2. The periods of minimum demand for resources are located and the activities are shifted according to the float availability and the requirement of resources. 3. Thus the intelligent utilization of floats can smoothen the demand of resources to the maximum possible extent. 4. This type of resource allocation is called Resource Smoothing. Resource leveling:- 1. In the process of resource leveling, whenever the availability of resource becomes less than its maximum requirement, the only alternative is to delay the activity having larger float. 2. In case, two or more activities require the same amount of resources, the activity with minimum duration is chosen for resource allocation. 3. Resource leveling is done if the restriction is on the availability of resources. Write a short note on linear regression. 5M Linear regression involves finding the “best” line to fit two attributes (or variables), so that one attribute can be used to predict the other. Linear Regression a. Straight-line regression: 1. Straight-line regression analysis involves a response variable, y, and a single predictor variable, x. 2. It is the simplest form of regression, and models y as a linear function of x. 3. That is, y = b+wx; where the variance of y is assumed to be constant, b and w are regression coefficients specifying the Y-intercept and slope of the line, respectively.
  • 11. 4. These coefficients can be solved by the method of least squares, which estimates the best- fitting straight line as the one that minimizes the error between the actual data and the estimate of the line. 5. The regression coefficients can be estimated using this method with the following equations: b. Multiple linear regressions: 1. Multiple linear regressions is an extension of straight-line regression so as to involve more than one predictor variable. 2. It allows response variable y to be modeled as a linear function of n predictor variables or attributes. 3. The equations (obtained from the method of least squares), become long and are tedious to solve by hand. 4. Multiple regression problems are instead commonly solved with the use of statistical software packages, such as SAS, SPSS, and S-Plus 5. Speed and Scalability: Time to construct the model and also time to use the model. 6. Robustness: This is the ability of the classifier to make correct predictions given noisy data or data with missing values 7. Scalability: This refers to the ability to construct the classifier efficiently given large amounts of data. 8. Interpretability: This refers to the level of understanding and insight that is provided by the classifier 9. Goodness of rules: Decision tree size compactness of classification rules. What is noisy data? How to handle it? Noisy data: 1. Noisy data is meaningless data. 2. It includes any data that cannot be understood and interpreted correctly by machines, such as unstructured text. 3. Noisy data unnecessarily increases the amount of storage space required and can also adversely affect the results of any data mining analysis.
  • 12. 4. Noisy data can be caused by faulty data collection instruments, human or computer errors occurring at data entry, data transmission errors, limited buffer size for coordinating synchronized data transfer, inconsistencies in naming conventions or data codes used and inconsistent formats for input fields(e.g.: date). Noisy data can be handled by following the given procedures: A. Binning: 1. Binning methods smooth a sorted data value by consulting the values around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning methods consult the values around it, they perform local smoothing. 2. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. 3. Each bin value is then replaced by the closest boundary value. In general, the larger the width, the greater the effect of the smoothing. Alternatively, bins may be equal-width, where the interval range of values in each bin is constant. Binning is also used as a discretization technique. B. Regression: 1. Here data can be smoothed by fitting the data to a function. 2. Linear regression involves finding the “best” line to fit two attributes, so that one attribute can be used to predict the other. 3. Multiple linear regressions is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface. C. Clustering: 1. Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.” 2. Similarly, values that fall outside of the set of clusters may also be considered outliers. Describe one hierarchical clustering algorithm using an example dendrogram. 5M There are three algorithms of hierarchical clustering:
  • 13. Agglomerative hierarchical clustering, Divisive hierarchical clustering and BIRCH hierarchical clustering. Agglomerative hierarchical clustering: 1. Agglomerative hierarchical clustering is a bottom-up clustering approach. 2. In this clustering the clusters have sub-clusters. 3. The example of this type of clustering is species taxonomy. 4. Gene expression data also exhibit this hierarchical quality. 5. Agglomerative hierarchical clustering starts with every single object (gene or sample) in a single cluster. 6. Then each successive iteration it merges the closest pair of clusters by satisfying some similarity criteria, until all the data is in one cluster. 7. The clusters generated in early stages are nested in clusters generated in later stages. 8. The clusters with different sizes in the tree can be valuable for discovery. 9. This type of clustering can produces an ordering of the objects, which may be informative for data display. 10. In this clustering algorithm, smaller clusters are generated, which may be helpful for discovery. 11. Different methods for combining clusters in agglomerative hierarchical clustering: i. Single linkage ii. Complete linkage iii. Average linkage iv. Centroid method v. Ward’s method 12. Example dendrogram:
  • 14. Explain the concept of decision support system with the help of an example application. 5M 1. A decision support system (DSS) is a computer program application which analyzes business data and presents it to make business decisions more easily. 2. Decision support system is an "informational application" system. 3. Decision support systemhelps businesses and organizations in decision making activities. 4. A decision support system presents information graphically and it includes artificial intelligence. 5. Categories of decision support system: i. Communication driven DSS: purpose is to conduct a meeting for users to collaborate. ii. Data driven DSS: it is used to query a database to seek specific answers for specific purpose. iii. Document driven DSS: purpose of this is to search a web page and to find documents. iv. Knowledge driven DSS: it is used to produce management advice and to choose products or services. v. Model driven DSS: it is used by managers and staff members of the business to interact with each other to analyze decisions.
  • 15. 6. Decision support systems are the combination of integrated resources which works together. 7. For example: a national on-line book seller wants to begin selling its products internationally but first needs to determine if that will be a wise business decision. 8. In such case, The vendor can use a DSS to collect information from its own resources using a tool OLAP to determine that the company has the ability to expand its business. 9. Also collect information from external resources, such as industry data to determine if there is indeed a demand to meet. 10. The Decision Support System will collect and analyze the data and then present it in a way that can be interpreted by humans. 11. There are few decision support systems, which come very close to acting as artificial intelligence agents. What is clustering? Explain k-means clustering algorithm. Suppose the data for clustering is {2, 4, 10, 12, 3, 20, 11, 25} Consider k=2, cluster the given data using k- means algorithm. 10M Clustering: 1. The process of partitioning a set of data into a set of meaningful sub-classes or clusters is called as clustering. 2. Clustering is a technique used to place data elements into related groups. 3. Example of graphical representation of the clustering: 4. A cluster is a collection of objects which are similar and are dissimilar to the objects of the other clusters. 5. In the above example there are four clusters. K-means clustering algorithm:
  • 16. 1. K-means clustering is an algorithm to group the different objects based on their features into K number of group. 2. K is positive integer number and it can be decided by user. 3. The Centroids of each cluster are generally far away from each other. 4. Group the elements into the clusters which are nearest to the Centroid and use same method to group elements as per new Centroids. 5. In every step Centroid changes and elements moved from one cluster to another cluster. 6. Follow same process till no element is moving from one cluster to other cluster. Data: {2, 4, 10, 12, 3, 20, 11, 25} K=2 Select any two means M1 and M2: M1=4 M2=12 Randomly partition given dataset: K1=2, 3, 4 mean=3 K2=10, 11, 12, 20, 25 mean=15.6 Reassign the values of dataset as per new mean values: K1=2, 3, 4 mean=3 K2=10, 11, 12, 20, 25 mean=15.6 Again we have same mean values. Therefore, stop solving algorithm. (Note: if you are using different means for M1 and M2, your answer will be different and it will stop in different steps) For the given set of data points, 10M (a) Find Mean, Median and Mode
  • 17. (b) Show a boxplot of the data. Clearly indicating the five –numbersummary. 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75 Given data: 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75 (a) Mean, Median and Mode: Mean: Mean= Addition of all numbers divide by total numbers Mean= (11+13+13+15+15+16+19+20+20+20+21+21+22+23+24+30+40+45+45+45+71+72+73+75)/24 =32.04 Median: Median= Total numbers divide by 2 Median= 24/2= 12th number = 21 Mode: Mode= the number that is repeated more time than other numbers Mode= 20,40 (b) Boxplot of the data with five number summary: What is an outlier? Describe methods that can be used for outlier analysis. 10M 1. An outlier is an observation point that is distant from other observations.
  • 18. 2. Outlier indicates measurement error or experimental error. 3. They can be novel, new, abnormal, unusual or noisy information about the data. 4. Outliers can be classified into three categories: i. Point outliers ii. Contextual outliers iii. Collective outliers 5. Point outliers: i. This is the simplest type of outlier and it focuses on the majority of research on outlier detection. ii. If an individual data point can be considered anomalous with respect to the other data, then it is called as a point outlier. 6. Contextual outliers: i. If an individual data point is anomalous in a specific context, then it is called as a contextual outlier or conditional outlier. ii. Each data points is defined with two sets of attributes in contextual outlier: contextual attributes and behavioral attributes. 7. Collective outliers: i. If a collection of data points is anomalous with respect to the entire data set, it is called as a collective outlier. ii. Collective outliers can occurs only in data sets in which data points are somehow related. 8. The benefit of outlier is, it can be removed or considered separately in regression modeling to improve accuracy. 9. Outlier detection is one of the basic problems of data mining. 10. Outliers may be erroneous or real. Methods used for outlier analysis: 1. Statistical approach
  • 19. 2. Distance-based approach 3. Density-based local outlier approach 4. Deviation-based approach 1. Statistical approach: i. This method assumes a distribution for the given data set and then identifies outliers with respect to the model using a discordance test. ii. A statistical discordance test examines two hypotheses, a working hypothesis and an alternative hypothesis. iii. A working hypothesis, states that the entire data set of n object is comes from an initial distribution model. iv. An alternative hypothesis, states that the entire data set of n objects is comes from another distribution model. 2. Distance-based approach: i. This method generalizes the ideas behind discordance testing for various standard distributions. ii. Its neighbors are defined based on their distance from the given object. iii. Many efficient algorithms for mining distance-based outliers have been developed: indexed based algorithm, nested loop algorithm and cell based algorithm. 3. Density-based local outlier approach: i. This approach is depends on the overall distribution of the given set of data points. ii. This brings us to the notion of local outliers and an object is a local outlier. iii. This approach can detect both global and local outliers. 4. Deviation-based approach: i. This method identifies outliers by examining the main characteristics of objects in a group. ii. The term deviation is typically used to refer to outliers in this approach. iii. There are two techniques for deviation-based outlier detection.
  • 20. iv. The first technique compares objects sequentially in a set and the second employs an OLAP data cube approach. Design a BI system for fraud detection. Describe all the steps from Data collection to Decision making clearly. 10M Fraud detection in Telecommunication Industry: 1. Fraud is an adaptive crime. It needs special method of intelligent data analysis to detect fraud and prevent it. 2. The telecommunications industry has expanded with the development of affordable mobile phone technology. 3. For example, forensic analytics are used to review an employee’s purchasing card activity to assess whether any of the purchases were diverted for personal use. 4. The main steps in forensic analytics are data collection, data preparation, data analysis and reporting. 5. Fraud detection method exits in the areas of Knowledge Discovery in Databases (KDD), Data Mining, Machine Learning and Statistics. 6. They offer applicable and successful solutions in different areas of fraud crimes. 7. Fraud detection: 8. Fraud detection techniques are categorized into two primary classes: i. Statistical data analysis techniques ii. Artificial intelligence techniques 9. Statistical data analysis techniques:
  • 21. i. Statistical data analysis techniques are data preprocessing techniques for detection, data validation, error correction and filling up of missing or incorrect data. ii. It also includes calculation of various statistical parameters such as averages, performance metrics. 10. Artificial Intelligence techniques: i. Artificial intelligence techniques are data mining to classify, cluster and segment the data and automatically find association rules in the data related to fraud. ii. It also includes pattern recognition to detect approximate classes, clusters, or patterns of suspicious behavior either automatically or to match given inputs. Steps from data collection to decision making: 1. Data collection: i. Before you collect new data, determine what information could be collected from existing databases or sources on hand. ii. Determine a file storing and naming system ahead of time to help all tasked team members collaborate. iii. If you need to gather data via observation or interviews, then develop an interview template ahead of time to ensure consistency and save time. iv. Keep your collected data organized in a log with collection dates and add any source notes as you go. 2. Analyze data: i. After you’ve collected the right data it’s time for deeper data analysis. ii. Begin by manipulating your data in a number of different ways, such as plotting it out and finding correlations or by creating a pivot table in Excel. iii. A pivot table lets you sort and filter data by different variables and lets you calculate the mean, maximum, minimum and standard deviation of your data. 3. Interpret results: i. After analyzing data and possibly conducting further research it’s time to interpret your results.
  • 22. ii. As you interpret your analysis, you cannot ever prove a hypothesis true; you can only fail to reject the hypothesis. iii. By following these steps in your data analysis process, you make better decisions for your business or government agency. Partition the given data into4 bins using Equi-depth binning method and perform smoothing according to the following methods. 10M Smoothing by bin mean Smoothing by bin median Smoothing by bin boundaries Data: 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75 Given data: 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75 Let, distribute data into 4 bins using Equi-depth binning. Total values (T) =24 Number of values in each bin=24/4=6 Thus we get, B1=11, 13, 13, 15, 15, 16 B2=19, 20, 20, 20, 21, 21 B3=22, 23, 24, 30, 40, 45 B4=45, 45, 71, 72, 73, 75 i. smoothing by bin median: Replace each value of bin with its mean value. Mean for B1=(11+13+13+15+15+16)/6=13.83 Mean for B2=(19+20+20+20+21+21)/6=20.16 Mean for B3=(22+23+24+30+40+45)/6=30.67 Mean for B4=(45+45+71+72+73+75)/6=63.5 Thus we get, B1=13.83, 13.83, 13.83, 13.83, 13.83, 13.83 B2=20.16, 20.16, 20.16, 20.16, 20.16, 20.16 B3=30.67, 30.67, 30.67, 30.67, 30.67, 30.67 B4=63.5, 63.5, 63.5, 63.5, 63.5, 63.5 ii. Smoothing by bin median: Replace each value of bin with its median value. Median for B1=(13+15)/2=14 Median for B2=(20+20)/2=20 Median for B3=(24+30)/2=27 Median for B4=(71+72)/2=71.5
  • 23. Thus we get, B1=14, 14, 14, 14, 14, 14 B2=20, 20, 20, 20, 20, 20 B3=27, 27, 27, 27, 27, 27 B4=71.5, 71.5, 71.5, 71.5, 71.5, 71.5 iii. Smoothing by bin boundaries: Replace each value of bin with its closet boundary value. Thus we get, B1=11, 11, 11, 16, 16, 16 B2=19, 21, 21, 21, 21, 21 B3=22, 22, 22, 22, 45, 45 B4=45, 45, 75, 75, 75, 75