Data mining involves extracting hidden patterns from large databases. It helps companies analyze important information in their data. Some applications of data mining include financial data analysis, retail industry analysis, telecommunications analysis, biological data analysis, scientific applications, and intrusion detection. Data mining uses techniques like classification, clustering, and prediction.
Existing model uses structured data to predict the patients of either high risk or low risk.
But for a complex disease, structured data is not a good way to describe the disease.
We propose a new convolutional neural network (CNN)-based multimodal disease risk prediction algorithm using structured and unstructured data from hospital.
In this paper, we mainly focus on the risk prediction of cerebral infarction.
following topics are discussed inside the PPT:
Introduction
Objective
Motivation
Literature Survey
Some Key Features of Disease
Plan of Action
Methodology Adopted
Data Collection
Steps to be Performed
Functional Architecture
Existing model uses structured data to predict the patients of either high risk or low risk.
But for a complex disease, structured data is not a good way to describe the disease.
We propose a new convolutional neural network (CNN)-based multimodal disease risk prediction algorithm using structured and unstructured data from hospital.
In this paper, we mainly focus on the risk prediction of cerebral infarction.
following topics are discussed inside the PPT:
Introduction
Objective
Motivation
Literature Survey
Some Key Features of Disease
Plan of Action
Methodology Adopted
Data Collection
Steps to be Performed
Functional Architecture
presentation on recent data mining Techniques ,and future directions of research from the recent research papers made in Pre-master ,in Cairo University under supervision of Dr. Rabie
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
Data science is an interdisciplinary field that uses algorithms, procedures, and processes to examine large amounts of data in order to uncover hidden patterns, generate insights, and direct decision making.
Andrea Bielli, IT Architect Global Digital Solution, Enel
Davide Gimondo, Software Engineer, Enel
Enel mostra come neo4j aiuta nella gestione delle reti elettriche in 8 paesi nel mondo.
Con l’obiettivo di ottimizzare gli algoritmi di percorrenza della rete elettrica, in modo da rendere le reti sempre più efficienti e resilienti.
L’obiettivo di Enel è una gestione ottimale della topologia della rete per garantire gli obiettivi del gruppo: la transizione energetica e l’elettrificazione dei paesi in cui opera, verso l’obiettivo Net Zero, relativo alla riduzione delle emissioni nella produzione e distribuzione dell’energia elettrica.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
This is a Powerpoint Presentation based on the comparison of various available analytical tools. This includes various tools for business analytics and their detailed description.
DATA MINING AND DATA WAREHOUSE
W.H. Inmon
OLAP, (On-line analytical processing)
OLTP, ( On-line transaction processing )
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data warehouse vs Data Mining
Use in Urban Planning
presentation on recent data mining Techniques ,and future directions of research from the recent research papers made in Pre-master ,in Cairo University under supervision of Dr. Rabie
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
Data science is an interdisciplinary field that uses algorithms, procedures, and processes to examine large amounts of data in order to uncover hidden patterns, generate insights, and direct decision making.
Andrea Bielli, IT Architect Global Digital Solution, Enel
Davide Gimondo, Software Engineer, Enel
Enel mostra come neo4j aiuta nella gestione delle reti elettriche in 8 paesi nel mondo.
Con l’obiettivo di ottimizzare gli algoritmi di percorrenza della rete elettrica, in modo da rendere le reti sempre più efficienti e resilienti.
L’obiettivo di Enel è una gestione ottimale della topologia della rete per garantire gli obiettivi del gruppo: la transizione energetica e l’elettrificazione dei paesi in cui opera, verso l’obiettivo Net Zero, relativo alla riduzione delle emissioni nella produzione e distribuzione dell’energia elettrica.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
This is a Powerpoint Presentation based on the comparison of various available analytical tools. This includes various tools for business analytics and their detailed description.
DATA MINING AND DATA WAREHOUSE
W.H. Inmon
OLAP, (On-line analytical processing)
OLTP, ( On-line transaction processing )
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data warehouse vs Data Mining
Use in Urban Planning
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. 1 Data mining is an interdisciplinary sub field of computer science and statistics with an overall goal to extract from a data set and transform the information into a comprehensible structure for further use. 1 2 3 4 The process of digging through data to discover hidden connections and predict future trends has a long history. Sometimes referred to as 'knowledge discovery' in databases, the term data mining wasn't coined until the 1990s. What was old is new again, as data mining technology keeps evolving to keep pace with the limitless potential of big data and affordable computing power. Over the last decade, advances in processing power and speed have enabled us to move beyond manual, tedious and time consuming practices to quick, easy and automated data analysis. The more complex the data sets collected, the more potential there is to uncover relevant insights. Rupashi Koul "Overview of Data Mining" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31368.pdf Paper Url :https://www.ijtsrd.com/engineering/computer-engineering/31368/overview-of-data-mining/rupashi-koul
Real World Application of Big Data In Data Mining Toolsijsrd.com
The main aim of this paper is to make a study on the notion Big data and its application in data mining tools like R, Weka, Rapidminer, Knime,Mahout and etc. We are awash in a flood of data today. In a broad range of application areas, data is being collected at unmatched scale. Decisions that previously were based on surmise, or on painstakingly constructed models of reality, can now be made based on the data itself. Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences. The paper mainly focuses different types of data mining tools and its usage in big data in knowledge discovery.
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGcsijjournal
The aim of this study is to identify the extent of Data mining activities that are practiced by banks, Data mining is the ability to link structured and unstructured information with the changing rules by which people apply it. It is not a technology, but a solution that applies information technologies. Currently
several industries including like banking, finance, retail, insurance, publicity, database marketing, sales predict, etc are Data Mining tools for Customer . Leading banks are using Data Mining tools for customer segmentation and benefit, credit scoring and approval, predicting payment lapse, marketing, detecting illegal transactions, etc. The Banking is realizing that it is possible to gain competitive advantage deploy data mining. This article provides the effectiveness of Data mining technique in organized Banking. It also discusses standard tasks involved in data mining; evaluate various data mining applications in different
sectors
Data Mining in Telecommunication Industryijsrd.com
Telecommunication companies today are operating in highly competitive and challenging environment. Vast volume of data is generated from various operational systems and these are used for solving many business problems that required urgent handling. These data include call detail data, customer data and network data. Data Mining methods and business intelligence technology are widely used for handling the business problems in this industry. The goal of this paper is to provide a broad review of data mining concepts.
1. Web Mining – Web mining is an application of data mining for di.docxbraycarissa250
1. Web Mining – Web mining is an application of data mining for discovering data patterns from the web. Web mining is of three categories – content mining, structure mining and usage mining. Content mining detects patterns from data collected by the search engine. Structure mining examines the data which is related to the structure of the website while usage mining examines data from the user’s browser. The data collected through web mining is evaluated and analyzed using techniques like clustering, classification, and association. It is a very good topic for the thesis in data mining.
2. Predictive Analytics – Predictive Analytics is a set of statistical techniques to analyze the current and historical data to predict the future events. The techniques include predictive modeling, machine learning, and data mining. In large organizations, predictive analytics help businesses to identify risks and opportunities in their business. Both structured and unstructured data is analyzed to detect patterns. Predictive Analysis is a lengthy process and consist of seven stages which are project defining, data collection, data analysis, statistics, modeling, deployment, and monitoring. It is an excellent choice for research and thesis.
3. Oracle Data Mining – Oracle Data Mining, also referred as ODM, is a component of Oracle Advanced Analytics Database. It provides powerful data mining algorithms to assist the data analysts to get valuable insights from data to predict the future standards. It helps in predicting the customer behavior which will ultimately help in targeting the best customer and cross-selling. SQL functions are used in the algorithm to mine data tables and views. It is also a good choice for thesis and research in data mining and database.
4. Clustering – Clustering is a process in which data objects are divided into meaningful sub-classes known as clusters. Objects with similar characteristics are aggregated together in a cluster. There are distinct models of clustering such as centralized, distributed. In centroid-based clustering, a vector value is assigned to each cluster. There are various applications of clustering in data mining such as market research, image processing, and data analysis. It is also used in credit card fraud detection.
5. Text mining – Text mining or text data mining is a process to extract high-quality information from the text. It is done through patterns and trends devised using statistical pattern learning. Firstly, the input data is structured. After structuring, patterns are derived from this structured data and finally, the output is evaluated and interpreted. The main applications of text mining include competitive intelligence, E-Discovery, National Security, and social media monitoring. It is a trending topic for the thesis in data mining.
6. Fraud Detection – The number of frauds in daily life is increasing in sectors like banking, finance, and government. Accurate detection of fraud is a challenge. Da.
In the information age, data turns to be the vital. Hence it is important to understand the data in order to face the future information challenges. This paper deals with the importance of data mining while explaining the concepts and life cycle involved. It extracts the basic gist of the topic presented in a user-friendly way. Further, in developing different stages of data mining followed by its extended application usage in practical business platform.
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
Abstract— Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. It concerns Large-Volume, Complex and growing data sets in both multiple and autonomous sources. Not only in science and engineering big data are now rapidly expanding in all domains like physical, bio logical etc...The main objective of this paper is to characterize the features of big data. Here the HACE theorem, that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective, is used. The aggregation of mining, analysis, information sources, user interest modeling, privacy and security are involved in this model. To explore and extract the large volumes of data and useful information or knowledge respectively is the most fundamental challenge in Big Data. So we should have a tendency to analyze these problems and knowledge revolution.
An analysis and impact factors on Agriculture field using Data Mining Techniquesijcnes
In computing and information huge amount of data was provided in the storage. The task is to extract the specified data from the raw data. Data mining is one of the techniques that will extract the data. Data mining techniques are used in many places. The techniques like K-means, K nearest neighbor, support vector machine, bi clustering, navie bayes classifier, neural networks and fuzzy C-means are applied on agricultural data. There are many factors in agriculture. The main factors for the farmer are climate, soil and yield prediction. Farmer must know To improve their production select suitable crop for suitable climate. This paper provides the various concepts of Data mining, their applications and also discusses the research field in agriculture. This paper discusses the different types of factors that impact in the agriculture field.
Similar to Data mining and business intelligence (20)
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxnikitacareer3
Looking for the best engineering colleges in Jaipur for 2024?
Check out our list of the top 10 B.Tech colleges to help you make the right choice for your future career!
1) MNIT
2) MANIPAL UNIV
3) LNMIIT
4) NIMS UNIV
5) JECRC
6) VIVEKANANDA GLOBAL UNIV
7) BIT JAIPUR
8) APEX UNIV
9) AMITY UNIV.
10) JNU
TO KNOW MORE ABOUT COLLEGES, FEES AND PLACEMENT, WATCH THE FULL VIDEO GIVEN BELOW ON "TOP 10 B TECH COLLEGES IN JAIPUR"
https://www.youtube.com/watch?v=vSNje0MBh7g
VISIT CAREER MANTRA PORTAL TO KNOW MORE ABOUT COLLEGES/UNIVERSITITES in Jaipur:
https://careermantra.net/colleges/3378/Jaipur/b-tech
Get all the information you need to plan your next steps in your medical career with Career Mantra!
https://careermantra.net/
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
1. Define “data mining”. Enumerate five example applications that can benefit by using
data mining. 5M
Data mining:
1. The extraction of hidden information from large databases is called as data mining.
2. Data mining is a new powerful technology to help companies focus on the most important
information in their databases of warehouses.
3. Data mining tools allows businesses to make proactive, knowledge-driven decisions.
4. The analyses offered by data mining move beyond the analyses of past events provided by
decision support systems.
5. Data mining tools can answer business questions that take more time to solve.
6. Data mining techniques can be implemented rapidly on existing software and hardware
platforms to enhance the value of existing information resources.
7. It can be integrated with new products and systems as they are brought on-line.
8. Data mining techniques are the result of a long process of research and product
development.
9. Data mining technique allows users to go through their data in real time.
10. Data mining is ready for application in the business community.
11. It is supported by three technologies that are:
i. Massive data collection
ii. Powerful multiprocessor computers
iii. Data mining algorithms.
12. Data mining is widely used in diverse areas.
Data Mining Applications: Financial Data Analysis, Retail Industry, Telecommunication
Industry, Biological Data Analysis, Other Scientific Applications & Intrusion Detection.
1. Financial Data Analysis:
2. The financial data is generally reliable and it has high quality which facilitates systematic data
analysis and data mining.
Some of the cases are as follows −
i. Design and construction of data warehouses for multidimensional data analysis and data
mining.
ii. Loan payment prediction and customer credit policy analysis.
iii. Classification and clustering of customers for targeted marketing.
iv. Detection of money laundering and other financial crimes.
2. Retail Industry:
Data Mining has its best application in Retail Industry because it collects large amount of data
from on sales, customer, consumption and services.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and satisfaction.
Here is the list of examples of data mining in the retail industry −
i. Design and Construction of data warehouses based on the benefits of data mining.
ii. Multidimensional analysis of sales, customers, products, time and region.
iii. Analysis of effectiveness of sales campaigns.
iv. Customer Retention.
v. Product recommendation and cross-referencing of items.
3. Telecommunication Industry:
The telecommunication industry is one of the most industries providing various services such as
fax, pager, cellular phone, internet messenger, images, e-mail, etc.
Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding.
Data mining helps telecommunication industry to identify the telecommunication patterns,
activities, make better use of resource, and improve quality of service.
Here is the list of examples for which data mining improves telecommunication services −
3. i. Multidimensional Analysis of Telecommunication data.
ii. Fraudulent pattern analysis.
iii. Identification of unusual patterns.
iv. Multidimensional association and sequential patterns analysis.
v. Mobile Telecommunication services.
vi. Use of visualization tools in telecommunication data analysis.
4. Biological Data Analysis:
Biological data analysis is a very important part of Bioinformatics.
Aspects of biological data analysis −
i. Semantic integration of heterogeneous, distributed genomic and proteomic databases.
ii. Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
iii. Discovery of structural patterns and analysis of genetic networks and protein pathways.
iv. Association and path analysis.
v. Visualization tools in genetic data analysis.
5. Other Scientific Applications:
Huge amount of data is collected from scientific domains such as astronomy.
A large amount of data sets is being generated because of the fast numerical simulations in
various fields.
Applications of data mining in the field of Scientific Applications −
i. Data Warehouses and data preprocessing.
ii. Graph-based mining.
Iii. Visualization and domain specific knowledge.
6. Intrusion Detection:
4. Intrusion refers to any kind of action that causes error in integrity, confidentiality, or the
availability of network resources.
With increased use of internet and availability of the tools and tricks for intrusion detection is
critical component of network administration.
Here is the list of areas in which data mining technology may be applied for intrusion detection
−
i. Development of data mining algorithm for intrusion detection.
ii. Association and correlation analysis, aggregation to help select and build discriminating
attributes.
iii. Analysis of Stream data.
iv. Distributed data mining.
v. Visualization and query tools.
What is data preprocessing? Explain the different methods for the data cleansing phase. 5M
Data preprocessing:
1. Data preprocessing is important step in data mining process.
2. Data preprocessing includes cleaning, normalization, transformation, feature extraction and
selection.
3. The product of data preprocessing is the final training set.
4. Data preprocessing is one of the most critical step in a data mining process which deals with
the preparation and transformation of the initial dataset.
5. Data preprocessing methods divided into following categories:
i. Data Cleaning
ii. Data Integration
iii. Data Transformation
iv. Data Reduction
Different methods for the data cleansing phase:
5. Difference between classification and prediction. 5M
No. Classification Prediction
1 Classification classifies data into classes. Prediction predicts the value of unseen data.
2 The accuracy of a classifier refers to the
ability of given classifier to correctly
classify new data.
The accuracy of the predicter refers to how
will a given predicter can give the value of
new or unseen data.
3 The speech of classifier refers to
computational cost involving in generating
and using the classifier.
The speed of predicter refers to
computational cost involving in generating
and using the predicter.
4 The robustness of classifier is the ability to
make correct classification on noisy data
or data with missing value.
The robustness of predicter is the ability to
make correct prediction on noisy data or
data with missing value.
5 The scalability of classification is the ability
to construct the classifier to efficiently
work on large amount of data.
The scalability of predicter is the ability to
construct the predicter to efficiently work on
amount of data.
Data mining architecture 10M
1. Data mining is a very important process where potentially useful and previously unknown
information is extracted from large volumes of data.
2. There are a number of components involved in the data mining process.
3. These components constitute the architecture of a data mining system.
4. The architecture of a typical data mining system may have the following major components
Database, data warehouse, World Wide Web, or other information repository.
5. This is one or a set of databases, data warehouses, spreadsheets, or other kinds of
information repositories.
6. Data cleaning and data integration techniques may be performed on the data.
7. Database or data warehouse server:
The database or data warehouse server is responsible for fetching the relevant data, based on
the user’s data mining request.
8. Knowledge base:
6. i. This is the domain knowledge that is used to guide the search or evaluate the interestingness
of resulting patterns.
ii. Such knowledge can include concept hierarchies, used to organize attributes or attribute
values into different levels of abstraction.
iii. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included.
9. Data mining engine:
i. The data mining engine is the core component of any data mining system.
ii. It consists of a number of modules for performing data mining tasks including association,
classification, characterization, clustering, prediction, time-series analysis etc.
10. Pattern evaluation module:
i. This component typically employs interestingness measures and interacts with the data
mining modules so as to focus the search toward interesting patterns.
ii. It may use interestingness thresholds to filter out discovered patterns.
iii. Alternatively, the pattern evaluation module may be integrated with the mining module,
depending on the implementation of the data mining method used.
11. User interface:
i. This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results.
ii. In addition, this component allows the user to browse database and data warehouse
schemas or data structures, evaluate mined patterns, and visualize the patterns in different
forms.
12. Database or Data Warehouse Server:
i. The database or data warehouse server contains the actual data that is ready to be
processed.
ii. Hence, the server is responsible for retrieving the relevant data based on the data mining
request of the user.
7. 13. DIAGRAM:
14. Each and every component of data mining system has its own role and importance in
completing data mining efficiently.
15. These different modules need to interact correctly with each other in order to complete the
complex process of data mining successfully.
What is KDD? Explain its process?
knowledge discovery from data)
1. Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data.
2. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD.
3. The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
Selection, Pre-processing, Transformation, Data Mining and Interpretation/Evaluation.
KDD PROCESS
1. Creating a target data set: data selection
8. 2. Data cleaning and preprocessing: (may take 60% of effort!)
3. Data reduction and transformation
4. Find useful features, dimensionality/variable reduction, and invariant representation
5. Choosing functions of data mining
6. summarization, classification, regression, association, clustering
7. Choosing the mining algorithm(s)
ALGORITHM
Input: D : a data set containing n objects, ε : the radius parameter, and MinPts: the
neighborhood density threshold.
Output: A set of density-based clusters.
Method:
1) mark all objects as unvisited;
2) do
3) randomly select an unvisited object p;
4) mark p as visited;
5) if the ε -neighborhood of p has at least MinPts objects
6) create a new cluster C, and add p to C;
7) let N be the set of objects in the ε -neighborhood of p;
8) for each point p' in N
9) if p' is unvisited
10) mark p' as visited;
11) if the -neighborhood of p' has at least MinPts points, add those points to N ;
12) if p' is not yet a member of any cluster, add p' to C;
13) end for
9. 14) output C;
15) else mark p as noise;
16) until no object is unvisited;
Advantages:
1. DBSCAN does not require one to specify the number of clusters in the data a priori, as
opposed to k-means.
2. DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded
by (but not connected to) a different cluster.
3. Due to the MinPts parameter, the so-called single-link effect (different clusters being
connected by a thin line of points) is reduced.
4. DBSCAN has a notion of noise, and is robust to outliers.
5. DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points
in the database.
6. DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an
R* tree.
Disadvantages:
1. DBSCAN is not entirely deterministic: border points that are reachable from more than one
cluster can be part of either cluster, depending on the order the data is processed.
2. The quality of DBSCAN depends on the distance measure used in the function
regionQuery(P,ε). The most common distance metric used is Euclidean distance.
3. DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε
combination cannot then be chosen appropriately for all clusters.
4. If the data and scale are not well understood, choosing a meaningful distance threshold ε can
be difficult.
What is Resource Allocation? Explain Resource leveling & Resource smoothing.
Resource allocation is the scheduling of activities and the resources required by those activities
while taking into consideration both the resource availability and project time.
10. The resource allocation procedure mainly consists of two activities: Resource Smoothing and
Resource Leveling.
Resource Smoothing:-
1. If duration of completion of the project is the constraint, then resource smoothing should be
applied without changing the total project duration.
2. The periods of minimum demand for resources are located and the activities are shifted
according to the float availability and the requirement of resources.
3. Thus the intelligent utilization of floats can smoothen the demand of resources to the
maximum possible extent.
4. This type of resource allocation is called Resource Smoothing.
Resource leveling:-
1. In the process of resource leveling, whenever the availability of resource becomes less than
its maximum requirement, the only alternative is to delay the activity having larger float.
2. In case, two or more activities require the same amount of resources, the activity with
minimum duration is chosen for resource allocation.
3. Resource leveling is done if the restriction is on the availability of resources.
Write a short note on linear regression. 5M
Linear regression involves finding the “best” line to fit two attributes (or variables), so that one
attribute can be used to predict the other.
Linear Regression
a. Straight-line regression:
1. Straight-line regression analysis involves a response variable, y, and a single predictor
variable, x.
2. It is the simplest form of regression, and models y as a linear function of x.
3. That is, y = b+wx; where the variance of y is assumed to be constant, b and w are regression
coefficients specifying the Y-intercept and slope of the line, respectively.
11. 4. These coefficients can be solved by the method of least squares, which estimates the best-
fitting straight line as the one that minimizes the error between the actual data and the
estimate of the line.
5. The regression coefficients can be estimated using this method with the following equations:
b. Multiple linear regressions:
1. Multiple linear regressions is an extension of straight-line regression so as to involve more
than one predictor variable.
2. It allows response variable y to be modeled as a linear function of n predictor variables or
attributes.
3. The equations (obtained from the method of least squares), become long and are tedious to
solve by hand.
4. Multiple regression problems are instead commonly solved with the use of statistical
software packages, such as SAS, SPSS, and S-Plus
5. Speed and Scalability: Time to construct the model and also time to use the model.
6. Robustness: This is the ability of the classifier to make correct predictions given noisy data or
data with missing values
7. Scalability: This refers to the ability to construct the classifier efficiently given large amounts
of data.
8. Interpretability: This refers to the level of understanding and insight that is provided by the
classifier
9. Goodness of rules: Decision tree size compactness of classification rules.
What is noisy data? How to handle it?
Noisy data:
1. Noisy data is meaningless data.
2. It includes any data that cannot be understood and interpreted correctly by machines, such
as unstructured text.
3. Noisy data unnecessarily increases the amount of storage space required and can also
adversely affect the results of any data mining analysis.
12. 4. Noisy data can be caused by faulty data collection instruments, human or computer errors
occurring at data entry, data transmission errors, limited buffer size for coordinating
synchronized data transfer, inconsistencies in naming conventions or data codes used and
inconsistent formats for input fields(e.g.: date).
Noisy data can be handled by following the given procedures:
A. Binning:
1. Binning methods smooth a sorted data value by consulting the values around it. The sorted
values are distributed into a number of “buckets,” or bins. Because binning methods consult
the values around it, they perform local smoothing.
2. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by
the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given
bin are identified as the bin boundaries.
3. Each bin value is then replaced by the closest boundary value. In general, the larger the
width, the greater the effect of the smoothing. Alternatively, bins may be equal-width, where
the interval range of values in each bin is constant. Binning is also used as a discretization
technique.
B. Regression:
1. Here data can be smoothed by fitting the data to a function.
2. Linear regression involves finding the “best” line to fit two attributes, so that one attribute
can be used to predict the other.
3. Multiple linear regressions is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
C. Clustering:
1. Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters.”
2. Similarly, values that fall outside of the set of clusters may also be considered outliers.
Describe one hierarchical clustering algorithm using an example dendrogram. 5M
There are three algorithms of hierarchical clustering:
13. Agglomerative hierarchical clustering, Divisive hierarchical clustering and BIRCH hierarchical
clustering.
Agglomerative hierarchical clustering:
1. Agglomerative hierarchical clustering is a bottom-up clustering approach.
2. In this clustering the clusters have sub-clusters.
3. The example of this type of clustering is species taxonomy.
4. Gene expression data also exhibit this hierarchical quality.
5. Agglomerative hierarchical clustering starts with every single object (gene or sample) in a
single cluster.
6. Then each successive iteration it merges the closest pair of clusters by satisfying some
similarity criteria, until all the data is in one cluster.
7. The clusters generated in early stages are nested in clusters generated in later stages.
8. The clusters with different sizes in the tree can be valuable for discovery.
9. This type of clustering can produces an ordering of the objects, which may be informative for
data display.
10. In this clustering algorithm, smaller clusters are generated, which may be helpful for
discovery.
11. Different methods for combining clusters in agglomerative hierarchical clustering:
i. Single linkage
ii. Complete linkage
iii. Average linkage
iv. Centroid method
v. Ward’s method
12. Example dendrogram:
14. Explain the concept of decision support system with the help of an example
application. 5M
1. A decision support system (DSS) is a computer program application which analyzes business
data and presents it to make business decisions more easily.
2. Decision support system is an "informational application" system.
3. Decision support systemhelps businesses and organizations in decision making activities.
4. A decision support system presents information graphically and it includes artificial
intelligence.
5. Categories of decision support system:
i. Communication driven DSS: purpose is to conduct a meeting for users to collaborate.
ii. Data driven DSS: it is used to query a database to seek specific answers for specific purpose.
iii. Document driven DSS: purpose of this is to search a web page and to find documents.
iv. Knowledge driven DSS: it is used to produce management advice and to choose products or
services.
v. Model driven DSS: it is used by managers and staff members of the business to interact with
each other to analyze decisions.
15. 6. Decision support systems are the combination of integrated resources which works together.
7. For example: a national on-line book seller wants to begin selling its products internationally
but first needs to determine if that will be a wise business decision.
8. In such case, The vendor can use a DSS to collect information from its own resources using a
tool OLAP to determine that the company has the ability to expand its business.
9. Also collect information from external resources, such as industry data to determine if there
is indeed a demand to meet.
10. The Decision Support System will collect and analyze the data and then present it in a way
that can be interpreted by humans.
11. There are few decision support systems, which come very close to acting as artificial
intelligence agents.
What is clustering? Explain k-means clustering algorithm. Suppose the data for
clustering is {2, 4, 10, 12, 3, 20, 11, 25} Consider k=2, cluster the given data using k-
means algorithm. 10M
Clustering:
1. The process of partitioning a set of data into a set of meaningful sub-classes or clusters is
called as clustering.
2. Clustering is a technique used to place data elements into related groups.
3. Example of graphical representation of the clustering:
4. A cluster is a collection of objects which are similar and are dissimilar to the objects of the
other clusters.
5. In the above example there are four clusters.
K-means clustering algorithm:
16. 1. K-means clustering is an algorithm to group the different objects based on their features into
K number of group.
2. K is positive integer number and it can be decided by user.
3. The Centroids of each cluster are generally far away from each other.
4. Group the elements into the clusters which are nearest to the Centroid and use same
method to group elements as per new Centroids.
5. In every step Centroid changes and elements moved from one cluster to another cluster.
6. Follow same process till no element is moving from one cluster to other cluster.
Data: {2, 4, 10, 12, 3, 20, 11, 25}
K=2
Select any two means M1 and M2:
M1=4
M2=12
Randomly partition given dataset:
K1=2, 3, 4 mean=3
K2=10, 11, 12, 20, 25 mean=15.6
Reassign the values of dataset as per new mean values:
K1=2, 3, 4 mean=3
K2=10, 11, 12, 20, 25 mean=15.6
Again we have same mean values.
Therefore, stop solving algorithm.
(Note: if you are using different means for M1 and M2, your answer will be different and it will
stop in different steps)
For the given set of data points, 10M
(a) Find Mean, Median and Mode
17. (b) Show a boxplot of the data. Clearly indicating the five –numbersummary.
11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75
Given data:
11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75
(a) Mean, Median and Mode:
Mean:
Mean= Addition of all numbers divide by total numbers
Mean=
(11+13+13+15+15+16+19+20+20+20+21+21+22+23+24+30+40+45+45+45+71+72+73+75)/24
=32.04
Median:
Median= Total numbers divide by 2
Median= 24/2= 12th number = 21
Mode:
Mode= the number that is repeated more time than other numbers
Mode= 20,40
(b) Boxplot of the data with five number summary:
What is an outlier? Describe methods that can be used for outlier analysis. 10M
1. An outlier is an observation point that is distant from other observations.
18. 2. Outlier indicates measurement error or experimental error.
3. They can be novel, new, abnormal, unusual or noisy information about the data.
4. Outliers can be classified into three categories:
i. Point outliers
ii. Contextual outliers
iii. Collective outliers
5. Point outliers:
i. This is the simplest type of outlier and it focuses on the majority of research on outlier
detection.
ii. If an individual data point can be considered anomalous with respect to the other data, then
it is called as a point outlier.
6. Contextual outliers:
i. If an individual data point is anomalous in a specific context, then it is called as a contextual
outlier or conditional outlier.
ii. Each data points is defined with two sets of attributes in contextual outlier: contextual
attributes and behavioral attributes.
7. Collective outliers:
i. If a collection of data points is anomalous with respect to the entire data set, it is called as a
collective outlier.
ii. Collective outliers can occurs only in data sets in which data points are somehow related.
8. The benefit of outlier is, it can be removed or considered separately in regression modeling
to improve accuracy.
9. Outlier detection is one of the basic problems of data mining.
10. Outliers may be erroneous or real.
Methods used for outlier analysis:
1. Statistical approach
19. 2. Distance-based approach
3. Density-based local outlier approach
4. Deviation-based approach
1. Statistical approach:
i. This method assumes a distribution for the given data set and then identifies outliers with
respect to the model using a discordance test.
ii. A statistical discordance test examines two hypotheses, a working hypothesis and an
alternative hypothesis.
iii. A working hypothesis, states that the entire data set of n object is comes from an initial
distribution model.
iv. An alternative hypothesis, states that the entire data set of n objects is comes from another
distribution model.
2. Distance-based approach:
i. This method generalizes the ideas behind discordance testing for various standard
distributions.
ii. Its neighbors are defined based on their distance from the given object.
iii. Many efficient algorithms for mining distance-based outliers have been developed: indexed
based algorithm, nested loop algorithm and cell based algorithm.
3. Density-based local outlier approach:
i. This approach is depends on the overall distribution of the given set of data points.
ii. This brings us to the notion of local outliers and an object is a local outlier.
iii. This approach can detect both global and local outliers.
4. Deviation-based approach:
i. This method identifies outliers by examining the main characteristics of objects in a group.
ii. The term deviation is typically used to refer to outliers in this approach.
iii. There are two techniques for deviation-based outlier detection.
20. iv. The first technique compares objects sequentially in a set and the second employs an OLAP
data cube approach.
Design a BI system for fraud detection. Describe all the steps from Data collection to
Decision making clearly. 10M
Fraud detection in Telecommunication Industry:
1. Fraud is an adaptive crime. It needs special method of intelligent data analysis to detect
fraud and prevent it.
2. The telecommunications industry has expanded with the development of affordable mobile
phone technology.
3. For example, forensic analytics are used to review an employee’s purchasing card activity to
assess whether any of the purchases were diverted for personal use.
4. The main steps in forensic analytics are data collection, data preparation, data analysis and
reporting.
5. Fraud detection method exits in the areas of Knowledge Discovery in Databases (KDD), Data
Mining, Machine Learning and Statistics.
6. They offer applicable and successful solutions in different areas of fraud crimes.
7. Fraud detection:
8. Fraud detection techniques are categorized into two primary classes:
i. Statistical data analysis techniques
ii. Artificial intelligence techniques
9. Statistical data analysis techniques:
21. i. Statistical data analysis techniques are data preprocessing techniques for detection, data
validation, error correction and filling up of missing or incorrect data.
ii. It also includes calculation of various statistical parameters such as averages, performance
metrics.
10. Artificial Intelligence techniques:
i. Artificial intelligence techniques are data mining to classify, cluster and segment the data and
automatically find association rules in the data related to fraud.
ii. It also includes pattern recognition to detect approximate classes, clusters, or patterns of
suspicious behavior either automatically or to match given inputs.
Steps from data collection to decision making:
1. Data collection:
i. Before you collect new data, determine what information could be collected from existing
databases or sources on hand.
ii. Determine a file storing and naming system ahead of time to help all tasked team members
collaborate.
iii. If you need to gather data via observation or interviews, then develop an interview template
ahead of time to ensure consistency and save time.
iv. Keep your collected data organized in a log with collection dates and add any source notes as
you go.
2. Analyze data:
i. After you’ve collected the right data it’s time for deeper data analysis.
ii. Begin by manipulating your data in a number of different ways, such as plotting it out and
finding correlations or by creating a pivot table in Excel.
iii. A pivot table lets you sort and filter data by different variables and lets you calculate the
mean, maximum, minimum and standard deviation of your data.
3. Interpret results:
i. After analyzing data and possibly conducting further research it’s time to interpret your
results.
22. ii. As you interpret your analysis, you cannot ever prove a hypothesis true; you can only fail to
reject the hypothesis.
iii. By following these steps in your data analysis process, you make better decisions for your
business or government agency.
Partition the given data into4 bins using Equi-depth binning method and perform smoothing
according to the following methods. 10M
Smoothing by bin mean
Smoothing by bin median
Smoothing by bin boundaries
Data: 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73, 75
Given data: 11, 13, 13, 15, 15, 16, 19, 20, 20, 20, 21, 21, 22, 23, 24, 30, 40, 45, 45, 45, 71, 72, 73,
75
Let, distribute data into 4 bins using Equi-depth binning.
Total values (T) =24
Number of values in each bin=24/4=6
Thus we get,
B1=11, 13, 13, 15, 15, 16
B2=19, 20, 20, 20, 21, 21
B3=22, 23, 24, 30, 40, 45
B4=45, 45, 71, 72, 73, 75
i. smoothing by bin median:
Replace each value of bin with its mean value.
Mean for B1=(11+13+13+15+15+16)/6=13.83
Mean for B2=(19+20+20+20+21+21)/6=20.16
Mean for B3=(22+23+24+30+40+45)/6=30.67
Mean for B4=(45+45+71+72+73+75)/6=63.5
Thus we get,
B1=13.83, 13.83, 13.83, 13.83, 13.83, 13.83
B2=20.16, 20.16, 20.16, 20.16, 20.16, 20.16
B3=30.67, 30.67, 30.67, 30.67, 30.67, 30.67
B4=63.5, 63.5, 63.5, 63.5, 63.5, 63.5
ii. Smoothing by bin median:
Replace each value of bin with its median value.
Median for B1=(13+15)/2=14
Median for B2=(20+20)/2=20
Median for B3=(24+30)/2=27
Median for B4=(71+72)/2=71.5
23. Thus we get,
B1=14, 14, 14, 14, 14, 14
B2=20, 20, 20, 20, 20, 20
B3=27, 27, 27, 27, 27, 27
B4=71.5, 71.5, 71.5, 71.5, 71.5, 71.5
iii. Smoothing by bin boundaries:
Replace each value of bin with its closet boundary value.
Thus we get,
B1=11, 11, 11, 16, 16, 16
B2=19, 21, 21, 21, 21, 21
B3=22, 22, 22, 22, 45, 45
B4=45, 45, 75, 75, 75, 75