This slide describe all the necessary topic on Data-Mining. Even this covered all the important Questions on Data Mining in Graduation Level. Basically it covers the actual 2 and 4 marks questions along with the answers that you will need after.
Data Mining System and Applications: A Reviewijdpsjournal
In the Information Technology era information plays vital role in every sphere of the human life. It is very important to gather data from different data sources, store and maintain the data, generate information, generate knowledge and disseminate data, information and knowledge to every stakeholder. Due to vast use of computers and electronics devices and tremendous growth in computing power and storage capacity, there is explosive growth in data collection. The storing of the data in data warehouse enables entire enterprise to access a reliable current database. To analyze this vast amount of data and drawing fruitful conclusions and inferences it needs the special tools called data mining tools. This paper gives overview of the data mining systems and some of its applications.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Privacy preservation techniques in data miningeSAT Journals
Abstract In this paper different privacy preservation techniques are compared. Classification is the most commonly applied data mining technique, which employs a set of pre-classified examples to develop a model that can classify the population of records at large. Fraud detection and credit risk applications are particularly well suited to this type of analysis. This approach frequently employs decision tree or neural network-based classification algorithms. The data classification process involves learning and classification. In Learning the training data are analyzed by classification algorithm. In classification test data are used to estimate the accuracy of the classification rules. If the accuracy is acceptable the rules can be applied to the new data tuples . For a fraud detection application, this would include complete records of both fraudulent and valid activities determined on a record-by-record basis. The classifier-training algorithm uses these pre-classified examples to determine the set of parameters required for proper discrimination. The algorithm then encodes these parameters into a model called a classifier Index Terms: Data Mining, Privacy Preservation, Clustering, Classification Techniques, Naive Bayes.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
Data mining environment produces a large amount of data, that need to be
analyses, pattern have to be extracted from that to gain knowledge. In this new period with
rumble of data both ordered and unordered, by using traditional databases and architectures, it
has become difficult to process, manage and analyses patterns. To gain knowledge about the
Big Data a proper architecture should be understood. Classification is an important data mining
technique with broad applications to classify the various kinds of data used in nearly every
field of our life. Classification is used to classify the item according to the features of the item
with respect to the predefined set of classes. This paper provides an inclusive survey of
different classification algorithms and put a light on various classification algorithms including
j48, C4.5, k-nearest neighbor classifier, Naive Bayes, SVM etc., using random concept.
Data Mining System and Applications: A Reviewijdpsjournal
In the Information Technology era information plays vital role in every sphere of the human life. It is very important to gather data from different data sources, store and maintain the data, generate information, generate knowledge and disseminate data, information and knowledge to every stakeholder. Due to vast use of computers and electronics devices and tremendous growth in computing power and storage capacity, there is explosive growth in data collection. The storing of the data in data warehouse enables entire enterprise to access a reliable current database. To analyze this vast amount of data and drawing fruitful conclusions and inferences it needs the special tools called data mining tools. This paper gives overview of the data mining systems and some of its applications.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Privacy preservation techniques in data miningeSAT Journals
Abstract In this paper different privacy preservation techniques are compared. Classification is the most commonly applied data mining technique, which employs a set of pre-classified examples to develop a model that can classify the population of records at large. Fraud detection and credit risk applications are particularly well suited to this type of analysis. This approach frequently employs decision tree or neural network-based classification algorithms. The data classification process involves learning and classification. In Learning the training data are analyzed by classification algorithm. In classification test data are used to estimate the accuracy of the classification rules. If the accuracy is acceptable the rules can be applied to the new data tuples . For a fraud detection application, this would include complete records of both fraudulent and valid activities determined on a record-by-record basis. The classifier-training algorithm uses these pre-classified examples to determine the set of parameters required for proper discrimination. The algorithm then encodes these parameters into a model called a classifier Index Terms: Data Mining, Privacy Preservation, Clustering, Classification Techniques, Naive Bayes.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
Data mining environment produces a large amount of data, that need to be
analyses, pattern have to be extracted from that to gain knowledge. In this new period with
rumble of data both ordered and unordered, by using traditional databases and architectures, it
has become difficult to process, manage and analyses patterns. To gain knowledge about the
Big Data a proper architecture should be understood. Classification is an important data mining
technique with broad applications to classify the various kinds of data used in nearly every
field of our life. Classification is used to classify the item according to the features of the item
with respect to the predefined set of classes. This paper provides an inclusive survey of
different classification algorithms and put a light on various classification algorithms including
j48, C4.5, k-nearest neighbor classifier, Naive Bayes, SVM etc., using random concept.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
Overview of basic concepts related to Data Mining: database, data model, fuzzy sets, information retrieval, data warehouse, dimensional modeling, data cubes, OLAP, machine learning.
The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster. The increase in use of electronic data gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion of available data. Figure 1 from the Red Brick company illustrates the data explosion.
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
A high prediction accuracy of the students’ performance is more helpful to identify the low performance students at the beginning of the learning process. Data mining is used to attain this objective. Data mining techniques are used to discover models or patterns of data, and it is much helpful in the decision-making.Boosting technique is the most popular techniques for constructing ensembles of classifier to improve the classification accuracy. Adaptive Boosting (AdaBoost) is a generation of boosting algorithm. It is used for
the binary classification and not applicable to multiclass classification directly. SAMME boosting
technique extends AdaBoost to a multiclass classification without reduce it to a set of sub-binaryclassification.In this paper, students’ performance prediction system usingMulti Agent Data Mining is proposed to predict the performance of the students based on their data with high prediction accuracy and provide helpto the low students by optimization rules.The proposed system has been implemented and evaluated by investigate the prediction accuracy ofAdaboost.M1 and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The results show that using SAMME Boosting technique improves the prediction accuracy and outperformed
C4.5 single classifier and LogitBoost.
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
Data mining environment produces a large amount of data that need to be analyzed.
Using traditional databases and architectures, it has become difficult to process, manage and analyze
patterns. To gain knowledge about the Big Data a proper architecture should be understood.
Classification is an important data mining technique with broad applications to classify the various
kinds of data used in nearly every field of our life. Classification is used to classify the item
according to the features of the item with respect to the predefined set of classes. This paper put a
light on various classification algorithms including j48, C4.5, Naive Bayes using large dataset.
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
In a world of data explosion, the rate of data generation and consumption is on the increasing side, there comes the buzzword - Big Data.
Big Data is the concept of fast-moving, large-volume data in varying dimensions (sources) and
highly unpredicted sources.
The 4Vs of Big Data
● Volume - Scale of Data
● Velocity - Analysis of Streaming Data
● Variety - Different forms of Data
● Veracity - Uncertainty of Data
With increasing data availability, the new trend in the industry demands not just data collection,
but making ample sense of acquired data - thereby, the concept of Data Analytics.
Taking it a step further to further make a futuristic prediction and realistic inferences - the concept
of Machine Learning.
A blend of both gives a robust analysis of data for the past, now and the future.
There is a thin line between data analytics and Machine learning which becomes very obvious
when you dig deep.
An Empirical Study of the Applications of Classification Techniques in Studen...IJERA Editor
University servers and databases store a huge amount of data including personal details, registration details, evaluation assessment, performance profiles, and many more for students and lecturers alike. main problem that faces any system administration or any users is data increasing per-second, which is stored in different type and format in the servers, learning about students from a huge amount of data including personal details, registration details, evaluation assessment, performance profiles, and many more for students and lecturers alike. Graduation and academic information in the future and maintaining structure and content of the courses according to their previous results become importance. The paper objectives are extract knowledge from incomplete data structure and what the suitable method or technique of data mining to extract knowledge from a huge amount of data about students to help the administration using technology to make a quick decision. Data mining aims to discover useful information or knowledge by using one of data mining techniques, this paper used classification technique to discover knowledge from student’s server database, where all students’ information were registered and stored. The classification task is used, the classifier tree C4.5, to predict the final academic results, grades, of students. We use classifier tree C4.5 as the method to classify the grades for the students .The data include four years period [2006-2009]. Experiment results show that classification process succeeded in training set. Thus, the predicted instances is similar to the training set, this proves the suggested classification model. Also the efficiency and effectiveness of C4.5 algorithm in predicting the academic results, grades, classification is very good. The model also can improve the efficiency of the academic results retrieving and evidently promote retrieval precision.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
Overview of basic concepts related to Data Mining: database, data model, fuzzy sets, information retrieval, data warehouse, dimensional modeling, data cubes, OLAP, machine learning.
The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster. The increase in use of electronic data gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion of available data. Figure 1 from the Red Brick company illustrates the data explosion.
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
A high prediction accuracy of the students’ performance is more helpful to identify the low performance students at the beginning of the learning process. Data mining is used to attain this objective. Data mining techniques are used to discover models or patterns of data, and it is much helpful in the decision-making.Boosting technique is the most popular techniques for constructing ensembles of classifier to improve the classification accuracy. Adaptive Boosting (AdaBoost) is a generation of boosting algorithm. It is used for
the binary classification and not applicable to multiclass classification directly. SAMME boosting
technique extends AdaBoost to a multiclass classification without reduce it to a set of sub-binaryclassification.In this paper, students’ performance prediction system usingMulti Agent Data Mining is proposed to predict the performance of the students based on their data with high prediction accuracy and provide helpto the low students by optimization rules.The proposed system has been implemented and evaluated by investigate the prediction accuracy ofAdaboost.M1 and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The results show that using SAMME Boosting technique improves the prediction accuracy and outperformed
C4.5 single classifier and LogitBoost.
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
Data mining environment produces a large amount of data that need to be analyzed.
Using traditional databases and architectures, it has become difficult to process, manage and analyze
patterns. To gain knowledge about the Big Data a proper architecture should be understood.
Classification is an important data mining technique with broad applications to classify the various
kinds of data used in nearly every field of our life. Classification is used to classify the item
according to the features of the item with respect to the predefined set of classes. This paper put a
light on various classification algorithms including j48, C4.5, Naive Bayes using large dataset.
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
In a world of data explosion, the rate of data generation and consumption is on the increasing side, there comes the buzzword - Big Data.
Big Data is the concept of fast-moving, large-volume data in varying dimensions (sources) and
highly unpredicted sources.
The 4Vs of Big Data
● Volume - Scale of Data
● Velocity - Analysis of Streaming Data
● Variety - Different forms of Data
● Veracity - Uncertainty of Data
With increasing data availability, the new trend in the industry demands not just data collection,
but making ample sense of acquired data - thereby, the concept of Data Analytics.
Taking it a step further to further make a futuristic prediction and realistic inferences - the concept
of Machine Learning.
A blend of both gives a robust analysis of data for the past, now and the future.
There is a thin line between data analytics and Machine learning which becomes very obvious
when you dig deep.
An Empirical Study of the Applications of Classification Techniques in Studen...IJERA Editor
University servers and databases store a huge amount of data including personal details, registration details, evaluation assessment, performance profiles, and many more for students and lecturers alike. main problem that faces any system administration or any users is data increasing per-second, which is stored in different type and format in the servers, learning about students from a huge amount of data including personal details, registration details, evaluation assessment, performance profiles, and many more for students and lecturers alike. Graduation and academic information in the future and maintaining structure and content of the courses according to their previous results become importance. The paper objectives are extract knowledge from incomplete data structure and what the suitable method or technique of data mining to extract knowledge from a huge amount of data about students to help the administration using technology to make a quick decision. Data mining aims to discover useful information or knowledge by using one of data mining techniques, this paper used classification technique to discover knowledge from student’s server database, where all students’ information were registered and stored. The classification task is used, the classifier tree C4.5, to predict the final academic results, grades, of students. We use classifier tree C4.5 as the method to classify the grades for the students .The data include four years period [2006-2009]. Experiment results show that classification process succeeded in training set. Thus, the predicted instances is similar to the training set, this proves the suggested classification model. Also the efficiency and effectiveness of C4.5 algorithm in predicting the academic results, grades, classification is very good. The model also can improve the efficiency of the academic results retrieving and evidently promote retrieval precision.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
DM_Notes.pptx
1. DATA_MINING_NOTES
1. Explain steps in KDD process. [5]
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets. The KDD process in
data mining typically involves the following steps:
1. Selection: Select a relevant subset of the data for analysis.
2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include
tasks such as data normalization, missing value handling, and data integration.
3. Transformation: Transform the data into a format suitable for data mining, such as a matrix or a
graph.
4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful
information and insights. This may include tasks such as clustering, classification, association
rule mining, and anomaly detection.
5. Interpretation: Interpret the results and extract knowledge from the data. This may include
tasks such as visualizing the results, evaluating the quality of the discovered patterns, and
identifying relationships and associations among the data.
6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and
meaningful.
7. Deployment: Use the discovered knowledge to solve the business problem and make decisions.
2. What is text mining? [2]
o Definition: Text mining is the process of extracting meaningful information from text data.
o Process: It involves using natural language processing (NLP) techniques and machine
learning algorithms to analyze large volumes of unstructured text data and identify
patterns, trends, and insights that would be difficult to uncover manually.
o Application: This can be applied in various field such as sentiment analysis, topic modeling,
and text classification and so on.
o Goal: The goal of text mining is to extract valuable information from text data and use it to
make data-driven decisions or predictions.
3. What do you mean by Clustering? [2]
o Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of
data points into clusters so that the objects belong to the same group.
o Clustering is a method of partitioning a set of data or objects into a set of significant
subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either
as a stand-alone instrument to get a better insight into data distribution or as a pre-
processing step for other algorithms
4. Linear Regression.
o It is simplest form of regression. Linear regression attempts to model the relationship
between two variables by fitting a linear equation to observe the data.
o Linear regression attempts to find the mathematical relationship between variables.
o If outcome is straight line then it is considered as linear model and if it is curved line, then it
is a non-linear model.
2. o The relationship between dependent variable is given by straight line and it has only one
independent variable.
Y = α + Β X
Model 'Y', is a linear function of 'X'.
The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also
changes.
4. Difference between Data Mining and Text Mining. [3/5]
Data Mining Text Mining
Data mining is a process to extract useful
information from huge datasets.
Text Mining is a part of data mining that includes the
processing of text from huge documents.
In data mining, we get the stored data in
a structured format.
In text mining, we get the stored data in an
unstructured format.
It allows the mining of mixed data. It allows mining of text only.
Data processing is done directly. Data processing is done linguistically.
It is a homogeneous process. It is a heterogeneous process.
Pre-defined databases and sheets are
used to collect the information.
The text is used to gather high-quality data.
The statistical method is used for data
evaluation.
Computational linguistic principles are used to
evaluate the text.
5. Difference between DM and OLAP. [3/5]
Data Mining OLAP
Data mining refers to the field of computer
science, which deals with the extraction of
data, trends and patterns from huge sets of
data.
OLAP is a technology of immediate access to
data with the help of multidimensional
structures.
It deals with the data summary. It deals with detailed transaction-level data.
It is discovery-driven. It is query driven.
It is used for future data prediction. It is used for analyzing past data.
3. It has huge numbers of dimensions. It has a limited number of dimensions.
Bottom-up approach. Top-down approach.
It is an emerging field. It is widely used.
6. Difference between Descriptive and predictive data mining. [3/5]
Descriptive data mining Predictive data mining
Descriptive mining is usually used to
provide correlation, cross-tabulation,
frequency, etc.
The term 'Predictive' means to predict something, so
predictive data mining is the analysis done to predict the
future event or other data or trends.
It is based on the reactive approach. It is based on the proactive approach.
It specifies the characteristics of the
data in a target data set.
It executes the induction over the current and past data
so that prediction can happen.
It needs data aggregation and data
mining.
It needs statistics and data forecasting procedures.
It provides precise data. It produces outcomes without ensuring accuracy.
7. Difference between Classification and Clustering [3/5]
Classification Clustering
Classification is a supervised learning
approach where a specific label is provided
to the machine to classify new observations.
Here the machine needs proper testing and
training for the label verification.
Clustering is an unsupervised learning
approach where grouping is done on
similarities basis.
Supervised learning approach. Unsupervised learning approach.
It uses a training dataset. It does not use a training dataset.
It uses algorithms to categorize the new
data as per the observations of the training
set.
It uses statistical concepts in which the data
set is divided into subsets with the same
features.
In classification, there are labels for training
data.
In clustering, there are no labels for training
data.
Its objective is to find which class a new
object belongs to form the set of predefined
classes.
Its objective is to group a set of objects to
find whether there is any relationship
between them.
It is more complex as compared to
clustering.
It is less complex as compared to clustering.
4. 8. Difference between Supervised and un-supervised Learning [3/5]
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained
using labeled data.
Unsupervised learning algorithms are trained
using unlabeled data.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to
the model along with the output.
In unsupervised learning, only input data is
provided to the model.
The goal of supervised learning is to train the
model so that it can predict the output when it
is given new data.
The goal of unsupervised learning is to find the
hidden patterns and useful insights from the
unknown dataset.
Supervised learning can be categorized
in Classification and Regression problems.
Unsupervised Learning can be classified
in Clustering and Associations problems.
Supervised learning model produces an
accurate result.
Unsupervised learning model may give less
accurate result as compared to supervised
learning.
It includes various algorithms such as Linear
Regression, Logistic Regression, Support Vector
Machine, Multi-class Classification, Decision
tree, Bayesian Logic, etc.
It includes various algorithms such as Clustering,
KNN, and Apriori algorithm.
9. Difference between OLAP and OLTP [3/5]
Category OLAP (Online analytical processing) OLTP (Online transaction processing)
Definition It is well-known as an online database
query management system.
It is well-known as an online database
modifying system.
Data source Consists of historical data from various
Databases.
Consists of only of operational current
data.
Method
used
It makes use of a data warehouse. It makes use of a standard database
management system (DBMS).
Application It is subject-oriented. Used for Data
Mining, Analytics, Decisions making, etc.
It is application-oriented. Used for
business tasks.
Normalized In an OLAP database, tables are not
normalized.
In an OLTP database, tables are
normalized (3NF).
Usage of
data
The data is used in planning, problem-
solving, and decision-making.
The data is used to perform day-to-day
fundamental operations.
Purpose It serves the purpose to extract
information for analysis and decision-
making.
It serves the purpose to Insert, Update,
and Delete information from the
database.
Volumeof
data
A large amount of data is stored typically
in TB, PB
The size of the data is relatively small as
the historical data is archived. For ex MB,
GB
Queries Relatively slow as the amount of data
involved is large. Queries may take
hours.
Very Fast as the queries operate on 5% of
the data.
5. 10. Difference between Data Mining and Data Warehousing. [3/5]
Data Mining Data Warehousing
Data mining is the process of determining
data patterns.
A data warehouse is a database system designed for
analytics.
Data mining is generally considered as the
process of extracting useful data from a
large set of data.
Data warehousing is the process of combining all
the relevant data.
Business entrepreneurs carry data mining
with the help of engineers.
Data warehousing is entirely carried out by the
engineers.
In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically.
Data mining uses pattern recognition
techniques to identify patterns.
Data warehousing is the process of extracting and
storing data that allow easier reporting.
One of the most amazing data mining
technique is the detection and identification
of the unwanted errors that occur in the
system.
One of the advantages of the data warehouse is its
ability to update frequently. That is the reason why
it is ideal for business entrepreneurs who want up
to date with the latest stuff.
The data mining techniques are cost-
efficient as compared to other statistical
data applications.
The responsibility of the data warehouse is to
simplify every type of business data.
The data mining techniques are not 100
percent accurate. It may lead to serious
consequences in a certain condition.
In the data warehouse, there is a high possibility
that the data required for analysis by the company
may not be integrated into the warehouse. It can
simply lead to loss of data.
Companies can benefit from this analytical
tool by equipping suitable and accessible
knowledge-based data.
Data warehouse stores a huge amount of historical
data that helps users to analyze different periods
and trends to make future predictions.
11. K-Means vs KNN [3/5]
Category K-Means KNN
Algorithm Unsupervised learning algorithm Supervised learning algorithm
Process
Clusters data points into k clusters
based on their similarity
Classifies data points based on the
majority class of their k nearest
neighbors
Number
Requires the number of clusters (k) to
be specified in advance
Requires the number of nearest
neighbors (k) to be specified in advance
method
Clustering is done using the mean of
the data points in each cluster
Classification is done using majority
vote of the k nearest neighbors
Suitability Suitable for continuous variables
Suitable for both continuous and
categorical variables
Scalability
K-Means is generally faster and more
scalable than KNN, especially for large
datasets.
KNN is generally slower and more
scalable than K-Means, for large
datasets.
6. 12. What do you mean by an outlier? [2]
An outlier in data mining is an observation that is significantly different from the other observations
in a dataset.
o Outliers can have a major impact on the results of data mining and statistical analysis, and
are often considered to be undesirable because they can skew the results and lead to
inaccurate conclusions.
o Outliers can be identified by a number of methods, including statistical tests, visualization
techniques, and machine learning algorithms.
o Once identified, outliers can be handled in a number of ways, such as removing them from
the dataset, treating them as special cases, or including them in the analysis but with
appropriate caution.
It's important to note that the definition of an outlier is context dependent, in some cases an
outlier can be a valuable information, for example in fraud detection, identifying an outlier can be
the key to finding a fraudulent transaction.
13.What is Knowledge Discovery in Databases? [2]
Knowledge Discovery in Databases (KDD) is the iterative process of extracting useful and valuable
information from large and complex sets of data.
o The goal of KDD is to identify patterns, trends, and insights hidden within the data that can
be used to make better decisions and improve business processes etc.
o The KDD process typically involves several steps, including data cleaning and preprocessing,
data mining, pattern evaluation, and knowledge representation.
o This process can be used in a variety of applications, including business intelligence, fraud
detection, and customer relationship management.
14. Hierarchical Clustering in Data Mining: [4]
Definition: A Hierarchical clustering method works via grouping data into a tree of clusters.
Hierarchical clustering begins by treating every data point as a separate cluster. Then, it
repeatedly executes the subsequent steps:
o Identify the 2 clusters which can be closest together, and
o Merge the 2 maximum comparable clusters. We need to continue these steps until all
the clusters are merged together.
Steps:
1. Compute the pairwise similarity or distance between all data points.
2. Start with each data point as a separate cluster.
3. Merge the two closest clusters into a new larger cluster.
4. Repeat step 3 until all data points belong to a single cluster or some stopping criteria is
met.
7. Representation: The hierarchy of clusters can be represented using a tree-based structure
called dendrogram.
Advantages:
o It can handle non-linearly separable data.
o It can handle different shapes and sizes of clusters.
o It allows for incremental and dynamic updates of the clustering results.
o It can be used to visualize the relationships between clusters.
Disadvantages:
o It is sensitive to the choice of the similarity or distance metric.
o It is sensitive to the choice of linkage method used to merge clusters.
o It can be computationally expensive for large datasets.
o It can be hard to interpret the results for higher dimensions.
15. Associative Classification in Data Mining. [2]
Definition: A data mining technique that discovers associations between features and class
labels, instead of building a predictive model for the class labels.
Advantages:
o It can handle noisy and incomplete data.
o It can discover important features and relationships between features and class labels.
Disadvantages:
o It is only applicable for binary or nominal class labels.
o It can be computationally expensive for large datasets.
16. Explain the following terms in the context of association rule mining:
(i) Support of an itemset.
(ii) Frequent closed itemset.
(iii) Lift of a rule. [3X2]
i. Support of an Itemset:
Definition: The proportion of transactions in a transaction database that contain a particular
itemset.
Calculation: The support of an itemset X can be calculated as the number of transactions
containing X divided by the total number of transactions in the database.
Significance: Support is a measure of the popularity of an itemset and is used as a threshold to
determine which itemsets are considered frequent.
Advantages:
o It provides a simple and intuitive measure of the popularity of an itemset.
o It can be easily calculated from transaction data.
ii. Frequent Closed Itemset:
Definition: A frequent itemset is closed if there is no superset of the itemset that has the same
support.
Significance: A frequent closed itemset is considered a more meaningful result than a frequent
itemset as it captures the complete information of the itemset and its subsets.
Advantages:
o It can avoid generating redundant and less meaningful results.
o It can capture the complete information of the itemset and its subsets.
8. iii
.
Lift of a Rule:
Definition: A measure of the degree of association between two items in a rule, compared to
their individual frequencies in the transaction database.
Calculation: The lift of a rule X -> Y is calculated as the ratio of the support of X U Y divided by
the support of X times the support of Y.
Significance: Lift is a measure of the strength of the association between two items in a rule, and
is used to rank and select the most interesting rules.
Advantages:
o It provides a measure of the strength of the association between two items in a rule.
o It can adjust for the overall popularity of the items in the transaction database.
17. Why data preprocessing is required? [2]
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for
cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
It involves below steps:
o Getting the dataset
o Importinglibraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling
18. Explain Market Basket Analysis with suitable example.
Suppose 5000 transactions have been made through a popular e-Commerce website. Now they
want to calculate the support, confidence, and lift for the two products. For example, let's say pen
and notebook, out of 5000 transactions, 500 transactions for pen, 700 transactions for notebook,
and 1000 transactions for both.
Using the information provided, we can calculate the support, confidence, and lift for the two
products: pen and notebook.
Support:
Support for pen = (number of transactions containing pen) / (total number of transactions) = 500 /
5000 = 0.1 or 10%
Support for notebook = (number of transactions containing notebook) / (total number of
transactions) = 700 / 5000 = 0.14 or 14%
Support for pen and notebook = (number of transactions containing both pen and notebook) /
(total number of transactions) = 1000 / 5000 = 0.2 or 20%
Confidence:
Confidence of the rule "If a customer buys a pen, they will also buy a notebook" = (number of
transactions containing both pen and notebook) / (number of transactions containing pen) = 1000 /
500 = 2 or 200%
Confidence of the rule "If a customer buys a notebook, they will also buy a pen" = (number of
transactions containing both pen and notebook) / (number of transactions containing notebook) =
1000 / 700 = 1.43 or 143%
Lift:
9. Lift of the rule "If a customer buys a pen, they will also buy a notebook" = (confidence of the rule)
/ (support of notebook) = 2 / 0.14 = 14.3
Lift of the rule "If a customer buys a notebook, they will also buy a pen" = (confidence of the rule)
/ (support of pen) = 1.43 / 0.1 = 14.3
Note that the lift is the same for both rules, this is because the lift is symmetric, it doesn't depend
on the order of the antecedent and the consequent.
A lift value of 1 indicates that there is no association between the antecedent and consequent, and
values greater than 1 indicate a positive association. Here the lift is 14.3 times, which is a strong
positive association between buying pen and notebook, the more the lift value more the
association.
19. Support Vector Machine
A support vector machine (SVM) is a type of deep learning algorithm that performs supervised
learning for classification or regression of data groups.
In AI and Machine learning, supervised learning system provide both input and desired output data,
which are labeled for classification.