As the name suggests this block deals with preparing the data for mining, converting it to the proper format. The data preparation is the most important part of the data mining process as data can only be analyzed if it is structured in a proper format if accurate reports are our goal. This is done by the three tools provided for this purpose for this purpose:
Explore Data : this tool helps us to create a histogram for any column in the table.
Clean Data : Using this tool we can specify maximum and minimum values for data that we require in particular column
This tool uses a given column from the table and plots histogram .The histogram provides us insight on the distribution of data and the occurrence of a set of values enabling us to explore which discrete value of group of values dominate our data set .
How to use :
Choose a column a produce its histogram .
In the next slide, Here we have used the tool to explore the Income column of the data set. We can see that maximum of the customers have income between the range of 30000 to 50000 and very few people have income in the range 150000-170000, so that we may market our product accordingly.
If required we can add this data as a column in our table .
This tool helps to identify outlying values or rare values that exist beyond a give value or below it within the table which may be exceptions thus making the table data inconsistent. After detecting outliers we may choose to change their values to average or null.
The actual work of data mining is done on prepared data using these tools. These tools internally mine data using powerful mining algorithm’s employing SQL Server Analysis services.
Sr.no Tool name Mining Algorithm used 1. Classify Microsoft Decision Trees 2. Estimate Microsoft Decision Trees 3. Clusters Microsoft Clustering 4. Associate Microsoft Association Rules 5. Forecast Microsoft Time Series
In this part, we can find tools that can be used to test and validate our mining models. It is important that we know how well the mining models developed by us work with real world data, and by checking their accuracy we can validate the mining models