Enhancing the Classification Accuracy of Noisy Dataset by Fusing Correlation Based Feature Selection With K-Nearest Neighbour

The performance of data mining and machine learning tasks can be significantly degraded due to the presence of noisy, irrelevant and high dimensional data containing large number of features. A large amount of real world data consist of noise or missing values. While collecting data, there may be many irrelevant features that are collected by the storage repositories. These redundant and irrelevant feature values distorts the classification principle and simultaneously increases calculations overhead and decreases the prediction ability of the classifier. The high-dimensionality of such datasets possesses major bottleneck in the field of data mining, statistics, machine learning. Among several methods of dimensionality reduction, attribute or feature selection technique is often used in dimensionality reduction. Since the k-NN algorithm is sensitive to irrelevant attributes therefore its performance degrades significantly when a dataset contains missing values or noisy data. However, this weakness of the k-NN algorithm can be minimized when combined with the other feature selection techniques. In this research we combine the Correlation based Feature Selection (CFS) with k-Nearest Neighbour (k-NN) Classification algorithm to find better result in classification when the dataset contains missing values or noisy data. The reduced attribute set decreases the time required for classification. The research shows that when dimensionality reduction is done using CFS and classified with k-NN algorithm, dataset with nil or very less noise may have negative impact in the classification accuracy, when compared with classification accuracy of k-NN algorithm alone. When additional noise is introduced to these datasets, the performance of k-NN degrades significantly. When these noisy datasets are classified using CFS and k-NN together, the percentage in classification accuracy is improved.


INTRODUCTION
Data mining is a process of extracting knowledge from enormous data.Classification is the important data analysis technique among the other major component of data mining, in which data models are extracted that describes important data classes.These models are called classifiers, which predicts categorical class 1 labels.
Most of the real world data sources has to deal with the unavoidable problem of incomplete data 2 .To improve the data quality data may first be preprocessed and the refined data may be then used for further data mining process.There are several data preprocessing techniques 3 .Data cleaning is the process of removing noise and correct inconsistencies in data.Dimensionality reduction is a technique in which a reduced or compressed dataset is obtained by reducing the attribute set.The resulting dataset is a representation of the original dataset.Data compression techniques includes Wavelet Transforms and Principle Component Analysis , Attribute Subset selection, in which irrelevant attributes are removed, and attribute construction, where a new attribute is constructed from two or more attributes and usually is more useful than the original attributes.
Analysis of high-dimensional data for knowledge discovery does not always requires all the attributes to understand the underlying interest on the knowledge The analysis of .high-dimensionaldatasets thus augment the requirement of new theoretical developments 4 .Though predictive models with high accuracy can be constructed with high-dimensional data using computationally expensive methods 5 , reduction in the dimension of the original data may be of the concern of many application.There are several methods of handling missing data 6 , from which appropriate method may be chosen, depending upon circumstances of each of the case.
The process of identifying and removing irrelevant and redundant information is known as attribute subset selection.Here a minimum subset of all attributes in the original dataset are selected such a way that the probability distribution of the ensuing classes is as close as possible to the original distribution.Mining on a dataset with reduced attributes has extra advantages.First, mining on reduced attribute set requires less computation time as compared to the dataset with original attribute set.Secondly, It makes the pattern easier to understand by reducing the attribute set in the discovered pattern.To find the optimal subset of attributes an exhaustive search is definitely expansive especially when number of attributes and the number of data classes increases.Therefore, heuristic methods are commonly used for attribute subset selection.These methods are typically greedy.They always make what looks to be the best choice at the time while searching through attribute space.
Memory-based learning is a type of learning algorithms that match a new test instances with training instances, which have been stored in memory, instead of performing explicit generalization.Since these type of learning constructs hypotheses directly from the training instances, therefore it is called instance based learning.The advantage memory-based learning over other methods of machine learning is its capability to adapt its model to previously unseen data.A learning method is termed "unstable" if small changes in the trainingtest set split can result in large changes in the resulting classifier 7 .The disadvantage of instancebased classifiers is their large computational time for classification.Therefore a reduced attribute set may significantly improve the classification time for an instance based learning.Hence, determining which input features should be used for modelling becomes an key issue in the process because it could improve the classification accuracy and reduce the classification time 8 .Fig. 1: CFS and k-NN for Class lablel prediction K-Nearest Neighbour algorithm is a simple example of an instance-based learning algorithm.When trying to solve new problems, people often look at solutions to similar problems that they have previously solved 9 .The same principle is used in k-nearest neighbour classification technique.It determines an instance is to be placed in which class by examining, the 'k' in k-nearest neighbour, which is the most similar cases or neighbours.It counts the number of instances for each class, and assigns the new instance to the same class to which most of its neighbours belong.The sensitivity of k-NN to irrelevant attributes degrades the classification accuracy significantly when a dataset contains missing values or noisy data 10 .
Though there are several different techniques for attribute selection and classification there are few which are used together to improve classification accuracy.Correlation-based Feature Selection 11 for machine learning originally proposed by Mark A. Hall is one of the feature selection

Methodology/ Experiments
Classification accuracy is defined as the percentage of test tuples correctly classified by the algorithm.The error rate of an algorithm is one minus the accuracy.Measuring accuracy on a test set of tuples is better than using the training set because tuples in the test set have not been used to induce concept descriptions.Using the training set to measure accuracy will typically provide an optimistically biased estimate, especially if the learning algorithm overfits the training data.
Data sets for analysis may contain hundreds of attributes, many of which may irrelevant for mining task.Mining the useful information 12 from the huge dataset is a complex task.UCI repository 13 for machine learning consists of large  To show that the k-NN algorithm is sensitive to noise, each dataset is then introduced with an additional amount of missing values in order to create noise in the datasets used for the experiment.The resulted noisy dataset then again classified with k-NN algorithm.The noisy datasets are then preprocessed using CFS algorithm in order to achieve the reduced attribute set and then again classified using k-NN algorithm to show percentage improvement in each dataset when compared with the improvement in accuracy percentage of correctly classified instances in initial datasets.
Weka 14 is a popular open source Data mining tools implemented in java.The relevant

The process uses both CFS and KNN together to predict the class labels. The process is depicted below:
The process can be summarized as follows: Step 1: Select a dataset with no or minimum missing values.
Step 2: Find the accuracy of k-NN classifier for the given dataset.
Step 3: Select the attributes using CFS algorithm.
Step 4: Remove the remaining attributes from all the instances.
Step 5: Classify the dataset with reduced set of attributes using k-NN classifier.
Step 6: Record the result of k-NN classifier with reduces set of attributes.
Step 7: Add small amount of noise to the dataset, Improvement in accuracy in(in %)

RESULTS AND DISCUSSIONS
The performance of our proposed approach has been tested with 10 different types of dataset with no or minimal level of missing values in each dataset.Each of the dataset and their percentage of corresponding missing values in the instances are summarized in Table 1.Feature selection can reduce the number of training cases because fewer features equates with fewer distinct instances (especially when features are nominal).Speed of the algorithm can be increased significantly if number of training cased needed is reduced while maintain and acceptable rate of errors.For CFS based attribute selection greedy Stepwise Search 15 algorithm with backward search is used and for k-NN, value of k is set to 1.The result so obtained is summarized in Table 1.
The graph presented below in Figure 2 shows that the improvement in accuracy is very less or even negative when only k-NN and k-NN is used with CFS, in the datasets with no noise or very minimal noise.The best case of improvement in accuracy is 6.49 in "splice" dataset whereas in the worst case the improvement in accuracy is negative, that is, -4.71% in case of "molecularbiology promoters" dataset.The fact is also evident from Table 1.
When additional noise is introduced to each of the dataset it is observed that in all the cases the performance of k-NN algorithm is degraded significantly.When these noisy dataset are classified using CFS and k-NN together, in all of the cases, the CFS and k-NN together shows the improvement in percentage of classification accuracy.The comparison of percentage of correctly classified instances of each dataset when classified only with k-NN and CFS and k-NN together is presented in Table 2 and Figure 3.  3 shows that there is a significant reduction in performance of k-NN when additional noise is introduced to the initial dataset.
When compared the improvement in percentage of classification of initial datasets and noisy datasets it is found that each of the dataset shows better improvement in percentage of correctly classified instances in noisy datasets.The results for each dataset are compared in Table 4 and in Figure 5.

CONCLUSION
The main objective of classification algorithms is to predict more precise, accurate and certain class labels.Various methods have been suggested for the construction of ensemble of classifiers.If we are only concerned about the best possible classification accuracy only, it might be difficult or almost impossible to find a single classifier that performs as well as a good ensemble of classifiers.Further, presence of noise or missing values degrades the performance of classifiers.However, when the classification algorithm is combined with the appropriate feature selection tool it can improve the classification accuracy significantly in noisy datasets.In this research we combined Correlation based Feature Selection Technique and k-Nearest Neighbor algorithm toimprove the classification accuracy when dataset contains missing values.The best case in this research is "splice" dataset where classification is improved 8.49 % when CFS is applied with k-NN, and in the worst case of "molecular-biology-promoters" dataset the improvement in accuracy is negative by -4.79% before adding additional noise.When additional noise are introduced to all these datasets and classified using CFS and k-NN together the improvement in accuracy is positive in all the cases.In the previous worst case of "molecular-biology-promoters" dataset which was initially without missing values, when additional noise introduced has also shown the improvement in accuracy from -4.71 to 8.5%.The k-NN algorithm having the weakness of sensitivity to the missing values is shown in this research (Refer Table 3 and Figure 4).In each of the noisy dataset the classification accuracy is improved when CFS and k-NN algorithm is used together for classification.The objective of utilizing the strengths of one method (CFS) to complement the weaknesses of another (k-NN) is thus achieved in this research.

Fig. 2 :
Fig. 2: Comparison of percentage of correctly classified instances with k-NN algorithm and CFS+k-NN Algorithm and corresponding improvement/ degradation in classification accuracy percentage

Fig. 3 :
Fig. 3: Compares the percentage correctly classified instances only with k-NN and CFS and k-NN together.The graph in figure.4 composed form table3shows that there is a significant

Fig. 4 :
Fig. 4: Degradation in performance of k-NN when additional noise is introduced to the initial dataset

Fig. 5 :
Fig. 5: Comparison of improvement in accuracy in initial datasets with dataset containing additional noise classes of Weka source code is used in Java for experiment.The process uses 10-fold cross validation for training and predicting class label of each dataset used in the experiment.

Step 8 : 9 :
Repeat Step 2 through 6 for all datasets Step Evaluate Accuracy Improvement from initial datasets and noisy Datasets.Let, N A = Total number of attributes in a dataset.S A = Total number of selected Attributes.N I = Number of Instances in the dataset.C knn = Correctly classified Instances with k-NN C CFS+kNN = Correctly classified Instances with CFS + k-NN Therefore, Total Number of values in the dataset, (T v .)= N A * N I ...(1) and, Percentage of Reduction in Attribute PR A = (N A -S A )*100 / N A ...(2) Let, M v = Number of Missing Values in the dataset and P MV = Percentage of Missing values in the Dataset Percentage of Missing values in the Dataset can be calculated as, P MV = M V *100/Tv ...

Table 3
compares the percentage correctly classified instances only with k-NN and CFS and k-NN together.The graph in Figure.4 composed from Table

Table 4 : Percentage of missing values and percentage improvement in correctly classified instances in initial dataset and noisy dataset
13. UCI Machine Learning Repository, Available at http://mlr.cs.umass.edu/ml/datasets.html,accessed Sep 16. 14.Weka Documentation, Available at www. cs.waikato.ac.nz, accessed Sep 16. 15. guyon and Elissee, 2003, greedy stepwise search : An introduction to variable and feature selection.Journal of Machine Learning research.