The standard classification task consists of making generalizations from a set of training examples. The knowledge learned from them can be applied on a set of unobserved examples to predict their classes. When noise affects the features or classes of the training examples is more difficult to perform a good prediction of the unseen examples class.
This section shows the data sets with class noise avalaible in the repository. Every one defines a supervised classification problem, where each of its examples is composed by some nominal or numerical attributes and a nominal output attribute (its class). Each data file has the following structure:
- @relation: Name of the data set
- @attribute: Description of an attribute (one for each attribute)
- @inputs: List with the names of the input attributes
- @output: Name of the output attribute
- @data: Starting tag of the data
The rest of the file contains all the examples belonging to the data set, expressed in comma sepparated values format. None of the data sets contains missing values.
All data sets included here are Standard data sets in which we have introduced aleatory class noise, following the schema in next section.
In order to introduce class noise in the data sets, we adopt the schema which is described as follows:
Given the pair of classes (X, Y), being X the majority class and Y the second majority class, and a noise level x%, an instance with label X has a probability of x% to be incorrectly labeled as Y.
This schema was proposed by Zhu et al. in "Eliminating class noise in large datasets", Proceedings of the 20th ICML International Conference on Machine Learning, Washington D.C., p. 920-927, 2003. They suggested this scheme is appropriate because it is more likely that only certain types of classes are mislabeled.
5 partitions are used in the partitioning scheme (an stratified 5-folds cross validation). Since each fold has a larger number of examples considering 5 partitions than considering a higher number of partitions, e.g., 10, it is likely that little modifications in the classifiers used due to the effect of noise on training sets to be shown better in test sets because a larger number of examples are considered.
We introduce the noise only in training partitions, while test sets remain unchanged.
We have introduced four different noise levels: 5%, 10%, 15% and 20%. So for the scheme Noisy Train - Clean Test, we have:
- 5% Noisy Train - Clean Test
- 10% Noisy Train - Clean Test
- 15% Noisy Train - Clean Test
- 20% Noisy Train - Clean Test
Below you can find all the Class Noise data sets available with 5% of noise in training set while test one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).
The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).
By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.
Below you can find all the Class Noise data sets available with 10% of noise in training set while test one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).
The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).
By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.
Below you can find all the Class Noise data sets available with 15% of noise in training set while test one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).
The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).
By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.
Below you can find all the Class Noise data sets available with 20% of noise in training set while test one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).
The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).
By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.
Collecting Data Sets
If you have some example data sets and you would like to share
them with the rest of the research community by means of this page, please be so
kind as to send your data to the Webmaster Team with the following information:
- People answerable for the data (full name, affiliation, e-mail, web page,
...).
- training and test data sets considered, preferably in ASCII format.
- A brief description of the application.
- References where it is used.
- Results obtained by the methods proposed by the authors or used for comparison.
- Type of experiment developed.
- Any additional useful information.
Collecting Results
If you have applied your methods to some of the problems
presented here we will be glad of showing your results in this page. Please be so kind as to send the following information to Webmaster Team:
- Name of the application considered and type of experiment developed.
- Results obtained by the methods proposed by the authors or used for comparison.
- References where the results are shown.
- Any additional useful information.
Contact Us
If you are interested on being informed of each update made in
this page or you would like to comment on it, please contact with the Webmaster Team.
|