KEEL: A software tool to assess evolutionary algorithms for Data Mining problems (regression, classification, clustering, pattern mining and so on)

The standard classification task consists of making generalizations from a set of training examples. The knowledge learned from them can be applied on a set of unobserved examples to predict their classes. When noise affects the features or classes of the training examples is more difficult to perform a good prediction of the unseen examples class.

This section shows the data sets with attribute noise avalaible in the repository. Every one defines a supervised classification problem, where each of its examples is composed by some nominal or numerical attributes and a nominal output attribute (its class).

Each data file has the following structure:

@relation: Name of the data set
@attribute: Description of an attribute (one for each attribute)
@inputs: List with the names of the input attributes
@output: Name of the output attribute
@data: Starting tag of the data

The rest of the file contains all the examples belonging to the data set, expressed in comma sepparated values format. None of the data sets contains missing values.

All data sets included here are Standard data sets in which we have introduced aleatory attribute noise, following the schema in next section.

In order to introduce attribute noise in the data sets, we adopt the schema which is described as follows:

The wrong values are introduced into each attribute A with a level of x%. In order to corrupt each attribute A with a noise level of x%, the x% of the examples in the data set are chosen approximately and the value of A of each of these examples is assigned a random value between the minimum and maximum of the domain of that attribute, following a uniform distribution.

This schema was proposed by Wu et al. in "Error detection and impactsensitive instance ranking in noisy datasets", In Proceedings of 19th National conference on Artificial Intelligence (AAAI-2004), San Jose, CA., 2004. They suggested this scheme based on the hypothesis that interactions between attributes are weak. As a result, the noise introduced in each attribute has a low correlation with the noise introduced in the rest. With this schema, the percentage of noise in the data set may be lower than desired, because random assignment can choose the original value again sometimes.

5 partitions are used in the partitioning scheme (an stratified 5-folds cross validation). Since each fold has a larger number of examples considering 5 partitions than considering a higher number of partitions, e.g., 10, it is likely that little modifications in the classifiers used due to the effect of noise on training sets to be shown better in test sets because a larger number of examples are considered.

So we have created the data sets following three different schemes based on where the noise is present:

-    Noisy Train - Noisy Test
-    Noisy Train - Clean Test
-    Clean Train - Noisy Test

For each of the schemes where the noise is present, we have introduced three different noise levels: 5%, 10%, 15% and 20%.

So for the scheme Noisy Train - Noisy Test, we have:

-      5% Noisy Train - Noisy Test
-    10% Noisy Train - Noisy Test
-    15% Noisy Train - Noisy Test
-    20% Noisy Train - Noisy Test

, for the scheme Noisy Train - Clean Test, we have:

-      5% Noisy Train - Clean Test
-    10% Noisy Train - Clean Test
-    15% Noisy Train - Clean Test
-    20% Noisy Train - Clean Test

, and for the scheme Clean Train - Clean Test, we have:

-      5% Clean Train - Noisy Test
-    10% Clean Train - Noisy Test
-    15% Clean Train - Noisy Test
-    20% Clean Train - Noisy Test

Below you can find all the Attribute Noise data sets available with 5% of noise in training and test sets. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set in KEEL format (inside a ZIP file). This complete data set is only available with this type of scheme (Noisy Train - Noisy Test) because is the only one in which has sense. It is due to the instances in each training and test partitions form the complete data set always and do not in the other ones. Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-5an-nn	4 (4/0/0)	150	3
ecoli-5an-nn	7 (7/0/0)	336	8
pima-5an-nn	8 (8/0/0)	768	2
yeast-5an-nn	8 (8/0/0)	1484	10
glass-5an-nn	9 (9/0/0)	214	7
contraceptive-5an-nn	9 (0/9/0)	1473	3
page-blocks-5an-nn	10 (4/6/0)	5472	5
wine-5an-nn	13 (13/0/0)	178	3
heart-5an-nn	13 (1/12/0)	270	2
penbased-5an-nn	16 (0/16/0)	10992	10
segment-5an-nn	19 (19/0/0)	2310	7
ring-5an-nn	20 (20/0/0)	7400	2
twonorm-5an-nn	20 (20/0/0)	7400	2
thyroid-5an-nn	21 (6/15/0)	7200	3
wdbc-5an-nn	30 (30/0/0)	569	2
ionosphere-5an-nn	33 (32/1/0)	351	2
satimage-5an-nn	36 (0/36/0)	6435	7
spambase-5an-nn	57 (57/0/0)	4597	2
sonar-5an-nn	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 10% of noise in training and test sets. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set in KEEL format (inside a ZIP file). This complete data set is only available with this type of scheme (Noisy Train - Noisy Test) because is the only one in which has sense. It is due to the instances in each training and test partitions form the complete data set always and do not in the other ones. Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-10an-nn	4 (4/0/0)	150	3
ecoli-10an-nn	7 (7/0/0)	336	8
pima-10an-nn	8 (8/0/0)	768	2
yeast-10an-nn	8 (8/0/0)	1484	10
glass-10an-nn	9 (9/0/0)	214	7
contraceptive-10an-nn	9 (0/9/0)	1473	3
page-blocks-10an-nn	10 (4/6/0)	5472	5
wine-10an-nn	13 (13/0/0)	178	3
heart-10an-nn	13 (1/12/0)	270	2
penbased-10an-nn	16 (0/16/0)	10992	10
segment-10an-nn	19 (19/0/0)	2310	7
ring-10an-nn	20 (20/0/0)	7400	2
twonorm-10an-nn	20 (20/0/0)	7400	2
thyroid-10an-nn	21 (6/15/0)	7200	3
wdbc-10an-nn	30 (30/0/0)	569	2
ionosphere-10an-nn	33 (32/1/0)	351	2
satimage-10an-nn	36 (0/36/0)	6435	7
spambase-10an-nn	57 (57/0/0)	4597	2
sonar-10an-nn	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 15% of noise in training and test sets. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set in KEEL format (inside a ZIP file). This complete data set is only available with this type of scheme (Noisy Train - Noisy Test) because is the only one in which has sense. It is due to the instances in each training and test partitions form the complete data set always and do not in the other ones. Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-15an-nn	4 (4/0/0)	150	3
ecoli-15an-nn	7 (7/0/0)	336	8
pima-15an-nn	8 (8/0/0)	768	2
yeast-15an-nn	8 (8/0/0)	1484	10
glass-15an-nn	9 (9/0/0)	214	7
contraceptive-15an-nn	9 (0/9/0)	1473	3
page-blocks-15an-nn	10 (4/6/0)	5472	5
wine-15an-nn	13 (13/0/0)	178	3
heart-15an-nn	13 (1/12/0)	270	2
penbased-15an-nn	16 (0/16/0)	10992	10
segment-15an-nn	19 (19/0/0)	2310	7
ring-15an-nn	20 (20/0/0)	7400	2
twonorm-15an-nn	20 (20/0/0)	7400	2
thyroid-15an-nn	21 (6/15/0)	7200	3
wdbc-15an-nn	30 (30/0/0)	569	2
ionosphere-15an-nn	33 (32/1/0)	351	2
satimage-15an-nn	36 (0/36/0)	6435	7
spambase-15an-nn	57 (57/0/0)	4597	2
sonar-15an-nn	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 20% of noise in training and test sets. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set in KEEL format (inside a ZIP file). This complete data set is only available with this type of scheme (Noisy Train - Noisy Test) because is the only one in which has sense. It is due to the instances in each training and test partitions form the complete data set always and do not in the other ones. Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-20an-nn	4 (4/0/0)	150	3
ecoli-20an-nn	7 (7/0/0)	336	8
pima-20an-nn	8 (8/0/0)	768	2
yeast-20an-nn	8 (8/0/0)	1484	10
glass-20an-nn	9 (9/0/0)	214	7
contraceptive-20an-nn	9 (0/9/0)	1473	3
page-blocks-20an-nn	10 (4/6/0)	5472	5
wine-20an-nn	13 (13/0/0)	178	3
heart-20an-nn	13 (1/12/0)	270	2
penbased-20an-nn	16 (0/16/0)	10992	10
segment-20an-nn	19 (19/0/0)	2310	7
ring-20an-nn	20 (20/0/0)	7400	2
twonorm-20an-nn	20 (20/0/0)	7400	2
thyroid-20an-nn	21 (6/15/0)	7200	3
wdbc-20an-nn	30 (30/0/0)	569	2
ionosphere-20an-nn	33 (32/1/0)	351	2
satimage-20an-nn	36 (0/36/0)	6435	7
spambase-20an-nn	57 (57/0/0)	4597	2
sonar-20an-nn	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 5% of noise in training set while test one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-5an-nc	4 (4/0/0)	150	3
ecoli-5an-nc	7 (7/0/0)	336	8
pima-5an-nc	8 (8/0/0)	768	2
yeast-5an-nc	8 (8/0/0)	1484	10
glass-5an-nc	9 (9/0/0)	214	7
contraceptive-5an-nc	9 (0/9/0)	1473	3
page-blocks-5an-nc	10 (4/6/0)	5472	5
wine-5an-nc	13 (13/0/0)	178	3
heart-5an-nc	13 (1/12/0)	270	2
penbased-5an-nc	16 (0/16/0)	10992	10
segment-5an-nc	19 (19/0/0)	2310	7
ring-5an-nc	20 (20/0/0)	7400	2
twonorm-5an-nc	20 (20/0/0)	7400	2
thyroid-5an-nc	21 (6/15/0)	7200	3
wdbc-5an-nc	30 (30/0/0)	569	2
ionosphere-5an-nc	33 (32/1/0)	351	2
satimage-5an-nc	36 (0/36/0)	6435	7
spambase-5an-nc	57 (57/0/0)	4597	2
sonar-5an-nc	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 10% of noise in training set while test one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-10an-nc	4 (4/0/0)	150	3
ecoli-10an-nc	7 (7/0/0)	336	8
pima-10an-nc	8 (8/0/0)	768	2
yeast-10an-nc	8 (8/0/0)	1484	10
glass-10an-nc	9 (9/0/0)	214	7
contraceptive-10an-nc	9 (0/9/0)	1473	3
page-blocks-10an-nc	10 (4/6/0)	5472	5
wine-10an-nc	13 (13/0/0)	178	3
heart-10an-nc	13 (1/12/0)	270	2
penbased-10an-nc	16 (0/16/0)	10992	10
segment-10an-nc	19 (19/0/0)	2310	7
ring-10an-nc	20 (20/0/0)	7400	2
twonorm-10an-nc	20 (20/0/0)	7400	2
thyroid-10an-nc	21 (6/15/0)	7200	3
wdbc-10an-nc	30 (30/0/0)	569	2
ionosphere-10an-nc	33 (32/1/0)	351	2
satimage-10an-nc	36 (0/36/0)	6435	7
spambase-10an-nc	57 (57/0/0)	4597	2
sonar-10an-nc	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 15% of noise in training set while test one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-15an-nc	4 (4/0/0)	150	3
ecoli-15an-nc	7 (7/0/0)	336	8
pima-15an-nc	8 (8/0/0)	768	2
yeast-15an-nc	8 (8/0/0)	1484	10
glass-15an-nc	9 (9/0/0)	214	7
contraceptive-15an-nc	9 (0/9/0)	1473	3
page-blocks-15an-nc	10 (4/6/0)	5472	5
wine-15an-nc	13 (13/0/0)	178	3
heart-15an-nc	13 (1/12/0)	270	2
penbased-15an-nc	16 (0/16/0)	10992	10
segment-15an-nc	19 (19/0/0)	2310	7
ring-15an-nc	20 (20/0/0)	7400	2
twonorm-15an-nc	20 (20/0/0)	7400	2
thyroid-15an-nc	21 (6/15/0)	7200	3
wdbc-15an-nc	30 (30/0/0)	569	2
ionosphere-15an-nc	33 (32/1/0)	351	2
satimage-15an-nc	36 (0/36/0)	6435	7
spambase-15an-nc	57 (57/0/0)	4597	2
sonar-15an-nc	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 20% of noise in training set while test one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-20an-nc	4 (4/0/0)	150	3
ecoli-20an-nc	7 (7/0/0)	336	8
pima-20an-nc	8 (8/0/0)	768	2
yeast-20an-nc	8 (8/0/0)	1484	10
glass-20an-nc	9 (9/0/0)	214	7
contraceptive-20an-nc	9 (0/9/0)	1473	3
page-blocks-20an-nc	10 (4/6/0)	5472	5
wine-20an-nc	13 (13/0/0)	178	3
heart-20an-nc	13 (1/12/0)	270	2
penbased-20an-nc	16 (0/16/0)	10992	10
segment-20an-nc	19 (19/0/0)	2310	7
ring-20an-nc	20 (20/0/0)	7400	2
twonorm-20an-nc	20 (20/0/0)	7400	2
thyroid-20an-nc	21 (6/15/0)	7200	3
wdbc-20an-nc	30 (30/0/0)	569	2
ionosphere-20an-nc	33 (32/1/0)	351	2
satimage-20an-nc	36 (0/36/0)	6435	7
spambase-20an-nc	57 (57/0/0)	4597	2
sonar-20an-nc	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 5% of noise in test set while training one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-5an-cn	4 (4/0/0)	150	3
ecoli-5an-cn	7 (7/0/0)	336	8
pima-5an-cn	8 (8/0/0)	768	2
yeast-5an-cn	8 (8/0/0)	1484	10
glass-5an-cn	9 (9/0/0)	214	7
contraceptive-5an-cn	9 (0/9/0)	1473	3
page-blocks-5an-cn	10 (4/6/0)	5472	5
wine-5an-cn	13 (13/0/0)	178	3
heart-5an-cn	13 (1/12/0)	270	2
penbased-5an-cn	16 (0/16/0)	10992	10
segment-5an-cn	19 (19/0/0)	2310	7
ring-5an-cn	20 (20/0/0)	7400	2
twonorm-5an-cn	20 (20/0/0)	7400	2
thyroid-5an-cn	21 (6/15/0)	7200	3
wdbc-5an-cn	30 (30/0/0)	569	2
ionosphere-5an-cn	33 (32/1/0)	351	2
satimage-5an-cn	36 (0/36/0)	6435	7
spambase-5an-cn	57 (57/0/0)	4597	2
sonar-5an-cn	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 10% of noise in test set while training one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-10an-cn	4 (4/0/0)	150	3
ecoli-10an-cn	7 (7/0/0)	336	8
pima-10an-cn	8 (8/0/0)	768	2
yeast-10an-cn	8 (8/0/0)	1484	10
glass-10an-cn	9 (9/0/0)	214	7
contraceptive-10an-cn	9 (0/9/0)	1473	3
page-blocks-10an-cn	10 (4/6/0)	5472	5
wine-10an-cn	13 (13/0/0)	178	3
heart-10an-cn	13 (1/12/0)	270	2
penbased-10an-cn	16 (0/16/0)	10992	10
segment-10an-cn	19 (19/0/0)	2310	7
ring-10an-cn	20 (20/0/0)	7400	2
twonorm-10an-cn	20 (20/0/0)	7400	2
thyroid-10an-cn	21 (6/15/0)	7200	3
wdbc-10an-cn	30 (30/0/0)	569	2
ionosphere-10an-cn	33 (32/1/0)	351	2
satimage-10an-cn	36 (0/36/0)	6435	7
spambase-10an-cn	57 (57/0/0)	4597	2
sonar-10an-cn	60 (60/0/0)	208	2
All data sets

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-15an-cn	4 (4/0/0)	150	3
ecoli-15an-cn	7 (7/0/0)	336	8
pima-15an-cn	8 (8/0/0)	768	2
yeast-15an-cn	8 (8/0/0)	1484	10
glass-15an-cn	9 (9/0/0)	214	7
contraceptive-15an-cn	9 (0/9/0)	1473	3
page-blocks-15an-cn	10 (4/6/0)	5472	5
wine-15an-cn	13 (13/0/0)	178	3
heart-15an-cn	13 (1/12/0)	270	2
penbased-15an-cn	16 (0/16/0)	10992	10
segment-15an-cn	19 (19/0/0)	2310	7
ring-15an-cn	20 (20/0/0)	7400	2
twonorm-15an-cn	20 (20/0/0)	7400	2
thyroid-15an-cn	21 (6/15/0)	7200	3
wdbc-15an-cn	30 (30/0/0)	569	2
ionosphere-15an-cn	33 (32/1/0)	351	2
satimage-15an-cn	36 (0/36/0)	6435	7
spambase-15an-cn	57 (57/0/0)	4597	2
sonar-15an-cn	60 (60/0/0)	208	2
All data sets

Below you can find all the Attribute Noise data sets available with 20% of noise in test set while training one remains unchanged. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued), classes (number of possible values of the output variable).

The table allows to download each data set already partitioned, by means of a 5-folds cross validation procedure, in KEEL format (inside a ZIP file).

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes
iris-20an-cn	4 (4/0/0)	150	3
ecoli-20an-cn	7 (7/0/0)	336	8
pima-20an-cn	8 (8/0/0)	768	2
yeast-20an-cn	8 (8/0/0)	1484	10
glass-20an-cn	9 (9/0/0)	214	7
contraceptive-20an-cn	9 (0/9/0)	1473	3
page-blocks-20an-cn	10 (4/6/0)	5472	5
wine-20an-cn	13 (13/0/0)	178	3
heart-20an-cn	13 (1/12/0)	270	2
penbased-20an-cn	16 (0/16/0)	10992	10
segment-20an-cn	19 (19/0/0)	2310	7
ring-20an-cn	20 (20/0/0)	7400	2
twonorm-20an-cn	20 (20/0/0)	7400	2
thyroid-20an-cn	21 (6/15/0)	7200	3
wdbc-20an-cn	30 (30/0/0)	569	2
ionosphere-20an-cn	33 (32/1/0)	351	2
satimage-20an-cn	36 (0/36/0)	6435	7
spambase-20an-cn	57 (57/0/0)	4597	2
sonar-20an-cn	60 (60/0/0)	208	2
All data sets

Collecting Data Sets

If you have some example data sets and you would like to share them with the rest of the research community by means of this page, please be so kind as to send your data to the Webmaster Team with the following information:

People answerable for the data (full name, affiliation, e-mail, web page, ...).
training and test data sets considered, preferably in ASCII format.
A brief description of the application.
References where it is used.
Results obtained by the methods proposed by the authors or used for comparison.
Type of experiment developed.
Any additional useful information.

Collecting Results

If you have applied your methods to some of the problems presented here we will be glad of showing your results in this page. Please be so kind as to send the following information to Webmaster Team:

Name of the application considered and type of experiment developed.
Results obtained by the methods proposed by the authors or used for comparison.
References where the results are shown.
Any additional useful information.

If you are interested on being informed of each update made in this page or you would like to comment on it, please contact with the Webmaster Team.