KEEL: A software tool to assess evolutionary algorithms for Data Mining problems (regression, classification, clustering, pattern mining and so on)

This section shows the standard classification data sets avalaible in the repository. Every one defines a supervised classification problem, where each of its examples is composed by some nominal or numerical attributes and a nominal output attribute (its class).

Each data file has the following structure:

@relation: Name of the data set
@attribute: Description of an attribute (one for each attribute)
@inputs: List with the names of the input attributes
@output: Name of the output attribute
@data: Starting tag of the data

The rest of the file contains all the examples belonging to the data set, expressed in comma sepparated values format.

We offer information about experimental studies using these data sets (result files, papers and more) in the Experimental studies in classification section of the repository.

Below you can find all the Standard Classification data sets available. For each data set, it is shown its name and its number of instances, attributes (the table details the number of Real/Integer/Nominal attributes in the data) and classes (number of possible values of the output variable). In addition, the table shows if the corresponding data set has missing values or not (for data sets with missing values the table shows the number of instances without missing values, and the total number of instances between brackets).

The table allows to download each data set in KEEL format (inside a ZIP file). Additionally, it is possible to obtain the data set already partitioned, by means of a 10-folds / 5-folds stratified cross validation (SCV) procedure. The partitions using a 10-folds / 5-folds distribution optimally balanced stratified cross-validation (DOB-SCV) are also available (except for those datasets with a very high number of examples, since this partitioning scheme requires a considerable computation). The latter validation procedure was proposed in:

J.G. Moreno-Torres, J.A. Sáez, F. Herrera, Study on the impact of partition-induced dataset shift on k-fold cross-validation, IEEE Transactions on Neural Networks and Learning Systems 23 (8) (2012) 1304-1313.

For data sets with missing values, only the cleaned version (where instances with missing values are not included) is provided. A complete version including instances with missing values can be found in the description page of each data set or in the missing values section of KEEL-dataset. Finally, we provide a header file to give additional information about each data set and its attributes.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or classes, or by the presence of missing values. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	#Classes	Miss Val.
banana	2 (2/0/0)	5300	2	No
haberman	3 (0/3/0)	306	2	No
titanic	3 (3/0/0)	2201	2	No
iris	4 (4/0/0)	150	3	No
hayes-roth	4 (0/4/0)	160	3	No
balance	4 (4/0/0)	625	3	No
tae	5 (0/5/0)	151	3	No
newthyroid	5 (4/1/0)	215	3	No
mammographic	5 (0/5/0)	830 (961)	2	Yes
phoneme	5 (5/0/0)	5404	2	No
bupa	6 (1/5/0)	345	2	No
monk-2	6 (0/6/0)	432	2	No
car	6 (0/0/6)	1728	4	No
kr-vs-k	6 (0/0/16)	28056	17	No
appendicitis	7 (7/0/0)	106	2	No
ecoli	7 (7/0/0)	336	8	No
led7digit	7 (7/0/0)	500	10	No
post-operative	8 (0/0/8)	87 (90)	3	Yes
pima	8 (8/0/0)	768	2	No
yeast	8 (8/0/0)	1484	10	No
abalone	8 (7/0/1)	4174	28	No
nursery	8 (0/0/8)	12690	5	No
glass	9 (9/0/0)	214	7	No
breast	9 (0/0/9)	277 (286)	2	Yes
saheart	9 (5/3/1)	462	2	No
wisconsin	9 (0/9/0)	683 (699)	2	Yes
tic-tac-toe	9 (0/0/9)	958	2	No
contraceptive	9 (0/9/0)	1473	3	No
shuttle	9 (0/9/0)	58000	7	No
page-blocks	10 (4/6/0)	5472	5	No
magic	10 (10/0/0)	19020	2	No
poker	10 (0/10/0)	1025010	10	No
flare	11 (0/0/11)	1066	6	No
winequality-red	11 (11/0/0)	1599	11	No
winequality-white	11 (11/0/0)	4898	11	No
wine	13 (13/0/0)	178	3	No
heart	13 (1/12/0)	270	2	No
cleveland	13 (13/0/0)	297 (303)	5	Yes
vowel	13 (10/3/0)	990	11	No
marketing	13 (0/13/0)	6876 (8993)	9	Yes
australian	14 (3/5/6)	690	2	No
adult	14 (6/0/8)	45222 (48842)	2	Yes
crx	15 (3/3/9)	653 (690)	2	Yes
zoo	16 (0/0/16)	101	7	No
housevotes	16 (0/0/16)	232 (435)	2	Yes
penbased	16 (0/16/0)	10992	10	No
letter	16 (0/16/0)	20000	26	No
lymphography	18 (0/3/15)	148	4	No
vehicle	18 (0/18/0)	846	4	No
hepatitis	19 (2/17/0)	80 (155)	2	Yes
bands	19 (13/6/0)	365 (539)	2	Yes
segment	19 (19/0/0)	2310	7	No
german	20 (0/7/13)	1000	2	No
ring	20 (20/0/0)	7400	2	No
twonorm	20 (20/0/0)	7400	2	No
thyroid	21 (6/15/0)	7200	3	No
mushroom	22 (0/0/22)	5644 (8124)	2	Yes
automobile	25 (15/0/10)	150 (205)	6	Yes
fars	29 (5/0/24)	100968	8	No
wdbc	30 (30/0/0)	569	2	No
ionosphere	33 (32/1/0)	351	2	No
dermatology	34 (0/34/0)	358 (366)	6	Yes
chess	36 (0/0/36)	3196	2	No
satimage	36 (0/36/0)	6435	7	No
texture	40 (40/0/0)	5500	11	No
census	41 (1/12/28)	142521 (299284)	3	Yes
kddcup	41 (26/0/15)	494020	23	No
connect-4	42 (0/0/42)	67557	3	No
spectfheart	44 (0/44/0)	267	2	No
spambase	57 (57/0/0)	4597	2	No
sonar	60 (60/0/0)	208	2	No
splice	60 (0/0/60)	3190	3	No
optdigits	64 (0/64/0)	5620	10	No
coil2000	85 (0/85/0)	9822	2	No
movement_libras	90 (90/0/0)	360	15	No
All data sets (SCV)
All data sets (DOB-SCV)

Collecting Data Sets

If you have some example data sets and you would like to share them with the rest of the research community by means of this page, please be so kind as to send your data to the Webmaster Team with the following information:

People answerable for the data (full name, affiliation, e-mail, web page, ...).
training and test data sets considered, preferably in ASCII format.
A brief description of the application.
References where it is used.
Results obtained by the methods proposed by the authors or used for comparison.
Type of experiment developed.
Any additional useful information.

Collecting Results

If you have applied your methods to some of the problems presented here we will be glad of showing your results in this page. Please be so kind as to send the following information to Webmaster Team:

Name of the application considered and type of experiment developed.
Results obtained by the methods proposed by the authors or used for comparison.
References where the results are shown.
Any additional useful information.

If you are interested on being informed of each update made in this page or you would like to comment on it, please contact with the Webmaster Team.