|
This page contains a HTML version of the KEEL Reference Manual, providing basic guidelines to help in the developement of new methods. Also, it describes the structure of the data and configuration files employed by the KEEL GUI. A PDF version of the contents of the manual can be downloaded here: Contents
1 Introduction to KEEL Software Suite
1.1 KEEL Suite 3.0 Description 1.2 How to get KEEL 1.3 System requirements 2 Getting Started 2.1 Download and Start KEEL 2.1.1 Starting from the pre-compiled version 2.1.2 Starting from the Source Code 2.2 Importing your own data 2.3 An example of running experiments with KEEL 2.3.1 Standard use case 2.3.2 Advanced use case 2.4 Where to go from here 3 Data Management 3.1 Data import 3.1.1 Import dataset 3.1.2 Import partitions 3.1.3 Importing SQL databases to KEEL format 3.2 Data export 3.2.1 Export dataset 3.2.2 Export partitions 3.3 File formats 3.3.1 CVS data file format 3.3.2 TXT and TVS data file format 3.3.3 PRN data file format 3.3.4 DIF data file format 3.3.5 C4.5 data file format 3.3.6 Excel data file format 3.3.7 Weka data file format 3.3.8 XML data file format 3.3.9 HTML data file format 3.3.10 KEEL data file format 3.4 Visualize data 3.4.1 Dataset view 3.4.2 Attribute info 3.4.3 Charts 2D 3.5 Edit data 3.5.1 Data edition 3.5.2 Variable edition 3.6 Data partition 4 Experiment Design 4.1 Configuration of experiments 4.2 Selection of datasets 4.3 Experiment Graph 4.3.1 Datasets 4.3.2 Preprocessing methods 4.3.3 Standard Methods 4.3.4 Post-processing methods 4.3.5 Statistical tests 4.3.6 Visualization modules 4.3.7 Connections 4.4 Graph Management 4.5 Algorithm parameters configuration 4.6 Generation of Experiments 4.7 Menu bar 4.8 Tool bar 4.9 Status bar 5 Running KEEL Experiments 5.1 Deploying a KEEL experiment 5.2 Viewing the experiment results 6 Teaching module 6.1 Introduction 6.2 Menu Bar 6.3 Tools Bar 6.4 Status Bar 6.5 Experiment Graph 6.5.1 Datasets 6.5.2 Algorithms 6.5.3 Connections 6.5.4 Inteface Management 7 KEEL Modules 7.1 Imbalanced Learning Module 7.1.1 Introduction to classification with imbalanced datasets 7.1.2 Imbalanced Experiments Design: Offline module 7.2 Statistical tests Module 7.2.1 Introduction to statistical test 7.2.2 KEEL Suite for Statistical Analysis 7.3 Semi-supervised Learning Module 7.3.1 Semi-supervised Learning Experiments Design: Offline module 7.4 Multiple Instance Learning Module 7.4.1 Introduction to multiple instance learning 7.4.2 Multiple Instance Learning Experiments Design: Offline module 8 Appendix 1 Introduction to KEEL Software Suite
1.1 KEEL Suite 3.0 DescriptionKEEL (Knowledge Extraction based on Evolutionary Learning) is a free software (GPLv3) Java suite which empowers the user to assess the behavior of evolutionary learning and soft computing based techniques for different kind of data mining problems: regression, classification, clustering, pattern mining and so on. The main features of KEEL are:
The current version of KEEL consists of the following function blocks:
These blocks that compose the KEEL Software Suite will also influence directly the organization of this User Manual. First of all, we will describe all the operations related to the Data Management section as a first step to obtain the data that is needed in the experiments. Then, the Experiments section is detailed and all of its operations are explained as the most powerful section of the suite. Next, the Educational section is presented and all its options are showed. Later, all the modules are presented in the same order as they appear in the KEEL Menu. 1.2 How to get KEELKEEL Software can be downloaded from the Web page of the project at http:www.keel.es/download.php. From here, several options are available:
The simplest way to begin with KEEL is downloading the latest version of the prototype, which is already compiled for Java JRE 1.7 version. Additionally, all versions of the KEEL Software Suite include a basic package of datasets. However, we encourage users to browse through the KEEL-Dataset repository (http://www.keel.es/dataset.php), where more than 600 datasets (classification datasets, regression datasets and more) are available, ready to be imported to the prototype. Once you have saved the compressed file with KEEL, you only need to unzip all files into any of your folders. Then, please place yourself into the “dist” folder and run the “GraphInterKeel.jar” file for the main menu. Finally, just by following the guidelines provided in this document, you will be able to configure any data mining experiment. Furthermore, you might include your own algorithms for a more complete study. Please refer to the “KEEL Developer manual” for this purpose.
1.3 System requirementsKEEL is fully developed in Java. This means that any computer able to install and run a Java Virtual Machine (JVM) will be enough for running both the KEEL graphical interface and the data mining experiments created with the suite.
Currently, we recommend to install the latest stable version of Java (available at http://www.java.com/) although any JVM from the 1.7 version should be enough for running the graphical interface and the algorithms included in KEEL. Memory requirements (the only critical resource for some algorithms) can be adjusted when the experiments are created. All these resources are free software, therefore, no custom or proprietary software is required to work with the tools provided by the KEEL project. 2 Getting StartedThis section provides a quick introduction to using the KEEL software tool. The following subsections will allow you to download, install and run simple and elaborated examples in KEEL.
2.1 Download and Start KEELTo follow along with this guide, first download the KEEL Software from the website (http:www.keel.es/download.php). You can either download the compiled version or the source code from the Git Repository (https://github.com/SCI2SUGR/KEEL). The figure below shows the two download options to get the last version of KEEL. First, note that Java version 7 needs to be installed on your system for this to work. Depending on your computing platform you may have to download and install it separately. It is available for free from Sun GET JAVA. If you have Java already installed in your system, please, update it to the latest version if you want to use the newest KEEL versions. 2.1.1 Starting from the pre-compiled versionIf you have downloaded the binary version (Software-20XX-YY-ZZ.zip), you first have to unzip this file. In order to launch the KEEL Software Suite, you just have to execute the GraphInterKeel.jar file. Then, navigate into dist folder, and run the program by simply execute the ”GraphInterKeel.jar” file. There are two different procedures to execute this jar file.
java -jar ./dist/GraphInterKeel.jar Make sure you have properly setup the Java Path. For related issues, go to (https://www.java.com/en/download/help/path.xml). This is the launch window that appears after typing that command: This GUI lets you importing datasets, run (educational) experiments, run different modules (Imbalanced Learning, Non-parametric Statistical Analysis, Semi-supervised learning and Multiple Instance Learning). It also provides a Help file with explaining the content of the initial screen. 2.1.2 Starting from the Source CodeIf you want to compile KEEL source code it is advisable to use the Apache Ant Tool (available for download at the The Apache Ant Project web page: http://ant.apache.org/). The KEEL Software tool includes a ”build.xmlz” file to be used together with ant. To compile the KEEL project (assuming you have already installed ant) you just have to type the following commands: ant cleanAll This command erases previous binary files so that there aren’t any conflicts with new binary builts. ant This command builds the whole KEEL project binaries using the available source code. You can now navigate into the dist folder and run the generated .jar file: java -jar ./dist/GraphInterKeel.jar For more information, please refer to Subsection 1.2.
2.2 Importing your own dataThe installation of new datasets into the application can be done using the Data Management module or the Experiments module. These modules can convert data from several formats (CVS, ARFF or plain text) to the KEEL format, thus allowing the user to quickly integrate them. In what follows, we show a simple example of use, enumarating the steps to be done. Please refer to Section 3.1 for full details. Let’s say that we dispose of the following dataset file in CSV format that correspond to a subset of the Iris classification problem. 4.6,3.1,1.5,0.2,1
5.0,3.6,1.4,0.2,1 5.4,3.9,1.7,0.4,1 4.6,3.4,1.4,0.3,1 6.9,3.1,4.9,1.5,0 5.5,2.3,4.0,1.3,0 6.5,2.8,4.6,1.5,0 5.7,2.8,4.5,1.3,0 ... From the first screen:
After these steps, you will have created a new dataset with k-fold cross validation for the give CSV file. You can now close the current window and come back to the welcome KEEL’s screen. 2.3 An example of running experiments with KEELIn this section, we present several examples on how to create and run experiments with the KEEL software tool. We will first present a simple example of an use case, and then, a more profound use case will be developed.
2.3.1 Standard use caseIn this example, we will test the performance of one existing method within the KEEL software suite over the datasets that are already inserted in the tool. Specifically, we would like to obtain the accuracy performance of the C4.5 decision tree using a standard 10-fold cross validation partitioning scheme. To do so, we will first select the “Experiments” option from the KEEL software suite main menu as show in Figure 7. Now, we will select the type of experiment that we want to perform. First, we will select the partitioning scheme. As we want to perform a 10-fold cross validation, we need to select the first bullet “k-fold cross validation” from the “Type of partitions” menu, setting the value of k to 10. Then we will select the “Type of the experiment” clicking on the “Classification” button. This procedure is depicted in Figure 8. Now, we have to select the datasets that we want to use in this experiment. As we want to test all the data available in KEEL, we just click on the “Select All” button. This action will highlight all the datasets on the left panel. Then, we need to add these data to the experiment. To do so, we just have to click on any place of the right panel. Figure 9 shows how the KEEL screen has changed after adding the data to the experiment. Now, we will select the methods that we want to add to the experiment. Since we want to test the C4.5 decision tree, we click on the methods panel on the left side menu. This will prompt a list of methods organized by folders. We then expand the “Decision Trees” folder, and click on the C45-C method, which is the C4.5 decision tree that we want to use. Then, we click on any part of the right panel to place this method in the experiments. If we want to make sure that we have selected the correct method, we can click on the “Data set / Algorithms Use Case” menu at the bottom to find further information about the selected method. In our case, we check that “C45-C” effectively corresponds with the “C4.5 Decision Tree” according to its description. Figure 10 shows the screen used to add the C45-C method to the experiment. Furthermore, we want to test the accuracy obtained by this method. To easily check the accuracy obtained by the C4.5 decision tree, we want to include a visualization method. To do so, we click on the visualization panel on the left side menu. This will prompt a list of methods organized by folders. Since we are using a single classification method, we expand the “Show Results (classification)” folder and select its only method “Vis-Class-Check”. Now, we click on any part of the right panel to place this visualization approach in the experiment. Figure 11 shows how the visualization method is added to the experiment. Now we need to establish the execution flow of the experiment. In this case, we just need to connect the data, with the method and with the visualization approach. To do so, we click on the arrow (connection) on the left side menu. Then, we connect the “data” and “C45-C” elements, clicking on the first one and dragging the click to the second one. We repeat this action with “C45-C” and “Vis-Clas-Check”. Figure 12 displays the current state of the KEEL screen. Finally, we click on the generate ZIP experiment button on the top menu (Figure 13). This will prompt the generation of the zip experiment. A menu will be shown to select where we want to place our experiment and how we want to name it. We select the name “c45” and we place the ZIP file in the “D:\\” folder. We have now created our KEEL experiment! However, we have not finished yet as we have to run the experiment. We now unzip the “c45.zip” that has just been generated. We move to its “scripts” subfolder and type in a console “java -jar RunKeel.jar”. With this command, we launch the experiment. Now we wait until the experiments are completed; this is shown with the message “Experiment completed succesfully” (Figure 15). We have now finished running our KEEL experiment! If we want to explore the results we have obtained, we have to check the contents of the “results” subfolder associated to our KEEL experiment. In this subfolder we can find several subfolders containing all the results. The “C45-C.datasetName” subfolders contain the detailed results of the C4.5 algorithm over the “datasetName” dataset. In each of these subfolders, we will find 30 files, 3 per each partition, one .tra file, containing the classification results of the training partition, one .tst file, containing the classification results of the test partition, and one .txt file, containing the built tree and related statistics. Figure 16 shows the content of one of these .txt files for the “iris” dataset. Moreover, in the “results” subfolder, we can find an additional subfolder named “Vis-Clas-Check”. This folder contains the summary results of the C4.5 algorithm considering the accuracy. Specifically, we will first see another subfolder named “TSTC45-C”, and in it, the .stat files with the accuracy associated to each dataset. Figure 17 shows the content of one of the .stat file associated to the “iris” dataset. 2.3.2 Advanced use caseIn this example, we will test the performance of two existing methods within the KEEL software suite over some datasets and we will compare them to see which method performs better through the use of statistical tests. Specifically, we would like to compare the classification accuracy performance of an SMO support vector machine against the K-nearest neighbor classifier (from the lazy learning family) using the 5-fold DOB cross validation partitioning scheme and comparing some datasets which are not initially including in the tool: one from the KEEL dataset repository and the other one from the UCI dataset repository. To perform this experiment, the first step would be the obtaining of these external datasets. We are going to use the “mammographic” classification dataset from KEEL dataset repository. To download this data, we access the associated webpage in its standard classification section through http://www.keel.es/category.php?cat=clas. As partitions are available for this data, we download the generated partitions for 5-dobscv, as seen in Figure 18. We unzip the downloaded file. Moreover, we are also going to use the “Indian Liver Patient Dataset” (ILPD) dataset from the UCI dataset repository. We access the repository through http://archive.ics.uci.edu/ml/index.html and we download the dataset, as seen in Figure 19. As the only available format is CSV, we obtain this format and we will process the file with KEEL. Now, we start the KEEL software suite. We will select the “Data Management” option from the KEEL software suite main menu as show in Figure 20. Since we are going to add datasets, we select the “Import Data” option from the menu as seen in Figure 21. To add the “mammographic” dataset we will select the “Import Partitions” option (Figure 22), as we downloaded a set of partitions for this data. In the following screen (Figure 23), we have to select the location where we unzipped the downloaded files and organize considering if they are training or test files. Moreover, we need to specify that the data files are originally in DAT format, selecting “Keel to Keel” in the “Select Input Format” option. Before finally adding this dataset to KEEL, we find another confirmation window (Figure 24) where we need to include additional information about the data we are including. First, we need to make sure that the “Import to the Experiments Section” checkbox is on. Then, we need to select the type of dataset and partitioning of the data we are adding. In this case, we will use the options “Real” and “DOB-SCV” respectively. We will then click on the “Save” button. Then, a dialog asks to provide a name for the dataset (Figure 25). We select “mammographic” and confirm this selection. Then, we are asked about the type of problem this dataset belongs to (Figure 26) where we select “Classification”. Now we have successfully imported the “mammographic” dataset. Now we are back to the “Import Data” menu. Since we do not have partitions for the “Indian Liver Patient Dataset” (ILPD), we select the “Import Dataset” option now (Figure 27). In the first screen that is shown, Figure 28), we have to search for the input file that contains the whole dataset and select it. We also need to include some information about the data in the “Input Format” section. Specifically, we have to select the “CSV to Keel” option and untick the “Attribute name header” option as the first line in the CSV file does not contain any information about the attributes. Having selected all the options, we click on the “Next” button. Now, we find a confirmation window (Figure 29) where we need to include additional information about the data we are including. As in the previous case, we need to make sure that the “Import to the Experiments Section” checkbox is on. Then, we need to select the type of dataset we are adding which in this case will be “Real”. We will then click on the “Save” button. We will now be asked by a dialog (Figure 30) the name of this dataset. We select “indian” and confirm this selection. Then, we are asked about the type of problem this dataset belongs to (Figure 31) where we select “Classification”. Next, we are asked whether we want to edit this dataset (Figure 32) where we answer “No” as we do not want to perform changes to the original dataset. Afterwards, we are asked if we want to perform partitions to this dataset (Figure 33). In this case, we answer “Yes” as we want to perform experiments with DOB-SCV. We are now at the partitioning scheme (Figure 34). We have to select the options for the partitioning of our data. In our case, we first select the “Indian Liver Patient Dataset” dataset selecting the “indian.dat” file. Then, we select the correct “Type of Partition” by selecting the “K-Fold Distribution Optimally Balanced Stratified Cross Validation” option from the list. Additionally, we have to click on the “Options” button to change the number of k fold to 5 (Figure 35). Having selected the appropriate options we now click on the “Divide” button. First of all we obtain a message stating that this process may be long (Figure 36). We click on it and wait for the partitions to be created (Figure 37). When they are created we receive a message with that information (Figure 38). We can now go back to KEEL main menu. As we have added our data now we will select the “Experiments” option from the KEEL software suite main menu as show in Figure 39. Now, we will select the type of experiment that we want to perform. First, we will select the partitioning scheme. As we want to perform a 5-fold DOB cross validation, we need to select the second bullet “k-fold DOB-SCV” from the “Type of partitions” menu, setting the value of k to 5. Then we will select the “Type of the experiment” clicking on the “Classification” button. This procedure is depicted in Figure 40. Now, we have to select the datasets that we want to use in this experiment. We have available the datasets that we have just added to KEEL under the “User Dataset” listing. We select the “indian” and “mammographic” datasets. We also select the “Bupa” and “Ecoli” datasets from the “KEEL Datasets” listing. Now, we need to add these data to the experiment. To do so, we just have to click on any place of the right panel. Figure 41 shows how the KEEL screen has changed after adding the data to the experiment. Now, we will select the methods that we want to add to the experiment. Since the data that we have contains some missing values, we will introduce a preprocessing method to imputate the missing values. To do so, we click on the pre-processing panel on the left side menu. This will prompt a list of pre-processing approaches organized by folders. We then expand the “Missing Values” folder, and click on the MostCommon-MV method, which is the missin values method that we want to use. Then, we click on any part of the right panel to place this method in the experiments. Figure 42 shows the screen including the mentioned missing values approach. As we want to compare two classifiers, we click on the methods panel on the left side menu. This will prompt a list of methods organized by folders. We then expand the “Lazy Learning” and “Support Vector Machines” folders as they contain the methods we want to test. We click on the “KNN-C” method in the “Lazy Learning” folder and then on any part of the right panel to place this method in the experiments. Then, we do the same with the “SMO-C” method in the “Support Vector Machines” folder. Figure 43 shows the screen representing the experiment. We may want to change the parameters associated to the methods. To do so, we just have to double-click on top of the box containing the method whose parameters we want to change. We double-click on the “KNN-C” method and a new menu is opened (Figure 44). In there, we modify the “K Value” to 3, using the 3 nearest neighbors to classify. Then, we double-click on the “SMO-C” algorithm and a new menu is opened (Figure 45). As we want to change the kernel for the support vector machine and its option to fit the logistic models, we change the option “KERNELtype” to “RBFKernel” and “FitLogisticModel” to “True”. Furthermore, we want to test the accuracy obtained by these methods. We first want to compare the methods performance according to a statistical test. Since we are comparing two approaches, we will use the Wilcoxon test. Therefore, we click on the statistical test panel on the left side menu, and expand the “Tests for Classification” folder as we are performing a classification experiment. Among the methods, we select the Wilcoxon test which is named as “Clas-Wilcoxon-ST” and we click on the right panel to place this test. Figure 44 shows the current state of the experiment. Moreover, we also want to obtain statistics about the accuracy obtained by the tested methods. To calculate this information we will include a visualization method clicking on the visualization panel on the left side menu. This will prompt a list of methods organized by folders. Since we are using several classification methods, we expand the “Multiple results (classif.)” folder and select one of its methods “Vis-Class-Tabular”, which will organize the information in tables. Now, we click on any part of the right panel to place this visualization approach in the experiment. Figure 47 shows how the visualization method is added to the experiment. Now we need to establish the execution flow of the experiment. In this case, we need to connect the data, with the preprocessing method, then with the classification methods, and then both methods will be connected with the statistical test and the visualization approach. To do so, we click on the arrow (connection) on the left side menu. Then, we connect the “data” and “MostCommon-MV” elements, clicking on the first one and dragging the click to the second one. We repeat this action with “MostCommon-MV” and “KNN-C”, “MostCommon-MV” and “SMO-C”, “KNN-C” and “Clas-Single-Wilcoxon-ST”, “KNN-C” and “Vis-Clas-Tabular”, “SMO-C” and “Clas-Single-Wilcoxon-ST” and “SMO-C” and “Vis-Clas-Tabular”. Figure 48 depicts the current state of the KEEL screen. Finally, we click on the generate ZIP experiment button on the top menu (Figure 49). This will prompt the generation of the zip experiment. A menu will be shown to select where we want to place our experiment and how we want to name it. We select the name “knnvssmo” and we place the ZIP file in the “D:\\” folder. We have finally created our KEEL experiment!!! However, we have not finished yet as we have to run the experiment. We now unzip the “knnvssmo.zip” that has just been generated. We move to its “scripts” subfolder and type in a console “java -jar RunKeel.jar”. With this command, we launch the experiment. Now we wait until the experiments are completed; this is shown with the message “Experiment completed succesfully” (Figure 51). We have now finished running our KEEL experiment! Now we would like to explore the results that we have obtained. To do so, we have to check the contents of the “results” subfolder associated to our KEEL experiment. In this subfolder we can find several subfolders containing all the results. First, we find a set of subfolders with names like “KNN-C.datasetName” or “SMO-C.datasetName”. These subfolders contain the detailed results of the KNN and SMO algorithms over the “datasetName” dataset. In each of these subfolders, we will find 10 files, 2 per each partition, one .tra file, containing the classification results of the training partition, one .tst file, containing the classification results of the test partition. Figure 52 shows the content of one of these .tra files for the “bupa” dataset using the KNN algorithm. Moreover, in the “results” subfolder, we can find an additional subfolder named “Vis-Clas-Tabluar”. This folder contains the summary results of both KNN and SMO algorithms considering the accuracy. Specifically, we will first see another subfolder named “TSTSMO-CvsKNN-C”, and in it, the .stat files with the accuracy associated to each dataset. For instance, the “Summary_s0.stat” file, shows a table with the average statistics of all the methods; the “datasetName_KNN-C_ConfussionMatrix_s0.stat” shows the confusion matrix for the “datasetName” dataset for the “KNN-C” method; and the “datasetName_ByFoldByClassifier_s0.stat” show a table with the accuracy obtained in each fold by the methods for the “datasetName” dataset. Figure 17 shows the content of one of the .stat file associated to the “iris” dataset. Furthermore in the “results” subfolder, we can find another additional subfolder named “Clas-Wilcoxon-ST”. This folder contains the results associated to the Wilcoxon statistical test. Specifically, we will first see another subfolder named “TSTSMO-CvsKNN-C”, and in it, several .stat files and a .tex file. The .stat files include the information associated to the Wilcoxon test of each used dataset. The .tex file is a LATEXfile providing the output of the Wilcoxon test over all the selected datasets. Figure 54 shows the content of one of the “output.tex” file. 2.4 Where to go from hereCongratulations on running your first experiments with KEEL! For an in-depth overview of the KEEL features, go into the further sections to:
3 Data ManagementThe next tasks are possible to be carried out using KEEL data management module. In Figure 3, the data management main menu is shown featuring the available options:
3.1 Data importThe import option allows a user to transform files in different formats (TXT, Excel, XML, etc.) to the KEEL format. Notice that if you want to use your own datasets within the KEEL software suite, the design of the experiments will only use datasets according to the KEEL format, therefore, a previous step of import will be required. Figure 56 shows the two possible options to import datasets. One option consists of importing one dataset, the other option consists of importing a set of partitions which you have available in other formats different to the KEEL format. In the following, we show the process of both options. 3.1.1 Import datasetSelect this option if you want to import only a single file from other formats to KEEL format. Figure 57 shows the window to this option. To import a dataset, it is necessary to follow the next steps:
Finally, the tool will ask if you agree to perform data partitions for this new dataset. For this procedure, please refer to Section 3.6 (Data partitions) in this document. 3.1.2 Import partitionsSelect this option if you have previously performed partitions of a dataset in other formats and you want to import them to KEEL format. This option allows the selection of a set of training and test files separately. Figure 62 shows the window with respect to this option. To import partitions, it is necessary the next parts:
3.1.3 Importing SQL databases to KEEL formatThis section describes how to import databases stored in SQL format to KEEL format. Once the Import Dataset option have been chosen within the Import menu, it is necessary to follow the next steps:
3.2 Data exportData export allows you to transform the datasets in KEEL format to the desired format (TXT, Excel, XML, Html table and so on). Figure 69 shows the two possible options to export datasets. One option consists of exporting one dataset, the other option consists of exporting a set of partitions which you have available in other formats different to KEEL format. In what follows, we show the process of these two options. 3.2.1 Export datasetSelect this option if you want to export only a single file from KEEL format to other format (see Figure 70). This option consists of the next parts:
If we agree with the conversion done, click on the Save button and you can select the destination directory for the transformed dataset. 3.2.2 Export partitionsSelect this option if you have previously performed partitions in KEEL format and you want to export them to other format. This option allows the selection of a set of training and test files separately. Figure 75 shows the window with that features this option. This option consists of the following parts:
If you agree with the conversion done, click on the Save button and select the destination directory for the transformed dataset. 3.3 File formatsThere are different formats of data that can be used to work with the KEEL software suite. In the following, we will show the different available formats that can be used to import/export data. The last format that will be described is the KEEL format that is the one used within the KEEL experiments.
3.3.1 CVS data file formatThe CSV file (comma-separated-values) is one implementation of a delimited text file, which uses a “comma”’ to separate values. The CSV file format is very simple and is supported by almost all spreadsheets and database management systems. The characteristics associated to the CVS file format are the following:
A CSV (Comma-Separated Values) data file is usually built following the next file format: attribute1, attribute2, ..., attributeN
value11, value12, ..., value1N ... valueM1, valueM2, ..., valueMN An example of a valid CSV file is: FirstName, LastName, Company, EmailAddress
Johnathan,Doe,”ABC Company”,”johndoe@abccompany.com” Harrie,Wong,”Company Inc.”,”hwong@myprovider.com” Mary,”Jo Smith”,”Any Corp.”,”mjsmith@myprovider.com” In the following example we can see the use of some of the rules explained before, such as, the null value expressed in two consecutive commas and the use of double quotes to use the comma character as part of the data and not as a separator. OBS,CAREXPEND,DISPOSINC,DOLLARVALUE,WAGES
”1960:1”,14.2,362,,270.7 ”1960:2”,14.1,365.9,,273.4 ”1960:3”,14.6,367.6,,273.9 ”1960:4”,13.2,369.2,,273.3 ”1961:1”,10.8,72.9,,273.7 ”1961:2”,11.7,378.4,,277.6 ”1961:3”,12.2,385.1,,282.2 ”1961:4”,13.7,393.2,,288.4 3.3.2 TXT and TVS data file formatA TXT (Text Separated by Tabs) or TSV (Tab Separated Values) file, is a simple text data that allows tabular data to be exchanged between applications with a different internal format. Values separated by tabs have been officially registered as a MIME type (Multipurpose Internet Mail Extensions) under the name text/tab-separated-values. The characteristics associated to the TXT or TVS file format are the following:
A TXT (Text Separated by Tabulators) or TSV (Tab/Text Separated Values) data file is usually built following the next file format: attribute1<TAB> attribute<TAB>...<TAB>attributeN
value11<TAB> value12<TAB> ... <TAB> value1N ... valueM1<TAB> valueM2<TAB> ... <TAB> valueMN An example of valid TXT or TSV file is: FirstName <TAB> LastName <TAB> Company <TAB> EmailAddress
Johnathan <TAB> Doe <TAB> ABC Company <TAB> johndoe@abccompany.com Harrie <TAB>Wong <TAB> Company <TAB> Inc. hwong@myprovider.com Mary <TAB> Jo Smith <TAB> Any <TAB> Corp <TAB> mjsmith@myprovider.com” 3.3.3 PRN data file formatThis format has the same features and restrictions than the CSV format. The main difference is the separator between fields in the PRN format, which are spaces. However, the spaces in the PRN format have a different role than in CSV files. The characteristics associated to the PRN file format are the following:
PRN files have the data separated by blank spaces. A PRN data file is usually built following the next file format shown in Figure 82: An example of a valid PRN file is (Figure 83): OBS DELL GE YAHOO
1 26.99 48.5 22.92 2 26 49.93 20.83 3 26.24 49.96 20.13 4 25.76 49.48 19.98 5 26.73 49.43 19.74 6 24.93 49.83 18.86 7 25.84 49.01 18.23 8 25.91 49.73 17.79 9 24.6 50.15 17.1 3.3.4 DIF data file formatA DIF file (Data Interchange Format) is a text file that is used to import/export between different spreadsheet programs such as Excel, StarCalc, dBase, and so on. This type of format is stored with the extension DIF. The characteristics associated to the DIF file format are the following:
An example of a valid DIF file is: The internal format of a DIF file generated is the following: TABLE 1,0 0,1
0,1 ”Vehicle” V ”EXCEL” 1,0 1,0 VECTORS ”Quantity” ”Lorry” 0,4 -1,0 0,1.050 ”” BOT V TUPLES 1,0 -1,0 0,4 ”January” BOT ”” 0,1 1,0 DATA V ”January” 0,0 1,0 0,1 ”” ”Auto” ”Bus” -1,0 0,105.000 0,1.575 BOT V V 1,0 -1,0 -1,0 ”Month” BOT EOD 1,0 1,0 ”Week” ”January” 3.3.5 C4.5 data file formatData files can also be encoded according to the C4.5 format. This format consists of two files, one of them is a name file with the extension NAMES, the other one is a data file with the extension DATA. The characteristics associated to the NAMES file are the following:
A NAMES file is usually built following the next file format: class-1, class-2, ..., class-N.
characteristic-1: domain. characteristic-2: domain. ... characteristic-M: domain. The characteristics associated to the DATA file are the following:
A DATA file is usually built following the next file format: An example of a valid C4.5 data file is:
3.3.6 Excel data file formatMicrosoft Excel is a spreadsheet program written and distributed by Microsoft. It is currently one of the most widely used spreadsheet suites for operating systems like Microsoft Windows and Apple OS X. Microsoft Excel is integrated as part of the Microsoft Office office suite. A spreadsheet is a program that allows you to manipulate numerical and alphanumeric data. Spreadsheets are arranged in rows and columns. The intersection of a row/column is called cell. Each cell can contain data or a formula that can refer to the contents of other cells. A spreadsheet contains 256 columns, which are labeled with letters (from A to IV) and the rows with numbers (from 1 to 65,536), making a total of 16,777,216 cells by spreadsheet. Because of the versatility of modern spreadsheets, they are used to sometimes to make smaller databases, reports, and other uses. The Microsoft Excel format has the XLS extension. An example of a valid Excel file is: 3.3.7 Weka data file formatWeka (Waikato Environment for Knowledge Analysis) is a suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. Weka is free software available under the GNU General Public License. It is also a popular software for machine learning and data analysis. Its files are stored by default with the extension ARFF. The characteristics associated to the ARFF file format are the following:
Some additional specifications of the ARFF format are:
A Weka data file is usually built following the next file format shown in Figure 98: @relation <relation-name>
@attribute <attribute-name-1> <datatype> ... @attribute <attribute-name-N> <datatype> @data value11,value12,value1N ... valueM1,valueM2,valueMN An example of a valid Weka file is shown in Figure 99: % Comment
@relation weather @attribute outlook sunny, overcast, rainy @attribute temperature real @attribute humidity real @attribute windy TRUE, FALSE @attribute play yes, no @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes 3.3.8 XML data file formatXML (EXtensible Markup Language) is a set of rules to define semantic labels that organize a document in different parts. XML is a meta-language that defines the syntax to define other structured label languages. Not all XML files describe data files. In the following, the basic features of the XML format will be defined, with an special interest in how these files are built to storage data:
A XML data file for the KEEL suite is usually built following the next file format (Figure 100): <?xml version=”1.0” encoding=”UTF-8” standalone=”yes”?>
<root> <row1> <att-name-1>att-value-11</att-name-1> <att-name-2>att-value-12</att-name-2> <att-name-N>att-value-1N</att-name-N> </row1> ... <rowM> <att-name-1>att-value-M1</att-name-1> <att-name-2>att-value-M2</att-name-2> <att-name-N>att-value-MN</att-name-N> </rowM> </root> Another XML data file format valid for the KEEL suite is shown in Figure 101 <?xml version=”1.0” encoding=”UTF-8” standalone=”yes”?>
<root> <row1> <field name=”att-name-1”>att-value-11</field> <field name=”att-name-2”>att-value-12</field> <field name=”att-name-N”>att-value-1N</field> </row1> ... <rowM> <field name=”att-name-1”>att-value-M1</field> <field name=”att-name-2”>att-value-M2</field> <field name=”att-name-N”>att-value-MN</field> </rowM> </root> An example of a valid XML is depicted in Figure 102 <?xml version=”1.0” encoding=”UTF-8”?>
<root> <customer> <id>5</id> <course>66</course> <name>My book</name> <summary>Book summary</summary> <numbering>2</numbering> <disableprinting>0</disableprinting> <customtitles>1</customtitles> <timecreated>1114095924</timecreated> <timemodified>1114097355</timemodified> </customer> <customer> <id>6</id> <course>207</course> <name>My book</name> <summary>A test summary</summary> <numbering>1</numbering> <disableprinting>0</disableprinting> <customtitles>0</customtitles> <timecreated>1114095966</timecreated> <timemodified>1114095966</timemodified> </customer> </root> In this example there are:
The following example (Figure 103) presents another XML data structure, but contains the same data than the previous example. <?xml version=”1.0” encoding=”UTF-8”?>
<root> <row> <field name=”id”>5</field> <field name=”course”>66</field> <field name=”name”>My book</field> <field name=”summary”>Book summary</field> <field name=”numbering”>2</field> <field name=”disableprinting”>0</field> <field name=”customtitles”>1</field> <field name=”timecreated”>1114095924</field> <field name=”timemodified”>1114097355</field> </row> <row> <field name=”id”>6</field> <field name=”course”>207</field> <field name=”name”>My book</field> <field name=”summary”>A test summary</field> <field name=”numbering”>1</field> <field name=”disableprinting”>0</field> <field name=”customtitles”>0</field> <field name=”timecreated”>1114095966</field> <field name=”timemodified”>1114095966</field> </row> </root> 3.3.9 HTML data file formatHTML, an extension of Hypertext Markup Language, is the predominant markup language for web pages. It provides a means to describe the structure of text-based information in a document (denoting certain text as headings, paragraphs, lists, and so on) and to supplement that text with interactive forms, embedded images, and other objects. HTML is written in the form of labels (known as tags), surrounded by angle brackets. HTML is an application of SGML according to the international standard ISO 8879. XHTML is a reformulation of HTML 4 as an XML application 1.0, and allows compatibility with user agents already admitted HTML 4 following a set of rules. The basic HTML tags are:
A HTML file is usually built following the previously described format, which is shown in Figure 104: The HTML table model enables the arrangement of data like text, preformatted text, images, links, forms, form fields, other tables, and so on, into rows and columns of cells. Tables are defined with the <TABLE> tag. A table is divided into rows (with the <TR> tag), and each row is divided into data cells (with the <TD> tag). The tag TD stands for table data which is the content of a data cell. A data cell can contain text, images, lists, paragraphs, forms, horizontal rules, tables, etc. The different tags which will define the structure of the table for obtaining a valid data file are:
An HTML data file valid for KEEL is usually built following the next file format (Figure 105: <table>
<tr> <th>Header 1</th> <th>Header 2</th> <th>Header 3</th> </tr> <tr> <td>Value 1</td> <td>Value 2</td> <td>Value 3</td> </tr> <tr> <td>Value 4</td> <td>Value 5</td> <td>Value 6</td> </tr> </table> An example of a valid HTML file is the following (Figure 106): <html>
<head> <h1 align=”center”>VEHICLES</h1> </head> <body> <table border=”1” cellspacing=”1” cellpadding=”0”> <tr align=”center”> <td>Month</td> <td>Week</td> <td>Vehicle</td> <td>Amount</td> </tr> <tr> <td>January</td> <td>1</td> <td>Car</td> <td>105.0</td> </tr> <tr> <td>January</td> <td>1</td> <td>Truck</td> <td>1.05</td> </tr> <tr> <td>January</td> <td>1</td> <td>MotorBike</td> <td>1.575</td> </tr> </table> </body> </html> 3.3.10 KEEL data file formatAll the other data formats described in this section can be imported/exported to the KEEL data file format. This format is used in KEEL experiments and associated operations. KEEL data files are represented as plain ASCII text files, named with the DAT extension. Each KEEL data file is composed by 2 sections:
Comments are allowed in both sections using the “%” character. The header is composed by the following metadata:
The @inputs and @outputs definitions are optional. If they are missing, all the attributes will be considered as input attributes, except the last, which will be considered as the output attribute. @relation bupa2
@attribute mcv nominal {a,b,c} @attribute alkphos integer [23, 138] @attribute sgpt integer [4, 155] @attribute sgot integer [5, 82] @attribute gammagt integer [5, 297] @attribute drinks real [0.0, 20.0] @attribute selector {true,false} @inputs mcv, alkphos, sgpt, sgot, gammagt, drinks @outputs selector The data instances are represented as rows of comma separated values, where each value corresponds to one attribute, in the order defined by the header. Missing or null values are represented as <null> or ?. If the dataset corresponds to a classification problem, the output type must be nominal: ...
@attribute selector {true,false} ... @outputs selector @data a, 92, 45, 27, 31, 0.0, true a, 64, 59, 32, 23, <null>, false b, 54, <null>, 16, 54, 0.0, false If the dataset corresponds to a regression problem, the output type must be real: ...
@attribute selector real [0.0, 20.0] ... @outputs selector @data a, 92, 45, 27, 31, 0.0, 0.9 a, 64, 59, 32, 23, <null>, 17.5 b, 54, <null>, 16, 54, 0.0, 3.5 A full example of a valid KEEL file is shown in Figure 110: % Comment
@relation bupa2 @attribute mcv nominal {a,b,c} @attribute alkphos integer [23, 138] @attribute sgpt integer [4, 155] @attribute sgot integer [5, 82] @attribute gammagt integer [5, 297] @attribute drinks real [0.0, 20.0] @attribute selector {true,false} @inputs mcv, alkphos, sgpt, sgot, gammagt, drinks @outputs selector @data a, 92, 45, 27, 31, 0.0, true a, 64, 59, 32, 23, <null>, false b, 54, <null>, 16, 54, 0.0, false a, 78, 34, 24, 36, 0.0, false a, 55, 13, 17, 17, 0.0, false b, 62, 20, 17, 9, 0.5, true c, 67, 21, 11, 11, 0.5, true a, 54, 22, 20, 7, 0.5, true 3.4 Visualize dataThe visualization options provide graphical information about existing KEEL datasets. There are different options related to this graphical information, where an user can select to view the content of a dataset, specific information about the attributes or to compare two attributes using charts. Figure 111 shows the main window of the visualization menu. First of all, an user must select the path of source dataset (in KEEL format) that is going to be visualized (see Figure 112). When the file is loaded, different information about the dataset is shown according to the option selected. 3.4.1 Dataset viewIf an user selects to visualize the dataset information, the content of dataset selected will be shown in plain text form. The data cannot be modified; the user can only visualize it (see Figure 113). 3.4.2 Attribute infoIn this option, an user can obtain detailed information about the attributes defined in the dataset. The information showed is the attribute’s type (either integer, real or nominal) and whether the attribute is input or output. Below the attribute information, there are two additional areas that provide further information about the selected attribute within the attribute list. On the left side, textual information about the attribute will be shown. This information depends on the attribute type. If the attribute is integer or real, then, the rank values, average and variance associated to the data are shown. In the case of a nominal attribute, only its possible values are displayed. On the right side, graphical information about the selected attribute is provided. Specifically, the distribution of the attribute’s values is shown through a chart. Figure 114 shows how this information is organized for a real attribute and Figure 115 shows which information is provided for a nominal attribute. 3.4.3 Charts 2DThis option enables an user to contrast a pair of different attributes. In order to do so, an user has to select the two attributes that are going to be compared. There are two drop lists to select the two attributes that are going to be contrasted: each one of these lists contain all attribute of dataset (see Figure 116). When the attributes are selected, an user has to click on the View chart button and then, a graphic depicting the values of these attributes is shown (Figure 117). If the generated chart is expected to be introduced in other document, an user should use the buttons: Convert to PNG: this option saves the graph as a PNG image, and Convert to PDF: this option saves the chart as a PDF document (Figure 118). 3.5 Edit dataThe edit data feature allows an user to edit any existing KEEL datasets in order to add new attributes, to delete others, to correct some errors within the data, and so on. Figure 119 shows the main window of the edit menu. First of all, an user must select the path of source dataset (in KEEL format) that is going to be edited (see Figure 120). When the file is loaded, its content appears bellow the Load Dataset option, organized in a table, in a Data area. The modifications over this dataset can be performed both over the instances and over the variables. In the following, we will address how an user can alter the values in a dataset from both approaches. 3.5.1 Data editionThis option enables an user to add new instances, delete existing instances or modify any of the available instances in the data (see Figure 121). In order to do so, an user has to interact with the table that displays the dataset information and with its associated buttons. The operations that can be performed are:
3.5.2 Variable editionIn this option different modifications on the variables of the selected dataset can be carried out (see Figure 122). In order to do so, an user has to interact with the table that displays the dataset information and with its associated buttons. The operations that can be performed are:
When all the changes to data have been applied, an user can save them to a file clicking on the Save button. 3.6 Data partitionThe data partition feature enables an user to make partitions from an existing dataset in KEEL format. Figure 123 shows the main window of this option. To create partitions from a given dataset, an user has to follow the next steps:
4 Experiment DesignThe Experiments Design section goal is to allow an user to create the desired experiments using a graphical interface. To do so, the user is expected to use available datasets and algorithms to generate a file containing a folder structure with all the necessary files needed to run the designed experiments in the processing unit selected by the user. In this way, an user only needs to select the input data (datasets), the algorithms that want are going to be tested and the connections that define the processing flow that needs to be run. It is possible to concatenate methods, insert statistical tests, and so on. Moreover, the tool allows an easy configuration of the parameters associated to each method: they can be selected using the graphical interface without external configuration files. This part of KEEL has two main objectives: on the one hand, an user can use the software as a test and evaluation tool during the development of an algorithm. On the other hand, it is also a good option in order to compare new developments with standard algorithms already implemented and available in the KEEL software suite 3.0.
4.1 Configuration of experimentsWhen the Experiments option is selected, the main window of the Experiments module will appear (Figure 126): First, it is necessary to select the type of experiment and the type of partitions to employ; the options selected will determine the kind of methods and datasets that will be available to design the experiment. The types of partitions available (as shown in Figure 127) are the following ones:
Currently, the KEEL Experiments module offers the following types of experiments:
When the type of experiment has been selected, the datasets selection panel will be shown, allowing continuing the experiment design. 4.2 Selection of datasetsThe datasets selection panel shows the available datasets for the current experiment. Its contents will depend of the type of experiment already selected: The next step is to choose the wished datasets from the panel. The buttons Select All and Invert allows making the selection easily: The Import Button allows importing an existing dataset into the KEEL environment, ready to be selected for the current experiment. By clicking on it, the main window of the Data Import Tool will be shown. The process to import a new dataset can is described in the Data Management module section of the manual (Section 3.1). If a new dataset is added, new buttons will appear allowing the user to Invert the current selection of user datasets, or to Select All of them. Furthermore, it is possible to add even more datasets (with the Import button), or to Remove the datasets selected. When all the necessary dataset are selected, the experiment design process can continue. To do so, the user must click on the white graph panel to set the datasets node of the experiment. At this point, the KEEL Experiments module will check if all the necessary partitions of the current selected datasets are present. If some missing partitions are found (e.g. if the user selected a k value different from the sets available in the standard distribution), the tool will prompt the following message: Clicking on yes will result on the generation of the missing partitions inside the KEEL environment. If the user selects to No generate the partitions, this warning will be shown again before the generation of the experiment graph. 4.3 Experiment GraphThe experiment graph shows the components of the current experiment and describes the relationships between them. The user can add new components by using the left menu: This menu has the following categories available: Datasets: Modify the datasets of the experiments. Preprocessing methods: Preprocess over the initial datasets. Standard methods: Data mining methods. Postprocessing methods: Post-process over the results of standard methods. Statistical tests: Statistical procedures to contrast the results achieved in the experiment. Visualization modules: Show the results of the experiments in an upgraded way. Connections: Links between the components of the experiment. 4.3.1 DatasetsThis module lets the user edit the current datasets selected for the experiment. As in the Select Datasets panel, the user can still Add and Delete datasets to the experiment (from those already registered in the KEEL environment). Also, it is still possible to import new datasets. Furthermore, the button Edit allows the user to indicate which partitions (training and test) desires to use. This way, it is possible to temporally alter the files which will be included in the experiment. This dialog shows the initial files of the dataset. From it, is possible to Remove a pair of training a pair of training/test files, to Remove All files. Also, the dialog allows to Add new pairs of training and test files. To do so, they must be selected by using the search buttons : Finally, it is also possible to add a complete set of k-fold cross validation files by selecting the adequate number of folds and pressing the button Add k-fold cv. 4.3.2 Preprocessing methodsThis category includes several preprocessing methods
To add any preprocessing method to the current experiment, it is only needed to select it and click in the graph of the experiment: 4.3.3 Standard MethodsThis category includes the data mining methods included in the KEEL software suite:
To add any method to the current experiment, it is only needed to select it and click in the graph of the experiment: 4.3.4 Post-processing methodsThis category includes the postprocessing methods included in the KEEL software suite:
To add any postprocessing method to the current experiment, it is only needed to select it and click in the graph of the experiment: 4.3.5 Statistical testsThis category includes several statistical modules available to contrast experiments performed with the KEEL software suite:
To add any statistical procedure to the current experiment, it is only needed to select it and click in the graph of the experiment: Additionally, a full module is available for carrying out non-parametrical statistical tests of the results obtained by the experiments developed by KEEL or by any other software tool. Please refer to the content of this manual regarding this specific module (Section 7.2). 4.3.6 Visualization modulesThis category includes several visualization modules developed to analyze and summarize the results achieved in the experiments:
To add any visualization module to the current experiment, it is only needed to select it and click in the graph of the experiment: 4.3.7 ConnectionsThe connections allow finishing the designing of the experiment, by connecting the included modules with flows which represent the data flow in the experiment. They can be used both as inputs or outputs of the modules.
4.4 Graph ManagementThe graph allows performing the following operations over its elements:
4.5 Algorithm parameters configurationOnce a module has been inserted in the graph, it is possible to configure the value of its parameters. To do so, the user have to double click on the algorithm symbol and a dialog will be shown (Figure 151). At the top of this dialog it is possible to set the number of times that the algorithm will be executed (only available for random methods). Each execution will be made using a seed generated from the initial seed. The second list allows specifying in which datasets the parameters will be changed. In the table located in the center of the window, all the algorithm parameters are established to its initial values. These values can be modified, as far as the new values will be appropriate for the specific method; otherwise, an error message will appear, as shown in Figure 152 Finally, the Default Values button allows returning all parameter to its default values. 4.6 Generation of ExperimentsOnce a experiment has been designed, the user can generate it through the option Run Experiment of the ’Tools’ menu. Furthermore, it is possible to use the tools bar button. At this point, the software tool will perform several tests about the completeness of the experiment. Firstly, if it detects that there are missing partitions for some of the datasets employed, the following dialog will be shown, allowing regenerating them: This is the last opportunity to generate them. Else, the experiment will be generated incorrectly. Secondly, if some of the elements of the graph are not connected by flows, the following warning will be prompt, and the isolated nodes will be discarded. If everything is correct, the user will have to select a path for the experiment’s zip file: The generation process generates a ZIP file containing all the elements needed to run the experiment. If the experiment generation is completed successfully, the following message will be shown. The experiment must be run using the RunKeel jar file located at “experiment/scripts” In the following picture, we can see an example of the structure of directories that is created. We see that four directories are created:
4.7 Menu barEach item of the menu bar contains different submenus. These are the different options available:
5 Running KEEL ExperimentsThis section describes the procedure that needs to be followed in order to run and visualize an existing KEEL experiment from the ZIP file generated with the experiment design process.
5.1 Deploying a KEEL experimentIn order to launch a KEEL experiment, an user has to previously design the aforementioned experiment using the KEEL software suite (following the procedure described in Section 4.6). This will create a ZIP file containing all the files needed to run the experiment. First of all, an user has to unzip the named ZIP file in the machine that will run the experiment (this does not have to be the same machine that was used to create the experiment but needs to be able to run a Java Virtual Machine with at least version 1.7). The user will obtain a directory called “experimentName” (how an user named its experiment). Then, the user has to place himself into that “experimentName” folder, and then into the “scripts” subfolder. To run the experiments, an user just has to type and run the “java -jar RunKeel.jar” command. The experiment is thus executed. When it finishes, the user will obtain at the command prompt the message “Experiment completed succesfully”. 5.2 Viewing the experiment resultsOnce the run of an experiment has finished, the associated result files can be found at the results\ subdirectory associated to each experiment. Depending on the type of methods used, the following directories and files will be available:
On the other hand, note that the new datasets obtained as the result of the execution of a preprocessing method will be placed in the datasets\ directory of the experiment, to allow a further employment of them with linked methods in the same experiment. 6 Teaching moduleThis module has the objective of supporting teachers and students to better understand the working procedure of the Data Mining process for educational purposes. In what follows, we recall the features of the KEEL software and the usage of this particular section.
6.1 IntroductionKEEL is a software tool developed to build and use different Data Mining models. We would like to remark that this is the first software tool of this type containing a free code Java library of Evolutionary Learning Algorithms. The main features of KEEL are:
We can distinguish three parts in the graphic environment:
6.2 Menu BarEach item of the menu bar contains different submenus. These are the different options available:
6.3 Tools BarThere are two tool bars in this program. One of them appears under the menu bar. Pressing on its buttons it is possible to access to the most frequently used options that appear in the menus. It looks like Figure 171: The other one is located on the left of the main window, and it contains buttons to perform specific options of design. It looks like the one shown in Figure 172: If you put the mouse over a button, it will appear a short description about it. 6.4 Status BarThe status bar is located at the bottom of the window (Figure 173). Here it will appear information about the action being carried out, helping the user to understand the meaning of each command or button. 6.5 Experiment Graph
6.5.1 Datasets
6.5.2 Algorithms
6.5.3 ConnectionsThey allow you to connect algorithm outputs (or dataset) to the inputs of another algorithm, creating a data flow that will be run later.
6.5.4 Inteface ManagementIn this section we will see some additional considerations about other possibilities that provide this application.
7 KEEL ModulesIn this section, we introduce several modules that are included in KEEL for particular purposes. Specifically, three different modules have been developed:
All these modules are described throughout the following sections.
7.1 Imbalanced Learning ModuleIn many supervised learning applications, there is a significant difference between the prior probabilities of different classes. This situation is known as the class imbalance problem and it is common in many real problems from telecommunications, web, finance-world, ecology, biology, medicine and so on. In this way, it has been considered as one of the top problems in data mining today. Furthermore, it is worth to point out that the minority class is usually the one that has the highest interest from a learning point of view and it also implies a great cost when it is not well classified. The KEEL Software Suite have taken into account this significant scenario of classification and it includes a complete framework for the experimentation of this type of problems. In this section, we will briefly introduce the features of classification with imbalanced datasets and we will describe how this is addressed with KEEL.
7.1.1 Introduction to classification with imbalanced datasetsThe hitch with imbalanced datasets is that standard classification learning algorithms are often biased towards the majority class (known as the “negative” class) and therefore there is a higher misclassification rate for the minority class instances (called the “positive” examples). Since most of the standard learning algorithms consider a balanced training set, this may generate suboptimal classification models, i.e. a good coverage of the majority examples, whereas the minority ones are misclassified frequently. Therefore, those algorithms, which obtain a good behavior in the framework of standard classification, do not necessarily achieve the best performance for imbalanced datasets. There are several reasons behind this behavior:
Therefore, throughout the last years, many solutions have been proposed to deal with this problem, both for standard learning algorithms and for ensemble techniques. They can be categorized into three major groups:
Most of the studies on the behavior of several standard classifiers in imbalance domains have shown that significant loss of performance is mainly due to the skewed class distribution, given by the imbalance ratio (IR), defined as the ratio of the number of instances in the majority class to the number of examples in the minority class. In imbalanced domains, the evaluation of the classifiers’ performance must be carried out using specific metrics in order to take into account the class distribution. Particularly, four metrics can be employed for computing the classification performance of both, positive and negative, classes independently:
Since in this classification scenario we intend to achieve good quality results for both classes, there is a necessity of combining the individual measures of both the positive and negative classes, as none of these measures alone is adequate by itself. A well-known approach to unify these measures and to produce an evaluation criteria is to use the Receiver Operating Characteristic (ROC) graphic. This graphic allows the visualization of the trade-off between the benefits (TPrate) and costs (FPrate), as it evidences that any classifier cannot increase the number of true positives without also increasing the false positives. The Area Under the ROC Curve (AUC) corresponds to the probability of correctly identifying which one of the two stimuli is noise and which one is signal plus noise. The AUC provides a single measure of a classifier’s performance for evaluating which model is better on average. Figure 184 shows how to build the ROC space plotting on a two-dimensional chart the TPrate (Y-axis) against the FPrate (X-axis). Points in (0,0) and (1,1) are trivial classifiers where the predicted class is always the negative and positive one, respectively. On the contrary, (0,1) point represents the perfect classifier. The AUC measure is computed just by obtaining the area of the graphic as AUC = Apart from the AUC measure, it is also commmon to use the geometric mean (GM) of the true positive and true negative rates (TPrate and TNrate) obtained by the classifier and is given by
7.1.2 Imbalanced Experiments Design: Offline moduleIn order to have access to this part of the software, in the first frame of the program we must click on Modules, and then select Imbalanced Learning as shown in Figure 185. Once we have clicked, a new window will appear, with the same appearance that the standard “Design of Experiments” framework (please refer to Section 4.3). Regarding to this fact, all menu bars include exactly the same patterns, i.e. the menu, tool, and status bars. In fact, all the process for preparing an experiment follows the same scheme than in the standard “Offline experiments” module, which has been described throughout sections 4.3, 4.4 and 4.6. However, we must point out several significant differences between both scenarios, regarding the Experimental Graph: (1) Datasets, (2) Preprocessing methods, (3) Algorithms, and (4) Statistical tests and Visualization:
7.2 Statistical tests ModuleThe goodness of a given approach cannot be only measured in terms of the improvement for the mean performance. Significant differences must be found among the different algorithms for concluding the superior behavior of the one that achieves the highest average result. For this reason, in KEEL Software Suite several hypothesis testing techniques are included in order to provide statistical support for the analysis of the results. Specifically, we will use non-parametric tests, due to the fact that the initial conditions that guarantee the reliability of the parametric tests may not be satisfied, causing the statistical analysis to lose credibility with these type of tests. Any interested reader can find additional information on the Website http://sci2s.ugr.es/sicidm/.
7.2.1 Introduction to statistical testThe experimental analysis on the performance of a new method is a crucial and necessary task to carry out in a research on Data Mining, Computational Intelligence techniques. Deciding when an algorithm is better than other one may not be a trivial task. Hyphotesis testing and p-values: In inferential statistics, sample data are primarily employed in two ways to draw inferences about one or more populations. One of them is the hypothesis testing. The most basic concept in hypothesis testing is a hypothesis. It can be defined as a prediction about a single population or about the relationship between two or more populations. Hypothesis testing is a procedure in which sample data are employed to evaluate a hypothesis. There is a distinction between research hypothesis and statistical hypothesis. The first is a general statement of what a researcher predicts. In order to evaluate a research hypothesis, it is restated within the framework of two statistical hypotheses. They are the null hypothesis, represented by the notation H0, and the alternative hypothesis, represented by the notation H1. The null hypothesis is a statement of no effect or no difference. Since the statement of the research hypothesis generally predicts the presence of a difference with respect to whatever is being studied, the null hypothesis will generally be a hypothesis that the researcher expects to be rejected. The alternative hypothesis represents a statistical statement indicating the presence of an effect or a difference. In this case, the researcher generally expects the alternative hypothesis to be supported. An alternative hypothesis can be nondirectional (two-tailed hypothesis) and directional (one-tailed hypothesis). The first type does not make a prediction in a specific direction; i.e. H1 : μ ≠ 100. The latter implies a choice of one of the following directional alternative hypothesis; i.e. H1:μ 100 or H1:μ 100. Upon collecting the data for a study, the next step in the hypothesis testing procedure is to evaluate the data through use of the appropriate inferential statistical test. An inferential statistical test yields a test statistic. The latter value is interpreted by employing special tables that contain information with regard to the expected distribution of the test statistic. Such tables contain extreme values of the test statistic (referred to as critical values) that are highly unlikely to occur if the null hypothesis is true. Such tables allow a researcher to determine whether or not the results of a study is statistically significant. The conventional hypothesis testing model employed in inferential statistics assumes that prior to conducting a study, a researcher stipulates whether a directional or nondirectional alternative hypothesis is employed, as well as at what level of significance is represented the null hypothesis to be evaluated. The probability value which identifies the level of significance is represented by ?. When one employs the term significance in the context of scientific research, it is instructive to make a distinction between statistical significance and practical significance. Statistical significance only implies that the outcome of a study is highly unlikely to have occurred as a result of chance, but it does no necessarily suggest that any difference or effect detected in a set of data is of any practical value. For example, no-one would normally care if algorithm A in continuos optimization solves the sphere function to within 10-10 of error of the global optimum and algorithm B solves it within 10-15. Between them, statistical significance could be found, but in practical sense, this difference is not significant. Instead of stipulating a priori a level of significance ?, one could calculate the smallest level of significance that results in the rejection of the null hypothesis. This is the definition of p-value, which is an useful and interesting datum for many consumers of statistical analysis. A p-value provides information about whether a statistical hypothesis test is significant or not, and it also indicates something about how significant the result is: The smaller the p-value, the stronger the evidence against the null hypothesis. Most important, it does this without committing to a particular level of significance. The most common way for obtaining the p-value associated to a hypothesis is by means of normal approximations, that is, once computed the statistic associated to a statistical test or procedure, we can use a specific expression or algorithm for obtaining a z value, which corresponds to a normal distribution statistics. Then, by using normal distribution tables, we could obtain the p-value associated with z.
7.2.2 KEEL Suite for Statistical AnalysisIn order to have access to this part of the software, in the first frame of the program we must click on Modules, and then select Non-Parametric Statistical Analysis as shown in Figure 192. Once we have clicked, a new window will appear. This module allows to perform several non-parametric statistical test over a given set of results. Further information about them can be found in the SCI2S thematic Web Site of Statistical Inference in Computational Intelligence and Data Mining http://sci2s.ugr.es/sicidm/ In this version, the available procedures are the following:
They can be selected through the Statistical procedures box: Post hoc methods In order to characterize the differences detected by the statistical tests, this module also provides a set of well-known post hoc methods. For Friedman, Quade and Friedman alligned tests, it is possible to employ the following post hoc tests:
For multiple tests, it is possible to employ the following post hoc tests:
Performance measure Depending on the characteristics of the problem considered, it is possible to perform the statistical test for maximization and minimization problems. This feature allows to determine if the results have been obtained from a maximization problem (e.g. using accuracy in supervised classification problems) or from a minimization problem (e.g. using MS Error in regression problems). Working with data The data table stores the average results achieved by each algorithm in each data set (problem). See Figure 195. It is possible to input the values directly on the cells of the table, updating both results and the names of the data sets (however, algorithms’ names can only be updated through reading a CSV results file, see the next section). See Figure 196. Table controls The following operations are defined to manage the data table (Figure 197).
Generation of analysis Finally, when the data table has been filled with the results to analyze, and all the relevant options have been selected, the analysis can be performed through the Perform Analysis button (Figure 199). The name for a .tex (a LaTeX file) will be needed to store the results of the analysis. Then, if the data introduced is correct, the analysis will be performed (Figure 200). This .tex file contains all the information offered by the tests. To revise it, just compile the file with your favourite LaTeX processor, in order to obtain a PDF/PS/DVI file containing the results. 7.3 Semi-supervised Learning ModuleThe Semi-Supervised Learning (SSL) paradigm has attracted much attention in many different fields ranging from bioinformatics to web mining, where it is easier to obtain unlabeled than labeled data because it requires less effort, expertise and time-consumption. In this context, traditional supervised learning is limited to using labeled data to build a model. Nevertheless, SSL is a learning paradigm concerned with the design of models in the presence of both labeled and unlabeled data. Essentially, SSL methods use unlabeled samples to either modify or reprioritize the hypothesis obtained from labeled samples alone. The KEEL Software Suite have taken into account this significant scenario of classification and it includes a complete framework for the experimentation of this type of problems. In this section, we will briefly introduce the features of semi-supervised classification and we will describe how this is addressed with KEEL.
7.3.1 Semi-supervised Learning Experiments Design: Offline moduleIn order to have access to this part of the software, in the first frame of the program we must click on Modules, and then select Semi-supervised Learning as shown in Figure 201. Once we have clicked, a new window will appear, with the same appearance that the standard “Design of Experiments” framework (please refer to Section 4.3). Regarding to this fact, all menu bars include exactly the same patterns, i.e. the menu, tool, and status bars. In fact, all the process for preparing an experiment follows a very similar scheme than in the standard “Offline experiments” module, which has been described throughout sections 4.3, 4.4 and 4.6. However, we must point out several significant differences between both scenarios, regarding the Experimental Graph: (1) Datasets, and (2) Preprocessing and (3) Algorithms:
7.4 Multiple Instance Learning ModuleMultiple instance learning (MIL) is a generalization of traditional supervised learning. In MIL, training patterns called bags are represented as a set of feature vectors called instances. Each bag contains a number of non-repeated instances and each instance usually represents a different view of the training pattern attached to it. There is information about the bags and each one receives a special label, although the labels of instances are unknown. The problem consists of generating a classifier that will correctly classify unseen bags of instances. The key challenge in MIL is to cope with the ambiguity of not knowing which instances in a positive bag are actually positive examples, and which ones are not. In this sense, a multiple instance learning problem can be regarded as a special kind of supervised learning problem with incomplete labeling information The KEEL Software Suite have taken into account this significant scenario of classification and it includes a complete framework for the experimentation of this type of problems. In this section, we will briefly introduce the features of classification with multiple instance data and we will describe how this is addressed with KEEL.
7.4.1 Introduction to multiple instance learningMIL is designed to solve the same problems as single-instance learning: learning a concept that correctly classifies training data as well generalizing unseen data. Although the actual learning process is quite similar, the two approaches differ in the class labels provided which are what they learn from. In a traditional machine learning setting, an object mi is represented by a feature vector vi, which is associated with a label f(mi). However, in the multiple instance setting, each object mi may have Vi various instances denoted mi1, mi2, …, mivi . Each of these variants will be represented by a (usually) distinct feature vector V(mi,j). A complete training example is therefore written as ({V(mi,1), V(mi,2), …, V(mi,vi)}, f(mi)). The goal of learning is to find a good approximation to the function f(mi), (mi), analyzing a set of training examples and labeled as f(mi). To obtain this function Dietterich defines a hypothesis that assumes that if the result observed is positive, then at least one of the variant instances must have produced that positive result. Furthermore, if the result observed is negative, then none of the variant instances could have produced a positive result. This can be modeled by introducing a second function g(V(mi,j)) that takes a single variant instance and produces a result. The externally observed result, f(mi), can then be defined as follows:
7.4.2 Multiple Instance Learning Experiments Design: Offline moduleIn order to have access to this part of the software, in the first frame of the program we must click on Modules, and then select Multiple Instance Learning as shown in Figure 185. Once we have clicked, a new window will appear, with the same appearance that the standard “Design of Experiments” framework (please refer to Section 4.3). Regarding to this fact, all menu bars include exactly the same patterns, i.e. the menu, tool, and status bars. In fact, all the process for preparing an experiment follows the same scheme than in the standard “Offline experiments” module, which has been described throughout sections 4.3, 4.4 and 4.6. However, we must point out several significant differences between both scenarios, regarding the Experimental Graph: (1) Datasets, (2) Preprocessing methods, and (3) Algorithms:
8 AppendixIn this section, we provide the unit test routines performed to the KEEL software: |
||||||||||||||
|