main main
KEEL - dataset     Text Classification data sets

Text classification (also known as text categorization, or topic spotting) is the task of automatically sorting a set of documents into categories from a predefined set. This task has several applications, including automated indexing of scientific articles according to predefined thesauri of technical terms, filing patents into patent directories, selective dissemination of information to information consumers, automated population of hierarchical catalogues of Web resources, spam filtering, identification of document genre or authorship attribution. Automated text classification is attractive because it frees organizations from the need of manually organizing document bases, which can be too expensive, or simply not feasible given the time constraints of the application or the number of documents involved.

This section shows the Text Classification data sets avalaible in the repository. Please, note that data sets found here are not the original ones. We have applied a pre-processing technique which divides the original data set into a collection of N binary ones (one for each class label of the original dataset). Each one of these binary data sets consists of 100 attributes selected by using Chi Square scoring function.

Thus, every data set defines a supervised classification problem, where each of its examples is composed by some nominal attributes and a label indicating the example's class.

Each data file has the following structure:

  • @relation: Name of the data set
  • @attribute: Description of an attribute (one for each attribute)
  • @inputs: List with the names of the input attributes
  • @output: List with the names of the output attributes
  • @data: Starting tag of the data

The rest of the file contains all the examples belonging to the data set, expressed in comma sepparated values format.


Below you can find all the Text Classification data sets available. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued) and labels (number of output variables).

The table allows to download each data set in KEEL format (inside a ZIP file). Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes, or labels. Clicking again will sort the rows in reverse order.

Namedownarrow.png#Attributes (R/I/N)downarrowS.png#Examplesdownarrow.png#Labelsdownarrow.png Data set 5-fcv Header
OH15-100100        (0/0/100)91310zip.gifzip.giftxt.png
OH15-100100        (0/0/100)91310zip.gifzip.giftxt.png
OH5-100100        (0/0/100)91810zip.gifzip.giftxt.png
OH5-100100        (0/0/100)91810zip.gifzip.giftxt.png
OH0-100100        (0/0/100)100310zip.gifzip.giftxt.png
OH0-100100        (0/0/100)100310zip.gifzip.giftxt.png
OH10-100100        (0/0/100)105010zip.gifzip.giftxt.png
OH10-100100        (0/0/100)105010zip.gifzip.giftxt.png
blogsGender-100100        (0/0/100)32322zip.gifzip.giftxt.png
ohscale-100100        (0/0/100)1116210zip.gifzip.giftxt.png
ohscale-100100        (0/0/100)1116210zip.gifzip.giftxt.png
r10-100100        (0/0/100)1289710zip.gifzip.giftxt.png
r10-100100        (0/0/100)1289710zip.gifzip.giftxt.png
ohsumed-100100        (0/0/100)1392910zip.gifzip.giftxt.png
ohsumed-100100        (0/0/100)1392910zip.gifzip.giftxt.png
All data setszip.gif

Collecting Data Sets

If you have some example data sets and you would like to share them with the rest of the research community by means of this page, please be so kind as to send your data to the Webmaster Team with the following information:

  • People answerable for the data (full name, affiliation, e-mail, web page, ...).
  • training and test data sets considered, preferably in ASCII format.
  • A brief description of the application.
  • References where it is used.
  • Results obtained by the methods proposed by the authors or used for comparison.
  • Type of experiment developed.
  • Any additional useful information.

Collecting Results

If you have applied your methods to some of the problems presented here we will be glad of showing your results in this page. Please be so kind as to send the following information to Webmaster Team:

  • Name of the application considered and type of experiment developed.
  • Results obtained by the methods proposed by the authors or used for comparison.
  • References where the results are shown.
  • Any additional useful information.

Contact Us

If you are interested on being informed of each update made in this page or you would like to comment on it, please contact with the Webmaster Team.



 
 Copyright 2004-2018, KEEL (Knowledge Extraction based on Evolutionary Learning)
About the Webmaster Team
Valid XHTML 1.1   Valid CSS!