Digital Library
Close Browse articles from a journal
 
<< previous    next >>
     Journal description
       All volumes of the corresponding journal
         All issues of the corresponding volume
           All articles of the corresponding issues
                                       Details for article 2 of 6 found articles
 
 
  ARCADE: A Prediction Method for Nominal Variables
 
 
Title: ARCADE: A Prediction Method for Nominal Variables
Author: Costa, J.F.P.
Lerman, I.C.
Appeared in: Intelligent data analysis
Paging: Volume 2 (2013) nr. 4 pages 265-286
Year: 2013-06-14
Contents: The main problem considered in this paper consists of binarizing categorical (nominal) attributes having a very large number of values (204 in our application). A small number of relevant binary attributes are gathered from each initial attribute. Let us suppose that we want to binarize a categorical attribute v with L values, where L is large or very large. The total number of binary attributes that can be extracted from v is 2L−1−1, which in the case of a large L is prohibitive. Our idea is to select only those binary attributes that are predictive; and these shall constitute a small fraction of all possible binary attributes. In order to do this, the significant idea consists in grouping the L values of a categorical attribute by means of an hierarchical clustering method. To do so, we need to define a similarity between values, which is associated with their predictive power. By clustering the L values into a small number of clusters (J), we define a new categorical attribute with only J values. The hierarchical clustering method used by us, AVL, allows to choose a significant value for J. Now, we could consider using all the 2L−1−1 binary attributes associated with this new categorical attribute. Nevertheless, the J values are tree-structured, because we have used a hierarchical clustering method. We profit from this, and consider only about 2×J binary attributes. If L is extremely large, for complexity and statistical reasons, we might not be able to apply a clustering algorithm directly. In this case, we start by “factorizing” v into a pair (v2,v2), each one with about $\sqrt{L(v)}$ values. For a simple example, consider an attribute v with only four values m1,m2,m3,m4. Obviously, in this example, there is no need to factorize the set of values of v, because it has a very small number of values. Nevertheless, for illustration purposes, v could be decomposed (factorized) into 2 attributes with only two values each; the correspondence between the values of v and (v2,v2) would be \[\begin{array}{c@{\qquad}c@{\qquad}c}v(v^{1},&v^{2})\\m_{1}1&1\\m_{2}1&2\\m_{3}2&1\\m_{4}2&2\\\end{array}\] Now we apply the clustering method to both sets of values of v1 and v2, defining therefore a new synthetic pair $(\bar{v}^{1},\bar{v}^{2})$. Then, we “multiply” these new attributes and get another attribute v10 with J×J values; J1 (resp. J2) is the number of values of $\bar{v}^{1}$ (resp. $\bar{v}^{2}$). Now, we apply a final clustering to the values of v10, and proceed as above. The solution that we propose is independent of the number of classes and can be applied to various situations. The application of ARCADE to the protein secondary structure prediction problem, proves the validity of our approach.
Publisher: IOS Press
Source file: Elektronische Wetenschappelijke Tijdschriften
 
 

                             Details for article 2 of 6 found articles
 
<< previous    next >>
 
 Koninklijke Bibliotheek - National Library of the Netherlands