PRE-PROCESSING OF HIGH-DIMENSIONAL CATEGORICAL PREDICTORS IN CLASSIFICATION SETTINGS
Authors:
Eugene Tuv a;
George Runger b
| Affiliations: | a Analysis Control Technology, Intel Corporation, Chandler, AZ, USA. |
| b Department of Industrial Engineering, Arizona State University, Tempe, AZ, USA. |
DOI:
10.1080/713827172
Publication Frequency:
10 issues per year
Subjects:
Artificial Intelligence;
Computer Science (General);
Information & Communication Technology (ICT);
Formats available:
PDF
(English)
View Article:
View Article (PDF)
Abstract
Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.
|

Download Citation
CiteULike
Del.icio.us
BibSonomy
Connotea