Finding relevant attributes in high dimensional data: a distributed computing hybrid data mining strategy

From National Research Council Canada

DOI	Resolve DOI: https://doi.org/10.1007/978-3-540-71200-8_20
Author	Search for: Valdés, Julio; Search for: Barton, Alan
Format	Text, Article
Abstract	In many domains the data objects are described in terms of a large number of features (e.g. microarray experiments, or spectral characterizations of organic and inorganic samples). A pipelined approach using two clustering algorithms in combination with Rough Sets is investigated for the purpose of discovering important combinations of attributes in high dimensional data. The Leader and several k-means algorithms are used as fast procedures for attribute set simplification of the information systems presented to the rough sets algorithms. The data described in terms of these fewer features are then discretized with respect to the decision attribute according to different rough set based schemes. From them, the reducts and their derived rules are extracted, which are applied to test data in order to evaluate the resulting classification accuracy in crossvalidation experiments. The data mining process is implemented within a high throughput distributed computing environment. Nonlinear transformation of attribute subsets preserving the similarity structure of the data were also investigated. Their classification ability, and that of subsets of attributes obtained after the mining process were described in terms of analytic functions obtained by genetic programming (gene expression programming), and simplified using computer algebra systems. Visual data mining techniques using virtual reality were used for inspecting results. An exploration of this approach (using Leukemia, Colon cancer and Breast cancer gene expression data) was conducted in a series of experiments. They led to small subsets of genes with high discrimination power.
Publication date	2007
Publisher	Springer
In	Transactions on Rough Sets VI: 366–396.
Language	English
Peer reviewed	Yes
NRC number	NRCC 48766
NPARC number	5764714
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	4cd37a2c-96cc-49b7-b656-ab3ec99a8508
Record created	2009-03-29
Record modified	2020-05-10

Date modified:: 2024-04-19