Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data

Erika Cantor, Sandra Guauque-Olarte, Roberto León, Steren Chabert, Rodrigo Salas

Producción: Contribución a una revistaArtículorevisión exhaustiva

1 Cita (Scopus)

Resumen

The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes (n≤30) comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.

Idioma originalInglés
Número de artículo34
PublicaciónBioData Mining
Volumen17
N.º1
DOI
EstadoPublicada - dic. 2024

Huella

Profundice en los temas de investigación de 'Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data'. En conjunto forman una huella única.

Citar esto