Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data

Erika Cantor, Sandra Guauque-Olarte, Roberto León, Steren Chabert, Rodrigo Salas

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes (n≤30) comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.

Original languageEnglish
Article number34
JournalBioData Mining
Volume17
Issue number1
DOIs
StatePublished - Dec 2024

Keywords

  • Explainability
  • Feature selection
  • Gene selection
  • High-dimensional
  • Prior knowledge
  • Protein-protein interaction
  • Random forest
  • RNA-Seq

Fingerprint

Dive into the research topics of 'Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data'. Together they form a unique fingerprint.

Cite this