TY - JOUR
T1 - Application of weighted low rank approximations
T2 - outlier detection in a data matrix
AU - Garcia-Pena, Marisol
AU - Arciniegas-Alarcon, Sergio
AU - Basford, Kaye E.
N1 - © 2025. The Author(s).
PY - 2025/5/27
Y1 - 2025/5/27
N2 - Objective: A mandatory step in the exploratory analysis of any rectangular database is the identification of possible outliers. The presence of these defines what type of explanatory and/or predictive modeling should be used subsequently. This paper presents strategies to identify outliers in any data set using weighted approximations of a matrix. The strategies are evaluated through artificial contamination in sixteen real data sets, of which two have multivariate characteristics and fourteen come from multi-environment trials. As an evaluation criterion, a statistic is proposed such that its value is small when the detection method is good and it is large when false positives or false negatives appear. Results: Six criteria for identifying outliers from weighted approximations were considered, including simple residuals, squared residuals with differential weights, Jackknife and their corresponding iterative versions, and they were compared with the gold standard one based on limits from a bias-adjusted boxplot. All methods are applicable to any numerical data set written in matrix form, e.g. experiments with genotype-by-environment interaction. It was found that in the presence of random outliers in a matrix with numerical entries, the identification of outliers using weighted approximations is more effective than detection based on limits from a bias-adjusted boxplot.
AB - Objective: A mandatory step in the exploratory analysis of any rectangular database is the identification of possible outliers. The presence of these defines what type of explanatory and/or predictive modeling should be used subsequently. This paper presents strategies to identify outliers in any data set using weighted approximations of a matrix. The strategies are evaluated through artificial contamination in sixteen real data sets, of which two have multivariate characteristics and fourteen come from multi-environment trials. As an evaluation criterion, a statistic is proposed such that its value is small when the detection method is good and it is large when false positives or false negatives appear. Results: Six criteria for identifying outliers from weighted approximations were considered, including simple residuals, squared residuals with differential weights, Jackknife and their corresponding iterative versions, and they were compared with the gold standard one based on limits from a bias-adjusted boxplot. All methods are applicable to any numerical data set written in matrix form, e.g. experiments with genotype-by-environment interaction. It was found that in the presence of random outliers in a matrix with numerical entries, the identification of outliers using weighted approximations is more effective than detection based on limits from a bias-adjusted boxplot.
KW - Atypical elements
KW - Criss-cross regression
KW - Data preprocessing
KW - Exploratory analysis
KW - Genotype-by-environment interaction
KW - Data Interpretation, Statistical
KW - Algorithms
KW - Humans
KW - Models, Statistical
UR - http://www.scopus.com/inward/record.url?scp=105006752230&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/ae034c96-2253-3522-aa73-d7429c4cae01/
U2 - 10.1186/s13104-025-07284-2
DO - 10.1186/s13104-025-07284-2
M3 - Article
C2 - 40420101
AN - SCOPUS:105006752230
SN - 1756-0500
VL - 18
SP - 1
EP - 11
JO - BMC research notes
JF - BMC research notes
IS - 1
M1 - 234
ER -