TY - JOUR
T1 - Interpretable risk models for Sleep Apnea and Coronary diseases from structured and non-structured data
AU - Silva, Carlos Anderson Oliveira
AU - Gonzalez-Otero, Rafael
AU - Bessani, Michel
AU - Mendoza, Liliana Otero
AU - de Castro, Cristiano L.
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/8/15
Y1 - 2022/8/15
N2 - Machine learning-based risk models built from Electronic Health Records (EHR) can support medical decision-making. However, the lack of standardization of EHR data and the “black-box” nature of the machine learning approaches have imposed difficulties to their acceptance as support tools in a clinical environment. This paper presents a method able to predict and explain the diagnostic of Atrial Fibrillation (AF), Sleep Apnea (SA) and, Coronary Arterial Disease (CAD); the proposed model is learned using EHR's structured data (commonly used screening variables) and non-structured data (textual data drawn from medical reports) of patients. An embedding scheme of variables together with a labeling approach is used to mimic the ability of an expert in categorizing the non-structured textual data. The method relies on complex models to predict such diseases combined with the SHAP approach to explaining the prediction. A comparison of prediction models with different settings of input variables has shown that the use of non-structured data improved the performances of CAD risk prediction models. Moreover, such a comparison pointed out that the patients’ medical histories is an important factor that should be considered during the data-driven learning process.
AB - Machine learning-based risk models built from Electronic Health Records (EHR) can support medical decision-making. However, the lack of standardization of EHR data and the “black-box” nature of the machine learning approaches have imposed difficulties to their acceptance as support tools in a clinical environment. This paper presents a method able to predict and explain the diagnostic of Atrial Fibrillation (AF), Sleep Apnea (SA) and, Coronary Arterial Disease (CAD); the proposed model is learned using EHR's structured data (commonly used screening variables) and non-structured data (textual data drawn from medical reports) of patients. An embedding scheme of variables together with a labeling approach is used to mimic the ability of an expert in categorizing the non-structured textual data. The method relies on complex models to predict such diseases combined with the SHAP approach to explaining the prediction. A comparison of prediction models with different settings of input variables has shown that the use of non-structured data improved the performances of CAD risk prediction models. Moreover, such a comparison pointed out that the patients’ medical histories is an important factor that should be considered during the data-driven learning process.
KW - Coronary diseases
KW - EHR data
KW - SHAP
KW - Sleep Apnea
KW - Text embedding
KW - Weak supervision
UR - http://www.scopus.com/inward/record.url?scp=85127506236&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2022.116955
DO - 10.1016/j.eswa.2022.116955
M3 - Article
AN - SCOPUS:85127506236
SN - 0957-4174
VL - 200
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 116955
ER -