TY - GEN
T1 - LABAS-TS A system for assisting labeling of training sets for text classification
AU - Sierra-Múnera, Alejandro
AU - Pomares-Quimbaya, Alexandra
AU - Rivera, Rafael Andrés González
AU - Rodríguez, Julián Camilo Daza
AU - Velandia, Oscar Mauricio Muñoz
AU - Peña, Angel Alberto Garcia
N1 - Publisher Copyright:
© 2017 by SCITEPRESS - Science and Technology Publications, Lda. All rights reserved.
PY - 2017
Y1 - 2017
N2 - Most text classification techniques rely on the existence of training data sets that are required to build models. However, in many text classification projects, the availability of previously labeled texts is not frequent due to differences in language (e.g. Spanish), domain (e.g. healthcare) and regional or institutional written culture (e.g. specific hospital). In order to contribute to dealing with this problem, this paper presents LABAS-TS, a web-enabled system for assisting the open, collaborative labeling of training sets for text classification. LABAS-TS is framed within a named entity recognition approach that identifies important entities from a domain-specific corpus, based on gazetteers, and uses a language specific sentence analyzer that extracts the portions of text that should be annotated. LABAS-TS was evaluated in the generation of training data sets to classify whether an electronic health record text contains a diagnosis, a test or a procedure, and demonstrated its utility in reducing the required time for building a reliable training set, with an average of eleven seconds between two labels.
AB - Most text classification techniques rely on the existence of training data sets that are required to build models. However, in many text classification projects, the availability of previously labeled texts is not frequent due to differences in language (e.g. Spanish), domain (e.g. healthcare) and regional or institutional written culture (e.g. specific hospital). In order to contribute to dealing with this problem, this paper presents LABAS-TS, a web-enabled system for assisting the open, collaborative labeling of training sets for text classification. LABAS-TS is framed within a named entity recognition approach that identifies important entities from a domain-specific corpus, based on gazetteers, and uses a language specific sentence analyzer that extracts the portions of text that should be annotated. LABAS-TS was evaluated in the generation of training data sets to classify whether an electronic health record text contains a diagnosis, a test or a procedure, and demonstrated its utility in reducing the required time for building a reliable training set, with an average of eleven seconds between two labels.
KW - Labeling
KW - Text classification
KW - Training set
UR - http://www.scopus.com/inward/record.url?scp=85055592014&partnerID=8YFLogxK
U2 - 10.5220/0006504901740180
DO - 10.5220/0006504901740180
M3 - Conference contribution
AN - SCOPUS:85055592014
SN - 9789897582738
T3 - IC3K 2017 - Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
SP - 174
EP - 180
BT - IC3K 2017 - Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
A2 - Liu, Kecheng
A2 - Salgado, Ana Carolina
A2 - Bernardino, Jorge
A2 - Filipe, Joaquim
A2 - Filipe, Joaquim
PB - SciTePress
T2 - 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2017
Y2 - 1 November 2017 through 3 November 2017
ER -