TY - GEN
T1 - An automatic approach to generate corpus in Spanish
AU - Puertas, Edwin
AU - Alvarado-Valencia, Jorge Andres
AU - Moreno-Sandoval, Luis Gabriel
AU - Pomares-Quimbaya, Alexandra
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2018.
PY - 2018
Y1 - 2018
N2 - A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction.
AB - A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction.
KW - Corpus
KW - Knowledge extraction
KW - Linguistic computational
KW - Natural language processing
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=85054377708&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-98998-3_12
DO - 10.1007/978-3-319-98998-3_12
M3 - Conference contribution
AN - SCOPUS:85054377708
SN - 9783319989976
T3 - Communications in Computer and Information Science
SP - 150
EP - 161
BT - Advances in Computing - 13th Colombian Conference, CCC 2018, Proceedings
A2 - Serrano C., Jairo E.
A2 - Martínez-Santos, Juan Carlos
PB - Springer Verlag
T2 - 13th Colombian Conference on Computing, CCC 2018
Y2 - 26 September 2018 through 28 September 2018
ER -