An automatic approach to generate corpus in Spanish

Edwin Puertas, Jorge Andres Alvarado-Valencia, Luis Gabriel Moreno-Sandoval, Alexandra Pomares-Quimbaya

Producción: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

Resumen

A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction.

Idioma originalInglés
Título de la publicación alojadaAdvances in Computing - 13th Colombian Conference, CCC 2018, Proceedings
EditoresJairo E. Serrano C., Juan Carlos Martínez-Santos
EditorialSpringer Verlag
Páginas150-161
Número de páginas12
ISBN (versión impresa)9783319989976
DOI
EstadoPublicada - 2018
Evento13th Colombian Conference on Computing, CCC 2018 - Cartagena, Colombia
Duración: 26 sep. 201828 sep. 2018

Serie de la publicación

NombreCommunications in Computer and Information Science
Volumen885
ISSN (versión impresa)1865-0929

Conferencia

Conferencia13th Colombian Conference on Computing, CCC 2018
País/TerritorioColombia
CiudadCartagena
Período26/09/1828/09/18

Huella

Profundice en los temas de investigación de 'An automatic approach to generate corpus in Spanish'. En conjunto forman una huella única.

Citar esto