An automatic approach to generate corpus in Spanish

Edwin Puertas, Jorge Andres Alvarado-Valencia, Luis Gabriel Moreno-Sandoval, Alexandra Pomares-Quimbaya

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction.

Original languageEnglish
Title of host publicationAdvances in Computing - 13th Colombian Conference, CCC 2018, Proceedings
EditorsJairo E. Serrano C., Juan Carlos Martínez-Santos
PublisherSpringer Verlag
Pages150-161
Number of pages12
ISBN (Print)9783319989976
DOIs
StatePublished - 2018
Event13th Colombian Conference on Computing, CCC 2018 - Cartagena, Colombia
Duration: 26 Sep 201828 Sep 2018

Publication series

NameCommunications in Computer and Information Science
Volume885
ISSN (Print)1865-0929

Conference

Conference13th Colombian Conference on Computing, CCC 2018
Country/TerritoryColombia
CityCartagena
Period26/09/1828/09/18

Keywords

  • Corpus
  • Knowledge extraction
  • Linguistic computational
  • Natural language processing
  • Text mining

Fingerprint

Dive into the research topics of 'An automatic approach to generate corpus in Spanish'. Together they form a unique fingerprint.

Cite this