TY - JOUR
T1 - Detection of Hate Speech, Racism and Misogyny in Digital Social Networks
T2 - Colombian Case Study
AU - Moreno-Sandoval, Luis Gabriel
AU - Pomares-Quimbaya, Alexandra
AU - Barbosa-Sierra, Sergio Andres
AU - Pantoja-Rojas, Liliana Maria
N1 - Data Availability Statement
The raw data for supporting the conclusions of this article will be made available by the authors on request.
Acknowledgments
This work has been supported by Pontificia Universidad Javeriana and the Center of Excellence and Appropriation in Big Data and Data Analytics in Colombia (CAOBA). Likewise, thanks are expressed to the students of the Master in Artificial Intelligence Eng. Gisell Natalia Cristiano and Eng. Andrés Felipe Ethorimn, for their valuable contribution. We also thank the International Research Group in Computer Science, Communications and Knowledge Management (GICOGE) and IDEAS of the Universidad Distrital Francisco José de Caldas, and LUMON LV TECH research group.
Publisher Copyright:
© 2024 by the authors.
PY - 2024/9/6
Y1 - 2024/9/6
N2 - The growing popularity of social networking platforms worldwide has substantially increased the presence of offensive language on these platforms. To date, most of the systems developed to mitigate this challenge focus primarily on English content. However, this issue is a global concern, and therefore, other languages, such as Spanish, are involved. This article addresses the task of identifying hate speech, racism, and misogyny in Spanish within the Colombian context on social networks, and introduces a gold standard dataset specifically developed for this purpose. Indeed, the experiment compares the performance of TLM models from Deep Learning methods, such as BERT, Roberta, XLM, and BETO adjusted to the Colombian slang domain, then compares the best TLM model against a GPT, having a significant impact on achieving more accurate predictions in this task. Finally, this study provides a detailed understanding of the different components used in the system, including the architecture of the models and the selection of functions. The best results show that the BERT model achieves an accuracy of 83.6% for hate speech detection, while the GPT model achieves an accuracy of 90.8% for racism speech and 90.4% for misogyny detection.
AB - The growing popularity of social networking platforms worldwide has substantially increased the presence of offensive language on these platforms. To date, most of the systems developed to mitigate this challenge focus primarily on English content. However, this issue is a global concern, and therefore, other languages, such as Spanish, are involved. This article addresses the task of identifying hate speech, racism, and misogyny in Spanish within the Colombian context on social networks, and introduces a gold standard dataset specifically developed for this purpose. Indeed, the experiment compares the performance of TLM models from Deep Learning methods, such as BERT, Roberta, XLM, and BETO adjusted to the Colombian slang domain, then compares the best TLM model against a GPT, having a significant impact on achieving more accurate predictions in this task. Finally, this study provides a detailed understanding of the different components used in the system, including the architecture of the models and the selection of functions. The best results show that the BERT model achieves an accuracy of 83.6% for hate speech detection, while the GPT model achieves an accuracy of 90.8% for racism speech and 90.4% for misogyny detection.
KW - large language models
KW - digital social networks
KW - hate speech detection
KW - sentiment analysis
KW - social network analysis
KW - subjectivity analysis
KW - text classification
UR - http://www.scopus.com/inward/record.url?scp=85205060278&partnerID=8YFLogxK
U2 - 10.3390/bdcc8090113
DO - 10.3390/bdcc8090113
M3 - Article
SN - 2504-2289
VL - 8
SP - 113
JO - Big Data and Cognitive Computing
JF - Big Data and Cognitive Computing
IS - 9
ER -