Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

发布

2026年04月14日

采集 2026年04月14日 04:31

学术前沿 5.1 分 — 中等质量：常规学术论文，有适度参考价值

评分 5.1 · 来源：cs.CL updates on arXiv.org · 发布于 2026-04-14

评分依据：中等质量：常规学术论文，有适度参考价值

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

arXiv:2604.09625v1 Announce Type: new Abstract: We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via OpenWebSearch.eu~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that…