Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

Tags
Microsoft
arxiv id
2505.16722
6 more properties

Abstract Summary

The research focuses on cross-lingual detoxification to ensure large language models are toxicity-free across different languages and script families.
Effectiveness of cross-lingual detoxification is evaluated through various settings to reduce toxicity in limited data scenarios and understand the trade-offs between safety and knowledge preservation.

Abstract

As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 504 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at this https URL.