DISCO: A large scale human annotated corpus for disfluency correction in Indo-European languages

Disfluency Correction (DC) removes fillers, repetitions, and corrections from spoken utterances, improving Automatic Speech Recognition (ASR) outputs for downstream language tasks. While most DC research focuses on English, we introduce a high-quality human-annotated DC corpus covering English, Hindi, German, and French, enabling multilingual disfluency correction. Our evaluation of state-of-the-art DC models achieves F1 scores of 97.55 (English), 94.29 (Hindi), 95.89 (German), and 92.97 (French). Additionally, integrating DC with Machine Translation (MT) improves BLEU scores by 5.65 points on average. We open-source our dataset and implementation for further research.

Paper Code & Dataset