Charsiu

Charsiu forced aligner & automatic speech recognition tool

The Charsiu phonetic aligner can:

  • force align: require speech audio + corresponding text transcription
  • automatically recognise and align: require speech audio only

The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment.

The evaluation results against other aligners are:


Github repo:


Tutorial:

A step-by-step tutorial for linguists: Open In Colab


Related articles:

  1. ICASSP
    Phone-to-audio alignment without text: A Semi-supervised Approach
    Jian Zhu, Cong Zhang, and David Jurgens
    In IEEE International Conference on Acoustics, Speech and Signal Processing, 2022

    Related talks:

      1. speech tech
        Phone-to-audio alignment without text: A Semi-supervised Approach.
        Cong Zhang Jian Zhu, and David Jurgens
        IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
        Singapore, Singapore, 22-27 may 2022

      Related resources:

      1. dataset
        Phoneme and word level forced aligned data: Common Voice - English (860,000 utterances)
        Jian Zhu, and Cong Zhang
        2022
      2. dataset
        Phoneme and word level forced aligned data: multiple datasets - Mandarin (over 1 million utterances)
        Jian Zhu, and Cong Zhang
        2022
      3. alinger
        Phone-to-audio alignment without text: A Semi-supervised Approach
        Jian Zhu, Cong Zhang, and David Jurgens
        2022