Charsiu

Charsiu forced aligner & automatic speech recognition tool

The Charsiu phonetic aligner can:

force align: require speech audio + corresponding text transcription
automatically recognise and align: require speech audio only

The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment.

The evaluation results against other aligners are：

Github repo:

Tutorial:

A step-by-step tutorial for linguists:

ICASSP
Phone-to-audio alignment without text: A Semi-supervised Approach

Jian Zhu, Cong Zhang, and David Jurgens

In IEEE International Conference on Acoustics, Speech and Signal Processing, 2022

Abs Bib HTML PDF Code

The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment.
@inproceedings{zhu2022phone-charsiu, title = {Phone-to-audio alignment without text: A Semi-supervised Approach}, author = {Zhu, Jian and Zhang, Cong and Jurgens, David}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing}, year = {2022} }

Related talks:

speech tech
Phone-to-audio alignment without text: A Semi-supervised Approach.

Cong Zhang Jian Zhu, and David Jurgens

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Singapore, Singapore, 22-27 may 2022

Abs Bib HTML PDF Code

The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment. Code and pretrained models are available at https://github.com/lingjzhu/charsiu.
@conference{zhu2022phone-c, author = {Jian Zhu, Cong Zhang and Jurgens, David}, title = {Phone-to-audio alignment without text: A Semi-supervised Approach.}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2022}, category = {oral}, month = {22-27 May}, location = {Singapore, Singapore} }

Related resources:

dataset
Phoneme and word level forced aligned data: Common Voice - English (860,000 utterances)

Jian Zhu, and Cong Zhang

2022

Abs Bib HTML Code

Word & phone alignments for 2000 hrs of English from Common Voice (https://github.com/lingjzhu/charsiu/blob/main/misc/data.md#alignments-for-english-datasets). Some data come with demographic annotations. Great for studying speech styles, accents & variations
@misc{zhu2022aligned-en, author = {Zhu, Jian and Zhang, Cong}, title = {{Phoneme and word level forced aligned data: Common Voice - English (860,000 utterances)}}, year = {2022}, category = {dataset} }
dataset
Phoneme and word level forced aligned data: multiple datasets - Mandarin (over 1 million utterances)

Jian Zhu, and Cong Zhang

2022

Abs Bib HTML Code

Phone & word alignments for 1300 hours of open-source Mandarin speech datasets. Automatically aligned with our own Charsiu Forced Aligner.
@misc{zhu2022aligned-cn, author = {Zhu, Jian and Zhang, Cong}, title = {{Phoneme and word level forced aligned data: multiple datasets - Mandarin (over 1 million utterances)}}, year = {2022}, category = {dataset} }
alinger
Phone-to-audio alignment without text: A Semi-supervised Approach

Jian Zhu, Cong Zhang, and David Jurgens

2022

Abs Bib HTML PDF Code

Charsiu is a phonetic alignment tool, which can: (1) force-align given speech audio + text transcription to phone level; and/or (2) automatically recognise the text in speech audio without the need for any transcription. It is currently available in both Mandarin Chinese and English (mainly American English).
@misc{zhu2022phone-charsiu, title = {Phone-to-audio alignment without text: A Semi-supervised Approach}, author = {Zhu, Jian and Zhang, Cong and Jurgens, David} }

Github repo:

Tutorial:

Related articles:

Related talks:

Related resources: