publications
I will try my best to keep this page updated.
Articles
2024
- toneA cross-linguistic review of citation tone production studies: Methodology and recommendationsChenzi Xu, and Cong ZhangThe Journal of the Acoustical Society of America, Oct 2024
The study of citation tones, lexical tones produced in isolation, is one of the first steps towards understanding speech prosody in tone languages. However, methodologies for investigating citation tones vary significantly, often leading to limited comparability of tone inventories, both within and across languages. This paper presents a systematic review of research methods and practices in 136 citation tone studies on 129 tonal language varieties in China, including 99 studies published in Chinese, which are therefore not easily available to an international scientific readership. The review provides an overview of possible analytical decisions along the research pipeline, and unveils considerable variation in data collection, analysis, and reporting conventions, particularly in how f0, the primary acoustic correlate for tone, is operationalised and reported across studies. Key methodological issues are identified, including small sample sizes and inadequate transparency in communicating methodological decisions and procedure. This paper offers a clear road map for citation tone production research and proposes a range of recommendations on speaker sampling, experimental design, acoustic processing techniques, f0 analysis, and result reporting, with the goal of facilitating future tonal research and enhancing resources for underrepresented tonal varieties.
- remoteInvestigating differences in lab-quality and remote recording methods with dynamic acoustic measuresCong Zhang, Kathleen Jepson, and Yu-Ying ChuangLaboratory Phonology, Oct 2024
Increasingly, phonetic research uses data collected from participants who record themselves on readily available devices. Though such recordings are convenient, their suitability for acoustic analysis remains an open question, especially regarding how recording methods affect acoustic measures over time. We used Quantile Generalized Additive Mixed Models (QGAMMs) to analyze measures of F0, intensity, and the first and second formants, comparing files recorded using a laboratory-standard recording method (Zoom H6 recorder with an external microphone), to three remote recording methods: (1) the Awesome Voice Recorder application on a smartphone (AVR), (2) the Zoom meeting application with default settings (Zoom-default), and (3) the Zoom meeting application with the “Turn on Original Sound” setting (Zoom-raw). A linear temporal alignment issue was observed for the Zoom methods over the course of the long, recording session files; however, the difference was not significant for utterance-length files. F0 was reliably measured using all methods. Intensity and formants presented non-linear differences across methods that could not be corrected for simply. Overall, the AVR files were most similar to the H6’s, and so AVR is deemed to be a more reliable recording method than either Zoom-default or Zoom-raw.
- aphasiaProsody of speech production in latent post-stroke aphasiaCong Zhang, Tong Li, Gayle DeDe, and Christos SalisIn Interspeech 2024, Oct 2024
This study explores prosodic production in latent aphasia, a mild form of aphasia associated with left-hemisphere brain damage (e.g. stroke). Unlike prior research on moderate to severe aphasia, we investigated latent aphasia, which can seem to have very similar speech production with neurotypical speech. We analysed the f0, intensity and duration of utterance-initial and utterance-final words of ten speakers with latent aphasia and ten matching controls. Regression models were fitted to improve our understanding of this understudied type of very mild aphasia. The results highlighted varying degrees of differences in all three prosodic measures between groups. We also investigated the diagnostic classification of latent aphasia versus neurotypical control using random forest, aiming to build a fast and reliable tool to assist with the identification of latent aphasia. The random forest analysis also reinforced the significance of prosodic features in distinguishing latent aphasia.
- gamificationCollecting Big Data Through Citizen Science: Gamification and Game-based Approaches to Data Collection in Applied LinguisticsYoolim Kim, Vita V Kogan, and Cong ZhangApplied Linguistics, Oct 2024
Gamification of behavioral experiments has been applied successfully to research in a number of disciplines, including linguistics. We believe that these methods have been underutilized in applied linguistics, in particular second-language acquisition research. The incorporation of games and gaming elements (gamification) in behavioral experiments has been shown to mitigate many of the practical constraints characteristic of lab settings, such as limited recruitment or only achieving small-scale data. However, such constraints are no longer an issue with gamified and game-based experiments, and as a result, data collection can occur remotely with greater ease and on a much wider scale, yielding data that are ecologically valid and robust. These methods enable the collection of data that are comparable in quality to the data collected in more traditional settings while engaging far more diverse participants with different language backgrounds that are more representative of the greater population. We highlight three successful applications of using games and gamification with applied linguistic experiments to illustrate the effectiveness of such approaches in a greater effort to invite other applied linguists to do the same.
2023
- clinicalExploring the Acoustic and Prosodic Features of a Lung-Function-Sensitive Repeated-Word Speech Articulation TestBiao Zeng, Edgar Mark Williams, Chelsea Owen, Cong Zhang, Shakiela Khanam Davies, Keira Evans, and Savannah-Rose PreudhommeFrontiers in Psychology, Oct 2023
IntroductionSpeech breathing is a term usually used to refer to the manner in which expired air and lung mechanics are utilized for the production of the airflow necessary for phonation. Neurologically, speech breathing overrides the normal rhythms of alveolar ventilation. Speech breathing is generated using the diaphragm, glottis, and tongue. The glottis is the opening between the vocal folds in the larynx; it is the primary valve between the lungs and the mouth, and by varying its degree of opening, the sound can be varied. The use of voice as an indicator of health has been widely reported. Chronic obstructive pulmonary disease (COPD) is the most common long-term respiratory disease. The main symptoms of COPD are increasing breathlessness, a persistent chesty cough with phlegm, frequent chest infections, and persistent wheezing. There is no cure for COPD, and it is one of the leading causes of death worldwide. The principal cause of COPD is tobacco smoking, and estimates indicate that COPD will become the third leading cause of death worldwide by 2030. The long-term aim of this research program is to understand how speech generation, breathing, and lung function are linked in people with chronic respiratory diseases such as COPD.MethodsThis pilot study was designed to test an articulatory speech task that uses a single word (“helicopter”), repeated multiple times, to challenge speech-generated breathing and breathlessness. Specifically, a single-word articulation task was used to challenge respiratory system endurance in people with healthy lungs by asking participants to rapidly repeat the word “helicopter” for three 20-s runs interspersed with two 20-s rest periods of silent relaxed breathing. Acoustic and prosodic features were then extracted from the audio recordings of each adult participant.Results and discussionThe pause ratio increased from the first run to the third, representing an increasing demand for breath. These data show that the repeated articulation task challenges speech articulation in a quantifiable manner, which may prove useful in defining respiratory ill-health.
- BigTeamMultidimensional signals and analytic flexibility: Estimating degrees of freedom in human speech analysesSte Coretta, Joseph V Casillas, ..., Cong Zhang, ..., and Timo B RoettgerAdvances in Methods and Practices in Psychological Sciences, Oct 2023
Recent empirical studies have highlighted the large degree of analytic flexibility in data analysis which can lead to substantially different conclusions based on the same data set. Thus, researchers have expressed their concerns that these researcher degrees of freedom might facilitate bias and can lead to claims that do not stand the test of time. Even greater flexibility is to be expected in fields in which the primary data lend themselves to a variety of possible operationalizations. The multidimensional, temporally extended nature of speech constitutes an ideal testing ground for assessing the variability in analytic approaches, which derives not only from aspects of statistical modeling, but also from decisions regarding the quantification of the measured behavior. In the present study, we gave the same speech production data set to 46 teams of researchers and asked them to answer the same research question, resulting in substantial variability in reported effect sizes and their interpretation. Using Bayesian meta-analytic tools, we further find little to no evidence that the observed variability can be explained by analysts’ prior beliefs, expertise or the perceived quality of their analyses. In light of this idiosyncratic variability, we recommend that researchers more transparently share details of their analysis, strengthen the link between theoretical construct and quantitative system and calibrate their (un)certainty in their conclusions.
- ICPhSLanguage redundancy effects on F0: A preliminary controlled studyCong Zhang, Catherine Lai, Ricardo Souza, Alice Turk, and Tina BögelIn Proceeding of 20th International Congress of Phonetic Sciences, Oct 2023
Previous research suggests that words with a high level of language redundancy (i.e. recognition likelihood from familiarity and predictability based on syntactic, pragmatic, and semantic factors) have reduced acoustic salience, such as shorter duration and reduced vowels. The Smooth Signal Redundancy Hypothesis proposes that acoustic salience is controlled via prosodic structure, and makes the prediction that parameters such as fundamental frequency should also be affected by language redundancy. This study investigates the relationship of F0 with lexical frequency, together with bigram (verb-adjective or adjective-noun) frequency and the ratio between these two bigram frequencies. Results from a carefully controlled experiment with quadruplets of minimal pairs suggests that language redundancy can affect fundamental frequency in English.
2022
- EMNLP FindingsBootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddingsJian Zhu, Zuoyu Tian, Yadong Liu, Cong Zhang, and Chia-wen LoFindings of Empirical Methods in Natural Language Processing, Oct 2022
Inducing semantic representations directly from speech signals is a highly challenging task but has many useful applications in speech mining and spoken language understanding. This study tackles the unsupervised learning of semantic representations for spoken utterances. Through converting speech signals into hidden units generated from acoustic unit discovery, we propose WavEmbed, a multimodal sequential autoencoder that predicts hidden units from a dense representation of speech. Secondly, we also propose S-HuBERT to induce meaning through knowledge distillation, in which a sentence embedding model is first trained on hidden units and passes its knowledge to a speech encoder through contrastive learning. The best performing model achieves a moderate correlation (0.5 0.6) with human judgments, without relying on any labels or transcriptions. Furthermore, these models can also be easily extended to leverage textual transcriptions of speech to learn much better speech embeddings that are strongly correlated with human annotations. Our proposed methods are applicable to the development of purely data-driven systems for speech mining, indexing and search.
- rhythmTask effect on L2 rhythm production by Cantonese learners of PortugueseYuqi Sun, and Cong ZhangDELTA: Documentação de Estudos em Lingüística Teórica e Aplicada, Oct 2022
This study examines L2 Portuguese speech produced by eight native Cantonese speakers from Macao, China. The aims of this study are to investigate (1) whether the speech rhythm in L2 Portuguese is more source-like (more similar to Cantonese) or more target-like (more similar to Portuguese), and (2) whether L2 speech rhythm differs across three different tasks: a reading task, a retelling task, and an interpreting task. Seven rhythm metrics, i.e., %V, ΔC, ΔV, VarcoC, VarcoV, rPVI_C, and nPVI_V, were adopted for comparison and investigation. The results showed that L2 Portuguese rhythm produced by Cantonese speakers differed from L1 Portuguese speakers’ rhythm. R-deletion and vowel epenthesis were the reasons for the variabilities and instabilities of L2 Portuguese production by Cantonese learners, as they affect the duration and the number of vowel intervals and consonantal intervals. Moreover, in Cantonese learners’ L2 Portuguese production, the semi-spontaneous tasks (retelling and interpreting) presented a significant difference from the reading task. The driving force for such a difference was the cognitive load behind the tasks.
- InterspeechByT5 model for massively multilingual grapheme-to-phoneme conversionJian Zhu*, Cong Zhang*, and David Jurgens [* equal contribution]In Interspeech 2022, Oct 2022
In this study, we tackle massively multilingual grapheme-to-phoneme conversion through implementing G2P models based on ByT5. We have curated a G2P dataset from various sources that covers around 100 languages and trained large-scale multilingual G2P models based on ByT5. We found that ByT5 operating on byte-level inputs significantly outperformed the token-based mT5 model in terms of multilingual G2P. Pairwise comparison with monolingual models in these languages suggests that multilingual ByT5 models generally lower the phone error rate by jointly learning from a variety of languages. The pretrained model can further benefit low resource G2P through zero-shot prediction on unseen languages or provides pretrained weights for finetuning, which helps the model converge to a lower phone error rate than randomly initialized weights. To facilitate future research on multilingual G2P, we make available our code and pretrained multilingual G2P models at: https://github.com/lingjzhu/CharsiuG2P.
- ICASSPPhone-to-audio alignment without text: A Semi-supervised ApproachJian Zhu, Cong Zhang, and David JurgensIn IEEE International Conference on Acoustics, Speech and Signal Processing, Oct 2022
The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment.
- Speech ProsodyThe many shapes of H*Stella Gryllia, Amalia Arvaniti, Cong Zhang, and Katherine MarcouxIn Speech Prosody 2022, Oct 2022
We examined individual and task-related variability in the realization of Greek nuclear H* followed by L-L% edge tones. The accents (N = 748) were elicited from native speakers of Greek, producing scripted and unscripted speech, and examined using functional Principal Components Analysis. The accented vowel onset was used for landmark registration to capture accent shape and the alignment of the fall. The resulting PCs were analysed using LMEMs (fixed factors: speaker; task type (scripted, unscripted); accented syllable distance from the analysis window offset, to examine the effects of tonal crowding). Tonal scaling and the steepness of the fall (reflected in PC1 and PC2 respectively) changed by task in ways that differed across speakers. PC3, which captured accent shape, also varied by speaker, reflecting shape differences between a rise-fall and (the expected) plateau-plus-fall realization. Tonal crowding did not have consistent effects. In short, the overall accent shape and the alignment of the accentual fall varied by speaker and task. These results hint at substantial variability in tonal realization. At the same time, they indicate that tonal alignment is not as consistent as is sometimes portrayed and thus it should not be the sole criterion for tone categorization.
- Speech ProsodyDisentangling emphasis from pragmatic contrastivity in the English H* ∼L+H* contrastAmalia Arvaniti, Stella Gryllia, Cong Zhang, and Katherine MarcouxIn Speech Prosody 2022, Oct 2022
English H* and L+H* indicate new and contrastive information respectively, though some argue the difference between them is solely one of phonetic emphasis. We used (modified) Rapid Prosody Transcription to test these views. Forty-seven speakers of Standard Southern British English (SSBE) listened to 86 SSBE utterances and marked the words they considered prominent or emphatic. Accents (N = 281) were independently coded as H* or L+H* using phonetic criteria, and as contrastive or non-contrastive using pragmatic criteria. If L+H* is an emphatic H*, all L+H*s should be more prominent than H*s. If the accents mark pragmatic information, contrastivity should drive responses. Contrastive accents and L+H*s were considered more prominent than non-contrastive accents and H*s respectively. Individual responses showed different strategies: for some participants, all L+H*s were more prominent than H*s, for others, contrastive accents were more prominent than non-contrastive accents, and for still others, there was no difference between categories. These results indicate that a reason for the continuing debate about English H* and L+H* may be that the two accents form a weak contrast which some speakers acquire and attend to while others do not.
2021
- InterspeechSynchronising Speech Segments with Musical Beats in Mandarin and English SingingCong Zhang, and Jian ZhuIn Interspeech 2021, Oct 2021
Generating synthesised singing voice with models trained on speech data has many advantages due to the models’ flexibility and controllability. However, since the information about the temporal relationship between segments and beats are lacking in speech training data, the synthesised singing may sound off-beat at times. Therefore, the availability of the information on the temporal relationship between speech segments and music beats is crucial. The current study investigated the segment-beat synchronisation in singing data, with hypotheses formed based on the linguistics theories of P-centre and sonority hierarchy. A Mandarin corpus and an English corpus of professional singing data were manually annotated and analysed. The results showed that the presence of musical beats was more dependent on segment duration than sonority. However, the sonority hierarchy and the P-centre theory were highly related to the location of beats. Mandarin and English demonstrated cross-linguistic variations despite exhibiting common patterns.
- JASAComparing acoustic analyses of speech data collected remotelyCong Zhang, Kathleen Jepson, Georg Lohfink, and Amalia ArvanitiThe Journal of the Acoustical Society of America, Oct 2021
Face-to-face speech data collection has been next to impossible globally due to COVID-19 restrictions. To address this problem, simultaneous recordings of three repetitions of the cardinal vowels were made using a Zoom H6 Handy Recorder with external microphone (henceforth H6) and compared with two alternatives accessible to potential participants at home: the Zoom meeting application (henceforth Zoom) and two lossless mobile phone applications (Awesome Voice Recorder, and Recorder; henceforth Phone). F0 was tracked accurately by all devices; however, for formant analysis (F1, F2, F3) Phone performed better than Zoom, i.e. more similarly to H6, though data extraction method (VoiceSauce, Praat) also resulted in differences. In addition, Zoom recordings exhibited unexpected drops in intensity. The results suggest that lossless format phone recordings present a viable option for at least some phonetic studies.
2020
- Speech ProsodySegment Duration and Proportion in Mandarin SingingCong Zhang, and Xinrong WangIn Proc. Speech Prosody 2020, Oct 2020
Speech-based singing synthesis has various merits while it also has unsolved issues. One of the most noticeable issues is the segment duration and proportion in synthesised singing, which is caused by the difference in the short syllables in speech and the lengthened syllables in singing. This study therefore investigates how syllables are lengthened in Mandarin singing data. A total of 20 songs from the MIREX singing corpus were segmented and analysed. The results showed that (1) the segment proportions in Mandarin syllables are different in speech and in singing; (2) the lengthening is influenced more by the slots in the syllable structure than by the types of segments; (3) in the syllable structure of CGVX in Mandarin, the nuclear V lengthens the most and X follows. The durations of C and G also increase but their proportions in a syllable decrease.
2019
- tone-intonationStacking and Unstacking Prosodies : The Production and Perception of Sentence Prosody in a Tonal LanguageCong ZhangIn Proceeding of 19th International Congress of Phonetic Sciences, Oct 2019
Teasing apart lexical prosody and sentence prosody has been one of the most difficult tasks in the study of intonational tunes in tonal languages. Are different prosodic manifestations stacked, or are they an integrated whole? With evidence from production and perception data of the intonational yes/no question tune in Tianjin Mandarin at sentence level, this paper proposes that (1) lexical tonal alterations (a.k.a tone sandhi) are lexical-level prosody and do not belong to sentence-level tune; (2) pitch accents induced by information structure are “intra-tune” features, which are such sentence-level prosody features that do not cause sentence type change. Despite being sentence-level prosody features, they are not a part of the tune for intonational yes/no question.
2018
- Tianjin Mandarin Tones and TunesCong ZhangDoctoral thesis: University of Oxford, Oct 2018
Lexical tones and intonational tunes are both mainly realised through pitch modulation. What role does intonation play in a language which has a lexical tonal contrast? Can one separate ‘tone’ from ‘intonation’? If yes, how do lexical tones interact with intonational tunes? In order to answer these questions, this thesis investigates how tone and intonation interact during production and perception in Tianjin Mandarin, by means of examining the components of different intonational tunes under the Autosegmental-Metrical (AM) Framework (Pierrehumbert, 1980), and the cues native listeners use during the tune identification process. Chapter 1 – 3 are the introductory chapters: Chapter 1 introduces the topic of research, and sketches the three research goals for this thesis – the theoretical goal, the documentation goal, and the methodological goal; Chapter 2 addresses the theoretical foundation of this thesis – the AM theory; and Chapter 3 outlines the linguistic background of Tianjin Mandarin. Chapter 4 presents production studies of the tune of intonational Yes/No questions (IntQ) in Tianjin Mandarin. A total of six native Tianjin speakers were recorded for monosyllabic words in isolation (Mono(ISO)) and monosyllabic words as sentence prominence (Mono(SEN)), with statement tune and IntQ tune, respectively. The results show that when a monosyllabic word is produced in isolation, the IntQ tune has a raised register, and a floating H% boundary tone at the end of the intonational phrase. When a monosyllabic word is in sentence prominence position, the IntQ tune also has a raised register, a floating H% boundary tone, as well as a H* pitch accent coming from the focus and a post-focus compression. The IntQ tune is: [H* pitch accent + (post-focus compression) + floating H̥% boundary tone] higher register. To further investigate how the IntQ tune is represented, three perception experiments were conducted on monosyllabic words in isolation, monosyllabic words as sentence prominence, and sentences with monosyllabic words as prominence in Chapter 5. A total of 28 native Tianjin Mandarin speakers participated in the experiments. They were asked to identify the tunes (yes-no question or statement) of the audio stimuli. The accuracy of their responses and reaction time together show that they strongly prefer the H-Rising lexical tone for IntQs, and L-Falling lexical tone for statements, which indicate that they look for the low register information during the identification of statements, and a H boundary tone for the IntQ tune. Another important tune, chanted call (CC) tune, was also studied to further investigate the possibilities of intonational tunes in a tonal language in Chapter 6. Six native speakers’ production of monosyllabic words and disyllabic words were recorded. The results show that there is a L% boundary tone at the end of the intonational phrase, regardless of the lexical tones. Different from the IntQ data, the L% boundary tone is phonetically manifested and overrode the lexical tone contours. A H* pitch accent was found to be associated with the H of each lexical tone. Lengthening was also found in the CC tune. The CC tune in Tianjin Mandarin can be represented as follows: [[H*]sustained]higher register + L%.
- Speech ProsodyChanted Call Tune in Tianjin Mandarin: Disyllabic CallsCong ZhangIn 9th International Conference on Speech Prosody 2018, Oct 2018
This paper examines the chanted call tune in Tianjin Mandarin in order to investigate the possibilities of intonational components, i.e. pitch accents, boundary tones, etc., in a tonal language. Six native Tianjin speakers’ production of disyllabic names and kinship terms were recorded. The speech materials were composed of a set of left-prominent disyllabic names and a set of right-prominent disyllabic names. The results show that there is a L% boundary tone at the end of the intonational phrase, regardless of the lexical tones. Different from the IntQ data, the L% boundary tone is phonetically manifested and overrode the lexical tone contours. A H* pitch accent was found to be associated with the H of each lexical tone. Lengthening was also found in the CC tune. The CC tune in Tianjin Mandarin can be represented as follows: [[H*]sustained]higher register + L%.
2015
- fluencyThe Effect of Study Abroad Experience on L2 Mandarin Disfluency in Different Types of TasksClare Wright, and Cong ZhangIn Proceeding of The Disfluency in Spontaneous Speech, Oct 2015
Disfluency is a common phenomenon in L2 speech, especially in beginners’ speech. Whether studying abroad can help with reducing their disfluency or not remains debated. We examined longitudinal data from 10 adult English instructed learners of Mandarin measured before and after ten months of studying abroad (SA) in this paper. We used two speaking tasks comparing pre-planned vs. unplanned spontaneous speech to compare differences over time and between tasks, using eight linguistic and temporal fluency measures (analysed using CLAN and PRAAT). Overall mean linguistic and temporal fluency scores improved significantly (p < .05), especially speech rate (p <.01), supporting the general claim that SA favours oral development, particularly fluency. Further analysis revealed task differences at both times of measurement, but with greater improvement in the spontaneous task.
2014
- fluencyExamining the Effects of Study Abroad on Mandarin Chinese Language Development among UK University LearnersClare Wright, and Cong ZhangNewcastle Working Papers in Linguistics, Oct 2014
This study tracked ten third-year English students learning Mandarin Chinese as a second language (L2) at a UK university, to examine changes in L2 Mandarin during an eight-month period spent studying abroad (SA). We used three writing tasks and four speaking tasks as measures of writing and speaking proficiency, to assess total output, grammatical accuracy, lexical development, pronunciation and fluency, repeated before and after SA in China. Overall mean oral proficiency scores improved significantly (p < .05), especially speech rate (p <.01), supporting the claim that SA favours fluency development (Freed et al. 2004), although the measures highlighted difficulties in clarifying precisely how to assess oral proficiency. Written proficiency showed fewer marked improvements: only one writing test (an untimed short essay) significantly improved in length (p <.05), and increased complex grammar (use of de-relative clause morphemes, p <.001). A sub-group (n=7) provided quantitative data on L2 Mandarin use at different times during SA, showing clear individual differences, highlighting the value of capturing details of students’ experiences during SA (Regan et al. 2009). We also note the lack of standardised linguistically-informed measures for tracking development in L2 Mandarin (Freed et al. 2004; Pallotti 2009; De Jong et al. 2012). Further research is therefore much needed to identify systematic linguistic development in L2 Mandarin, and also to bridge theory and practice in L2 Mandarin language teaching to clarify the interconnecting factors that affect L2 Mandarin language development.
- feedbackThe effect of immediate feedback on the perception of Mandarin lexical tones by non-native speakers of MandarinCong ZhangSt. Anne’s Annual Review, Oct 2014
Lexical tone is one of the most difficult issues in learning Mandarin as a foreign language. Various efforts have been made by training non-native speakers to improve the perception of Mandarin lexical tones. Immediate feedback, as an essential and efficient way of perceptual learning, however, has been understudied. An AX discrimination task is used to test whether the participants’ perception of Mandarin lexical tones improves after being given immediate feedback. The result shows an evident effect of immediate feedback on the perception of Mandarin lexical tones, both within the experiment groups as well as between the experiment group and the control group
2012
- The effect of immediate and simple feedback on the perception of Mandarin lexical tones by English speakersCong ZhangMA dissertation: Newcastle University, UK, Oct 2012
Lexical tone is an important feature in Mandarin. However, it caused substantial difficulty for the non-Mandarin speakers. Since lexical tones are not completely illegible to non-Mandarin speakers, many methods are employed to improve the perception of Mandarin lexical tones by non-Mandarin speakers. One method is perceptual learning. Perceptual learning is a learning style through which people pick up “previously unused information” (Gibson & Gibson1955). Most perceptual learning studies on Mandarin lexical tone perception have used perceptual training as the condition for the participants to pick up information (e.g. Wang et al. 1999; Wang et al. 2003; Francis et al. 2008). Feedback, which is another important tool for perceptual learning, is nevertheless understudied in Mandarin lexical tone perceptual learning studies. Since feedback has many categories, in the current study, only the immediate and simplest form of feedback is examined. Whether the perception of the lexical tones by non-Mandarin speakers can be improved by receiving immediate and simple feedback is the focus of this study. The aims of this study are: 1) to investigate the effect of feedback and provide future studies with experimental grounds on the use of feedback; 2) to suggest the use of feedback in Computer Assisted Language Learning through investigating the effect of feedback; 3) to contribute to the existing theories, such as Autosegmental Theory (Goldsmith 1979), Categorical Perception (Best 1995), Noticing Hypothesis (Schmidt 1990), with empirical evidence. The current study examined the perception of Mandarin lexical tones by 24 native British English speakers, with 5 native Mandarin speakers as one of the control groups. The experiment made use of an AX discrimination task which required the participants to make judgments on whether the tones of 160 pairs of stimuli were the same or not, regardless of the consonants or vowels. The experiment group consisted of 12 native British English speakers. This group received simple feedback which only indicated the incorrectness on the incorrect judgments immediately after the judgment was made. As a control group, the other 12 participants did not receive any feedback. The results of the experiment showed that simple immediate feedback did not have a significant effect on the perception of Mandarin lexical tones by English speakers, both in terms of accuracy and reaction time. The reasons for the results are mainly discussed in terms of feedback types, individual differences on perceptual learning, and the influence from musical experience on lexical tone perception. The first chapter of this dissertation is a general introduction. In the second chapter, some background information about the phonology of Mandarin syllables is introduced. The third chapter reviews the previous studies on lexical tone perception, perceptual learning and feedback, to establish the basis for this study. A pilot experiment and a full-scale experiment are reported in the fourth and the fifth chapter respectively. The last chapter is dedicated to the general discussion, conclusion and suggestions for future studies.
Preprints
- featureTTSApplying Phonological Features in Multilingual Text-To-SpeechCong Zhang, Huinan Zeng, Huang Liu, and Jiewen Zheng2021
This study investigates whether phonological features can be applied in text-to-speech systems to generate native and non-native speech. We present a mapping between ARPABET/pinyin->SAMPA/SAMPA-SC->phonological features in this paper, and tested whether native, non-native, and code-switched speech could be successfully generated using this mapping. We ran two experiments, one with a small dataset and one with a larger dataset. The results proved that phonological features can be a feasible input system, although it needs further investigation to improve model performance. The accented output generated by the TTS models also helps with understanding human second language acquisition processes.
- intonationFloating Boundary Tone: Production and Perception of Syntactically Unmarked Polar Question in Tianjin MandarinCong Zhang, and Aditi Lahiri2021
The present study investigates the intonational tune of syntactically unmarked polar question in Tianjin Mandarin. A production study was conducted to examine the phonological features of the syntactically unmarked polar question (a.k.a intonational yes/no question) tune by comparing against the statement tune. The results show a significant register lift HR and a high floating boundary tone H̥I. The tone shape and tone register played a significant role in how the tunes vary. A tune identification task then further verifies whether the two prosodic features concluded from the production are used in perception. The results showed that both the register difference and the boundary tone made a difference in native speakers’ perception in discriminating questions from statements.