This goes into a bit more detail about phonetics than some people familiar with me might be comfortable with.
On Friday I went to Tokyo JALT’s monthly meeting (no link because I can’t find a permalink) to see three presentations on pronunciation (or more accurately, phonology, seeing as Alastair Graham-Marr covered both productive and receptive, listening skills). All three presenters, Kenichi Ohyama, Yukie Saito and Alastair Graham-Marr were interesting but there was one particular point that stuck with me from Yukie Saito’s presentation.
She was talking about rating pronunciation and how it had often been carried out by ‘native speaker’ raters. She also said that it was often carried out according to rater intuition on Likert scales of either ‘fluency’ (usually operating as speed of speech), ‘intelligibility’ (usually meaning phonemic conformity to a target community norm) or ‘comprehensibility’ (how easily raters understand speakers).
What else could work is something that needs to be answered, not only to make work done in applied linguistics more rigorous but to make assessment of pronunciation less arbitrary. I have an idea. Audio corpora could be gathered of speakers in target communities, phonemes run through Praat, and typical acceptable ranges for formant frequencies taken. Learners should then be rated according to comprehensibility by proficient speakers, ideally from the target community, as well as run through Praat to check that phonemes correspond to the acceptable ranges for formants. This data would all then be triangulated and a value assigned based on both.
Now, I fully acknowledge that there are some major drawbacks to this. Gathering an audio corpus is massive pain. Running it all through Praat and gathering the data even more so. To then do the same with learners for assessment makes things yet more taxing. However, is it really better to rely on rater hunches and hope that every rater generally agrees? I don’t think so and the reason is, there is no construct that makes any of this any less arbitrary, especially if assessment is done quickly. With the Praat data, there is at least some quantifiable data to show whether, for example, a learner-produced /l/ conforms to that typically produced in the target community and it would be triangulated with the rater data. It would also go some way to making the sometimes baffling assessment methodologies a bit more transparent, at least to other researchers.