This is intended primarily as a note for myself, and is very much a work in progress, but I thought that others might benefit. Also, if anyone commented, I would benefit. With the disclaimer out of the way, I will get to the point.
Basically, we have problems
In a pre-print, Flake & Fried (2019) make the point that measurement in psychology is very difficult to do in a valid way and, even worse, check the validity because of underreporting of decision-making processes among the researchers involved. The reason this matters is that psychology and its sub-disciplines heavily influence applied linguistics/SLA.
While psychology attempts to get through its replication crisis, the main ways for it to do so seem to be pre-registered studies and greater transparency in reporting them. Flake and Fried (2019) choose to look at “Questionable Measurement Practices (QMPs)” as opposed to “Questionable Research Practices” (Banks et al., 2016; John, Loewenstein, & Prelec, 2012 in Flake & Fried, ibid)). such as HARKing (hypothesising after results known) (Kerr, 1998 in Flake & Fried, ibid) and p-hacking (manipulating data so the p-value or probability that the hypothesis is validated by the results is due to chance is made smaller) (Head et al., 2015).
They go on to differentiate as follows:
“In the presence of QMPs, all four types of validity become difficult to evaluate… Statistical conclusion validity, which QRPs have largely focused on, captures whether conclusions from a statistical analysis are correct. It is difficult to evaluate when undisclosed measurement flexibility generates multiple comparisons in a statistical test, which could be exploited to obtain a desired result (i.e., QRPs). ”
(Flake & Fried, 2019, p.6-7)
Flake and Fried (2019) state that many of the QMPs are not carried out deliberately but a major problem is the lack of transparency in decisions made in the measurement process which reduces not only replicability but also the checking of validity.
They advocate answering the questions in a checklist (Flake & Fried, 2019, p. 9) to reduce the possibility of QMPs arising.
I am quite certain that a lot of applied linguistics masters-level students and above have seen articles where there are statistics reported but it is not clear why those particular statistics were chosen. Often these are blindly followed processes of running ANOVA or ANCOVA in SPSS software. I will go out on a limb and say that these problems are ignored as being simply how things are usually done.
However, how many of us have considered our controlled variables? For example, when running studies on phonological perception, are we explicit in the ranges of volume, fundamental frequency and formant frequency? Processing for noise reduction? I know I’ve seen studies that make claims for generalizability, not just exploratory or preliminary studies that do not control these. If you are going to make these claims, I think there should be greater controls than in a study that is primarily for oneself that you are sharing because it could be informative for others. Of course declaring the decision-making process and rationale ought to be necessary in both.
There’s an awful lot of talk about how language acquisition studies in classrooms are problematic due to individual differences being confounding. One way to increase the validity and generalisability is to be explicit in the choices made regarding measurement and variable choices.
Acknowledgements
I took part in a Google Hangout hosted by Julia Strand. Some of the ideas discussed over an hour have bound to have wormed their way in and mingled with my own.
References
Flake, J. K., & Fried, E. I. (2019, January 17). Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them. Retrieved Jan 20th 2019 from https://doi.org/10.31234/osf.io/hs7wm .
Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS biology, 13(3), e1002106. doi:10.1371/journal.pbio.1002106 . Retrieved Feb 1st 2019.