Chapter1 Introduction
1.1 Context of the research
1.1.1 Rater variability in performance assessment
As a direct measure of learners'communicative language ability,performance assessment is commonly espoused for its close link between the assessment situation and authentic language use and is often taken for granted to enhance the validity of inference drawn from the scores(Bachman et al.,1995;Lynch&McNamara,1998).It has therefore been increasingly involved as a compulsory or optional part in many large-scale language assessments worldwide.
However,the elicitation of complex responses from candidates inevitably calls for raters to make evaluative judgments on the effectiveness of candidate performance or the degree of mastery of the underlying construct that the assessment intends to measure.Variability in scores has been regarded as a“measurement error”that lowers the reliability and,hence the validity of the test(Hout,2002;Ruth&Murphy,1988).Research has focused on increasing the score consistency across raters(intra-/inter-rater reliability),writing tasks and occasions(test-retest reliability)by controlling or reducing the variability due to rater factors.Various procedures therefore have been proposed including training raters,using standardized rating scales to direct the raters to look for the same writing features,monitoring raters regularly to check their consistency in applying the rater scales and adopting double marking,etc.(e.g.,Jacobs et al.,1981;Underhill,1982).However,research findings from numerous studies in the context of performance assessment indicate that even after principled rater training or standardization,raters still exhibit considerable variability or idiosyncrasies in the ratings they award(Lumely&McNamara,1995;Weigle,1998).Rater factors have therefore been deemed as one of the significant sources of variability that researchers have explored and taken into account when interpreting and using test scores to make valid and fair decisions(Cumming,1997;Hout,2002;Schoonen,2005).It is reasonable to assume that raters mediate between the candidate performance and the final score with their internalized criteria and specific approach in implementing these criteria,to determine the meaningfulness of the score and the appropriateness of inferences made from the results.
Two lines of research have been identified to investigate rater variability by exploring various underlying factors which lead to the observed variations.One line of studies has been concerned with variability introduced by raters'judgment through statistical analysis of the scores awarded by the raters.The other has focused on raters'rationales for their scoring decisions.Rather than focusing on the final scores,this line of research perceives raters as decision-makers who go through different thought processes to arrive at the final scores.
The first line of research has focused on the statistical modeling of rater effects.The most commonly used statistical methods in detecting and measuring rater variability include inter-/intra-rater reliability indices by Classical Test Theory(CTT),estimation of variance component related with the whole rater facet by Generalizability Theory(G-theory) and calibration of individual rater's rating patterns by Many-Facet Rasch Model(MFRM).By operationalizing rater variability in different ways,these methods provide different statistical indices depicting raters'ratings from different perspectives.Though these more sophisticated statistical methods such as G-theory and MFRM enable researchers to investigate rater factors in a more systematic manner than the correlation coefficients in the CTT,they leave the complexity and interaction in the rating process unexplored.Many researchers therefore have called for more in-depth investigations into the rating process articulated in the future agenda of their research(e.g.,Weigle,1998;Eckes,2005).
The other line of research is therefore devoted to exploring how raters arrive at their scoring decisions,with the aim of finding out the underlying reasons leading to the individual scoring judgments among raters.Some of the studies have drawn upon indirect evidence to infer what writing features raters attend to(Cumming et al.,2006;Eckes,2005;Homburg,1984;Jerkins&Parra,2003;Laufer&Nation,1995).These endeavors help to extract salient features in candidate performance which influence raters'decision-making and therefore provide useful information for the validation or development of rating scale.There are also other studies which have employed intro-/retrospective verbal protocols as a direct evidence of raters'thought processes in their decision-making.Some of these studies have focused on describing the similarities and differences in raters'rating focus and their reading styles as well as the strategies employed to acquire and process the information(e.g.,Vaughan,1991;Milanovic et al.,1996;Sakyi,2003;Cumming et al.,2002,2003;Lumely,2002,2005).Given that the nature of these studies is exploratory;their findings are mixed due to the specific assessment context and different rater groups investigated,more importantly,to the complex nature of the rating process.
Though small in number,these studies have revealed important sources of variance among raters with different personal,cultural and professional backgrounds.At the same time,these studies still fall short in terms of providing a comprehensive account of the mechanism of how these factors lead to rater variability in both their scoring outcomes and processes.A mixed-method approach examining both raters'rating outcomes and processes can therefore enhance understanding of raters'scoring judgment and explore how it can be related to the factors influencing their rating.
1.1.2 Pre-service EFL teachers and writing assessment
1.1.2.1 Writing proficiency of pre-service EFL teachers
Language proficiency has long been recognized as one of the most essential characteristics of a good language teacher(Lange,l990).This recognition has given rise to concerns about language teachers'proficiency,particular for English-as-a-foreign-language(EFL)teachers(Arva&Medgyes,2000;Coniam&Falvey,1996,2000,2001,2007;Elder,2001).Nowadays,EFL teachers are faced with the challenge of achieving appropriate levels of target language proficiency for delivering effective instruction and carrying out their professional activities.In the context of Asia-Pacific countries,Nunan(2003)suggests that English language proficiency for many teachers is not sufficient enough to provide learners with the rich input needed for successful foreign language acquisition.Apart from the important role that language proficiency plays for students,it has been argued that the language proficiency of teachers is too often overlooked(Johnson,1990;Richards,1998).
Of the four macro-skills related to language proficiency,it has been argued that learning to write in a second language(L2)is far more challenging than learning to listen to,to speak or to read a foreign language(Bell&Burnaby,1984;Bialystok,1978;Nunan,1989).Writing requires coordinating a complex and multifaceted set of skills and learning these skills requires careful instruction and guidance from teachers who are competent and confident in their writing ability(Ochsner&Fowler,2004).For a writing class,teachers have to meaningfully respond to and critically evaluate students'written works such as the ones produced under traditional writing tests,which are then scored on some sort of numerical scale(Hamp-Lyons,1991)or other informal assessment activities such as portfolios or take-home writing assignments.It can be seen that teachers'capabilities for evaluating writing and their competence to provide feedback to students are closely tied to their ability to judge varying levels of writing quality and use these judgments in providing their students with diagnostic feedback (Dappen,Isernhagen&Anderson,2008).
However,teachers'writing performances are far from satisfaction.Compared the results of teachers taking Language Proficiency Assessment for Teachers of English(LPAT)on different papers from 2001 to 2011,the scores on the Writing papers are the weakest(Education Bureau,2011;Lin,2007).Current research on writing proficiency of teacher candidate is scant at best and no data on the writing proficiency level of pre-service EFL teachers can be attained.In addition,compared with the bulk of studies on students in relationship to assessment,relatively few studies have examined empirically the evaluation criteria and assessment practices of EFL teachers in the classroom context(Xu&Liu,2009).
1.1.2.2 Pre-service EFL teachers as raters
Language teachers,novice or experienced,are usually involved as raters in various language assessments.In the context of large-scale writing assessments,a series of studies have been conducted to explore the influence of rater backgrounds on rating;in most cases,however,only differences between judgments and behaviors of expert/novice or experience/less-experienced raters have been explored(Cumming,1990;Hout,1993;Wolfe&Kao,1996).The influence of raters'language proficiency has emerged most evidently in the contrasting studies of native speaker(NS)and nonnative-speaker(NNS)raters of EFL writing.However,this group of research has yielded ambiguous and inconclusive findings(e.g.,Brown,1995;Connor-Linton,1995;Fayer&Krasinski,1987;O’Loughlin,1994;Santos,1988;Shi,2001).
In the context of classroom assessment,assessing student performance is one of the most critical aspects of the job of a teacher.Research shows that teachers can spend as much as a third to one half of their professional time involved in assessment or assessment related activities(Cheng,2001).Studies on EFL teachers'assessments of student writing have been conducted to compare with those of native English teachers(Connor-Linton,1995b;Hamp-Lyons&Zhang,2001;Kobayashi,1992;Kobayashi&Rinnert,1996;Santos,1988).Some studies have focused on English teachers at tertiary levels assessing heterogeneous EFL students(Cumming,1990;Brown,1991;Hamp-Lyons,1989;Santos,1988;Vaughan,1991)or on a homogeneous group of students(Hamp-Lyons&Zhang,2001;Kobayashi,1992;Connor-Linton,1995;Kobayashi&Rinnert,1996).However,the writing proficiency of EFL teachers has not been investigated as a factor influencing their judgments of student writing.Furthermore,studies investigating the role of teachers'writing proficiency in writing assessment are yet to be conducted in the Chinese context,which is arguably an influential one due to the variety of English and the densely populated country(Berns,2005).
Typically,in Mainland China,many EFL teachers,in-service or pre-service,perceive the processes of writing assessment as vague and beyond their control,affecting their writing assessment practices(Sheng,2009).There has been little systematic experience in pre-service teacher education or teacher professional development programs to prepare EFL teachers in writing assessment(Xu&Liu,2009).A common result is that pre-service EFL teachers enter the profession without any formal training in assessing student writing,rendering it unclear as to whether they will be able to provide quality assessment in the future.It is possible that various factors might influence their judgments of student writing;for example,their own experiences with writing might be uneven and even negative,resulting in ineffective writing assessment (Bruning&Horn,2000).The need therefore exists to examine how pre-service EFL teachers make judgments about student writing.Though most pre-service EFL teachers—through their own writing and reading those of others—might be capable of making general judgments about writing,assessing and analyzing student writing at a micro level is a complex and challenging task for them.The explicit connection between pre-service EFL teachers'writing proficiency and classroom assessment practices needs to be established.
To this end,the current book seeks to explore the relationship between writing proficiency of pre-service EFL teachers and their judgments of student writing.Furthermore,no previous study has attempted to employ a mixed-methods approach to investigate the relationship between raters'writing proficiency and their assignment of scores.