2.3 Measuring the construct and constructing the measures
The previous section presented a list of discourse features identified from the standards/requirements on language teachers'writing proficiency(Table 2-3).In this section,the discourse-analytic approach will be briefly discussed.Theoretical foundations underlying the selected discourse features are then discussed,followed by the discourse-analytic measures identified in empirical studies in L2 writing to operationalize these features.Based on a comprehensive literature review,a list of measures for the writing features is identified for the further empirical analysis.It is hoped that discourse analytic measures identified can represent the categories of discourse features selected,considered integral to writing in academic context and important in the writing assessment of language teachers.
2.3.1 Discourse features as evidence of writing proficiency
Complementary to the research on rater perceptions and judgment criteria,an objective analysis of discourse features justifies the constructs underlying raters'scoring of L2 writings(Homburg,1984).The discourse-analytic approach focuses on investigating the discourse features inherent in written performances of candidates at different performance levels.Such an approach can“examine the linguistic features of written responses at each level to justify and complement the constructs underlying raters'scoring.”[2]
A considerable number of empirical studies have been conducted to document a range of discourse features such as the morphological,syntactic and rhetorical aspects of L2 writings at different levels of writing proficiency(e.g.,Cumming et al.,2006;Kennedy&Thorp,1999).A number of meta-analyses and replication studies have also been conducted in this regard(e.g.,Ortega,2003;Polio,1997;Wolfe-Quintero et al.,1998).
In the setting of large-scale language assessments worldwide,both TOEFL and IELTS researchers have undertaken a series of empirical studies employing the discourse analytic approach to anchor test scores and band levels against the performances of L2 writers.These studies have aimed to inform rating scale descriptors as part of a broader,multi-method approach to rating scale validation(e.g.,Cumming et al.,2001,2002,2006).Methodologically,these studies have usually adopted quantitative or mixed-methods design to examine features of language use,thus providing part of the validity evidence for score interpretation and use.Of relevance to the current study,Cumming et al.(2006)examined the discourse features in test taker performances on integrated writing tasks in the new TOEFL.
It can be concluded that the discourse-analytic approach towards L2 writing is usually based upon quantifiable features of written discourse.The scores of writing performance can be verified empirically through the analysis of discourse features that differentiate different proficiency levels.[3]
2.3.2 Operationalization of and measures of discourse features
In this section,various discourse analytic measures employed by previous empirical studies to operationalize the discourse features selected in the previous section are reviewed and discussed.Among them,measures which have produced reliable results and that have clear theoretical justifications and operational definitions are identified for further analysis of writing performance of pre-service EFL teachers.
2.3.2.1 Lexical Resources
Lexical Resources in the current study refers to the range and sophistication of a writer's lexicon.This section reviews measures of Lexical Resources,which are grouped into three sub-categories:word-based measures,ratio-based measures as well as corpus and list-based measures.
Word-based measures.Word tokens and word types are regarded as the two prevalent text indices for lexical range.Though word tokens has proved to be a straightforward and effective measure of writing quality,it does not take account of the quality of a writer's production,subject to being only a quantitative measure.Inconsistent results have been identified in the literature for this measure.Some studies have confirmed a significant correlation between word tokens and proficiency level as reported by Wolfe-Quintero et al.(1998)while others produced contradictory results(e.g.,Raimes,1985).More recently,Cumming et al.(2005)also failed to identify a significant difference between the two higher levels of TOEFL essays.
In addition to word tokens,word types has been one of the prevalent indices for measuring lexical production.The number of word types examines both the quantity and quality of the lexis,as the use of a large number of repeated words certainly does not indicate lexical range.
In a more recent study,Cumming et al.(2005)employed average word length as an indicator of lexical complexity to examine candidate performance of new TOEFL writing tasks.As one of the differences between written language and spoken language is that written English generally employs words that are considerably longer than those used in spoken English,it is therefore assumed that average word length can be an effective measure of lexical sophistication.The average word length had been proved effective in a number of previous empirical studies to differentiate between candidates at different proficiency levels(e.g.,Engber,1995;Grant&Ginther,2000).
Ratio-based measures.Among all the ratio-measures for Lexical Resources,the type/token ratio(TTR)(the number of different lexical items divided by the total number of words)is the most popular among researchers.One criticism with TTR,however,is that longer texts automatically obtain lower TTR because the chance for a new word type to occur gets lower as text length increases,since any writer has only a limited number of different words at his/her disposal(Richard,1987).
To avoid the effect of text length on the TTR,numerous alternative measures have been proposed to remedy this text length effect,including the Guiraud Index(Guiraud,1960),the vocd D measure(Malvern et al.,2004),and the Measure of Textual Lexical Diversity(MTLD;McCarthy&Jarvis,2010).The Guiraud Index is defined as the number of word types divided by the square root of the number of word tokens(Guiraud,1960).Though findings have shown the Guiraud Index to be a measurement with advantages over the TTR(Broeder et al.,1992),researchers maintain that the Guiraud Index makes no distinction between lexical items,some of which may be qualitatively worthier than others(Daller et al.,2003).It remains unclear as to whether the Guiraud Index outperforms the TTR as a measure of lexical complexity in the EFL writing context.
Unlike other measures for lexical diversity,MTLD does not vary as a function of text length for text segments whose length is in the 100-2,000 word range.It therefore allows for comparisons between text segments of largely different lengths(up to 2,000 words)and has proven effective for both spoken and written texts.Additionally,it produces reliable results over a wide range of genres while strongly correlating with other lexical measures(McCarthy,2005).Thus,MTLD is able to examine differences of lexical diversity between different texts even while those texts may be considerably different in terms of text length.
To transform the traditional TTR,the vocd D measure was developed by McKee,Malvern,and Richards(2000)to measure the lexical range of written text,represented by the mathematical expression below,where N is the number of tokens and D a parameter:
A higher D indicates a greater complexity of a written text,which can be measured by a computer program called vocd.Jarvis(2002)compared the reliability of several algebraic indices,suggesting that the D measure was more accurate and consistent in modeling the lexical sophistication than other indices.The methodological advantage of using D is that“it allows valid comparisons between varying quantities of linguistic data and is more informative because it is representative of the whole of the type-token ratio versus token size curve rather than just a single point on it.”[4]
Corpus and list-based measures.Measures in this category are usually calculated by identifying the lexical words in a written sample that are not on a list of basic words,or are on a specific sophisticated word list,as with the Academic Word List(Coxhead,2000)and the Lexical Frequency Profile(LFP,Laufer&Nation,1995).What makes this measure different from the TTR and modified measures is that it makes a distinction between the quality,or the depth,of the vocabulary contained in texts,though the numbers it provides are still susceptible to text length.
Another measure regularly used is CELEX word frequency,based on the CELEX database developed by(Baayen,Piepenbrock,&van Rijn,1993)to measure the frequency of specific words in the written text.The CELEX word frequency has been employed in a number of recent studies examining the relationship between lexical features and writing proficiency(e.g.,Crossley et al.2009;McNamara et al.,2010).The findings showed that writers at higher level of proficiency tend to use more frequent written words than those at lower level of proficiency since it is easier to retrieve high frequency words compared with the retrieval of low frequency words.
2.3.2.2 Grammatical complexity
Grammatical complexity,referred to as Grammatical Range in the current study,is defined as“the range of forms that surface in language production and the degree of sophistication of such forms.”[5]A grammatically complex writer employs a wide range of basic and complex structures,while a grammatically simple writer uses only a narrow range of basic structures(Wolfe-Quintero et al.,1998).Two measures that have been shown to distinguish significantly between proficiency levels are t-unit complexity ratio and dependent clause ratio(Wolfe-Quintero et al.,1998).
The t-unit complexity ratio is calculated by the number of clauses divided by total number of t-units.Followed Hunt(1965),a t-unit is defined as a“minimal terminable unit…minimal as to length,and each would be grammatically capable of being terminated with a capital letter[at one end]and a period[at the other].”[6]A t-unit complexity ratio of two therefore means that each t-unit consists of one independent clause plus one other clause(Cumming et al.,2005).However,in L2 writing studies,there have been mixed results regarding relationship between t-unit complexity ratio and proficiency levels.Wolfe-Quintero,Inagaski and Kim(1998)regarded t-unit complexity ratio as one of the best measures for complexity,as there was a positive linear correlation between t-unit complexity ratio and proficiency level.However,some researchers have found no significant results(Banerjerr et al.,2004;Cumming et al.,2005).According to Wolfe-Quintero et al.(1998),the t-unit complexity ratio is most related to holistic ratings,as adopted in the current study.Therefore,the t-unit complexity ratio will be employed to measure grammatical complexity.
The second measure identified is the dependent clause ratio(the number of dependent clauses per total number of clauses or t-units),which examines the degree of embedding in a text.Studies using the number of dependent clauses ratio have successfully investigated the relationship between the measure and holistic ratings(Homburg,1984;Vann,1989).
Comparing with the above-mentioned measures,some researchers feel safe in using average sentence length to operationalize grammatical complexity(Szmrecsanyi,2004).Some have gone further providing evidence to show that average sentence length is predictive of the quality of student essays(Reid&Findlay,1986).However,some studies have found no significant correlation between average sentence length and essay scores(e.g.,Carlson,1988).
2.3.2.3 Grammatical accuracy
In judging grammatical accuracy of written texts,some studies have employed a holistic rating scale(e.g.,Hamp-Lyons&Henning,1991),while others have relied on objective measures.A number of discourse analytic measures have been developed to analyze the grammatical accuracy of the written discourse(see Polio,1997),among which the number of errors has always been counted in different ways.Two approaches to this process have been developed by researchers.The first one focuses on whether a structural unit is error-free.Two typical measures,the error-free t-unit ratio and the error-free clause ratio,have been employed by quite a few studies,albeit with contradictory findings.
In the current study,a clause is defined as a group of words containing a subject and a verb that form part of a sentence(Cumming et al.,2006).The current study follows several guidelines to count the number of t-units and clauses(Polio,1997).For example,(1)A t-unit refers to an independent clause with all its dependent clauses(Cumming et al.,2006).The sentence“My hometown,where I've lived for eighteen years,is in China.”is therefore counted as one t-unit and two clauses;(2)When there is a grammatical subject deletion in a coordinate clause,the entire sentence is counted as one t-unit,but each clause is counted separately.[7]Based on this guideline,“I left the school and ran away,”this sentence is counted as one t-unit and two clauses.
The error-free t-unit ratio refers to the number of error-free t-units per total number of t-units.Researchers have adopted error-free t-unit ratio to identify the relationship between this measure and the proficiency level.However,no consensus has been reached.Several studies have found a relationship between the number of error-free t-unit ratio and writing proficiency measured by program level(e.g.,Hirano,1991;Tedick,1990),standardized test scores(e.g.,Hirano,1991),and holistic scores(e.g.,Homburg,1984;Perkins,1980).There are,however,researchers who have identified no relationship between error-free t-unit ratio and student grades(e.g.,Kawata,1992).For the error-free clause ratio,studies employing this measure show similar findings with those adopting the error-free t-unit ratio.Bardovi-Harlig and Bofman(1989)have criticized error-free measures of accuracy for not disclosing the types of errors that are involved as some might impede communication more than others.And raters have to make operational definitions of errors among each other during rating.
Another group of measures have therefore been proposed to address the above-mentioned criticism.Among this kind of approach,the number of error-free clauses,a variant of the error-free t-unit ratio,has been deemed effective to differentiate among proficiency levels.However,only one researcher(Ishikawa,1995)chose this measure to examine the beginning-level learners who are less likely to have errors in all clauses than in t-units.Since the participants in the current study are college-level writers,the measure for beginning-level learners is not appropriate in this instance.This measure was therefore excluded from the analysis for the current study.The second measure is the errors per t-unit,which has also proved to be related to the program level,holistic rating and standardized test scores(e.g.,Flahive&Gerlach Snow,1980;Perkins,1980).Therefore,two measures for analyzing grammatical accuracy—error-free t-unit ratio and errors per t-unit—will be employed in the further analysis of the current study.
2.3.2.4 Cohesion and Coherence
For the current study,coherence refers to the“linking of ideas through logical sequencing,”while cohesion refers to“the varied and apposite use of cohesive devices to assist in making the conceptual and referential relationships between and within sentences clear.”[8]
Measuring cohesion.Cohesion occurs when the semantic relations between linguistic elements in the discourse depend on one another.[9]For analyzing cohesion,Halliday and Hasan(1976)proposed their taxonomy for cohesion and the framework for analysis(see Table 2-4),which has been widely employed to examine discourse organization of the text.Two types of cohesion are identified in the model.The first is grammatical cohesion,which is the surface sematic links between clauses and sentences in written discourse.Grammatical cohesion is further described as four categories:reference,conjunction,substitution,and ellipsis,among which substitution and ellipsis are more a relation at the lexico-grammatical level and more frequent in spoken texts.The second major group of cohesive relations is lexical cohesion which refers to related vocabulary occurring across clause and sentence boundaries in written texts.It is produced through the use of repetition and collocation.These grammatical and lexical means of creating cohesion are referred to as“cohesive devices,”which have been applied in a number of studies investigating the writing features that could differentiate different essay quality(e.g.,Connor,1984;Johnson,1992;Witte&Faigley,1981),especially essays produced by native and non-native speakers(e.g.,Field&Yip,1992;Reid,1992).
At the same time,Halliday and Hasan(1976)acknowledged that lexical and grammatical cohesion cannot be clearly differentiated in terms of definition and usage.The findings in the literature are therefore somewhat mixed on the relationship between cohesive devices and writing proficiency.
Measuring coherence.Several different approaches of measuring coherence have been proposed in the literature,among which three measures will be discussed in this section:meta-discourse markers,topic structure analysis and Latent Sematic Analysis.
First,meta-discourse is defined as“the writers'discourse about discourse,their directions of how readers should read,react to,and evaluate what they have written about the subject matter.”[10]Meta-discourse is primarily regarded as reader-oriented guidelines of the writer to help the reader better understand the discourse and the writer's position.Vande Kopple(1980)makes contribution to the studies of meta-discourse by distinguishing seven kinds of meta-discourse within two categories:textual meta-discourse and interpersonal meta-discourse.Based on Vande Kopple's(1980)classification,Crismore et al.revisedthe seven sub-categories into twelve categories while remaining the two main categories.Their classification has been widely employed to investigate the difference between good and poor L2 essays(e.g.,Intaraprawat&Steffensen,1995).
Second,topic structure analysis is an approach of analyzing coherence,which was first developed by Lautamatti(1987)in the context of text readability to analyze text coherence from the pattern of topic-comment progression.Three types of thematic progression have been proposed to establish local coherence:sequential progression where the rheme of one sentence becomes the theme of the next(a→b,b→c,c→d),parallel progression where the theme of one clause becomes the theme of the next or subsequent clauses(a→b,a→c,a→d)and extended parallel progression,in which the first and the last topics of a piece of text are the same but are interrupted with some sequential progression (a→b,b→c,a→d)(Hoenisch,1996).A number of studies refer to this approach to compare argumentative or persuasive writing scripts of different writer groups with similar findings(e.g.,Connor&Farmer,1990;Witte,1983).
However,these two approaches do not take into account all features of coherence,such as the overall organization of the writing.In addition,researchers have conducted studies on rhetoric of the text to identify certain text types which help readers interpret particular texts(Paltridge,2001).For example,the three characteristic stages of the essay structure(Introduction–Body–Conclusion)are deeply embedded into academic English writing curricula,especially for non-native speakers,such as EFL learners in China(Mickan&Slater,2003).
Third,Latent Sematic Analysis(LSA)has always been used to effectively measure the amount of textual coherence and predict the effect of text coherence on comprehension.In terms of methodology,LSA is different in that it focuses on the content and knowledge being conveyed in the essay rather than on the style,syntax,for argument structure of the essay.The meaning of a word is assessed based on its relations with all the other words(Landauer&Psotka,2001).One major methodological advantage of LSA is that it uses both relative and absolute scoring methods so that an essay can be compared to other essays in the same sample or to outside source materials(Chung&O’Neil,1997).However,it does not take into account syntactic information such as word order,syntactic relations or logic;it can be tricked that the matrix arrangement of information can make every possible combination of words in a sentence equivalent(Weigle,2002).LSA has also been criticized because it does not assess the structure of the responses nor does it assess vocabulary(Landauer et al.,2003).For the present study,LSA is used to compute the similarity between two sentences or that between the entire text and a sentence to measure the coherence of the text.
2.3.2.5 Content
Though raters have found content to be important(Vaughan,1991;Lumley,2002;Hout,1990),assessing and measuring it has been proving to be a great challenge.Raters have often complained that“the exact nature of the construct they assess remains uncertain.”[11]In addition,raters respond to and interpret the rating scales differently,as Erdosy(2004)states,“construct such as content and organization have as manifestations as there are raters.”[12]Despite this,there are still some common features that raters focus on when rating writing for the feature of content.
In the context of large-scale L2 writing tests,content has always been conceptualized as two dimensions:topic relevance and topic development(Liu,Mak&Jin,2012).Specifically,the two aspects usually involve a logical connection between topic and statement/argument,and an inclusion of supporting details to develop ideas(Liu,Mak&Jin,2012).For example,in the IELTS writing rating scales,content is assessed in terms of arguments,ideas and evidence,using words like“addresses the task,”“fully developed position…with relevant,fully extended and well supported ideas.”(Cambridge ESOL,2011)
As Cumming et al.(2000)noted,content is one of the two major dimensions of writing identified in evaluative criteria that raters considered as the best indicators of the writing quality.It includes aspects of“organization,coherence,and progression and development,specific and quality of information and ideas.”[13]A similar study was conducted by Erdosy(2004)who gathered think-aloud protocols with four experienced raters to compare the ways they evaluated a written text using a holistic scale.Raters in this study came up with various aspects of content such as the development of ideas,argumentation,reasoning,logic and topic development.Other studies have shown similar responses from raters or teachers in their studies(Freedman&Calfee,1983;Lumley,2002;Vaughan,1991).
Judging the content has always been a matter susceptible to raters'personal judgments,and few studies have investigated objective measures of content.The current study therefore develops a hierarchical decision-tree approach to measure the content of writing(see Liu,Mak&Jin,2012),which incorporates both the descriptive richness and simplistic of decision-making by a performance data-based approach(Fulcher,Davidson&Kemp,2011).Content was operationalized in the current study as topic relevance(related ideas;requirements/expectations of the task)and topic development(hierarchy of ideas with facts and supporting information;introductory framing and concluding phrases;paragraphing),indicating the degree to which a candidate is conveying relevant and well-elaborated/developed ideas on given topics.It seems that there may be a fundamental connection between organization and content in some rating scales which may prevent the two features from ever being completely distinct from each other.There has also been no particular mention in the literature regarding the possible overlap between these two features(e.g.,Connor&Carrell,1993;Vaughan,1991).
The hierarchical decision-tree approach is a scoring system for argumentative/persuasive writing.The integration of evidence into written arguments to gain credibility for the claims advanced is a type of persuasive appeal and a necessary requisite for successful academic performance in tertiary contexts.[14]This approach is therefore developed on the basis of Toulmin's(1958)schematic structures for argument analysis(see Figure 2-1).Previous studies have indicated the efficacy of Toulmin's model in investigating the nature of persuasive writing.The modified argument model employed by the current study is depicted in Figure 2-1 in form of a graphic representation.
Toulmin's(Toulmin,1958)model consists of the following elements:(a)Claim,which is a contentious assertion advanced in response to a problem;(b)Data,which constitute the evidence or grounds on which a claim is based;(c)a Warrant,which authorizes the link between data and claim;(d)Backing or support for the warrant;(e)a Qualifier,which is a model term indicating that the claim is a probable conclusion and(f)a Reservation,which refers to conditions or circumstances under which the warrant will not hold and that can hence defeat the claim.Though Toulmin's model captured the essential characteristics of an argument that are common to everyday discourse,the current study employs his model with two modifications:first,Data or reasons offered in support of a Claim may be supported by specific examples or facts(Rieke&Sillars,1975)in much the same way that Toulmin's Warrant may be supported by the Backing.A component termed Supporting Detail is therefore added.Second,researchers have reported the presence of argument chains or embedded arguments (Thomas,1986).The feature of argument chains is also considered in the current study.The hierarchical decision-tree approach employed by the current study is depicted in Figure 2-2 in the form of a graphic representation.
The rationale for devising the hierarchical decision-tree approach for scoring content can be explained from two perspectives.First,in the scoring model,the Major argument and the Supporting detail are the minimal requirements of the argument structure,while the remaining elements are optional or elaborative structures.[15]The argument structures in the Figure 2-2 refer to other elements in an argument,including the Reservation,Qualifier,and Rebuttal.Second,the ability to establish a focus of argument by raising key points of argument that are closely related to the main topic contained within the writing task and by employing chains of embedded arguments to formulate appropriate depths of discussion in relation to these points are strong indicators of successful argumentation in student writing.
Figure 2-1 Toulmin's(1958)argument structure
In a pilot study employing this decision-tree approach for measuring content,the resulting scores of the decision-tree approach were compared to the scores awarded on the basis of a traditional five-point holistic rating scale(Liu,Mak&Jin,2013).Results showed that,first,the proposed scoring approach could significantly differentiate the candidate performance at three different levels of writing proficiency.Second,the proposed approach enjoyed relatively higher inter-rater reliability(r=0.81,p<.01)compared with that of the traditional holistic scoring(r=0.71,p<.01).In addition,the results of paired sample t-test confirmed the stability of the proposed approach.The hierarchical decision-tree approach for measuring content is therefore employed in the current study.