Volume 4, Issue 1: DECEMBER 2011
Conventional wisdom and common practice suggest that to preserve the independence of holistic judgments, they should precede analytic scoring. However, little is known about the effects of scoring order on the scores obtained or if true holistic scoring is even possible from the mind of a scorer who has already been trained to and will be asked to provide analytic scores as well. This research explores the matter of independence of scores and the effects of scoring order upon those judgments. Our analysis shows statistically significant differences in mean scores under the two conditions (holistic scoring preceding analytic and the reverse), with the holistic scores more nearly replicating "pure" holistic scoring only when it precedes the analytic. This research affirms that when readers will be asked to score both ways, holistic scoring should precede analytic scoring. It also suggests interesting insights into the cognitive processes engaged by scorers as they score holistically and analytically.
This essay provides an overview of the research and scholarship on reliability in college writing assessment from the author's perspective as a composition and rhetoric scholar. It argues for reframing reliability by drawing on traditions from fields of college composition and educational measurement with the goal of developing a more productive discussion about reliability as we work toward a unified field of writing assessment. In making this argument, the author uses the concept of framing to argue that writing assessment scholars should develop a shared understanding of reliability. The shared understanding begins with the values—such as accuracy, consistency, fairness, responsibility, and meaningfulness—that we have in common with others, including psychometricians and measurement specialists, instead of focusing on the methods. Traditionally, reliability has been framed by statistical methods and calculations associated with positivist science although psychometric theory has moved beyond this perspective. Over time, the author argues, if we can shift the frame associated with reliability, we can develop methods to support assessments that lead to improvement of teaching and learning.
Validity Inquiry of Race and Shared Evaluation Practices in a Large-Scale, University-Wide Writing Portfolio Assessment
This article examines the intersections of students' race with the evaluation of their writing abilities in a locally-developed, context-rich, university-wide, junior-level writing portfolio assessment that relies on faculty articulation of standards and shared evaluation practices. This study employs sequential regression analysis to identify how faculty raters operationalize their definition of good writing within this university-wide writing portfolio assessment, and, in particular, whether students' race accounts for any of the variability in faculty's assessment of student writing. The findings suggest that there is a difference in student performance by race, but that student race does not contribute to faculty's assessment of students' writing in this setting. However, the findings also suggest that faculty employ a limited set of the criteria published by the writing assessment program, and faculty use non-programmatic criteria—including perceived demographic variables—in their operationalization of "good writing" in this writing portfolio assessment. This study provides a model for future validity inquiry of emerging context-rich writing assessment practices.
The present study identified the characteristics of seventh-grade writing produced in an on-demand state assessment situation. The subjects were 464 seventh graders in three middle schools in the southeastern United States. The research team included 12 English language arts teachers. Results of the analysis yielded some 32 prominent features, 22 positive and 10 negative. The features were correlated with state assessment scores, which ranged from 1 to 4. Of the 22 positive features, 14 correlated positively with the assessment scores. Of the ten negative features, 8 correlated negatively with the assessment scores. The study also found 108 statistically significant (p <.001) interrcorrelations among the features. From the features themselves, a formula was devised to create a prominent feature score for each paper, the scores ranging from 3 to 21. The prominent feature scores were also significantly correlated with assessment scores (r = .54). Whereas statewide assessment scoring assigns numerical values to student writing, prominent feature analysis or scoring derives numerical values from specific rhetorical features. These results may be helpful for classroom teachers for the assessment and diagnosis of student writing and for professionals who lead staff development programs for teachers.
Out of the Box: a Review of Ericsson and Haswell's (Eds.) Machine Scoring of Student Writing: Truth and Consequences