Volume 6, Issue 1: AUGUST 2013
Welcome to Volume 6 of the Journal of Writing Assessment. This is the third volume of the online, open-access format, and the second volume in which we are serving as editors. As we mentioned in the last volume, our policy is to publish articles as they complete the review process so that the scholarly work is published as quickly as it is accepted for publication. As a result, JWA doesn’t construct an issue in the same way a print journal does. Instead, the issue grows organically, and when we complete the calendar year we will provide some retrospective comments. You can access all of the archives of JWA for free here. The archives include access to all of the print issues of JWA which were made available through the generosity of Hampton Press.
Critique of Mark D. Shermis & Ben Hammer, “Contrasting State-of-the-Art Automated Scoring of Essays: Analysis”
Although the unpublished study by Shermis & Hammer (2012) received substantial publicity about its claim that automated essay scoring (AES) of student essays was as accurate as scoring by human readers, a close examination of the paper’s methodology demonstrates that the data and analytic procedures employed in the study do not support such a claim. The most notable shortcoming in the study is the absence of any articulated construct for writing, the variable that is being measured. Indeed, half of the writing samples used were not essays but short one-paragraph responses involving literary analysis or reading comprehension that were not evaluated on any construct involving writing. In addition, the study’s methodology employed one method for calculating the reliability of human readers and a different method for calculating the reliability of machines, this difference artificially privileging the machines in half the writing samples. Moreover, many of the study’s conclusions were based on impressionistic and sometimes inaccurate comparisons drawn without the performance of any statistical tests. Finally, there was no standard testing of the model as a whole for significance, which, given the large number of comparisons, allowed machine variables to occasionally surpass human readers merely through random chance. These defects in methodology and reporting should prompt the authors to consider formally retracting the study. Furthermore, because of the widespread publicity surrounding this study and because its findings may be used by states and state consortia in implementing the Common Core State Standards, the authors should make the test data publicly available for analysis.
This article explores the value of using social media and a community rubric to assess writing ability across genres, course sections, and classes. Since Fall 2011 through Spring 2013, approximately 70 instructors each semester in the first-year composition program at the University of South Florida have used one rubric to evaluate over 100,000 student essays. Between Fall 2012 and Spring 2013, students used the same rubric to conduct more than 20,000 peer reviews. The rubric was developed via a datagogical, crowdsourcing process (Moxley, 2008; Vieregge, Stedman, Mitchell, & Moxley, 2012). It was administrated via My Reviewers, a web-based software tool designed to facilitate document review, peer review, and writing program assessment. This report explores what we have learned by comparing rubric scores by project and semester on five measures (Focus, Organization, Evidence, Style, and Format) by project, section, semester, and course and by comparing independent evaluators’ scores with classroom teachers’ scores on two assignments for two semesters. Findings suggest use of the rubric across genres, sections, and courses facilitates a high level of inter-rater reliability among instructors; illustrates ways a curriculum affects student success; measures the level of difficulty of specific writing projects for student cohorts; and provides a measure of transfer. WPAs and instructors may close the assessment loop by consulting learning analytics that reveal real-time, big-data patterns, which facilitate evidence-based curriculum decisions. While not an absolute measure of student learning or ability, these methods enable tentative mapping of students’ reasoning, research, and writing abilities.
Keywords: big data, writing assessment, social pedagogy, datagogies, transfer, curriculum standardization, peer production, communal agency
This article presents an investigation of the reliability of a rubric-based writing assessment system, the National Writing Project’s (NWP) Analytic Writing Continuum (AWC), which applies both holistic and analytic scoring. Data from double-scored student writing samples collected over several national scoring events (2005 to 2011) were used. First examined was the extent to which scorers trained to apply the AWC tended to agree with each other on the quality of various attributes of student writing (inter-rater agreement rates). Next considerations were how consistently groups of scorers applied the standards of AWC over multiple scoring events (cross-time reliability), and how consistently the attributes of the AWC collectively represented the construct of writing (internal consistency reliability). Finally, generalizability analyses were conducted to determine the degree to which the observed score variances were attributable to two sources of measurement error – scorers and scoring environment (grade group). Reliability examined from consensus, consistency, and measurement approaches indicate that the AWC assessment system generates highly reliable scoring of both holistic and analytic components of writing. The AWC assessment system includes expert scorers, training procedures, and materials as essential components and serves purposes beyond assessment of writing. It provides a common framework for structuring professional development and coordinating research and evaluation programs, encouraging the growth of professional learning communities and improved understanding of the links between professional development, classroom practice, and student writing performance.
Keywords: writing assessment, scoring rubric, reliability, teacher professional development
Using Appraisal Theory to Understand Rater Values: An Examination of Rater Comments on ESL Test Essays
This study is an illustration of the value of appraisal theory in studies of writing assessment. To demonstrate its functionality and value, appraisal theory is used to examine rater variability and scoring criteria through an analysis of the evaluative nature of rater comments. This allows an exploration of the values raters bring to the scoring task of rating second language writing proficiency. The written comments of three raters scoring the same sixteen writing tests were analyzed through appraisal theory and correlated to each test score. The analysis of the comments suggests that textual features external to the scoring rubric influenced raters’ scoring decisions. The findings shed light on raters’ perception of the construct of “good writing” and show how raters bring their own interpretations to the rating task. The findings also suggest that there may be unidentified shared rater values, as was evidenced when all raters awarded the same score but disagreed on the quality of specific features of a text. These observations may have implications for rater monitoring, rater training, and scoring rubric revision. This study illustrates that appraisal theory may offer a systematic means to analyze rater comments as they relate to the rating process.
This study examined automated essay scoring for experimental tests of writing from sources. These tests (part of the CBAL research initiative at ETS) embed writing tasks within a scenario in which students read and respond to sources. Two large-scale pilots are reported: One was administered in 2009, in which four writing assessments were piloted, and one was administered in 2011, in which two writing assessments and two reading assessments were administered. Two different rubrics were applied by human raters to each prompt: a general rubric intended to measure only those skills for which automated essay scoring provides relatively direct measurement, and a genre-specific rubric focusing on specific skills such as argumentation and literary analysis. An automated scoring engine (e-rater®) was trained on part of the 2009 dataset, and cross-validated against the remaining 2009 dataset and all the 2011 data. The results indicated that automated scoring can achieve operationally acceptable levels of accuracy in this context. However, differentiation between the general rubric and the genre-specific rubric reinforces the need to achieve full construct coverage by supplementing automated scoring with additional sources of evidence.