In this system, there is an easy way to measure reliability: by inter-rater agreement.
If raters do not consistently agree within one point, their training may be at fault.
Before computers entered the picture, high-stakes essays were typically given scores by two trained human raters.
If the scores differed by more than one point, a third, more experienced rater would settle the disagreement.
It is fair if it does not, in effect, penalize or privilege any one class of people.
It is reliable if its outcome is repeatable, even when irrelevant external factors are altered.
Rising education costs have led to pressure to hold the educational system accountable for results by imposing standards.
The advance of information technology promises to measure educational achievement at reduced cost. Eventually, Page sold PEG to Measurement Incorporated By 1990, desktop computers had become so powerful and so widespread that AES was a practical possibility.
Although the investigators reported that the automated essay scoring was as reliable as human scoring, Some of the major criticisms of the study have been that five of the eight datasets consisted of paragraphs rather than essays, four of the eight data sets were graded by human readers for content only rather than for writing ability, and that rather than measuring human readers and the AES machines against the "true score", the average of the two readers' scores, the study employed an artificial construct, the "resolved score", which in four datasets consisted of the higher of the two human scores if there was a disagreement.
This last practice, in particular, gave the machines an unfair advantage by allowing them to round up for these datasets.