Technical Adequacy of Assessments

Technical Adequacy of the Reading Benchmark Assessment Passages

The following information provides evidence for the reliability and validity of the procedures employed in the current version of the Read Naturally benchmark assessment passages. All data were collected in public school classrooms. The reliability and validity coefficients reported in this section are based on words read correctly per minute on standard passages.

pointer Learn more about Benchmark Assessor Live (Read Naturally's reading benchmark assessment tool)

Methodology

In the first edition of Read Naturally’s reading benchmark assessment tool (Reading Fluency Monitor), benchmark passages were developed to match the readability of anchor passages that had been extensively field-tested for predictability of comprehension scores. The data for single-measure technical adequacy were collected primarily in Washington State from 1998 to 2000. The data and correlations from the initial anchor passages provided substantial evidence of both concurrent and predictive validity.

During the development of the second edition of Read Naturally’s reading benchmark assessment tool (available in Reading Fluency Benchmark Assessor and Benchmark Assessor Live), Read Naturally extensively field-tested new passages to ensure that the passages within each grade level were similar in difficulty. During field testing, students orally read the grade-level passages during one-minute timings. When the results of the field testing indicated that a passage was not at the expected level, that passage was rewritten or replaced, and the new passage was tested for level of difficulty. The data collected during the field testing indicated that, compared to the first edition, the scores of the new passages resulted in a narrower range of scores at the intended grade level across sets of three passages. Therefore, these new passages replaced a number of the first-edition benchmark passages. Multiple-measure reliability and validity studies were conducted during the 2002–2003 school year in Minnesota, California, Texas, Virginia, Michigan, Iowa, and Pennsylvania.

Reliability

Reliability is the degree to which scores produced under standard procedures will be dependably replicated on another occasion, with sets of similar passages, and with different test administrators. Evidence for the single measure reliability of oral reading fluency measurement is provided in Table 1 and Table 2. The 121 correlations contributing to the database were meta-analyzed using the Comprehensive Meta-Analysis software. The results of this analysis are presented in two ways. Table 1 represents test-retest reliability coefficients for groups of children in grades one through seven leveled by three time-span categories as follows:

  • Reliability: Test and re-test were done within the same academic year
  • Reliability-Delayed: Test and re-test were done in contiguous years
  • Reliability-Delayed 2: Test and re-test were separated by more than one year.

The data in Table 1 show that across the 121 coefficients, the point estimate is .907 with a 95% confidence interval from .899 to .915. This is well within acceptable psychometric standards. The magnitude of the correlations is remarkably stable across time-span categories, although it is not surprising to note a slight inverse relationship between the size of the correlation and the length of time between test administrations.

Table 1: Meta-analysis of Oral Reading Fluency Reliability Coefficients
Time: Same Year or Delayed Effect Lower Upper N Total P Value
Same Year (84)* .915 .906 .923 12213 000
Delayed:  Consecutive Years (19)* .896 .881 .909 958 000
Delayed 2:  Alternate Years (18)* Total Combined (121)* .866 .820 .901 1361 000
Total Combined (121)* .907 .899 .915 14532 .000

* ( ) number of correlation coefficients in meta-analysis

Table 2 shows the single measure reliability coefficients organized by grade level across time span categories. The grade one reliability estimate (.847) is lower than that for the other grade levels and can be explained by the fact that many first-grade children are just beginning to learn to read. The reliability estimates for all other grades are above .90.

Table 2: Meta-analysis of Oral Reading Fluency Reliability Coefficients by Grade Level
Grade Effect Lower Upper N Total P Value
1 (8)* .847 .768 .901 1631 .000
2 (16)* .914 .895 .929 3920 .000
3 (23)* .909 .897 .920 4345 .000
4 (38)* .905 .883 .922 2121 .000
5 (28)* .916 .905 .925 2131 .000
6 (2)* .922 .877 .946 122 .000
7 (6)* .906 .861 .936 262 .000
Combined (121)* .907 .899 .915 14532 .000

* ( ) number of correlation coefficients in meta-analysis

The Standard Error of Measurement

Even with substantial reliability, one must be sure to recognize that there is error involved in any single score. The magnitude of this error is quantified with a value called the standard error of measurement. The size of the standard error of measurement is a function of the standard deviation of the measure and the reliability estimate.

In the benchmark assessment research with single passages, the standard deviation of oral reading fluency scores was in the neighborhood of 35 to 42. Using a reliability coefficient of .90, the standard error of measurement was estimated to be somewhere between 11 and 13.

Validity and Reading Comprehension

Validity is the degree to which an assessment measures what it purports to measure. In this case, the benchmark assessment passages purport to measure reading achievement.

Tables 3 and 4 provide initial evidence (1998 to 2000) for the validity of oral reading fluency measures as an indicator of reading comprehension. The combined effects between the oral reading fluency measure and various high-stakes measures of reading were analyzed using the Comprehensive Meta-Analysis computer program. Time between the administration of the oral reading fluency measure and administration of the criterion measure ranged from one month to two years. The analysis was done in two ways: (1) across grades 1 through 6 by criterion measure and (2) across criterion measure by grade level (see Tables 3 and 4, respectively).

The overall validity estimate across grades and measures was .730 with a 95% confidence interval from .716 to .744. The variance interpretation of this correlation is that roughly half of the variability in high-stakes reading comprehension scores can be explained by oral reading fluency.

Table 3: Meta-analysis of Validity Coefficients Using Oral Reading Fluency with Measures of Reading Comprehension
Test Effect Lower Upper N Total P Value
CTBSa Comprehension (4)b .789 .726 .839 186 .000
CTBSa Total Reading (8)b .770 .730 .804 486 .000
CTBSa Vocabulary (1)b .855 .745 .920 42 .000
Gatesc Comprehension (8)b .746 .703 .783 784 .000
Gatesc Total Reading (8)b .782 .763 .800 2382 .000
Gatesc Vocabulary (9)b .730 .684 .770 585 .000
ITBSd Comprehension (11)b .688 .655 .717 6346 .000
ITBSd Total Reading (8)b .712 .677 .743 3297 .000
ITBSd  Vocabulary (11)b .610 .594 .625 6409 .000
WASLe Reading (19)b .680 .647 .711 1136 .000
Combined (115)b .730 .716 .744 21653 .000

aCTBS:  Comprehensive Test of Basic Skills
b( ) number of correlation coefficients in meta-analysis
cGates:  Gates MacGinitie Reading Tests (GMRT), 3rd edition
dITBS:  Iowa Test of Basic Skills
eWASL:  Washington Assessment of Student Learning (state-mandated performance assessment)

Table 3:  Meta-analysis of Validity Coefficients Using Oral Reading Fluency with Measures of Reading Comprehension by Grade Level
Grade Effect Lower Upper N Total P Value
1 (3)* .779 .716 .829 200 .000
2 (26)* .731 .702 .758 10626 .000
3 (29)* .698 .671 .723 7529 .000
4 (29)* .748 .719 .774 1734 .000
5 (19)* .752 .724 .778 1075 .000
6 (1)* .760 .605 .860 47 .000
Combined (115)* .730 .716 .744 21653 .000

* ( ) number of correlation coefficients in meta-analysis

Contact

Please let us know what questions you have so we can assist. For Technical Support, please call us or submit a software support request.

 
Click to refresh image