## Wednesday, August 02, 2006

### HQT Q&A #1

In this post I respond to some questions that were raised about my post The Highly Qualified Math Teacher

1. In Table 1 how did you calculate the last column? I suppose that what I am asking is how raw scores were converted into scaled scores for the Praxis II exam.
In standardized tests, the raw to scaled score conversion can change with each administration of the test, though usually not by much. The only test-to-test number that is comparable for psychometrically sound tests is the scaled score. The Education Trust must have had access to the actual raw to scaled score conversion tables for the tests they analyzed in order to produce the table on page 22 of Not Good Enough. In their table they produce an estimate of the percent correct, based on the now known raw score. Of course, from their percentage I can calculate the corresponding raw score.

Their results match up nicely with a sample raw to scaled score conversion table found on page 201 of Cracking the Praxis by Fritz Stewart and Rick Sliter (you can go to books.google.com and google mathematics Praxis II. This book shows up as one of the first few entries.) One thing that is obvious from this more complete table is that raw scores between 0 and 14 all map into the lowest scaled score of 100. This is consistent with the fact that on average an examinee will get a raw score of 12-13 just by randomly guessing at all the questions, and so these raw scores represent no content knowledge.

A scaled score of 124 corresponds to 46% correct responses in both the Not Good Enough and Cracking the Praxis tables. Since there is no penalty for guessing, we may safely assume the examinee will answer all the questions. On this 50 question test the examinee got 23 (46%) right and 27 wrong. Applying the appropriate (4 answer choices per problem) SAT-like penalty of -1/3 point per incorrect response the adjusted raw score is 23 - 27/3 = 14 and so the adjusted percent correct is (14/50)*100% = 28%.

You can also think of it this way: Suppose an examinee can solve 28% of the problems on this exam. If he randomly guesses at the rest, which he will do because there is no penalty, what is his expected raw score? He gets the 14 he knows how to solve plus 1/4 of the 36 he guesses at, i.e. the 23 we started with from the original raw to scaled score.

This correction method can be reduced to the equation:

$P$a = (4/3)(Pr - 25%)

where $P$a (e.g. 28%) is the adjusted percentage and $P$r (e.g. 46%) is the originally reported raw percentage.

One can also derive the same equation by using three parameter Item Response Theory with the guessing parameter set at 0.25. You can think of the above as a simple linear rescaling from the non-intuitive 25%-100% range to the more intuitive 0%-100% range. If you want to use 46%, you must keep in mind that this is on scale where 25% represents zero content knowledge.

Of all these explantions, I prefer the SAT-like penalty one the best, because this practice is used on the multiple-choice tests that should be familiar to a general audience (the SAT has five answer choices per question and penalizes -1/4 point for an incorrect response). As I argued in my previous post this adjusted score IS the score the examinee would get on the test with penalties, because optimal strategy would be to continue to guess on any question on which he can eliminate at least one answer choice and the penalty is neutral in those cases where he cannot eliminate any answer choices.

Also every caveat I can come up with tends to drive the scores even lower. What we would really like to know is how the examinee would perform on a constructed response exam, i.e. an exam where no answer choices are given. To estimate this you have to go beyond three parameter Item Response Theory and make other assumptions. If I use the Garcia-Perez model in our previous case the estimated percent correct drops from 46% on a multiple-choice test without penalty to 28% on the multiple-choice test with penalty to 21% on a constructed response exam. The Strauss estimate, based on the scaled score alone, would be 24%. Furthermore, since this test has a margin of error, and the examinee can take it multiple times, and we are examining the bottom end where multiple times is likely — we should adjust these scores downward by some fraction of the scoring error margin.

2. How did you calculate (or estimate) the numbers in Table 2?
I got them directly from ETS. ETS kindly agreed to meet with me and we spent several hours going over data. Prior to meeting with ETS , I estimated similar numbers by fitting a normal distribution from the summary statistics that accompanied my score report. (Yes I actually took this test. Even though some of my high school math was rusty from 30 years of non-use, I still scored above a 190 and am convinced I would have scored a 200 as a high school senior) Those statistics told me that the first and third quartile scores were 128 and 157 respectively, from which I got a normal distribution with a mean of 142.5 and a standard deviation of 21.5. As an additional check the ETS defines an ROE (recognition of excellence) level for a subset of their Praxis II exams, which for the 0061 test was a 165 and was at the 85th percentile. This matched my normal distribution almost perfectly.

3. A related question is how did you come up with the numbers in the paragraph just below Table 2, particularly the 15% for the percentage who “demonstrate competency”?
From the normal distribution. Of course, the scaled to raw score table I am using is quantized, so what I really find is that 15.8% of the examinee population scores at or above a 164, which corresponds to an adjusted percent correct of 70.7%. It is a somewhat arbitrary assumption on my part that an adjusted score of 70% defines competency.

What we would really like to know is what are these precentages in the existing teacher pool. All we have now is estimates based on the examinee pool. But in math, some 25% never pass so these are eliminated. We have reason to believe that at the upper end of the distribution a disproportiately large number either never enter teaching or leave teaching early. So my best guess is that the bottom end isn't as bad, nor the top end as good as these examinee numbers suggest.